Skip to main content

list manipulation - Counting the population of integers


Suppose that myData is a list of sublists. Each sublist has a length of one or greater and contains any number of replicates of the integers 1, 2, 3, and 4. I would like to create a function myFun that counts the number of each integer in the sublist.


As an example, suppose that I have the following myData:


myData = {{2, 3, 3, 1, 1, 3, 2, 2, 4, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1},
{2, 2, 3, 1, 1, 3, 2, 2, 3, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1},

{1, 2, 2, 1, 1, 3, 2, 1, 3, 1, 1, 2, 2, 2, 2, 3, 1, 1},
{2, 3, 3, 1, 1, 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1}};

I would like myFun /@ myData to return the output:



{{{1, 8}, {2, 8}, {3, 3}, {4, 1}},
{{1, 9}, {2, 8}, {3, 3}, {4, 0}},
{{1, 8}, {2, 7}, {3, 3}, {4, 0}},
{{1, 8}, {2, 8}, {3, 4}, {4, 0}}}


In other words, sublist 1 has eight 1s, eight 2s, three 3s, and one 4, and similarly for the other three sublists. The key feature here is that all of 1, 2, 3, and 4 are listed, even if one or more of them have zero population.


Using the following (taking advantage of Tally) gets me close:


myFun[sublist_] := SortBy[Tally[sublist], First]
myFun /@ myData


{{{1, 8}, {2, 8}, {3, 3}, {4, 1}},
{{1, 9}, {2, 8}, {3, 3}},
{{1, 8}, {2, 7}, {3, 3}},
{{1, 8}, {2, 8}, {3, 4}}}


But Tally does not list members with zero population, so the above does not satisfy the key feature.


On the other hand, the following (using Count) seems to accomplish my goal:


myFun[sublist_] := Map[{#, Count[sublist, #]} &, {1, 2, 3, 4}]
myFun /@ myData


{{{1, 8}, {2, 8}, {3, 3}, {4, 1}},
{{1, 9}, {2, 8}, {3, 3}, {4, 0}},
{{1, 8}, {2, 7}, {3, 3}, {4, 0}},

{{1, 8}, {2, 8}, {3, 4}, {4, 0}}}

But, is there a more succinct, more elegant way to do this? Thanks for your time!



Answer



Though it is not typically as fast as Tally I am fond of Sow and Reap for this kind of problem:


countBy[dat_, bins_] :=
{bins, Tr /@ Reap[Sow[1, #], bins, Tr@#2 &][[2]]}\[Transpose] & /@ dat

countBy[myData, {1, 2, 3, 4}] // List // MatrixForm


Mathematica graphics


Since Tally is probably better in the long run despite my enjoyment of Sow and Reap here is a way to use that efficiently:


countBy2[dat_, bins_] :=
{bins, Replace[bins, Dispatch @ Append[Rule @@@ Tally @ #, _ -> 0], {1}]}\[Transpose] & /@ dat



Performance


I commented that while a method that jVincent proposed (also in a comment) was short and elegant that it was "not at all efficient computationally." He rebuts this in an answer, to which I will now reply. I was of course not speaking about the absolute performance on this tiny example but rather about algorithmic complexity and the way the method scales to a larger problem. The issue with using Count is that each list must be scanned as many times as there are unique elements. When the list is long and contains a large number of such elements this becomes a very slow process.


I shall use the faster Tally method (countBy3) rather than the playful Sow and Reap method while comparing performance. I will use the following functions:


SetAttributes[timeAvg, HoldFirst]


timeAvg[func_] := Do[If[# > 0.3, Return[#/5^i]] & @@ Timing@Do[func, {5^i}], {i, 0, 15}]

myFun[sublist_] := SortBy[Tally[sublist], First]
myFun[sublist_, elems_] :=
Replace[myFun[sublist~Join~elems], {el : Alternatives @@ elems, n_} :> {el, n - 1}, 1];
leonid[dat_, bins_] := myFun[#, bins] & /@ dat

jV[dat_, bins_] :=
Outer[{#2, Count[#1, #2]} &, dat, bins, 1]


First, using an array of fixed size but varying the number bins to tally:


times =
Table[
With[{dat = RandomInteger[{1, n}, {1000, 1000}]},
timeAvg[#[dat, Range@n]] & /@ {jV, leonid, countBy2}
],
{n, 3^Range@6}
]



{{0.1028, 0.01312, 0.007608},
{0.2776, 0.01996, 0.012104},
{0.811, 0.0418, 0.02372},
{2.371, 0.1216, 0.05616},
{7.161, 0.593, 0.2246},
{21.45, 3.573, 0.468}}

ListPlot[times\[Transpose], PlotRange -> All, AxesOrigin -> {1, -1}, Joined -> True]


Mathematica graphics


The blue line is the timing of jVincent's Count method, the purple line Leonid's, and the yellow line mine. Clearly Count does not scale well with regard to an increasing number of unique elements.


Here is a run-off between Leonid's code an my own (not that he wrote it with peak efficiency in mind), using much longer sublists to allow for more unique elements:


times2 =
Table[
With[{dat = RandomInteger[{1, n}, {20, 25000}]},
timeAvg[#[dat, Range@n]] & /@ {leonid, countBy2}
],
{n, 3^Range@9}
]



{{0.0011728, 0.0009232},
{0.0012976, 0.0009984},
{0.0017712, 0.001248},
{0.003368, 0.001872},
{0.011352, 0.003744},
{0.0688, 0.009608},
{0.561, 0.03304},
{4.914, 0.156},

{43.462, 0.39}}

ListPlot[times2\[Transpose], PlotRange -> All, AxesOrigin -> {1, -1}, Joined -> True, 
PlotStyle -> ColorData[1] /@ {2, 3}]

Mathematica graphics


So once again at the extremes Leonid's code "blows up" while mine does not.


That is what I meant by "not at all efficient computationally."


Comments

Popular posts from this blog

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...

mathematical optimization - Minimizing using indices, error: Part::pkspec1: The expression cannot be used as a part specification

I want to use Minimize where the variables to minimize are indices pointing into an array. Here a MWE that hopefully shows what my problem is. vars = u@# & /@ Range[3]; cons = Flatten@ { Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; Minimize[{Total@((vec1[[#]] - vec2[[u[#]]])^2 & /@ Range[1, 3]), cons}, vars, Integers] The error I get: Part::pkspec1: The expression u[1] cannot be used as a part specification. >> Answer Ok, it seems that one can get around Mathematica trying to evaluate vec2[[u[1]]] too early by using the function Indexed[vec2,u[1]] . The working MWE would then look like the following: vars = u@# & /@ Range[3]; cons = Flatten@{ Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; NMinimize[ {Total@((vec1[[#]] - Indexed[vec2, u[#]])^2 & /@ R...

What is and isn't a valid variable specification for Manipulate?

I have an expression whose terms have arguments (representing subscripts), like this: myExpr = A[0] + V[1,T] I would like to put it inside a Manipulate to see its value as I move around the parameters. (The goal is eventually to plot it wrt one of the variables inside.) However, Mathematica complains when I set V[1,T] as a manipulated variable: Manipulate[Evaluate[myExpr], {A[0], 0, 1}, {V[1, T], 0, 1}] (*Manipulate::vsform: Manipulate argument {V[1,T],0,1} does not have the correct form for a variable specification. >> *) As a workaround, if I get rid of the symbol T inside the argument, it works fine: Manipulate[ Evaluate[myExpr /. T -> 15], {A[0], 0, 1}, {V[1, 15], 0, 1}] Why this behavior? Can anyone point me to the documentation that says what counts as a valid variable? And is there a way to get Manpiulate to accept an expression with a symbolic argument as a variable? Investigations I've done so far: I tried using variableQ from this answer , but it says V[1...