Skip to main content

programming - Underlying Algorithms for List Manipulation Functions


Does anyone know, or know of a link for the underlying algorithms used for list manipulation functions?


Specifically:



This question came about as there has been numerous questions that have asked about removing duplicates or some other list manipulation function. Some of the best answers often included usage of GatherBy due to the performance it offers (a few hundred times over DeleteDuplicates in some cases).


Related: usages of GatherBy and Union:

Checking for duplicates in sublists GatherBy
Ordering function with recognition of duplicates GatherBy
Gather list elements by labels GatherBy
Retrieving duplicates from nested list GatherBy
Delete duplicates in a list, depending on the sequence of numbers Union
How to delete duplicate solutions of this system of equations? Union



Answer



This is an incomplete answer; I will continue it tomorrow. Work In Progress: errors may abound.




Preamble hat-tip to Leonid



For the variations with custom test or ordering functions we can snoop on applications of that function to deduce the algorithm that is used. In the case of the default methods we must rely on observed complexity and guesswork unless they are expressly described in the documentation or other official sources.


In every case the default method (without a test or ordering function) uses a well optimized routine. Since the details of these methods are not directly observable and since I have no special knowledge of the implementation I shall reference them only in comparison to the customized forms and with supposition.


All of these experiments were conducted in Mathematica 7. While I don't expect the underlying algorithms used by these functions to have changed in recent versions I cannot rule it out at present.


For all but Sort, which uses an ordering function rather than a test function we can use:


test = (Print[#, " -- ", #2]; Mod[#, 3] == Mod[#2, 3]) &;

I will make use of the following timing function:


timeAvg = 
Function[func,
Do[If[# > 0.3, Return[#/5^i]] & @@ Timing@Do[func, {5^i}], {i, 0, 15}],

HoldFirst];

I will use random numeric data on the assumption that the algorithm is universal; I have no specific reason to believe otherwise.


SeedRandom[2]
a = RandomInteger[99, 10]


{92, 57, 22, 84, 63, 1, 81, 96, 19, 38}

Split



I'll start with one not like the rest because it is easy to understand the behavior and will serve as an illustration for the other functions.


Split[a, test]


92 -- 57
57 -- 22
22 -- 84
84 -- 63
63 -- 1
1 -- 81

81 -- 96
96 -- 19
19 -- 38


{{92}, {57}, {22}, {84, 63}, {1}, {81, 96}, {19}, {38}}

We see that straightforwardly every adjacent pair of elements is compared in order.
One would expect some overhead from a custom test function but the same O(n) complexity as default Split, and that is observed:


Table[{timeAvg@Split[#], timeAvg@Split[#, Equal]}&@RandomInteger[1,n], {n, 10^Range[0, 6]}];

ListLogPlot[time\[Transpose], Joined -> True,

Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic},
AxesLabel -> {"Length", "Seconds"}]

enter image description here


There is an additional component relative to the number of splits. Observe:


d1 = RandomInteger[99, 5*^6];  (* lots of splits *)
d2 = Round@Clip[d1, {98, 99}]; (* few splits *)
Timing @ Length @ Split @ # & /@ {d1, d2}



{{0.483, 4949976}, {0.125, 99485}}

Union


Applying the same test to the same set with Union:


Union[a, SameTest -> test]


19 -- 1
22 -- 1
38 -- 1

57 -- 38
57 -- 1
63 -- 57
81 -- 57
84 -- 57
92 -- 57
92 -- 38
96 -- 57


{1, 38, 57}


Here we see that the list is pre-sorted as testing starts with the small numbers: {1, 19, 22}. Also, we see that comparisons are each time with one of three unique elements {1, 38, 57} (right column). The algorithm appears to be:



  1. Sort the list using the canonical ordering, and put the first element in a keep list.

  2. Take the last keep list element (1) and compare it to subsequent elements (the left column) until test returns False.

  3. Add the new unique element to the keep list (38) and compare it to subsequent elements from this element.

  4. If in Step #3 a test returns False compare it to each prior element in the keep list; if it matches one continue with Step #3, else add to keep and start a new #3 with that element.

  5. Stop when the last element is reached and compared (if necessary) to all elements of keep.


Some things result from this (supposed) algorithm. One is that the number of comparisons is directly influenced by the number of unique elements (defined by test) there are. Compare the output of these two lines:


Union[{1, 1, 1, 1, 1}, SameTest -> ((Print[#, " -- ", #2]; # == #2) &)]



1 -- 1
1 -- 1
1 -- 1
1 -- 1



Union[Range@5, SameTest -> ((Print[#, " -- ", #2]; # == #2) &)]



2 -- 1
3 -- 2
3 -- 1
4 -- 3
4 -- 2
4 -- 1
5 -- 4
5 -- 3
5 -- 2
5 -- 1




In the first case there are only (n-1) tests whereas in the second there are n (n-1) / 2 tests performed.


With a limited number of unique elements using SameTest has roughly equal computational complexity to the default Union (though several times slower):


times1 = Table[
{timeAvg @ Union[#], timeAvg @ Union[#, SameTest -> Equal]} &@RandomInteger[20, n],
{n, 10^Range[0, 6]}];

ListLogPlot[times1\[Transpose], Joined -> True,
Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic},
AxesLabel -> {"Length", "Seconds"}]


enter image description here


With a fixed length but a different numbers of unique elements very different behavior emerges:


times2 = Table[
{timeAvg @ Union[#], timeAvg @ Union[#, SameTest -> Equal]} &@
RandomInteger[{1, n}, 1*^5],
{n, 5^Range[0, 6]}];

ListLogPlot[times2\[Transpose], Joined -> True,
FrameTicks -> {Thread@{Range@7, "5"^Range[0, 6]}, Automatic}, Frame -> {1, 1, 0, 0},

FrameLabel -> {"Unique elements (max)", "Seconds"}, PlotRange -> All]

enter image description here


Clearly a default Union uses other optimizations. A hint at this optimization can be found on MathGroup in reference to the old way that SameTest worked and by inference the way that default Union still does:



[mg16113]
The Union function sorts the elements before comparing them, and compares only adjacent elements after sorting. This is done for reasons of efficiency.



So the default is akin to Split in this way, and in fact Carl Woll implemented the old-style SameTest in terms of Split in another MathGroup post: [mg23309]. There is additional discussion there of complexity that is very pertinent to this topic.


Gather



Our preliminary experiment once again:


Gather[a, test]


92 -- 57
92 -- 22
57 -- 22
92 -- 84
57 -- 84
92 -- 63

57 -- 63
92 -- 1
57 -- 1
22 -- 1
92 -- 81
57 -- 81
92 -- 96
57 -- 96
92 -- 19
57 -- 19

22 -- 19
92 -- 38


{{92, 38}, {57, 84, 63, 81, 96}, {22, 1, 19}}

The list is not sorted and the first two elements receive the first test. We see quite a few more applications of test this time compared to Union but that proves to be mere chance.


The same three "groups" are formed by the Mod function as with Union: there 1,38,57 in the right column; here 92,57,22 in the left. The tests are no longer in canonical order as the list is not sorted but the same type of comparisons are made.


To show that the apparent difference in the number of test applications between Gather and Union is not significant we can run the test with many other sets


Module[{x, i, test2},
test2 = (i++; Mod[#, 3] == Mod[#2, 3]) &;
Table[

x = RandomInteger[99, 10]; {
i = 0; Union[x, SameTest -> test2]; i,
i = 0; Gather[x, test2]; i
} , {10000}
] // Pane /@ Histogram /@ Transpose[#] & // Row
]

enter image description here


Given the equal number of comparisons one might expect the performance of Gather and Union (with test functions) to track quite well, however there are some surprises.


With a limited number of groups the time complexity trends to the same degree as Union but the scales are surprising:



gtest[a_, b_] := a == b
times1 = Table[
{timeAvg @ Gather[#], timeAvg @ Gather[#, gtest]} &@RandomInteger[20, n],
{n, 10^Range[0, 6]}
];

ListLogPlot[times1\[Transpose], Joined -> True,
Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic},
AxesLabel -> {"Length", "Seconds"}, PlotStyle -> {Red, Green}]


enter image description here


I have overlaid this graphic with the one for Union for easy comparison. On this test with a limited number of unique elements (20) default and test Union were within an order of magnitude, but default and test Gather are nearly five orders of magnitude apart. One might expect Gather with test to be slower than the equivalent Union lists must be built, however it is interesting to note that default Gather is an order of magnitude faster than default Union here.


For the test with a varying number of unique elements I had to abort at 5^5 as the function slowed down too greatly:


gtest[a_, b_] := a == b
times2 = Table[
{timeAvg @ Gather[#], timeAvg @ Gather[#, gtest]} &@RandomInteger[{1, n}, 1*^5],
{n, 5^Range[0, 5]}
];

ListLogPlot[times2\[Transpose], Joined -> True,

FrameTicks -> {Thread@{Range@7, "5"^Range[0, 6]}, Automatic}, Frame -> {1, 1, 0, 0},
FrameLabel -> {"Unique elements (max)", "Seconds"}, PlotRange -> All,
PlotStyle -> {Red, Green}]

enter image description here


Again I have overlaid the result from the corresponding Union result. Complexities here are harder to compare, and more testing is in order.


Automatic enhancements in Gather


You may have noticed that I used gtest[a_, b_] := a == b rather than merely Equal or even # == #2 & as the test. This is because Gather has special handling built in to optimize such cases; this is normally a good thing but in this case it amounts to cheating.





  • Gather[x, Equal] is apparently converted directly to Gather[x].




  • A symmetric pure function of the form lhs == rhs & or lhs === rhs & where lhs and rhs are identical except for the arguments is converted to GatherBy, i.e. GatherBy[x, lhs &].




Examples supporting these assertions:




  1. Timings are the same as Gather[x] for explicit use of Equal and SameQ:



    big = RandomInteger[5000, 10000];

    Gather[big] // timeAvg
    Gather[big, Equal] // timeAvg
    Gather[big, SameQ] // timeAvg


    0.002624


    0.002624


    0.002616






  2. Convertion to GatherBy (or internal equivalent) is directly observable using Print or `Sow:


    Gather[Range@5, Print[#] == Print[#2] &];


    1
    2
    3

    4
    5



    Note that each element is printed only once -- this is the behavior of GatherBy not a pairwise comparison!





Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...