programming - Underlying Algorithms for List Manipulation Functions

Does anyone know, or know of a link for the underlying algorithms used for list manipulation functions?

Specifically:

This question came about as there has been numerous questions that have asked about removing duplicates or some other list manipulation function. Some of the best answers often included usage of GatherBy due to the performance it offers (a few hundred times over DeleteDuplicates in some cases).

Answer

This is an incomplete answer; I will continue it tomorrow. Work In Progress: errors may abound.

Preamble _{^{_{^{hat-tip to Leonid}}}}

For the variations with custom test or ordering functions we can snoop on applications of that function to deduce the algorithm that is used. In the case of the default methods we must rely on observed complexity and guesswork unless they are expressly described in the documentation or other official sources.

In every case the default method (without a test or ordering function) uses a well optimized routine. Since the details of these methods are not directly observable and since I have no special knowledge of the implementation I shall reference them only in comparison to the customized forms and with supposition.

All of these experiments were conducted in Mathematica 7. While I don't expect the underlying algorithms used by these functions to have changed in recent versions I cannot rule it out at present.

For all but Sort, which uses an ordering function rather than a test function we can use:

test = (Print[#, " -- ", #2]; Mod[#, 3] == Mod[#2, 3]) &;

I will make use of the following timing function:

timeAvg = 
  Function[func,
    Do[If[# > 0.3, Return[#/5^i]] & @@ Timing@Do[func, {5^i}], {i, 0, 15}],

    HoldFirst];

I will use random numeric data on the assumption that the algorithm is universal; I have no specific reason to believe otherwise.

SeedRandom[2]
a = RandomInteger[99, 10]

{92, 57, 22, 84, 63, 1, 81, 96, 19, 38}

Split

I'll start with one not like the rest because it is easy to understand the behavior and will serve as an illustration for the other functions.

Split[a, test]

92 -- 57
57 -- 22
22 -- 84
84 -- 63
63 -- 1
1 -- 81

81 -- 96
96 -- 19
19 -- 38
{{92}, {57}, {22}, {84, 63}, {1}, {81, 96}, {19}, {38}}

We see that straightforwardly every adjacent pair of elements is compared in order.
One would expect some overhead from a custom test function but the same O(n) complexity as default Split, and that is observed:

Table[{timeAvg@Split[#], timeAvg@Split[#, Equal]}&@RandomInteger[1,n], {n, 10^Range[0, 6]}];

ListLogPlot[time\[Transpose], Joined -> True,

  Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic},
  AxesLabel -> {"Length", "Seconds"}]

enter image description here

There is an additional component relative to the number of splits. Observe:

d1 = RandomInteger[99, 5*^6];  (* lots of splits *)
d2 = Round@Clip[d1, {98, 99}]; (* few splits *)
Timing @ Length @ Split @ # & /@ {d1, d2}

{{0.483, 4949976}, {0.125, 99485}}

Union

Applying the same test to the same set with Union:

Union[a, SameTest -> test]

19 -- 1
22 -- 1
38 -- 1

57 -- 38
57 -- 1
63 -- 57
81 -- 57
84 -- 57
92 -- 57
92 -- 38
96 -- 57
{1, 38, 57}

Here we see that the list is pre-sorted as testing starts with the small numbers: {1, 19, 22}. Also, we see that comparisons are each time with one of three unique elements {1, 38, 57} (right column). The algorithm appears to be:

Sort the list using the canonical ordering, and put the first element in a keep list.

Take the last keep list element (1) and compare it to subsequent elements (the left column) until test returns False.

Add the new unique element to the keep list (38) and compare it to subsequent elements from this element.

If in Step #3 a test returns False compare it to each prior element in the keep list; if it matches one continue with Step #3, else add to keep and start a new #3 with that element.

Stop when the last element is reached and compared (if necessary) to all elements of keep.

Some things result from this (supposed) algorithm. One is that the number of comparisons is directly influenced by the number of unique elements (defined by test) there are. Compare the output of these two lines:

Union[{1, 1, 1, 1, 1}, SameTest -> ((Print[#, " -- ", #2]; # == #2) &)]

1 -- 1
1 -- 1
1 -- 1
1 -- 1

Union[Range@5, SameTest -> ((Print[#, " -- ", #2]; # == #2) &)]

2 -- 1
3 -- 2
3 -- 1
4 -- 3
4 -- 2
4 -- 1
5 -- 4
5 -- 3
5 -- 2
5 -- 1

In the first case there are only (n-1) tests whereas in the second there are n (n-1) / 2 tests performed.

With a limited number of unique elements using SameTest has roughly equal computational complexity to the default Union (though several times slower):

times1 = Table[
   {timeAvg @ Union[#], timeAvg @ Union[#, SameTest -> Equal]} &@RandomInteger[20, n],
   {n, 10^Range[0, 6]}];

ListLogPlot[times1\[Transpose], Joined -> True, 
 Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic}, 
 AxesLabel -> {"Length", "Seconds"}]

enter image description here

With a fixed length but a different numbers of unique elements very different behavior emerges:

times2 = Table[
   {timeAvg @ Union[#], timeAvg @ Union[#, SameTest -> Equal]} &@
    RandomInteger[{1, n}, 1*^5],
   {n, 5^Range[0, 6]}];

ListLogPlot[times2\[Transpose], Joined -> True, 
 FrameTicks -> {Thread@{Range@7, "5"^Range[0, 6]}, Automatic}, Frame -> {1, 1, 0, 0}, 

 FrameLabel -> {"Unique elements (max)", "Seconds"}, PlotRange -> All]

enter image description here

Clearly a default Union uses other optimizations. A hint at this optimization can be found on MathGroup in reference to the old way that SameTest worked and by inference the way that default Union still does:

[mg16113]
The Union function sorts the elements before comparing them, and compares only adjacent elements after sorting. This is done for reasons of efficiency.

So the default is akin to Split in this way, and in fact Carl Woll implemented the old-style SameTest in terms of Split in another MathGroup post: [mg23309]. There is additional discussion there of complexity that is very pertinent to this topic.

Gather

Our preliminary experiment once again:

Gather[a, test]

92 -- 57
92 -- 22
57 -- 22
92 -- 84
57 -- 84
92 -- 63

57 -- 63
92 -- 1
57 -- 1
22 -- 1
92 -- 81
57 -- 81
92 -- 96
57 -- 96
92 -- 19
57 -- 19

22 -- 19
92 -- 38
{{92, 38}, {57, 84, 63, 81, 96}, {22, 1, 19}}

The list is not sorted and the first two elements receive the first test. We see quite a few more applications of test this time compared to Union but that proves to be mere chance.

The same three "groups" are formed by the Mod function as with Union: there 1,38,57 in the right column; here 92,57,22 in the left. The tests are no longer in canonical order as the list is not sorted but the same type of comparisons are made.

To show that the apparent difference in the number of test applications between Gather and Union is not significant we can run the test with many other sets

Module[{x, i, test2},
 test2 = (i++; Mod[#, 3] == Mod[#2, 3]) &;
 Table[

    x = RandomInteger[99, 10]; {
     i = 0; Union[x, SameTest -> test2]; i,
     i = 0; Gather[x, test2]; i
     } , {10000}
 ] // Pane /@ Histogram /@ Transpose[#] & // Row
]

enter image description here

Given the equal number of comparisons one might expect the performance of Gather and Union (with test functions) to track quite well, however there are some surprises.

With a limited number of groups the time complexity trends to the same degree as Union but the scales are surprising:

gtest[a_, b_] := a == b
times1 = Table[
   {timeAvg @ Gather[#], timeAvg @ Gather[#, gtest]} &@RandomInteger[20, n],
   {n, 10^Range[0, 6]}
  ];

ListLogPlot[times1\[Transpose], Joined -> True, 
 Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic}, 
 AxesLabel -> {"Length", "Seconds"}, PlotStyle -> {Red, Green}]

enter image description here

I have overlaid this graphic with the one for Union for easy comparison. On this test with a limited number of unique elements (20) default and test Union were within an order of magnitude, but default and test Gather are nearly five orders of magnitude apart. One might expect Gather with test to be slower than the equivalent Union lists must be built, however it is interesting to note that default Gather is an order of magnitude faster than default Union here.

For the test with a varying number of unique elements I had to abort at 5^5 as the function slowed down too greatly:

gtest[a_, b_] := a == b
times2 = Table[
  {timeAvg @ Gather[#], timeAvg @ Gather[#, gtest]} &@RandomInteger[{1, n}, 1*^5],
  {n, 5^Range[0, 5]}
 ];

ListLogPlot[times2\[Transpose], Joined -> True, 

 FrameTicks -> {Thread@{Range@7, "5"^Range[0, 6]}, Automatic}, Frame -> {1, 1, 0, 0}, 
 FrameLabel -> {"Unique elements (max)", "Seconds"}, PlotRange -> All, 
 PlotStyle -> {Red, Green}]

enter image description here

Again I have overlaid the result from the corresponding Union result. Complexities here are harder to compare, and more testing is in order.

Automatic enhancements in `Gather`

You may have noticed that I used gtest[a_, b_] := a == b rather than merely Equal or even # == #2 & as the test. This is because Gather has special handling built in to optimize such cases; this is normally a good thing but in this case it amounts to cheating.

Gather[x, Equal] is apparently converted directly to Gather[x].

A symmetric pure function of the form lhs == rhs & or lhs === rhs & where lhs and rhs are identical except for the arguments is converted to GatherBy, i.e. GatherBy[x, lhs &].

Examples supporting these assertions:

Timings are the same as Gather[x] for explicit use of Equal and SameQ:

big = RandomInteger[5000, 10000];

Gather[big]        // timeAvg
Gather[big, Equal] // timeAvg
Gather[big, SameQ] // timeAvg

0.002624

0.002624

0.002616

Convertion to GatherBy (or internal equivalent) is directly observable using Print or `Sow:
```
Gather[Range@5, Print[#] == Print[#2] &];
```
1
2
3

4
5

Note that each element is printed only once -- this is the behavior of GatherBy not a pairwise comparison!

What is and isn't a valid variable specification for Manipulate?

I have an expression whose terms have arguments (representing subscripts), like this: myExpr = A[0] + V[1,T] I would like to put it inside a Manipulate to see its value as I move around the parameters. (The goal is eventually to plot it wrt one of the variables inside.) However, Mathematica complains when I set V[1,T] as a manipulated variable: Manipulate[Evaluate[myExpr], {A[0], 0, 1}, {V[1, T], 0, 1}] (*Manipulate::vsform: Manipulate argument {V[1,T],0,1} does not have the correct form for a variable specification. >> *) As a workaround, if I get rid of the symbol T inside the argument, it works fine: Manipulate[ Evaluate[myExpr /. T -> 15], {A[0], 0, 1}, {V[1, 15], 0, 1}] Why this behavior? Can anyone point me to the documentation that says what counts as a valid variable? And is there a way to get Manpiulate to accept an expression with a symbolic argument as a variable? Investigations I've done so far: I tried using variableQ from this answer , but it says V[1...

Blog

Search This Blog