Does anyone know, or know of a link for the underlying algorithms used for list manipulation functions?
Specifically:
This question came about as there has been numerous questions that have asked about removing duplicates or some other list manipulation function. Some of the best answers often included usage of GatherBy
due to the performance it offers (a few hundred times over DeleteDuplicates
in some cases).
Related: usages of GatherBy
and Union
:
Checking for duplicates in sublists GatherBy
Ordering function with recognition of duplicates GatherBy
Gather list elements by labels GatherBy
Retrieving duplicates from nested list GatherBy
Delete duplicates in a list, depending on the sequence of numbers Union
How to delete duplicate solutions of this system of equations? Union
Answer
This is an incomplete answer; I will continue it tomorrow. Work In Progress: errors may abound.
Preamble hat-tip to Leonid
For the variations with custom test or ordering functions we can snoop on applications of that function to deduce the algorithm that is used. In the case of the default methods we must rely on observed complexity and guesswork unless they are expressly described in the documentation or other official sources.
In every case the default method (without a test or ordering function) uses a well optimized routine. Since the details of these methods are not directly observable and since I have no special knowledge of the implementation I shall reference them only in comparison to the customized forms and with supposition.
All of these experiments were conducted in Mathematica 7. While I don't expect the underlying algorithms used by these functions to have changed in recent versions I cannot rule it out at present.
For all but Sort
, which uses an ordering function rather than a test function we can use:
test = (Print[#, " -- ", #2]; Mod[#, 3] == Mod[#2, 3]) &;
I will make use of the following timing function:
timeAvg =
Function[func,
Do[If[# > 0.3, Return[#/5^i]] & @@ Timing@Do[func, {5^i}], {i, 0, 15}],
HoldFirst];
I will use random numeric data on the assumption that the algorithm is universal; I have no specific reason to believe otherwise.
SeedRandom[2]
a = RandomInteger[99, 10]
{92, 57, 22, 84, 63, 1, 81, 96, 19, 38}
Split
I'll start with one not like the rest because it is easy to understand the behavior and will serve as an illustration for the other functions.
Split[a, test]
92 -- 57
57 -- 22
22 -- 84
84 -- 63
63 -- 1
1 -- 81
81 -- 96
96 -- 19
19 -- 38{{92}, {57}, {22}, {84, 63}, {1}, {81, 96}, {19}, {38}}
We see that straightforwardly every adjacent pair of elements is compared in order.
One would expect some overhead from a custom test function but the same O(n) complexity as default Split
, and that is observed:
Table[{timeAvg@Split[#], timeAvg@Split[#, Equal]}&@RandomInteger[1,n], {n, 10^Range[0, 6]}];
ListLogPlot[time\[Transpose], Joined -> True,
Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic},
AxesLabel -> {"Length", "Seconds"}]
There is an additional component relative to the number of splits. Observe:
d1 = RandomInteger[99, 5*^6]; (* lots of splits *)
d2 = Round@Clip[d1, {98, 99}]; (* few splits *)
Timing @ Length @ Split @ # & /@ {d1, d2}
{{0.483, 4949976}, {0.125, 99485}}
Union
Applying the same test to the same set with Union
:
Union[a, SameTest -> test]
19 -- 1
22 -- 1
38 -- 1
57 -- 38
57 -- 1
63 -- 57
81 -- 57
84 -- 57
92 -- 57
92 -- 38
96 -- 57{1, 38, 57}
Here we see that the list is pre-sorted as testing starts with the small numbers: {1, 19, 22}. Also, we see that comparisons are each time with one of three unique elements {1, 38, 57} (right column). The algorithm appears to be:
- Sort the list using the canonical ordering, and put the first element in a keep list.
- Take the last keep list element (1) and compare it to subsequent elements (the left column) until test returns False.
- Add the new unique element to the keep list (38) and compare it to subsequent elements from this element.
- If in Step #3 a test returns False compare it to each prior element in the keep list; if it matches one continue with Step #3, else add to keep and start a new #3 with that element.
- Stop when the last element is reached and compared (if necessary) to all elements of keep.
Some things result from this (supposed) algorithm. One is that the number of comparisons is directly influenced by the number of unique elements (defined by test) there are. Compare the output of these two lines:
Union[{1, 1, 1, 1, 1}, SameTest -> ((Print[#, " -- ", #2]; # == #2) &)]
1 -- 1
1 -- 1
1 -- 1
1 -- 1
Union[Range@5, SameTest -> ((Print[#, " -- ", #2]; # == #2) &)]
2 -- 1
3 -- 2
3 -- 1
4 -- 3
4 -- 2
4 -- 1
5 -- 4
5 -- 3
5 -- 2
5 -- 1
In the first case there are only (n-1)
tests whereas in the second there are n (n-1) / 2
tests performed.
With a limited number of unique elements using SameTest
has roughly equal computational complexity to the default Union
(though several times slower):
times1 = Table[
{timeAvg @ Union[#], timeAvg @ Union[#, SameTest -> Equal]} &@RandomInteger[20, n],
{n, 10^Range[0, 6]}];
ListLogPlot[times1\[Transpose], Joined -> True,
Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic},
AxesLabel -> {"Length", "Seconds"}]
With a fixed length but a different numbers of unique elements very different behavior emerges:
times2 = Table[
{timeAvg @ Union[#], timeAvg @ Union[#, SameTest -> Equal]} &@
RandomInteger[{1, n}, 1*^5],
{n, 5^Range[0, 6]}];
ListLogPlot[times2\[Transpose], Joined -> True,
FrameTicks -> {Thread@{Range@7, "5"^Range[0, 6]}, Automatic}, Frame -> {1, 1, 0, 0},
FrameLabel -> {"Unique elements (max)", "Seconds"}, PlotRange -> All]
Clearly a default Union
uses other optimizations. A hint at this optimization can be found on MathGroup in reference to the old way that SameTest
worked and by inference the way that default Union still does:
[mg16113]
The Union function sorts the elements before comparing them, and compares only adjacent elements after sorting. This is done for reasons of efficiency.
So the default is akin to Split
in this way, and in fact Carl Woll implemented the old-style SameTest
in terms of Split
in another MathGroup post: [mg23309]. There is additional discussion there of complexity that is very pertinent to this topic.
Gather
Our preliminary experiment once again:
Gather[a, test]
92 -- 57
92 -- 22
57 -- 22
92 -- 84
57 -- 84
92 -- 63
57 -- 63
92 -- 1
57 -- 1
22 -- 1
92 -- 81
57 -- 81
92 -- 96
57 -- 96
92 -- 19
57 -- 19
22 -- 19
92 -- 38{{92, 38}, {57, 84, 63, 81, 96}, {22, 1, 19}}
The list is not sorted and the first two elements receive the first test. We see quite a few more applications of test
this time compared to Union
but that proves to be mere chance.
The same three "groups" are formed by the Mod
function as with Union
: there 1,38,57
in the right column; here 92,57,22
in the left. The tests are no longer in canonical order as the list is not sorted but the same type of comparisons are made.
To show that the apparent difference in the number of test applications between Gather
and Union
is not significant we can run the test with many other sets
Module[{x, i, test2},
test2 = (i++; Mod[#, 3] == Mod[#2, 3]) &;
Table[
x = RandomInteger[99, 10]; {
i = 0; Union[x, SameTest -> test2]; i,
i = 0; Gather[x, test2]; i
} , {10000}
] // Pane /@ Histogram /@ Transpose[#] & // Row
]
Given the equal number of comparisons one might expect the performance of Gather
and Union
(with test functions) to track quite well, however there are some surprises.
With a limited number of groups the time complexity trends to the same degree as Union
but the scales are surprising:
gtest[a_, b_] := a == b
times1 = Table[
{timeAvg @ Gather[#], timeAvg @ Gather[#, gtest]} &@RandomInteger[20, n],
{n, 10^Range[0, 6]}
];
ListLogPlot[times1\[Transpose], Joined -> True,
Ticks -> {Thread@{Range@7, "10"^Range[0, 6]}, Automatic},
AxesLabel -> {"Length", "Seconds"}, PlotStyle -> {Red, Green}]
I have overlaid this graphic with the one for Union
for easy comparison. On this test with a limited number of unique elements (20) default and test Union
were within an order of magnitude, but default and test Gather
are nearly five orders of magnitude apart. One might expect Gather
with test to be slower than the equivalent Union
lists must be built, however it is interesting to note that default Gather
is an order of magnitude faster than default Union
here.
For the test with a varying number of unique elements I had to abort at 5^5
as the function slowed down too greatly:
gtest[a_, b_] := a == b
times2 = Table[
{timeAvg @ Gather[#], timeAvg @ Gather[#, gtest]} &@RandomInteger[{1, n}, 1*^5],
{n, 5^Range[0, 5]}
];
ListLogPlot[times2\[Transpose], Joined -> True,
FrameTicks -> {Thread@{Range@7, "5"^Range[0, 6]}, Automatic}, Frame -> {1, 1, 0, 0},
FrameLabel -> {"Unique elements (max)", "Seconds"}, PlotRange -> All,
PlotStyle -> {Red, Green}]
Again I have overlaid the result from the corresponding Union
result. Complexities here are harder to compare, and more testing is in order.
Automatic enhancements in Gather
You may have noticed that I used gtest[a_, b_] := a == b
rather than merely Equal
or even # == #2 &
as the test. This is because Gather
has special handling built in to optimize such cases; this is normally a good thing but in this case it amounts to cheating.
Gather[x, Equal]
is apparently converted directly toGather[x]
.A symmetric pure function of the form
lhs == rhs &
orlhs === rhs &
wherelhs
andrhs
are identical except for the arguments is converted toGatherBy
, i.e.GatherBy[x, lhs &]
.
Examples supporting these assertions:
Timings are the same as
Gather[x]
for explicit use ofEqual
andSameQ
:big = RandomInteger[5000, 10000];
Gather[big] // timeAvg
Gather[big, Equal] // timeAvg
Gather[big, SameQ] // timeAvg0.002624
0.002624
0.002616
Convertion to
GatherBy
(or internal equivalent) is directly observable usingPrint
or `Sow:Gather[Range@5, Print[#] == Print[#2] &];
1
2
3
4
5Note that each element is printed only once -- this is the behavior of
GatherBy
not a pairwise comparison!
Comments
Post a Comment