I just wonder why some built-in functions are slower compared to some other ways using more than one built-in functions for doing same job.
consider this example:
list = RandomInteger[10000, 10000000];
Timing[sol1 = Union[list];]
(*{2.265625, Null}*)
Timing[sol2 = Sort[First /@ Tally[list]];]
(*{0.031250, Null}*)
sol1 == sol2
(* True*)
One built-in function (Union
) is so slow compared to four built-in functions (Tally
, Map
, First
, and Sort
) .
Questions:
1- why is that happening?
2- how to know that there is faster way (using more than one built-in function) for doing job compare to one built-in function? is it just by experience?
Answer
The general case
There are indeed some functions in Mathematica that are/were not performing nicely. The one I am most scared of is Total
(the issue is addressed here) (update: apparently Total
has been fixed in version 10.0.2). User C.E. provides some more examples in his comment. But I feel the case of Union
is different, as it is simply specialised for a specific case and in this case it performs well.
The case of Union
The way I see it, there is a spectrum of lists on which Union
has to work. On one side of the spectrum, there are many duplicates, on the other few. Union
is optimised to handle relatively few duplicates, and here it is unbeatable. But for many duplicates, using DeleteDuplicates
first is faster.
To see that this is true, one can experiment with the following code.
mm = 1*^6;
kk = 1*^4;
rands = RandomInteger[kk, mm];
(res1 = Sort[DeleteDuplicates[rands]]) // Timing // First
(res2 = DeleteDuplicates[Sort[rands]]) // Timing // First
(res3 = Union[rands]) // Timing // First
res1 == res2 == res3
res1 == Range[0, Min[kk, 10^6]]
0.003440
0.152277
0.149979
True
True
Here there are many duplicates, so that Sort@*DeleteDuplicates
is faster. Compare this with
kk = 1*^10;
rands = RandomInteger[kk, mm];
(res1 = Sort[DeleteDuplicates[rands]]) // Timing // First
(res2 = DeleteDuplicates[Sort[rands]]) // Timing // First
(res3 = Union[rands]) // Timing // First
res1 == res2 == res3
res1 == Range[0, Min[kk, 10^6]]
0.376823
0.365987
0.175067
True
False
where there are few duplicates, and Union
is unbeatable.
At first I felt that it was strange that Union
was so focussed on the few-duplicates case, but then again, techniques like sampling the list to see how many duplicates there are are pretty ugly and heuristic at best. I think the choice to make Union
work like this is all right, though I think it should be documented more clearly.
In the documentation of DeleteDuplicates
one can find that deleting duplicates can be much faster than "using Union
". However it does not say that in cases like this it may be better to use technique of deleting duplicates first and then sorting.
A tricky part of understanding all this is understanding that DeleteDuplicates
can be so much faster. I think the power of the hash table is easily underestimated, and to think that it is always a good idea to delete duplicates by sorting elements is an easy mistake to make.
I would love to see something like an option of Union
to make it assume it is dealing with many duplicates. It must be possible to make something that is (at least a little bit) faster than using Sort@*DeleteDuplicates
. That would also mean that the issue would have to be addressed in the documentation of Union
.
Comments
Post a Comment