performance tuning - Why some built-in functions are slow

I just wonder why some built-in functions are slower compared to some other ways using more than one built-in functions for doing same job.

consider this example:

list = RandomInteger[10000, 10000000];

Timing[sol1 = Union[list];]

(*{2.265625, Null}*)

Timing[sol2 = Sort[First /@ Tally[list]];]

(*{0.031250, Null}*)


sol1 == sol2

(* True*)

One built-in function (Union) is so slow compared to four built-in functions (Tally, Map, First, and Sort) .

Questions:

1- why is that happening?

2- how to know that there is faster way (using more than one built-in function) for doing job compare to one built-in function? is it just by experience?

Answer

The general case

There are indeed some functions in Mathematica that are/were not performing nicely. The one I am most scared of is Total (the issue is addressed here) (update: apparently Total has been fixed in version 10.0.2). User C.E. provides some more examples in his comment. But I feel the case of Union is different, as it is simply specialised for a specific case and in this case it performs well.

The case of Union

The way I see it, there is a spectrum of lists on which Union has to work. On one side of the spectrum, there are many duplicates, on the other few. Union is optimised to handle relatively few duplicates, and here it is unbeatable. But for many duplicates, using DeleteDuplicates first is faster.

To see that this is true, one can experiment with the following code.

mm = 1*^6;
kk = 1*^4;
rands = RandomInteger[kk, mm];

(res1 = Sort[DeleteDuplicates[rands]]) // Timing // First

(res2 = DeleteDuplicates[Sort[rands]]) // Timing // First
(res3 = Union[rands]) // Timing // First

res1 == res2 == res3
res1 == Range[0, Min[kk, 10^6]]

0.003440
0.152277
0.149979

True
True

Here there are many duplicates, so that Sort@*DeleteDuplicates is faster. Compare this with

kk = 1*^10;
rands = RandomInteger[kk, mm];
(res1 = Sort[DeleteDuplicates[rands]]) // Timing // First
(res2 = DeleteDuplicates[Sort[rands]]) // Timing // First
(res3 = Union[rands]) // Timing // First


res1 == res2 == res3
res1 == Range[0, Min[kk, 10^6]]

0.376823
0.365987
0.175067
True
False

where there are few duplicates, and Union is unbeatable.

At first I felt that it was strange that Union was so focussed on the few-duplicates case, but then again, techniques like sampling the list to see how many duplicates there are are pretty ugly and heuristic at best. I think the choice to make Union work like this is all right, though I think it should be documented more clearly.

In the documentation of DeleteDuplicates one can find that deleting duplicates can be much faster than "using Union". However it does not say that in cases like this it may be better to use technique of deleting duplicates first and then sorting.

A tricky part of understanding all this is understanding that DeleteDuplicates can be so much faster. I think the power of the hash table is easily underestimated, and to think that it is always a good idea to delete duplicates by sorting elements is an easy mistake to make.

I would love to see something like an option of Union to make it assume it is dealing with many duplicates. It must be possible to make something that is (at least a little bit) faster than using Sort@*DeleteDuplicates. That would also mean that the issue would have to be addressed in the documentation of Union.

Blog

Search This Blog

performance tuning - Why some built-in functions are slow

Comments

Post a Comment

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

How to thread a list

functions - Get leading series expansion term?