Skip to main content

performance tuning - Why some built-in functions are slow


I just wonder why some built-in functions are slower compared to some other ways using more than one built-in functions for doing same job.



consider this example:


list = RandomInteger[10000, 10000000];

Timing[sol1 = Union[list];]

(*{2.265625, Null}*)

Timing[sol2 = Sort[First /@ Tally[list]];]

(*{0.031250, Null}*)


sol1 == sol2

(* True*)

One built-in function (Union) is so slow compared to four built-in functions (Tally, Map, First, and Sort) .


Questions:


1- why is that happening?


2- how to know that there is faster way (using more than one built-in function) for doing job compare to one built-in function? is it just by experience?



Answer




The general case


There are indeed some functions in Mathematica that are/were not performing nicely. The one I am most scared of is Total (the issue is addressed here) (update: apparently Total has been fixed in version 10.0.2). User C.E. provides some more examples in his comment. But I feel the case of Union is different, as it is simply specialised for a specific case and in this case it performs well.


The case of Union


The way I see it, there is a spectrum of lists on which Union has to work. On one side of the spectrum, there are many duplicates, on the other few. Union is optimised to handle relatively few duplicates, and here it is unbeatable. But for many duplicates, using DeleteDuplicates first is faster.


To see that this is true, one can experiment with the following code.


mm = 1*^6;
kk = 1*^4;
rands = RandomInteger[kk, mm];

(res1 = Sort[DeleteDuplicates[rands]]) // Timing // First

(res2 = DeleteDuplicates[Sort[rands]]) // Timing // First
(res3 = Union[rands]) // Timing // First

res1 == res2 == res3
res1 == Range[0, Min[kk, 10^6]]


0.003440
0.152277
0.149979

True
True

Here there are many duplicates, so that Sort@*DeleteDuplicates is faster. Compare this with


kk = 1*^10;
rands = RandomInteger[kk, mm];
(res1 = Sort[DeleteDuplicates[rands]]) // Timing // First
(res2 = DeleteDuplicates[Sort[rands]]) // Timing // First
(res3 = Union[rands]) // Timing // First


res1 == res2 == res3
res1 == Range[0, Min[kk, 10^6]]


0.376823
0.365987
0.175067
True
False


where there are few duplicates, and Union is unbeatable.


At first I felt that it was strange that Union was so focussed on the few-duplicates case, but then again, techniques like sampling the list to see how many duplicates there are are pretty ugly and heuristic at best. I think the choice to make Union work like this is all right, though I think it should be documented more clearly.


In the documentation of DeleteDuplicates one can find that deleting duplicates can be much faster than "using Union". However it does not say that in cases like this it may be better to use technique of deleting duplicates first and then sorting.


A tricky part of understanding all this is understanding that DeleteDuplicates can be so much faster. I think the power of the hash table is easily underestimated, and to think that it is always a good idea to delete duplicates by sorting elements is an easy mistake to make.


I would love to see something like an option of Union to make it assume it is dealing with many duplicates. It must be possible to make something that is (at least a little bit) faster than using Sort@*DeleteDuplicates. That would also mean that the issue would have to be addressed in the documentation of Union.


Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...