Skip to main content

performance tuning - Why some built-in functions are slow


I just wonder why some built-in functions are slower compared to some other ways using more than one built-in functions for doing same job.



consider this example:


list = RandomInteger[10000, 10000000];

Timing[sol1 = Union[list];]

(*{2.265625, Null}*)

Timing[sol2 = Sort[First /@ Tally[list]];]

(*{0.031250, Null}*)


sol1 == sol2

(* True*)

One built-in function (Union) is so slow compared to four built-in functions (Tally, Map, First, and Sort) .


Questions:


1- why is that happening?


2- how to know that there is faster way (using more than one built-in function) for doing job compare to one built-in function? is it just by experience?



Answer




The general case


There are indeed some functions in Mathematica that are/were not performing nicely. The one I am most scared of is Total (the issue is addressed here) (update: apparently Total has been fixed in version 10.0.2). User C.E. provides some more examples in his comment. But I feel the case of Union is different, as it is simply specialised for a specific case and in this case it performs well.


The case of Union


The way I see it, there is a spectrum of lists on which Union has to work. On one side of the spectrum, there are many duplicates, on the other few. Union is optimised to handle relatively few duplicates, and here it is unbeatable. But for many duplicates, using DeleteDuplicates first is faster.


To see that this is true, one can experiment with the following code.


mm = 1*^6;
kk = 1*^4;
rands = RandomInteger[kk, mm];

(res1 = Sort[DeleteDuplicates[rands]]) // Timing // First

(res2 = DeleteDuplicates[Sort[rands]]) // Timing // First
(res3 = Union[rands]) // Timing // First

res1 == res2 == res3
res1 == Range[0, Min[kk, 10^6]]


0.003440
0.152277
0.149979

True
True

Here there are many duplicates, so that Sort@*DeleteDuplicates is faster. Compare this with


kk = 1*^10;
rands = RandomInteger[kk, mm];
(res1 = Sort[DeleteDuplicates[rands]]) // Timing // First
(res2 = DeleteDuplicates[Sort[rands]]) // Timing // First
(res3 = Union[rands]) // Timing // First


res1 == res2 == res3
res1 == Range[0, Min[kk, 10^6]]


0.376823
0.365987
0.175067
True
False


where there are few duplicates, and Union is unbeatable.


At first I felt that it was strange that Union was so focussed on the few-duplicates case, but then again, techniques like sampling the list to see how many duplicates there are are pretty ugly and heuristic at best. I think the choice to make Union work like this is all right, though I think it should be documented more clearly.


In the documentation of DeleteDuplicates one can find that deleting duplicates can be much faster than "using Union". However it does not say that in cases like this it may be better to use technique of deleting duplicates first and then sorting.


A tricky part of understanding all this is understanding that DeleteDuplicates can be so much faster. I think the power of the hash table is easily underestimated, and to think that it is always a good idea to delete duplicates by sorting elements is an easy mistake to make.


I would love to see something like an option of Union to make it assume it is dealing with many duplicates. It must be possible to make something that is (at least a little bit) faster than using Sort@*DeleteDuplicates. That would also mean that the issue would have to be addressed in the documentation of Union.


Comments

Popular posts from this blog

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...

How to thread a list

I have data in format data = {{a1, a2}, {b1, b2}, {c1, c2}, {d1, d2}} Tableform: I want to thread it to : tdata = {{{a1, b1}, {a2, b2}}, {{a1, c1}, {a2, c2}}, {{a1, d1}, {a2, d2}}} Tableform: And I would like to do better then pseudofunction[n_] := Transpose[{data2[[1]], data2[[n]]}]; SetAttributes[pseudofunction, Listable]; Range[2, 4] // pseudofunction Here is my benchmark data, where data3 is normal sample of real data. data3 = Drop[ExcelWorkBook[[Column1 ;; Column4]], None, 1]; data2 = {a #, b #, c #, d #} & /@ Range[1, 10^5]; data = RandomReal[{0, 1}, {10^6, 4}]; Here is my benchmark code kptnw[list_] := Transpose[{Table[First@#, {Length@# - 1}], Rest@#}, {3, 1, 2}] &@list kptnw2[list_] := Transpose[{ConstantArray[First@#, Length@# - 1], Rest@#}, {3, 1, 2}] &@list OleksandrR[list_] := Flatten[Outer[List, List@First[list], Rest[list], 1], {{2}, {1, 4}}] paradox2[list_] := Partition[Riffle[list[[1]], #], 2] & /@ Drop[list, 1] RM[list_] := FoldList[Transpose[{First@li...

front end - keyboard shortcut to invoke Insert new matrix

I frequently need to type in some matrices, and the menu command Insert > Table/Matrix > New... allows matrices with lines drawn between columns and rows, which is very helpful. I would like to make a keyboard shortcut for it, but cannot find the relevant frontend token command (4209405) for it. Since the FullForm[] and InputForm[] of matrices with lines drawn between rows and columns is the same as those without lines, it's hard to do this via 3rd party system-wide text expanders (e.g. autohotkey or atext on mac). How does one assign a keyboard shortcut for the menu item Insert > Table/Matrix > New... , preferably using only mathematica? Thanks! Answer In the MenuSetup.tr (for linux located in the $InstallationDirectory/SystemFiles/FrontEnd/TextResources/X/ directory), I changed the line MenuItem["&New...", "CreateGridBoxDialog"] to read MenuItem["&New...", "CreateGridBoxDialog", MenuKey["m", Modifiers-...