It seems to me like identifying the first occurrence of unique elements should take about the same amount of time as it takes to delete duplicates. My data sets are much larger than this (a is 10-20k and b is 1-10k).
With[{a = 1000, b = 10}, list = RandomInteger[a, b a]];
Timing[uniques = DeleteDuplicates@list;]
Timing[firstOccurrenceIndices = Position[list, #, 1, 1] & /@ uniques // Flatten;]
{0.000116, Null}
{0.653104, Null}
Any ideas?
Answer
Here's a faster way:
Clear[f]
Timing[MapIndexed[If[Not@IntegerQ@f[#1], f[#1] = First[#2]] &, list];]
Now f[elem] will tell you the position of the first occurrence of elem. On my machine this is approximately 8-10 times faster than your approach for a list of 10000 elements.
The timing for a=5000, b=100 is 1.3 s on my machine.
In general I expect your solution to have complexity $O(a^2 b)$ while mine should be $O(ab \ln a)$, in terms of $a,b$ as used in your question and $b>1$. I haven't measured it though. For larger lists the speedup will be more significant though.
An alternative:
dt = Dispatch@Thread[list -> Range@Length[list]];
This takes 1.07 s on my machine for the same list as above.
Now to get the first index of elem just do elem /. dt.
Another much faster alternative:
firstOccurrenceIndices = GatherBy[Range@Length[list], list[[#]] &][[All, 1]];
This takes 0.04 s on my machine.
In the Raspberry Pi version (which is a beta of v10) the function PositionIndex will provide a direct solution to this problem
Comments
Post a Comment