Skip to main content

memory - Efficient code for tallying entries in very large lists


I am dealing with data sets containing tens of millions of (hashable) entries and simply using the Tally function to count the frequency of each unique list element maxes out available memory. What's the most efficient way to perform this sort of operation on very large lists?


Some clarifications:



  1. I actually don't need all values. I need all values that occur more than once and a sizeable majority of the values occur only once.

  2. The datasets themselves do fit in memory in their entirety.




Answer



Preamble


I will discuss here two methods for doing computations on very large data sets which don't fit into memory. The first method is based on sequential reading of chunks of data from a file. The second method is based on converting a data set to a file-backed list representation. The unifying idea for both methods is the use of iterators as a useful abstraction to separate data-fetching mechanism from the computation proper.


Reading data from a file


Since the all-data-fit-into-memory solution has been already given here, I will now show how to augment it by an iterator which would read from a file.


Here we create a sample file (~500 Mb) of random integers:


file = $TemporaryPrefix <> "test.dat";
Do[BinaryWrite[file, RandomInteger[10^4, 10^5], "Integer32"], {1000}];
Close[file];


Here is one possible way to make an iterator:


ClearAll[makeIterator];
makeIterator[file_, chunkSize_, type_String] :=
Module[{ctr = 0, stream = OpenRead[file, BinaryFormat -> True]},
{
iterator[
Function[
If[# === {}, None, #] &@
BinaryReadList[stream, type, chunkSize]
]

],
stream
}];

The above function opens a stream, constructs an iterator and returns a list of both (I didn't include error-handling here, but that can be easiy added).


Now, assuming that one takes the function lazyTally from the linked post, and other functions it depends on, and defines them, here is the code to compute Tally on the data in the file:


iteratorAndStream = makeIterator[file, 10^5, "Integer32"];
(tally = lazyTally[First@iteratorAndStream]); // AbsoluteTiming
Close[Last@iteratorAndStream]


(* {40.643555,Null} *)

By playing with the size of the chunk of data, one can trade memory for speed and get faster execution but use more memory. For example, for the chunk size of 10^7, I get the running time of about 3.5 seconds.


One can also generalize this toy example to more complicated cases where files contain records with different types of entries.


Using file-backed lists


I will now show how one can gain certain speedup with respect to the previous method by converting a file to a file-backed list representation with larger granularity. How much faster this will be depends on whether the problem is IO-bound or the main bottleneck is in the computation (for this particular problem it is rather the latter). This method can be practical if you need to do many operations / computations on a data set, since it will give you



  • much faster reads

  • coarse-grained random access

  • familiar high-level list abstraction



The setup


To start using it, first load the framework:


Import["https://gist.github.com/lshifr/2696189/raw/largeData.m"]

Now, to speed things up by using mx files, define:


$fileNameFunction = mxFileName;
$importFunction = mxImport;
$exportFunction = mxExport;
$compressFunction = Identity;

$uncompressFunction = Identity;

You will have to designate some directory to store the files, for example:


$dir = "C:\\Temp\\LargeData";

Converting data file to a file-backed list representation


You will also want to pick the chunk size defining granularity of the file-backed list. I will pick a larger one than used before for the read-from-file iterator:


$chunkSize = 3*10^6;

Now, here is the code to transfer the content of the file to the file-backed representation (variable ourTest):



iteratorAndStream = makeIterator[file, $chunkSize , "Integer32"];
initList[ourTest];
Module[{next},
While[(next = getNext[First[iteratorAndStream]]) =!= None,
appendTo[ourTest, next, DestinationDirectory :> $dir]
]];
Close[Last@iteratorAndStream];
storeMainList[ourTest, DestinationDirectory :> $dir]

This code executes in about 10 seconds on my machine. Note that appendTo is careful to release the memory for the stored part right after the part has been stored on disk, so at any given time only one chunk of data needs to be stored in RAM.



Construcing the iterator for a file-backed-list, and the computation


Here is a fairly straight-forward implementation for an iterator over a file-backed list:


ClearAll[makeFileBackedListIterator];
makeFileBackedListIterator[fblist_] :=
Module[{ctr = 0},
iterator[
Function[
If[ctr >= Length[fblist],
None,
With[{result = fblist[[++ctr]]},

releasePart[fblist, ctr];
result]]]]];

The only non-trivial part here is an explicit call to the releasePart function to release the memory used to store a given part.


Finally, the computation:


iter = makeFileBackedListIterator[ourTest];
(tallyFB = lazyTally[iter]); // AbsoluteTiming

(* {3.285156, Null} *)


And we can check, that the result is the same as the previous one:


tallyFB ===tally

(* True *)

while the computation took 10 times less time. One can also see that the memory use has been fairly modest:


MaxMemoryUsed[]

(* 117825552 *)


Fair data access speed comparison


If we had used the same chunk size also in the first method, we would have observed that the actual timing difference for the same chunk size is about 2 times only. The speed-up here is not as dramatic because the main bottleneck for this toy problem is in computation rather than IO. This can be confirmed by the following measurement:


iter = makeFileBackedListIterator[ourTest];
While[getNext[iter] =!= None] // Timing

(* {0.312500, Null} *)

This shows that the actual run-time of pure iteration is 10 times less than the full run-time. This is, of course, good, because most of the time is spent on the actual computation, but this is also what limits the speed advantage of using the file - backed list format, in this problem. For the previous method, we would have to run this code (where we use the same chunk size as for a file-backed list version):


iteratorAndStream = makeIterator[file, $chunkSize,"Integer32"];
While[getNext[First@iteratorAndStream ]=!=None]//Timing

Close[Last@iteratorAndStream];

(* {2.093750,Null} *)

which shows, that for this chunk size, BinaryReadList is indeed almost an order of magnitude slower. For more complex data structures one may get more speedup from using the file-backed lists, since in our example the first method using BinaryReadList is also quite efficient.


Summary


I have demonstrated two ways of doing the computation for the large data sets which don't fit into memory. The unifying idea here is to use the iterator abstraction, which decouples the specific way the data is being fetched from the general computation that must be performed on the data.


The first method of sequential reading from a disk is the most straightforward one. However, if one needs to work with a given data set more than once, converting such data set to a file-backed list representation may have advantages, both in terms of speed and in terms of generality, since file-backed list format gives a coarse-grained random access to data.


Comments

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

I frequently need to type in some matrices, and the menu command Insert > Table/Matrix > New... allows matrices with lines drawn between columns and rows, which is very helpful. I would like to make a keyboard shortcut for it, but cannot find the relevant frontend token command (4209405) for it. Since the FullForm[] and InputForm[] of matrices with lines drawn between rows and columns is the same as those without lines, it's hard to do this via 3rd party system-wide text expanders (e.g. autohotkey or atext on mac). How does one assign a keyboard shortcut for the menu item Insert > Table/Matrix > New... , preferably using only mathematica? Thanks! Answer In the MenuSetup.tr (for linux located in the $InstallationDirectory/SystemFiles/FrontEnd/TextResources/X/ directory), I changed the line MenuItem["&New...", "CreateGridBoxDialog"] to read MenuItem["&New...", "CreateGridBoxDialog", MenuKey["m", Modifiers-...

How to thread a list

I have data in format data = {{a1, a2}, {b1, b2}, {c1, c2}, {d1, d2}} Tableform: I want to thread it to : tdata = {{{a1, b1}, {a2, b2}}, {{a1, c1}, {a2, c2}}, {{a1, d1}, {a2, d2}}} Tableform: And I would like to do better then pseudofunction[n_] := Transpose[{data2[[1]], data2[[n]]}]; SetAttributes[pseudofunction, Listable]; Range[2, 4] // pseudofunction Here is my benchmark data, where data3 is normal sample of real data. data3 = Drop[ExcelWorkBook[[Column1 ;; Column4]], None, 1]; data2 = {a #, b #, c #, d #} & /@ Range[1, 10^5]; data = RandomReal[{0, 1}, {10^6, 4}]; Here is my benchmark code kptnw[list_] := Transpose[{Table[First@#, {Length@# - 1}], Rest@#}, {3, 1, 2}] &@list kptnw2[list_] := Transpose[{ConstantArray[First@#, Length@# - 1], Rest@#}, {3, 1, 2}] &@list OleksandrR[list_] := Flatten[Outer[List, List@First[list], Rest[list], 1], {{2}, {1, 4}}] paradox2[list_] := Partition[Riffle[list[[1]], #], 2] & /@ Drop[list, 1] RM[list_] := FoldList[Transpose[{First@li...

plotting - How to draw lines between specified dots on ListPlot?

I would like to create a plot where I have unconnected dots and some connected. So far, I have figured out how to draw the dots. My code is the following: ListPlot[{{1, 1}, {2, 2}, {3, 3}, {4, 4}, {1, 4}, {2, 5}, {3, 6}, {4, 7}, {1, 7}, {2, 8}, {3, 9}, {4, 10}, {1, 10}, {2, 11}, {3, 12}, {4,13}, {2.5, 7}}, Ticks -> {{1, 2, 3, 4}, None}, AxesStyle -> Thin, TicksStyle -> Directive[Black, Bold, 12], Mesh -> Full] I have thought using ListLinePlot command, but I don't know how to specify to the command to draw only selected lines between the dots. Do have any suggestions/hints on how to do that? Thank you. Answer One possibility would be to use Epilog with Line : ListPlot[ {{1, 1}, {2, 2}, {3, 3}, {4, 4}, {1, 4}, {2, 5}, {3, 6}, {4, 7}, {1, 7}, {2, 8}, {3, 9}, {4, 10}, {1, 10}, {2, 11}, {3, 12}, {4, 13}, {2.5, 7}}, Ticks -> {{1, 2, 3, 4}, None}, AxesStyle -> Thin, TicksStyle -> Directive[Black, Bold, 12], Mesh -> Full, Epilog -> { Line[ ...