Skip to main content

programming - Reading periodic elements from a large file


I have a large binary data file (big endian) with 100+ million "rows" of 11 elements, combination of floats and integers.


This is the format:


{"Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Integer32", "Integer32"}


This question: How to read data file quickly?, is related but not exactly the same.


I've been reading in the whole file like this:


str = OpenRead[filename, BinaryFormat -> True];
data = BinaryReadList[str, {"Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Integer32", "Integer32"}, ByteOrdering -> +1];

This requires lots and lots of memory and in the end I throw away most of the data most of the time. Usually I am just interested in the 4th real32 and the 2ndint32, or each "row". I would like to read the only the 4th real32 and the 2nd int32 of each "row" if possible and skip over the rest.


I've tried to use Skip but the documentation isn't clear if it works with BinaryReadList. I get the error Skip::readf: Real32 is not a valid format specification. ".


The documentation doesn't describe that you can skip byte by byte, but you can...


str = OpenRead[name, BinaryFormat -> True];
count = FileByteCount[name]/(11*4);

reading = Table[{Skip[str, Byte, 12];
BinaryRead[str, "Real32", ByteOrdering -> +1],
Skip[str, Byte, 24];
BinaryRead[str, "Integer32", ByteOrdering -> +1]},
{count}]; // AbsoluteTiming

edit: This code works now, but it is very slow, about a minute to load a file that takes only 15 seconds with BinaryReadList, however, the memory overhead is orders of magnitude lower.


edit2: Skip appears to be very slow, much slower than SetStreamPosition for some reason. So I wrote some new code that uses SetStreamPosition with a precomputed list of StreamPositions in bytes. It is about twice as fast as the Skip version, which is okay, but its still about 3x slower than BinaryReadList


pos = Range[12, FileByteCount[name], 11*4];
data = {SetStreamPosition[str, #];

BinaryRead[str, "Real32", ByteOrdering -> +1],
SetStreamPosition[str, # + 28];
BinaryRead[str, "Integer32", ByteOrdering -> +1]} & /@ pos; // AbsoluteTiming

Hopefully, someone will have an idea how this can be improved. Memory usage is still low, as expected.


I'm willing to tolerate a slight slow down (maybe 2x but not 5-10x) if there is a considerable memory savings to be gained but it would be great if the process could be sped up as well.


I can't really easily provide a copy of my data file as they are 100s of megabytes. I tried to write some code that generates some random data and writes it to a file, however, BinaryWrite appears to be extremely slow... I'm on a fast machine with a solid state drive and its going only a few 100 kilobytes per second... Here is the code, regardless, maybe someone knows a faster way to make a random binary data file. This will make an ~40 MB file.


outputstr = OpenWrite["randomdata", BinaryFormat -> True]
reals = RandomReal[100, {10^6, 9}];
ints = RandomInteger[100, {10^6, 2}];

both = Flatten@Transpose@Join[Transpose@reals, Transpose@ints];
BinaryWrite[outputstr, both, {"Real32", "Real32", "Real32", "Real32",
"Real32", "Real32", "Real32", "Real32", "Real32", "Integer32",
"Integer32"}, ByteOrdering -> +1]
Close[outputstr]

Answer



I was able to get 50x speedup w.r.t. your fastest code by using highly optimized Java buffered read functionality.


The idea


The idea is quite simple: use buffered read to reduce the IO overhead, and use Java to reduce the symbolic Mathematica overhead.


Implementation



You will have to run the Java reloader. Then, you call


JCompileLoad@"
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.FileChannel.MapMode;
import java.util.Arrays;

public class TableReader{


public static int byteArrayToInt(byte[] b){
return b[3] & 0xFF |
(b[2] & 0xFF) << 8 |
(b[1] & 0xFF) << 16 |
(b[0] & 0xFF) << 24;
}


public static int[] getIntegerColumn(String filename, int rowByteCount,

int skipBefore, int skipAfter, int rowChunkSize)
throws FileNotFoundException,IOException{
File fl = new File(filename);
FileInputStream str = new FileInputStream(fl);
FileChannel ch = str.getChannel( );
MappedByteBuffer mb = ch.map( FileChannel.MapMode.READ_ONLY, 0L, ch.size( ) );
final int buffrows = rowChunkSize;
final int buffSize = buffrows * rowByteCount;
byte[] buffer = new byte[buffSize];
int rows = (int)(fl.length()/rowByteCount);

int[] result = new int[rows];
int cycles = (int)(rows/buffrows);
int remaining = rows % buffrows;
byte[] remBuffer = new byte[remaining * rowByteCount];
int ctr=0;
try{
for(int j=0;j int bctr = 0;
mb.get(buffer);
for(int i=0;i < buffrows;i++){

bctr+=skipBefore;
result[ctr++] = byteArrayToInt(Arrays.copyOfRange(buffer,bctr,bctr+4));
bctr+=4+skipAfter;
}
}
int bctr = 0;
mb.get(remBuffer);
for(int i=0; i < remaining;i++){
bctr+=skipBefore;
result[ctr++] = byteArrayToInt(Arrays.copyOfRange(remBuffer,bctr,bctr+4));

bctr+=4+skipAfter;
}
} finally{
str.close();
}
return result;
}

public static float[] getFloatColumn(String filename, int rowByteCount, int skipBefore, int skipAfter, int rowChunkSize)
throws FileNotFoundException, IOException{

File fl = new File(filename);
FileInputStream str = new FileInputStream(fl);
FileChannel ch = str.getChannel();
MappedByteBuffer mb = ch.map( FileChannel.MapMode.READ_ONLY, 0L, ch.size( ));
final int buffrows = rowChunkSize;
final int buffSize = buffrows * rowByteCount;
byte[] buffer = new byte[buffSize];
int rows = (int)(fl.length()/rowByteCount);
float[] result = new float[rows];
byte[] intermediate = new byte[4*rows];

int cycles = (int)(rows/buffrows);
int remaining = rows % buffrows;
byte[] remBuffer = new byte[remaining * rowByteCount];
int ctr=0;
try{
for(int j=0;j int bctr = 0;
mb.get(buffer);
for(int i=0;i < buffrows;i++){
bctr+=skipBefore;

System.arraycopy(buffer, bctr,intermediate,4*ctr++,4);
bctr+=4+skipAfter;
}
}
int bctr = 0;
mb.get(remBuffer);
for(int i=0; i < remaining;i++){
bctr+=skipBefore;
System.arraycopy(remBuffer, bctr,intermediate,4*ctr++,4);
bctr+=4+skipAfter;

}
ByteBuffer buf2 = ByteBuffer.wrap(intermediate);
for(int i=0;i result[i]=buf2.getFloat();
}
} finally{
str.close();
}
return result;
}

}"

Usage


There are 2 static methods, to extract a single column, of integer or floating point numbers. Both take the same set of 5 parameters: file name, total bytes in one row, bytes to skip before reading the element in one row, bytes to skip after, and the number of rows in a buffer for buffered read.


Benchmarks


Using your code to produce the 40Mb file, I get then:


(jdataInt = TableReader`getIntegerColumn[name,11*4,10*4,0,100])
//Length//AbsoluteTiming



(* {0.0898438,1000000} *)

(jdataFl = TableReader`getFloatColumn[name,11*4,3*4,7*4,100])
//Length//AbsoluteTiming

(* {0.0839844,1000000} *)

while your code on my machine gives


str = OpenRead[name, BinaryFormat -> True];
pos = Range[12, FileByteCount[name], 11*4];

data = {
SetStreamPosition[str, #];
BinaryRead[str, "Real32", ByteOrdering -> +1],
SetStreamPosition[str, # + 28];
BinaryRead[str, "Integer32", ByteOrdering -> +1]
} & /@ pos; // AbsoluteTiming
Close[str];

(* {9.1044922,Null} *)


And we can verify:


Flatten[data[[All,1]]] == jdataFl

(* True *)

Flatten[data[[All,2]]]==jdataInt

(* True *)

Conclusions



By using buffered reads implemented in a compiled language with a highly optimized native IO routines (the latter implemented by much brighter people than me :)), I was able to gain two orders of magnitude speedup. I suspect that the running time for the Java code is dominated by data transfer, so the read itself is quite a bit faster still. I also think that going to C one can gain further significant speedups, although probably not as dramatic as here.


Note that I was quite sloppy with the Java-side error-handling, partly intentionally to optimize for speed, partly because at this point I did not really care. If someone decides to use this in production code, more care must be taken however.


Comments

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

I frequently need to type in some matrices, and the menu command Insert > Table/Matrix > New... allows matrices with lines drawn between columns and rows, which is very helpful. I would like to make a keyboard shortcut for it, but cannot find the relevant frontend token command (4209405) for it. Since the FullForm[] and InputForm[] of matrices with lines drawn between rows and columns is the same as those without lines, it's hard to do this via 3rd party system-wide text expanders (e.g. autohotkey or atext on mac). How does one assign a keyboard shortcut for the menu item Insert > Table/Matrix > New... , preferably using only mathematica? Thanks! Answer In the MenuSetup.tr (for linux located in the $InstallationDirectory/SystemFiles/FrontEnd/TextResources/X/ directory), I changed the line MenuItem["&New...", "CreateGridBoxDialog"] to read MenuItem["&New...", "CreateGridBoxDialog", MenuKey["m", Modifiers-...

How to thread a list

I have data in format data = {{a1, a2}, {b1, b2}, {c1, c2}, {d1, d2}} Tableform: I want to thread it to : tdata = {{{a1, b1}, {a2, b2}}, {{a1, c1}, {a2, c2}}, {{a1, d1}, {a2, d2}}} Tableform: And I would like to do better then pseudofunction[n_] := Transpose[{data2[[1]], data2[[n]]}]; SetAttributes[pseudofunction, Listable]; Range[2, 4] // pseudofunction Here is my benchmark data, where data3 is normal sample of real data. data3 = Drop[ExcelWorkBook[[Column1 ;; Column4]], None, 1]; data2 = {a #, b #, c #, d #} & /@ Range[1, 10^5]; data = RandomReal[{0, 1}, {10^6, 4}]; Here is my benchmark code kptnw[list_] := Transpose[{Table[First@#, {Length@# - 1}], Rest@#}, {3, 1, 2}] &@list kptnw2[list_] := Transpose[{ConstantArray[First@#, Length@# - 1], Rest@#}, {3, 1, 2}] &@list OleksandrR[list_] := Flatten[Outer[List, List@First[list], Rest[list], 1], {{2}, {1, 4}}] paradox2[list_] := Partition[Riffle[list[[1]], #], 2] & /@ Drop[list, 1] RM[list_] := FoldList[Transpose[{First@li...

plotting - How to draw lines between specified dots on ListPlot?

I would like to create a plot where I have unconnected dots and some connected. So far, I have figured out how to draw the dots. My code is the following: ListPlot[{{1, 1}, {2, 2}, {3, 3}, {4, 4}, {1, 4}, {2, 5}, {3, 6}, {4, 7}, {1, 7}, {2, 8}, {3, 9}, {4, 10}, {1, 10}, {2, 11}, {3, 12}, {4,13}, {2.5, 7}}, Ticks -> {{1, 2, 3, 4}, None}, AxesStyle -> Thin, TicksStyle -> Directive[Black, Bold, 12], Mesh -> Full] I have thought using ListLinePlot command, but I don't know how to specify to the command to draw only selected lines between the dots. Do have any suggestions/hints on how to do that? Thank you. Answer One possibility would be to use Epilog with Line : ListPlot[ {{1, 1}, {2, 2}, {3, 3}, {4, 4}, {1, 4}, {2, 5}, {3, 6}, {4, 7}, {1, 7}, {2, 8}, {3, 9}, {4, 10}, {1, 10}, {2, 11}, {3, 12}, {4, 13}, {2.5, 7}}, Ticks -> {{1, 2, 3, 4}, None}, AxesStyle -> Thin, TicksStyle -> Directive[Black, Bold, 12], Mesh -> Full, Epilog -> { Line[ ...