I have a large binary data file (big endian) with 100+ million "rows" of 11 elements, combination of floats and integers.
This is the format:
{"Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Integer32", "Integer32"}
This question: How to read data file quickly?, is related but not exactly the same.
I've been reading in the whole file like this:
str = OpenRead[filename, BinaryFormat -> True];
data = BinaryReadList[str, {"Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Integer32", "Integer32"}, ByteOrdering -> +1];
This requires lots and lots of memory and in the end I throw away most of the data most of the time. Usually I am just interested in the 4th real32 and the 2ndint32, or each "row". I would like to read the only the 4th real32 and the 2nd int32 of each "row" if possible and skip over the rest.
I've tried to use Skip but the documentation isn't clear if it works with BinaryReadList. I get the error Skip::readf: Real32 is not a valid format specification. ".
The documentation doesn't describe that you can skip byte by byte, but you can...
str = OpenRead[name, BinaryFormat -> True];
count = FileByteCount[name]/(11*4);
reading = Table[{Skip[str, Byte, 12];
BinaryRead[str, "Real32", ByteOrdering -> +1],
Skip[str, Byte, 24];
BinaryRead[str, "Integer32", ByteOrdering -> +1]},
{count}]; // AbsoluteTiming
edit: This code works now, but it is very slow, about a minute to load a file that takes only 15 seconds with BinaryReadList, however, the memory overhead is orders of magnitude lower.
edit2: Skip appears to be very slow, much slower than SetStreamPosition for some reason. So I wrote some new code that uses SetStreamPosition with a precomputed list of StreamPositions in bytes. It is about twice as fast as the Skip version, which is okay, but its still about 3x slower than BinaryReadList
pos = Range[12, FileByteCount[name], 11*4];
data = {SetStreamPosition[str, #];
BinaryRead[str, "Real32", ByteOrdering -> +1],
SetStreamPosition[str, # + 28];
BinaryRead[str, "Integer32", ByteOrdering -> +1]} & /@ pos; // AbsoluteTiming
Hopefully, someone will have an idea how this can be improved. Memory usage is still low, as expected.
I'm willing to tolerate a slight slow down (maybe 2x but not 5-10x) if there is a considerable memory savings to be gained but it would be great if the process could be sped up as well.
I can't really easily provide a copy of my data file as they are 100s of megabytes. I tried to write some code that generates some random data and writes it to a file, however, BinaryWrite appears to be extremely slow... I'm on a fast machine with a solid state drive and its going only a few 100 kilobytes per second... Here is the code, regardless, maybe someone knows a faster way to make a random binary data file. This will make an ~40 MB file.
outputstr = OpenWrite["randomdata", BinaryFormat -> True]
reals = RandomReal[100, {10^6, 9}];
ints = RandomInteger[100, {10^6, 2}];
both = Flatten@Transpose@Join[Transpose@reals, Transpose@ints];
BinaryWrite[outputstr, both, {"Real32", "Real32", "Real32", "Real32",
"Real32", "Real32", "Real32", "Real32", "Real32", "Integer32",
"Integer32"}, ByteOrdering -> +1]
Close[outputstr]
Answer
I was able to get 50x speedup w.r.t. your fastest code by using highly optimized Java buffered read functionality.
The idea
The idea is quite simple: use buffered read to reduce the IO overhead, and use Java to reduce the symbolic Mathematica overhead.
Implementation
You will have to run the Java reloader. Then, you call
JCompileLoad@"
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.FileChannel.MapMode;
import java.util.Arrays;
public class TableReader{
public static int byteArrayToInt(byte[] b){
return b[3] & 0xFF |
(b[2] & 0xFF) << 8 |
(b[1] & 0xFF) << 16 |
(b[0] & 0xFF) << 24;
}
public static int[] getIntegerColumn(String filename, int rowByteCount,
int skipBefore, int skipAfter, int rowChunkSize)
throws FileNotFoundException,IOException{
File fl = new File(filename);
FileInputStream str = new FileInputStream(fl);
FileChannel ch = str.getChannel( );
MappedByteBuffer mb = ch.map( FileChannel.MapMode.READ_ONLY, 0L, ch.size( ) );
final int buffrows = rowChunkSize;
final int buffSize = buffrows * rowByteCount;
byte[] buffer = new byte[buffSize];
int rows = (int)(fl.length()/rowByteCount);
int[] result = new int[rows];
int cycles = (int)(rows/buffrows);
int remaining = rows % buffrows;
byte[] remBuffer = new byte[remaining * rowByteCount];
int ctr=0;
try{
for(int j=0;j int bctr = 0;
mb.get(buffer);
for(int i=0;i < buffrows;i++){
bctr+=skipBefore;
result[ctr++] = byteArrayToInt(Arrays.copyOfRange(buffer,bctr,bctr+4));
bctr+=4+skipAfter;
}
}
int bctr = 0;
mb.get(remBuffer);
for(int i=0; i < remaining;i++){
bctr+=skipBefore;
result[ctr++] = byteArrayToInt(Arrays.copyOfRange(remBuffer,bctr,bctr+4));
bctr+=4+skipAfter;
}
} finally{
str.close();
}
return result;
}
public static float[] getFloatColumn(String filename, int rowByteCount, int skipBefore, int skipAfter, int rowChunkSize)
throws FileNotFoundException, IOException{
File fl = new File(filename);
FileInputStream str = new FileInputStream(fl);
FileChannel ch = str.getChannel();
MappedByteBuffer mb = ch.map( FileChannel.MapMode.READ_ONLY, 0L, ch.size( ));
final int buffrows = rowChunkSize;
final int buffSize = buffrows * rowByteCount;
byte[] buffer = new byte[buffSize];
int rows = (int)(fl.length()/rowByteCount);
float[] result = new float[rows];
byte[] intermediate = new byte[4*rows];
int cycles = (int)(rows/buffrows);
int remaining = rows % buffrows;
byte[] remBuffer = new byte[remaining * rowByteCount];
int ctr=0;
try{
for(int j=0;j int bctr = 0;
mb.get(buffer);
for(int i=0;i < buffrows;i++){
bctr+=skipBefore;
System.arraycopy(buffer, bctr,intermediate,4*ctr++,4);
bctr+=4+skipAfter;
}
}
int bctr = 0;
mb.get(remBuffer);
for(int i=0; i < remaining;i++){
bctr+=skipBefore;
System.arraycopy(remBuffer, bctr,intermediate,4*ctr++,4);
bctr+=4+skipAfter;
}
ByteBuffer buf2 = ByteBuffer.wrap(intermediate);
for(int i=0;i result[i]=buf2.getFloat();
}
} finally{
str.close();
}
return result;
}
}"
Usage
There are 2 static methods, to extract a single column, of integer or floating point numbers. Both take the same set of 5 parameters: file name, total bytes in one row, bytes to skip before reading the element in one row, bytes to skip after, and the number of rows in a buffer for buffered read.
Benchmarks
Using your code to produce the 40Mb file, I get then:
(jdataInt = TableReader`getIntegerColumn[name,11*4,10*4,0,100])
//Length//AbsoluteTiming
(* {0.0898438,1000000} *)
(jdataFl = TableReader`getFloatColumn[name,11*4,3*4,7*4,100])
//Length//AbsoluteTiming
(* {0.0839844,1000000} *)
while your code on my machine gives
str = OpenRead[name, BinaryFormat -> True];
pos = Range[12, FileByteCount[name], 11*4];
data = {
SetStreamPosition[str, #];
BinaryRead[str, "Real32", ByteOrdering -> +1],
SetStreamPosition[str, # + 28];
BinaryRead[str, "Integer32", ByteOrdering -> +1]
} & /@ pos; // AbsoluteTiming
Close[str];
(* {9.1044922,Null} *)
And we can verify:
Flatten[data[[All,1]]] == jdataFl
(* True *)
Flatten[data[[All,2]]]==jdataInt
(* True *)
Conclusions
By using buffered reads implemented in a compiled language with a highly optimized native IO routines (the latter implemented by much brighter people than me :)), I was able to gain two orders of magnitude speedup. I suspect that the running time for the Java code is dominated by data transfer, so the read itself is quite a bit faster still. I also think that going to C one can gain further significant speedups, although probably not as dramatic as here.
Note that I was quite sloppy with the Java-side error-handling, partly intentionally to optimize for speed, partly because at this point I did not really care. If someone decides to use this in production code, more care must be taken however.
Comments
Post a Comment