programming - Reading periodic elements from a large file

I have a large binary data file (big endian) with 100+ million "rows" of 11 elements, combination of floats and integers.

This is the format:

{"Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Integer32", "Integer32"}

This question: How to read data file quickly?, is related but not exactly the same.

I've been reading in the whole file like this:

str = OpenRead[filename, BinaryFormat -> True];
data = BinaryReadList[str, {"Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Real32", "Integer32", "Integer32"}, ByteOrdering -> +1];

This requires lots and lots of memory and in the end I throw away most of the data most of the time. Usually I am just interested in the 4th real32 and the 2ndint32, or each "row". I would like to read the only the 4th real32 and the 2nd int32 of each "row" if possible and skip over the rest.

I've tried to use Skip but the documentation isn't clear if it works with BinaryReadList. I get the error Skip::readf: Real32 is not a valid format specification. ".

The documentation doesn't describe that you can skip byte by byte, but you can...

str = OpenRead[name, BinaryFormat -> True];
count = FileByteCount[name]/(11*4);

reading = Table[{Skip[str, Byte, 12];
     BinaryRead[str, "Real32", ByteOrdering -> +1], 
     Skip[str, Byte, 24];
     BinaryRead[str, "Integer32", ByteOrdering -> +1]},
 {count}]; // AbsoluteTiming

edit: This code works now, but it is very slow, about a minute to load a file that takes only 15 seconds with BinaryReadList, however, the memory overhead is orders of magnitude lower.

edit2: Skip appears to be very slow, much slower than SetStreamPosition for some reason. So I wrote some new code that uses SetStreamPosition with a precomputed list of StreamPositions in bytes. It is about twice as fast as the Skip version, which is okay, but its still about 3x slower than BinaryReadList

pos = Range[12, FileByteCount[name], 11*4];
data = {SetStreamPosition[str, #]; 

      BinaryRead[str, "Real32", ByteOrdering -> +1], 
      SetStreamPosition[str, # + 28]; 
      BinaryRead[str, "Integer32", ByteOrdering -> +1]} & /@ pos; // AbsoluteTiming

Hopefully, someone will have an idea how this can be improved. Memory usage is still low, as expected.

I'm willing to tolerate a slight slow down (maybe 2x but not 5-10x) if there is a considerable memory savings to be gained but it would be great if the process could be sped up as well.

I can't really easily provide a copy of my data file as they are 100s of megabytes. I tried to write some code that generates some random data and writes it to a file, however, BinaryWrite appears to be extremely slow... I'm on a fast machine with a solid state drive and its going only a few 100 kilobytes per second... Here is the code, regardless, maybe someone knows a faster way to make a random binary data file. This will make an ~40 MB file.

outputstr = OpenWrite["randomdata", BinaryFormat -> True]
reals = RandomReal[100, {10^6, 9}];
ints = RandomInteger[100, {10^6, 2}];

both = Flatten@Transpose@Join[Transpose@reals, Transpose@ints];
BinaryWrite[outputstr, both, {"Real32", "Real32", "Real32", "Real32", 
  "Real32", "Real32", "Real32", "Real32", "Real32", "Integer32", 
  "Integer32"}, ByteOrdering -> +1]
Close[outputstr]

Answer

I was able to get 50x speedup w.r.t. your fastest code by using highly optimized Java buffered read functionality.

The idea

The idea is quite simple: use buffered read to reduce the IO overhead, and use Java to reduce the symbolic Mathematica overhead.

Implementation

You will have to run the Java reloader. Then, you call

JCompileLoad@"
   import java.io.*;
   import java.nio.ByteBuffer;
   import java.nio.MappedByteBuffer;
   import java.nio.channels.FileChannel;
   import java.nio.channels.FileChannel.MapMode;
   import java.util.Arrays;

   public class TableReader{


     public static int byteArrayToInt(byte[] b){
       return   b[3] & 0xFF |
        (b[2] & 0xFF) << 8 |
        (b[1] & 0xFF) << 16 |
        (b[0] & 0xFF) << 24;
     }


    public  static int[] getIntegerColumn(String filename, int rowByteCount, 

       int skipBefore, int skipAfter, int rowChunkSize) 
         throws FileNotFoundException,IOException{
            File fl = new File(filename);
            FileInputStream str = new FileInputStream(fl);
            FileChannel ch = str.getChannel( );
            MappedByteBuffer mb = ch.map( FileChannel.MapMode.READ_ONLY, 0L, ch.size( ) );
            final int buffrows = rowChunkSize;
            final int buffSize = buffrows * rowByteCount;
            byte[] buffer = new byte[buffSize];
            int rows  = (int)(fl.length()/rowByteCount);

            int[] result = new int[rows];
            int cycles  = (int)(rows/buffrows);
            int remaining = rows % buffrows;
            byte[] remBuffer = new byte[remaining * rowByteCount];
            int ctr=0; 
            try{
               for(int j=0;j                  int bctr = 0;
                  mb.get(buffer);
                  for(int i=0;i < buffrows;i++){

                     bctr+=skipBefore;              
                     result[ctr++] = byteArrayToInt(Arrays.copyOfRange(buffer,bctr,bctr+4));
                     bctr+=4+skipAfter;
                  }
               }
               int bctr = 0;
               mb.get(remBuffer);
               for(int i=0; i < remaining;i++){
                  bctr+=skipBefore;             
                  result[ctr++] = byteArrayToInt(Arrays.copyOfRange(remBuffer,bctr,bctr+4));

                  bctr+=4+skipAfter;
               }            
            } finally{
               str.close();
            }
            return result;      
     }

     public  static float[] getFloatColumn(String filename, int rowByteCount, int skipBefore, int skipAfter, int rowChunkSize) 
       throws FileNotFoundException, IOException{

           File fl = new File(filename);
           FileInputStream str = new FileInputStream(fl);
           FileChannel ch = str.getChannel();
           MappedByteBuffer mb = ch.map( FileChannel.MapMode.READ_ONLY, 0L, ch.size( ));
           final int buffrows = rowChunkSize;
           final int buffSize = buffrows * rowByteCount;
           byte[] buffer = new byte[buffSize];
           int rows  = (int)(fl.length()/rowByteCount);
           float[] result = new float[rows];
           byte[] intermediate = new byte[4*rows];

           int cycles  = (int)(rows/buffrows);
           int remaining = rows % buffrows;
           byte[] remBuffer = new byte[remaining * rowByteCount];
           int ctr=0; 
           try{
              for(int j=0;j                 int bctr = 0;
                 mb.get(buffer);
                 for(int i=0;i < buffrows;i++){
                    bctr+=skipBefore;               

                    System.arraycopy(buffer, bctr,intermediate,4*ctr++,4);                  
                    bctr+=4+skipAfter;
                 }
              }
              int bctr = 0;
              mb.get(remBuffer);
              for(int i=0; i < remaining;i++){
                  bctr+=skipBefore;             
                  System.arraycopy(remBuffer, bctr,intermediate,4*ctr++,4);
                  bctr+=4+skipAfter;

              }
              ByteBuffer buf2 = ByteBuffer.wrap(intermediate);
              for(int i=0;i                  result[i]=buf2.getFloat();
              }         
           } finally{
              str.close();
           }
           return result;       
    }   

 }"

Usage

There are 2 static methods, to extract a single column, of integer or floating point numbers. Both take the same set of 5 parameters: file name, total bytes in one row, bytes to skip before reading the element in one row, bytes to skip after, and the number of rows in a buffer for buffered read.

Benchmarks

Using your code to produce the 40Mb file, I get then:

(jdataInt = TableReader`getIntegerColumn[name,11*4,10*4,0,100])
   //Length//AbsoluteTiming



(*  {0.0898438,1000000}  *)

(jdataFl =  TableReader`getFloatColumn[name,11*4,3*4,7*4,100])
   //Length//AbsoluteTiming

(* {0.0839844,1000000} *)

while your code on my machine gives

str = OpenRead[name, BinaryFormat -> True];
pos = Range[12, FileByteCount[name], 11*4];

data = {
  SetStreamPosition[str, #];
  BinaryRead[str, "Real32", ByteOrdering -> +1], 
  SetStreamPosition[str, # + 28]; 
  BinaryRead[str, "Integer32", ByteOrdering -> +1]
} & /@  pos; // AbsoluteTiming
Close[str];

(* {9.1044922,Null} *)

And we can verify:

Flatten[data[[All,1]]] == jdataFl

(* True *)

Flatten[data[[All,2]]]==jdataInt

(* True *)

Conclusions

By using buffered reads implemented in a compiled language with a highly optimized native IO routines (the latter implemented by much brighter people than me :)), I was able to gain two orders of magnitude speedup. I suspect that the running time for the Java code is dominated by data transfer, so the read itself is quite a bit faster still. I also think that going to C one can gain further significant speedups, although probably not as dramatic as here.

Note that I was quite sloppy with the Java-side error-handling, partly intentionally to optimize for speed, partly because at this point I did not really care. If someone decides to use this in production code, more care must be taken however.

dynamic - How can I make a clickable ArrayPlot that returns input?

I would like to create a dynamic ArrayPlot so that the rectangles, when clicked, provide the input. Can I use ArrayPlot for this? Or is there something else I should have to use? Answer ArrayPlot is much more than just a simple array like Grid : it represents a ranged 2D dataset, and its visualization can be finetuned by options like DataReversed and DataRange . These features make it quite complicated to reproduce the same layout and order with Grid . Here I offer AnnotatedArrayPlot which comes in handy when your dataset is more than just a flat 2D array. The dynamic interface allows highlighting individual cells and possibly interacting with them. AnnotatedArrayPlot works the same way as ArrayPlot and accepts the same options plus Enabled , HighlightCoordinates , HighlightStyle and HighlightElementFunction . data = {{Missing["HasSomeMoreData"], GrayLevel[ 1], {RGBColor[0, 1, 1], RGBColor[0, 0, 1], GrayLevel[1]}, RGBColor[0, 1, 0]}, {GrayLevel[0], GrayLevel...

Blog

Search This Blog