I have a tab separated value file with 10 million rows each of which has three tab separated values. The first value is a string, the second an integer, and the third another string. How to read efficiently (in terms of timing and memory footprint) the $n^{th}$ to $(n+100)^{th}$ rows of the file into Mathematica as
{
{_String, _Integer, _String},
...
}
?
Answer
For a one-off read you can Skip
a number of records:
str = OpenRead["test.tsv"];
Skip[str, Record, n - 1];
data = ReadList[str, {Record, Number, Record}, 100, RecordSeparators -> {"\t", "\n"}];
Close[str];
If you will be reading from the same file many times, it may be worth building an index you can use with SetStreamPosition
str = OpenRead["test.tsv"];
index = Table[pos = StreamPosition[str]; Skip[str, Record]; pos, {100000}];
readlines[n_, m_] := Block[{},
SetStreamPosition[str, index[[n]]];
ReadList[str, {Record, Number, Record}, m, RecordSeparators -> {"\t", "\n"}]]
data = readlines[50000,100]
On my PC building the index took about half a second for 10^5 rows in the file, assuming it scales linearly this would be about a minute for 10^7 rows. So this is only worth doing if you are going to be doing a lot of reads.
Comments
Post a Comment