When I read .Z file by something like
jpldatstring = Import["ftp://cddis.gsfc.nasa.gov/products/jpl/2003/jplg0040.04i.Z"]
Unfortunately, it turned out that "Import::infer: "Cannot infer format of file !(\"jplg0040.04i.Z\")"". Indeed, the form “.Z” isn’t included in the ImportFormats. Mathematica can read other types of compressed file included in $ImportFormats quite well such as “GZIP”,”ZIP”, “TAR”.
The .Z files I deal with can be uncompressed by certain software. But there are thousands of them added up to several GBytes , so an automatic way is needed. How can it be done in Mathematica?
The compressed .Z files can be uncompressed by 'uncompress' order under Linux, so is there any way to code Linux order into Mathematica program?
The dat example can be downloaded from ftp://cddis.gsfc.nasa.gov/products/jpl/2003/
The commit of george2079 shows a way out:
dat = Import[ "!uncompress -c /mnt/data/home/huangjp/magnphy/jpl/jplg0040.04i.Z", "GZIP"];
It returns a dir which contains the uncompressed file,again. which means I have to Imoport dat Is there anyway to read dat into mathematica directly? The trick of coding linux order into mathematica used isn't found in document. where could I learn more about this kind of skill?
Answer
This is an unfortunate oversight on the part of Wolfram and would be of general interest, so I wrote a LibraryFunction
in C (compiled using CreateLibrary
) that will decompress data compressed using the Unix compress command (i.e. .Z files, compressed in the LZW format).
Included are functions to extend Import
and ImportString
for the LZW format. You can then do:
importLZW["ftp://cddis.gsfc.nasa.gov/products/jpl/2003/comb2003_midnight.eop.Z", "Table"]
or:
Import["ftp://cddis.gsfc.nasa.gov/products/jpl/2003/comb2003_midnight.eop.Z", {"LZW", "Table"}]
which give the same result.
Here is the notebook (with an attempt at formatting for this site):
LZW Package
Mark Adler, 22 August 2015
Start Package
BeginPackage["LZW`"]
unlzw::usage="unlzw[str] decompresses str, assuming that str is the result of compressing data using the Unix compress command. Unix compress uses the LZW (Lempel\[Dash]Ziv\[Dash]Welch) algorithm. Files compressed using the Unix compress command will have the suffix \".Z\". unlzw[] returns the decompressed data or NULL is the data was not compressed using the Unix compress command, or if the compressed data was corrupted.";
importLZW::usage="importLZW[] can be used in place of Import[], extending its functionality by detecting and, if necessary, decompressing data that was compressed with the Unix compress command. Files compressed using the Unix compress command will have the suffix \".Z\".";
importStringLZW::usage="importStringLZW[] can be used in place of ImportString[], extending its functionality by detecting and, if necessary, decompressing data that was compressed with the Unix compress command. Files compressed using the Unix compress command will have the suffix \".Z\".";
Begin["`Private`"]
Compile unlzw C source code
Needs["CCompilerDriver`"]
unlzw() is adapted from the pigz source code. This decompressor is 20% faster than Unix uncompress.
unlzw$source="
/*
unlzw version 1.4, 22 August 2015
Copyright (C) 2014, 2015 Mark Adler
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler@alumni.caltech.edu
*/
/* Version history:
1.0 28 Sep 2014 First version
1.1 1 Oct 2014 Cast before shift of bit buffer for portability
Use fastest 32-bit type for bit buffer, uint_fast32_t
Use uint_least16_t in case a 16-bit type is not available
1.2 3 Oct 2014 Clean up comments, consolidate return values
1.3 20 Aug 2015 Assure no out-of-bounds access on invalid input
1.4 22 Aug 2015 Return uncompressed data so far on error conditions
Be more permissive on where the input is allowed to end
*/
#include
#include
/* Type for accumulating bits. 23 bits of the register are used to accumulate
up to 16-bit symbols. */
typedef uint_fast32_t bits_t;
/* Double size_t variable n, saturating at the maximum size_t value. */
#define DOUBLE(n) \\
do { \\
size_t was = n; \\
n <<= 1; \\
if (n < was) \\
n = (size_t)0 - 1; \\
} while (0)
/* Decompress compressed data generated by the Unix compress utility (LZW
compression, files with suffix .Z). Decompress in[0..inlen-1] to an
allocated buffer (*out)[0..*outlen-1]. The length of the uncompressed data
in the allocated buffer is returned in *outlen. unlzw() returns zero on
success, negative if the compressed data is invalid, or 1 if out of memory.
The negative return values are -1 for an invalid header, -2 if the first
code is not a literal or if an invalid code is detected, and -3 if the
stream ended in the middle of a code. -1 means that the data was not
produced by Unix compress, -2 generally means random or corrupted data, and
-3 generally means prematurely terminated data. If the decompression
results in a proper zero-length output, then unlzw() returns zero, *outlen
is zero, and *out is NULL. On error, any decompressed data up to that point
is returned using *out and *outlen. */
static int unlzw(unsigned const char *in, size_t inlen,
unsigned char **out, size_t *outlen)
{
unsigned bits; /* current number of bits per code (9..16) */
unsigned mask; /* mask for current bits codes = (1< bits_t buf; /* bit buffer -- holds up to 23 bits */
unsigned left; /* bits left in buf (0..7 after code pulled) */
size_t next; /* index of next input byte in in[] */
size_t mark; /* index where last change in bits began */
unsigned code; /* code, table traversal index */
unsigned max; /* maximum bits per code for this stream */
unsigned flags; /* compress flags, then block compress flag */
unsigned end; /* last valid entry in prefix/suffix tables */
unsigned prev; /* previous code */
unsigned final; /* last character written for previous code */
unsigned stack; /* next position for reversed string */
unsigned char *put; /* allocated output buffer */
size_t size; /* size of put[] allocation */
size_t have; /* number of bytes of data in put[] */
int ret = 0; /* return code */
/* memory for unlzw() -- the first 256 entries of prefix[] and suffix[] are
never used, so could have offset the index but it's faster to waste a
little memory */
uint_least16_t prefix[65536]; /* index to LZW prefix string */
unsigned char suffix[65536]; /* one-character LZW suffix */
unsigned char match[65280 + 2]; /* buffer for reversed match */
/* initialize output for error returns */
*out = NULL;
*outlen = 0;
/* process the header */
if (inlen < 3 || in[0] != 0x1f || in[1] != 0x9d)
return -1; /* invalid header */
flags = in[2];
if (flags & 0x60)
return -1; /* invalid header */
max = flags & 0x1f;
if (max < 9 || max > 16)
return -1; /* invalid header */
if (max == 9) /* 9 doesn't really mean 9 */
max = 10;
flags &= 0x80; /* true if block compress */
/* clear table, start at nine bits per symbol */
bits = 9;
mask = 0x1ff;
end = flags ? 256 : 255;
/* set up: get the first 9-bit code, which is the first decompressed byte,
but don't create a table entry until the next code */
if (inlen == 3)
return 0; /* zero-length input is ok */
buf = in[3];
if (inlen == 4)
return -3; /* a partial code is not ok */
buf += in[4] << 8;
final = prev = buf & mask; /* code */
buf >>= bits;
left = 16 - bits;
if (prev > 255)
return -2; /* first code must be a literal */
/* we have output -- allocate and set up an output buffer four times the
size of the input (Unix compress usually compresses less than 4:1, so
this will avoid a reallocation most of the time) */
size = inlen;
DOUBLE(size);
DOUBLE(size);
put = malloc(size);
if (put == NULL)
return 1;
put[0] = final; /* first decompressed byte */
have = 1;
/* decode codes */
mark = 3; /* start of compressed data */
next = 5; /* consumed five bytes so far */
stack = 0; /* empty stack */
while (next < inlen) {
/* if the table will be full after this, increment the code size */
if (end >= mask && bits < max) {
/* flush unused input bits and bytes to next 8*bits bit boundary
(this is a vestigial aspect of the compressed data format
derived from an implementation that made use of a special VAX
machine instruction!) */
{
unsigned rem = (next - mark) % bits;
if (rem) {
rem = bits - rem;
if (rem >= inlen - next)
break;
next += rem;
}
}
buf = 0;
left = 0;
/* mark this new location for computing the next flush */
mark = next;
/* increment the number of bits per symbol */
bits++;
mask <<= 1;
mask++;
}
/* get a code of bits bits */
buf += (bits_t)(in[next++]) << left;
left += 8;
if (left < bits) {
if (next == inlen) {
ret = -3; /* partial code (not ok) */
break;
}
buf += (bits_t)(in[next++]) << left;
left += 8;
}
code = buf & mask;
buf >>= bits;
left -= bits;
/* process clear code (256) */
if (code == 256 && flags) {
/* flush unused input bits and bytes to next 8*bits bit boundary */
{
unsigned rem = (next - mark) % bits;
if (rem) {
rem = bits - rem;
if (rem > inlen - next)
break;
next += rem;
}
}
buf = 0;
left = 0;
/* mark this new location for computing the next flush */
mark = next;
/* go back to nine bits per symbol */
bits = 9; /* initialize bits and mask */
mask = 0x1ff;
end = 255; /* empty table */
continue; /* get next code */
}
/* process LZW code */
{
unsigned temp = code; /* save the current code */
/* special code to reuse last match */
if (code > end) {
/* Be picky on the allowed code here, and make sure that the
code we drop through (prev) will be a valid index so that
random input does not cause an exception. */
if (code != end + 1 || prev > end) {
ret = -2; /* invalid LZW code */
break;
}
match[stack++] = final;
code = prev;
}
/* walk through linked list to generate output in reverse order */
while (code >= 256) {
match[stack++] = suffix[code];
code = prefix[code];
}
match[stack++] = code;
final = code;
/* link new table entry */
if (end < mask) {
end++;
prefix[end] = prev;
suffix[end] = final;
}
/* set previous code for next iteration */
prev = temp;
}
/* make room for the stack in the output */
if (stack > size - have) {
if (have + stack + 1 < have) {
ret = 1;
break;
}
do {
DOUBLE(size);
} while (stack > size - have);
{
unsigned char *mem = realloc(put, size);
if (mem == NULL) {
ret = 1;
break;
}
put = mem;
}
}
/* write output in forward order */
do {
put[have++] = match[--stack];
} while (stack);
/* stack is now empty (zero) for the next code */
}
/* return the decompressed data, first reducing the allocated memory */
{
unsigned char *mem = realloc(put, have);
if (mem != NULL)
put = mem;
}
*out = put;
*outlen = have;
return ret;
}
#include \"mathlink.h\"
#include \"WolframLibrary.h\"
DLLEXPORT mint WolframLibrary_getVersion( ) {
return WolframLibraryVersion;
}
DLLEXPORT int WolframLibrary_initialize(WolframLibraryData libData) {
return LIBRARY_NO_ERROR;
}
DLLEXPORT void WolframLibrary_uninitialize( ) {
return;
}
static void report(WolframLibraryData libData, const char *name,
const char *err, const char *text) {
MLINK link = libData->getMathLink(libData);
/* name::err = \"`1`\" */
MLPutFunction(link, \"EvaluatePacket\", 1);
MLPutFunction(link, \"Set\", 2);
MLPutFunction(link, \"MessageName\", 2);
MLPutSymbol(link, name);
MLPutString(link, err);
MLPutString(link, \"`1`\");
libData->processMathLink(link);
if (MLNextPacket(link) == RETURNPKT)
MLNewPacket(link);
/* Message[name::err, text] */
MLPutFunction(link, \"EvaluatePacket\", 1);
MLPutFunction(link, \"Message\", 2);
MLPutFunction(link, \"MessageName\", 2);
MLPutSymbol(link, name);
MLPutString(link, err);
MLPutString(link, text);
libData->processMathLink(link);
if (MLNextPacket(link) == RETURNPKT)
MLNewPacket(link);
}
DLLEXPORT mint munlzw(WolframLibraryData libData, MLINK link) {
const unsigned char *str = NULL;
int len; /* int type used by MLGetByteString() */
unsigned char *out;
size_t outlen;
int ret;
const char *errmsg[] = {
\"Prematurely terminated compress stream\", /* -3 */
\"Corrupted compress stream\", /* -2 */
\"Not a Unix compress (.Z) stream\", /* -1 */
\"Unexpected return code\", /* < -3 or > 1 */
\"Out of memory\" /* 1 */
};
if (MLTestHead(link, \"List\", &len) && len == 1 &&
MLGetByteString(link, &str, &len, 0)) {
MLNewPacket(link);
ret = unlzw(str, len, &out, &outlen);
MLReleaseByteString(link, str, len);
if (ret == 0) {
MLPutByteString(link, out, outlen);
free(out);
}
else {
free(out);
MLPutSymbol(link, \"Null\");
report(libData, \"unlzw\",
ret < 0 ? \"invalid\" : \"memerr\",
errmsg[ret < -3 || ret > 1 ? 3 : ret + 3]);
}
}
else {
MLClearError(link);
MLNewPacket(link);
MLPutSymbol(link, \"Null\");
report(libData, \"unlzw\", \"arg\",
\"String expected at position 1 as the only argument\");
}
MLClearError(link);
return LIBRARY_NO_ERROR;
}";
If[Head[lib]===String,LibraryUnload[lib]]
lib=CreateLibrary[unlzw$source,"unlzw","ShellOutputFunction"->Print]
unlzw=LibraryFunctionLoad[lib,"munlzw",LinkObject,LinkObject]
Provide Import functions that use unlzw
If compressed, decompress the provided string str
(possibly with nested compression) and pass on to ImportString
. If not, pass on the string as is. This function can be used in place of ImportString
, cleanly extending its functionality:
importStringLZW[str_String,opts___]:=
Module[{got},
If[(got=Quiet@unlzw@str)=!=Null,
importStringLZW[got,opts],
ImportString[str,opts]]
]
If compressed, decompress the data from the provided channel
(possibly with nested compression) and pass on to ImportString
. If not, pass on the data as is. This function can be used in place of Import
, cleanly extending its functionality:
importLZW[channel_,opts___]:=
importStringLZW[Import[channel,"String"],opts]
End the Package
End[]
LZW`Private`
EndPackage[]
$Context
Global`
Extend Import
with LZW compression format
Change the DownValue
of Import
to recognize the explicit compression format "LZW"
and call importLZW
with the "LZW"
stripped out. Do the same for ImportString
, calling importStringLZW
above. While it would be possible to extend Import
and ImportString
to automatically detect LZW compression, that could introduce inefficiencies in their operation for non-LZW data, in part because it would force always reading the compressed data into memory. Both importLZW
and Import[_, "LZW"]
will automatically detect LZW compression and decompress as needed.
ImportExport'RegisterImport
is not suitable for this application, due to its expectations for a "low-level" function that returns a set of rules. For a compression format that then passes another format for further import conversions, changing the DownValue
as is done here allows for completely transparent decompression with the automatic detection of the uncompressed data format and the specification of elements therein, e.g. for the TAR format, which was often compressed with Unix compress.
Begin["System`ConvertersDump`"];
Unprotect@Import;
DownValues[Import]={HoldPattern[Import[channel_,"LZW"|{"LZW"},opts___?OptionQ]]:>importLZW[channel,opts],
HoldPattern[Import[channel_,{"LZW",elts__},opts___?OptionQ]]:>importLZW[channel,{elts},opts]}~Join~DownValues[Import];
Protect@Import;
Unprotect@ImportString;
DownValues[ImportString]={HoldPattern[ImportString[str_String,"LZW"|{"LZW"},opts___?OptionQ]]:>importStringLZW[str,opts],
HoldPattern[ImportString[str_String,{"LZW",elts__},opts___?OptionQ]]:>importStringLZW[str,{elts},opts]}~Join~DownValues[ImportString];
Protect@ImportString;
End[]
System`ConvertersDump`
$Context
Global`
Comments
Post a Comment