I need to decompress a data model file embedded in xlsx file. The file is supposed to use the MS-XLDM file format and should consist of 3 sections (Spreadsheet Data Model Header, Files and Virtual Directory) and only the middle one is compressed. The first and last section are xml with unicode/utf-16 encoding presumably (every other byte is 0x00 and the content is preceded by 0xFF and 0xFE ). The middle file is preceded by a small chunk of xml. More detail about the file structure.
Now according to the documentation the file should be compressed using Xpress compression specified here which uses LZ77 compression and DIRECT2 encoding.
Now to get to the point. From my understanding, there should always be a 4 byte bitmask which indicates if byte in corresponding position should be a 1:1 data or metadata.
For example, given a hypothetical 8-bit bitmask, the string "ABCABCDEF" is compressed as (0,0)A(0,0)B(0,0)C(3,3)D(0,0)E(0,0)F. Its bitmask would be b'00010001' (0x11).
If given position is supposed to be metadata, at least 2 bytes should be read. Out of the 16 bits the first 13 is offset and the last 3 are the length (unless the last bit is 1, then another byte must be read).
So now onto the concrete example that I struggle with. The first 2 chunks are easy.
First one is:
....<Load xmlns="http://schemas.micr
The first 4 bytes (the dots) are 0x00 thus the 32 bytes that follow are uncompressed. Next chunk is similar:
....osoft.com/analysisservices/2003/
Now the 3rd chunk is where I get lost
w±engine":ddl27/"/2G_W?%g100gO8eðg_‡_)§è.Õ®]›‡o
I'm not sure where does the chunk exactly end because when I started counting every 36 bytes after those first ones after a while I would reach a portion of the byte stream which should be uncompressed and it didn't line up.
So back to the 3rd chunk. The bitmask for this one is 0x77 0xB1 0x04 0x01.
Or in binary 01110111 10110001 00000100 00000001. I tried to line it up with the bytes and it didn't make any sense. Clearly the word engine" is uncompressed and it fits to the previous chunks because a quick google search revealed to me a result with namespace "http://schemas.microsoft.com/analysisservices/2003/engine".
01110111 10110001 00000100 00000001
engine" :ddl27 /"/2G_W ?%g100gO8eðg_‡_)
This made me think that maybe the bytes if the bitmask are in reverse order. This made more sense to me.
00000001
engine"
If this was true, then the metadata should be 0x0B 0x02.
Or in binary 00001011 00000010. So if I split it up, the first 13 bits make up the offset of the metadata. And the length is 010 + constant offset 3 = 2+3=5.
Before 0000101100000
Invert 1111010011111
Decimal -353
But looking 353 bytes back it lands in the uncompressed partition xml section and should return the characters in parentheses (a.m.e). This doesn't make sense to me and is probably wrong.
Here is the file I tried to decompress.
Related
I am currently reading about PNG file format. It turns out that the first byte of the file is specified to be equal to 0x89.
I am wondering what are the reasons of the value of that byte.
What I've already learned about the format is that the first byte is used to detect the transmition over 7-bit channel. If the value was 0x80 (1000 0000), it would make sense (if after transmition we have 0 on the first byte then 7-bit mode was used and the file is corrupted). But what is the sense of ones on zero and third positions of 0x89 (1000 1001)?
Extract from http://www.libpng.org/pub/png/spec/1.2/PNG-Rationale.html#R.PNG-file-signature
The first two bytes distinguish PNG files on systems that expect the
first two bytes to identify the file type uniquely. The first byte is
chosen as a non-ASCII value to reduce the probability that a text file
may be misrecognized as a PNG file; also, it catches bad file
transfers that clear bit 7
So the LSB of the first byte is used for file type identification.
I have multiple blocks of data compressed with zlib. I want to concatenate these blocks of data and store that in one file.
Obviously, I could use something like JSON or XML to separate the zlib data blocks, but I'm wondering if, to save space, I can just search for the next 78 01, 78 9C or 78 DA?
Basically my question is, can, theoretically, these byte combinations exist in a zlib data stream, or can I be sure that when I find one of these byte combinations, a new zlib data block is started, and the end is at the found position minus one?
I know the uncompressed data blocks are always 1024 bytes or less in length, so the compressed stream will never be > 1024 bytes.
No, you can't. Any byte sequence can appear in the compressed data. At any byte position, there is a probability of 1/1024 of finding a valid zlib header. So you will find a lot of valid zlib headers in a long compressed stream that are not actually zlib headers.
You could create your own byte stuffing scheme that wraps around arbitrary data, including zlib streams or anything else, that assures that certain sequences cannot occur unless they really are delimiters. Such schemes can incur an arbitrarily small expansion of the data. For example if you find three 0xff's in a row in the data, then insert a 0x00 byte. Then 0xff 0xff 0xff 0xff can be a delimiter, since it will never appear in the data. This will only expand the stream, on average, by about 0.000006%.
I am currently developing a proprietary file format based on the png file format. I am done so far, except it doesn't work :-p The deflate decompressor I implemented works like a charm but the png decoder doesn't want to perform nicely so I took a look at the original png file.
The standard says that after a IDAT header, the compressed data is following immediatly. So as the data is a deflate stream the first char after IDAT is 0x78 == 01111000 which means, a mode one block (uncompressed) and its not the final one.
Strange though - its hard for me to imagine that a PNG encoder doesn't use dynamic huffman coding for compressing the filtered raw image data. The deflate standard says that the rest of the current byte is skipped in mode one.
So the next four bytes indicate the size of the uncompressed block and its one complement.
But 0x59FD is not the one complement of 0xECDA. Even if I screwed up the byte ordering: 0xFD59 is not the one complement of 0xDAEC either.
Well, the knockout byte just follows. 0x97 is considered to be the first byte of the uncompressed but still filtered raw png image data and as such must be the filtertype. But 0x97 == 10010111 is not a valid filter type. Event if I screwed up bit packing order 11101001 == 0xe9 is still no valid filter type.
I didn't focus on RFC 1951 much anymore as I am able to inflate all kind of files so far using my implementation of the deflate decompressor, so I suspect some misunderstanding on my part concering the PNG standard.
I read the RFC 2083 over and over again but the data I see here don't match the RFC, it doesn't make sense to me, there must be a missing piece!
When I look at the following bytes, I can actually not see a valid filter type byte anywhere near which makes me think that the filtered png data stream is nevertheless compressed after all.
It would make sense if 0x78 (the first byte after IDAT) would be read from MSB to LSB but RFC 1951 says otherwise. Another idea (more likely to me) is that there some data between the IDAT string and the start of the compressed deflate stream but RFC 2083 says otherwise. The Layout is clear
4Bytes Size
4Bytes ChunkName (IDAT)
[Size] Bytes (compressed deflate stream)
4Bytes CRC Checksum
So the first byte after IDAT must be the first byte of the compressed deflate stream - which indicates a mode 1 uncompressed data block. Which means that 0x97 must be the first byte of uncompressed but filtered png image data - which means 0x97 is the filtertype for the first row - which is invalid...
I just don't get it, am I stupid or what??
Summary:
Possibility 1:
There is some other data between IDAT and the effective start of the compressed deflate stream which, if renders to be true, is not meantioned in the RFC2083 nor in any book I read about image compression.
Possibility 2:
The number 0x78 is interpreted MSB -> LSB which would indicate a mode 3 block (dynamic huffman coding), but this contradicts with RF1951 which is very clear about Bit packing: (LSB -> MSB)
I know already, the missing piece must be something very stupid and I will feel the urgend need to sell my soul if there was only a delete button in Stack Overflow :-p
Two corrections which may help you get you on your way:
The number of zlib bytes in the flags is 2, not 1 -- see RFC 1950. The first is CMF, the next FLG.
In your data:
78 DA
---CMF--- ---FLG---
0111.1000 1101.0101
CINF -CM- +-||
| |+- FCHECK
| +-- FDICT
+---- FLEVEL
CINF is 7, indicating the standard 32Kb compression window.
CM is 8, indicating the compression algorithm is, indeed, DEFLATE.
FCHECK is just a checksum; I didn't check if it's correct (but I'd bet it is).
FDICT is clear, meaning there is no preset dictionary stored.
FLEVEL is 3, indicating Maximum Compression.
See also Trying to understand zlib/deflate in PNG files, esp. dr. Adler's answer.
LEN and NLEN are only set for uncompressed blocks; that's why you didn't find them. (Also, partially, because you were looking at the wrong byte(s).)
The next byte in the stream is EC; bitwise, this is 1110 1100 but remember to read bits from low to high. So the next bit read is 0, meaning not FINAL, and the next 2 bits read are 10 (in that order!), indicating a regular dynamic Huffman encoded data block.
Let us say I have a 32-bit machine running a 32-bit OS with an application program like Notepad (assume). Assume I create a .txt file with that program which contains just a single character 'A' in it and save the file with ANSI coding (or ASCII) on disk. With 32 bits making up a single addressable memory block called a word, how would the 4 bytes in the word be used to store 'A' (i.e., number 65 in ASCII)? Now, 65 translates to 0100 0001 in binary.
ASCII means, that you are just using one byte per character. many encodings just use one byte per character, but there are some like utf16, which use constantly two bytes for each character.
the 32 bits get just relevant, if your are processing these characters in your CPU in a register, and you load them as an integer. then the single byte is converted to a 32 - bit integer and processed by the cpu, when you save it its again one byte long
how one byte is converted into a 32 bit integer, thats described for example here: http://en.wikipedia.org/wiki/Endianness
My Systems Programming project has us implementing a compression/decompression program to crunch down ASCII text files by removing the zero top bit and writing the output to a separate file, depending on whether the compression or decompression routine is working. To do this, the professor has required us to use the binary files and Unix system calls, which include open, close, read, write, etc.
From my understanding of read and write, it reads the binary data by defined byte chunks. However, since this data is binary, I'm not sure how to parse it.
This is a stripped down version of my code, minus the error checking:
void compress(char readFile[]){
char buffer[BUFFER] //buffer size set to 4096, but tunable to system preference
int openReadFile;
openReadFile= open(readFile, O_RDONLY);
}
If I use read to read the data into buffer, will the data in buffer be in binary or character format? Nothing I've come across addresses that detail, and its very relevant to how I parse the contents.
read() will read the bytes in without any interpretation (so "binary" mode).
Being binary, and you want to access the individual bytes, you should use a buffer of unsigned char
unsigned char buffer[BUFFER]. You can regard char/unsigned char as bytes, they'll be 8 bits on linux.
Now, since what you're dealing with is 8 bit ascii compressed down to 7 bit, you'll have to convert those 7 bits into 8 bits again so you can make sense of the data.
To explain what's been done - consider the text Hey .That's 3 bytes. The bytes will have 8 bits each, and in ascii that's the bit patterns :
01001000 01100101 01111001
Now, removing the most significant bit from this, you shift the remaining bits one bit to the left.
X1001000 X1100101 X1111001
Above, X is the bit to removed. Removing those, and shifting the others you end up with bytes with this pattern:
10010001 10010111 11001000
The rightmost 3 bits is just filled in with 0. So far, no space is saved though. There's still 3 bytes.
With a string of 8 bytes, we'd saved 1 byte as that would compress down to 7 bytes.
Now you have to do the reverse on the bytes you've read back in
I'll quote the manual of the fopen function (that is based on the open function/primitive) from http://www.kernel.org/doc/man-pages/online/pages/man3/fopen.3.html
The mode string can also include the
letter 'b' either as a last character
or as a character between the
characters in any of the two-character
strings described above. This is
strictly for compatibility with C89
and has no effect; the 'b' is ignored
on all POSIX conforming systems,
including Linux
So even the high level function ignores the mode :-)
It will read the binary content of the file and load it in the memory buffer points to. Of course, a byte is 8 bits, and that's why a char is 8 bits, so, if the file was a regular plain text document you'll end up with a printable string (be careful with how it ends, read returns the number of bytes (characters in a ascii-encoded plain text file) read).
Edit: in case the file you're reading isn't a text file, and is a collection of binary representations, you can make the type of the buffer the one of the file, even if it's a struct.