Does the Huffman Code Tree in Deflate algorithm have to be a complete tree? - zlib

Does the Huffman code tree in Deflate algorithm have to be a complete tree? By complete tree, I mean that each leaf node must always represent one symbol. In other word, the last symbol with longest code will be assigned with all ones.
Take an extreme case for example: given 286 symbols, each symbol is encoded with 15-bit code - which is possible in general huffman tree coding. In this case however, there are 2^15 - 286 leaf nodes are not assigned/used. Is it allowed in Deflate? I have an impression that this is not allowed in Deflate and the tree must be a complete one. Is that true?

Except for one case, the Huffman codes described in a dynamic block in a valid deflate stream must be complete. Those are the bit lengths code, the literal/length code, and the distance code.
The one exception is that if there is only one distance symbol used, it is coded with one bit (a zero) as opposed to zero bits, leaving one code unused (the single bit being a one).

Related

puff.c How does the huffman decoding work?

I am trying to decompress a raw DEFLATE data stream in order to understand, how the DEFLATE decompression works. I don't care for performance at the moment. I just want to understand the decompression algorithm in a detailed way so I searched online for examples and stumbled upon the program puff.c by Mark Adler.
I want to decompress this piece of raw DEFLATE data:
05 C1 01 01 00 00 08 C3 20 6E FF CE 13 44 D2 4D C2 03
It is a single block compressed with dynamic huffman codes. RFC-1951 really gives a good overview about the overall structure of the block. After the block header, there are 4-19 3-bit integers, that define the code lengths for the code length alphabet; after that there are the code lengths for the literal/length and the distance alphabet and finally there is the actual compressed data. So far so good...
I took a look at puff.c to see how an implementation of the huffman decoding process should look like.
In the function construct (lines 340-379), the symbol table for an alphabet is created which is then used in the decoding proccess. In the function decode (lines 235-255) a single symbol is decoded using the symbol table and the table of code length frequencies. This function gives my headaches. I do not understand the relationship between the symbols, the code length frequencies and the actual input bit stream. Also the check
if (code - count < first) /* if length len, return symbol */
on line 247 is pure witchcraft to me. I searched the internet for detailed explanations of the decoding process but I found nothing that explains it so deeply.
Could you please provide an explanation of the process/the decode function?
Here is a link to puff.c in case you want to take a look.
The key is to understand that the Huffman codes used in deflate are canonical, which means that the sequence of zeros and ones assigned to the symbols are determined entirely just from the bit lengths of the codes and an ordering of the symbols. The ordering is specified in the RFC, where, for example, literal/length codes start with the literal bytes in order from 0 to 255, then a single end-of-block code, and then the length codes in order from shortest to longest. For any given, say, literal/length code described in a dynamic header, usually only a subset of those symbols will actually be used in the block and have codes defined for them.
The convention in deflate is to assign the first ordered symbol of the shortest code to be all zeros. Then the code is incremented as an integer (that's key), for the remaining symbols of that length, in order. To step up to the next length, a zero bit is appended to the result of incrementing after the last symbol of the previous length (i.e., it is doubled).
For decoding, all I need is a) a list of the symbols that are coded sorted in order of bit length, and within each bit length, sorted by their symbol order, and b) the number of symbols for each bit length, which has to add up to the number of symbols in the list.
The decoding scheme used in puff.c is to start with the shortest bit length in the list of bit lengths, and get that many bits. Now I check to see if the value of those bits, as an integer, is less than or equal to the last code of that length. If so, I have all of the bits of the current code, and I can return the corresponding symbol by indexing into the list of symbols in a). If the value of the integer is greater than the last code, then I know I need more bits for this code. I append one more bit from the stream to the integer, and I repeat the process for the next code length.
This is complicated a little by the fact that the bits are stored in reverse order in the stream. I need to flip the order as I build up the integer from successive bits, so that I can interpret the result as an integer that can be compared.
So, if (code - count < first) can be read as if (code < first + count), or if (code <= first + count - 1), which is whether code is less than or equal to the last code of that length. If so, we have the code, and the corresponding symbol is h->symbol[index + (code - first)], where index is the position in the list of symbols of the first symbol of the current length.

LZSS vs. LZ77 compression difference

Can someone please explain the difference between the LZSS and the LZ77 algorithm. I've been looking online for a couple of hours but I couldn't find the difference. I found the LZ77 algorithm and I understood its implementation.
But, how does LZSS differ from LZ77? Let's say if we have an string "abracadabra" how is LZSS gonna compress it differently from LZ77? Is there a C pseudo-code that I could follow?
Thank you for your time!
Unfortunately, both terms LZ77 and LZSS tend to be used very loosely, so they do not really imply very specific algorithms. When people say that they compressed their data using an LZ77 algorithm, they usually mean that they implemented a dictionary based compression scheme, where a fixed-size window into the recently decompressed data serves as the dictionary and some words/phrases during the compression are replaced by references to previously seen words/phrases within the window.
Let us consider the input data in the form of the word
abracadabra
and assume that window can be as large as the input data. Then we can represent "abracadabra" as
abracad(-7,4)
Here we assume that letters are copied as is, and that the meaning of two numbers in brackets is "go 7 positions back from where we are now and copy 4 symbols from there", which reproduces "abra".
This is the basic idea of any LZ77 compressor. Now, the devil is in the detail. Note that the original word "abracadabra" contains 11 letters, so assuming ASCII representation the word, it is 11 bytes long. Our new representation contains 13 symbols, so if we assume the same ASCII representation, we just expanded the original message, instead of compressing it. One can prove that this can sometimes happen to any compressor, no matter how good it actually is.
So, the compression efficiency depends on the format in which you store the information about uncompressed letters and back references. The original paper where the LZ77 algorithm was first described (Ziv, J. & Lempel, A. (1977) A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3), 337-343) uses the format that can be loosely described here as
(0,0,a)(0,0,b)(0,0,r)(0,1,c)(0,1,d)(0,3,a)
So, the compressed data is the sequence of groups of three items: the absolute (not relative!) position in the buffer to copy from, the length of the dictionary match (0 means no match was found) and the letter that follows the match. Since most letters did not match anything in the dictionary, you can see that this is not a particularly efficient format for anything but very compressible data.
This inefficiency may well be the reason why the original form of LZ77 has not been used in any practical compressors.
SS in the "LZSS" refers to a paper that was trying to generalize the ideas of dictionary compression with the sliding window (Storer, J. A. & Szymanski, T. G. (1982). Data compression via textual substitution. Journal of the ACM, 29(4), 928-951). The paper itself looks at several variations of dictionary compression schemes with windows, so once again, you will not find an explicit "algorithm" in it. However, the term LZSS is used by most people to describe the dictionary compression scheme with flag bits, e.g. describing "abracadabra" as
|0a|0b|0r|0a|0c|0a|0d|1-7,4|
where I added vertical lines purely for clarity. In this case numbers 0 and 1 are actually prefix bits, not bytes. Prefix bit 0 says "copy the next byte into the output as is". Prefix bit 1 says "next follows the information for copying a match". Nothing else is really specific, term LZSS is used to say something specific about the use of these prefix signal bits. Hopefully you can see how this can be done compactly, in fact much more efficiently than the format described in LZ77 paper.

Compression mechanism

I know that Huffman encoding is a popular technique for file compression, and I know that it works by encoding more frequent characters with shorter bits. The problem is you can only decode that if you have the tree. Do you actually have to send over the tree as well? If so, in what form? Details please.
Yes, you have to send a representation of the code first. The Huffman code is made canonical, so that you can just send the number of bits in the code corresponding to each symbol. Then the canonical code can be reconstructed from the lengths at the other end. You never have to send the tree.
The lengths can themselves be compressed as well, for another level of efficiency, as well as complexity. See the deflate specification for an example of how Huffman codes are transmitted efficiently.
On how the Huffman tree is transferred exactly depends on the compression format.
Static Huffman encodes the tree. The Deflate algorithm only encodes the number of bits per symbol.
For Adaptive Huffman, there is no need to explicitly encode the tree, as the tree is re-built (or just slightly modified) from time to time. The initial tree is then hardcoded.

how to get <distance, length> pairs from ZLIB compressor

I am compressing several long strings using ZLIB, which uses LZ77 representations of repeated substrings prior to encoding these representations using a Huffman tree. I am interested in studying the sequence of integer tuple representations themselves, and have been looking in the code to figure out where these are generated and how I could print them out one after another. Unfortunately, I am not very strong in C, and it seems that the compressor handles distances as pointers, and not as ints. Could somebody please tell me if there is a simple way to print out the sequence of tuples as the algorithm runs, and point me to the appropriate location in the code.
You can use infgen to disassemble a deflate stream. It will print the decoded symbols in a readable form, e.g. match 41 105 indicating a string to copy of length 41, from a distance back of 105.

storing strings in an array in a compact way [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

Resources