Huffman decompression using tables: maximum code length? - zlib

I know that zlib uses a 2-level table to look up huffman codes when decompressing a file. This has the underlying assumption that the huffman code for any symbol will not be longer tha 18 (9+9) bits... is there any mathematical reason for this assumption?

The deflate format restricts the maximum Huffman code length to 15 bits.

They must be limiting it in some way during compression.
For straight Huffman encoding, there is no such limit.
The pathological case is when one symbol is more common than all the remaining symbols combined, and then, for the remaining symbols, one symbol is more common than the rest combined, and so on. For byte-sized symbols, this type of (extremely unlikely) distribution will give a Huffman code length of 255 bits for the two least common codes.
(Calculating the minimal length of an input that has the above property is left as an exercise for the reader).

Related

in zlib what happen when the Huffman code lengths for the alphabets exceed maximum code length(15)?

in https://www.rfc-editor.org/rfc/rfc1951
Note that in the "deflate" format, the Huffman codes for the
various alphabets must not exceed certain maximum code lengths.
the max code length definition is 15.
what happens when the Huffman code length exceed 15?
from https://cs.stackexchange.com/questions/75542/maximum-size-of-huffman-codes-for-an-alphabet-containing-256-letters
The maximum possible code size for a 256 symbol alphabet is 256 bits. Consider the case when the most frequent symbol has frequency 1/2, the next most frequent symbol has frequency 1/4, then 1/8
so in literal/length alphabet the max Huffman code length is 285-1=284
but in zlib the max code length is 15.
why 15 was choose as the max code length?
will zlib fail is code length exceed 15?
We don't know for sure why Phil Katz chose 15, but it was likely to facilitate a fast implementation in a 16-bit processor.
No, zlib will not fail. It happens all the time. The zlib implementation applies the normal Huffman algorithm, after which if the longest code is longer than 15 bits, it proceeds to modify the codes to force them all to 15 bits or less.
Note that your example resulting in a 256-bit long code would require a set of 2256 ~= 1077 symbols in order to arrive at those frequencies. I don't think you have enough memory for that.
In any case, zlib normally limits a deflate block to 16384 symbols. For that number, the maximum Huffman code length is 19. That comes from a Fibonacci-like (Lucas) sequence of probabilities, not your powers of two. (Left as an exercise for the reader.)

Which compression to use between embedded processors (known byte distributions)

I'm working with radio-to-radio communications where bandwidth is really really precious. It's all done with on the metal C code (no OS, small atmel 8bit microprocessors). So the idea of compression becomes appealing for some large, but rare, transmissions.
I'm no compression expert. I've used the command line tools to shrink files and looked at how much I get. And linked a library or two over the years. But never anything this low level.
In one example, I want to move about 28K over the air between processors. If I just do a simple bzip2 -9 on a representative file, I get about 65% of the original size.
But I'm curious if I can do better though. I am (naively?) under the impression that most basic compression formats must be some declaration of metadata up front, that describes how to inflate a bitstream that follows. What I don't know is how much space that metadata itself takes up. I histogram'ed said same file, and a number of other ones, and found that due to the nature of what's being transmitted, the histogram is almost always about the same. So I'm curious if I could hard code these frequencies in my code so that that was no longer dynamic, but also wasn't transmitted as part of the packet.
For example, my understanding of a huffman encoding is that usually there's a "dictionary" up front, followed by a bitstream. And that if a compressor does it by blocks, each block will have its own dictionary.
On top of this, it's a small processor, with a small footprint, I'd like to keep whatever I do small, simple, and straightforward.
So I guess the basic question is, what, if any, basic compression algorithm would you implement in this kind of environment/scenario. Especially taking into account, that you can basically precompile a representative histogram of the bytes per transmission.
What you are suggesting, providing preset frequency data, would help very little. Or more likely it would hurt, since you will take a hit by not using optimal codes. As an example, only about 80 bytes at the start of a deflate block is needed to represent the literal/length and distance Huffman codes. A slight increase in the, say, 18 KB of your compressed data could easily cancel that.
With zlib, you could use a representative one of your 28K messages as a dictionary in which to search for matching strings. This could help the compression quite a bit, if there are many common strings in your messages. See deflateSetDictionary() and inflateSetDictionary().

What can be the least possible value of data-compression-ratio for any real dataset

I am writing ZLIB like API for an embedded hardware compressor which uses deflate algorithm for compression of given input stream.
Before going further i would like to explain data compression ratio. Data compression ratio is defined as the ratio between the uncompressed size and compressed size.
Compression ratio is usually greater than one. which mean compressed data is usually smaller than uncompressed data, which is whole point to do compression. but this is not always the case. for example using ZLIB library and pseudo-random data generated on some Linux machine give compression ratio of 0.996 roughly. which mean 9960 bytes compressed into 10000 bytes.
I know ZLIB handle this situation by using type 0 block where it simply return original uncompressed data with roughly 5 byte header so it give only 5 byte overhead up to 64KB data-block. This is intelligent solution of this problem but for some reason i can not use this in my API. I must have to provide extra safe space in advance to handle this situation.
Now if i know the least possible known data compression ratio it would be easy for me to calculate the extra space i have to provide. Otherwise to be safe, i have to provide more than needed extra space which can be crucial in embedded system.
While calculating data compression ratio, i am not concerned with header,footer,extremely small dataset and system specific details as i am separately handling that. What i am particularly interested in, is there exist any real dataset with minimum size of 1K and which can provide compression ratio less than 0.99 using deflate algorithm. In that case calculation would be:
Compression ratio = uncompressed size/(compressed size using deflate excluding header,footer and system specific overhead)
Please provide feedback. Any help would be appreciated. It would be great if reference to such dataset could be provided.
EDIT:
#MSalters comment indicate that hardware compressor is not following deflate specification properly and this can be a bug in microcode.
because the pigeon priciple
http://en.wikipedia.org/wiki/Pigeonhole_principle
you will always have strings that get compressed and strings that get expanded
http://matt.might.net/articles/why-infinite-or-guaranteed-file-compression-is-impossible/
theoretically you can achieve best compression with 0 entropy data (infinite compression ratio) and worst compression with infinite entropy data (AWGN noise, so you have 0 compression ratio).
I can't tell from your question whether you're using zlib or not. If you're using zlib, it provides a function, deflateBound(), which does exactly what you're asking for, taking an uncompressed size and returning the maximum compressed size. It takes into account how the deflate stream was initialized with deflateInit() or deflateInit2() in computing the proper header and trailer sizes.
If you're writing your own deflate, then you will already know what the maximum compressed size is based on how often you allow it to use stored blocks.
Update: The only way to know for sure the maximum data expansion of a hardware deflator is to obtain the algorithm used. Then through inspection you can determine how often it will emit stored blocks for random data.
The only alternative is empirical and unreliable. You can feed the hardware compressor random data, and examine the results. You can use infgen to disassemble the deflate output and see the stored blocks and their sizes. Then you can write a linear bounding formula for the expansion. Then add some margin to the additive and multiplicative terms to cover for situations that you did not observe in your tests.
This will only work if the hardware deflate algorithm is well behaved, which means that it will not write a fixed or dynamic deflate block if a stored block would be smaller. If it is not well behaved, then all bets are off.
The deflate algorithm has a similar approach as the ZLIB algorithm. It uses a 3 bit header, and the lower two bits are 00 when the following block is stored length-prefixed but otherwise uncompressed.
This means the worst case is an one byte input that blows up to 6 bytes (3 bits header, 32 bits length, 8 bits data, 5 bits padding), so the worst ratio is 1/6 = 0.16.
This is of course assuming an optimal encoder. A suboptimal encoder would transmit an Huffman table for that one byte.

Checksum with low probability of false negative

At this moment I'm using a simple checksum scheme, that just adds the words in a buffer. Firstly, my question is what is the probability of a false negative, that is, the receiving system calculating the same checksum as the sending system even when the data is different (corrupted).
Secondly, how can I reduce the probability of false negatives? What is the best checksuming scheme for that. Note that each word in the buffer is of size 64 bits or 8 bytes, that is a long variable in a 64 bit system.
Assuming a sane checksum implementation, then the probability of a randomly-chosen input string colliding with a reference input string is 1 in 2n, where n is the checksum length in bits.
However, if you're talking about input that differs from the original by a low number of bits, then the probability of collision is generally much, much lower.
One possibility is to have a look at T. Maxino's thesis titled "The Effectiveness of Checksums for Embedded Networks" (PDF), which contains an analysis for some well-known checksums.
However, usually it is better to go with CRCs, which have additional benefits, such as detection of burst errors.
For these, P. Koopman's paper "Cyclic Redundancy Code (CRC) Selection for Embedded Networks" (PDF) is a valuable resource.

Huffman table entropy decoding simplification (in C)

First time using this site to ask a question, but I have gotten many many answers!
Background:
I am decoding a variable length video stream that was encoded using RLE and Huffman encoding. The stream is 10 to 20 Kilobytes long and therefore I am trying to "squeeze" as much time out of every step that I can so it can be decoded efficiently in real time.
Right now the step I am working on involves converting the bitstream into a number based on a Huffman table. I do this by counting the number of leading zeros to determine the number of trailing bits to include. The table looks like:
001xs range -3 to 3
0001xxs range -7 to 7
00001xxxs range -15 to 15
And on till 127. The s is a sign bit, 0 means positive, 1 means negative. So for example if clz=2 then I would read the next 3 bits, 2 for value and 1 for sign.
Question:
Right now the nasty expression I created to do this is:
int outblock[64];
unsigned int value;
//example value 7 -> 111 (xxs) which translates to -3
value=7;
outblock[index]=(((value&1)?-1:1)*(value>>1)); //expression
Is there a simpler and faster way to do this?
Thanks for any help!
Tim
EDIT: Expression edited because it was not generating proper positive values. Generates positive and negative properly now.
I just quickly googled "efficient huffman decoding" and found the following links which may be useful:
Efficient Huffman Decoding with Table Lookup
Previous question - how to decode huffman efficiently
It seems the most efficient way to huffman decode is to use table lookup. Have you tried a method like this?
I'd be interested to see your times of the original algorithm before doing any optimisations. Finally, what hardware / OS are you running on?
Best regards,

Resources