in zlib what happen when the Huffman code lengths for the alphabets exceed maximum code length(15)? - zlib

in https://www.rfc-editor.org/rfc/rfc1951
Note that in the "deflate" format, the Huffman codes for the
various alphabets must not exceed certain maximum code lengths.
the max code length definition is 15.
what happens when the Huffman code length exceed 15?
from https://cs.stackexchange.com/questions/75542/maximum-size-of-huffman-codes-for-an-alphabet-containing-256-letters
The maximum possible code size for a 256 symbol alphabet is 256 bits. Consider the case when the most frequent symbol has frequency 1/2, the next most frequent symbol has frequency 1/4, then 1/8
so in literal/length alphabet the max Huffman code length is 285-1=284
but in zlib the max code length is 15.
why 15 was choose as the max code length?
will zlib fail is code length exceed 15?

We don't know for sure why Phil Katz chose 15, but it was likely to facilitate a fast implementation in a 16-bit processor.
No, zlib will not fail. It happens all the time. The zlib implementation applies the normal Huffman algorithm, after which if the longest code is longer than 15 bits, it proceeds to modify the codes to force them all to 15 bits or less.
Note that your example resulting in a 256-bit long code would require a set of 2256 ~= 1077 symbols in order to arrive at those frequencies. I don't think you have enough memory for that.
In any case, zlib normally limits a deflate block to 16384 symbols. For that number, the maximum Huffman code length is 19. That comes from a Fibonacci-like (Lucas) sequence of probabilities, not your powers of two. (Left as an exercise for the reader.)

Related

How to convert very large string to number in C?

I am doing my 3x program and developing 3x I found this article about handling big numbers but final result is string and with string I can't do math operations. I use gcc compiler.
Also this program is not meant to solve problem, I created just to test performance.
In fact, you can't. C supports integers of limited length (64 bits, about 20 digits), and floats (about 15 significant digits), not larger.
So there is no better way than using a representation in a large radix (power of 2 or power of 10) and implementing the operations based on this representation. E.g. addition can be done digit by digit, with occasional carries.

Huffman decompression using tables: maximum code length?

I know that zlib uses a 2-level table to look up huffman codes when decompressing a file. This has the underlying assumption that the huffman code for any symbol will not be longer tha 18 (9+9) bits... is there any mathematical reason for this assumption?
The deflate format restricts the maximum Huffman code length to 15 bits.
They must be limiting it in some way during compression.
For straight Huffman encoding, there is no such limit.
The pathological case is when one symbol is more common than all the remaining symbols combined, and then, for the remaining symbols, one symbol is more common than the rest combined, and so on. For byte-sized symbols, this type of (extremely unlikely) distribution will give a Huffman code length of 255 bits for the two least common codes.
(Calculating the minimal length of an input that has the above property is left as an exercise for the reader).

Checksum with low probability of false negative

At this moment I'm using a simple checksum scheme, that just adds the words in a buffer. Firstly, my question is what is the probability of a false negative, that is, the receiving system calculating the same checksum as the sending system even when the data is different (corrupted).
Secondly, how can I reduce the probability of false negatives? What is the best checksuming scheme for that. Note that each word in the buffer is of size 64 bits or 8 bytes, that is a long variable in a 64 bit system.
Assuming a sane checksum implementation, then the probability of a randomly-chosen input string colliding with a reference input string is 1 in 2n, where n is the checksum length in bits.
However, if you're talking about input that differs from the original by a low number of bits, then the probability of collision is generally much, much lower.
One possibility is to have a look at T. Maxino's thesis titled "The Effectiveness of Checksums for Embedded Networks" (PDF), which contains an analysis for some well-known checksums.
However, usually it is better to go with CRCs, which have additional benefits, such as detection of burst errors.
For these, P. Koopman's paper "Cyclic Redundancy Code (CRC) Selection for Embedded Networks" (PDF) is a valuable resource.

Huffman table entropy decoding simplification (in C)

First time using this site to ask a question, but I have gotten many many answers!
Background:
I am decoding a variable length video stream that was encoded using RLE and Huffman encoding. The stream is 10 to 20 Kilobytes long and therefore I am trying to "squeeze" as much time out of every step that I can so it can be decoded efficiently in real time.
Right now the step I am working on involves converting the bitstream into a number based on a Huffman table. I do this by counting the number of leading zeros to determine the number of trailing bits to include. The table looks like:
001xs range -3 to 3
0001xxs range -7 to 7
00001xxxs range -15 to 15
And on till 127. The s is a sign bit, 0 means positive, 1 means negative. So for example if clz=2 then I would read the next 3 bits, 2 for value and 1 for sign.
Question:
Right now the nasty expression I created to do this is:
int outblock[64];
unsigned int value;
//example value 7 -> 111 (xxs) which translates to -3
value=7;
outblock[index]=(((value&1)?-1:1)*(value>>1)); //expression
Is there a simpler and faster way to do this?
Thanks for any help!
Tim
EDIT: Expression edited because it was not generating proper positive values. Generates positive and negative properly now.
I just quickly googled "efficient huffman decoding" and found the following links which may be useful:
Efficient Huffman Decoding with Table Lookup
Previous question - how to decode huffman efficiently
It seems the most efficient way to huffman decode is to use table lookup. Have you tried a method like this?
I'd be interested to see your times of the original algorithm before doing any optimisations. Finally, what hardware / OS are you running on?
Best regards,

Will MD5 ever return the same output as its input? [duplicate]

Is there a fixed point in the MD5 transformation, i.e. does there exist x such that md5(x) == x?
Since an MD5 sum is 128 bits long, any fixed point would necessarily also have to be 128 bits long. Assuming that the MD5 sum of any string is uniformly distributed over all possible sums, then the probability that any given 128-bit string is a fixed point is 1/2128.
Thus, the probability that no 128-bit string is a fixed point is (1 − 1/2128)2128, so the probability that there is a fixed point is 1 − (1 − 1/2128)2128.
Since the limit as n goes to infinity of (1 − 1/n)n is 1/e, and 2128 is most certainly a very large number, this probability is almost exactly 1 − 1/e ≈ 63.21%.
Of course, there is no randomness actually involved – either there is a fixed point or there isn't. But, we can be 63.21% confident that there is a fixed point. (Also, notice that this number does not depend on the size of the keyspace – if MD5 sums were 32 bits or 1024 bits, the answer would be the same, so long as it's larger than about 4 or 5 bits).
My brute force attempt found a 12 prefix and 12 suffix match.
prefix 12:
54db1011d76dc70a0a9df3ff3e0b390f -> 54db1011d76d137956603122ad86d762
suffix 12:
df12c1434cec7850a7900ce027af4b78 -> b2f6053087022898fe920ce027af4b78
Blog post:
https://plus.google.com/103541237243849171137/posts/SRxXrTMdrFN
Since the hash is irreversible, this would be very hard to figure out. The only way to solve this, would be to calculate the hash on every possible output of the hash, and see if you came up with a match.
To elaborate, there are 16 bytes in an MD5 hash. That means there are 2^(16*8) = 3.4 * 10 ^ 38 combinations. If it took 1 millisecond to compute a hash on a 16 byte value, it would take 10790283070806014188970529154.99 years to calculate all those hashes.
While I don't have a yes/no answer, my guess is "yes" and furthermore that there are maybe 2^32 such fixed points (for the bit-string interpretation, not the character-string intepretation). I'm actively working on this because it seems like an awesome, concise puzzle that will require a lot of creativity (if you don't settle for brute force search right away).
My approach is the following: treat it as a math problem. We have 128 boolean variables, and 128 equations describing the outputs in terms of the inputs (which are supposed to match). By plugging in all of the constants from the tables in the algorithm and the padding bits, my hope is that the equations can be greatly simplified to yield an algorithm optimized to the 128-bit input case. These simplified equations can then be programmed in some nice language for efficient search, or treated abstractly again, assigning single bits at a time, watching out for contraditions. You only need to see a few bits of the output to know that it is not matching the input!
Probably, but finding it would take longer than we have or would involve compromising MD5.
There are two interpretations, and if one is allowed to pick either, the probability of finding a fixed point increases to 81.5%.
Interpretation 1: does the MD5 of a MD5 output in binary match its input?
Interpretation 2: does the MD5 of a MD5 output in hex match its input?
Strictly speaking, since the input of MD5 is 512 bits long and the output is 128 bits, I would say that's impossible by definition.

Resources