How high do I have to count before I hit an MD5 hash collision? - md5

Never mind why I'm doing this -- this is mainly theoretical.
If I were MD5 hashing string representations of integers, how high would I have to count before two of the hashes collide?

This problem (in generic case) is known as Birthday Paradox
The probability of collision in generic case can be computed easily. However, in your particular case, you have to actually compute (and store!) each MD5.
EDIT #Scott : not really. The Pigeonhole principle (being just a particular case of Birthday problem) would say that having 2^128 possible MD5 values, we surely will have a collision after 1 + 2^128 tries. The birthday paradox says that the probability of collision will be grater than 0.5 for about 2^70 MD5 values.
With these estimates for storage requirements, it's up to you to decide if the problem worth it. By me it does not.

Apparently, one can base a thesis on this very thing (or similar problems, anyway). I haven't read it, but maybe something in Stevens' thesis will help you (it's apparently linked from the Wikipedia article).

In a perfect world, to 1 + 2^128. But I doubt md5 is perfect, I cant give you a number but is guaranteed to be <= 1+ 2^128

Here is a scientific way to find out an estimate of how high you would have to count.
Make MD5 hash that is cut down to say 4 bits. Calculate that (make sure you calculate until you reach say 100 collisions so you get a good average)
Then make the same thing at 8 bits (again, wait for many collisions so you can calculate an average).
Do it again and again until you have averages for 4, 8, 12, 16 bits and then see if you can find a trend. Follow that trend up to 128 bits
You may want to xor all 128 bits to come up with your shorter version. Taking the first or last part may not be the best test.

Related

How to get pseudo-random uniformly distributed integers in C good enough for statistical simulation?

I'm writing a Monte Carlo simulation and am going to need a lot of random bits for generating integers uniformly distributed over {1,2,...,N} where N<40. The problem with using the C rand function is that I'd waste a lot of perfectly good bits using the standard rand % N technique. What's a better way for generating the integers?
I don't need cryptographically secure random numbers, but I don't want them to skew my results. Also, I don't consider downloading a batch of bits from random.org a solution.
rand % N does not work; it skews your results unless RAND_MAX + 1 is a multiple of N.
A correct approach is to figure out the largest multiple of N that's smaller than RAND_MAX, and then generate random numbers until it's less than that value. Only then should you do the modulo operation. This gives you a worst-case rejection ratio of 50%.
in addition to oli's answer:
if you're desperately concerned about bits then you can manage a queue of bits by hand, only retrieving as many as are necessary for the next number (ie upper(log2(n))).
but you should make sure that your generator is good enough. simple linear congruential (sp?) generators are better in the higher bits than the lower (see comments) so your current modular division approach makes more sense there.
numerical recipes has a really good section on all this and is very easy to read (not sure it mentions saving bits, but as a general ref).
update if you're unsure whether it's needed or not, i would not worry about this for now (unless you have better advice from someone who understands your particular context).
Represent rand in base40 and take the digits as numbers. Drop any incomplete digits, that is, drop the first digit if it doesn't have the full range [0..39] and drop the whole random number if the first digit takes its highest-possible value (e.g. if RAND_MAX is base40 is 21 23 05 06, drop all numbers having the highest base-40 digit 21).

32-bit checksum algorithm better quality than CRC32?

Are there any 32-bit checksum algorithm with either:
Smaller hash collision probability for input data sizes < 1 KB ?
Collision hits with more uniform distribution.
These relative to CRC32. I'm practically not counting on first property, because of limitation of storage space of 32 bits. But for the second ... seems there could be improvements.
Any ideas ? Thanks. (I need concrete implementation, better in C, but C++/ C# or anything to start with is also OK).
How about MurmurHash? It is said, that this hash has good distribution (passes chi-square tests) and good avalanche effect. Also very good computing speed.
Not for the first criteria. Any well designed hash function with a 32 bit output has a 1 in 2^32 chance of a collision for any pair of inputs. The second criteria is not very well defined, although there are surely some statistical tests that could be used, and I'm sure someone has done it (chi-square for collision intervals?). As for needing an implementation, I strongly recommend that you not accept any proposed code for a hash function that is not an implementation of a well known hash, as there is a high risk of security problems or poor performance when rolling your own hash or encryption. A well known but bad hash function is better than one you designed yourself, even if the latter one tests well and has a 'good' collision distribution, simply because the former has more eyeballs on it.

How likely are md5 false positive checksums?

I have a client who is distributing large binary files internally. They are also passing md5 checksums of the files and apparently verifying the files against the checksum before use as part of their workflow.
However they claim that "often" they are encountering corruption in the files where the md5 is still saying that the file is good.
Everything I've read suggests that this should be hugely unlikely.
Does this sound likely? Would another hashing algorithm provide better results? Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?
NB, I don't yet know what "often" means in this context. They are processing hundreds of files a day. I don't know if this is a daily, monthly or yearly occurrence.
MD5 is a 128 bit cryptographic hash function, so different messages should be distributed pretty well over the 128-bit space. That would mean that two files (excluding files specifically built to defeat MD5) should have a 1 in 2^128 chance of collision. In other words, if a pair of files is compared every nanosecond, it wouldn't have happened yet.
If a file is corrupted, then the probability that the corrupted file has the same md5 checksum as the uncorrupted file is 1:2^128. In other words, it will happen almost as "often" as never. It is astronomically more likely that your client is misreporting what really happened (like they are computing the wrong hash)
Sounds like a bug in their use of MD5 (maybe they are MD5-ing the wrong files), or a bug in the library that they're using. For example, an older MD5 program that I used once didn't handle files over 2GB.
This question suggests that, on average, you get a collision on average every 100 years if you were generating 6 billion files per second, so it's quite unlikely.
Does this sound likely?
No, the chance of a random corruption causing the same checksum is 1 in 2128 or 3.40 × 1038. This number puts 1 in a billion (109) chance to shame.
Would another hashing algorithm provide better results?
Probably not. While MD5 has been broken for collision-resistance against attack, it's fine against random corruption and a popular standard to use.
Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?
Probably, but consider all possible points of problems:
File corrupted before MD5 generation
File corrupted after MD5 verification.
MD5 program or supporting framework has a bug
Operator misuse (unintentional, e.g. running MD5 program on wrong file)
Operator abuse (intentional, e.g. skipping the verification step)
IF it is the last, then one final thought is to distribute the files in a wrapper format that forces the operator to unwrap the file, but the unwrapping does verification during extraction. I thinking something like Gzip or 7-Zip that supports large files and possibly turning off compression (I don't know that those do).
There are all sorts of reasons that binaries either won't get distributed or if they do, there is corruption (firewall, size limitation, virus insertions, etc). You should always encrypt files (even a low level encryption is better than none) when sending binary files to help protect data integrity.
Couldn't resist a back-of-envelope calculation:
There are 2^128 possible MD5 hashes or c. 3.4 x 10^38 (that is odds 340 billion, billion, billion, billion, billion, billion, billion, billion, billion,billion, billion to 1 against). Lets call this number 'M'
The probability of the Kth hash matching if the 1 to (K-1)th matches didn't is (1-(K-1)/M) as we have K-1 unique hashes already out of M.
And P(no duplicate in N file hashes) = Product [k = 1...N] (1-(k-1)/M). When N^2 <<< M then this approximates to 1 - 1/2 N^2 / M and P(one or more duplicates) = 1/2 N^2 / M when 1/2 N^2 is an approximation to the number of pair-wise matches of hashes that have to be made
So lets say we take photograph of EVERYONE ON THE PLANET (7.8 billion, or a little under 2^33) then there are 30.4 billion billion billion pair-wise comparisons to make (a little under 2^65).
This means that the chance of a matching MD5 hash (assuming perfectly even distribution) is still 2^65/2^128 = 2^-63 or 1 in 10,000,000,000,000,000,000.
MD5 is a pretty decent hash function for non-hostile environments which means the chance of your clients having a false match is far less likely than say the chance of their CEO going crazy and burning down the data centre, let alone the stuff they actually worry about.

64-bit multiplicative hashing

I'm working on fast 64-bit hash. Many existing secure hash functions are too way slow, some non-cryptographic hash functions like FNV are just bad.
Well, I came up with a FNV-like hash:
UINT64 hash=0;
// for each input byte
hash=(hash^(input_byte+1))*HASH_PRIME;
Main question is about HASH_PRIME. Often, we may see a "golden ratio" term for multiplicative hashing.
For 64-bit hash, golden ratio is 0x9e3779b97f4a7c13.
I tested the 32-bit golden ratio for period in PRNG:
DWORD hash=0;
// loop
hash=(hash^1)*0x9e3779b9;
rnd_out=hash>>24;
A good value here may produce the period of 0xFFFFFFFF - i.e. max possible. This golden ratio produces notably smaller period.
or just
DWORD hash=~0;
// loop
hash*=0x9e3779b9;
rnd_out=hash>>24;
And again, a good enough multiplier can produce period of 0x3FFFFFFF bytes. Golden ratio here produces again much shorter period.
Never tested the 64-bit primes - too computationally expensive.
Is period important for my hash? And where to find a good 64-bit HASH_PRIMES and how to test such stuff?
Are you doing this is as an exercise? otherwise I would advise having a looking at well known hash functions as Bob Jenkin's lookup8 and lookup family (http://burtleburtle.net/bob/hash/ ) and Austin Appleby's murmur http://code.google.com/p/smhasher/ (a speed killer and my favorite). Good hash functions are hard to build... and if you are after a rolling type of hash, Rabin fingerprints are hard to beat.
And to make sure that your hashes are decent if you really want to roll your own, use either Appleby and Jenkins hash tests (torture and smhasher )
Not sure about the first two examples. But in the third to get a full period out of the code you need to add an odd number. Otherwise this will have a maximum period of 65537, It could be as low as 3. There may even be a fixed point.
Wherever you got the 0x3FFFFFFF for a good period is not correct. One of the Knuth volumes discusses this in excessive detail.
The multiplier must be of the form 4n+1 and there must be an odd addend

Will MD5 ever return the same output as its input? [duplicate]

Is there a fixed point in the MD5 transformation, i.e. does there exist x such that md5(x) == x?
Since an MD5 sum is 128 bits long, any fixed point would necessarily also have to be 128 bits long. Assuming that the MD5 sum of any string is uniformly distributed over all possible sums, then the probability that any given 128-bit string is a fixed point is 1/2128.
Thus, the probability that no 128-bit string is a fixed point is (1 − 1/2128)2128, so the probability that there is a fixed point is 1 − (1 − 1/2128)2128.
Since the limit as n goes to infinity of (1 − 1/n)n is 1/e, and 2128 is most certainly a very large number, this probability is almost exactly 1 − 1/e ≈ 63.21%.
Of course, there is no randomness actually involved – either there is a fixed point or there isn't. But, we can be 63.21% confident that there is a fixed point. (Also, notice that this number does not depend on the size of the keyspace – if MD5 sums were 32 bits or 1024 bits, the answer would be the same, so long as it's larger than about 4 or 5 bits).
My brute force attempt found a 12 prefix and 12 suffix match.
prefix 12:
54db1011d76dc70a0a9df3ff3e0b390f -> 54db1011d76d137956603122ad86d762
suffix 12:
df12c1434cec7850a7900ce027af4b78 -> b2f6053087022898fe920ce027af4b78
Blog post:
https://plus.google.com/103541237243849171137/posts/SRxXrTMdrFN
Since the hash is irreversible, this would be very hard to figure out. The only way to solve this, would be to calculate the hash on every possible output of the hash, and see if you came up with a match.
To elaborate, there are 16 bytes in an MD5 hash. That means there are 2^(16*8) = 3.4 * 10 ^ 38 combinations. If it took 1 millisecond to compute a hash on a 16 byte value, it would take 10790283070806014188970529154.99 years to calculate all those hashes.
While I don't have a yes/no answer, my guess is "yes" and furthermore that there are maybe 2^32 such fixed points (for the bit-string interpretation, not the character-string intepretation). I'm actively working on this because it seems like an awesome, concise puzzle that will require a lot of creativity (if you don't settle for brute force search right away).
My approach is the following: treat it as a math problem. We have 128 boolean variables, and 128 equations describing the outputs in terms of the inputs (which are supposed to match). By plugging in all of the constants from the tables in the algorithm and the padding bits, my hope is that the equations can be greatly simplified to yield an algorithm optimized to the 128-bit input case. These simplified equations can then be programmed in some nice language for efficient search, or treated abstractly again, assigning single bits at a time, watching out for contraditions. You only need to see a few bits of the output to know that it is not matching the input!
Probably, but finding it would take longer than we have or would involve compromising MD5.
There are two interpretations, and if one is allowed to pick either, the probability of finding a fixed point increases to 81.5%.
Interpretation 1: does the MD5 of a MD5 output in binary match its input?
Interpretation 2: does the MD5 of a MD5 output in hex match its input?
Strictly speaking, since the input of MD5 is 512 bits long and the output is 128 bits, I would say that's impossible by definition.

Resources