Will MD5 ever return the same output as its input? [duplicate] - md5

Is there a fixed point in the MD5 transformation, i.e. does there exist x such that md5(x) == x?

Since an MD5 sum is 128 bits long, any fixed point would necessarily also have to be 128 bits long. Assuming that the MD5 sum of any string is uniformly distributed over all possible sums, then the probability that any given 128-bit string is a fixed point is 1/2128.
Thus, the probability that no 128-bit string is a fixed point is (1 − 1/2128)2128, so the probability that there is a fixed point is 1 − (1 − 1/2128)2128.
Since the limit as n goes to infinity of (1 − 1/n)n is 1/e, and 2128 is most certainly a very large number, this probability is almost exactly 1 − 1/e ≈ 63.21%.
Of course, there is no randomness actually involved – either there is a fixed point or there isn't. But, we can be 63.21% confident that there is a fixed point. (Also, notice that this number does not depend on the size of the keyspace – if MD5 sums were 32 bits or 1024 bits, the answer would be the same, so long as it's larger than about 4 or 5 bits).

My brute force attempt found a 12 prefix and 12 suffix match.
prefix 12:
54db1011d76dc70a0a9df3ff3e0b390f -> 54db1011d76d137956603122ad86d762
suffix 12:
df12c1434cec7850a7900ce027af4b78 -> b2f6053087022898fe920ce027af4b78
Blog post:
https://plus.google.com/103541237243849171137/posts/SRxXrTMdrFN

Since the hash is irreversible, this would be very hard to figure out. The only way to solve this, would be to calculate the hash on every possible output of the hash, and see if you came up with a match.
To elaborate, there are 16 bytes in an MD5 hash. That means there are 2^(16*8) = 3.4 * 10 ^ 38 combinations. If it took 1 millisecond to compute a hash on a 16 byte value, it would take 10790283070806014188970529154.99 years to calculate all those hashes.

While I don't have a yes/no answer, my guess is "yes" and furthermore that there are maybe 2^32 such fixed points (for the bit-string interpretation, not the character-string intepretation). I'm actively working on this because it seems like an awesome, concise puzzle that will require a lot of creativity (if you don't settle for brute force search right away).
My approach is the following: treat it as a math problem. We have 128 boolean variables, and 128 equations describing the outputs in terms of the inputs (which are supposed to match). By plugging in all of the constants from the tables in the algorithm and the padding bits, my hope is that the equations can be greatly simplified to yield an algorithm optimized to the 128-bit input case. These simplified equations can then be programmed in some nice language for efficient search, or treated abstractly again, assigning single bits at a time, watching out for contraditions. You only need to see a few bits of the output to know that it is not matching the input!

Probably, but finding it would take longer than we have or would involve compromising MD5.

There are two interpretations, and if one is allowed to pick either, the probability of finding a fixed point increases to 81.5%.
Interpretation 1: does the MD5 of a MD5 output in binary match its input?
Interpretation 2: does the MD5 of a MD5 output in hex match its input?

Strictly speaking, since the input of MD5 is 512 bits long and the output is 128 bits, I would say that's impossible by definition.

Related

Finding the pair of strings with most number of identical letters in an array

Suppose I have an array of strings of different lengths.
It can be assumed that the strings have no repeating characters.
Using a brute-force algorithm, I can find the pair of strings that have the most number of identical letters (order does not matter - for example, "ABCDZFW" and "FBZ" have 3 identical letters) in n-squared time.
Is there a more efficient way to do this?
Attempt: I've tried to think of a solution using the trie data structure, but this won't work since a trie would only group together strings with similar prefixes.
I can find the pair of strings that have the most number of identical
letters (order does not matter - for example, "ABCDZFW" and "FBZ" have
3 identical letters) in n-squared time.
I think you can't as string comparison itself is O(max(length(s1), length(s2))) along with the O(n^2) loop for checking all pairs. However you can optimize the comparison of strings in some extent.
As you mentioned the strings don't have duplicates and I am assuming the strings consist of only uppercase letters according to your input. So, it turns into each string can be only 26 characters long.
For each string, we can use a bitmask. And for each character of a string, we can set the corresponding bit 1. For example:
ABCGH
11000111 (from LSB to MSB)
Thus, we have n bit-masks for n strings.
Way #1
Now you can check all possible pairs of strings using O(n^2) loop and compare the string by ANDing two corresponding mask and check the number of set bits (hamming weight). Obviously this is an improvement of your version because the string comparison is optimized now - Only an AND operation between two 32 bit integer which is a O(1) operation.
For example for any two strings comparison will be:
ABCDG
ABCEF
X1 = mask(ABCDG) => 1001111
X2 = mask(ABCEF) => 0110111
X1 AND X2 => 0000111
hamming weight(0000111) => 3 // number of set bits
Way #2
Now, one observation is the AND of same type bit is 1. So for every masks, we will try to maximize the Hamming weight (total number of set bits) of AND value of two string's masks as the string with most matched characters have same bit 1 and ANDing these two masks will make those bits 1.
Now build a Trie with all masks - every node of the trie will hold 0 or 1 based on the corresponding bit is set or not. Insert each mask from MSB ot LSB. Before inserting ith mask into Trie(already holding i - 1 masks), we will query to try maximizing the Hamming weight of AND recusively by going to same bit's branch (to make the bit 1 in final AND variable) and also to opposite bit's branch because in later levels you might get more set bits in this branch.
Regarding this Trie part, for nice pictorial explanation, you can find a similar thread here (this works with XOR).
Here in worst case, we will need to traverse many branches of trie for maximizing the hamming weight. And in worst case it will take around 6 * 10^6 operations (which will take ~1 sec in typical machine) and also we need additional space for building trie. But say the total number of strings is 10^5, then for O(n^2) algorithms, it will take 10^10 operations which is too much - so the trie approach is still far better.
Let me know if you're having problem with implementation. Unfortunately I can able to help you with code only if you're a C/C++ or Java guy.
Thanks #JimMischel for pointing out a major flaw. I slightly misunderstood the statement first.

lightweight (quasi-random) integer fingerprint of C string

I would like to generate a nicely-mixed-up integer fingerprint of an arbitrary C string (s). Most C strings will consist of ASCII text characters:
I want very different fingerprints for similar strings, esp such similar strings as "ab" and "ba"
I want it to be difficult to invert back from the fingerprint to the string (well, my string is typically longer than 32 bits, which means that many strings would map into the same integer), which means again that I want similar strings to yield very different codes;
I want to use the 32 bits available to me efficiently in the integer result,
I want the function source to be small
I want the function to be fast.
one usage is security (but not encryption) related. I can ask a user for a text password, convert it into an integer for storage and later test whether this integer is correct. (I know I could store strings, but I don't want to. guessing a 32-bit integer correctly is impossible if my program can slow down incorrect attempts to the point where brute force cannot work faster than password guessing. another use of this function is as the start of a hash index function (mod array length) into an array.)
alas, I am probably reinventing the wheel here. such functions have probably been written a million times, and by people who are much more versed in cryptography. I don't need AES, of course, but something much more lightweight. the use is different.
my first thinking was
mod 64 each character to take advantage of the ASCII text aspect. now I have 6 bits. call this x.
I can place a 6bit string into 5 locations in a 32-bit space, leaving 2 bits over.
take the current string index position (0, 1, 2...), mod5 it to determine where I want to start to place my x into my running integer result code. XOR my x into this running-result integer.
use the remaining 2 bits to increment a counter [mod 4 to prevent overflow] for each character processed.
then I thought that bit operations may be computer-fast but take more source code. I can think of other choices. take each index position i and multiply it by an ascii representation of each character [or the x from above], and call this y[i]. now do the following:
calculate the natural logarithm of the sums of the y (or this sum plus the running result), and just pretend that the first 32 bits of this result [maybe leaving off the first few bits], which are really a double, are an integer representation. I can XOR each bitint(log(y[i])) into the running integer result.
do it even cheaper. just add the y's, and then do the logarithm with 32-bit pickoff just once at the end. alternatively, run a sum-y through srand as a seed and grab a rand.
there are probably a few other ways to do it, too. in sum, the function should map strings into very different integers, be short to code, and be very fast.
Any pointers?
A common method of generating a non-reversible digest or hash of a string is to generate a Cyclic Redundancy Checksum (CRC).
Source for CRC is widely available, in this case you should use a common CRC-32 such as that used by Ethernet. Different CRCs work on the same principle, buy use different polynomials. Do not be tempted to invent your own polynomial; the distribution is likely to be sub-optimal.
What you're looking for is called a "hash". Two examples of hash functions I'm aware of that return short integers are MurmurHash and SipHash. MurmurHash, as I recall, is not designed to be a cryptographic hash, while SipHash, on the other hand, is indeed designed with security in mind, as stated on its homepage. MurmurHash has 2 versions that return a 32-bit and a 64-bit output. SipHash returns a 64-bit output.

Normalising 18 bit input between 0-9999

I'm writing a program in which i require to normalise an 18-bit input between 0-9999. This is something i have never come across before,
I have searched the internet and correct me if i am wrong here, but is this as simple as converting the 18-bit binary(000000000000000000) input into a natural number and then divide it by 1000.
Is there is a different and more efficient method ????
Thank you
No, what you want to do is multiply your input by 0.03814697265.
The reasoning is pretty simple: you take your range of inputs (0..2^18) and split it in 10000 "slices". Thus each slice will have a range of just over 26. Then if you divide your input from the original range by this 26 (or multiply it by 1/26), you'll get your number in the 0..9999 range.
Edit: depending on your background, you may need to know that here I use ^ with the meaning of exponentiation. Might be moot since this question is tagged C and it has no first-class concept of exponentiation, but it's definetly not XOR!

Huffman table entropy decoding simplification (in C)

First time using this site to ask a question, but I have gotten many many answers!
Background:
I am decoding a variable length video stream that was encoded using RLE and Huffman encoding. The stream is 10 to 20 Kilobytes long and therefore I am trying to "squeeze" as much time out of every step that I can so it can be decoded efficiently in real time.
Right now the step I am working on involves converting the bitstream into a number based on a Huffman table. I do this by counting the number of leading zeros to determine the number of trailing bits to include. The table looks like:
001xs range -3 to 3
0001xxs range -7 to 7
00001xxxs range -15 to 15
And on till 127. The s is a sign bit, 0 means positive, 1 means negative. So for example if clz=2 then I would read the next 3 bits, 2 for value and 1 for sign.
Question:
Right now the nasty expression I created to do this is:
int outblock[64];
unsigned int value;
//example value 7 -> 111 (xxs) which translates to -3
value=7;
outblock[index]=(((value&1)?-1:1)*(value>>1)); //expression
Is there a simpler and faster way to do this?
Thanks for any help!
Tim
EDIT: Expression edited because it was not generating proper positive values. Generates positive and negative properly now.
I just quickly googled "efficient huffman decoding" and found the following links which may be useful:
Efficient Huffman Decoding with Table Lookup
Previous question - how to decode huffman efficiently
It seems the most efficient way to huffman decode is to use table lookup. Have you tried a method like this?
I'd be interested to see your times of the original algorithm before doing any optimisations. Finally, what hardware / OS are you running on?
Best regards,

How high do I have to count before I hit an MD5 hash collision?

Never mind why I'm doing this -- this is mainly theoretical.
If I were MD5 hashing string representations of integers, how high would I have to count before two of the hashes collide?
This problem (in generic case) is known as Birthday Paradox
The probability of collision in generic case can be computed easily. However, in your particular case, you have to actually compute (and store!) each MD5.
EDIT #Scott : not really. The Pigeonhole principle (being just a particular case of Birthday problem) would say that having 2^128 possible MD5 values, we surely will have a collision after 1 + 2^128 tries. The birthday paradox says that the probability of collision will be grater than 0.5 for about 2^70 MD5 values.
With these estimates for storage requirements, it's up to you to decide if the problem worth it. By me it does not.
Apparently, one can base a thesis on this very thing (or similar problems, anyway). I haven't read it, but maybe something in Stevens' thesis will help you (it's apparently linked from the Wikipedia article).
In a perfect world, to 1 + 2^128. But I doubt md5 is perfect, I cant give you a number but is guaranteed to be <= 1+ 2^128
Here is a scientific way to find out an estimate of how high you would have to count.
Make MD5 hash that is cut down to say 4 bits. Calculate that (make sure you calculate until you reach say 100 collisions so you get a good average)
Then make the same thing at 8 bits (again, wait for many collisions so you can calculate an average).
Do it again and again until you have averages for 4, 8, 12, 16 bits and then see if you can find a trend. Follow that trend up to 128 bits
You may want to xor all 128 bits to come up with your shorter version. Taking the first or last part may not be the best test.

Resources