Wierd use of md5 for hashing a file - Does it do anything?

Wierd use of md5 for hashing a file - Does it do anything? - file

For uploading a file to a service, I was calculating the md5 based on the whole content of the file.
I was asked to do in a different way: the md5 of the file, and then also 3 more parts: 2% from the start of the file, 2% from 1/3 of the file, and 2% from 2/3, and 2% of the end of the file and then hash it the file's size and added the file size in bytes at the end.
Apparently this solves hash collisions between files. To me it seems like a waste of time, since your not increasing the size of md5. So for a huge large number of files, you're still gonna have, statistically, the same number of collisions.
Please help me understand the reasoning behind this.
EDIT: we are then hashing the resulting hashes.

A good cryptographically strong hashing algorithm is already designed with the goal to make it infeasible to intentionally find two different pieces of data with the same hash, let alone by accident. Therefore, just hashing the file is sufficient. Extra hashing of parts of the file is pointless.
This may seem unintuitive because obviously there must exist collisions if the length of the hash is shorter than the length of the data. However, it is not feasible to find these collisions because an MD5 hash is an unpredictable 128-bit number and the amount of possible 128-bit numbers (2^128) is mind boggling. If you could count at a rate of a trillion trillion per second, counting through all 128-bit numbers would still take (2^128 / 1e24) seconds ~ about 10 million years. This is probably a good lower limit to the amount of time that it would take to find a hash collision the brute force way without custom hardware.
That said, this is all assuming that there are no weaknesses in the hashing algorithm that allow you to do better than brute force. MD5 is broken in this regard, so you should not use it if you need to defend against attackers that would try to create collisions. It would be better to use a newer hashing algorithm like SHA-2 or SHA-3. (These also support even larger outputs such as 256 bits.)

Sounds like a dangerous practice, because you're re-hashing without factoring in a lot of data. The advantage however is that by running other hashes, you are effectivley winding up with a hash signature consisting of "more bits" - (i.e. you are getting three MD5 hashes as a result).
If you want to do this - and are in-fact okay with having more (larger) hash data to store/compare - you would be MUCH better advise to simply run a different hash function (other than MD5) that is either more secure, and/or uses a larger number of bits.
MD5 is an "older" algorithm and is known to have cryptographic weaknessess. I'd recommend one of the "SHA" algorithms - like SHA-256 or SHA-512. Advantages are that it is a stronger algorithm, you'd only have to has the data ONCE, and you'd get more bits than an MD5, yet since your running it once, it would be faster.
Note, that the possibility of hash collisions always exists. Even "high end" storage products which use hashes for detection will compare buffers to verify an exact match even if the comparison of two hashes matches.

Related

How to generate a file from its MD5 sum

The title says it all. From my understanding, every file has a unique MD5 checksum. Is it possible to reverse-engineer the file from its sum?
For example, let's just say a video's sum was 5. I know, but its just an example. Could you write a program where you enter 5 and it generates a video?
In other words, instead of generating a sum from a file, you get a file from a sum.

No, it's one way - otherwise be great method of compression!

To expand on what Jim W said, any hash function is one-way, which means they're functions that don't need to be easily reversible — while some may have inverses, most do not.
MD5 is a cryptographic hash function, which means it's intentionally designed to be very difficult to reverse. MD5 in particular is relatively weak, there are vulnerabilities that make it easy to find collisions — two files with the same MD5 hash.
Since an MD5 hash is only 128 bits, there are 2^128 different possible MD5 hashes, and while that's a very large number, there are still many, many more files than that in the world (potentially an infinite number, in fact), so it is inevitable that some files will hash to the same value. This, as user2864740 pointed out in a comment, is known as the pigeonhole principle.
A strong cryptographic hash function — like SHA-256 — is one for which it's considered computationally infeasible to reliably find such collisions.

What is the best cryptographic hash function that generates 16-bit hashes values in openssl?

I was thinking of just using SHA256 and then using only the first two bytes of the result. Is there anything wrong with this approach?
NOTE: The concern here is not malicious attacks, but to ensure the best possible protection against random bit flips.

Any hash that satisfies the strict avalanche criterion (that is, if any bit is flipped in the input, every bit in the output will be flipped with a probability of 50%) may be used in this way, and that includes every cryptographic hash in common use, including SHA512. There are security implications to using very short hashes, but if they really aren't relevant, as you claim, you're free to select the fastest hash available (probably MD5).
Since short hashes will be particularly vulnerable to the birthday paradox, though, consider using longer hashes anyway. If you're generating so many hashes that 16 bits versus 256 bits is significant, you will run into duplicates even without malicious attackers.

What are the chances of having 2 strings with the same md5 hash?

I read somewhere that md5 is not 100% secure. Hence, the question.

You seem to be asking 2 separate but related questions.
The probability of a random collision is highly dependent on the size of the data that you're working with; the more strings you're hashing, the more likely a collision is to occur. See the first table at Wikipedia: Birthday Attack for exact probabilities. MD5 uses 128 bits, so to achieve a 50% collision probability, you'll need 2.2E19 strings.
However, while random collisions are suitably rare for small data sets, MD5 has been shown to be completely insecure against intentional collisions. According to the Wikipedia article on MD5, a collision attack exists that can be run in seconds on a 2.6Ghz Pentium4 processor. For security, MD5 is completely broken, and has been considered so since 2005.
If you need to securely hash something, use one of the more modern hashing algorithms, such as SHA-2, SHA-3 (when it's development is finished), or Whirlpool.

How likely are md5 false positive checksums?

I have a client who is distributing large binary files internally. They are also passing md5 checksums of the files and apparently verifying the files against the checksum before use as part of their workflow.
However they claim that "often" they are encountering corruption in the files where the md5 is still saying that the file is good.
Everything I've read suggests that this should be hugely unlikely.
Does this sound likely? Would another hashing algorithm provide better results? Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?
NB, I don't yet know what "often" means in this context. They are processing hundreds of files a day. I don't know if this is a daily, monthly or yearly occurrence.

MD5 is a 128 bit cryptographic hash function, so different messages should be distributed pretty well over the 128-bit space. That would mean that two files (excluding files specifically built to defeat MD5) should have a 1 in 2^128 chance of collision. In other words, if a pair of files is compared every nanosecond, it wouldn't have happened yet.

If a file is corrupted, then the probability that the corrupted file has the same md5 checksum as the uncorrupted file is 1:2^128. In other words, it will happen almost as "often" as never. It is astronomically more likely that your client is misreporting what really happened (like they are computing the wrong hash)

Sounds like a bug in their use of MD5 (maybe they are MD5-ing the wrong files), or a bug in the library that they're using. For example, an older MD5 program that I used once didn't handle files over 2GB.
This question suggests that, on average, you get a collision on average every 100 years if you were generating 6 billion files per second, so it's quite unlikely.

Does this sound likely?
No, the chance of a random corruption causing the same checksum is 1 in 2128 or 3.40 × 1038. This number puts 1 in a billion (109) chance to shame.
Would another hashing algorithm provide better results?
Probably not. While MD5 has been broken for collision-resistance against attack, it's fine against random corruption and a popular standard to use.
Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?
Probably, but consider all possible points of problems:
File corrupted before MD5 generation
File corrupted after MD5 verification.
MD5 program or supporting framework has a bug
Operator misuse (unintentional, e.g. running MD5 program on wrong file)
Operator abuse (intentional, e.g. skipping the verification step)
IF it is the last, then one final thought is to distribute the files in a wrapper format that forces the operator to unwrap the file, but the unwrapping does verification during extraction. I thinking something like Gzip or 7-Zip that supports large files and possibly turning off compression (I don't know that those do).

There are all sorts of reasons that binaries either won't get distributed or if they do, there is corruption (firewall, size limitation, virus insertions, etc). You should always encrypt files (even a low level encryption is better than none) when sending binary files to help protect data integrity.

Couldn't resist a back-of-envelope calculation:
There are 2^128 possible MD5 hashes or c. 3.4 x 10^38 (that is odds 340 billion, billion, billion, billion, billion, billion, billion, billion, billion,billion, billion to 1 against). Lets call this number 'M'
The probability of the Kth hash matching if the 1 to (K-1)th matches didn't is (1-(K-1)/M) as we have K-1 unique hashes already out of M.
And P(no duplicate in N file hashes) = Product [k = 1...N] (1-(k-1)/M). When N^2 <<< M then this approximates to 1 - 1/2 N^2 / M and P(one or more duplicates) = 1/2 N^2 / M when 1/2 N^2 is an approximation to the number of pair-wise matches of hashes that have to be made
So lets say we take photograph of EVERYONE ON THE PLANET (7.8 billion, or a little under 2^33) then there are 30.4 billion billion billion pair-wise comparisons to make (a little under 2^65).
This means that the chance of a matching MD5 hash (assuming perfectly even distribution) is still 2^65/2^128 = 2^-63 or 1 in 10,000,000,000,000,000,000.
MD5 is a pretty decent hash function for non-hostile environments which means the chance of your clients having a false match is far less likely than say the chance of their CEO going crazy and burning down the data centre, let alone the stuff they actually worry about.

Can I prevent duplicate content using md5?

I would like to prevent duplicate content. I do not want to keep a copies of content, so I decided to keep just the md5 signatures.
I read that md5 collisions do happen, different content could give in the same md5 signature.
Do you think md5 is enough?
Should I use md5 and sh1 together?

People have been able to deliberately produce MD5 collisions under contrived circumstances, but for preventing duplicate content (in the absence of malicious users) it's more than adequate.
Having said that, if you can use SHA-1 (or SHA-2) you should - you'll be fractionally but measurably safer from collisions.

MD5 should be fine, collisions are very rare, but if you're really worried, you can use sha-1 as well.
Though I guess the signatures really aren't that large, so if you have the spare processing cycles and the disk space, you could do both. But if space or speed is limited, I'd just go with one.

Why not simply compare the content byte for byte if there is a hash collision? hash collisions are very rare, and so you're only going to have to do a byte for byte check very rarely. That way duplicates will only be detected if the items are actually duplicated

md5 should be enough. Yes, there can be collisions, but the chances of that happening are so incredibly small that I wouldn't worry about it unless you were literally tracking many billions of pieces of content.

If you're really afraid of accidental collisions just do both MD5 and SHA1 hashes and compare them. If they both match, it's the same content. If either one differs, it's different content.

Combining algorithms serves to only obfuscate, but does not increase security in a hashing algorithm.
MD5 is too broken to use anyway, IMHO. Forging MD5 hashes is proven by researchers, where they demonstrated being able to forge content that generates an MD5 collision, thereby opening the door to generating a forged CSR to buy a cert from RapidSSL for a domain name they don't own. Security Now! episode 179 explains the process.
For me, SHA-based hashes are stronger and most development platforms support it so the choice is easy. The remaining deciding factor is then the block size.

A timestamp + md5 together are safe enough.

MD5 is broken and SHA1 is close to it. Use SHA2.
edit
Based on an update from the OP, it doesn't seem that intentional collisions are a serious concern here. For unintentional ones, any decent hash with at least a 64-bit output would be fine.
I would still avoid MD5 and even SHA1, in general, but there's no reason to be dogmatic about it. If the tool fits here, then by all means use it.