SHA-256 or MD5 for file integrity - file

I know that SHA-256 is favored over MD5 for security, etc., but, if I am to use a method to only check file integrity (that is, nothing to do with password encryption, etc.), is there any advantage of using SHA-256?
Since MD5 is 128-bit and SHA-256 is 256-bit (therefore twice as big)...
Would it take up to twice as long to encrypt?
Where time is not of essence, like in a backup program, and file integrity is all that is needed, would anyone argue against MD5 for a different algorithm, or even suggest a different technique?
Does using MD5 produce a checksum?

Both SHA256 and MD5 are hashing algorithms. They take your input data, in this case your file, and output a 256/128-bit number. This number is a checksum. There is no encryption taking place because an infinite number of inputs can result in the same hash value, although in reality collisions are rare.
SHA256 takes somewhat more time to calculate than MD5, according to this answer.
Offhand, I'd say that MD5 would be probably be suitable for what you need.

Every answer seems to suggest that you need to use secure hashes to do the job but all of these are tuned to be slow to force a bruteforce attacker to have lots of computing power and depending on your needs this may not be the best solution.
There are algorithms specifically designed to hash files as fast as possible to check integrity and comparison (murmur, XXhash...). Obviously these are not designed for security as they don't meet the requirements of a secure hash algorithm (i.e. randomness) but have low collision rates for large messages. This features make them ideal if you are not looking for security but speed.
Examples of this algorithms and comparison can be found in this excellent answer: Which hashing algorithm is best for uniqueness and speed?.
As an example, we at our Q&A site use murmur3 to hash the images uploaded by the users so we only store them once even if users upload the same image in several answers.

To 1):
Yes, on most CPUs, SHA-256 is about only 40% as fast as MD5.
To 2):
I would argue for a different algorithm than MD5 in such a case. I would definitely prefer an algorithm that is considered safe. However, this is more a feeling. Cases where this matters would be rather constructed than realistic, e.g. if your backup system encounters an example case of an attack on an MD5-based certificate, you are likely to have two files in such an example with different data, but identical MD5 checksums. For the rest of the cases, it doesn't matter, because MD5 checksums have a collision (= same checksums for different data) virtually only when provoked intentionally.
I'm not an expert on the various hashing (checksum generating) algorithms, so I can not suggest another algorithm. Hence this part of the question is still open.
Suggested further reading is Cryptographic Hash Function - File or Data Identifier on Wikipedia. Also further down on that page there is a list of cryptographic hash algorithms.
To 3):
MD5 is an algorithm to calculate checksums. A checksum calculated using this algorithm is then called an MD5 checksum.

The underlying MD5 algorithm is no longer deemed secure, thus while md5sum is well-suited for identifying known files in situations that are not security related, it should not be relied on if there is a chance that files have been purposefully and maliciously tampered. In the latter case, the use of a newer hashing tool such as sha256sum is highly recommended.
So, if you are simply looking to check for file corruption or file differences, when the source of the file is trusted, MD5 should be sufficient. If you are looking to verify the integrity of a file coming from an untrusted source, or over from a trusted source over an unencrypted connection, MD5 is not sufficient.
Another commenter noted that Ubuntu and others use MD5 checksums. Ubuntu has moved to PGP and SHA256, in addition to MD5, but the documentation of the stronger verification strategies are more difficult to find. See the HowToSHA256SUM page for more details.

No, it's less fast but not that slow
For a backup program it's maybe necessary to have something even faster than MD5
All in all, I'd say that MD5 in addition to the file name is absolutely safe. SHA-256 would just be slower and harder to handle because of its size.
You could also use something less secure than MD5 without any problem. If nobody tries to hack your file integrity this is safe, too.

It is technically approved that MD5 is faster than SHA256 so in just verifying file integrity it will be sufficient and better for performance.
You are able to checkout the following resources:
Speed Comparison of Popular Crypto Algorithms
Comparison of cryptographic hash functions

Yes, on most CPUs, SHA-256 is two to three times slower than MD5, though not primarily because of its longer hash. See other answers here and the answers to this Stack Overflow questions.
Here's a backup scenario where MD5 would not be appropriate:
Your backup program hashes each file being backed up. It then stores
each file's data by its hash, so if you're backing up the same file
twice you only end up with one copy of it.
An attacker can cause the system to backup files they control.
The attacker knows the MD5 hash of a file they want to remove from the
backup.
The attacker can then use the known weaknesses of MD5 to craft a new
file that has the same hash as the file to remove. When that file is
backed up, it will replace the file to remove, and that file's backed up
data will be lost.
This backup system could be strengthened a bit (and made more efficient)
by not replacing files whose hash it has previously encountered, but
then an attacker could prevent a target file with a known hash from
being backed up by preemptively backing up a specially constructed bogus
file with the same hash.
Obviously most systems, backup and otherwise, do not satisfy the
conditions necessary for this attack to be practical, but I just wanted
to give an example of a situation where SHA-256 would be preferable to
MD5. Whether this would be the case for the system you're creating
depends on more than just the characteristics of MD5 and SHA-256.
Yes, cryptographic hashes like the ones generated by MD5 and SHA-256 are a type of checksum.
Happy hashing!

Related

What are the chances of having 2 strings with the same md5 hash?

I read somewhere that md5 is not 100% secure. Hence, the question.
You seem to be asking 2 separate but related questions.
The probability of a random collision is highly dependent on the size of the data that you're working with; the more strings you're hashing, the more likely a collision is to occur. See the first table at Wikipedia: Birthday Attack for exact probabilities. MD5 uses 128 bits, so to achieve a 50% collision probability, you'll need 2.2E19 strings.
However, while random collisions are suitably rare for small data sets, MD5 has been shown to be completely insecure against intentional collisions. According to the Wikipedia article on MD5, a collision attack exists that can be run in seconds on a 2.6Ghz Pentium4 processor. For security, MD5 is completely broken, and has been considered so since 2005.
If you need to securely hash something, use one of the more modern hashing algorithms, such as SHA-2, SHA-3 (when it's development is finished), or Whirlpool.

How reliable is the adler32 checksum?

I wonder how reliable the adler32 checksum is, compared to e.g. md5 checksums? It was told on wikipedia that adler32 is "much less reliable" than md5, so I wonder how much, and in which way?
More specifically, I'm wondering if it is reliable enough as a consistency check for long-time archiving of (tar) files of size 20GB+?
For details on the error-checking capabilities of the Adler-32 checksum, see for example Revisiting Fletcher and Adler Checksums. Maxino, 2006.
This paper contains an analysis on the Hamming distance provided by these two checksums, and provides an indication of the residual error rate for data words up to about 2^11 bits. Which, obviously is much less than your requirement of 2^38 bits...
Adler32 has an entirely different purpose than MD5. Adler32 is a checksum. MD5 is a secure message digest. Adler32 is for quick hashes, has a small bit space, and simple algorithm. Its collision rate is low, but not low enough to be secure. MD5, SHA, and other cryptographic/secure hashes (or message digests) have much larger bitspaces and more complex algorithms, thus have far fewer collisions. Compare SHA2-256, for example; 256 bits compared to Adler32's measly 32 bits.
Adler does have its purpose, in hash tables for instance, or rapid data integrity checks. Still, it is not designed with the same purpose as MD5 or other secure digests.
BTW, if a simple but somewhat reliable checksum is what you need, then it seems Fletcher out-performs Adler. I'd speculate they both out-perform CRC, though perhaps not a simple addition based checksum (though it is very prone to collisions). If you want BOTH performance AND security, then use BOTH algorithms. Have the checksum algorithm used as a quick calculation and lookup, then use the larger digest for a more thorough confirmation if found.
To answer your question on ensuring the validity of archives, I would say that it would probably suffice just fine. Best choice? Questionable. Possibility of error? Very low.
This is an ancient algorithm; one which, as the Wikipedia page says, "trades accuracy for speed". In short, no, you shouldn't rely on it.
The point is that with multiple corruptions, this checksum might still pass as "okay". Due to the avalanche effect, this is significantly less likely to occur in modern algorithms (even the old MD5).
For today's machines, speed is not so much of a concern, therefore I'd suggest using a modern algorithm (whichever is current), even for files in the TB range. The insignificant time savings you'd get with an old checksum system are IMHO not enough to balance the significantly increased risk of undetected data corruption - and honestly, 20GB of files is not that much data these days that you'd need to use weak (and I daresay broken) algorithms.
It is less reliable than say MD5 or CRC (about the same as CRC actually). Advantage is speed, disadvantage is more showing for short data (few hundred bytes) - the meaning is that the distribution of hash values does not cover very well the available 32bit output. For big files it is a good choice.
Adler-32 and MD5 are not comparable in this way. MD5 is actually intended to be a cryptographic checksum when you want to make sure that a file hasn't been tampered with by an adversary, while Adler-32 (and also CRC, which is comparable to Adler-32) is intended for making sure a file hasn't been tampered with by accident (integrity checksum.)
MD5 is actually considered broken for its cryptographic purposes, and is only useful now as an integrity check when you want more bits for certainty. The only way Adler-32 can be "less reliable" is that it allows potentially more bits to be altered while retaining the same output, which means there is more room for collisions.
This link gives a good discussion as to how using Adler-32 can provide performance benefits for some kinds of code which needs to use cryptographic sums for added certainty. Namely, that you can use the smaller and cheap checksum to see if doing the more expensive MD5/SHA/Whirlpool is worth considering in the event of changed files.

A suitable hash function to detect data corruption / check for data integrity?

What is the most suitable hash function for file integrity checking (checksums) to detect corruption?
I need to consider the following:
Wide range of file size (1 kb to 10GB+)
Lots of different file types
Large collection of files (+/-100 TB and growing)
Do larger files require higher digest sizes (SHA-1 vs SHA 512)?
I see that the SHA-family is referred to as cryptographic hash functions. Are they ill-suited for "general purpose" use such as detecting file corruption? Will something like MD5 or Tiger be better?
If malicious tampering is also a concern, will your answer change w.r.t the most suitable hash function?
External libraries are not an option, only whats available on Win XP SP3+.
Naturally performance is also of concern.
(Please excuse my terminology if it is incorrect, my knowledge on this subject is very limited).
Any cryptographic hash function, even a broken one, will be fine for detecting accidental corruption. A given hash function may be defined only for inputs up to some limit, but for all standard hash function that limit is at least 264 bits, i.e. about 2 millions of terabytes. That's quite large.
File type has no incidence whatsoever. Hash functions operate over sequences of bits (or bytes) regardless of what those bits represent.
Hash function performance is unlikely to be an issue. Even the "slow" hash functions (e.g. SHA-256) will run faster on a typical PC than the harddisk: reading the file will be the bottleneck, not hashing it (a 2.4 GHz PC can hash data with SHA-512 at a speed close to 200 MB/s, using a single core). If hash function performance is an issue, then either your CPU is very feeble, or your disks are fast SSD (and if you have 100 TB of fast SSD then I am kind of jealous). In that case, some hash functions are somewhat faster than other, MD5 being one of the "fast" functions (but MD4 is faster, and it is simple enough that its code can be included in any application without much hassle).
If malicious tampering is a concern, then this becomes a security issue, and that's more complex. First, you will like to use one of the cryptographically unbroken hash function, hence SHA-256 or SHA-512, not MD4, MD5 or SHA-1 (the weaknesses found in MD4, MD5 and SHA-1 might not apply to a specific situation, but this is a subtle matter and it is better to play safe). Then, hashing may or may not be sufficient, depending on whether the attacker has access to the hash results. Possibly, you may need to use a MAC, which can be viewed as a kind of keyed hash. HMAC is a standard way of building a MAC out of a hash function. There are other non-hash-based MAC. Moreover, a MAC uses a secret "symmetric" key, which is not appropriate if you want some people to be able to verify the file integrity without being able to perform silent alterations; in that case, you would have to resort to digital signatures. To be brief, in a security context, you need a thorough security analysis with a clearly defined attack model.

How to find all files with the same content?

This is an interview question: "Given a directory with lots of files, find the files that have the same content". I would propose to use a hash function to generate hash values of the file contents and compare only the files with the same hash values. Does it make sense ?
The next question is how to choose the hash function. Would you use SHA-1 for that purpose ?
I'd rather use the hash as a second step. Sorting the dir by file size first and hashing and comparing only when there are duplicate sizes may improve a lot your search universe in the general case.
Like most interview questions, it's more meant to spark a conversation than to have a single answer.
If there are very few files, it may be faster to simply to a byte-by-byte comparison until you reach bytes which do not match (assuming you do). If there are many files, it may be faster to compute hashes, as you won't have to shift around the disk reading in chunks from multiple files. This process may be sped up by grabbing increasingly large chunks of each file, as you progress through the files eliminating potentials. hIt may also be necessary to distribute the problem among multiple servers, if their are enough files.
I would begin with a much faster and simpler hash function than SHA-1. SHA-1 is cryptographically secure, which is not necessarily required in this case. In my informal tests, Adler 32, for example, is 2-3 times faster. You could also use an even weaker presumptive test, than retest any files which match. This decision also depends on the relation between IO bandwidth and CPU power, if you have a more powerful CPU, use a more specific hash to save having to reread files in subsequent tests, if you have faster IO, the rereads may be cheaper than doing expensive hashes unnecessarily.
Another interesting idea would be to use heuristics on the files as you process them to determine the optimal method, based on the files size, computer's speed, and the file's entropy.
Yes, the proposed approach is reasonable and SHA-1 or MD5 will be enough for that task. Here's a detailed analysis for the very same scenario and here's a question specifically on using MD5. Don't forget you need a hash function as fast as possible.
Yes, hashing is the first that comes to mind. For your particular task you need to take the fastest hash function available. Adler32 would work. Collisions are not a problem in your case, so you don't need cryptographically strong function.

Can I prevent duplicate content using md5?

I would like to prevent duplicate content. I do not want to keep a copies of content, so I decided to keep just the md5 signatures.
I read that md5 collisions do happen, different content could give in the same md5 signature.
Do you think md5 is enough?
Should I use md5 and sh1 together?
People have been able to deliberately produce MD5 collisions under contrived circumstances, but for preventing duplicate content (in the absence of malicious users) it's more than adequate.
Having said that, if you can use SHA-1 (or SHA-2) you should - you'll be fractionally but measurably safer from collisions.
MD5 should be fine, collisions are very rare, but if you're really worried, you can use sha-1 as well.
Though I guess the signatures really aren't that large, so if you have the spare processing cycles and the disk space, you could do both. But if space or speed is limited, I'd just go with one.
Why not simply compare the content byte for byte if there is a hash collision? hash collisions are very rare, and so you're only going to have to do a byte for byte check very rarely. That way duplicates will only be detected if the items are actually duplicated
md5 should be enough. Yes, there can be collisions, but the chances of that happening are so incredibly small that I wouldn't worry about it unless you were literally tracking many billions of pieces of content.
If you're really afraid of accidental collisions just do both MD5 and SHA1 hashes and compare them. If they both match, it's the same content. If either one differs, it's different content.
Combining algorithms serves to only obfuscate, but does not increase security in a hashing algorithm.
MD5 is too broken to use anyway, IMHO. Forging MD5 hashes is proven by researchers, where they demonstrated being able to forge content that generates an MD5 collision, thereby opening the door to generating a forged CSR to buy a cert from RapidSSL for a domain name they don't own. Security Now! episode 179 explains the process.
For me, SHA-based hashes are stronger and most development platforms support it so the choice is easy. The remaining deciding factor is then the block size.
A timestamp + md5 together are safe enough.
MD5 is broken and SHA1 is close to it. Use SHA2.
edit
Based on an update from the OP, it doesn't seem that intentional collisions are a serious concern here. For unintentional ones, any decent hash with at least a 64-bit output would be fine.
I would still avoid MD5 and even SHA1, in general, but there's no reason to be dogmatic about it. If the tool fits here, then by all means use it.

Resources