How reliable is the adler32 checksum? - md5

I wonder how reliable the adler32 checksum is, compared to e.g. md5 checksums? It was told on wikipedia that adler32 is "much less reliable" than md5, so I wonder how much, and in which way?
More specifically, I'm wondering if it is reliable enough as a consistency check for long-time archiving of (tar) files of size 20GB+?

For details on the error-checking capabilities of the Adler-32 checksum, see for example Revisiting Fletcher and Adler Checksums. Maxino, 2006.
This paper contains an analysis on the Hamming distance provided by these two checksums, and provides an indication of the residual error rate for data words up to about 2^11 bits. Which, obviously is much less than your requirement of 2^38 bits...

Adler32 has an entirely different purpose than MD5. Adler32 is a checksum. MD5 is a secure message digest. Adler32 is for quick hashes, has a small bit space, and simple algorithm. Its collision rate is low, but not low enough to be secure. MD5, SHA, and other cryptographic/secure hashes (or message digests) have much larger bitspaces and more complex algorithms, thus have far fewer collisions. Compare SHA2-256, for example; 256 bits compared to Adler32's measly 32 bits.
Adler does have its purpose, in hash tables for instance, or rapid data integrity checks. Still, it is not designed with the same purpose as MD5 or other secure digests.
BTW, if a simple but somewhat reliable checksum is what you need, then it seems Fletcher out-performs Adler. I'd speculate they both out-perform CRC, though perhaps not a simple addition based checksum (though it is very prone to collisions). If you want BOTH performance AND security, then use BOTH algorithms. Have the checksum algorithm used as a quick calculation and lookup, then use the larger digest for a more thorough confirmation if found.
To answer your question on ensuring the validity of archives, I would say that it would probably suffice just fine. Best choice? Questionable. Possibility of error? Very low.

This is an ancient algorithm; one which, as the Wikipedia page says, "trades accuracy for speed". In short, no, you shouldn't rely on it.
The point is that with multiple corruptions, this checksum might still pass as "okay". Due to the avalanche effect, this is significantly less likely to occur in modern algorithms (even the old MD5).
For today's machines, speed is not so much of a concern, therefore I'd suggest using a modern algorithm (whichever is current), even for files in the TB range. The insignificant time savings you'd get with an old checksum system are IMHO not enough to balance the significantly increased risk of undetected data corruption - and honestly, 20GB of files is not that much data these days that you'd need to use weak (and I daresay broken) algorithms.

It is less reliable than say MD5 or CRC (about the same as CRC actually). Advantage is speed, disadvantage is more showing for short data (few hundred bytes) - the meaning is that the distribution of hash values does not cover very well the available 32bit output. For big files it is a good choice.

Adler-32 and MD5 are not comparable in this way. MD5 is actually intended to be a cryptographic checksum when you want to make sure that a file hasn't been tampered with by an adversary, while Adler-32 (and also CRC, which is comparable to Adler-32) is intended for making sure a file hasn't been tampered with by accident (integrity checksum.)
MD5 is actually considered broken for its cryptographic purposes, and is only useful now as an integrity check when you want more bits for certainty. The only way Adler-32 can be "less reliable" is that it allows potentially more bits to be altered while retaining the same output, which means there is more room for collisions.
This link gives a good discussion as to how using Adler-32 can provide performance benefits for some kinds of code which needs to use cryptographic sums for added certainty. Namely, that you can use the smaller and cheap checksum to see if doing the more expensive MD5/SHA/Whirlpool is worth considering in the event of changed files.

Related

64-bit hash/digest in C

I am trying to find out if there is any API in C for calculating a 64 bit hash.
I found out that some people use top 64 bits of MD5/SHA1 etc. Is it a good approach?
You could try SipHash in its form as a MAC (which requires key management, though). It is particularly well-suited for short input messages and aims at cryptographic strength. A C implementation is also available.
But if you really care about someone actively messing around with your files, you shouldn't restrict yourself to 64 bits of security. 64 bits can be broken even by brute force today, given enough time and resources. You should use SHA-256 or stronger for that. Or let me state it the other way round, blacklisting broken options: don't use MD5 (or MD-anything for that matter). Use SHA-1 only if you can't use SHA-256 for some reason.
Using a hash also has the advantage that you don't need to manage any keys (opposed to using a MAC). You should just keep the hashes you compute in a different place than the files you are about to monitor - otherwise somebody tampering with your files can easily tamper with the checksum, too.
Regarding whether truncating hashes is good or bad
In theory, I can't see why it should be wrong to truncate a let's say 160 bit hash value down to 64 bit, regardless of whether you take the most significant bits or the least significant bits or pick them using any arbitrary pattern. The only reason why this isn't done more often that I can think of is efficiency - why bring the big guns if there are more efficient algorithms for handling the smaller problems.
In what follows, I assume a cryptographically secure hash for this purpose, general-purpose hashes are quite a different topic - they might expose attack surfaces when truncated for all I know.
But, for a cryptographically secure hash, unless the algorithm is broken, we can assume that its output is indistinguishable from that of a uniformly distributed random variable.
If we truncate this value now, we don't offer any further insight into the inner workings of the algorithm. Still, we do weaken the security by the simple fact that brute-forcing (be it collisions or finding pre-images) now takes less time by laws of probability.
For example, finding a collision for a 64 bit hash takes roughly 2^32 attempts on average - says the Birthday Paradoxon. If you truncate your output down to the least significant 32 bits of the original 64 bit hash, then you will find collisions in time roughly 2^16, because you simply ignore the most significant 32 bits and the de-facto uniform distribution does the rest - it's like you started searching for collisions with a 32 bit value in the first place.
It's a bad idea. Hash function values are always meant to be taken as a whole.
For the implied question of "how to calculate a 64 bit hash": what's your intended use? Remember that 64 bits are too few for a crypto-strength hash function.
Use CRC to protect against random changes.
Use HMAC to protect against an attacker changing your files. HMAC uses a secret key that is necessary to generate and verify the tags. The result of an HMAC is as long as the underlying hash function (e.g. 20 bytes for an HMAC-SHA1), but it is frequently truncated. I.e. according to NIST SP 800-107 p.14 64-96 bits should be enough for most applications.
64 bits is small for a hash and usually, hashes are meant to be taken as a whole.
Now, what do you need these 64 bits for ? Answer will depend of expected usage.
Keep in mind that md5 is quite broken nowadays and 64 bits is very low security.
If you just need integrity checking against random changes, then a simple checksum as given in the other answers may be enough.
If you need cryptographic strength to ensure the original content, then 64 bit is too weak. Better use the full value of an unbroken algorithm, i.e. not MD5. SHA1 is still okay, but for longer term security better use SHA256. Or even go further with HMAC, as mentioned in the other answer.
There is nothing wrong with using the truncated value of a cryptographic hash. In fact, SHA224/384 are calculated by calculating a SHA256/512 hash with a different initialization vector and then truncating the result. However, this is only valid for cryptographic hashes. It may be a bad idea for normal checksums and table hashes.
Use OpenSSL's API for the calculations.(www.openssl.org).

Is CRC32 really so bad for file integrity check?

Of course that MD5 is better then CRC32, SHA1 is better then MD5 and so on... But also they are also much slower then CRC32.
Right know, I am thinking about how to check consistency of being transfered file and CRC32 is fastest option.
I haven't found anywhere how bad is CRC32 for integrity checks (maybe in other words how is probably that CRC32 will not detect malformed file)?
Quoting from http://www.mathpages.com/home/kmath458.htm :
So, if we assume that any corruption of our data affects our string
in a completely random way, i.e., such that the corrupted string is
totally uncorrelated with the original string, then the probability
of a corrupted string going undetected is 1/(2^n). This is the basis
on which people say a 16-bit CRC has a probability of 1/(2^16) =
1.5E-5 of failing to detect an error in the data, and a 32-bit CRC has a probability of 1/(2^32), which is about
2.3E-10 (less than one in a billion).
My opinion: CRC-32 is more than enough for error detection. It is being used widely. However, it is not secure when you want to use it as a "hash function".
Collisions (same hash output but different data) can occur easily using CRC-32 because CRC-32 use only 32bits compare to other algorithms ex. MD5 is 128-bits, SHA-1 is 160-bits, SHA-2 (SHA256/512 series) is 224bits-512bits. (depend on what you use). Also, for SHA-2 series no collision has been found.
For more info about mathematics and probability that would cause your data a collision. Please look for Hash Collision and Birthday paradox problem

What are the chances of having 2 strings with the same md5 hash?

I read somewhere that md5 is not 100% secure. Hence, the question.
You seem to be asking 2 separate but related questions.
The probability of a random collision is highly dependent on the size of the data that you're working with; the more strings you're hashing, the more likely a collision is to occur. See the first table at Wikipedia: Birthday Attack for exact probabilities. MD5 uses 128 bits, so to achieve a 50% collision probability, you'll need 2.2E19 strings.
However, while random collisions are suitably rare for small data sets, MD5 has been shown to be completely insecure against intentional collisions. According to the Wikipedia article on MD5, a collision attack exists that can be run in seconds on a 2.6Ghz Pentium4 processor. For security, MD5 is completely broken, and has been considered so since 2005.
If you need to securely hash something, use one of the more modern hashing algorithms, such as SHA-2, SHA-3 (when it's development is finished), or Whirlpool.

32-bit checksum algorithm better quality than CRC32?

Are there any 32-bit checksum algorithm with either:
Smaller hash collision probability for input data sizes < 1 KB ?
Collision hits with more uniform distribution.
These relative to CRC32. I'm practically not counting on first property, because of limitation of storage space of 32 bits. But for the second ... seems there could be improvements.
Any ideas ? Thanks. (I need concrete implementation, better in C, but C++/ C# or anything to start with is also OK).
How about MurmurHash? It is said, that this hash has good distribution (passes chi-square tests) and good avalanche effect. Also very good computing speed.
Not for the first criteria. Any well designed hash function with a 32 bit output has a 1 in 2^32 chance of a collision for any pair of inputs. The second criteria is not very well defined, although there are surely some statistical tests that could be used, and I'm sure someone has done it (chi-square for collision intervals?). As for needing an implementation, I strongly recommend that you not accept any proposed code for a hash function that is not an implementation of a well known hash, as there is a high risk of security problems or poor performance when rolling your own hash or encryption. A well known but bad hash function is better than one you designed yourself, even if the latter one tests well and has a 'good' collision distribution, simply because the former has more eyeballs on it.

A suitable hash function to detect data corruption / check for data integrity?

What is the most suitable hash function for file integrity checking (checksums) to detect corruption?
I need to consider the following:
Wide range of file size (1 kb to 10GB+)
Lots of different file types
Large collection of files (+/-100 TB and growing)
Do larger files require higher digest sizes (SHA-1 vs SHA 512)?
I see that the SHA-family is referred to as cryptographic hash functions. Are they ill-suited for "general purpose" use such as detecting file corruption? Will something like MD5 or Tiger be better?
If malicious tampering is also a concern, will your answer change w.r.t the most suitable hash function?
External libraries are not an option, only whats available on Win XP SP3+.
Naturally performance is also of concern.
(Please excuse my terminology if it is incorrect, my knowledge on this subject is very limited).
Any cryptographic hash function, even a broken one, will be fine for detecting accidental corruption. A given hash function may be defined only for inputs up to some limit, but for all standard hash function that limit is at least 264 bits, i.e. about 2 millions of terabytes. That's quite large.
File type has no incidence whatsoever. Hash functions operate over sequences of bits (or bytes) regardless of what those bits represent.
Hash function performance is unlikely to be an issue. Even the "slow" hash functions (e.g. SHA-256) will run faster on a typical PC than the harddisk: reading the file will be the bottleneck, not hashing it (a 2.4 GHz PC can hash data with SHA-512 at a speed close to 200 MB/s, using a single core). If hash function performance is an issue, then either your CPU is very feeble, or your disks are fast SSD (and if you have 100 TB of fast SSD then I am kind of jealous). In that case, some hash functions are somewhat faster than other, MD5 being one of the "fast" functions (but MD4 is faster, and it is simple enough that its code can be included in any application without much hassle).
If malicious tampering is a concern, then this becomes a security issue, and that's more complex. First, you will like to use one of the cryptographically unbroken hash function, hence SHA-256 or SHA-512, not MD4, MD5 or SHA-1 (the weaknesses found in MD4, MD5 and SHA-1 might not apply to a specific situation, but this is a subtle matter and it is better to play safe). Then, hashing may or may not be sufficient, depending on whether the attacker has access to the hash results. Possibly, you may need to use a MAC, which can be viewed as a kind of keyed hash. HMAC is a standard way of building a MAC out of a hash function. There are other non-hash-based MAC. Moreover, a MAC uses a secret "symmetric" key, which is not appropriate if you want some people to be able to verify the file integrity without being able to perform silent alterations; in that case, you would have to resort to digital signatures. To be brief, in a security context, you need a thorough security analysis with a clearly defined attack model.

Resources