Hash quality of 4-byte prefix of MD5SUM - md5

I have an MD5 sum, but I need a 4-byte hash. Does anybody know whether it is better to use, for example, an XOR of 4-byte chunks of the MD5 or just the first 4 bytes of it.
My hunch is that the first 4 bytes should be as good as any other 4 bytes from the MD5 sum and that further mangling with XOR will not improve the hash. Explanation and pointers as to the "randomness" of parts of an MD5 sum appreciated.
This is not intended for cryptography, just for lookup in a hash table.

If you just want a 4-byte hash for hash table lookup or similar, then yes, the first four bytes (or any four bytes) are as good as any other combination. In fact, using MD5 is overkill, unless (as you imply) you already have the MD5 hash available for free.
Obviously four bytes are not enough for any kind of cryptographic signature, whatever algorithm you use. But you knew that, right?

Related

Wierd use of md5 for hashing a file - Does it do anything?

For uploading a file to a service, I was calculating the md5 based on the whole content of the file.
I was asked to do in a different way: the md5 of the file, and then also 3 more parts: 2% from the start of the file, 2% from 1/3 of the file, and 2% from 2/3, and 2% of the end of the file and then hash it the file's size and added the file size in bytes at the end.
Apparently this solves hash collisions between files. To me it seems like a waste of time, since your not increasing the size of md5. So for a huge large number of files, you're still gonna have, statistically, the same number of collisions.
Please help me understand the reasoning behind this.
EDIT: we are then hashing the resulting hashes.
A good cryptographically strong hashing algorithm is already designed with the goal to make it infeasible to intentionally find two different pieces of data with the same hash, let alone by accident. Therefore, just hashing the file is sufficient. Extra hashing of parts of the file is pointless.
This may seem unintuitive because obviously there must exist collisions if the length of the hash is shorter than the length of the data. However, it is not feasible to find these collisions because an MD5 hash is an unpredictable 128-bit number and the amount of possible 128-bit numbers (2^128) is mind boggling. If you could count at a rate of a trillion trillion per second, counting through all 128-bit numbers would still take (2^128 / 1e24) seconds ~ about 10 million years. This is probably a good lower limit to the amount of time that it would take to find a hash collision the brute force way without custom hardware.
That said, this is all assuming that there are no weaknesses in the hashing algorithm that allow you to do better than brute force. MD5 is broken in this regard, so you should not use it if you need to defend against attackers that would try to create collisions. It would be better to use a newer hashing algorithm like SHA-2 or SHA-3. (These also support even larger outputs such as 256 bits.)
Sounds like a dangerous practice, because you're re-hashing without factoring in a lot of data. The advantage however is that by running other hashes, you are effectivley winding up with a hash signature consisting of "more bits" - (i.e. you are getting three MD5 hashes as a result).
If you want to do this - and are in-fact okay with having more (larger) hash data to store/compare - you would be MUCH better advise to simply run a different hash function (other than MD5) that is either more secure, and/or uses a larger number of bits.
MD5 is an "older" algorithm and is known to have cryptographic weaknessess. I'd recommend one of the "SHA" algorithms - like SHA-256 or SHA-512. Advantages are that it is a stronger algorithm, you'd only have to has the data ONCE, and you'd get more bits than an MD5, yet since your running it once, it would be faster.
Note, that the possibility of hash collisions always exists. Even "high end" storage products which use hashes for detection will compare buffers to verify an exact match even if the comparison of two hashes matches.

How to generate a file from its MD5 sum

The title says it all. From my understanding, every file has a unique MD5 checksum. Is it possible to reverse-engineer the file from its sum?
For example, let's just say a video's sum was 5. I know, but its just an example. Could you write a program where you enter 5 and it generates a video?
In other words, instead of generating a sum from a file, you get a file from a sum.
No, it's one way - otherwise be great method of compression!
To expand on what Jim W said, any hash function is one-way, which means they're functions that don't need to be easily reversible — while some may have inverses, most do not.
MD5 is a cryptographic hash function, which means it's intentionally designed to be very difficult to reverse. MD5 in particular is relatively weak, there are vulnerabilities that make it easy to find collisions — two files with the same MD5 hash.
Since an MD5 hash is only 128 bits, there are 2^128 different possible MD5 hashes, and while that's a very large number, there are still many, many more files than that in the world (potentially an infinite number, in fact), so it is inevitable that some files will hash to the same value. This, as user2864740 pointed out in a comment, is known as the pigeonhole principle.
A strong cryptographic hash function — like SHA-256 — is one for which it's considered computationally infeasible to reliably find such collisions.

What is the best cryptographic hash function that generates 16-bit hashes values in openssl?

I was thinking of just using SHA256 and then using only the first two bytes of the result. Is there anything wrong with this approach?
NOTE: The concern here is not malicious attacks, but to ensure the best possible protection against random bit flips.
Any hash that satisfies the strict avalanche criterion (that is, if any bit is flipped in the input, every bit in the output will be flipped with a probability of 50%) may be used in this way, and that includes every cryptographic hash in common use, including SHA512. There are security implications to using very short hashes, but if they really aren't relevant, as you claim, you're free to select the fastest hash available (probably MD5).
Since short hashes will be particularly vulnerable to the birthday paradox, though, consider using longer hashes anyway. If you're generating so many hashes that 16 bits versus 256 bits is significant, you will run into duplicates even without malicious attackers.

64-bit hash/digest in C

I am trying to find out if there is any API in C for calculating a 64 bit hash.
I found out that some people use top 64 bits of MD5/SHA1 etc. Is it a good approach?
You could try SipHash in its form as a MAC (which requires key management, though). It is particularly well-suited for short input messages and aims at cryptographic strength. A C implementation is also available.
But if you really care about someone actively messing around with your files, you shouldn't restrict yourself to 64 bits of security. 64 bits can be broken even by brute force today, given enough time and resources. You should use SHA-256 or stronger for that. Or let me state it the other way round, blacklisting broken options: don't use MD5 (or MD-anything for that matter). Use SHA-1 only if you can't use SHA-256 for some reason.
Using a hash also has the advantage that you don't need to manage any keys (opposed to using a MAC). You should just keep the hashes you compute in a different place than the files you are about to monitor - otherwise somebody tampering with your files can easily tamper with the checksum, too.
Regarding whether truncating hashes is good or bad
In theory, I can't see why it should be wrong to truncate a let's say 160 bit hash value down to 64 bit, regardless of whether you take the most significant bits or the least significant bits or pick them using any arbitrary pattern. The only reason why this isn't done more often that I can think of is efficiency - why bring the big guns if there are more efficient algorithms for handling the smaller problems.
In what follows, I assume a cryptographically secure hash for this purpose, general-purpose hashes are quite a different topic - they might expose attack surfaces when truncated for all I know.
But, for a cryptographically secure hash, unless the algorithm is broken, we can assume that its output is indistinguishable from that of a uniformly distributed random variable.
If we truncate this value now, we don't offer any further insight into the inner workings of the algorithm. Still, we do weaken the security by the simple fact that brute-forcing (be it collisions or finding pre-images) now takes less time by laws of probability.
For example, finding a collision for a 64 bit hash takes roughly 2^32 attempts on average - says the Birthday Paradoxon. If you truncate your output down to the least significant 32 bits of the original 64 bit hash, then you will find collisions in time roughly 2^16, because you simply ignore the most significant 32 bits and the de-facto uniform distribution does the rest - it's like you started searching for collisions with a 32 bit value in the first place.
It's a bad idea. Hash function values are always meant to be taken as a whole.
For the implied question of "how to calculate a 64 bit hash": what's your intended use? Remember that 64 bits are too few for a crypto-strength hash function.
Use CRC to protect against random changes.
Use HMAC to protect against an attacker changing your files. HMAC uses a secret key that is necessary to generate and verify the tags. The result of an HMAC is as long as the underlying hash function (e.g. 20 bytes for an HMAC-SHA1), but it is frequently truncated. I.e. according to NIST SP 800-107 p.14 64-96 bits should be enough for most applications.
64 bits is small for a hash and usually, hashes are meant to be taken as a whole.
Now, what do you need these 64 bits for ? Answer will depend of expected usage.
Keep in mind that md5 is quite broken nowadays and 64 bits is very low security.
If you just need integrity checking against random changes, then a simple checksum as given in the other answers may be enough.
If you need cryptographic strength to ensure the original content, then 64 bit is too weak. Better use the full value of an unbroken algorithm, i.e. not MD5. SHA1 is still okay, but for longer term security better use SHA256. Or even go further with HMAC, as mentioned in the other answer.
There is nothing wrong with using the truncated value of a cryptographic hash. In fact, SHA224/384 are calculated by calculating a SHA256/512 hash with a different initialization vector and then truncating the result. However, this is only valid for cryptographic hashes. It may be a bad idea for normal checksums and table hashes.
Use OpenSSL's API for the calculations.(www.openssl.org).

Univocal hash function for a string 76 chars long

Here's my problem (I'm programming in C):
I have some huge text files containing DNA sequences (each file has something like 65 million rows and a size of about 4~5 GB). In these files there are many duplicates (don't know how many yet, but there should be many millions of them) and I want to return in output a file with only distinct values. Each string has a quality value associated, so if e.g I have 5 equal strings with different quality values I'll hold the best one and discard the other 4.
Reducing memory requirements and improving speed efficiency as far as I can is VITAL.
My idea was to create a JudyHS array using an hash function in order to convert the String DNA sequence (which is 76 letters long and has 7 possible characters) into an integer to reduce memory usage (4 or 8 bytes instead of 76 bytes on many millions of entries should be quite an achievement). This way I could use the integer as index and store only the best quality value for that index. The problem is that I can't find an hash function that UNIVOCALLY defines such a long string and produces a value that can be stored inside an integer or even a long long!
My first idea for an hash function was something like the default string hash function in Java: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1], but I could obtain a maximal value of 8,52*10^59.. way tooo big.
What about doing the same thing and store it in a double? Would the computation become a lot slower?
Please note that I'd like a way to UNIVOCALLY define a string, avoiding collisions (or at least they should be extremely rare, because I would have to access the disk at every collision, quite a costly operation...)
You have 7^76 possible DNA sequences and want to map them to 2^32 hashes without collisions? Not possible.
You need a least log2(7^76) = 214 bits to do that, about 27 bytes.
I you can live with some collisions I would recommend to stick to CRC32 or md5 instead of inventing a new wheel again.
The "simple" way to get a collision-free hash function for N elements is to use a good mixing function (say, a cryptographic hash function) and to truncate the size, so that the hash results live in a space of size at least N2. Here, you have 65 million rows -- this fits on 26 bits (226 is close to 65 millions) so 52 bits "ought to be enough".
You can try using a fast cryptographic hash function, even a "broken" one since this is not a security-related problem. MD4, MD5, SHA-1... then truncate the result to the first (or last) 64 bits, store that in a 64-bit integer type. Chances are that you will not get any collision among your 65 million rows; and if you get some, they will be very rare.
For optimized C implementations of hash functions, lookup sphlib. Use the provided sph_dec64le() function to "decode" a sequence of 8 bits into a 64-bit unsigned integer value.

Resources