If I want a shorter MD5 checksum, should I take the regular one and use the first half or the second half? Or does it even matter?
(obviously it would cease to be an MD5, it would just be a checksum)
It doesn't matter, but I'd think very hard about doing this: you'll greatly increase the chance of a collision (two different bits of data with the same checksum).
Use some sort of a CNC hashing function if you need a short string. Be aware that it isn't a very safe, but it will be still better than taking the half of a MD5.
It shouldn't matter. Generally speaking the entire checksum changes for each input byte. But if you can't decide, why not xor the first half with the second? :-)
Related
have anyone proved or tested that whether or not MD5 collision may happen for data with fixed length of 18 bytes ?
i.e. can I construct two arrays(18 bytes length) with same MD5?
thanks!
Considering that MD5 has a length of 16 bytes... yes, 18 byte values are guaranteed to collide eventually.
But it's the wrong question to ask. Hashes are by definition prone to collisions. It may even happen with two single byte values. Very unlikely, but possible. If you're using a hash, you must expect collisions to happen. The question is whether this is acceptable for your use case, what implications a collision has for your application, whether you can mitigate that problem, and how likely it is for a collision to happen.
All this together informs your decision whether hashing in general is something you want to use in your situation and/or what hash in particular to choose.
The straightforward way to compare two network endian u16 or u32 would be to convert both of them into host endian and then compare.
But I'm working on a performance critical program and we have lots of those cases. So I'm wondering would it help if we just write a macro to compare them byte by byte from the MSB? In other words, by adding extra one (for u16) or extra three (for u32) comparisons, we can avoid two ntoh calls.
Would it help? Or would it depend on the hardware or the compiler? Is there any better way to do that?
Thanks
PS:
I understand the extra complexity needed while the performance enhancement may be small compared to the whole program. I'm just interested in how the hardware is working and how to push it to the extreme :P
I will assume that you only need this code to run on one processor, which most likely will be little endian.
You need 4 compare functions, which you can write as macros. Two that compare the whole word (short or long) when network order matches processor order, and two that compare byte by byte for the other case. It's faster to compare directly than convert and then compare.
If you need individual compares for EQ, LT, GT etc and for signed/unsigned you may need a lot more combinations to get peak performance. I assume you know how to write the code so I won't try.
Naturally having done this you should benchmark the whole thing to make sure it was actually worth it! Unit tests are pretty important too, so not a trivial project.
Is it possible to optimise the function:
MD5_Update(&ctx_d, buf, num);
if you know that buf contains only zeros?
Or is this mathematically impossible?
Likewise for SHA1.
If you control the input of the hash function then you could use a simple count instead of all the zero's, maybe using some kind of escape. E.g. 000020 in hex could mean 32 zero's. A (very) basic compression function may be much faster than MD5 or SHA1.
Obviously this solution will only be faster if you save one or more blocks of hash calculations. E.g. it does not matter if you hash 3 bytes or 16 bytes, as the input will be padded and expanded by the hash function before it is used.
Cryptographic hashes are actually supposed to produce significant changes in output for small changes in input, see http://en.wikipedia.org/wiki/Avalanche_effect . It sounds like you're looking for some relationship between some hashed data, and some hashed data pre-padded with zeros. By design this change in your input should produce output that isn't clearly related.
EDIT: To answer your question directly, by design "a small change in either the key or the plaintext should cause a drastic change in the ciphertext" which means it's meant to be mathematically difficult to do.
You'd probably get some speedup, but it'd be relatively minor. The most important thing for high performance hashing is choosing an optimized implementation, and to use GPUs(or even FPGA/ASIC) to exploit parallelism if that's possible.
There is a known speedup for SHA-1 with fixed IV and messages that differ only a little. That speedup is around 21%. See New attack makes some password cracking faster - Ars Technica.
You might get a similar speedup when you have a completely fixed message but a variable IV. But it'd be a lot of work to implement this, especially as a non expert. Buying additional hardware is probably much cheaper than speeding up your code a few percent.
If the beginning of your message consists of multiple constant blocks, you can hash them once, and cache the intermediate state of the hashfunction. Might or might not be applicable to your situation.
Are there any 32-bit checksum algorithm with either:
Smaller hash collision probability for input data sizes < 1 KB ?
Collision hits with more uniform distribution.
These relative to CRC32. I'm practically not counting on first property, because of limitation of storage space of 32 bits. But for the second ... seems there could be improvements.
Any ideas ? Thanks. (I need concrete implementation, better in C, but C++/ C# or anything to start with is also OK).
How about MurmurHash? It is said, that this hash has good distribution (passes chi-square tests) and good avalanche effect. Also very good computing speed.
Not for the first criteria. Any well designed hash function with a 32 bit output has a 1 in 2^32 chance of a collision for any pair of inputs. The second criteria is not very well defined, although there are surely some statistical tests that could be used, and I'm sure someone has done it (chi-square for collision intervals?). As for needing an implementation, I strongly recommend that you not accept any proposed code for a hash function that is not an implementation of a well known hash, as there is a high risk of security problems or poor performance when rolling your own hash or encryption. A well known but bad hash function is better than one you designed yourself, even if the latter one tests well and has a 'good' collision distribution, simply because the former has more eyeballs on it.
Never mind why I'm doing this -- this is mainly theoretical.
If I were MD5 hashing string representations of integers, how high would I have to count before two of the hashes collide?
This problem (in generic case) is known as Birthday Paradox
The probability of collision in generic case can be computed easily. However, in your particular case, you have to actually compute (and store!) each MD5.
EDIT #Scott : not really. The Pigeonhole principle (being just a particular case of Birthday problem) would say that having 2^128 possible MD5 values, we surely will have a collision after 1 + 2^128 tries. The birthday paradox says that the probability of collision will be grater than 0.5 for about 2^70 MD5 values.
With these estimates for storage requirements, it's up to you to decide if the problem worth it. By me it does not.
Apparently, one can base a thesis on this very thing (or similar problems, anyway). I haven't read it, but maybe something in Stevens' thesis will help you (it's apparently linked from the Wikipedia article).
In a perfect world, to 1 + 2^128. But I doubt md5 is perfect, I cant give you a number but is guaranteed to be <= 1+ 2^128
Here is a scientific way to find out an estimate of how high you would have to count.
Make MD5 hash that is cut down to say 4 bits. Calculate that (make sure you calculate until you reach say 100 collisions so you get a good average)
Then make the same thing at 8 bits (again, wait for many collisions so you can calculate an average).
Do it again and again until you have averages for 4, 8, 12, 16 bits and then see if you can find a trend. Follow that trend up to 128 bits
You may want to xor all 128 bits to come up with your shorter version. Taking the first or last part may not be the best test.