MD5 collision may happen for input of 18 chars? - md5

have anyone proved or tested that whether or not MD5 collision may happen for data with fixed length of 18 bytes ?
i.e. can I construct two arrays(18 bytes length) with same MD5?
thanks!

Considering that MD5 has a length of 16 bytes... yes, 18 byte values are guaranteed to collide eventually.
But it's the wrong question to ask. Hashes are by definition prone to collisions. It may even happen with two single byte values. Very unlikely, but possible. If you're using a hash, you must expect collisions to happen. The question is whether this is acceptable for your use case, what implications a collision has for your application, whether you can mitigate that problem, and how likely it is for a collision to happen.
All this together informs your decision whether hashing in general is something you want to use in your situation and/or what hash in particular to choose.

Related

What is the best cryptographic hash function that generates 16-bit hashes values in openssl?

I was thinking of just using SHA256 and then using only the first two bytes of the result. Is there anything wrong with this approach?
NOTE: The concern here is not malicious attacks, but to ensure the best possible protection against random bit flips.
Any hash that satisfies the strict avalanche criterion (that is, if any bit is flipped in the input, every bit in the output will be flipped with a probability of 50%) may be used in this way, and that includes every cryptographic hash in common use, including SHA512. There are security implications to using very short hashes, but if they really aren't relevant, as you claim, you're free to select the fastest hash available (probably MD5).
Since short hashes will be particularly vulnerable to the birthday paradox, though, consider using longer hashes anyway. If you're generating so many hashes that 16 bits versus 256 bits is significant, you will run into duplicates even without malicious attackers.

Optimise updating MD5/SHA1 with streams of zeros

Is it possible to optimise the function:
MD5_Update(&ctx_d, buf, num);
if you know that buf contains only zeros?
Or is this mathematically impossible?
Likewise for SHA1.
If you control the input of the hash function then you could use a simple count instead of all the zero's, maybe using some kind of escape. E.g. 000020 in hex could mean 32 zero's. A (very) basic compression function may be much faster than MD5 or SHA1.
Obviously this solution will only be faster if you save one or more blocks of hash calculations. E.g. it does not matter if you hash 3 bytes or 16 bytes, as the input will be padded and expanded by the hash function before it is used.
Cryptographic hashes are actually supposed to produce significant changes in output for small changes in input, see http://en.wikipedia.org/wiki/Avalanche_effect . It sounds like you're looking for some relationship between some hashed data, and some hashed data pre-padded with zeros. By design this change in your input should produce output that isn't clearly related.
EDIT: To answer your question directly, by design "a small change in either the key or the plaintext should cause a drastic change in the ciphertext" which means it's meant to be mathematically difficult to do.
You'd probably get some speedup, but it'd be relatively minor. The most important thing for high performance hashing is choosing an optimized implementation, and to use GPUs(or even FPGA/ASIC) to exploit parallelism if that's possible.
There is a known speedup for SHA-1 with fixed IV and messages that differ only a little. That speedup is around 21%. See New attack makes some password cracking faster - Ars Technica.
You might get a similar speedup when you have a completely fixed message but a variable IV. But it'd be a lot of work to implement this, especially as a non expert. Buying additional hardware is probably much cheaper than speeding up your code a few percent.
If the beginning of your message consists of multiple constant blocks, you can hash them once, and cache the intermediate state of the hashfunction. Might or might not be applicable to your situation.

Magic Numbers in C

I wanna use a magic number as a constant in checking that a memory block has not been violated, is there a method of "reverse-checking" to bring back the signature into the hexadecimal format of MAGIC_32BIT
#define MAGIC_32BIT 0x77A5844CU
int signature = (int)MAGIC_32BIT;
Also, i wanna use a more creative magic number, any ideas on generating them or Rules to follow? No offence but i heard of the microsoft's 0xB16B00B5 and would like mine to be more human "readable".
Yeah i found the answer, straight up checking with macros, cegfault's wiki comment showed that int variables are inter-changeable without the use of casting needed.
#define MAGIC_NUM 0x8BADF00D
#define CHECK_SIG(A) (A == MAGIC_NUM)
I realize this response may not answer your question, but I still hope that it may be of some help.
Your 'magic number' should really depend upon both your application, and the type of memory corruption you want to detect or are expecting.
I've seen OSes that initialized a task's entire stack with 0xEE--a value that is both easily recognizable and unlikely to be used by most people. This method could be used to guesstimate the amount of unused stack space by counting the the 0xEE bytes. Is it perfect--no; but it is quick, (fairly) cheap and easy to do. One of the benefits to this is that you can sometimes easily identify which bytes are getting corrupted (say if you have a couple of non-0xEE bytes in a sea of 0xEE bytes). The basic idea should be transferrable to other areas.
You could go the custom route and have a unique magic number per data structure--say a CRC. It's more expensive, but it will be better at detecting whether the data structure has been corrupted or not. It will not tell you where/how/when it was corrupted, but only whether it was or wasn't. This unfortunately would fail your human readable request.
If your memory blocks are large enough, it might be possible and practical to take advantage of the MMU and protect your memory blocks by disabling writes to them by default, and enabling them only for the duration of time you need to make changes. This method would have some write performance penalties, but it can help detect when, where and by whom the corruption is occurring. This completely eliminates the magic number.
Hope this helps.

64-bit hash/digest in C

I am trying to find out if there is any API in C for calculating a 64 bit hash.
I found out that some people use top 64 bits of MD5/SHA1 etc. Is it a good approach?
You could try SipHash in its form as a MAC (which requires key management, though). It is particularly well-suited for short input messages and aims at cryptographic strength. A C implementation is also available.
But if you really care about someone actively messing around with your files, you shouldn't restrict yourself to 64 bits of security. 64 bits can be broken even by brute force today, given enough time and resources. You should use SHA-256 or stronger for that. Or let me state it the other way round, blacklisting broken options: don't use MD5 (or MD-anything for that matter). Use SHA-1 only if you can't use SHA-256 for some reason.
Using a hash also has the advantage that you don't need to manage any keys (opposed to using a MAC). You should just keep the hashes you compute in a different place than the files you are about to monitor - otherwise somebody tampering with your files can easily tamper with the checksum, too.
Regarding whether truncating hashes is good or bad
In theory, I can't see why it should be wrong to truncate a let's say 160 bit hash value down to 64 bit, regardless of whether you take the most significant bits or the least significant bits or pick them using any arbitrary pattern. The only reason why this isn't done more often that I can think of is efficiency - why bring the big guns if there are more efficient algorithms for handling the smaller problems.
In what follows, I assume a cryptographically secure hash for this purpose, general-purpose hashes are quite a different topic - they might expose attack surfaces when truncated for all I know.
But, for a cryptographically secure hash, unless the algorithm is broken, we can assume that its output is indistinguishable from that of a uniformly distributed random variable.
If we truncate this value now, we don't offer any further insight into the inner workings of the algorithm. Still, we do weaken the security by the simple fact that brute-forcing (be it collisions or finding pre-images) now takes less time by laws of probability.
For example, finding a collision for a 64 bit hash takes roughly 2^32 attempts on average - says the Birthday Paradoxon. If you truncate your output down to the least significant 32 bits of the original 64 bit hash, then you will find collisions in time roughly 2^16, because you simply ignore the most significant 32 bits and the de-facto uniform distribution does the rest - it's like you started searching for collisions with a 32 bit value in the first place.
It's a bad idea. Hash function values are always meant to be taken as a whole.
For the implied question of "how to calculate a 64 bit hash": what's your intended use? Remember that 64 bits are too few for a crypto-strength hash function.
Use CRC to protect against random changes.
Use HMAC to protect against an attacker changing your files. HMAC uses a secret key that is necessary to generate and verify the tags. The result of an HMAC is as long as the underlying hash function (e.g. 20 bytes for an HMAC-SHA1), but it is frequently truncated. I.e. according to NIST SP 800-107 p.14 64-96 bits should be enough for most applications.
64 bits is small for a hash and usually, hashes are meant to be taken as a whole.
Now, what do you need these 64 bits for ? Answer will depend of expected usage.
Keep in mind that md5 is quite broken nowadays and 64 bits is very low security.
If you just need integrity checking against random changes, then a simple checksum as given in the other answers may be enough.
If you need cryptographic strength to ensure the original content, then 64 bit is too weak. Better use the full value of an unbroken algorithm, i.e. not MD5. SHA1 is still okay, but for longer term security better use SHA256. Or even go further with HMAC, as mentioned in the other answer.
There is nothing wrong with using the truncated value of a cryptographic hash. In fact, SHA224/384 are calculated by calculating a SHA256/512 hash with a different initialization vector and then truncating the result. However, this is only valid for cryptographic hashes. It may be a bad idea for normal checksums and table hashes.
Use OpenSSL's API for the calculations.(www.openssl.org).

Can I prevent duplicate content using md5?

I would like to prevent duplicate content. I do not want to keep a copies of content, so I decided to keep just the md5 signatures.
I read that md5 collisions do happen, different content could give in the same md5 signature.
Do you think md5 is enough?
Should I use md5 and sh1 together?
People have been able to deliberately produce MD5 collisions under contrived circumstances, but for preventing duplicate content (in the absence of malicious users) it's more than adequate.
Having said that, if you can use SHA-1 (or SHA-2) you should - you'll be fractionally but measurably safer from collisions.
MD5 should be fine, collisions are very rare, but if you're really worried, you can use sha-1 as well.
Though I guess the signatures really aren't that large, so if you have the spare processing cycles and the disk space, you could do both. But if space or speed is limited, I'd just go with one.
Why not simply compare the content byte for byte if there is a hash collision? hash collisions are very rare, and so you're only going to have to do a byte for byte check very rarely. That way duplicates will only be detected if the items are actually duplicated
md5 should be enough. Yes, there can be collisions, but the chances of that happening are so incredibly small that I wouldn't worry about it unless you were literally tracking many billions of pieces of content.
If you're really afraid of accidental collisions just do both MD5 and SHA1 hashes and compare them. If they both match, it's the same content. If either one differs, it's different content.
Combining algorithms serves to only obfuscate, but does not increase security in a hashing algorithm.
MD5 is too broken to use anyway, IMHO. Forging MD5 hashes is proven by researchers, where they demonstrated being able to forge content that generates an MD5 collision, thereby opening the door to generating a forged CSR to buy a cert from RapidSSL for a domain name they don't own. Security Now! episode 179 explains the process.
For me, SHA-based hashes are stronger and most development platforms support it so the choice is easy. The remaining deciding factor is then the block size.
A timestamp + md5 together are safe enough.
MD5 is broken and SHA1 is close to it. Use SHA2.
edit
Based on an update from the OP, it doesn't seem that intentional collisions are a serious concern here. For unintentional ones, any decent hash with at least a 64-bit output would be fine.
I would still avoid MD5 and even SHA1, in general, but there's no reason to be dogmatic about it. If the tool fits here, then by all means use it.

Resources