Recursive MD5 and probability of collision - md5

I wonder if it is 'safe' to hash a bunch of MD5 hash values together to create a new hash or whether this will in any way increase the probability of collisions.
The background: I have a couple of files with dependencies. Each file has an associated hash value which is calculated based on it's content. Let's call this the 'single-file' hash value. In addition to this, the file should also have a hash value which includes all the dependent files, the 'multi-file' hash value.
So the question is: Can I just take all the single-file MD5 hash values of the dependent files, concatenate them and then calculate an MD5 over the concatenated values to get the multi-file hash value. Or will this result in an MD5 hash that is more likely to collide than if I would concatenate the content of all dependent files together.
Alternatively, could I xor the single-file hash values together to generate a multi-file hash value, or would this likely result in more collisions?

Sounds like you need a Merkel Tree

MD5 has a lot of collision problems, see MD5 entry on Wikipedia.
However, if you use MD5 not for security but as a unique marker to check dependencies, even hashing contatenated hashes should be pretty safe.
Or, if it's not too late, switch to SHA-1.

I think the risks of a collision is about the same for hashing the concatenated files, as to hashing the concatenated file hashes.

Related

Encrypt hash map keys while keeping constant lookup speed

I would like to encrypt the keys and values of a hash map with AES256 CBC, individually.
The challenge is to encrypt the keys while maintaining the constant lookup speed and security (mainly against dictionary attacks).
I read about blind indices, but these need some randomness at creation (salt, nonce) and it is impossible for the lookup function to recreate the nonce when searching. At lookup we would need to know where to fetch the nonce from for a particular key, which in the end would mean to be vulnerable elsewhere.
So far, I can only think of two options.
First one would be to just not encrypt keys, although I would prefer to do it.
The second one would be to obtain the blind indices by applying a transformation like
blind_index(key) = encrypt(digest(key))
but the problem here is that you need a unique initialisation vector for each key encryption, which brings us again to the problem described above: having a table of IVs used, in order for the lookup function to be able to reconstruct the blind index when searching, which is moving the same problem elsewhere.
For the second approach, my thought was: since I always encrypt unique values (keys are unique and even if they are substrings of one another, e.g. 'awesome' and 'awesome_key', they are hashed before encryption, so they look quite different in their 'hashed & unencrypted' form) I could use a global IV for all encryptions, which can be easily accessible to the lookup function. Since the lookup function requires the encryption key, only the owner will be able to compute the blind index correctly and in the map itself there will be no visible similarities between keys that are similar in plaintext.
The big problem I have with the second approach is that it violates the idea of never using IVs for more than one encryption. I could obfuscate the IV 'to make it more secure,' but that's again a bad idea since IVs are supposed to be plaintext.
More details about the circumstances:
app for mobile
map will be dumped to a file
map will be accessible for lookup through a REST API
Maybe I should use a different algorithm (e.g. EBC)?
Thanks in advance!
This is completely in the realm of Format Preserving Encryption (FPE). However, applying it is hard and libraries that handle it well are not all that common. FPE takes a an amount of bits or even a range and then returns an encrypted value of the same size or in the same range. This ciphertext is pseudo-random in the given domain as long as the input values are unique (which, for keys in a hash table, they are by definition).
If you may expand your ciphertext compared to the plaintext then you could also look at SIV modes (AES-SIV or AES-GCM_SIV, which are much easier to handle. These return a byte array, which could turn into a String, e.g. by using base64 encoding. Otherwise you could wrap the byte array and provide your own equals and hashCode method. Note that these expand your plaintext relatively significantly; these are authenticated modes. Advantage: the IV gets calculated from the input and any change in the input will randomize the ciphertext again.
Finally, you could of course simply use an IV or nonce to produce your ciphertext and prefix it to the value. However, beware that reencryption of changed values using the same IV would be rather dangerous, as you may leak information through repetition. In some modes this could entirely break the confidentiality offered. So you would have to prevent reuse of the IV.
The use of ECB is certainly not recommended for strings. A single block encrypt would work of course if the input is (or can be expanded to) a single block.

Does an MD5 have a sufficient hash space to fingerprint files?

I'm looking for speedy but not necessarily good hashing. I know that MD5s can be broken or maliciously crafted, but supposing that I'm not using them for security, and only to insure that a file is the same as when I initially indexed it, is it adequate?
To what extend can I reasonably use MD5 hashing before I will on average have a collision?
I want to store database records as FILE(id,path,size,md5)
Should I be able to make md5 unique, or is there not enough entropy for this. If not and MD5, which hash will scale to the point where I can for all intents and purposes call it unique. Is a SHA1 adequate even if it's slower.
I have a dataset with N~=50,000,000

SQL Server hash algorithms

If my input length is less than the hash output length, are there any hashing algorithms that can guarantee no collisions.
I know by nature that a one way hash can have collisions across multiple inputs due to the lossy nature of the hashing, especially when considering input size is often greater than output size, but does that still apply with smaller input sizes?
Use a symmetric block cipher with a randomly chosen static key. Encryption can never produce a duplicate because that would prevent unambiguous decryption.
This scheme will force a certain output length which is a multiple of the cipher block size. If you can make use a variable-length output you can use a stream cipher as well.
Your question sounds like you're looking for a perfect hash function. The problem with perfect hash functions is they tend to be tailored towards a specific set of data.
The following assumes you're not trying to hide, secure or encrypt the data...
To think of it another way, the easiest way to "generate" a perfect hash function that will accept your inputs is to map the data you want to store to a table and associate those inputs with a surrogate primary key. You then create a unique constraint for the column (or columns) to ensure the input you're mapping only maps to a single surrogate value.
The surrogate key could be int, bigint or a guid. It all depends on how many rows you're looking to store.
If your input lengths are known to be small, such as 32 bits, then you could actually enumerate through all possible inputs and check the resulting hashes for collisions. That's only going to be 4294967296 possible inputs, and shouldn't take to terribly long to enumerate all of them. Essentially you'd be building a rainbow table to test for collisions.
If there is some security relying on this though, one of the issues is if an attacker knows your input lengths are constrained, it makes it easy for them to also perform the same enumeration to create a map/table that will map hashes back to the original values. "attacker" is a pretty terrible term here though because I have no context of how you are using these hashes and whether you are concerned about being able to reverse them.

Is it okay to use a non-cryptographic hash to fingerprint a block of data?

My problem is this. I have a block of data. Occasionally this block of data is updated and a new changed version appears. I need to detect if the data I am looking at matches the version I am expecting to receive.
I have decided to use a fingerprint so that I can avoid storing the 'expected' version of the data in full. It seems that the 'default' choice for this kind of thing is an MD5 hash.
However MD5 was designed to be cryptographically secure. There are much faster hashing functions. I am looking at modern non-cryptographic functions such as CityHash and SpookyHash.
Since I control all the data in my system I only care about accidental collisions where a changed block of data hashes to the same value. Therefore I don't think I have to worry about the 'attacker-proof' nature of cryptographic hashes and could get away with a simpler hash function.
Are there any problems with using a hash function such as CityHash or SpookyHash for this purpose, or should I just stick with MD5? Or should I be using something specifically designed for fingerprinting such as a Rabin fingerprint?
Yes, it's okay (also take a look at the even faster CRC series of functions). However I tend to avoid using hashes to differentiate data, using serial numbers combined with a date/time value provide a means to determine which version is newer and to detect out-of-sync changes. Fingerprints are used more to detect corrupted files rather than versioning.
If you want to compare one set of data with another, then don't use hashes/fingerprints, just compare the data directly. It's faster to compare two streams than it is to take the hashes of two streams and then compare the hashes.
That said, a good quick way to compare lots of files is to take the hashes of each file, then compare the hashes, and when there's a hash match you then compare the raw bytes. The chance of a hash collision is indeed minimal, but it isn't impossible - and I like to absolutely be sure.
You may want to use the Rabin Hash, which is faster and more collision resilient than cryptographic hashes such as MD5, SHA1, et al. A Java implementation can be found here. Most large-scale deduplication efforts by web scale companies utilize Rabin Hash (for example, see Google's efforts led by Henzinger

How are hash tables implemented internally in popular languages?

Can someone please shed some light on how popular languages like Python, Ruby implements hash tables internally for symbol lookup? Do they use the classic "array with linked-list" method, or use a balanced tree?
I need a simple (fewer LOC) and fast method for indexing the symbols in a DSL written in C. Was wondering what others have found most efficient and practical.
The classic "array of hash buckets" you mention is used in every implementation I've seen.
One of the most educative versions is the hash implementation in the Tcl language, in file tcl/generic/tclHash.c. More than half of the lines in the file are comments explaining everything in detail: allocation, search, different hash table types, strategies, etc. Sidenote: the code implementating the Tcl language is really readable.
Perl uses an array with linked lists to hold collisions. It has a simple heuristic to automatically double the size of the array as necessary. There's also code to share keys between hashes to save a little memory. You can read about it in the dated but still relevant Perl Illustrated Guts under "HV". If you're truly adventurous you can dig into hv.c.
The hashing algorithm used to be pretty simple but its probably a lot more complicated now with Unicode. Because the algorithm was predictable there was a DoS attack whereby the attacker generated data which would cause hash collisions. For example, a huge list of keys sent to a web site as POST data. The Perl program would likely split it and dump it into a hash which then shoved it all into one bucket. The resulting hash was O(n) rather than O(1). Throw a whole lot of POST requests at a server and you might clog the CPU. As a result Perl now perturbs the hash function with a bit of random data.
You also might want to look at how Parrot implements basic hashes which is significantly less terrifying than the Perl 5 implementation.
As for "most efficient and practical", use someone else's hash library. For god's sake don't write one yourself for production use. There's a hojillion robust and efficient ones out there already.
Lua tables use an utterly ingenious implemenation which for arbitrary keys behaves like 'array of buckets', but if you use consecutive integers as keys, it has the same representation and space overhead as an array. In the implementation each table has a hash part and an array part.
I think this is way cool :-)
Attractive Chaos have a comparison of Hash Table Libraries and a update.
The source code is available and it is in C and C++
Balanced trees sort of defeat the purpose of hash tables since a hash table can provide lookup in (amortized) constant time, whereas the average lookup on a balanced tree is O(log(n)).
Separate chaining (array with linked list) really works quite well if you have enough buckets, and your linked list implementation uses a pooling allocator rather than malloc()ing each node from the heap individually. I've found it to be just about as performant as any other technique when properly tuned, and it is very easy and quick to write. Try starting with 1/8 as many buckets as source data.
You can also use open addressing with quadratic or polynomial probing, as Python does.
If you can read Java, you might want to check out the source code for its various map implementations, in particular HashMap, TreeMap and ConcurrentSkipListMap. The latter two keep the keys ordered.
Java's HashMap uses the standard technique you mention of chaining at each bucket position. It uses fairly weak 32-bit hash codes and stores the keys in the table. The Numerical Recipes authors also give an example (in C) of a hash table essentially structured like Java's but in which (a) you allocate the nodes of the bucket lists from an array, and (b) you use a stronger 64-bit hash code and dispense with storing keys in the table.
What Crashworks mean to say was....
The purpose of Hash tables are constant time lookup, addition and deletion. In terms of Algorithm, the operation for all operation is O(1) amortized.
Whereas in case you use tree ...the worst case operation time will be O(log n) for a balanced tree. N is the number of nodes. but, do we really have hash implemented as Tree?

Resources