MD5 hashing 4-byte and 8-byte keys into 16-byte values; what's the chance of a collision? - md5

I have 232 4-byte keys that I'm hashing; what's the chance of collision?
What if I have 264 8-byte keys (not really storing every key, but I want to know the worst case)?

Per the wikipedia page on the Birthday Problem, a good first order approximation can be found with 1-e^(-(n^2)/d). Graphing this for your values gives this graph (logarithmic horizontal axis, I've zoomed in on where the probability starts to spike). Note that this is only an approximation, and should be considered conservatively (ie, the real probability may be somewhat higher, but it should be in the right ballpark).

What are you doing with the hash codes? If you're using them to work out whether two pieces of data are the same, an MD5 hash is pretty good, though only if you are working with data that is not being created by malicious entities. (Cryptographic purposes need better hash algorithms precisely in order to deal with the "malicious attacker" problem.)
If you're using them for building a map (i.e., you're building a hash table) it's usually better to use a cheap hash and come up with a way to mitigate the costs of collision (e.g., by hanging a linked list off the hash table and resizing/rebuilding when the average weight gets too large).

Related

Optimize duplicate values in NoSql key-value storage

I am building a maps tile storage, and need to store 1.5 billion ~3KB blobs. Over 95% of them are duplicate. Is there a NoSQL storage engine that would avoid storing identical values?
I could of course implement a double-de-referencing, e.g. key->hash->value. If hash is MD5, 16 byte hashes would use up 24GB for hashess alone, plus the per-item overhead, which is probably much more. Anything more efficient?
Thanks!
Double de-referencing is the way to go - you'd be saving somewhere between 4-5TB of data by not storing duplicate data, so storing a 24GB set of hashes is worth the cost. Additionally, you only need to compute the hash function on inserts and updates, not on lookups or deletions.
To reduce the cost of double de-referencing on lookups, you can supplement your on-disk key-value database with an in-memory cache, e.g. Redis - you can either cache frequently accessed key->hash pairs to avoid two lookups on the primary database, or else you can directly store the entire key->hash->blob structure in the cache (the former is much simpler to implement because you don't need to replicate the double de-referencing from the primary database, whereas the latter makes more sense if only a small subset of the blobs are ever active).
You may be able to use a simpler/smaller hash - the probability of a hash collision is 1 - e^(-k^2 / 2N) where k is the number of values being hashed and N is the size of the hash, so a good 64-bit hash has about a 12% chance of having a collision and a good 128-bit hash has an infinitesimal chance of having a collision. MurmurHash has 64 and 128-bit versions so you can experiment between the two, and it's faster than MD5 largely owing to MD5 being a cryptographic hash function whereas Murmur doesn't have the added expense/complexity of being cryptographically secure (I'm assuming that you're not concerned about anybody attempting to intentionally generate hash collisions or anything like that). Some key-value stores also make it relatively easy to make your design collision-tolerant, for example you could store the hash in a Riak Map with a flag indicating whether there have been any collisions on that hash value - if false then simply return the blob, else fall back on option 2 (e.g. the indexed blob becomes the two blobs with a hash collision zipped/tarred together along with a CSV of which keys correspond to which blob; even with a 64-bit hash this code path will not be exercised very often, and so implementation simplicity likely trumps performance); the question is whether the reduced memory/hashing overhead makes up for the complexity of collision tolerance.

How to choose size of hash table?

Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue?
Platform used - C (c99 version),
words are English char words, ASCII values
One time initialization of hash table (buckets of link list style),
used for searching next, like dictionary search.
After collision , that word will be added as new node into bucket.
A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. Assuming you have a good hash function.
Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. That's assuming no collisions.
That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get.
You might also consider a better hash function than hash*33 + word[i]. The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size.
You could also just throw memory at the problem. A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. However, you'll soon reach a point of diminishing returns. That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that.
Long story short: use a better hash function and do some testing at different table sizes.
There is such a thing as a minimal perfect hash. If you know what your input data is (i.e., it doesn't change), then you can create a hash function that guarantees O(1) lookup. It's also very space efficient. However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items.

can md5 uniquely identify one hundred million strings

given hundreds of millions of unique strings with average length of a few hundreds, can md5 uniquely represent each of them?
can collision occur? security is not a concern but uniqueness is.
If MD5 distributes its results evenly along the 2^128 space (which it doesn't, but it's pretty close), you could calculate the chance of two values in a collection of size n having a collision. This is often referred to as 'the birthday problem'.
Some of this math may seem confusing so I'll explain it best as possible.
Let M be the size of the range of MD5 (2^128 since MD5 is a 128 bit hashing function)
Let n be the number of random values in this range (you said 100,000,000)
We can calculate p, the probability for at least one collision, with:
Using the values you supplied:
Thanks to Dukeling for providing the answer to the above equation, 1.46E-23, which comes out at 0.0000000000000000000000146. You can read more about the formulae here.
For any type of hash function e.g. MD5, there exists 2 strings that hash to the same value. So, given any set of unique strings, you can't be sure 2 of these won't hash to the same value unless you analyse them in depth, or hash them all.
If you are concerned about an attacker maliciously constructing colliding strings, you cannot use MD5. If that's not an issue, MD5 is most likely good enough for your application with typical failure rates in realistic use cases on the order of one accidental collision per million years.
However, I would suggest picking something even more reliable so you don't have to worry about it. If nothing else, you will always have to defend your decision to use MD5 given that it's "known broken".
For example, you could use MD160 to get 160-bit hashes, SHA-1 to get 168-bit hashes, or SHA-256 to get 256-bit hashes. All of these algorithms have no known collisions despite effort to try to find them. Accidental collisions are billions of times less likely than failure due to asteroid impact.
The best choice depends on your priorities. What are the consequences of a collision? Do you have to resist malicious attacks? How critical is performance? How critical is hash size? Give us some more details and we can give you better advice.

Building a "sparse" lookup array minimizing memory footprint

let's say I want to build an array to perform a lookup to parse network protocols (like an ethertype). Since such an identifier is 2-byte long, I would end up with a 2^16 cells array if I use direct indexing: this is a real waste, because it is very likely that the array is sparse - i.e. lots of gaps into the array.
In order to reduce memory usage to the maximum, I would use a perfect hashing function generator like CMPH, so that I can map my "n" identifiers to a n-sized array without any collision. The downside of this approach is that I have to rely on an external "exoteric" library.
I am wondering whether - in my case - there are smarter ways to have a constant time lookup while keeping at bay memory usage; bear in mind that I am interested in indexing 16-bit unsigned numbers and the set size is quite limited.
Thanks
Since you know for a fact that you're dealing with 16-bit values, any lookup algorithm will be a constant-time algorithm, since there are only O(1) different possible values. Consequently, algorithms that on the surface might be slower (for example, linear search, which runs in O(n) for n elements) might actually be useful here.
Barring a perfect hashing function, if you want to guarantee fast lookup, I would suggest looking into cuckoo hashing, which guarantees worst-case O(1) lookup times and has expected O(1)-time insertion (though you have to be a bit clever with your hash functions). It's really easy to generate hash functions for 16-bit values; if you compute two 16-bit multipliers and multiply the high and low bits of the 16-bit value by these values, then add them together, I believe that you get a good hash function mod any prime number.
Alternatively, if you don't absolutely have to have O(1) lookup and are okay with good expected lookup times, you could also use a standard hash table with open addressing, such as a linear probing hash table or double hashing hash table. Using a smaller array with this sort of hashing scheme could be extremely fast and should be very simple to implement.
For an entirely different approach, if you're storing sparse data and want fast lookup times, an option that might work well for you is to use a simple balanced binary search tree. For example, the treap data structure is easy to implement and gives expected O(log n) lookups for values. Since you're dealing with 16-bit values, here log n is about 16 (I think the base of the logarithm is actually a bit different), so lookups should be quite fast. This does introduce a bit of overhead per element, but if you have only a few elements it should be simple to implement. For even less overhead, you might want to look into splay trees, which require only two pointers per element.
Hope this helps!

Speclialized hashtable algorithms for dynamic/static/incremental data

I have a number of data sets that have key-value pattern - i.e. a string key and a pointer to the data. Right now it is stored in hashtables, each table having array of slots corresponding to hash keys, and on collision forming a linked list under each slot that has collision (direct chaining). All implemented in C (and should stay in C) if it matters.
Now, the data is actually 3 slightly different types of data sets:
Some sets can be changed (keys added, removed, replaced, etc.) at will
For some sets data can be added but almost never replaced/removed (i.e. it can happen, but in practice it is very rare)
For some sets the data is added once and then only looked up, it is never changed once the whole set is loaded.
All sets of course have to support lookups as fast as possible, and consume minimal amounts of memory (though lookup speed is more important than size).
So the question is - is there some better hashtable structure/implementation that would suit the specific cases better? I suspect for the first case the chaining is the best, but not sure about two other cases.
If you are using linked lists for each bucket in your hashtable, you have already accepted relatively poor performance on modern CPUs (linked lists have poor locality and therefore poor CPU cache interaction). So I probably wouldn't worry about optimizing the other special cases. However, here are a few tips if you want to continue down the path you are using:
For the 'frequent changes' data set and the 'almost never change' cases, every time you read an item from the hash table, move it to the front of the linked list chain for that bucket. For some even better ideas this paper, even though it focus on fixed size keys, is a good staring point Fast and Compact Hash Tables for Integer Keys.
For the 'data set never changes' case you should look into the perfect hash generators. If you know your keys at compile time I've had good results with gperf. If your keys are not available until run-time try the C Minimal Perfect Hashing Library.
Those sets that are small (tens of elements) might be fastest using a binary or even linear search over the keys stored in sequential memory!
Obviously the key bodies have to be in the sequential memory, or hashes of them. But if you can get that into one or two L1 cache.lines, it'll fly.
As for the bigger hashes, the direct chaining might lose out to open addressing?
You could explore "cache conscious" hash tables and tries.
The wikipedia article discusses cache-lines in detail, describing the various trade-offs to consider.

Resources