can md5 uniquely identify one hundred million strings - md5

given hundreds of millions of unique strings with average length of a few hundreds, can md5 uniquely represent each of them?
can collision occur? security is not a concern but uniqueness is.

If MD5 distributes its results evenly along the 2^128 space (which it doesn't, but it's pretty close), you could calculate the chance of two values in a collection of size n having a collision. This is often referred to as 'the birthday problem'.
Some of this math may seem confusing so I'll explain it best as possible.
Let M be the size of the range of MD5 (2^128 since MD5 is a 128 bit hashing function)
Let n be the number of random values in this range (you said 100,000,000)
We can calculate p, the probability for at least one collision, with:
Using the values you supplied:
Thanks to Dukeling for providing the answer to the above equation, 1.46E-23, which comes out at 0.0000000000000000000000146. You can read more about the formulae here.

For any type of hash function e.g. MD5, there exists 2 strings that hash to the same value. So, given any set of unique strings, you can't be sure 2 of these won't hash to the same value unless you analyse them in depth, or hash them all.

If you are concerned about an attacker maliciously constructing colliding strings, you cannot use MD5. If that's not an issue, MD5 is most likely good enough for your application with typical failure rates in realistic use cases on the order of one accidental collision per million years.
However, I would suggest picking something even more reliable so you don't have to worry about it. If nothing else, you will always have to defend your decision to use MD5 given that it's "known broken".
For example, you could use MD160 to get 160-bit hashes, SHA-1 to get 168-bit hashes, or SHA-256 to get 256-bit hashes. All of these algorithms have no known collisions despite effort to try to find them. Accidental collisions are billions of times less likely than failure due to asteroid impact.
The best choice depends on your priorities. What are the consequences of a collision? Do you have to resist malicious attacks? How critical is performance? How critical is hash size? Give us some more details and we can give you better advice.

Related

Size of the hash table

Let the size of the hash table to be static (I set it once). I want to set it according to the number of entries. Searching yielded that the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.
For simplicity, assume that the hash table will not accept any new entries and won't delete any.
The number of entries will be 200, 2000, 20000 and 2000000.
However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?
I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.
I using C and I want to build my own structure, for educating myself.
the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.
It certainly shouldn't. Probably this recommendation implies that load factor of 0.5 is good tradeoff, at least by default.
What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:
If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.
The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.
The entries will be 200, 2000, 20000 and 2000000.
I don't understand what did you mean by this.
However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?
The general rule is called space-time tradeoff: the more memory you allocate for hash table, the faster hash table operate. Here you can find some charts illustrating this. So, if you think that by assigning table size ~ 2 * N you would waste memory, you can freely choose smaller size, but be ready that operations on hash table will become slower on average.
I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.
It's impossible to avoid collisions completely (remember birthday paradox? :) Certain ratio of collisions is an ordinary situation. This ratio affects only average operation speed, see the previous section.
The answer to your question depends somewhat on the quality of your hash function. If you have a good quality hash function (i.e. one where on average, the bits of the hash code will be "distributed evenly"), then:
the necessity to have a prime number of buckets disappears;
you can expect the number of items per bucket to be Poisson distributed.
So firstly, the advice to use a prime number of buckets is is essentially a kludge to help alleviate situations where you have a poor hash function. Provided that you have a good quality hash function, it's not clear that there are really any constraints per se on the number of buckets, and one common choice is to use a power of two so that the modulo is just a bitwise AND (though either way, it's not crucial nowadays). A good hash table implementation will include a secondary hash to try and alleviate the situation where the original hash function is of poor quality-- see the source code to Java's HashTable for an example.
A common load factor is 0.75 (i.e. you have 100 buckets for every 75 entries). This translates to approximately 50% of buckets having just one single entry in them-- so it's good performance-wise-- though of couse it also wastes some space. What the "correct" load factor is for you depends on the time/space tradeoff that you want to make.
In very high-performance applications, a potential design consideration is also how you actually organise the structure/buckets in memory to maximise CPU cache performance. (The answer to what is the "best" structure is essentially "the one that performs best in your experiments with your data".)

How to choose size of hash table?

Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue?
Platform used - C (c99 version),
words are English char words, ASCII values
One time initialization of hash table (buckets of link list style),
used for searching next, like dictionary search.
After collision , that word will be added as new node into bucket.
A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. Assuming you have a good hash function.
Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. That's assuming no collisions.
That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get.
You might also consider a better hash function than hash*33 + word[i]. The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size.
You could also just throw memory at the problem. A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. However, you'll soon reach a point of diminishing returns. That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that.
Long story short: use a better hash function and do some testing at different table sizes.
There is such a thing as a minimal perfect hash. If you know what your input data is (i.e., it doesn't change), then you can create a hash function that guarantees O(1) lookup. It's also very space efficient. However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items.

Expected performance of tries vs bucket arrays with constant load-factor

I know that I can simply use bucket array for associative container if I have uniformly distributed integer keys or keys that can be mapped into uniformly distributed integers. If I can create the array big enough to ensure a certain load factor (which assumes the collection is not too dynamic), than the expected number of collisions for a key will be bounded, because this is simply hash table with identity hash function.
Edit: I view strings as equivalent to positional fractions in the range [0..1]. So they can be mapped into any integer range by multiplication and taking floor of the result.
I can also do prefix queries efficiently, just like with tries. I presume (without knowing a proof) that the expected number of empty slots corresponding to a given prefix that have to be skipped sequentially before the first bucket with at least one element is reached is also going to be bounded by constant (again depending on the chosen load factor).
And of course, I can do stabbing queries in worst-case constant time, and range queries in solely output sensitive linear expected time (if the conjecture of denseness from the previous paragraph is indeed true).
What are the advantages of a tries then?
If the distribution is uniform, I don't see anything that tries do better. But I may be wrong.
If the distribution has large uncompensated skew (because we had no prior probabilities or just looking at the worst case), the bucket array performs poorly, but tries also become heavily imbalanced, and can have linear worst case performance with strings of arbitrary length. So the use of either structure for your data is questionable.
So my question is - what are the performance advantages of tries over bucket arrays that can be formally demonstrated? What kind of distributions elicit those advantages?
I was thinking of distributions with self-similar structure at different scales. I believe those are called fractal distributions, of which I confess to know nothing. May be then, if the distribution is prone to clustering at every scale, tries can provide superior performance, by keeping the load factor of each node similar, adding levels at dense regions as necessary - something that bucket arrays can not do.
Thanks
Tries are good if your strings share common prefixes. In that case, the prefix is stored only once and can be queried with linear performance in the output string length. In a bucket array, all strings with the same prefixes would end up close together in your key space, so you have very skewed load where most buckets are empty and some are huge.
More generally, tries are also good if particular patterns (e.g. the letters t and h together) occur often. If there are many such patterns, the order of the trie's tree nodes will typically be small, and little storage is wasted.
One of the advantages of tries I can think of is insertion. Bucket array may need to be resized at some point and this is expensive operation. So worst-case insertion time into trie is much better than into bucket array.
Another thing is that you need to map string to fraction to be used with bucket arrays. So if you have short keys, theoretically trie can be more efficient, because you don't need to do the mapping.

MD5 hashing 4-byte and 8-byte keys into 16-byte values; what's the chance of a collision?

I have 232 4-byte keys that I'm hashing; what's the chance of collision?
What if I have 264 8-byte keys (not really storing every key, but I want to know the worst case)?
Per the wikipedia page on the Birthday Problem, a good first order approximation can be found with 1-e^(-(n^2)/d). Graphing this for your values gives this graph (logarithmic horizontal axis, I've zoomed in on where the probability starts to spike). Note that this is only an approximation, and should be considered conservatively (ie, the real probability may be somewhat higher, but it should be in the right ballpark).
What are you doing with the hash codes? If you're using them to work out whether two pieces of data are the same, an MD5 hash is pretty good, though only if you are working with data that is not being created by malicious entities. (Cryptographic purposes need better hash algorithms precisely in order to deal with the "malicious attacker" problem.)
If you're using them for building a map (i.e., you're building a hash table) it's usually better to use a cheap hash and come up with a way to mitigate the costs of collision (e.g., by hanging a linked list off the hash table and resizing/rebuilding when the average weight gets too large).

Is it faster to search for a large string in a DB by its hashcode?

If I need to retrieve a large string from a DB, Is it faster to search for it using the string itself or would I gain by hashing the string and storing the hash in the DB as well and then search based on that?
If yes what hash algorithm should I use (security is not an issue, I am looking for performance)
If it matters: I am using C# and MSSQL2005
In general: probably not, assuming the column is indexed. Database servers are designed to do such lookups quickly and efficiently. Some databases (e.g. Oracle) provide options to build indexes based on hashing.
However, in the end this can be only answered by performance testing with representative (of your requirements) data and usage patterns.
I'd be surprised if this offered huge improvement and I would recommend not using your own performance optimisations for a DB search.
If you use a database index there is scope for performance to be tuned by a DBA using tried and trusted methods. Hard coding your own index optimisation will prevent this and may stop you gaining for any performance improvements in indexing in future versions of the DB.
Though I've never done it, it sounds like this would work in principle. There's a chance you may get false positives but that's probably quite slim.
I'd go with a fast algorithm such as MD5 as you don't want to spend longer hashing the string than it would have taken you to just search for it.
The final thing I can say is that you'll only know if it is better if you try it out and measure the performance.
Are you doing an equality match, or a containment match? For an equality match, you should let the db handle this (but add a non-clustered index) and just test via WHERE table.Foo = #foo. For a containment match, you should perhaps look at full text index.
First - MEASURE it. That is the only way to tell for sure.
Second - If you don't have an issue with the speed of the string searching, then keep it simple and don't use a Hash.
However, for your actual question (and just because it is an interesting thought). It depends on how similar the strings are. Remember that the DB engine doesn't need to compare all the characters in a string, only enough to find a difference. If you are looking through 10 million strings that all start with the same 300 characters then the hash will almost certainly be faster. If however you are looking for the only string that starts with an x, then i the string comparison could be faster. I think though that SQL will still have to get the entire string from disc, even if it then only uses the first byte (or first few bytes for multi byte characters), so the total string length will still have an impact.
If you are trying the hash comparison then you should make the hash an indexed calculated column. It will not be faster if you are working out the hashes for all the strings each time you run a query!
You could also consider using SQL's CRC function. It produces an int, which will be even quicker to comapre and is faster to calculate. But you will have to double check the results of this query by actually testing the string values because the CRC function is not designed for this sort of usage and is much more likly to return duplicate values. You will need to do the CRC or Hash check in one query, then have an outer query that compares the strings. You will also want to watch the QEP generated to make sure the optimiser is processing the query in the order you intended. It might decide to do the string comparisons first, then the CRC or Hash checks second.
As someone else has pointed out, this is only any good if you are doing an exact match. A hash can't help if you are trying to do any sort of range or partial match.
If your strings are short (less than 100 charaters in general), strings will be faster.
If the strings are large, HASH search may and most probably will be faster.
HashBytes(MD4) seems to be the fastest on DML.
If you use a fixed length field and an index it will probably be faster...
TIP: if you are going to store the hash in the database, a MD5 Hash is always 16 bytes, so can be saved in a uniqueidentifier column (and System.Guid in .NET)
This might offer some performance gain over saving hashes in a different way (I use this method to check for binary/ntext field changes but not for strings/nvarchars).
The 'ideal' answer is definitely yes.
String matching against an indexed column will always be slower than matching a hashvalue stored in an index column. This is what hashvalues are designed for, because they take a large dataset (e.g. 3000 comparison points, one per character) and coalesce it into a smaller dataset, (e.g. 16 comparison points, one per byte).
So, the most optimized string comparison tool will be slower than the optimized hash value comparison.
However, as has been noted, implementing your own optimized hash function is dangerous and likely to not go well. (I've tried and failed miserably) Hash collisions are not particulrly a problem, because then you will just have to fall back on the string matching algorithm, which means that would be (at worst) exactly as fast as your string comparison method.
But, this is all assuming that your hashing is done in an optimal fashion, (which it probably won't be) and that there will not be any bugs in your hashing component (which there will be) and that the performance increase will be worth the effort (probably not). String comparison algorithms, especially in indexed columns are already pretty fast, and the hashing effort (programmer time) is likely to be much higher than your possible gain.
And if you want to know about performance, Just Measure It.
I am confused and am probably misunderstanding your question.
If you already have the string (thus you can compute the hash), why do you need to retrieve it?
Do you use a large string as the key for something perhaps?

Resources