Too many collisions in hash function

Too many collisions in hash function - c

I was trying to hash about 64million 64bit unique unsigned integers to 128 million buckets(27bit wide address). I tried Bob Jenkin's HashLittle and Murmur hash(Both these hash functions gives 32bit hashes which I masked to obtain 27bit address). In both the cases it resulted in about 22% of collisions and in the end only occupied 37% of buckets. Is this expected or am I doing something wrong ? I was expecting far less collisions and better occupation of buckets.

It looks slightly worse than I would expect at random, using an approximation based on the http://en.wikipedia.org/wiki/Poisson_distribution. If the expected number of entries in a bucket is 1/2, I would expect that the probability of 0 entries is about exp(-0.5) = 0.607, and the probability of 1 entry in a bucket is about half this, or 0.303. This leaves probability 0.09 that a bucket has two or more entries.
Are your integers all unique? If not, are you counting duplicate values as causing a hash collision?
In favourable circumstances, you can choose a hash function so as to give FEWER collisions that you would expect at random. Sometimes hash(x) = x % p, where p is a prime, will achieve this.

If you want to get "random but repeatable" results - which have the best worst-case collision rates even for deliberately difficult inputs* - you can simply create a table like:
uint32_t r[8][256];
Populate it using 8kb of random data - you can google for a website with random data to download and reformat it for inclusion in your source or loading at runtime from file.
(*) - as long as the inputs aren't created by someone malicious who knows your random data too.
Then hash like this:
uint32_t hash(uint64_t n)
{
unsigned char* p = (unsigned char*)&n;
return r[0][p[0]] ^ r[1][p[1]] ^ r[2][p[2]] ^ r[3][p[3]] ^
r[4][p[4]] ^ r[5][p[5]] ^ r[6][p[6]] ^ r[7][p[7]];
}
Of course, better worst-case collisions is often a very different thing from better real-world performance - a lot depends on your data set and hardware - so it's just something to benchmark if you really care. Do benchmark simple pass-through as well. Using a prime number of buckets is very good practice but might be tricky depending on your hash table - e.g. some implementations may round any resize request up to a power of two.

Related

What is a propper way of calculating the size of a hash table

I am bullding a hash table using double hashing to solve collision. How can I know what is the propper size? I know it has to prime to minmize the number of collisions.

The easiest way to implement hash tables is to use a power-of-2 sized hash tables.
The reason is that if N = 2M, then calculating H % N is as simple as calculating H & (N - 1).
With fast hash functions such as MurmurHash3_32, the slowest part of using the hash table is actually calculating the modulo. H & (N - 1) does not calculate modulo but bitwise AND which is much faster (and it's the same as modulo if N is a power of 2).
Somebody could validly claim that MurmurHash suffers from seed-independent multicollisions and therefore is susceptible to a hash collision denial of service attack. That's true, but you shouldn't use linked lists to resolve hash collisions. You should use only hash tables where the keys are sortable by some comparison function (larger than, equal, smaller than) and then you can use red-black trees (or AVL trees) to resolve hash collisions. If there's no natural comparison functions (such as for complex numbers), you can invent one.
Using a red-black tree that almost always is just a single root element with MurmurHash is much faster than trying to be "secure" by using SipHash and then stupidly using linked lists to resolve hash collisions (which caused the need for the absurdly slow SipHash in the first place).
In theory, with non-power-of-2-sized hash tables where the size is rarely varying, you could use the "fast division by invariant integers using multiplication" trick but that's slower than power-of-2-sizing and bitwise AND.
The prime sizing is just for really poor hash functions. MurmurHash, although it suffers from seed-independent multicollisions, does not suffer from collisions with reasonable (non-attacker-generated) keys if the table size is a power of 2.

No, there is no point in making the size be prime and that adds a lot of extra work for you. Just make the size be a power of two and double it whenever the number of objects in the hash table reaches some threshold, like 50% or 25% of the size.

If you are asking about the current size, you may use the sizeof(table)/sizeof(element) function since you are using the double hashing method.
If you are asking about the new size of the hash table once full (passed a certain criterion), then the most common is to add 10 new slots. This should be based on what are you using your table for. The default setting on most built in tables in other languages is if 0.75 full, then add 10 slots.
If it's about something else, then please modify your question, so it's more descriptive.
Edit: I just noticed the answer above me and I think that using the 2^p method is very common too in exponentially increasing tables and is very helpful with double hashing.

Is mod prime good enough as a hash function for a hashtable in C

I need a hash function that is as efficient as possible, for a hash table (actually a hash set) that uses probing (open addressing) for collision resolution. The entries stored in the table are all 4 byte ints that take on random values over the range.
I am considering something even faster than djb2, something like
value mod LARGE_PRIME
Then mod it again with my bucket size. I suppose this prime is necessarily bigger than my bucket size, which means I also have some kind of sanity limit on how big my table has to grow (it probably won't ever get past 256 entries).
I don't require any cryptography aspects of hash function - as long as it isn't terribly collision-prone, it should work fine.
Will this make a good hash function? Can I define a specific algorithm for my hash table capacity every time I resize to improve it?

The right hash function boils down to the data you are hashing: how random are your values? If your values are uniformly distributed over the range, and the range is much larger than the number of hash buckets, then just using
value MOD number_of_buckets
will be a reasonable hash function - adding in MOD <prime> won't actually give you very much, and in fact might well make the hashing worse (because some buckets will be under- or over-used more than they would otherwise have been).
Primes aren't magic - they can sometimes be used to "smooth out" correlation effects due to common factors, but if you don't have those correlations to begin with, you may be better off without them - especially if speed is paramount!

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.

The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.

A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).

The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Why setting HashTable's length to a Prime Number is a good practice?

I was going through Eric Lippert's latest Blog post for Guidelines and rules for GetHashCode when i hit this para:
We could be even more clever here; just as a List resizes itself when it gets full, the bucket set could resize itself as well, to ensure that the average bucket length stays low. Also, for technical reasons it is often a good idea to make the bucket set length a prime number, rather than 100. There are plenty of improvements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now. I want to keep it simple.
So looks like i'm missing something. Why is it a good practice to set it to a prime number?.

You can find people that suggest the two opposite ends of the spectrum. On the one side, choosing a prime number for the size of the hash table will reduce the chances of collisions, even if the hash function is not too effective distributing the results. Note that if (in the simplest example to argue about) a power of 2 size is decided, only the lower bits affect the bucket, while for a prime number most bits in the result of the hash will be used.
On the other hand, you can gain more by choosing a better hash function, or even rehashing he result of the hash function by applying some bit operations, and using a power of 2 hash size to speed up calculations.
As an example from real life, Java HashTable were initially implemented by using prime (or almost prime sizes), but from Java 1.4 on, the design was changed to use power of two number of buckets and added a second fast hash function applied to the result of the initial hash. An interesting article commenting that change can be found here.
So basically:
a prime number helps dispersing the inputs across the different buckets even in the event of not-so-good hash functions.
a similar effect can be achieved by post processing the result of the hash function, and using a power of 2 size to speedup the modulo operation (bit mask) and compensate for the post processing.

Because this produces a better hash function and reduces the number of possible collisions. This is explained in Choosing a good hashing function:
A basic requirement is that the
function should provide a uniform
distribution of hash values. A
non-uniform distribution increases the
number of collisions, and the cost of
resolving them.
The distribution needs to be uniform
only for table sizes s that occur in
the application. In particular, if one
uses dynamic resizing with exact
doubling and halving of s, the hash
function needs to be uniform only when
s is a power of two. On the other
hand, some hashing algorithms provide
uniform hashes only when s is a prime
number.

Say your bucket set length is a power of 2 - that makes the mod calculations quite fast. It also means that the bucket selection is determine solely by the top m bits of the hash code. (Where m = 32 - n, where n is the power of 2 being used). So it's like you're throwing away useful bits of the hashcode immediately.
Or as in this blog post from 2006 puts it:
Suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering:
...
Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.

hash function for src dest ip + port

So, I am looking at different hash functions to use for hashing a 4 tuple ip and port to identify flows.
One I came across was
((size_t)(key.src.s_addr) * 59) ^
((size_t)(key.dst.s_addr)) ^
((size_t)(key.sport) << 16) ^
((size_t)(key.dport)) ^
((size_t)(key.proto));
Now for the life of me, I cannot explain the prime used (59). why not 31, and then why go mess it up by multiplying the sport by a power of 2.
Is there a better hash function to be used for ip addresses ?

The prime number is used because when one value is multiplied by a prime number it tends to have a higher probability of remaining unique when other similar operations are accumulated on top of it. The specific value 59 may have been choosen arbitrarily or it may be intentional. It is hard to tell. It is possible that 59 tended to generate a better distribution of values based on the most likely inputs.
The shift by 16 may be because ports are limited to the range 2^16. The function appears to be moving the source port into the higher part of the bitfield while leaving the destination port in the lower part. I think this can be explained further in my next paragraph.
Another reason why the multiplication takes place (and this is true of the shift operation as well) is because it breaks down the associative nature of the hash function. Remember, XOR is associative so the IPs src=192.168.1.1 and dst=192.168.1.254 would hash to the same value as src=192.168.1.254 and dst=192.168.1.1 (swapped) if the multiplication were not there.

Personally I think you'd be better off reading the four IP bytes as an unsigned long which would give you a number roughly in the range 0 - 2^32-1. Then you figure out how many flows you want to have active at any one time and that would be your index table size.
Take 2000 for example. That means you want to map 2^32 numbers onto roughly 2^11 indeces (to flow information). That won't work because hashing almost never works if filled to 100% and even 90% can be difficult. Using a index table that you only fill to 50% (4000 indeces) or even 25% (8000) is no big deal with todays memories.
The exact size of the index table should be an uneven number of locations and preferably a prime number. This is because you'll most likely need to have some overflow handling to handle collisions (two or more ip numbers which after the hashing point to the same location in the index table) - which you'll get. The overflow handling should be another prime number less than the index table size. All these prime numbers! What's with them anyway?
I'll illustrate with an example (in C):
idxsiz = prime(2000 * 2); // 50% loading
ovfjmp = prime(idxsiz/3);
...
initially fill the table of idxjmp positions with an UNUSED marking (-1). Have a DELETED marking ready (-2).
Your ip number enters the system and you look for its flow record (may or may not exist):
stoppos = ip % idxsiz; /* modulo (division) just this once */
i = stoppos;
do
{
if (index[i] == UNUSED) return NULL;
if (index[i] != DELETED)
{
flowrecptr = &flow_record[index[i]];
if (!flowrecptr->in_use) {/* hash table is broken */}
if (flowrecptr->ip == ip) return flowrecptr;
}
i += ovfjmp;
if (i >= idxsiz) i -= idxsiz;
}
while (i != stoppos);
return NULL;
The UNUSED serves as a marker that this index has never been used and that searching should stop. The DELETED serves as a marker that this index has been used but no longer. That means that searching must continue.
This was when you were attempting to do a get. You got a NULL back from get so you need to do a put which you begin by finding the first index position containing UNUSED or DELETED. Replace this value with an index to the first/next free row on the flow_record table. Mark the row as in_use. Put the original ip number into the ip member of the flow_record row.
This is a very simple - but very effective - way to construct a hashing mechanism. Practically every optimization in the form of special functions to be used after this or that function has failed will enhance the effectiveness of the hashing.
The use of prime numbers will ensure that - in the worst case where all index locations are occupied - the mechanism will test every single location. To illustrate this: suppose idxsiz is evenly divisible by ovfjmp: you won't have much overflow handling to speak of. 35 and 7 will result in locations 0,7,14,21 and 28 being tested before the index jumps to 0 where the while test will cause the search to stop.
----------------------OOPS!
I missed that you wanted the port number as well. Assuming ip V4 that means 6 bytes of address. Read this as an unsigned 64-bit integer and clear the top 16 bits/2 bytes. Then you do the modulo calculation.

Examine the output of the function for uniform distribution. If you don't like it, plug in some different primes until you get a distribution you like. Hashing can be a very dark art with no 'right' answer.

Brian Gideon pretty much sums it up; the multiplication and the shift are intended as a symmetry breaker. So this catches the hypothetical case of machine A telnetting to machine B and vice versa and they happen to chose the same ephemeral portnum. Not very common, but not impossible. Much of the 5-tuple is pretty constant: protocol comes from a very small domain, and so does one half of {address,portnum}.
Assuming a prime-sized hashtable, the magic constant 59 could be replaced by any prime, IMO. The (port << 16) could also be replaced by another shift, as long as no bits fall off or even by a (port * some_other_prime) term.
For a power-of-two sized hashtable, all(minus one) members of the 5-tuple should be multiplied by a (different) prime. (in the old days, when division was expensive that would have been an option)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight