Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue?
Platform used - C (c99 version),
words are English char words, ASCII values
One time initialization of hash table (buckets of link list style),
used for searching next, like dictionary search.
After collision , that word will be added as new node into bucket.
A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. Assuming you have a good hash function.
Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. That's assuming no collisions.
That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get.
You might also consider a better hash function than hash*33 + word[i]. The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size.
You could also just throw memory at the problem. A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. However, you'll soon reach a point of diminishing returns. That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that.
Long story short: use a better hash function and do some testing at different table sizes.
There is such a thing as a minimal perfect hash. If you know what your input data is (i.e., it doesn't change), then you can create a hash function that guarantees O(1) lookup. It's also very space efficient. However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items.
Related
Let the size of the hash table to be static (I set it once). I want to set it according to the number of entries. Searching yielded that the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.
For simplicity, assume that the hash table will not accept any new entries and won't delete any.
The number of entries will be 200, 2000, 20000 and 2000000.
However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?
I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.
I using C and I want to build my own structure, for educating myself.
the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.
It certainly shouldn't. Probably this recommendation implies that load factor of 0.5 is good tradeoff, at least by default.
What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:
If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.
The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.
The entries will be 200, 2000, 20000 and 2000000.
I don't understand what did you mean by this.
However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?
The general rule is called space-time tradeoff: the more memory you allocate for hash table, the faster hash table operate. Here you can find some charts illustrating this. So, if you think that by assigning table size ~ 2 * N you would waste memory, you can freely choose smaller size, but be ready that operations on hash table will become slower on average.
I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.
It's impossible to avoid collisions completely (remember birthday paradox? :) Certain ratio of collisions is an ordinary situation. This ratio affects only average operation speed, see the previous section.
The answer to your question depends somewhat on the quality of your hash function. If you have a good quality hash function (i.e. one where on average, the bits of the hash code will be "distributed evenly"), then:
the necessity to have a prime number of buckets disappears;
you can expect the number of items per bucket to be Poisson distributed.
So firstly, the advice to use a prime number of buckets is is essentially a kludge to help alleviate situations where you have a poor hash function. Provided that you have a good quality hash function, it's not clear that there are really any constraints per se on the number of buckets, and one common choice is to use a power of two so that the modulo is just a bitwise AND (though either way, it's not crucial nowadays). A good hash table implementation will include a secondary hash to try and alleviate the situation where the original hash function is of poor quality-- see the source code to Java's HashTable for an example.
A common load factor is 0.75 (i.e. you have 100 buckets for every 75 entries). This translates to approximately 50% of buckets having just one single entry in them-- so it's good performance-wise-- though of couse it also wastes some space. What the "correct" load factor is for you depends on the time/space tradeoff that you want to make.
In very high-performance applications, a potential design consideration is also how you actually organise the structure/buckets in memory to maximise CPU cache performance. (The answer to what is the "best" structure is essentially "the one that performs best in your experiments with your data".)
I know that I can simply use bucket array for associative container if I have uniformly distributed integer keys or keys that can be mapped into uniformly distributed integers. If I can create the array big enough to ensure a certain load factor (which assumes the collection is not too dynamic), than the expected number of collisions for a key will be bounded, because this is simply hash table with identity hash function.
Edit: I view strings as equivalent to positional fractions in the range [0..1]. So they can be mapped into any integer range by multiplication and taking floor of the result.
I can also do prefix queries efficiently, just like with tries. I presume (without knowing a proof) that the expected number of empty slots corresponding to a given prefix that have to be skipped sequentially before the first bucket with at least one element is reached is also going to be bounded by constant (again depending on the chosen load factor).
And of course, I can do stabbing queries in worst-case constant time, and range queries in solely output sensitive linear expected time (if the conjecture of denseness from the previous paragraph is indeed true).
What are the advantages of a tries then?
If the distribution is uniform, I don't see anything that tries do better. But I may be wrong.
If the distribution has large uncompensated skew (because we had no prior probabilities or just looking at the worst case), the bucket array performs poorly, but tries also become heavily imbalanced, and can have linear worst case performance with strings of arbitrary length. So the use of either structure for your data is questionable.
So my question is - what are the performance advantages of tries over bucket arrays that can be formally demonstrated? What kind of distributions elicit those advantages?
I was thinking of distributions with self-similar structure at different scales. I believe those are called fractal distributions, of which I confess to know nothing. May be then, if the distribution is prone to clustering at every scale, tries can provide superior performance, by keeping the load factor of each node similar, adding levels at dense regions as necessary - something that bucket arrays can not do.
Thanks
Tries are good if your strings share common prefixes. In that case, the prefix is stored only once and can be queried with linear performance in the output string length. In a bucket array, all strings with the same prefixes would end up close together in your key space, so you have very skewed load where most buckets are empty and some are huge.
More generally, tries are also good if particular patterns (e.g. the letters t and h together) occur often. If there are many such patterns, the order of the trie's tree nodes will typically be small, and little storage is wasted.
One of the advantages of tries I can think of is insertion. Bucket array may need to be resized at some point and this is expensive operation. So worst-case insertion time into trie is much better than into bucket array.
Another thing is that you need to map string to fraction to be used with bucket arrays. So if you have short keys, theoretically trie can be more efficient, because you don't need to do the mapping.
I have a C-language app where I need to do table lookups.
The entries are strings, All are known at the start of runtime. The table is initialized once, and then looked up many times. The table can change, but it's basically as if the app starts over. I think this means I can use a perfect-hash? It's ok to consume some time for the hashtable initialization, as it happens just once.
There will be between 3 and 100,000 entries, each one unique, and I estimate that 80% of cases will have fewer than 100 entries. A simple naive lookup is "fast enough" in those cases. (== no one is complaining)
However in the cases where there are 10k+ entries, the lookup speed of a naive approach is unacceptable. What's a good approach for delivering good hashtable-based lookup performance for strings in C?
Assume I do not have a 3rd-party commercial library like Boost/etc. What hash algorithm should I use? how do I decide?
Generating a perfect hash is not a simple problem. There's libraries devoted to the task.
In this case the most popular one is probably CMPH. I haven't used it though so can't help beyond that. gperf is another tool, but it requires the strings to be known at compile time (you could work around it by compiling a .so and loading, but kind of overkill).
But frankly, I'd at least try to go with a binary search first. Simply sort the array using qsort, then search with bsearch (or roll your own). Both those are part of stdlib.h since C89.
If a naive (I assume you mean linear) approach is ok for 100 entries (so 50 comparisons are done on average) then a binary search will be more than sufficient for 100,000 entries (it takes at most 17 comparisons).
So I wouldn't bother with hashes at all but just resort to sorting your string table on startup (e.g. using qsort) and later using a binary search (e.g. using bsearch) to look up entries.
If the (maximal) table size is known, a plain hashtable with chaining is very easy to implement. Size overhead is only two ints per item. With a reasonable hash function only 1.5 probes per lookup are needed on average, this for a 100% loaded table.
Constructing a perfect hash is only feasible if your data does not change. Once it changes, you'll have to recompute and rehash, which is way more expensive than doing a few extra compares.
I'm working on a project where efficiency is crucial. A hash table would be very helpful since I need to easily look up the memory address of a node based on a key. The only problem I foresee is this hash table will need to handle up to 1 million entries. As I understand it usually hash tables buckets are a linked list so that they can handle multiple entries in the same bucket. It seems to me that with a million entries these lists would be way too slow. What is the common way of implementing something like this. Maybe swapping a standard linked list out for a skip list?
If you want a hash table with a million entries, normally you'd have at least 2 million buckets. I don't remember all the statistics (the key term is "birthday paradox"), but the vast majority of the buckets will have zero or one items. You can, in principle, be very unlucky and get all items in one bucket - but you'd have to be even more unlucky than those people who seem to get struck by lightning every other day.
For hashtables that grow, the normal trick is to grow by a constant percentage - the usual textbook case being growth by doubling the hash-table size. You do this whenever the number of items in the hashtable reaches a certain proportion of the hashtable size, irrespective of how many buckets are actually being used. This gives amortized expected performance of O(1) for inserts, deletes and searches.
The linked list in each bucket of a hash-table is just a way of handling collisions - improbable in a per-operation sense, but over the life of a significant hash table, they do happen - especially as the hash-table gets more than half full.
Linked lists aren't the only way to handle collisions - there's a huge amount of lore about this topic. Walter Bright (developer of the D programming language) has advocated using binary trees rather than linked lists, claiming that his Dscript gained a significant performance boost relative to Javascript from this design choice.
He used simple (unbalanced) binary trees when I asked, so the worst-case performance was the same as for linked lists, but the key point I guess is that the binary tree handling code is simple, and the hash table itself makes the odds of building large unbalanced trees very small.
In principle, you could just as easily use treaps, red-black trees or AVL trees. An interesting option may be to use splay trees for collision handling. But overall, this is a minor issue for a few library designers and a few true obsessives to worry about.
You lose all the advantages of a hash table if the per-bucket lists ever have more than a few entries. The usual way to make a hash table scale to millions of entries is to make the primary hash array resizable, so even with millions of entries, the bucket lists stay short.
You can use a Tree instead of a List in the individual "buckets". (AVL or similar)
EDIT: well, Skip List would do too. (and seems to be faster) - O(log n) is what you aim for.
The total number of entries does not matter, only the average number of entries per bucket (N / size of hash). Use a hash function with larger domain (for example, 20 bits, or even larger) to ensure that.
Of course, this will take up more memory, but that's it, it's a common memory vs speed tradeoff.
Not sure if this will help you or not, but maybe: http://memcached.org/
If your keys have normal distribution (That's a very big IF), then the expected number of insertions into the hashtable to exhaust all the buckets in the hashtable is M*logM ( Natural log, to the base e), where M is the number of buckets.
Was surprised couldn't find this easily online!
I have posted the derivation of the same on my blog,and verified it with Code, using rand().It does seem to be a pretty good estimate.
What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM).
Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?
The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).
In C, platform Linux..
Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.
Depends.
Do you want to search on name or on integer?
Are the names all about the same size?
Are all the integers 32 bits, or some big number thingy?
Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.
Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.
Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).
If the index is the integer where gang lookup is used, you might look into radix trees.
my educated guess is a B-Tree (but I could be wrong ...):
B-trees have substantial advantages
over alternative implementations when
node access times far exceed access
times within nodes. This usually
occurs when most nodes are in
secondary storage such as hard drives.
By maximizing the number of child
nodes within each internal node, the
height of the tree decreases,
balancing occurs less often, and
efficiency increases. Usually this
value is set such that each node takes
up a full disk block or an analogous
size in secondary storage. While 2-3
B-trees might be useful in main
memory, and are certainly easier to
explain, if the node sizes are tuned
to the size of a disk block, the
result might be a 257-513 B-tree
(where the sizes are related to larger
powers of 2).
Instead of a hash you can at least use a radix to get started.
For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work
If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.
If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.
We can use a trie where each node is 1/0 to store the integer values . with this we can ensure that the depth of the tree is 32/64,so fetch time is constant and with sub-linear space complexity.