I'm fairly new to the the concept of hash tables, and I've been reading up on different types of hash table lookup and insertion techniques.
I'm wondering what the difference is between the time complexities of linear probing, chaining, and quadratic probing?
I'm mainly interested in the the insertion, deletion, and search of nodes in the hash table. So if I graph the system time per process ( insert/search/delete process ) versus the process number, how would the graphs differ?
I'm guessing that:
- quadratic probing would be worst-case O(nlogn) or O(logn) for searching
- linear probing would be worst-case O(n) for search
- Not sure but I think O(n^2) for chaining
Could someone confirm this? Thanks!
It's actually surprisingly difficult to accurately analyze all of these different hashing schemes for a variety of reasons. First, unless you make very strong assumptions on your hash function, it is difficult to analyze the behavior of these hashing schemes accurately, because different types of hash functions can lead to different performances. Second, the interactions with processor caches mean that certain types of hash tables that are slightly worse in theory can actually outperform hash tables that are slightly better in theory because their access patterns are better.
If you assume that your hash function looks like a truly random function, and if you keep the load factor in the hash table to be at most a constant, all of these hashing schemes have expected O(1) lookup times. In other words, each scheme, on expectation, only requires you to do a constant number of lookups to find any particular element.
In theory, linear probing is a bit worse than quadratic hashing and chained hashing, because elements tend to cluster near one another unless the hash function has strong theoretical properties. However, in practice it can be extremely fast because of locality of reference: all of the lookups tend to be close to one another, so fewer cache misses occur. Quadratic probing has fewer collisions, but doesn't have as good locality. Chained hashing tends to have extremely few collisions, but tends to have poorer locality of reference because the chained elements are often not stored contiguously.
In the worst case, all of these data structures can take O(n) time to do a lookup. It's extremely unlikely for this to occur. In linear probing, this would require all the elements to be stored continuously with no gaps and you would have to look up one of the first elements. With quadratic hashing, you'd have to have a very strange looking set of buckets in order to get this behavior. With chained hashing, your hash function would have to dump every single element into the same bucket to get the absolute worst-case behavior. All of these are exponentially unlikely.
In short, if you pick any of these data structures, you are unlikely to get seriously burned unless you have a very bad hash function. I would suggest using chained hashing as a default since it has good performance and doesn't hit worst-case behavior often. If you know you have a good hash function, or have a small hash table, then linear probing might be a good option.
Hope this helps!
Related
I'm working with an algorithm that has to read a file with 1 million lines and store some informations about this file. I found the HashSet structure that adds, remove and finds any data in O(1) performance. But, when i execute the algorithm with the line that add the data into the HashSet, the algorithm execution time became more than 4x worst. The HashSet performance become worst when we insert too many data into it?
Different HashSet implementations can vary on performance. First of all there is a need for either some kind of a tree or a set of buckets, both has it's own performance cost. Theoratically the hash datastructures are fast, but reality can be much different. Even O(1) means that the execution time is independent of the number of elements, but it does not mean it's free or fast.
I have a collection of objects(max 500).
My entries will be looked up frequently based on a MAC like key, whose range is unknown.
Now, I am confused as to which data structure and algorithm to use for effective look up of values.
I am not sure whether to go for a balanced BST (AVL) or a hashtable for this case.
Are 500 keys small for building hashtables?
What would be the best approach in my case?
I read that computing hash might prove costly when the number of keys is less
On a side note, I would also like to know what number of entries (min) need to be present for considering a hashtable?
Please add a comment if further details are needed.
Thanks in advance.
Below are some of the benefits of Hash structures,
Fast lookup (O(1) theoretically)
Efficient storage (helps to store key-value)
Though these properties are beneficial but in some scenarios hash table can underperform.
If you have large amount of objects then more storage space (Memory) will be required and thus can cause performance hit
Hashing/key algorithm should not be complex. Otherwise more time will spent on hashing and finding key.
Key collision should be minimal to avoid linear search in all values for single key or key duplication.
In your case if hashing algo is not too complex then definitely you can use hashtable as you only 500 objects. If you have lookup intensive workflow then hashtable can save lot of time. If your data is nearly static then don't worry about initial loading time because your lookup time will be much faster.
You can also look at another DS which are efficient for less values like, Hash set, AVL trees, Hash tree. For 500 objects time diff will be milli or micro second diff in linear search and hash search. thus you wont achieve much perf. improvement. Thus look for easiness and readability.
let's say I want to build an array to perform a lookup to parse network protocols (like an ethertype). Since such an identifier is 2-byte long, I would end up with a 2^16 cells array if I use direct indexing: this is a real waste, because it is very likely that the array is sparse - i.e. lots of gaps into the array.
In order to reduce memory usage to the maximum, I would use a perfect hashing function generator like CMPH, so that I can map my "n" identifiers to a n-sized array without any collision. The downside of this approach is that I have to rely on an external "exoteric" library.
I am wondering whether - in my case - there are smarter ways to have a constant time lookup while keeping at bay memory usage; bear in mind that I am interested in indexing 16-bit unsigned numbers and the set size is quite limited.
Thanks
Since you know for a fact that you're dealing with 16-bit values, any lookup algorithm will be a constant-time algorithm, since there are only O(1) different possible values. Consequently, algorithms that on the surface might be slower (for example, linear search, which runs in O(n) for n elements) might actually be useful here.
Barring a perfect hashing function, if you want to guarantee fast lookup, I would suggest looking into cuckoo hashing, which guarantees worst-case O(1) lookup times and has expected O(1)-time insertion (though you have to be a bit clever with your hash functions). It's really easy to generate hash functions for 16-bit values; if you compute two 16-bit multipliers and multiply the high and low bits of the 16-bit value by these values, then add them together, I believe that you get a good hash function mod any prime number.
Alternatively, if you don't absolutely have to have O(1) lookup and are okay with good expected lookup times, you could also use a standard hash table with open addressing, such as a linear probing hash table or double hashing hash table. Using a smaller array with this sort of hashing scheme could be extremely fast and should be very simple to implement.
For an entirely different approach, if you're storing sparse data and want fast lookup times, an option that might work well for you is to use a simple balanced binary search tree. For example, the treap data structure is easy to implement and gives expected O(log n) lookups for values. Since you're dealing with 16-bit values, here log n is about 16 (I think the base of the logarithm is actually a bit different), so lookups should be quite fast. This does introduce a bit of overhead per element, but if you have only a few elements it should be simple to implement. For even less overhead, you might want to look into splay trees, which require only two pointers per element.
Hope this helps!
I'm working on a project where efficiency is crucial. A hash table would be very helpful since I need to easily look up the memory address of a node based on a key. The only problem I foresee is this hash table will need to handle up to 1 million entries. As I understand it usually hash tables buckets are a linked list so that they can handle multiple entries in the same bucket. It seems to me that with a million entries these lists would be way too slow. What is the common way of implementing something like this. Maybe swapping a standard linked list out for a skip list?
If you want a hash table with a million entries, normally you'd have at least 2 million buckets. I don't remember all the statistics (the key term is "birthday paradox"), but the vast majority of the buckets will have zero or one items. You can, in principle, be very unlucky and get all items in one bucket - but you'd have to be even more unlucky than those people who seem to get struck by lightning every other day.
For hashtables that grow, the normal trick is to grow by a constant percentage - the usual textbook case being growth by doubling the hash-table size. You do this whenever the number of items in the hashtable reaches a certain proportion of the hashtable size, irrespective of how many buckets are actually being used. This gives amortized expected performance of O(1) for inserts, deletes and searches.
The linked list in each bucket of a hash-table is just a way of handling collisions - improbable in a per-operation sense, but over the life of a significant hash table, they do happen - especially as the hash-table gets more than half full.
Linked lists aren't the only way to handle collisions - there's a huge amount of lore about this topic. Walter Bright (developer of the D programming language) has advocated using binary trees rather than linked lists, claiming that his Dscript gained a significant performance boost relative to Javascript from this design choice.
He used simple (unbalanced) binary trees when I asked, so the worst-case performance was the same as for linked lists, but the key point I guess is that the binary tree handling code is simple, and the hash table itself makes the odds of building large unbalanced trees very small.
In principle, you could just as easily use treaps, red-black trees or AVL trees. An interesting option may be to use splay trees for collision handling. But overall, this is a minor issue for a few library designers and a few true obsessives to worry about.
You lose all the advantages of a hash table if the per-bucket lists ever have more than a few entries. The usual way to make a hash table scale to millions of entries is to make the primary hash array resizable, so even with millions of entries, the bucket lists stay short.
You can use a Tree instead of a List in the individual "buckets". (AVL or similar)
EDIT: well, Skip List would do too. (and seems to be faster) - O(log n) is what you aim for.
The total number of entries does not matter, only the average number of entries per bucket (N / size of hash). Use a hash function with larger domain (for example, 20 bits, or even larger) to ensure that.
Of course, this will take up more memory, but that's it, it's a common memory vs speed tradeoff.
Not sure if this will help you or not, but maybe: http://memcached.org/
If your keys have normal distribution (That's a very big IF), then the expected number of insertions into the hashtable to exhaust all the buckets in the hashtable is M*logM ( Natural log, to the base e), where M is the number of buckets.
Was surprised couldn't find this easily online!
I have posted the derivation of the same on my blog,and verified it with Code, using rand().It does seem to be a pretty good estimate.
Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.
Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.
The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?
It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.
Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.
There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.
Dan Gusfield's book is the Bible of string algorithms.
I don't think there a reason to be worried about the worst case for two reasons:
You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.
If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.
That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.
In my experience there are three implementation that I think could met your requirement:
HAT-Trie (combination between trie and hashtable)
JudyArray (compressed n-ary tree)
Double Array Tree
You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.