I have a collection of objects(max 500).
My entries will be looked up frequently based on a MAC like key, whose range is unknown.
Now, I am confused as to which data structure and algorithm to use for effective look up of values.
I am not sure whether to go for a balanced BST (AVL) or a hashtable for this case.
Are 500 keys small for building hashtables?
What would be the best approach in my case?
I read that computing hash might prove costly when the number of keys is less
On a side note, I would also like to know what number of entries (min) need to be present for considering a hashtable?
Please add a comment if further details are needed.
Thanks in advance.
Below are some of the benefits of Hash structures,
Fast lookup (O(1) theoretically)
Efficient storage (helps to store key-value)
Though these properties are beneficial but in some scenarios hash table can underperform.
If you have large amount of objects then more storage space (Memory) will be required and thus can cause performance hit
Hashing/key algorithm should not be complex. Otherwise more time will spent on hashing and finding key.
Key collision should be minimal to avoid linear search in all values for single key or key duplication.
In your case if hashing algo is not too complex then definitely you can use hashtable as you only 500 objects. If you have lookup intensive workflow then hashtable can save lot of time. If your data is nearly static then don't worry about initial loading time because your lookup time will be much faster.
You can also look at another DS which are efficient for less values like, Hash set, AVL trees, Hash tree. For 500 objects time diff will be milli or micro second diff in linear search and hash search. thus you wont achieve much perf. improvement. Thus look for easiness and readability.
Related
I am looking to compare two values (like greater than or less than from other) in HashMap, Hashtable, Map or any other Array types.
Could you please help me this.
Here are some factors that would affect your selection of a data structure:
What is the purpose of the comparison?
What type of data are you comparing?
How often will data be inserted into this data structure?
How often will data be selected from this data structure?
When should you use a HashMap?
One should use HashMap when their major requirements are only retrieving or modifying data's based on Key. For example, in Web
Applications username is stored as a key and user data is stored as a
value in the HashMap, for faster retrieval of user data corresponding
to a username.
HashMap
When should you use a HashTable?
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by
someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can
be wasteful and slow when excessive memory's used), the majority of
implementations will outgrow the arrays they're using for the hash
table every now and then, then allocate a bigger array and copy
content across: this can make the specific insertions that cause this
rehashing to be much slower than the normal O(1) behaviour, even
though the average is still O(1); if you need more consistent
behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating
on elements with keys that are "nearby" in some specific sort order),
such that cache efficiency is better for other storage models that
keep them nearby in memory (e.g. bucket sorted elements), even if
you're not exactly relying on the sort order for e.g. iteration
HashTable
Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue?
Platform used - C (c99 version),
words are English char words, ASCII values
One time initialization of hash table (buckets of link list style),
used for searching next, like dictionary search.
After collision , that word will be added as new node into bucket.
A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. Assuming you have a good hash function.
Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. That's assuming no collisions.
That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get.
You might also consider a better hash function than hash*33 + word[i]. The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size.
You could also just throw memory at the problem. A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. However, you'll soon reach a point of diminishing returns. That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that.
Long story short: use a better hash function and do some testing at different table sizes.
There is such a thing as a minimal perfect hash. If you know what your input data is (i.e., it doesn't change), then you can create a hash function that guarantees O(1) lookup. It's also very space efficient. However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items.
I'm working on a project where efficiency is crucial. A hash table would be very helpful since I need to easily look up the memory address of a node based on a key. The only problem I foresee is this hash table will need to handle up to 1 million entries. As I understand it usually hash tables buckets are a linked list so that they can handle multiple entries in the same bucket. It seems to me that with a million entries these lists would be way too slow. What is the common way of implementing something like this. Maybe swapping a standard linked list out for a skip list?
If you want a hash table with a million entries, normally you'd have at least 2 million buckets. I don't remember all the statistics (the key term is "birthday paradox"), but the vast majority of the buckets will have zero or one items. You can, in principle, be very unlucky and get all items in one bucket - but you'd have to be even more unlucky than those people who seem to get struck by lightning every other day.
For hashtables that grow, the normal trick is to grow by a constant percentage - the usual textbook case being growth by doubling the hash-table size. You do this whenever the number of items in the hashtable reaches a certain proportion of the hashtable size, irrespective of how many buckets are actually being used. This gives amortized expected performance of O(1) for inserts, deletes and searches.
The linked list in each bucket of a hash-table is just a way of handling collisions - improbable in a per-operation sense, but over the life of a significant hash table, they do happen - especially as the hash-table gets more than half full.
Linked lists aren't the only way to handle collisions - there's a huge amount of lore about this topic. Walter Bright (developer of the D programming language) has advocated using binary trees rather than linked lists, claiming that his Dscript gained a significant performance boost relative to Javascript from this design choice.
He used simple (unbalanced) binary trees when I asked, so the worst-case performance was the same as for linked lists, but the key point I guess is that the binary tree handling code is simple, and the hash table itself makes the odds of building large unbalanced trees very small.
In principle, you could just as easily use treaps, red-black trees or AVL trees. An interesting option may be to use splay trees for collision handling. But overall, this is a minor issue for a few library designers and a few true obsessives to worry about.
You lose all the advantages of a hash table if the per-bucket lists ever have more than a few entries. The usual way to make a hash table scale to millions of entries is to make the primary hash array resizable, so even with millions of entries, the bucket lists stay short.
You can use a Tree instead of a List in the individual "buckets". (AVL or similar)
EDIT: well, Skip List would do too. (and seems to be faster) - O(log n) is what you aim for.
The total number of entries does not matter, only the average number of entries per bucket (N / size of hash). Use a hash function with larger domain (for example, 20 bits, or even larger) to ensure that.
Of course, this will take up more memory, but that's it, it's a common memory vs speed tradeoff.
Not sure if this will help you or not, but maybe: http://memcached.org/
If your keys have normal distribution (That's a very big IF), then the expected number of insertions into the hashtable to exhaust all the buckets in the hashtable is M*logM ( Natural log, to the base e), where M is the number of buckets.
Was surprised couldn't find this easily online!
I have posted the derivation of the same on my blog,and verified it with Code, using rand().It does seem to be a pretty good estimate.
Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.
Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.
The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?
It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.
Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.
There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.
Dan Gusfield's book is the Bible of string algorithms.
I don't think there a reason to be worried about the worst case for two reasons:
You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.
If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.
That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.
In my experience there are three implementation that I think could met your requirement:
HAT-Trie (combination between trie and hashtable)
JudyArray (compressed n-ary tree)
Double Array Tree
You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.
I have a number of data sets that have key-value pattern - i.e. a string key and a pointer to the data. Right now it is stored in hashtables, each table having array of slots corresponding to hash keys, and on collision forming a linked list under each slot that has collision (direct chaining). All implemented in C (and should stay in C) if it matters.
Now, the data is actually 3 slightly different types of data sets:
Some sets can be changed (keys added, removed, replaced, etc.) at will
For some sets data can be added but almost never replaced/removed (i.e. it can happen, but in practice it is very rare)
For some sets the data is added once and then only looked up, it is never changed once the whole set is loaded.
All sets of course have to support lookups as fast as possible, and consume minimal amounts of memory (though lookup speed is more important than size).
So the question is - is there some better hashtable structure/implementation that would suit the specific cases better? I suspect for the first case the chaining is the best, but not sure about two other cases.
If you are using linked lists for each bucket in your hashtable, you have already accepted relatively poor performance on modern CPUs (linked lists have poor locality and therefore poor CPU cache interaction). So I probably wouldn't worry about optimizing the other special cases. However, here are a few tips if you want to continue down the path you are using:
For the 'frequent changes' data set and the 'almost never change' cases, every time you read an item from the hash table, move it to the front of the linked list chain for that bucket. For some even better ideas this paper, even though it focus on fixed size keys, is a good staring point Fast and Compact Hash Tables for Integer Keys.
For the 'data set never changes' case you should look into the perfect hash generators. If you know your keys at compile time I've had good results with gperf. If your keys are not available until run-time try the C Minimal Perfect Hashing Library.
Those sets that are small (tens of elements) might be fastest using a binary or even linear search over the keys stored in sequential memory!
Obviously the key bodies have to be in the sequential memory, or hashes of them. But if you can get that into one or two L1 cache.lines, it'll fly.
As for the bigger hashes, the direct chaining might lose out to open addressing?
You could explore "cache conscious" hash tables and tries.
The wikipedia article discusses cache-lines in detail, describing the various trade-offs to consider.