Speclialized hashtable algorithms for dynamic/static/incremental data - c

I have a number of data sets that have key-value pattern - i.e. a string key and a pointer to the data. Right now it is stored in hashtables, each table having array of slots corresponding to hash keys, and on collision forming a linked list under each slot that has collision (direct chaining). All implemented in C (and should stay in C) if it matters.
Now, the data is actually 3 slightly different types of data sets:
Some sets can be changed (keys added, removed, replaced, etc.) at will
For some sets data can be added but almost never replaced/removed (i.e. it can happen, but in practice it is very rare)
For some sets the data is added once and then only looked up, it is never changed once the whole set is loaded.
All sets of course have to support lookups as fast as possible, and consume minimal amounts of memory (though lookup speed is more important than size).
So the question is - is there some better hashtable structure/implementation that would suit the specific cases better? I suspect for the first case the chaining is the best, but not sure about two other cases.

If you are using linked lists for each bucket in your hashtable, you have already accepted relatively poor performance on modern CPUs (linked lists have poor locality and therefore poor CPU cache interaction). So I probably wouldn't worry about optimizing the other special cases. However, here are a few tips if you want to continue down the path you are using:
For the 'frequent changes' data set and the 'almost never change' cases, every time you read an item from the hash table, move it to the front of the linked list chain for that bucket. For some even better ideas this paper, even though it focus on fixed size keys, is a good staring point Fast and Compact Hash Tables for Integer Keys.
For the 'data set never changes' case you should look into the perfect hash generators. If you know your keys at compile time I've had good results with gperf. If your keys are not available until run-time try the C Minimal Perfect Hashing Library.

Those sets that are small (tens of elements) might be fastest using a binary or even linear search over the keys stored in sequential memory!
Obviously the key bodies have to be in the sequential memory, or hashes of them. But if you can get that into one or two L1 cache.lines, it'll fly.
As for the bigger hashes, the direct chaining might lose out to open addressing?
You could explore "cache conscious" hash tables and tries.
The wikipedia article discusses cache-lines in detail, describing the various trade-offs to consider.

Related

Hashmap, hashtable, map or anyother method

I am looking to compare two values (like greater than or less than from other) in HashMap, Hashtable, Map or any other Array types.
Could you please help me this.
Here are some factors that would affect your selection of a data structure:
What is the purpose of the comparison?
What type of data are you comparing?
How often will data be inserted into this data structure?
How often will data be selected from this data structure?
When should you use a HashMap?
One should use HashMap when their major requirements are only retrieving or modifying data's based on Key. For example, in Web
Applications username is stored as a key and user data is stored as a
value in the HashMap, for faster retrieval of user data corresponding
to a username.
HashMap
When should you use a HashTable?
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by
someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can
be wasteful and slow when excessive memory's used), the majority of
implementations will outgrow the arrays they're using for the hash
table every now and then, then allocate a bigger array and copy
content across: this can make the specific insertions that cause this
rehashing to be much slower than the normal O(1) behaviour, even
though the average is still O(1); if you need more consistent
behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating
on elements with keys that are "nearby" in some specific sort order),
such that cache efficiency is better for other storage models that
keep them nearby in memory (e.g. bucket sorted elements), even if
you're not exactly relying on the sort order for e.g. iteration
HashTable

Why does Lucene use arrays instead of hash tables for its inverted index?

I was watching Adrien Grand's talk on Lucene's index architecture and a point he makes is that Lucene uses sorted arrays to represent the dictionary part of its inverted indices. What's the reasoning behind using sorted arrays instead of hash tables (the "classic" inverted index data structure)?
Hash tables provide O(1) insertion and access, which to me seems like it would help a lot with quickly processing queries and merging index segments. On the other hand, sorted arrays can only offer up O(logN) access and (gasp) O(N) insertion, although merging 2 sorted arrays is the same complexity as merging 2 hash tables.
The only downsides to hash tables that I can think of are a larger memory footprint (this could indeed be a problem) and less cache friendliness (although operations like querying a sorted array require binary search which is just as cache unfriendly).
So what's up? The Lucene devs must have had a very good reason for using arrays. Is it something to do with scalability? Disk read speeds? Something else entirely?
Well, I will speculate here (should probably be a comment - but it's going to be too long).
HashMap is in general a fast look-up structure that has search time O(1) - meaning it's constant. But that is the average case; since (at least in Java) a HashMap uses TreeNodes - the search is O(logn) inside that bucket. Even if we treat that their search complexity is O(1), it does not mean it's the same time wise. It just means it is constant for each separate data structure.
Memory Indeed - I will give an example here. In short storing 15_000_000 entries would require a little over 1GB of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.
Putting entries in a HashMap (usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.
Probably one extra point here - searches in ranges, that would require some TreeMap probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).
I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.
And the last point: put me into their shoes, I would start with a HashMap first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.
I was thinking of the reasoning behind it. Just thought of one use-case that was important in the context of text search. I could be totally wrong :)
Why sorted array and not Dictionary?
Yes, it performs well on range queries, but IMO Lucene was mainly built for text searches. Now imagine you were to do a search for prefix-based queries Eg: country:Ind*, you will need to scan the whole HashMap/Dictionary. Whereas this becomes log(n) if you have a sorted array.
Since we have a sorted array, it would be inefficient to update the array. Hence, in Lucene segments(inverted index resides in segments) are immutable.

Expected performance of tries vs bucket arrays with constant load-factor

I know that I can simply use bucket array for associative container if I have uniformly distributed integer keys or keys that can be mapped into uniformly distributed integers. If I can create the array big enough to ensure a certain load factor (which assumes the collection is not too dynamic), than the expected number of collisions for a key will be bounded, because this is simply hash table with identity hash function.
Edit: I view strings as equivalent to positional fractions in the range [0..1]. So they can be mapped into any integer range by multiplication and taking floor of the result.
I can also do prefix queries efficiently, just like with tries. I presume (without knowing a proof) that the expected number of empty slots corresponding to a given prefix that have to be skipped sequentially before the first bucket with at least one element is reached is also going to be bounded by constant (again depending on the chosen load factor).
And of course, I can do stabbing queries in worst-case constant time, and range queries in solely output sensitive linear expected time (if the conjecture of denseness from the previous paragraph is indeed true).
What are the advantages of a tries then?
If the distribution is uniform, I don't see anything that tries do better. But I may be wrong.
If the distribution has large uncompensated skew (because we had no prior probabilities or just looking at the worst case), the bucket array performs poorly, but tries also become heavily imbalanced, and can have linear worst case performance with strings of arbitrary length. So the use of either structure for your data is questionable.
So my question is - what are the performance advantages of tries over bucket arrays that can be formally demonstrated? What kind of distributions elicit those advantages?
I was thinking of distributions with self-similar structure at different scales. I believe those are called fractal distributions, of which I confess to know nothing. May be then, if the distribution is prone to clustering at every scale, tries can provide superior performance, by keeping the load factor of each node similar, adding levels at dense regions as necessary - something that bucket arrays can not do.
Thanks
Tries are good if your strings share common prefixes. In that case, the prefix is stored only once and can be queried with linear performance in the output string length. In a bucket array, all strings with the same prefixes would end up close together in your key space, so you have very skewed load where most buckets are empty and some are huge.
More generally, tries are also good if particular patterns (e.g. the letters t and h together) occur often. If there are many such patterns, the order of the trie's tree nodes will typically be small, and little storage is wasted.
One of the advantages of tries I can think of is insertion. Bucket array may need to be resized at some point and this is expensive operation. So worst-case insertion time into trie is much better than into bucket array.
Another thing is that you need to map string to fraction to be used with bucket arrays. So if you have short keys, theoretically trie can be more efficient, because you don't need to do the mapping.

C: Storing up to a million entries in a hash table

I'm working on a project where efficiency is crucial. A hash table would be very helpful since I need to easily look up the memory address of a node based on a key. The only problem I foresee is this hash table will need to handle up to 1 million entries. As I understand it usually hash tables buckets are a linked list so that they can handle multiple entries in the same bucket. It seems to me that with a million entries these lists would be way too slow. What is the common way of implementing something like this. Maybe swapping a standard linked list out for a skip list?
If you want a hash table with a million entries, normally you'd have at least 2 million buckets. I don't remember all the statistics (the key term is "birthday paradox"), but the vast majority of the buckets will have zero or one items. You can, in principle, be very unlucky and get all items in one bucket - but you'd have to be even more unlucky than those people who seem to get struck by lightning every other day.
For hashtables that grow, the normal trick is to grow by a constant percentage - the usual textbook case being growth by doubling the hash-table size. You do this whenever the number of items in the hashtable reaches a certain proportion of the hashtable size, irrespective of how many buckets are actually being used. This gives amortized expected performance of O(1) for inserts, deletes and searches.
The linked list in each bucket of a hash-table is just a way of handling collisions - improbable in a per-operation sense, but over the life of a significant hash table, they do happen - especially as the hash-table gets more than half full.
Linked lists aren't the only way to handle collisions - there's a huge amount of lore about this topic. Walter Bright (developer of the D programming language) has advocated using binary trees rather than linked lists, claiming that his Dscript gained a significant performance boost relative to Javascript from this design choice.
He used simple (unbalanced) binary trees when I asked, so the worst-case performance was the same as for linked lists, but the key point I guess is that the binary tree handling code is simple, and the hash table itself makes the odds of building large unbalanced trees very small.
In principle, you could just as easily use treaps, red-black trees or AVL trees. An interesting option may be to use splay trees for collision handling. But overall, this is a minor issue for a few library designers and a few true obsessives to worry about.
You lose all the advantages of a hash table if the per-bucket lists ever have more than a few entries. The usual way to make a hash table scale to millions of entries is to make the primary hash array resizable, so even with millions of entries, the bucket lists stay short.
You can use a Tree instead of a List in the individual "buckets". (AVL or similar)
EDIT: well, Skip List would do too. (and seems to be faster) - O(log n) is what you aim for.
The total number of entries does not matter, only the average number of entries per bucket (N / size of hash). Use a hash function with larger domain (for example, 20 bits, or even larger) to ensure that.
Of course, this will take up more memory, but that's it, it's a common memory vs speed tradeoff.
Not sure if this will help you or not, but maybe: http://memcached.org/
If your keys have normal distribution (That's a very big IF), then the expected number of insertions into the hashtable to exhaust all the buckets in the hashtable is M*logM ( Natural log, to the base e), where M is the number of buckets.
Was surprised couldn't find this easily online!
I have posted the derivation of the same on my blog,and verified it with Code, using rand().It does seem to be a pretty good estimate.

When to switch from unordered lists to sorted lists ? [optimization]

I have to implement an algorithm to decompose 3D volumes in voxels. The algorithm starts by identifying which vertexes is on each side of the cut plan and in a second step which edge traverse the cutting plan.
This process could be optimized by using the benefit of sorted list. Identifying the split point is O log(n). But I have to maintain one such sorted list per axis and this for vertexes and edges. Since this is to be implemented to be used by GPU I also have some constrains on memory management (i.e. CUDA). Intrusive listsM/trees and C are imposed.
With a complete "voxelization" I expect to endup with ~4000 points, and 12000 edges. Fortunately this can be optimized by using a smarter strategy to get rid of processed voxels and order residual volumes cutting to keep their number to a minimum. In this case I would expect to have less then 100 points and 300 edges. This makes the process more complex to manage but could end up beeing more efficient.
The question is thus to help me identify the criteria to determine when the benefit of using a sorted data structure is worth the effort and complexity overhead compared to simple intrusive linked lists.
chmike, this really sounds like the sort of thing you want to do first the simpler way, and see how it behaves. Any sort of GPU voxelization approach is pretty fragile to system details once you get into big volumes at least (which you don't seem to have). In your shoes I'd definitely want the straightforward implementation first, if for no other reason that to check against....
The question will ALWAYS boil down to which operator is most common, accessing, or adding.
If you have an unordered list, adding to it takes no time, and accessing particular items takes extra time.
If you have a sorted list, adding to it takes more time, but accessing it is quicker.
Most applications spending most of their time accessing the data, rather than adding to it, which means that the (running) time overhead in creating a sorted list will usually be balanced or covered by the time saved in accessing the list.
If there is a lot of churn in your data (which it doesn't sound like there is) then maintaining a sorted list isn't necessarily advisable, because you will be constantly resorting the list as considerable CPU cost.
The complexity of the data structures only matters if they cannot be sorted in a useful way. If they can be sorted, then you'll have to go by the heuristic of
number of accesses:number of changes
to determine if sorting is a good idea.
After considering all answers I found out that the later method used to avoid duplicate computation would end up being less efficient because of the effort to maintain and navigate in the data structure. Beside, the initial method is straightforward to parallelize with a few small kernel routines and thus more appropriate for GPU implementation.
Checking back my initial method I also found significant optimization opportunities that leaves the volume cut method well behind.
Since I had to pick one answer I chose devinb because he answer the question, but Simon's comment, backed up by Tobias Warre comment, were as valuable for me.
Thanks to all of you for helping me sorting out this issue.
Stack overflow is an impressive service.

Resources