I'm implementing a Hashtable that handles collisions with robin hood hashing. However, previously i had chaining instead, and the process of inserting almost 1 million keys was pretty much instantaneous. The same doesn't happen with the Robin Hood hashing which i found strange since i had the impression it was much quicker. So what i want to ask is if my insertion function is properly implemented. Here's the code:
typedef struct hashNode{
char *word;
int freq; //not utilized in the insertion
int probe;// distance from the calculated index to the actual index it was inserted.
struct hashNode* next; //not utilized in the insertion
struct hashNode* base; //not utilized in the insertion
}hashNode;
typedef hashNode* hash_ptr;
hash_ptr hashTable[NUM_WORDS] = {NULL}; // NUM_WORDS = 1000000
// Number of actual entries = 994707
hash_ptr swap(hash_ptr node, int index){
hash_ptr temp = hashTable[index];
hashTable[index] = node;
return temp;
}
static void insertion(hash_ptr node,int index){
while(hashTable[index]){
if(node->probe > hashTable[index]->probe){
node = swap(node,index);
}
node->probe++;
index++;
if(index > NUM_WORDS) index = 0;
}
hashTable[index] = node;
}
To contextualize everything:
the node parameter is the new entry.
the index parameter is where the new entry will be, if it isn't occupied.
The Robin Hood algorithm is very clever but it is just as dependent on having a good hash function as is any other open hashing technique.
As a worst case, consider the worst possible hash function:
int hash(const char* key) { return 0; }
Since this will map every item to the same slot, it is easy to see that the total number of probes is quadratic in the number of entries: the first insert succeeds on the first probe; the second insert requires two probes; the third one three probes; and so on, leading to a total of n(n+1)/2 probes for n inserts. This is true whether you use simple linear probing or Robin Hood probing.
Interestingly, this hash function might have no impact whatsoever on insertion into a chained hash table if -- and this is a very big if -- no attempt is made to verify that the inserted element is unique. (This is the case in the code you present, and it's not totally unreasonable; it is quite possible that the hash table is being built as a fixed lookup table and it is already known that the entries to be added are unique. More on this point later.)
In the chained hash implementation, the non-verifying insert function might look like this:
void insert(hashNode *node, int index) {
node->next = hashTable[index];
hashTable[index] = node;
}
Note that there is no good reason to use a doubly-linked list for a hash chain, even if you are planning to implement deletion. The extra link is just a waste of memory and cycles.
The fact that you can build the chained hash table in (practically) no time at all does not imply that the algorithm has built a good hash table. When it comes time to look a value up in the table, the problem will be discovered: the average number of probes to find the element will be half the number of elements in the table. The Robin Hood (or linear) open-addressed hash table has exactly the same performance, because all searches start at the beginning of the table. The fact that the open-addressed hash table was also slow to build is probably almost irrelevant compared to the cost of using the table.
We don't need a hash function quite as terrible as the "always use 0" function to produce quadratic performance. It's sufficient for the hash function to have an extremely small range of possible values (compared with the size of the hash table). If the possible values are equally likely, the chained hash will still be quadratic but the average chain length will be divided by the number of possible values. That's not the case for the linear/R.Hood probed hash, though, particularly if all the possible hash values are concentrated in a small range. Suppose, for example, the hash function is
int hash(const char* key) {
unsigned char h = 0;
while (*key) h += *key++;
return h;
}
Here, the range of the hash is limited to [0, 255). If the table size is much larger than 256, this will rapidly reduce to the same situation as the constant hash function. Very soon the first 256 entries in the hash table will be filled, and every insert (or lookup) after that point will require a linear search over a linearly-increasing compact range at the beginning of the table. So the performance will be indistinguishable from the performance of the table with a constant hash function.
None of this is intended to motivate the use of chained hash tables. Rather, it is pointing to the importance of using a good hash function. (Good in the sense that the result of hashing a key is uniformly distributed over the entire range of possible node positions.) Nonetheless, it is generally the case that clever open-addressing schemes are more sensitive to bad hash functions than simple chaining.
Open-addressing schemes are definitely attractive, particularly in the case of static lookup tables. They are more attractive in the case of static lookup tables because deletion can really be a pain, so not having to implement key deletion removes a huge complication. The most common solution for deletion is to replace the deleted element with a DELETED marker element. Lookup probes must still skip over the DELETED markers, but if the lookup is going to be followed by an insertion, the first DELETED marker can be remembered during the lookup scan, and overwritten by the inserted node if the key is not found. That works acceptably, but the load factor has to be calculated with the expected number of DELETED markers, and if the usage pattern sometimes successively deletes a lot of elements, the real load factor for the table will go down significantly.
In the case where deletion is not an issue, though, open-addressed hash tables have some important advantages. In particular, they are much lower overhead in the case that the payload (the key and associated value, if any) is small. In the case of a chained hash table, every node must contain a next pointer, and the hash table index must be a vector of pointers to node chains. For a hash table whose key occupies only the space of a pointer, the overhead is 100%, which means that a linear-probed open-addressed hash table with a load factor of 50% occupies a little less space than a chained table whose index vector is fully occupied and whose nodes are allocated on demand.
Not only is the linear probed table more storage efficient, it also provides much better reference locality, which means that the CPU's RAM caches will be used to much greater advantage. With linear probing, it might be possible to do eight probes using a single cacheline (and thus only one slow memory reference), which could be almost eight times as fast as probing through a linked list of randomly allocated table entries. (In practice, the speed up won't be this extreme, but it could well be more than twice as fast.) For string keys in cases where performance really matters, you might think about storing the length and/or the first few characters of the key in the hash entry itself, so that the pointer to the full character string is mostly only used once, to verify the successful probe.
But both the space and time benefits of open addressing are dependent on the hash table being an array of entries, not an array of pointers to entries as in your implementation. Putting the entries directly into the hash index avoids the possibly-significant overhead of a pointer per entry (or at least per chain), and permits the efficient use of memory caches. So that's something you might want to think about in your final implementation.
Finally, it's not necessarily the case that open addressing makes deletion complicated. In a cuckoo hash (and the various algorithms which it has inspired in recent years), deletion is no more difficult than deletion in a chained hash, and possibly even easier. In a cuckoo hash, any given key can only be in one of two places in the table (or, in some variants, one of k places for some small constant k) and a lookup operation only needs to examine those two places. (Insertion can take longer, but it is still expected O(1) for load factors less than 50%.) So you can delete an entry simply by removing it from where it is; that will have no noticeable effect on lookup/insertion speed, and the slot will be transparently reused without any further intervention being necessary. (On the down side, the two possible locations for a node are not adjacent and they are likely to be on separate cache lines. But there are only two of them for a given lookup. Some variations have better locality of reference.)
A few last comments on your Robin Hood implementation:
I'm not totally convinced that a 99.5% load factor is reasonable. Maybe it's OK, but the difference between 99% and 99.5% is so tiny that there is no obvious reason to tempt fate. Also, the rather slow remainder operation during the hash computation could be eliminated by making the size of the table a power of two (in this case 1,048,576) and computing the remainder with a bit mask. The end result might well be noticeably faster.
Caching the probe count in the hash entry does work (in spite of my earlier doubts) but I still believe that the suggested approach of caching the hash value instead is superior. You can easily compute the probe distance; it's the difference between the current index in the search loop and the index computed from the cached hash value (or the cached starting index location itself, if that's what you choose to cache). That computation does not require any modification to the hash table entry, so it's cache friendlier and slightly faster, and it doesn't require any more space. (But either way, there is a storage overhead, which also reduces cache friendliness.)
Finally, as noted in a comment, you have an off-by-one error in your wraparound code; it should be
if(index >= NUM_WORDS) index = 0;
With the strict greater-than test as written, your next iteration will try to use the entry at index NUM_WORDS, which is out of bounds.
Just to leave it here: the 99% fillrate is not resonable. Nether is 95%, nor 90%. I know they said it in the paper, they are wrong. Very wrong. Use 60%-80% like you should with open adressing
Robin Hood hashing does not change the number or collisions when you are inserting, the average (and the total) number of collisions remain the same. Only their distribution changes: Robin Hood improves the worst cases. But for the averages it's the same as linear, quadratic or double hashing.
at 75% you get about 1.5 collisions before a hit
at 80% about 2 collisions
at 90% about 4.5 collisions
at 95% about 9 collisions
at 99% about 30 collisions
I tested on a 10000 element random table. Robin Hood hashing cannot change this average, but it improves the 1% worst case number of collisions from having 150-250 misses (at 95% fill rate) to having about 30-40.
Related
I was wondering about a simple data structure for a set with O(1) lookup time. For detection of duplicate values in an unsorted linked list let's say.
The best I can come up with is a bool array, wherein the index stands for the value of the number. But this can have very high space complexities depending on the range. A red-black tree gives O(logn) time complexity.
Is there an alternative method, hash-table implementation of some kind, that can help me here?
The simpler the better.
You have an inherent space vs. time tradeoff here. To ensure at most O(1) operations are required to test set membership, you need a data structure of at least O(n) size. An array of bool could do it, or you could build a bitset out of an array of, say, unsigned int (I have done this for sets reaching to a few thousand members). If you expect the sets to be sparsely populated relative to the range of their elements' values, then a hash table could keep you at the O(n) space level (whereas the space required for an array-based solution scales with element range).
In theory every set of int implementation is going to have O(1) complexity lookup time. This is because there are a finite number of distinct int values, so there is an upper bound to the size of your set.
So even if the lookup time for a tree is O(logN), in the case of integers that N has a maximum value, say N <= k. log k is a constant, so your operation has an upper bound of a constant lookup time. That is to say... no matter how slow your algorithm is, it's faster than it would be with INT_MAX + 1 values
In my experience, when people ask for constant-time set lookup, they really just want hashes. This effectively reduces the size of k (at the cost of memory). Your bool array idea is an extreme case, reducing k to 1.
Maybe what you want is just a fast set implementation? If this is for academic purposes, then I'd suggest finding out what your professor wants.
I am wanting to implement something along the lines of a btree to index some data using variable length keys, where I am expecting that each node in the tree will look something like this,
struct key_block {
block_ptr parent; // link back up the tree to the parent
unsigned numkeys; // number of keys currently used by this block
struct {
block_ptr child; // points to the child immediately preceeding this key.
struct {
unsigned length; // how long this key is
unsigned offset; // where the data for this key is
} key; // support for variable length keys
data_ptr content; // ptr to the data indexed by this key
} entries[]; // as many entries as will fit on a disk block.
}; // the last entry will be followed by another block_ptr which is the right hand child of the last node.
What I am intending to is store the actual key data in the same disk block as the node itself, positioned just after the final key and its right hand child that was within the node. The offset and length fields in each key will indicate how far from the start of the current block the actual data for each key begins and how long it runs for.
However, I am wanting to use a fixed size of disk block for my storage, and since I am wanting to store the variable length keys inside of the same block as the node, that means that the maximum number of keys that can be in a node depends on the length of the keys in that node. This sort of contradicts my understanding of the way a btree generally works, where all nodes have a fixed maximum number of entries, and I am not sure whether I can really implement this using a btree at all because I am violating that typical invariant.
So should I even be looking at using a btree structure? If not, what other alternatives exist for very fast external searching, insertion, and deletion? In particular, a key criteria for any solution must be that it is highly scalable to supporting very VERY large numbers of entries, and still be efficient for searching, insertion, and deletion (and btrees perform adequately on this front).
If I can still use a btree, how would the algorithm be affected when I no longer have an invariant maximum number of keys, but instead the maximum depends on the content of each individual node itself?
There is no fundamental issue with a variable number of maximum keys in a B-tree. However, a B-tree does depend on some minimum and maximum number of keys in each node. If you have a fixed number of keys per node, then this is easy (usually N/2 to N nodes). Because you allow a variable number, you will need to determine a heuristic for balancing the tree. The better the heuristic, the more optimal the performance.
Fortunately, the issue will merely be performance. The shape of the B-tree has several invariants, but none of them are affected by your variable number of keys, so you will still be able to search. It just might be a poorly-balancing structure if you choose a bad heuristic.
Hashes provide an excellent mechanism to extract values corresponding to some given key in almost O(1) time. But it never preserves the order in which the keys are inserted. So is there any data structure which can simulate the best of array as well as hash, that is, return the value corresponding to a given key in O(1) time, as well as returning the nth value inserted in O(1) time? The ordering should be maintained, i.e., if the hash is {a:1,b:2,c:3}, and something like del hash[b] has been done, nth(2) should return {c,3}.
Examples:
hash = {};
hash[a] = 1;
hash[b] = 2;
hash[c] = 3;
nth(2); //should return 2
hash[d] = 4;
del hash[c];
nth(3); //should return 4, as 'd' has been shifted up
Using modules like TIE::Hash or similar stuff won't do, the onus is on me to develop it from scratch!
It depends on how much memory may be allocated for this data structure. For O(N) space there are several choices:
It's easy to get a data structure with O(1) time for each of these operations: "get value by key", "get nth value inserted", "insert" - but only when "delete" time is O(N). Just use combination of a hash map and an array, as explained by ppeterka.
Less obvious, but still simple is O(sqrt N) for "delete" and O(1) for all other operations.
A little bit more complicated is to "delete" in O(N1/4), O(N1/6), or, in general case, in O(M*N1/M) time.
It's, most likely, impossible to decrease "delete" time to O(log N) while retaining O(1) for other operations. But it is possible if you agree to O(log N) time for every operation. Solutions, based on binary search tree or on a skip list, allow it. One option is order statistics tree. You can augment every node of a binary search tree with a counter, storing number of elements in the sub-tree under this node; then use it to find nth node. Other option is to use Indexable skiplist. One more option is to use O(M*N1/M) solution with M=log(N).
And I don't think you can get O(1) "delete" without increasing time for other operations even more.
If unlimited space is available, you can do every operation in O(1) time.
O(sqrt N) "delete"
You can use a combination of two data structures to find value by key and to find value by its insertion order. First one is a hash map (mapping key to both value and a position in other structure). Second one is tiered vector, which maps position to both value and key.
Tiered vector is a relatively simple data structure, it may be easily developed from scratch. Main idea is to split array into sqrt(N) smaller arrays, each of size sqrt(N). Each small array needs only O(sqrt N) time to shift values after deletion. And since each small array is implemented as circular buffer, small arrays can exchange a single element in O(1) time, which allows to complete "delete" operation in O(sqrt N) time (one such exchange for each sub-array between deleted value and first/last sub-array). Tiered vector allows insertion into the middle also in O(sqrt N), but this problem does not require it, so we can just append a new element at the end in O(1) time. To access element by its position, we need to determine starting position of circular buffer for sub-array, where element is stored, then get this element from circular buffer; this needs also O(1) time.
Since hash map remembers a position in tiered vector for each of its keys, it should be updated when any element in tiered vector changes position (O(sqrt N) hash map updates for each "delete").
O(M*N1/M) "delete"
To optimize "delete" operation even more, you can use approach, described in this answer. It deletes elements lazily and uses a trie to adjust element's position, taking into account deleted elements.
O(1) for every operation
You can use a combination of three data structures to do this. First one is a hash map (mapping key to both value and a position in the array). Second one is an array, which maps position to both value and key. And third one is a bit set, one bit for each element of the array.
"Insert" operation just adds one more element to the array's end and inserts it into hash map.
"Delete" operation just unsets corresponding bit in the bit set (which is initialized with every bit = 1). Also it deletes corresponding entry from hash map. (It does not move elements of array or bit set). If, after "delete" the bit set has more than some constant proportion of elements deleted (like 10%), the whole data structure should be re-created from scratch (this allows O(1) amortized time).
"Find by key" is trivial, only hash map is used here.
"Find by position" requires some pre-processing. Prepare a 2D array. One index is the position we search. Other index is current state of our data structure, the bit set, reinterpreted as an index. Calculate population count for each prefix of every possible bit set and store prefix length, indexed by both population count and the bit set itself. Having this 2D array ready, you can perform this operation by first indexing by position and current "state" in this 2D array, then by indexing in the array with values.
Time complexity for every operation is O(1) (for insert/delete it is O(1) amortized). Space complexity is O(N 2N).
In practice, using whole bit set to index an array limits allowed value of N by pointer size (usually 64), even more it is limited by available memory. To alleviate this, we can split both the array and the bit set into sub-arrays of size N/C, where C is some constant. Now we can use a smaller 2D array to find nth element in each sub-array. And to find nth element in the whole structure, we need additional structure to record number of valid elements in each sub-array. This is a structure of constant size C, so every operation on it is also O(1). This additional structure may me implemented as an array, but it is better to use some logarithmic-time structure like indexable skiplist. After this modification, time complexity for every operation is still O(1); space complexity is O(N 2N/C).
Now, that the question is clear for me too (better late than never...) here are my proposals:
you could maintain two hashes: one with keys, and one with the insert order. this however is very ugly and slow to maintain when deleting, and inserting in between. This would give the same almost O(1) time needed to access the elements both ways.
you could use a hash for the keys, and maintain an array for the insert order. this one is a lot nicer than the hash type, deleting is still not very fast, but I think still a lot quicker than with the two hash approach. This also gives true O(1) on accessing the nth element.
At first, I misunderstood the question, and gave a solution that gives O(1) key lookup, and O(n) lookup of nth element:
In Java, there is the LinkedHashMap for this particular task.
I think however that if someone finds this page, this might not be totally useless, so I leave it here...
There is no data structure in O(1) for everything you cited. In particular any data structure with random dynamic insertion/deletion in the middle AND sorted/indexed access cannot have maintenance time lower than O(log N), to maintain such a dynamic collection you have to resort either on the operator "less than" (binary thus O(log2 N)) or some computed organization (typical O(sqrt N), by using sqrt(N) sub arrays). Note that O(sqrt N)>O(log N).
So, no.
You might reach O(1) for everything including keeping order with the linked list+hash map, and if access is mostly sequential, you could cache nth(x), to access nth(x+/-1) in O(1).
I guess only a plain array will give you O(1), best variant is to look for solution which gives O(n) in worst scenario. You can also use a really really bad approach - using key as index in plain array. I guess there is a way to transform any key to index in plain array.
std::string memoryMap[0x10000];
int key = 100;
std::string value = "Hello, World!";
memoryMap[key] = value;
I have a list of n strings (names of people) that I want to store in a hash table or similar structure. I know the exact value of n, so I want to use that fact to have O(1) lookups, which would be rendered impossible if I had to use a linked list to store my hash nodes. My first reaction was to use the the djb hash, which essentially does this:
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
To compress the resulting h into the range [0,n], I would like to simply do h%n, but I suspect that this will lead to a much higher probability of clashes in a way that would essentially render my hash useless.
My question then, is how can I hash either the string or the resulting hash so that the n elements provide a relatively uniform distribution over [0,n]?
It's not enough to know n. Allocation of an item to a bucket is a function of the item itself so, if you want a perfect hash function (one item per bucket), you need to know the data.
In any case, if you're limiting the number of elements to a known n, you're already technically O(1) lookup. The upper bound will be based on the constant n. This would be true even for a non-hash solution.
Your best bet is to probably just use the hash function you have and have each bucket be a linked list of the colliding items. Even if the hash is less than perfect, you're still greatly minimising the time taken.
Only if the hash is totally imperfect (all n elements placed in one bucket) will it be as bad as a normal linked list.
If you don't know the data in advance, a perfect hash is not possible. Unless, of course, you use h itself as the hash key rather than h%n but that's going to take an awful lot of storage :-)
My advice is to go the good-enough hash with linked list route. I don't doubt that you could make a better hash function based on the relative frequencies of letters in people's names across the population but even the hash you have (which is ideal for all letters having the same frequency) should be adequate.
And, anyway, if you start relying on frequencies and you get an influx of people from those countries that don't seem to use vowels (a la Bosniaa), you'll end up with more collisions.
But keep in mind that it really depends on the n that you're using.
If n is small enough, you could even get away with a sequential search of an unsorted array. I'm assuming your n is large enough here that you've already established that (or a balanced binary tree) won't give you enough performance.
A case in point: we have some code which searches through problem dockets looking for names of people that left comments (so we can establish the last member on our team who responded). There's only ever about ten or so members in our team so we just use a sequential search for them - the performance improvement from using a faster data structure was deemed too much trouble.
aNo offence intended. I just remember the humorous article a long time ago about Clinton authorising the airlifting of vowels to Bosnia. I'm sure there are other countries with a similar "problem".
What you're after is called a Perfect Hash. It's a hash function where all the keys are known ahead of time, designed so that there are no collisions.
The gperf program generates C code for perfect hashes.
It sounds like you're looking for an implementation of a perfect hash function, or perhaps even a minimal perfect hash function. According to the Wikipedia page, CMPH might
fit your needs. Disclaimer: I've never used it.
The optimal algorithm for mapping n strings to integers 1-n is to build a DFA where the terminating states are the integers 1-n. (I'm sure someone here will step up with a fancy name for this...but in the end it's all DFA.) Size/speed tradeoff can be adjusted by varying your alphabet size (operating on bytes, half-bytes, or even bits).
I'm looking for a data structure (or structures) that would allow me keep me an ordered list of integers, no duplicates, with indexes and values in the same range.
I need four main operations to be efficient, in rough order of importance:
taking the value from a given index
finding the index of a given value
inserting a value at a given index
deleting a value at a given index
Using an array I have 1 at O(1), but 2 is O(N) and insertion and deletions are expensive (O(N) as well, I believe).
A Linked List has O(1) insertion and deletion (once you have the node), but 1 and 2 are O(N) thus negating the gains.
I tried keeping two arrays a[index]=value and b[value]=index, which turn 1 and 2 into O(1) but turn 3 and 4 into even more costly operations.
Is there a data structure better suited for this?
I would use a red-black tree to map keys to values. This gives you O(log(n)) for 1, 3, 4. It also maintains the keys in sorted order.
For 2, I would use a hash table to map values to keys, which gives you O(1) performance. It also adds O(1) overhead for keeping the hash table updated when adding and deleting keys in the red-black tree.
How about using a sorted array with binary search?
Insertion and deletion is slow. but given the fact that the data are plain integers could be optimized with calls to memcpy() if you are using C or C++. If you know the maximum size of the array, you can even avoid any memory allocations during the usage of the array, as you can preallocate it to the maximum size.
The "best" approach depends on how many items you need to store and how often you will need to insert/delete compared to finding. If you rarely insert or delete a sorted array with O(1) access to the values is certainly better, but if you insert and delete things frequently a binary tree can be better than the array. For a small enough n the array most likely beats the tree in any case.
If storage size is of concern, the array is better than the trees, too. Trees also need to allocate memory for every item they store and the overhead of the memory allocation can be significant as you only store small values (integers).
You may want to profile what is faster, the copying of the integers if you insert/delete from the sorted array or the tree with it's memory (de)allocations.
I don't know what language you're using, but if it's Java you can leverage LinkedHashMap or a similar Collection. It's got all of the benefits of a List and a Map, provides constant time for most operations, and has the memory footprint of an elephant. :)
If you're not using Java, the idea of a LinkedHashMap is probably still suitable for a usable data structure for your problem.
Use a vector for the array access.
Use a map as a search index to the subscript into the vector.
given a subscript fetch the value from the vector O(1)
given a key, use the map to find the subscript of the value. O(lnN)
insert a value, push back on the vector O(1) amortized, insert the subscript into
the map O(lnN)
delete a value, delete from the map O(lnN)
Howabout a Treemap? log(n) for the operations described.
I like balanced binary trees a lot. They are sometimes slower than hash tables or other structures, but they are much more predictable; they are generally O(log n) for all operations. I would suggest using a Red-black tree or an AVL tree.
How to achieve 2 with RB-trees? We can make them count their children with every insert/delete operations. This doesn't make these operationis last significantly longer. Then getting down the tree to find the i-th element is possible in log n time. But I see no implementation of this method in java nor stl.
If you're working in .NET, then according to the MS docs http://msdn.microsoft.com/en-us/library/f7fta44c.aspx
SortedDictionary and SortedList both have O(log n) for retrieval
SortedDictionary has O(log n) for insert and delete operations, whereas SortedList has O(n).
The two differ by memory usage and speed of insertion/removal. SortedList uses less memory than SortedDictionary. If the SortedList is populated all at once from sorted data, it's faster than SortedDictionary. So it depends on the situation as to which is really the best for you.
Also, your argument for the Linked List is not really valid as it might be O(1) for the insert, but you have to traverse the list to find the insertion point, so it's really not.