How are hash tables implemented internally in popular languages? - c

Can someone please shed some light on how popular languages like Python, Ruby implements hash tables internally for symbol lookup? Do they use the classic "array with linked-list" method, or use a balanced tree?
I need a simple (fewer LOC) and fast method for indexing the symbols in a DSL written in C. Was wondering what others have found most efficient and practical.

The classic "array of hash buckets" you mention is used in every implementation I've seen.
One of the most educative versions is the hash implementation in the Tcl language, in file tcl/generic/tclHash.c. More than half of the lines in the file are comments explaining everything in detail: allocation, search, different hash table types, strategies, etc. Sidenote: the code implementating the Tcl language is really readable.

Perl uses an array with linked lists to hold collisions. It has a simple heuristic to automatically double the size of the array as necessary. There's also code to share keys between hashes to save a little memory. You can read about it in the dated but still relevant Perl Illustrated Guts under "HV". If you're truly adventurous you can dig into hv.c.
The hashing algorithm used to be pretty simple but its probably a lot more complicated now with Unicode. Because the algorithm was predictable there was a DoS attack whereby the attacker generated data which would cause hash collisions. For example, a huge list of keys sent to a web site as POST data. The Perl program would likely split it and dump it into a hash which then shoved it all into one bucket. The resulting hash was O(n) rather than O(1). Throw a whole lot of POST requests at a server and you might clog the CPU. As a result Perl now perturbs the hash function with a bit of random data.
You also might want to look at how Parrot implements basic hashes which is significantly less terrifying than the Perl 5 implementation.
As for "most efficient and practical", use someone else's hash library. For god's sake don't write one yourself for production use. There's a hojillion robust and efficient ones out there already.

Lua tables use an utterly ingenious implemenation which for arbitrary keys behaves like 'array of buckets', but if you use consecutive integers as keys, it has the same representation and space overhead as an array. In the implementation each table has a hash part and an array part.
I think this is way cool :-)

Attractive Chaos have a comparison of Hash Table Libraries and a update.
The source code is available and it is in C and C++

Balanced trees sort of defeat the purpose of hash tables since a hash table can provide lookup in (amortized) constant time, whereas the average lookup on a balanced tree is O(log(n)).
Separate chaining (array with linked list) really works quite well if you have enough buckets, and your linked list implementation uses a pooling allocator rather than malloc()ing each node from the heap individually. I've found it to be just about as performant as any other technique when properly tuned, and it is very easy and quick to write. Try starting with 1/8 as many buckets as source data.
You can also use open addressing with quadratic or polynomial probing, as Python does.

If you can read Java, you might want to check out the source code for its various map implementations, in particular HashMap, TreeMap and ConcurrentSkipListMap. The latter two keep the keys ordered.
Java's HashMap uses the standard technique you mention of chaining at each bucket position. It uses fairly weak 32-bit hash codes and stores the keys in the table. The Numerical Recipes authors also give an example (in C) of a hash table essentially structured like Java's but in which (a) you allocate the nodes of the bucket lists from an array, and (b) you use a stronger 64-bit hash code and dispense with storing keys in the table.

What Crashworks mean to say was....
The purpose of Hash tables are constant time lookup, addition and deletion. In terms of Algorithm, the operation for all operation is O(1) amortized.
Whereas in case you use tree ...the worst case operation time will be O(log n) for a balanced tree. N is the number of nodes. but, do we really have hash implemented as Tree?

Related

Exists lock-free hash tables that preserves insertion order?

I'm trying to optimize a library in which I use a lock-based hash table.
One way to do it is to substitute that lock-based structure with a lock-free one.
I found some algorithms about, and I decided to implement in C using this paper: Split-ordered lists: lock-free extensible hash tables
The problem is that this kind of structure does not preserve the insertion order of the elements, and I need this feature for two reasons:
1) to get the next element to the current one (in accordance with insertion order and not in hashkey order),
2) to replace old entries (with new ones) when the maximum number of elements in the ht is reached. This because I use the hash table like a buffer, and I want to take its size fixed.
So I ask you, all lock-free hash table's implementations suffers from this "lack-of-insertion-order" issue? Or there is a solution?
If memory isn't an issue, a simple way to implement this is by using an atomic reference. Modifications will copy the internal data structure, make the changes and then update the reference.
In a simple implementation, that means the last write wins and all other writes are "ignored". For more complex cases, you add a locking structure in the reference which allows to queue write operations.
So you pay with another level of indirection but get a very simple way to swap data structures and algorithms.
Since this approach works with any algorithm, you can select one which preserves order.

Is it okay to use a non-cryptographic hash to fingerprint a block of data?

My problem is this. I have a block of data. Occasionally this block of data is updated and a new changed version appears. I need to detect if the data I am looking at matches the version I am expecting to receive.
I have decided to use a fingerprint so that I can avoid storing the 'expected' version of the data in full. It seems that the 'default' choice for this kind of thing is an MD5 hash.
However MD5 was designed to be cryptographically secure. There are much faster hashing functions. I am looking at modern non-cryptographic functions such as CityHash and SpookyHash.
Since I control all the data in my system I only care about accidental collisions where a changed block of data hashes to the same value. Therefore I don't think I have to worry about the 'attacker-proof' nature of cryptographic hashes and could get away with a simpler hash function.
Are there any problems with using a hash function such as CityHash or SpookyHash for this purpose, or should I just stick with MD5? Or should I be using something specifically designed for fingerprinting such as a Rabin fingerprint?
Yes, it's okay (also take a look at the even faster CRC series of functions). However I tend to avoid using hashes to differentiate data, using serial numbers combined with a date/time value provide a means to determine which version is newer and to detect out-of-sync changes. Fingerprints are used more to detect corrupted files rather than versioning.
If you want to compare one set of data with another, then don't use hashes/fingerprints, just compare the data directly. It's faster to compare two streams than it is to take the hashes of two streams and then compare the hashes.
That said, a good quick way to compare lots of files is to take the hashes of each file, then compare the hashes, and when there's a hash match you then compare the raw bytes. The chance of a hash collision is indeed minimal, but it isn't impossible - and I like to absolutely be sure.
You may want to use the Rabin Hash, which is faster and more collision resilient than cryptographic hashes such as MD5, SHA1, et al. A Java implementation can be found here. Most large-scale deduplication efforts by web scale companies utilize Rabin Hash (for example, see Google's efforts led by Henzinger

C hashtable library natively supporting multiple values per key

If you want to store multiple values for a key, there's always the possibility of tucking a list in between the hashtable and the values. However, I figure that to be rather inefficient, as:
The hashtable has to resolve collisions, anyway, so does some kind of list walking. Instead of stopping when it found the first key in a bucket that matches the query key, it could just continue to walk the bucket, presumably giving better cache performance than walking yet another list after following yet another indirection.
Is anyone aware of library implementations that support this by default (and ideally are also otherwise shiny, fast, hashtables as well as BSD or similarly licensed)? I've looked through a couple of libraries but none did what I wanted, glib's datasets coming closest, though storing records, not lists.
So… something like a multimap?
Libgee, building off of GLib, provides a MultiMap. (It's written in Vala, but that is converted to plain C.)

C : Store and index a HUGH amount of info! School Project

i need to do a school project in C (I'm really don't know c++ as well).
I need a data struct to index each word of about 34k documents, its a lot of words, and need to do some ranking about the words, i already did this project about 2 years ago (i'm pause in the school and back this year) and a i use a hash table of binary tree, but i got a small grade cause my project took about 2hours to index all words. I need something a little fast... any sugestions?
Tkz
Roberto
If you have the option, I'd strongly recommend using a database engine (MSSQL, MySQL, etc.) as that's exactly the sort of datasets and operations these are written for. Best not to reinvent the wheel.
Otherwise, why use a btree at all? From what you've described (and I realise we're probably not getting the full story...) a straight up hash table with the word as a key and its rank/count of occurences should be useful?
bogofilter (the spam filter) has to keep word counts. It uses dbm as a backend, since it needs persistent storage of the word -> count map. You might want to look at the code for inspiration. Or not, since you need to implement the db part of it for the school project, not so much the spam filter part.
Minimize the amount of pointer chasing you have to do. Data-dependent memory-load operations are slow, esp. on a large working set where you will have cache misses. So make sure your hash table is big enough that you don't need a big tree in each bucket. And maybe check that your binary trees are dense, not degenerate linked lists, when you do get more than one value in a hash bucket.
If it's slow, profile it, and see if your problem is one slow function, or if it's cache misses, or if it's branch mispredictions.

Linked lists or hash tables?

I have a linked list of around 5000 entries ("NOT" inserted simultaneously), and I am traversing the list, looking for a particular entry on occasions (though this is not very often), should I consider Hash Table as a more optimum choice for this case, replacing the linked list (which is doubly-linked & linear) ?? Using C in Linux.
If you have not found the code to be the slow part of the application via a profiler then you shouldn't do anything about it yet.
If it is slow, but the code is tested, works, and is clear, and there are other slower areas that you can work on speeding up do those first.
If it is buggy then you need to fix it anyways, go for the hash table as it will be faster than the list. This assumes that the order that the data is traversed does not matter, if you care about what the insertion order is then stick with the list (you can do things with a hash table and keep the order, but that will make the code much tricker).
Given that you need to search the list only on occasion the odds of this being a significant bottleneck in your code is small.
Another data structure to look at is a "skip list" which basically lets you skip over a large portion of the list. This requires that the list be sorted however, which, depending on what you are doing, may make the code slower overall.
Whether using hash table is more optimum or not depends on the use case, which you have not described in detail. But more importantly, make sure the bottleneck of performance is in this part of the code. If this code is called only once in a while and not in a critical path, no use bothering to change the code.
Have you measured and found a performance hit with the lookup? A hash_map or hash table should be good.
If you need to traverse the list in order (not as a part of searching for elements, but say for displaying them) then a linked list is a good choice. If you're only storing them so that you can look up elements then a hash table will greatly outperform a linked list (for all but the worst possible hash function).
If your application calls for both types of operations, you might consider keeping both, and using whichever one is appropriate for a particular task. The memory overhead would be small, since you'd only need to keep one copy of each element in memory and have the data structures store pointers to these objects.
As with any optimization step that you take, make sure you measure your code to find the real bottleneck before you make any changes.
If you care about performance, you definitely should. If you're iterating through the thing to find a certain element with any regularity, it's going to be worth it to use a hash table. If it's a rare case, though, and the ordinary use of the list is not a search, then there's no reason to worry about it.
If you only traverse the collection I don't see any advantages of using a hashmap.
I advise against hashes in almost all cases.
There are two reasons; firstly, the size of the hash is fixed.
Second and much more importantly; the hashing algorithm. How do you know you've got it right? how will it behave with real data rather than test data?
I suggest a balanced b-tree. Always O(log n), no uncertainty with regard to a hash algorithm and no size limits.

Resources