If you want to store multiple values for a key, there's always the possibility of tucking a list in between the hashtable and the values. However, I figure that to be rather inefficient, as:
The hashtable has to resolve collisions, anyway, so does some kind of list walking. Instead of stopping when it found the first key in a bucket that matches the query key, it could just continue to walk the bucket, presumably giving better cache performance than walking yet another list after following yet another indirection.
Is anyone aware of library implementations that support this by default (and ideally are also otherwise shiny, fast, hashtables as well as BSD or similarly licensed)? I've looked through a couple of libraries but none did what I wanted, glib's datasets coming closest, though storing records, not lists.
So… something like a multimap?
Libgee, building off of GLib, provides a MultiMap. (It's written in Vala, but that is converted to plain C.)
Related
I'm using the search.h library to define a hash table through the hcreate function.
How can I go through all the keys in that table? hsearch always expects an entry to search for (or store).
This is the documentation to all the three functions that manage the hash table ( hcreate, hsearch and hdestroy) but there's not mention of how to iterate through the structure to obtain all the stored keys.
When storing an entry in the table, I malloc the key value and so would like to have an easy way to free those malloc'd values.
Can I avoid having to store those in a separate structure such as an array?
I wouldn't expect hdestroy to be doing this automatically for me, as it has no way of knowing if key points to dynamically allocated or static memory (or indeed if I haven't already freed that memory).
Switching to a different hash search table library is not an option. I have to work with this. I'm on CentOS and use GCC 4.1.2.
There is no standard functionality of iterating through the entries of the hash table. This question is addressed here (in the hdestroy section):
It is important to remember that the elements contained in the hashing
table at the time hdestroy is called are not freed by this function.
It is the responsibility of the program code to free those strings (if
necessary at all). Freeing all the element memory is not possible
without extra, separately kept information since there is no function
to iterate through all available elements in the hashing table. If it
is really necessary to free a table and all elements the programmer
has to keep a list of all table elements and before calling hdestroy
s/he has to free all element’s data using this list. This is a very
unpleasant mechanism and it also shows that this kind of hashing
tables is mainly meant for tables which are created once and used
until the end of the program run.
Without looking at the actual source of the library, I would say there is no way walk the hash table after it has been created. You would be required to remember the pointers for your malloc'd memory in a separate structure.
Frankly, I don't think I'd touch that library with a ten foot pole. The API has numerous problems
Atrocious documentation
The library can only support a single hash table (note that hcreate does not return a handle which is then passed to hsearch or hdestroy)
The inability to walk the table, or retrieve the keys severely limits its uses.
Instead, depending on your platform (you don't say whether you are on Windows or a Unix-based OS), I'd take a good long look at glib which supports a rich set of data-structures (documentation home)
The docs for hash tables are here. That's for v2.42 of the library - they don't have a generic link for the "latest version".
glib is the core of GNOME (the Ubuntu UI) but you don't need to use any of the gmainloop or event pump related features.
In some (horrible 3rd party) code we're working with there is a dictionary lookup routine that scans through a table populated with "'name-string' -> function_pointer" pairs, basically copy-pasted from K&R Section 6.6.
I've had to extend this, and while reading the code was struck by the seemingly pointless inclusion of hashing routines that iterate through the source data structure and create a hash table.
Given that the source data structure is fixed at compile time (so will never be added to or changed when running), is there any point in having hashing routines in there?
I'm just having one of those moments when I can't tell if the author was doing something clever that I've missed, or was being lazy and not thinking (so far the latter has been the case more often than not).
Is there a reason to have a hash table for data that will never change?
Is there a reason to have a hash table for data that will never
change?
Probably the hash table code was already there and working fine, and the programmer just wanted to get the job done (e.g. looking up a function pointer from a string). If this function is not performance critical I see no reason to change it.
If you want to change it, then I suggest to take a look at perfect hash tables.
These are hash tables were the hash function is created from a fixed set of predefined keys. The good thing about them: They are often faster than a tree data-structure.
GPERF is a tool that does just this. It creates C-code from a set of strings: https://www.gnu.org/software/gperf/
I am trying to develop a network resource manager component in C which keeps track of various network elements over TCP/UDP sockets. For this, I use three values :
Hardware Location Number
Service Group Number
Node Number
The rule is that no two elements on a network may have the same set of these three numbers. Thus, each location's identity will be unique on the network. This information needs to be saved in the program (non-persistently) in a way so that given any of the parameters (could be just a single number, or a combination of any two, or all three) the program returns the eligible candidates by performing a quick search.
The addition and deletion should also be efficient, but given that there will be few insertions or deletions after the initial transient phase if they are a bit slower than search, it should be OK. Using trees is one option, but the answer of 'Which one to use?' still eludes me (Not that I know of many, but I look forward to learning newer ones if they serve my purpose).
To do this, I could have three different trees maintained separately with similar nodes pointing to a same structure in memory, but I feel that is inefficient and not compact. I am looking for a unified data set which can handle these variations like multiple keys.
Or I could have a single AVL tree with multiple keys (if that is allowed).
The number of elements in the network is dynamic, so using a 3D array is out of option.
A friend also suggested hashing, but I am not too sure.
Please help.
Hashing seems like a silly choice for this. Perhaps the most significant reason is that you seem interested in approximate lookups. Hashing your values will likely mean iterating through the entire collection to find a group of nodes that have a common prefix, or a similar prefix.
PATRICIA is commonly used in routing tables, and makes itself quite amenable to searching for items that have similar keys. Note that I have found much misleading information about PATRICIA tries, which I've written about here. I found this resource to be particularly helpful.
Similarly to an AVL tree, you'll need to combine the three keys to form one (without hashing, preferably).
unsigned int key[3] = { hardware_location_number, service_group_number, node_number };
/* ^------- Use something like this as your key */
Some scripting languages, such as Python and Javascript, have arrays (aka lists) as a separate datatype from hash tables (aka dictionaries, maps, objects). In other scripting languages, such as PHP and Lua, an array is merely a hash table whose keys happen to be integers. (The implementation may be optimized for that special case, as is done in the current version of Lua, but that's transparent to the language semantics.)
Which is the better approach?
The unified approach is more elegant in the sense of having one thing rather than two, though the gain isn't quite as large as it might seem at first glance, since you still need to have the notion of iterating over the numeric keys specifically.
The unified approach is arguably more flexible. You can start off with nested arrays, find you need to annotate them with other stuff, and just add the annotations, without having to rework the data structures to interleave the arrays with hash tables.
In terms of efficiency, it seems to be pretty much a wash (provided the implementation optimizes for the special case, as Lua does).
What am I missing? Does the separate approach have any advantages?
Having separate types means that you can make guarantees about performance, and you know that you will have "normal" semantics for things like array slicing. If you have a unified system, you need to work out what all operations, such as slicing, mean on sparse arrays.
An array is more than a table intentionally restricted to consecutive integer keys. It's a sequence, a collection of n items (not key-value pairs, just the values) with a well-defined order. This is, in my opinion, a data structure that has no place for additional data in the form of non-integer keys. It's conceptually simpler.
Also, implementing the two seperately may be simpler, especially when considering the addition of an optimization (which is apparently obscure enough that a performance-oriented language like Lua didn't implement it for many many years) which makes arrays perform well.
Also, the flexibility point is arguable. If the need for more complex annotation arises, it's quite possible that you'll soon also need polymorphism, in which case you should just switch to objects with an array among other attributes.
As mentioned, there are speed and complexity issues involved in having two separate types. However, one of the things that I find important about having two types is that it expresses the intent of the datastore.
A list is a an ordered list of items. The items and their order ARE the data, the keys only exist in a conceptual manner to describe the order of the items.
A map is a mapping of keys to values. The keys and the values they represent ARE the data.
The point to note that the keys are part of the data for a map, they're not for a list... conceptually. When you choose one data type over the other, you're specifying your intent.
I'll add as an aside that every language that shares a data type for lists and maps has certain... annoyances that come along with it. There are always certain concessions that need to be made to allow the combination, and they can bite you sometimes. It's generally not a big deal, but it can be annoying.
Can someone please shed some light on how popular languages like Python, Ruby implements hash tables internally for symbol lookup? Do they use the classic "array with linked-list" method, or use a balanced tree?
I need a simple (fewer LOC) and fast method for indexing the symbols in a DSL written in C. Was wondering what others have found most efficient and practical.
The classic "array of hash buckets" you mention is used in every implementation I've seen.
One of the most educative versions is the hash implementation in the Tcl language, in file tcl/generic/tclHash.c. More than half of the lines in the file are comments explaining everything in detail: allocation, search, different hash table types, strategies, etc. Sidenote: the code implementating the Tcl language is really readable.
Perl uses an array with linked lists to hold collisions. It has a simple heuristic to automatically double the size of the array as necessary. There's also code to share keys between hashes to save a little memory. You can read about it in the dated but still relevant Perl Illustrated Guts under "HV". If you're truly adventurous you can dig into hv.c.
The hashing algorithm used to be pretty simple but its probably a lot more complicated now with Unicode. Because the algorithm was predictable there was a DoS attack whereby the attacker generated data which would cause hash collisions. For example, a huge list of keys sent to a web site as POST data. The Perl program would likely split it and dump it into a hash which then shoved it all into one bucket. The resulting hash was O(n) rather than O(1). Throw a whole lot of POST requests at a server and you might clog the CPU. As a result Perl now perturbs the hash function with a bit of random data.
You also might want to look at how Parrot implements basic hashes which is significantly less terrifying than the Perl 5 implementation.
As for "most efficient and practical", use someone else's hash library. For god's sake don't write one yourself for production use. There's a hojillion robust and efficient ones out there already.
Lua tables use an utterly ingenious implemenation which for arbitrary keys behaves like 'array of buckets', but if you use consecutive integers as keys, it has the same representation and space overhead as an array. In the implementation each table has a hash part and an array part.
I think this is way cool :-)
Attractive Chaos have a comparison of Hash Table Libraries and a update.
The source code is available and it is in C and C++
Balanced trees sort of defeat the purpose of hash tables since a hash table can provide lookup in (amortized) constant time, whereas the average lookup on a balanced tree is O(log(n)).
Separate chaining (array with linked list) really works quite well if you have enough buckets, and your linked list implementation uses a pooling allocator rather than malloc()ing each node from the heap individually. I've found it to be just about as performant as any other technique when properly tuned, and it is very easy and quick to write. Try starting with 1/8 as many buckets as source data.
You can also use open addressing with quadratic or polynomial probing, as Python does.
If you can read Java, you might want to check out the source code for its various map implementations, in particular HashMap, TreeMap and ConcurrentSkipListMap. The latter two keep the keys ordered.
Java's HashMap uses the standard technique you mention of chaining at each bucket position. It uses fairly weak 32-bit hash codes and stores the keys in the table. The Numerical Recipes authors also give an example (in C) of a hash table essentially structured like Java's but in which (a) you allocate the nodes of the bucket lists from an array, and (b) you use a stronger 64-bit hash code and dispense with storing keys in the table.
What Crashworks mean to say was....
The purpose of Hash tables are constant time lookup, addition and deletion. In terms of Algorithm, the operation for all operation is O(1) amortized.
Whereas in case you use tree ...the worst case operation time will be O(log n) for a balanced tree. N is the number of nodes. but, do we really have hash implemented as Tree?