Dictionary lookup - why use a hash table for fixed data? - c

In some (horrible 3rd party) code we're working with there is a dictionary lookup routine that scans through a table populated with "'name-string' -> function_pointer" pairs, basically copy-pasted from K&R Section 6.6.
I've had to extend this, and while reading the code was struck by the seemingly pointless inclusion of hashing routines that iterate through the source data structure and create a hash table.
Given that the source data structure is fixed at compile time (so will never be added to or changed when running), is there any point in having hashing routines in there?
I'm just having one of those moments when I can't tell if the author was doing something clever that I've missed, or was being lazy and not thinking (so far the latter has been the case more often than not).
Is there a reason to have a hash table for data that will never change?

Is there a reason to have a hash table for data that will never
change?
Probably the hash table code was already there and working fine, and the programmer just wanted to get the job done (e.g. looking up a function pointer from a string). If this function is not performance critical I see no reason to change it.
If you want to change it, then I suggest to take a look at perfect hash tables.
These are hash tables were the hash function is created from a fixed set of predefined keys. The good thing about them: They are often faster than a tree data-structure.
GPERF is a tool that does just this. It creates C-code from a set of strings: https://www.gnu.org/software/gperf/

Related

C - Ways to free() groups of elements in a hash table?

I'm currently fiddling with a program that's trying to solve a 2d rubix cube.The program is using a hash table as a memory of sorts where it saves different categories of information and it's runs on repeat. From run to run there are certain categories of information I'd like to free/remove instead of freeing the whole table at the end of each run (which is what I'm currently doing).
I've come up two ways and I'm unsure which to use. Either i basically make one array/stack for each of the categories where I save a pointer that i can later free. Or i make separate hash tables for all of the different categories and free each one at my discretion.
Are there other options? Some where i read about a pointer pool and I'm not sure what that might be. Any ides or helpful comments would be great!
Did you have more memory or time ?) If you use hash table (include separate) then you need use if for check all element in your hash table. It is very more time. I think best way create second struct when you save object after create. For free all table you need run on simple array without check and use memory set zero for flush your hash table. You need little bit more memory but more effective work.

Get all keys from (search.h) hash search table

I'm using the search.h library to define a hash table through the hcreate function.
How can I go through all the keys in that table? hsearch always expects an entry to search for (or store).
This is the documentation to all the three functions that manage the hash table ( hcreate, hsearch and hdestroy) but there's not mention of how to iterate through the structure to obtain all the stored keys.
When storing an entry in the table, I malloc the key value and so would like to have an easy way to free those malloc'd values.
Can I avoid having to store those in a separate structure such as an array?
I wouldn't expect hdestroy to be doing this automatically for me, as it has no way of knowing if key points to dynamically allocated or static memory (or indeed if I haven't already freed that memory).
Switching to a different hash search table library is not an option. I have to work with this. I'm on CentOS and use GCC 4.1.2.
There is no standard functionality of iterating through the entries of the hash table. This question is addressed here (in the hdestroy section):
It is important to remember that the elements contained in the hashing
table at the time hdestroy is called are not freed by this function.
It is the responsibility of the program code to free those strings (if
necessary at all). Freeing all the element memory is not possible
without extra, separately kept information since there is no function
to iterate through all available elements in the hashing table. If it
is really necessary to free a table and all elements the programmer
has to keep a list of all table elements and before calling hdestroy
s/he has to free all element’s data using this list. This is a very
unpleasant mechanism and it also shows that this kind of hashing
tables is mainly meant for tables which are created once and used
until the end of the program run.
Without looking at the actual source of the library, I would say there is no way walk the hash table after it has been created. You would be required to remember the pointers for your malloc'd memory in a separate structure.
Frankly, I don't think I'd touch that library with a ten foot pole. The API has numerous problems
Atrocious documentation
The library can only support a single hash table (note that hcreate does not return a handle which is then passed to hsearch or hdestroy)
The inability to walk the table, or retrieve the keys severely limits its uses.
Instead, depending on your platform (you don't say whether you are on Windows or a Unix-based OS), I'd take a good long look at glib which supports a rich set of data-structures (documentation home)
The docs for hash tables are here. That's for v2.42 of the library - they don't have a generic link for the "latest version".
glib is the core of GNOME (the Ubuntu UI) but you don't need to use any of the gmainloop or event pump related features.

How does the SETCURRENTKEY() C/AL function in Navision work?

I have the following questions:
What does SETCURRENTKEY actually do?
What is the benefit of SETCURRENTKEY?
Why would I use SETCURRENTKEY?
When would I use SETCURRENTKEY?
What is the advantage of using an index and how do I tie this analogously to the example of an old sorting system of a library?
What type of database querying efficiency problems does this function solve?
I have been searching all over the internet and the 'IT Pro Developer Help' internal Navision documentation for this poorly documented function and I cannot find a right answer to my questions.
The only thing I know is that SETCURRENTKEY sets the current key for a record variable and it sorts the recordset based on it. When SETCURRENTKEY is used with only a few keys, it can improve query performance. I have no idea what is actually happening when a database uses an index versus not using an index.
Someone told me this is how SETCURRENTKEY works:
It is like the old sorting card system in a library: without SETCURRENTKEY you would have to go through each shelf and manually filter out for the book you want. You would find a mix of random books and you would have to say: "No, not this one. Yes, this one". With SETCURRENTKEY you can have an index analogous to the old system where you would just go to a book or music CD based on its 'Author' or 'Artist' etc.
That's all fine, but I still can't properly answer my questions.
With SETCURRENTKEY you declare the key (table index, which can consist of many fields) to be used when querying database with FINDSET/FINDFIRST/FINDLAST statements, and the order of records you will receive while iterating the recordset with NEXT statement.
Performance. The database server uses the selected key (table index) to retrieve the record set. You are always better off explicitly stating SETCURRENTKEY in your code, as it makes you think along about you database structure and indices required.
Performance, and so that you know ahead the order of records you will receive when iterating through a recordset.
When to use:
The typical use is this:
RecordVar.SETCURRENTKEY(...)
RecordVar.SETRANGE(Field, ...)
RecordVar.SETFILTER(Field, ...)
RecordVar.SETRANGE(Field, ...)
...
IF RecordVar.FINDSET THEN REPEAT
// do something with records
UNTIL RecordVar.NEXT = 0;
SETCURRENTKEY is declarative, and comes into effect only when FINDSET is executed. At the moment FINDSET is executed, the database will be queried on the table represented by RecordVar, using the filters declared by SETRANGE/SETFILTER, and the key/index declared by SETCURRENTKEY.
For 5. and 6. and generally, I would truly reccomend you to familiarize yourself with basic database index theory. This is what it is, pretty well explained by yourself using the library/book analogy.
If modifying key fields (or filtered fields, even if not in the key) in a loop, the standard way to do this in NAV is to declare a second record variable, do a GET on it using the primary key fields from the record variable you are looping through, then change and MODIFY the second record variable.

C hashtable library natively supporting multiple values per key

If you want to store multiple values for a key, there's always the possibility of tucking a list in between the hashtable and the values. However, I figure that to be rather inefficient, as:
The hashtable has to resolve collisions, anyway, so does some kind of list walking. Instead of stopping when it found the first key in a bucket that matches the query key, it could just continue to walk the bucket, presumably giving better cache performance than walking yet another list after following yet another indirection.
Is anyone aware of library implementations that support this by default (and ideally are also otherwise shiny, fast, hashtables as well as BSD or similarly licensed)? I've looked through a couple of libraries but none did what I wanted, glib's datasets coming closest, though storing records, not lists.
So… something like a multimap?
Libgee, building off of GLib, provides a MultiMap. (It's written in Vala, but that is converted to plain C.)

Linked lists or hash tables?

I have a linked list of around 5000 entries ("NOT" inserted simultaneously), and I am traversing the list, looking for a particular entry on occasions (though this is not very often), should I consider Hash Table as a more optimum choice for this case, replacing the linked list (which is doubly-linked & linear) ?? Using C in Linux.
If you have not found the code to be the slow part of the application via a profiler then you shouldn't do anything about it yet.
If it is slow, but the code is tested, works, and is clear, and there are other slower areas that you can work on speeding up do those first.
If it is buggy then you need to fix it anyways, go for the hash table as it will be faster than the list. This assumes that the order that the data is traversed does not matter, if you care about what the insertion order is then stick with the list (you can do things with a hash table and keep the order, but that will make the code much tricker).
Given that you need to search the list only on occasion the odds of this being a significant bottleneck in your code is small.
Another data structure to look at is a "skip list" which basically lets you skip over a large portion of the list. This requires that the list be sorted however, which, depending on what you are doing, may make the code slower overall.
Whether using hash table is more optimum or not depends on the use case, which you have not described in detail. But more importantly, make sure the bottleneck of performance is in this part of the code. If this code is called only once in a while and not in a critical path, no use bothering to change the code.
Have you measured and found a performance hit with the lookup? A hash_map or hash table should be good.
If you need to traverse the list in order (not as a part of searching for elements, but say for displaying them) then a linked list is a good choice. If you're only storing them so that you can look up elements then a hash table will greatly outperform a linked list (for all but the worst possible hash function).
If your application calls for both types of operations, you might consider keeping both, and using whichever one is appropriate for a particular task. The memory overhead would be small, since you'd only need to keep one copy of each element in memory and have the data structures store pointers to these objects.
As with any optimization step that you take, make sure you measure your code to find the real bottleneck before you make any changes.
If you care about performance, you definitely should. If you're iterating through the thing to find a certain element with any regularity, it's going to be worth it to use a hash table. If it's a rare case, though, and the ordinary use of the list is not a search, then there's no reason to worry about it.
If you only traverse the collection I don't see any advantages of using a hashmap.
I advise against hashes in almost all cases.
There are two reasons; firstly, the size of the hash is fixed.
Second and much more importantly; the hashing algorithm. How do you know you've got it right? how will it behave with real data rather than test data?
I suggest a balanced b-tree. Always O(log n), no uncertainty with regard to a hash algorithm and no size limits.

Resources