How are HashTables in GLib Useful? - c

I'm familiar with the idea of a hash function but I'm unclear on how GLib's implementation is useful. I'll explain this with an example.
Suppose I have an expensive function that is recursive (somehow) on the positive real numbers in a weird way that depends on number theory (I'm a mathematician). Let's say I have an algorithm that needs to compute the function on some smallish-range of large numbers. Say [1000000000 - 1000999999].
I don't want to call my expensive function one million times, so I start memoizing values recursively. Then at each call I don't need to necessarily compute the whole function from scratch, I can hopefully remember any values of the function on the lower numbers (during my recursing) that I have already computed. Let's assume that the actual total number of calls at that first level of recursion is low. So that there are a lot of repeated values and memoizing actually saves you a lot of time.
This is my cartoony way of understanding why a hash table data structure is useful. What I don't get is how to do this without knowing exactly what keys I'll need in advance.
Since the recursive function is number theoretic in general I don't know which values it will take over and over again. So I'd like to just throw these in a bucket (hash table) as they pop out of recursive calls to my function
For GLib, it would seem that your (key,value) pairs are always pointers to data that you personally have to keep lying around somewhere. So if my function is computing for input x. I don't know how to tell if I've seen the value x before, the function g_hash_table_contains() for example needs a pointer, not the value x. So what's the use!?
I'm still learning so be kind. I'm familiar with coding in C, but haven't yet used hash tables in this language and I'm trying to do so and be adept at it with GLib but I just don't get this.

Let me take a dig at it to explain it.
First of all, if we are using hashmap, then we need [key, value] pair for sure as our input.
So as a user of hashmap, we have to be creative about choosing key, and it varies depending upon the usecase.
In your case, as far as I understood, you have a function which works on a range and gives you result. And when calculating, it uses memoization so that results of small problem, which constitutes the bigger problem, can be used.
So for example, your case, you can use string as your key where string will be [1000000009] which may use result of [1000999998] which may further use result of 1000999997 and so on, and you do not find results in hashmap, then you will calculate it and save it in hashmap.
In nutshell, as a user, we need to be creative about choosing keys.
The analogues to understand is how you would have done, if you have to think about choosing primary key of database.
Another example to think is how you would have thought about solving fibonacci(n) using the hashmap.

Related

Data Structure to do lookup on large number

I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.

Hash function for hash table with strings and integers as keys

i am in search for a good Hash function which i can use in Hash table implementation. The thing is that i want to give both strings and integers as parameters(keys) in my hash function.
i have a txt file with ~500 data and every one of them consists of integers and strings(max 15 chars). So, the thing that i want to do is to pick one of these ints/strings and use it as a key for my hash function in order to put my data in the "right" bucket.
Is there any good function to do this?
Thank you :)
Use the Integer value if that's present & reasonably well distributed, then hash the String if it's not. Integer hashcode is much cheaper to compute than String.
The algorithm has to be repeatable, obviously.
Your question is somewhat vague. It's unclear if your data set has 500 columns and you are trying to figure out which column to use for hashing, or if it has 500 items which you want to hash.
If you are looking for a decent general purpose hash that will produce well-distributed hash values, you may want to check out the Jenkins hash functions which have variants for strings and integers. But, to be frank, if your dataset has 500 fixed items you may want to look at a perfect hash function generator, like GNU gperf or even alternative data structures depending on your data.
Since you want to hash using two keys, I presume the distribution improves using two keys.
For string hashing, I have had good results with PJW algorithm. Just google for "PJW Hash String". One variation here
To augment the hash with an integer, see here

Linking filenames or labels to numeric index

In a C99+SDL game, I have an array that contains sound effects (SDL_mixer chunk data and some extra flags and filename string) and is referenced by index such as "sounds[2].data".
I'd like to be able to call sounds by filename, but I don't want to strcmp all the array until a match is found. This way as I add more sounds, or change the order, or allow for player-defined sound mods, they can still be called with a common identifier (such as "SHOT01" or "EXPL04").
What would be the fastest approach for this? I heard about hashing, which would result in something similar to lua's string indexes (such as table["field"]) but I don't know anything about the topic, and seems fairly complicated.
Just in case it matters, I plan to have filenames or labels be from 6 to 8 all caps filenames (such as "SHOT01.wav").
So to summarize, where can I learn about hashing short strings like that, or what would be the fastest way to keep track of something like sound effects so they can be called using arbitrary labels or identifiers?
I think in your case you can probably just keep all the sounds in a sorted data structure and use a fast search algorithm to find matches. Something like a binary search is very simple implement and it gives good performance.
However, if you are interested in hash tables and hashing, the basics of it all are pretty simple. There is no place like Wikipedia to get the basics down and you can then tailor your searches better on Google to find more in depth articles.
The basics are you start out with a fixed size array and store everything in there. To figure out where to store something you take the key (in your case the sound name) and you perform some operation on it such that it gives you an exact location where the value can be found. So the simplest case for string hashing is just adding up all the letters in the string as integer values then take the value and use modulus to give you an index in your array.
position = SUM(string letters) % [array size]
Of course naturally multiple strings will have same sum and thus give you the same position. This is called a collision, and collisions can be handled in many ways. The simplest way is to have an array of lists rather than array of values, and simply append to the list every there there is a collision. When searching for a value, simply iterate the lists and find the value you need.
Ideally a good hashing algorithm will have few collisions and quick hashing algorithm thus providing huge performance boost.
I hope this helps :)
You are right, when it comes to mapping objects with a set of string keys, hash tables are often the way to go.
I think this article on wikipedia is a good starting point to understand hash table mechanism: http://en.wikipedia.org/wiki/Hash_table

Hash Function Determination

How can we find the most efficient hash function(least possible chances of collision) for the set of strings.
Suppose we are given with some strings.. And the length of the strings is also not defined.
Ajay
Vijay
Rakhi
....
we know the count of no. of strings available, so we can design a hash table of size(count available). what could be the perfect hash function that we could design for such problem??
Multiplying each character ascii value by 31(prime no.) in increment fashion leads to the a hash value greater than the value of MAX_INT, and then modulus would not work properly... So please give some efficient hash function build up solution....
I have few set of strings,, lets say count = 10.... I need to implement a hash function such that all those 10 strings fit in uniquely in the hash table.... Any perfect hash function O(1) available, for this kind of problem?? hash table size will be 10, for this case...
Only C Programming...
Please explain the logic at website.... http://burtleburtle.net/bob/c/perfect.c
This looks very complicated but perfect to me..!! what is the algorithm used here... Reading the code straight away, is very difficult!!
Thanks....
Check some of these out, they apparantly have good distributions
http://www.partow.net/programming/hashfunctions/#HashingMethodologies
You might want to look into perfect hashing.
you might want to have a look at gperf, you could kinda do this on the fly if you didn't do it too often and your data set a small. if the strings are know ahead of time, then this is the method
Hash tables are meant to be able to handle dynamic input. If you can guarantee only a particular set of inputs, and you want to guarantee a particular slot for each input, why hash at all?
Just make an array indexed for each known available input.

How to design a hashfunction that is scalable to exactly n elements?

I have a list of n strings (names of people) that I want to store in a hash table or similar structure. I know the exact value of n, so I want to use that fact to have O(1) lookups, which would be rendered impossible if I had to use a linked list to store my hash nodes. My first reaction was to use the the djb hash, which essentially does this:
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
To compress the resulting h into the range [0,n], I would like to simply do h%n, but I suspect that this will lead to a much higher probability of clashes in a way that would essentially render my hash useless.
My question then, is how can I hash either the string or the resulting hash so that the n elements provide a relatively uniform distribution over [0,n]?
It's not enough to know n. Allocation of an item to a bucket is a function of the item itself so, if you want a perfect hash function (one item per bucket), you need to know the data.
In any case, if you're limiting the number of elements to a known n, you're already technically O(1) lookup. The upper bound will be based on the constant n. This would be true even for a non-hash solution.
Your best bet is to probably just use the hash function you have and have each bucket be a linked list of the colliding items. Even if the hash is less than perfect, you're still greatly minimising the time taken.
Only if the hash is totally imperfect (all n elements placed in one bucket) will it be as bad as a normal linked list.
If you don't know the data in advance, a perfect hash is not possible. Unless, of course, you use h itself as the hash key rather than h%n but that's going to take an awful lot of storage :-)
My advice is to go the good-enough hash with linked list route. I don't doubt that you could make a better hash function based on the relative frequencies of letters in people's names across the population but even the hash you have (which is ideal for all letters having the same frequency) should be adequate.
And, anyway, if you start relying on frequencies and you get an influx of people from those countries that don't seem to use vowels (a la Bosniaa), you'll end up with more collisions.
But keep in mind that it really depends on the n that you're using.
If n is small enough, you could even get away with a sequential search of an unsorted array. I'm assuming your n is large enough here that you've already established that (or a balanced binary tree) won't give you enough performance.
A case in point: we have some code which searches through problem dockets looking for names of people that left comments (so we can establish the last member on our team who responded). There's only ever about ten or so members in our team so we just use a sequential search for them - the performance improvement from using a faster data structure was deemed too much trouble.
aNo offence intended. I just remember the humorous article a long time ago about Clinton authorising the airlifting of vowels to Bosnia. I'm sure there are other countries with a similar "problem".
What you're after is called a Perfect Hash. It's a hash function where all the keys are known ahead of time, designed so that there are no collisions.
The gperf program generates C code for perfect hashes.
It sounds like you're looking for an implementation of a perfect hash function, or perhaps even a minimal perfect hash function. According to the Wikipedia page, CMPH might
fit your needs. Disclaimer: I've never used it.
The optimal algorithm for mapping n strings to integers 1-n is to build a DFA where the terminating states are the integers 1-n. (I'm sure someone here will step up with a fancy name for this...but in the end it's all DFA.) Size/speed tradeoff can be adjusted by varying your alphabet size (operating on bytes, half-bytes, or even bits).

Resources