Hash function for hash table with strings and integers as keys - c

i am in search for a good Hash function which i can use in Hash table implementation. The thing is that i want to give both strings and integers as parameters(keys) in my hash function.
i have a txt file with ~500 data and every one of them consists of integers and strings(max 15 chars). So, the thing that i want to do is to pick one of these ints/strings and use it as a key for my hash function in order to put my data in the "right" bucket.
Is there any good function to do this?
Thank you :)

Use the Integer value if that's present & reasonably well distributed, then hash the String if it's not. Integer hashcode is much cheaper to compute than String.
The algorithm has to be repeatable, obviously.

Your question is somewhat vague. It's unclear if your data set has 500 columns and you are trying to figure out which column to use for hashing, or if it has 500 items which you want to hash.
If you are looking for a decent general purpose hash that will produce well-distributed hash values, you may want to check out the Jenkins hash functions which have variants for strings and integers. But, to be frank, if your dataset has 500 fixed items you may want to look at a perfect hash function generator, like GNU gperf or even alternative data structures depending on your data.

Since you want to hash using two keys, I presume the distribution improves using two keys.
For string hashing, I have had good results with PJW algorithm. Just google for "PJW Hash String". One variation here
To augment the hash with an integer, see here

Related

Data Structure to do lookup on large number

I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.

What hash function can I use for keywords?

I am working in C. To store a set of words for searching through them, I am told to save them in a hash table, and that it will reduce the time complexity to a constant.
Can someone help me out with the hash function? Also, if I have around 25 keywords, can I just make a table of size 25 and map each keyword to an index?
One option is to look for a perfect hash function, a hash function for which collisions don't exist. The Linux tool gperf (not gprof) can be used to automatically generate a perfect hash function from a set of strings. As others have pointed out this is unlikely to give you a huge performance boost unless lookup times are a large part of your program, but it should speed up the lookups.
Hope this helps!
At just 25 entries, a hash table won't bring you much benefit. Just do a linear search instead.
At just 25 strings to match, hashing won't add up to the efficiency. You could look into Horspool Algorithm for string matching, that should work well! And as Bo mentioned you could store them in a sorted order and do a binary search. Or you could store your keywords in a Trie data structure (something like 26-ary tree) to search for words. Hope this helps :)

Linking filenames or labels to numeric index

In a C99+SDL game, I have an array that contains sound effects (SDL_mixer chunk data and some extra flags and filename string) and is referenced by index such as "sounds[2].data".
I'd like to be able to call sounds by filename, but I don't want to strcmp all the array until a match is found. This way as I add more sounds, or change the order, or allow for player-defined sound mods, they can still be called with a common identifier (such as "SHOT01" or "EXPL04").
What would be the fastest approach for this? I heard about hashing, which would result in something similar to lua's string indexes (such as table["field"]) but I don't know anything about the topic, and seems fairly complicated.
Just in case it matters, I plan to have filenames or labels be from 6 to 8 all caps filenames (such as "SHOT01.wav").
So to summarize, where can I learn about hashing short strings like that, or what would be the fastest way to keep track of something like sound effects so they can be called using arbitrary labels or identifiers?
I think in your case you can probably just keep all the sounds in a sorted data structure and use a fast search algorithm to find matches. Something like a binary search is very simple implement and it gives good performance.
However, if you are interested in hash tables and hashing, the basics of it all are pretty simple. There is no place like Wikipedia to get the basics down and you can then tailor your searches better on Google to find more in depth articles.
The basics are you start out with a fixed size array and store everything in there. To figure out where to store something you take the key (in your case the sound name) and you perform some operation on it such that it gives you an exact location where the value can be found. So the simplest case for string hashing is just adding up all the letters in the string as integer values then take the value and use modulus to give you an index in your array.
position = SUM(string letters) % [array size]
Of course naturally multiple strings will have same sum and thus give you the same position. This is called a collision, and collisions can be handled in many ways. The simplest way is to have an array of lists rather than array of values, and simply append to the list every there there is a collision. When searching for a value, simply iterate the lists and find the value you need.
Ideally a good hashing algorithm will have few collisions and quick hashing algorithm thus providing huge performance boost.
I hope this helps :)
You are right, when it comes to mapping objects with a set of string keys, hash tables are often the way to go.
I think this article on wikipedia is a good starting point to understand hash table mechanism: http://en.wikipedia.org/wiki/Hash_table

Hash Function Determination

How can we find the most efficient hash function(least possible chances of collision) for the set of strings.
Suppose we are given with some strings.. And the length of the strings is also not defined.
Ajay
Vijay
Rakhi
....
we know the count of no. of strings available, so we can design a hash table of size(count available). what could be the perfect hash function that we could design for such problem??
Multiplying each character ascii value by 31(prime no.) in increment fashion leads to the a hash value greater than the value of MAX_INT, and then modulus would not work properly... So please give some efficient hash function build up solution....
I have few set of strings,, lets say count = 10.... I need to implement a hash function such that all those 10 strings fit in uniquely in the hash table.... Any perfect hash function O(1) available, for this kind of problem?? hash table size will be 10, for this case...
Only C Programming...
Please explain the logic at website.... http://burtleburtle.net/bob/c/perfect.c
This looks very complicated but perfect to me..!! what is the algorithm used here... Reading the code straight away, is very difficult!!
Thanks....
Check some of these out, they apparantly have good distributions
http://www.partow.net/programming/hashfunctions/#HashingMethodologies
You might want to look into perfect hashing.
you might want to have a look at gperf, you could kinda do this on the fly if you didn't do it too often and your data set a small. if the strings are know ahead of time, then this is the method
Hash tables are meant to be able to handle dynamic input. If you can guarantee only a particular set of inputs, and you want to guarantee a particular slot for each input, why hash at all?
Just make an array indexed for each known available input.

The optimum* way to do a table-lookup-like function in C?

I have to do a table lookup to translate from input A to output A'. I have a function with input A which should return A'. Using databases or flat files are not possible for certain reasons. I have to hardcode the lookup in the program itself.
What would be the the most optimum (*space-wise and time-wise separately): Using a hashmap, with A as the key and A' as the value, or use switch case statements in the function?
The table is a string to string lookup with a size of about 60 entries.
If speed is ultra ultra necessary, then I would consider perfect hashing. Otherwise I'd use an array/vector of string to string pairs, created statically in sort order and use binary search. I'd also write a small test program to check the speed and memory constraints were met.
I believe that both the switch and the table-look up will be equivalent (although one should do some tests on the compiler being used). A modern C compiler will implement a big switch with a look-up table. The table look-up can be created more easily with a macro or a scripting language.
For both solutions the input A must be an integer. If this is not the case, one solution will be to implement a huge if-else statement.
If you have strings you can create two arrays - one for input and one for output (this will be inefficient if they aren't of the same size). Then you need to iterate the contents of the input array to find a match. Based on the index you find, you return the corresponding output string.
Make a key that is fast to calculate, and hash
If the table is pretty static, unlikely to change in future, you could have a look-see if adding a few selected chars (with fix indexes) in the "key" string could get unique values (value K). From those insert the "value" strings into a hash_table by using the pre-calculated "K" value for each "key" string.
Although a hash method is fast, there is still the possibility of collision (two inputs generating the same hash value). A fast method depends on the data type of the input.
For integral types, the fastest table lookup method is an array. Use the incoming datum as an index into the array. One of the problems with this method is that the array must account for the entire spectrum of values for the fastest speed. Otherwise execution is slowed down by translating the original index into an index for the array (kind of like a hashing method).
For string input types, a nested look up may be the fastest. One example is to break up tables by length. The first array returns pointers to the table to search based on length, e.g. char * sub_table = First_Array[5] for a string of length 5. These can be configured for specialized input data.
Another method is to use a B-Tree, which is a binary tree of "pages". Behavior is similar to nested arrays.
If you let us know the input type, we can better answer your question.

Resources