What hash function can I use for keywords? - c

I am working in C. To store a set of words for searching through them, I am told to save them in a hash table, and that it will reduce the time complexity to a constant.
Can someone help me out with the hash function? Also, if I have around 25 keywords, can I just make a table of size 25 and map each keyword to an index?

One option is to look for a perfect hash function, a hash function for which collisions don't exist. The Linux tool gperf (not gprof) can be used to automatically generate a perfect hash function from a set of strings. As others have pointed out this is unlikely to give you a huge performance boost unless lookup times are a large part of your program, but it should speed up the lookups.
Hope this helps!

At just 25 entries, a hash table won't bring you much benefit. Just do a linear search instead.

At just 25 strings to match, hashing won't add up to the efficiency. You could look into Horspool Algorithm for string matching, that should work well! And as Bo mentioned you could store them in a sorted order and do a binary search. Or you could store your keywords in a Trie data structure (something like 26-ary tree) to search for words. Hope this helps :)

Related

Hash function for hash table with strings and integers as keys

i am in search for a good Hash function which i can use in Hash table implementation. The thing is that i want to give both strings and integers as parameters(keys) in my hash function.
i have a txt file with ~500 data and every one of them consists of integers and strings(max 15 chars). So, the thing that i want to do is to pick one of these ints/strings and use it as a key for my hash function in order to put my data in the "right" bucket.
Is there any good function to do this?
Thank you :)
Use the Integer value if that's present & reasonably well distributed, then hash the String if it's not. Integer hashcode is much cheaper to compute than String.
The algorithm has to be repeatable, obviously.
Your question is somewhat vague. It's unclear if your data set has 500 columns and you are trying to figure out which column to use for hashing, or if it has 500 items which you want to hash.
If you are looking for a decent general purpose hash that will produce well-distributed hash values, you may want to check out the Jenkins hash functions which have variants for strings and integers. But, to be frank, if your dataset has 500 fixed items you may want to look at a perfect hash function generator, like GNU gperf or even alternative data structures depending on your data.
Since you want to hash using two keys, I presume the distribution improves using two keys.
For string hashing, I have had good results with PJW algorithm. Just google for "PJW Hash String". One variation here
To augment the hash with an integer, see here

How to search a big array for an object?

I had an interview today, I was asked how search for a number inside an array, I said binarysearch, he asked me how about a big array that has thousands of bjects (for example Stocks) searching for example by price of the stocks, I said binarysearch again, he said sorting an array of thousands will take lot of time before applying binarysearch.
Can you please bear with me and teach me how to approach this problem ?
thanks
your help is appreciated.
I was asked a similar question.The twist was to search in sorted and then an unsorted array .These were my answers all unaccepted
For sorted I suggested we can find the center and do a linear search .Binary search will also work here
For unsorted I suggested linear again .
Then I suggested Binary which is kind of wrong.
Suggested storing the array in a hashset and utilize hashing . (Not accepted since high space complexcity)
I suggested Tree Set which is a Red Black tree quite good for lookup.(Not accepted since high space complexcity)
Copying into Arraylist etch were also considered overhead.
In the end I got a negative feedback.
Though we may think that one of the above is solution but surely there is something special in linear searching which I am missing.
To be noted sorting before searching is also an overhead especially if you are utilizing any extra data structures in between.
Any comments welcomed.
I am not sure what he had in mind.
If you just want to find the number one time, and you have no guarantees about whether the array is sorted, then I don't think you can beat linear search. On average you will need to seek halfway through the array before you find the value, i.e. expected running time O(N); when sorting you have to touch every single value at least once and probably more than that, i.e. expected running time O(N log N).
But if you need to find multiple values then the time spent sorting it pays off quickly. With a sorted array, you can binary search in O(log N) time, so for sure by the third search you are ahead if you invested the time to sort.
You can do even better if you are allowed to build different data structures to help with the problem. You could build some sort of index, such as a hash table; but the champion data structure for this sort of problem probably would be some sort of tree structure. Then you can insert new values into the tree faster than you could append new values and re-sort the array, and the lookup is still going to be O(log N) to find any value. There are different sorts of trees available: binary tree, B-tree, trie, etc.
But as #Hot Licks said, a hash table is often used for this sort of thing, and it's pretty cheap to update: you just append a value on the main array, and update the hash table to point to the new value. And a hash table is very close to O(1) time, which you can't beat. (A hash table is O(1) if there are no hash collisions; assuming a good hash algorithm and a big enough hash table there will be almost no collisions. I think you could say that a hash table is O(N) where N is the average number of hash collisions per "bucket". If I'm wrong about that I expect to be corrected very quickly; this is StackOverflow!)
I think the interviewer wants you to analyze under different case about the array initial state, what algorithm will you use. Of cause , you must know you can build a hash table and then O(1) can find the number, or when the array is sorted (time spent on sorting maybe concerned) , you can use binarysearch, or use some other data structures to finish the job.

Linking filenames or labels to numeric index

In a C99+SDL game, I have an array that contains sound effects (SDL_mixer chunk data and some extra flags and filename string) and is referenced by index such as "sounds[2].data".
I'd like to be able to call sounds by filename, but I don't want to strcmp all the array until a match is found. This way as I add more sounds, or change the order, or allow for player-defined sound mods, they can still be called with a common identifier (such as "SHOT01" or "EXPL04").
What would be the fastest approach for this? I heard about hashing, which would result in something similar to lua's string indexes (such as table["field"]) but I don't know anything about the topic, and seems fairly complicated.
Just in case it matters, I plan to have filenames or labels be from 6 to 8 all caps filenames (such as "SHOT01.wav").
So to summarize, where can I learn about hashing short strings like that, or what would be the fastest way to keep track of something like sound effects so they can be called using arbitrary labels or identifiers?
I think in your case you can probably just keep all the sounds in a sorted data structure and use a fast search algorithm to find matches. Something like a binary search is very simple implement and it gives good performance.
However, if you are interested in hash tables and hashing, the basics of it all are pretty simple. There is no place like Wikipedia to get the basics down and you can then tailor your searches better on Google to find more in depth articles.
The basics are you start out with a fixed size array and store everything in there. To figure out where to store something you take the key (in your case the sound name) and you perform some operation on it such that it gives you an exact location where the value can be found. So the simplest case for string hashing is just adding up all the letters in the string as integer values then take the value and use modulus to give you an index in your array.
position = SUM(string letters) % [array size]
Of course naturally multiple strings will have same sum and thus give you the same position. This is called a collision, and collisions can be handled in many ways. The simplest way is to have an array of lists rather than array of values, and simply append to the list every there there is a collision. When searching for a value, simply iterate the lists and find the value you need.
Ideally a good hashing algorithm will have few collisions and quick hashing algorithm thus providing huge performance boost.
I hope this helps :)
You are right, when it comes to mapping objects with a set of string keys, hash tables are often the way to go.
I think this article on wikipedia is a good starting point to understand hash table mechanism: http://en.wikipedia.org/wiki/Hash_table

Hash Function Determination

How can we find the most efficient hash function(least possible chances of collision) for the set of strings.
Suppose we are given with some strings.. And the length of the strings is also not defined.
Ajay
Vijay
Rakhi
....
we know the count of no. of strings available, so we can design a hash table of size(count available). what could be the perfect hash function that we could design for such problem??
Multiplying each character ascii value by 31(prime no.) in increment fashion leads to the a hash value greater than the value of MAX_INT, and then modulus would not work properly... So please give some efficient hash function build up solution....
I have few set of strings,, lets say count = 10.... I need to implement a hash function such that all those 10 strings fit in uniquely in the hash table.... Any perfect hash function O(1) available, for this kind of problem?? hash table size will be 10, for this case...
Only C Programming...
Please explain the logic at website.... http://burtleburtle.net/bob/c/perfect.c
This looks very complicated but perfect to me..!! what is the algorithm used here... Reading the code straight away, is very difficult!!
Thanks....
Check some of these out, they apparantly have good distributions
http://www.partow.net/programming/hashfunctions/#HashingMethodologies
You might want to look into perfect hashing.
you might want to have a look at gperf, you could kinda do this on the fly if you didn't do it too often and your data set a small. if the strings are know ahead of time, then this is the method
Hash tables are meant to be able to handle dynamic input. If you can guarantee only a particular set of inputs, and you want to guarantee a particular slot for each input, why hash at all?
Just make an array indexed for each known available input.

Fast dictonary in C without linear search

How can I make a fast dictonary ( String => Pointer and Int => Pointer ) in C without a linear search? I need a few ( or more ) lines of code, not a library, and it must be possible to use it in closed-source software (LGPL, ...).
Use a Hash Table. A hash table will have a constant-time lookup. Here are some excerpts in C and an implementation in C (and Portuguese :).
You need to implement a Hash Table which stores objects using a hash code. The lookup time is constant.
A Binary Tree can traverse and lookup an element in log(n) time.
The Ternary Search Tree was born for this mission.
If you strings will be long, you cannot consider the "Hash table" as constant time! run-time depends on the length of the string! for long strings, this will cause problems. additionally, you have the problem of collisions with too small of a table or too poor of a hash function.
if you want to use hashing, please look at karp-rabin. if you want an algorithm dependent SOLELY upon the size of the word you are searching for, please look at aho-corasick.

Resources