Fast dictonary in C without linear search - c

How can I make a fast dictonary ( String => Pointer and Int => Pointer ) in C without a linear search? I need a few ( or more ) lines of code, not a library, and it must be possible to use it in closed-source software (LGPL, ...).

Use a Hash Table. A hash table will have a constant-time lookup. Here are some excerpts in C and an implementation in C (and Portuguese :).

You need to implement a Hash Table which stores objects using a hash code. The lookup time is constant.
A Binary Tree can traverse and lookup an element in log(n) time.

The Ternary Search Tree was born for this mission.

If you strings will be long, you cannot consider the "Hash table" as constant time! run-time depends on the length of the string! for long strings, this will cause problems. additionally, you have the problem of collisions with too small of a table or too poor of a hash function.
if you want to use hashing, please look at karp-rabin. if you want an algorithm dependent SOLELY upon the size of the word you are searching for, please look at aho-corasick.

Related

Hash function for hash table with strings and integers as keys

i am in search for a good Hash function which i can use in Hash table implementation. The thing is that i want to give both strings and integers as parameters(keys) in my hash function.
i have a txt file with ~500 data and every one of them consists of integers and strings(max 15 chars). So, the thing that i want to do is to pick one of these ints/strings and use it as a key for my hash function in order to put my data in the "right" bucket.
Is there any good function to do this?
Thank you :)
Use the Integer value if that's present & reasonably well distributed, then hash the String if it's not. Integer hashcode is much cheaper to compute than String.
The algorithm has to be repeatable, obviously.
Your question is somewhat vague. It's unclear if your data set has 500 columns and you are trying to figure out which column to use for hashing, or if it has 500 items which you want to hash.
If you are looking for a decent general purpose hash that will produce well-distributed hash values, you may want to check out the Jenkins hash functions which have variants for strings and integers. But, to be frank, if your dataset has 500 fixed items you may want to look at a perfect hash function generator, like GNU gperf or even alternative data structures depending on your data.
Since you want to hash using two keys, I presume the distribution improves using two keys.
For string hashing, I have had good results with PJW algorithm. Just google for "PJW Hash String". One variation here
To augment the hash with an integer, see here

most efficient way to compare short string with small dictionary (parsing)

I'm trying to optimize my simple C interpretter that I made just for fun, I am doing parsing like this - firstly I parse file into tokens inside doubly linked list, then I do syntax and semantic analysis.
I want to optimize function with this prototype:
bool parsed_keyword(struct token *, char dictionary[][]);
Inside the function I basically call strcmp against all keywords and edit token type.
This of course lead to 20 strcmp calls for each string that is being parsed (almost).
I was thinking Rabin-Karp would be best, but it sounds to me like it isn't best suited for this job (matching one word against small dictionary).
What would be the best algoritm to do this work? Thanks for any suggestions.
A hash table would probably be my choice for this particular problem. It will provide O(1) lookup for a table of your size. A trie would also be a good choice though.
But, the simplest to implement would be to place your words in an array alphabetically, and then use bsearch from the C library. It should be almost as fast as a hash or trie, since you are only dealing with 30 some words. It might actually turn out to be faster than a hash table, since you won't have to compute a hash value.
Steve Jessop's idea is a good one, to layout your strings end to end in identically sized char arrays.
const char keywords[][MAX_KEYWORD_LEN+1] = {
"auto", "break", "case", /* ... */, "while"
};
#define NUM_KEYWORDS sizeof(keywords)/sizeof(keywords[0])
int keyword_cmp (const void *a, const void *b) {
return strcmp(a, b);
}
const char *kw = bsearch(word, keywords, NUM_KEYWORDS, sizeof(keywords[0]),
keyword_cmp);
int kw_index = (kw ? (const char (*)[MAX_KEYWORD_LEN+1])kw - keywords : -1);
If you don't already have it, you should consider acquiring a copy of Compilers: Principles, Techniques, and Tools. Because of its cover, it is often referred to as The Dragon Book.
If you are looking for efficiency I would say that Rabin Karp is not your best bet, and your best efficiencies would be found with Boyer-Moore, though it is a fair bit more difficult to implement.
If you are doing this for fun, honestly I don't think there is any need to optimize as those calls should still run in a pretty short amount of time and you don't really need it to run at industry speed.
If you are looking to play around with string matching algorithms, which is a cool and useful goal I would suggest looking into the KMP algorithm and the Boyer-Moore algorithm, both of which will teach you a lot during implementation.
There are of course other more straightforward methods, like dictionary lookups and simple binary search etc..., but those don't really optimize for the fact that you are dealing with strings and string comparison is a really interesting field that you will inevitably run into at some point.
Assuming your keywords aren't changing, this sounds like the right case for a perfect hash function. A perfect hash function maps inputs to integers (like a regular hash function), but with no collisions.
Wikipedia has links to several perfect hash generators, including GNU gperf.
the first thing that comes to mind when doing lookups is to just use a sorted array of keyboards, and do a binary search on them.
If the set of keywords is fixed, you can use perfect hashing, for example using gperf. This requires only constant work and a single string comparison, thus being probably faster than other approaches.

What hash function can I use for keywords?

I am working in C. To store a set of words for searching through them, I am told to save them in a hash table, and that it will reduce the time complexity to a constant.
Can someone help me out with the hash function? Also, if I have around 25 keywords, can I just make a table of size 25 and map each keyword to an index?
One option is to look for a perfect hash function, a hash function for which collisions don't exist. The Linux tool gperf (not gprof) can be used to automatically generate a perfect hash function from a set of strings. As others have pointed out this is unlikely to give you a huge performance boost unless lookup times are a large part of your program, but it should speed up the lookups.
Hope this helps!
At just 25 entries, a hash table won't bring you much benefit. Just do a linear search instead.
At just 25 strings to match, hashing won't add up to the efficiency. You could look into Horspool Algorithm for string matching, that should work well! And as Bo mentioned you could store them in a sorted order and do a binary search. Or you could store your keywords in a Trie data structure (something like 26-ary tree) to search for words. Hope this helps :)

Hash Function Determination

How can we find the most efficient hash function(least possible chances of collision) for the set of strings.
Suppose we are given with some strings.. And the length of the strings is also not defined.
Ajay
Vijay
Rakhi
....
we know the count of no. of strings available, so we can design a hash table of size(count available). what could be the perfect hash function that we could design for such problem??
Multiplying each character ascii value by 31(prime no.) in increment fashion leads to the a hash value greater than the value of MAX_INT, and then modulus would not work properly... So please give some efficient hash function build up solution....
I have few set of strings,, lets say count = 10.... I need to implement a hash function such that all those 10 strings fit in uniquely in the hash table.... Any perfect hash function O(1) available, for this kind of problem?? hash table size will be 10, for this case...
Only C Programming...
Please explain the logic at website.... http://burtleburtle.net/bob/c/perfect.c
This looks very complicated but perfect to me..!! what is the algorithm used here... Reading the code straight away, is very difficult!!
Thanks....
Check some of these out, they apparantly have good distributions
http://www.partow.net/programming/hashfunctions/#HashingMethodologies
You might want to look into perfect hashing.
you might want to have a look at gperf, you could kinda do this on the fly if you didn't do it too often and your data set a small. if the strings are know ahead of time, then this is the method
Hash tables are meant to be able to handle dynamic input. If you can guarantee only a particular set of inputs, and you want to guarantee a particular slot for each input, why hash at all?
Just make an array indexed for each known available input.

How to design a hashfunction that is scalable to exactly n elements?

I have a list of n strings (names of people) that I want to store in a hash table or similar structure. I know the exact value of n, so I want to use that fact to have O(1) lookups, which would be rendered impossible if I had to use a linked list to store my hash nodes. My first reaction was to use the the djb hash, which essentially does this:
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
To compress the resulting h into the range [0,n], I would like to simply do h%n, but I suspect that this will lead to a much higher probability of clashes in a way that would essentially render my hash useless.
My question then, is how can I hash either the string or the resulting hash so that the n elements provide a relatively uniform distribution over [0,n]?
It's not enough to know n. Allocation of an item to a bucket is a function of the item itself so, if you want a perfect hash function (one item per bucket), you need to know the data.
In any case, if you're limiting the number of elements to a known n, you're already technically O(1) lookup. The upper bound will be based on the constant n. This would be true even for a non-hash solution.
Your best bet is to probably just use the hash function you have and have each bucket be a linked list of the colliding items. Even if the hash is less than perfect, you're still greatly minimising the time taken.
Only if the hash is totally imperfect (all n elements placed in one bucket) will it be as bad as a normal linked list.
If you don't know the data in advance, a perfect hash is not possible. Unless, of course, you use h itself as the hash key rather than h%n but that's going to take an awful lot of storage :-)
My advice is to go the good-enough hash with linked list route. I don't doubt that you could make a better hash function based on the relative frequencies of letters in people's names across the population but even the hash you have (which is ideal for all letters having the same frequency) should be adequate.
And, anyway, if you start relying on frequencies and you get an influx of people from those countries that don't seem to use vowels (a la Bosniaa), you'll end up with more collisions.
But keep in mind that it really depends on the n that you're using.
If n is small enough, you could even get away with a sequential search of an unsorted array. I'm assuming your n is large enough here that you've already established that (or a balanced binary tree) won't give you enough performance.
A case in point: we have some code which searches through problem dockets looking for names of people that left comments (so we can establish the last member on our team who responded). There's only ever about ten or so members in our team so we just use a sequential search for them - the performance improvement from using a faster data structure was deemed too much trouble.
aNo offence intended. I just remember the humorous article a long time ago about Clinton authorising the airlifting of vowels to Bosnia. I'm sure there are other countries with a similar "problem".
What you're after is called a Perfect Hash. It's a hash function where all the keys are known ahead of time, designed so that there are no collisions.
The gperf program generates C code for perfect hashes.
It sounds like you're looking for an implementation of a perfect hash function, or perhaps even a minimal perfect hash function. According to the Wikipedia page, CMPH might
fit your needs. Disclaimer: I've never used it.
The optimal algorithm for mapping n strings to integers 1-n is to build a DFA where the terminating states are the integers 1-n. (I'm sure someone here will step up with a fancy name for this...but in the end it's all DFA.) Size/speed tradeoff can be adjusted by varying your alphabet size (operating on bytes, half-bytes, or even bits).

Resources