most efficient way to compare short string with small dictionary (parsing)

most efficient way to compare short string with small dictionary (parsing) - c

I'm trying to optimize my simple C interpretter that I made just for fun, I am doing parsing like this - firstly I parse file into tokens inside doubly linked list, then I do syntax and semantic analysis.
I want to optimize function with this prototype:
bool parsed_keyword(struct token *, char dictionary[][]);
Inside the function I basically call strcmp against all keywords and edit token type.
This of course lead to 20 strcmp calls for each string that is being parsed (almost).
I was thinking Rabin-Karp would be best, but it sounds to me like it isn't best suited for this job (matching one word against small dictionary).
What would be the best algoritm to do this work? Thanks for any suggestions.

A hash table would probably be my choice for this particular problem. It will provide O(1) lookup for a table of your size. A trie would also be a good choice though.
But, the simplest to implement would be to place your words in an array alphabetically, and then use bsearch from the C library. It should be almost as fast as a hash or trie, since you are only dealing with 30 some words. It might actually turn out to be faster than a hash table, since you won't have to compute a hash value.
Steve Jessop's idea is a good one, to layout your strings end to end in identically sized char arrays.
const char keywords[][MAX_KEYWORD_LEN+1] = {
"auto", "break", "case", /* ... */, "while"
};
#define NUM_KEYWORDS sizeof(keywords)/sizeof(keywords[0])
int keyword_cmp (const void *a, const void *b) {
return strcmp(a, b);
}
const char *kw = bsearch(word, keywords, NUM_KEYWORDS, sizeof(keywords[0]),
keyword_cmp);
int kw_index = (kw ? (const char (*)[MAX_KEYWORD_LEN+1])kw - keywords : -1);
If you don't already have it, you should consider acquiring a copy of Compilers: Principles, Techniques, and Tools. Because of its cover, it is often referred to as The Dragon Book.

If you are looking for efficiency I would say that Rabin Karp is not your best bet, and your best efficiencies would be found with Boyer-Moore, though it is a fair bit more difficult to implement.
If you are doing this for fun, honestly I don't think there is any need to optimize as those calls should still run in a pretty short amount of time and you don't really need it to run at industry speed.
If you are looking to play around with string matching algorithms, which is a cool and useful goal I would suggest looking into the KMP algorithm and the Boyer-Moore algorithm, both of which will teach you a lot during implementation.
There are of course other more straightforward methods, like dictionary lookups and simple binary search etc..., but those don't really optimize for the fact that you are dealing with strings and string comparison is a really interesting field that you will inevitably run into at some point.

Assuming your keywords aren't changing, this sounds like the right case for a perfect hash function. A perfect hash function maps inputs to integers (like a regular hash function), but with no collisions.
Wikipedia has links to several perfect hash generators, including GNU gperf.

the first thing that comes to mind when doing lookups is to just use a sorted array of keyboards, and do a binary search on them.

If the set of keywords is fixed, you can use perfect hashing, for example using gperf. This requires only constant work and a single string comparison, thus being probably faster than other approaches.

Related

Searching a substring in C

Suppose I have a very long string, such as a filepath, and I want to search for something in it. For example, something like the $ find command. It seems like a basic implementation of this would be along the lines of:
if(strstr(sent, word) != NULL) {
return 1;
}
Would there be any performance difference between doing that and something like Boyer Moore? Or does strstr already do something just as efficient?
Basically, I have about a billion very long strings, and I'm looking to do a fast(ish) find on them (without any indexing), based on the most efficient substring implementation. What should I use?
Update: To give a more concrete example, let's say I have a billion filepaths I want to search through:
/archive/1002/myfile.txt
/archive/1002/newer.mov
/user/tom/local_2014version1.mov
And from this I would search either one or more strings. Example samples would be:
"1002" // would return the first two fileds
"mov version tom" // would return the first row

Advanced search algorithms like Boyer-Moore and Aho-Corasick work by precomputing lookup tables from the string(s) to be searched for, which incurs a large start-up time. It's very unlikely that searching something as small as a pathname would be able to make up for that high overhead. You really have to be searching something like multi-page documents before those algorithms show their value.

How are HashTables in GLib Useful?

I'm familiar with the idea of a hash function but I'm unclear on how GLib's implementation is useful. I'll explain this with an example.
Suppose I have an expensive function that is recursive (somehow) on the positive real numbers in a weird way that depends on number theory (I'm a mathematician). Let's say I have an algorithm that needs to compute the function on some smallish-range of large numbers. Say [1000000000 - 1000999999].
I don't want to call my expensive function one million times, so I start memoizing values recursively. Then at each call I don't need to necessarily compute the whole function from scratch, I can hopefully remember any values of the function on the lower numbers (during my recursing) that I have already computed. Let's assume that the actual total number of calls at that first level of recursion is low. So that there are a lot of repeated values and memoizing actually saves you a lot of time.
This is my cartoony way of understanding why a hash table data structure is useful. What I don't get is how to do this without knowing exactly what keys I'll need in advance.
Since the recursive function is number theoretic in general I don't know which values it will take over and over again. So I'd like to just throw these in a bucket (hash table) as they pop out of recursive calls to my function
For GLib, it would seem that your (key,value) pairs are always pointers to data that you personally have to keep lying around somewhere. So if my function is computing for input x. I don't know how to tell if I've seen the value x before, the function g_hash_table_contains() for example needs a pointer, not the value x. So what's the use!?
I'm still learning so be kind. I'm familiar with coding in C, but haven't yet used hash tables in this language and I'm trying to do so and be adept at it with GLib but I just don't get this.

Let me take a dig at it to explain it.
First of all, if we are using hashmap, then we need [key, value] pair for sure as our input.
So as a user of hashmap, we have to be creative about choosing key, and it varies depending upon the usecase.
In your case, as far as I understood, you have a function which works on a range and gives you result. And when calculating, it uses memoization so that results of small problem, which constitutes the bigger problem, can be used.
So for example, your case, you can use string as your key where string will be [1000000009] which may use result of [1000999998] which may further use result of 1000999997 and so on, and you do not find results in hashmap, then you will calculate it and save it in hashmap.
In nutshell, as a user, we need to be creative about choosing keys.
The analogues to understand is how you would have done, if you have to think about choosing primary key of database.
Another example to think is how you would have thought about solving fibonacci(n) using the hashmap.

Linking filenames or labels to numeric index

In a C99+SDL game, I have an array that contains sound effects (SDL_mixer chunk data and some extra flags and filename string) and is referenced by index such as "sounds[2].data".
I'd like to be able to call sounds by filename, but I don't want to strcmp all the array until a match is found. This way as I add more sounds, or change the order, or allow for player-defined sound mods, they can still be called with a common identifier (such as "SHOT01" or "EXPL04").
What would be the fastest approach for this? I heard about hashing, which would result in something similar to lua's string indexes (such as table["field"]) but I don't know anything about the topic, and seems fairly complicated.
Just in case it matters, I plan to have filenames or labels be from 6 to 8 all caps filenames (such as "SHOT01.wav").
So to summarize, where can I learn about hashing short strings like that, or what would be the fastest way to keep track of something like sound effects so they can be called using arbitrary labels or identifiers?

I think in your case you can probably just keep all the sounds in a sorted data structure and use a fast search algorithm to find matches. Something like a binary search is very simple implement and it gives good performance.
However, if you are interested in hash tables and hashing, the basics of it all are pretty simple. There is no place like Wikipedia to get the basics down and you can then tailor your searches better on Google to find more in depth articles.
The basics are you start out with a fixed size array and store everything in there. To figure out where to store something you take the key (in your case the sound name) and you perform some operation on it such that it gives you an exact location where the value can be found. So the simplest case for string hashing is just adding up all the letters in the string as integer values then take the value and use modulus to give you an index in your array.
position = SUM(string letters) % [array size]
Of course naturally multiple strings will have same sum and thus give you the same position. This is called a collision, and collisions can be handled in many ways. The simplest way is to have an array of lists rather than array of values, and simply append to the list every there there is a collision. When searching for a value, simply iterate the lists and find the value you need.
Ideally a good hashing algorithm will have few collisions and quick hashing algorithm thus providing huge performance boost.
I hope this helps :)

You are right, when it comes to mapping objects with a set of string keys, hash tables are often the way to go.
I think this article on wikipedia is a good starting point to understand hash table mechanism: http://en.wikipedia.org/wiki/Hash_table

How to design a hashfunction that is scalable to exactly n elements?

I have a list of n strings (names of people) that I want to store in a hash table or similar structure. I know the exact value of n, so I want to use that fact to have O(1) lookups, which would be rendered impossible if I had to use a linked list to store my hash nodes. My first reaction was to use the the djb hash, which essentially does this:
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
To compress the resulting h into the range [0,n], I would like to simply do h%n, but I suspect that this will lead to a much higher probability of clashes in a way that would essentially render my hash useless.
My question then, is how can I hash either the string or the resulting hash so that the n elements provide a relatively uniform distribution over [0,n]?

It's not enough to know n. Allocation of an item to a bucket is a function of the item itself so, if you want a perfect hash function (one item per bucket), you need to know the data.
In any case, if you're limiting the number of elements to a known n, you're already technically O(1) lookup. The upper bound will be based on the constant n. This would be true even for a non-hash solution.
Your best bet is to probably just use the hash function you have and have each bucket be a linked list of the colliding items. Even if the hash is less than perfect, you're still greatly minimising the time taken.
Only if the hash is totally imperfect (all n elements placed in one bucket) will it be as bad as a normal linked list.
If you don't know the data in advance, a perfect hash is not possible. Unless, of course, you use h itself as the hash key rather than h%n but that's going to take an awful lot of storage :-)
My advice is to go the good-enough hash with linked list route. I don't doubt that you could make a better hash function based on the relative frequencies of letters in people's names across the population but even the hash you have (which is ideal for all letters having the same frequency) should be adequate.
And, anyway, if you start relying on frequencies and you get an influx of people from those countries that don't seem to use vowels (a la Bosniaa), you'll end up with more collisions.
But keep in mind that it really depends on the n that you're using.
If n is small enough, you could even get away with a sequential search of an unsorted array. I'm assuming your n is large enough here that you've already established that (or a balanced binary tree) won't give you enough performance.
A case in point: we have some code which searches through problem dockets looking for names of people that left comments (so we can establish the last member on our team who responded). There's only ever about ten or so members in our team so we just use a sequential search for them - the performance improvement from using a faster data structure was deemed too much trouble.
aNo offence intended. I just remember the humorous article a long time ago about Clinton authorising the airlifting of vowels to Bosnia. I'm sure there are other countries with a similar "problem".

What you're after is called a Perfect Hash. It's a hash function where all the keys are known ahead of time, designed so that there are no collisions.
The gperf program generates C code for perfect hashes.

It sounds like you're looking for an implementation of a perfect hash function, or perhaps even a minimal perfect hash function. According to the Wikipedia page, CMPH might
fit your needs. Disclaimer: I've never used it.

The optimal algorithm for mapping n strings to integers 1-n is to build a DFA where the terminating states are the integers 1-n. (I'm sure someone here will step up with a fancy name for this...but in the end it's all DFA.) Size/speed tradeoff can be adjusted by varying your alphabet size (operating on bytes, half-bytes, or even bits).

Fast dictonary in C without linear search

How can I make a fast dictonary ( String => Pointer and Int => Pointer ) in C without a linear search? I need a few ( or more ) lines of code, not a library, and it must be possible to use it in closed-source software (LGPL, ...).

Use a Hash Table. A hash table will have a constant-time lookup. Here are some excerpts in C and an implementation in C (and Portuguese :).

You need to implement a Hash Table which stores objects using a hash code. The lookup time is constant.
A Binary Tree can traverse and lookup an element in log(n) time.

The Ternary Search Tree was born for this mission.

If you strings will be long, you cannot consider the "Hash table" as constant time! run-time depends on the length of the string! for long strings, this will cause problems. additionally, you have the problem of collisions with too small of a table or too poor of a hash function.
if you want to use hashing, please look at karp-rabin. if you want an algorithm dependent SOLELY upon the size of the word you are searching for, please look at aho-corasick.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight