I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.
To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.
Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.
If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html
A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)
Related
I'm reviewing the security of an app for a University project, the app encrypts a file using RSA, specifically it uses this library: https://github.com/ilansmith/rsa (DO NOT use this, it has serious vulnerabilities).
(If you want to take a look, most of the operations between these numbers are implemented in the rsa_num.c file.)
This tool uses arrays of unsigned long long to store the big numbers needed for RSA (n, e and d):
typedef struct {
u64 arr[17]; //u64 is defined as unsigned long long
int top; //points to the last occupied slot of the array
} u1024_t;
The problem is that I don't understand how the numbers are stored in this format.
What I need is being able to print the real numbers in some way, or at least a way to recover the numbers from the components of the arrays.
I tried just concatenating them like strings, but it doesn't seem right.
Thanks to whoever will be able to help!
Thank you #Matthieu! Your comment worked.
I needed to concatenate the unsigned long longs in reverse order and reversing their bytes due to endianness.
Following his solution, I implemented this function, which works perfectly:
void print_u1024(u1024_t number) {
int size = (number.top + 1) * sizeof(u64);
for (int i = size-1; i >= 0; i--) {
printf("%02x", ((unsigned char*)number.arr)[i]);
}
printf("\n");
}
Please note that this solution will probably only work on little-endian systems (most PCs).
I need to create some hashing function... can you help me?
The input is a sequence of numbers. Your task is to determine the number of how many numbers are repeated.
Its string of numbers and letters (*a[]). N is the number of digits -input.
Returns the number of repetition.
int function(char *a[], int n)
{
int i,j;
int same=0;
for(i=0;i<n-1;i++)
{
for(j=i+1;j<n;j++)
{
if(!strcmp(a[i],a[j]))
same++;
}
}
return same;
}
int main(void)
{
char *a[] = {"AA123456", "BA987689", "AA123123", "AA312312", "BB345345", "AA123123"};
printf("Number of duplicates: %d\n", function(a, 6));
return 0;
}
Read the wikipage on hash functions & hash tables.
Often, a linear combination with prime coefficients (see bézout's identity) and involving the components and the partial hash gives good enough result.
For example, something like
int basile_hash(const char*str) {
int h = 65537;
while (*str) {
h = 75553*h + 5531* (*str);
str++;
};
return h;
}
I don't claim it is a very good hash, but it is probably good enough for your needs. All the constants 65537, 75553, 5531 are primes (given by /usr/games/primes from bsdgames Debian package)
You could make a variant with bitwise xor ^, or you could take into account more than one component:
h = 65579*str[0] ^ 5507*str[1] + 17*h;
str += 2;
but you should care -and special-case when s[1] is the terminating null byte.
Read also about MD5
Notice that a lot of standard or popular libraries gives you many hash functions. Most of the time the particular choice of some hash function is not very important. On the other hand, you can still earn a PhD on studying and inventing good hash functions. I have another one in this values.c file, function mom_cstring_hash near line 150 (I imagine that it might be better optimized, since for large strings some of the instructions might run "in parallel" inside the processor).
I certainly don't claim to be expert on hash functions.
Study also the source code of hash functions in free software libraries like Glib, Qt, etc.... See also gperf
I am currently playing around with hashing and key generation trying to make my own hash key generator.
At the moment I have a list of ~90000 strings (each 1 word and a different word). I was wondering what the best way to generate keys (number keys not string keys) would be?
Currently depending on the words last ascii character I do a calculation based on the value of the letter.
The result is about 50% of the words generate a key that clashes with another.
I have used quadratic probing to then find space in the table for the rest of the words.
My question, as above, is what is generally the best sort of way to generate a key for 90000 different words? I know that the larger the data set, the more likely there will be clashes, but how would you suggest/or minimise the clashes?
Edit: Also - I don't care about cryptography, it just needs to be fast.
Thanks.
You can "borrow" Java's implementation of String's hashCode*:
int hashCode(const char* s) {
int h = 0;
while (*s) {
h = 31*h + (*s++);
}
return h;
}
This function achieves a reasonable separation, and is among the most widely used hash functions out there.
* which, as it turns out, Java in turn "borrowed" from Kernighan & Ritchie's book on C programming.
To prevent clashes you need a good hash key generator.
There are several algorithms available. One recent and very fast one is called xxHash. It's written in C.
It cant be good choice choosing 90,000 size of the hash table, there is much better concept of perfect hashing, according to this use double hashing one for table lookup and the other to maintain the list, you should try multiplication method for both, i think that's good idea.
I've seen Knuth use:
register int h,k; register char *p;
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;
Where hash_prime is a prime larger than 4x the expected number of live entries in the hash table.
See: Knuth's literateprogramming.com, the Adventure example.
Here's the hashing code in context:
#define hash_prime 1009/* the size of the hash table */
typedef struct {
char text[6]; /* string of length at most 5 */
char word_type; /* a |wordtype| */
char meaning;
} hash_entry;
hash_entry hash_table[hash_prime]; /* the table of words we know */
void new_word(w,m)
char *w; /* a string of length 5 or less */
int m; /* its meaning */
{
register int h,k; register char *p;
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;
while (hash_table[h].word_type) {
h++;if (h==hash_prime) h=0;
}
int lookup(w)
char *w; /* a string that you typed */
{
register int h; register char *p; register char t;
t=w[5]; w[5]='\0'; /* truncate the word */
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime; /* compute starting address */
w[5]=t; /* restore original word */
if (h<0) return -1; /* a negative character might screw us up */
while (hash_table[h].word_type) {
if (streq(w,hash_table[h].text)) return h;
h++;if (h==hash_prime) h=0;
}
return -1;
}
Note, this code:
register char t;
// . . .
t=w[5]; w[5]='\0'; /* truncate the word */
// . . .
w[5]=t; /* restore original word */
Are for a specific requirement to only look at the first 5 characters and should be removed so you hash the entire word.
The term you want is avalanche - a hash function that provides optimal spread.
If you want your keys to be guaranteed to be unique, and if your dataset has zero duplicates
then you can convert your word as a base36 number into a base10 number. If you use stroull() you can return really large integers
char *p=myword;
for(; *p; p++)
*p=toupper(*p);
unsigned long long key=strtoull(myword, NULL, 36);
This can overflow and still return a positive number. Some hashes when given a long string may overflow a 32bit integer. Kerneghan's hash and Bernstein's hash do that.
In reality and as pointed out by several other folks:
Consider that collisions are a function of the hash_table size and the avalanche of the hash_function modulo hash_table size. Instead of truly unique keys what you want may be a better hash_table algorithm and size.
I was wondering what is the fastest way to count the number of occurrences of a string (needle) within another string (haystack). The way I'm doing it is:
int findWord(char * file, char * word){
char *fptr;
char * current = strtok_r(file, " ,.\n", &fptr);
int sum = 0;
while (current != NULL){
//printf("%s\n", current);
if(strcmp(current, word) == 0)
sum+=1;
current = strtok_r(NULL, " ,.\n", &fptr);
}
return sum;
}
Would it be faster to use a more complex algorithm (Boyer-Moore)?
Thanks
Currently, if your program is counting word "blah" and encounters a token is "blahblah", your algorithm counts it as zero occurrences. If it needed to count it as two, you cound benefit from a more advanced approach.
If your program does what you want, you are processing as fast as you can: it is already linear in the number of letters of the longer "word", so you cannot speed it up further.
An even more interesting solution would be required to count words with self-aliasing: for example, count "aa"s inside "aaaa" string. If you needed to return 3 for this situation, you'd need a lot more advanced algorithm.
Would it be faster to use a more complex algorithm (Boyer-Moore)?
In your algorithm, the unit of comparison is a word rather than a character. This enables the algorithm to ignore matches that straddle a word boundary, and thus makes it run in O(n) time.
I doubt you'd be able to beat that asymptotically.
As far as lowering the multiplicative constant, right now your algorithm looks at every character in file twice. You can eliminate that redundancy by rewriting the code to use a pair of pointers and a single for loop (figuring out the details is left as an exercise for the reader :))
Unless your system has a bad implementation of string functions, this should be roughly the fastest:
const char *s, *t;
size_t cnt;
for (cnt=0, s=haystack; t=strchr(s, needle); s=t+1, cnt++);
Adjust it a bit (+strlen(needle) rather than +1) if you don't want to count overlapping matches.
Can the performance of this sequential search algorithm (taken from
The Practice of Programming) be improved using any of C's native utilities, e.g. if I set the i variable to be a register variable ?
int lookup(char *word, char*array[])
{
int i
for (i = 0; array[i] != NULL; i++)
if (strcmp(word, array[i]) == 0)
return i;
return -1;
}
Yes, but only very slightly. A much bigger performance improvement can be achieved by using better algorithms (for example keeping the list sorted and doing a binary search).
In general optimizing a given algorithm only gets you so far. Choosing a better algorithm (even if it's not completely optimized) can give you a considerable (order of magnitude) performance improvement.
I think, it will not make much of a difference. The compiler will already optimize it in that direction.
Besides, the variable i does not have much impact, word stays constant throughout the function and the rest is too large to fit in any register. It is only a matter how large the cache is and if the whole array might fit in there.
String comparisons are rather expensive computationally.
Can you perhaps use some kind of hashing for the array before searching?
There is well-known technique as sentinal method.
To use sentinal method, you must know about the length of "array[]".
You can remove "array[i] != NULL" comparing by using sentinal.
int lookup(char *word, char*array[], int array_len)
{
int i = 0;
array[array_len] = word;
for (;; ++i)
if (strcmp(word, array[i]) == 0)
break;
array[array_len] = NULL;
return (i != array_len) ? i : -1;
}
If you're reading TPOP, you will next see how they make this search many times faster with different data structures and algorithms.
But you can make things a bit faster by replacing things like
for (i = 0; i < n; ++i)
foo(a[i]);
with
char **p = a;
for (i = 0; i < n; ++i)
foo(*p);
++p;
If there is a known value at the end of the array (e.g. NULL) you can eliminate the loop counter:
for (p = a; *p != NULL; ++p)
foo(*p)
Good luck, that's a great book!
To optimize that code the best bet would be to rewrite the strcmp routine since you are only checking for equality and don't need to evaluate the entire word.
Other than that you can't do much else. You can't sort as it appears you are looking for text within a larger text. Binary search won't work either since the text is unlikely to be sorted.
My 2p (C-psuedocode):
wrd_end = wrd_ptr + wrd_len;
arr_end = arr_ptr - wrd_len;
while (arr_ptr < arr_end)
{
wrd_beg = wrd_ptr; arr_beg = arr_ptr;
while (wrd_ptr == arr_ptr)
{
wrd_ptr++; arr_ptr++;
if (wrd_ptr == wrd_en)
return wrd_beg;
}
wrd_ptr++;
}
Mark Harrison: Your for loop will never terminate! (++p is indented, but is not actually within the for :-)
Also, switching between pointers and indexing will generally have no effect on performance, nor will adding register keywords (as mat already mentions) -- the compiler is smart enough to apply these transformations where appropriate, and if you tell it enough about your cpu arch, it will do a better job of these than manual psuedo-micro-optimizations.
A faster way to match strings would be to store them Pascal style. If you don't need more than 255 characters per string, store them roughly like this, with the count in the first byte:
char s[] = "\x05Hello";
Then you can do:
for(i=0; i<len; ++i) {
s_len = strings[i][0];
if(
s_len == match_len
&& strings[i][s_len] == match[s_len-1]
&& 0 == memcmp(strings[i]+1, match, s_len-1)
) {
return 1;
}
}
And to get really fast, add memory prefetch hints for string start + 64, + 128 and the start of the next string. But that's just crazy. :-)
Another fast way to do it is to get your compiler to use a SSE2 optimized memcmp. Use fixed-length char arrays and align so the string starts on a 64-byte alignment. Then I believe you can get the good memcmp functions if you pass const char match[64] instead of const char *match into the function, or strncpy match into a 64,128,256,whatever byte array.
Thinking a bit more about this, these SSE2 match functions might be part of packages like Intel's and AMD's accelerator libraries. Check them out.
Realistically, setting I to be a register variable won't do anything that the compiler wouldn't do already.
If you are willing to spend some time upfront preprocessing the reference array, you should google "The World's Fastest Scrabble Program" and implement that. Spoiler: it's a DAG optimized for character lookups.
/* there is no more quick */
int lookup(char *word, char*array[])
{
int i;
for(i=0; *(array++) != NULL;i++)
if (strcmp(word, *array) == 0)
return i;
return -1;
}