hash function in C - c

I need to create some hashing function... can you help me?
The input is a sequence of numbers. Your task is to determine the number of how many numbers are repeated.
Its string of numbers and letters (*a[]). N is the number of digits -input.
Returns the number of repetition.
int function(char *a[], int n)
{
int i,j;
int same=0;
for(i=0;i<n-1;i++)
{
for(j=i+1;j<n;j++)
{
if(!strcmp(a[i],a[j]))
same++;
}
}
return same;
}
int main(void)
{
char *a[] = {"AA123456", "BA987689", "AA123123", "AA312312", "BB345345", "AA123123"};
printf("Number of duplicates: %d\n", function(a, 6));
return 0;
}

Read the wikipage on hash functions & hash tables.
Often, a linear combination with prime coefficients (see bézout's identity) and involving the components and the partial hash gives good enough result.
For example, something like
int basile_hash(const char*str) {
int h = 65537;
while (*str) {
h = 75553*h + 5531* (*str);
str++;
};
return h;
}
I don't claim it is a very good hash, but it is probably good enough for your needs. All the constants 65537, 75553, 5531 are primes (given by /usr/games/primes from bsdgames Debian package)
You could make a variant with bitwise xor ^, or you could take into account more than one component:
h = 65579*str[0] ^ 5507*str[1] + 17*h;
str += 2;
but you should care -and special-case when s[1] is the terminating null byte.
Read also about MD5
Notice that a lot of standard or popular libraries gives you many hash functions. Most of the time the particular choice of some hash function is not very important. On the other hand, you can still earn a PhD on studying and inventing good hash functions. I have another one in this values.c file, function mom_cstring_hash near line 150 (I imagine that it might be better optimized, since for large strings some of the instructions might run "in parallel" inside the processor).
I certainly don't claim to be expert on hash functions.
Study also the source code of hash functions in free software libraries like Glib, Qt, etc.... See also gperf

Related

C array sorting ignoring special characters

char temp[size];
int b, z;
for (b = 0; b < size; b++) {
for (z = 0; z < size; z++) {
if (strcmp(processNames[b], processNames[z]) < 0) {
strcpy(temp, processNames[b]);
strcpy(processNames[b], processNames[z]);
strcpy(processNames[z], temp);
}
}
}
I'm sorting a list of char ** processNames;
I want it to sort like this:
abc
bee
george
(sally)
saw
thomas
zebra
However, it is sorting it like this:
(sally)
abc
bee
george
saw
thomas
zebra
Thanks, I'm not sure how to negate the special characters and only sort on alphabet. Thanks!
You can pre-process the string and use strcmp to compare the processed string:
// Inside the two-layer for loop
char newb[size], newz[size];
int ib, iz, tb = 0, tz = 0;
for (ib = 0; processNames[b][ib] != '\0'; ib++){
if (isalpha(processNames[b][ib])) {
newb[tb++] = processNames[b][ib];
}
}
newb[tb] = 0;
for (iz = 0; processNames[z][iz] != '\0'; iz++){
if (isalpha(processNames[z][iz])) {
newz[tz++] = processNames[z][iz];
}
}
newz[tz] = 0;
if (strcmp(newb, newz)) {
// swap the ORIGINAL string here
}
The above code is what I came up with at first. It is very inefficient and is not recommended. Alternatively, you can write your own mystrcmp() implementation:
int mystrcmp(const char* a, const char *b){
while (*a && *b) {
while (*a && !isalpha(*a)) a++;
while (*b && !isalpha(*b)) b++;
if (*a - *b) return *a - *b;
a++, b++;
}
return *a - *b;
}
“Sorting” means “putting things in order.” What order? The order is defined by some thing that tells us which of two items goes first.
In your code, you are using strcmp to decide which item goes first. That is the thing that decides the order. Since strcmp is giving an order you do not want, you need another function. In this case, you have to write your own function.
Your function should take two strings (via pointers to char), examine the strings, and return a value to indicate whether the first string should be before or after the second string (or whether they are equal).
Since this is likely a class assignment, I will leave it to you to ponder the necessary comparison function.
Alternative
There is an alternative method which is likely to be used in professionally deployed code, in suitable situations. I recommend the above because it is suitable for a class assignment—it addresses the key principle this assignment seems to target.
The alternative is to preprocess all the list items before doing the sort. Since you want to sort on the non-special characters of the names, you would augment the list by creating copies of the names with the special characters removed. These new versions would be your “sort keys”—they would be the values you use to decide order instead of the original names. You could compare them with strcmp.
This method requires allocating new memory for the new versions of the names, managing both the keys and the names while you sort them, and releasing the memory after the sort. It requires some overhead before you start the sort. However, if there are a very large number of things to sort with a considerable number of special characters, then doing the extra work up front can result in better performance overall.
(Again, I mention this only for completeness. It is likely not useful in a class assignment of this sort, just something computer science students should learn over time.)
Bonus Notes
You say you are sorting an array of char **ProcessNames. In this case, it is probably not necessary to move the strings themselves with strcpy. Instead, you can simply move the pointers to the strings. E.g., if you want to swap ProcessNames[4] and ProcessNames[7], just make a copy of the pointer that is ProcessNames[4], set ProcessNames[4] to be the pointer that is ProcessNames[7], and set ProcessNames[7] to be the temporary copy you made. This is generally faster than moving strings.
As others note, starting your z loop with z = 0 is probably not a good idea. You likely want z = b+1.
Your code uses size for the size of the string buffer (char temp[size]) and for the size of the ProcessNames array (for (b = 0; b < size; b++)). It is unlikely the number of strings to be sorted is the same as the maximum length of the strings. You should be sure to use the correct size in each instance.

How to add Strings from a Char Array to a String in C

I'm trying to create a program that generates random words from Katakana (Japanese syllables).
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <wchar.h>
#include <locale.h>
char* word;
char *kata[] = {"ア", "イ", "ウ", "エ", "オ", "カ", "キ", "ク", "ケ", "コ", "サ", "シ", "ス","セ","ソ","タ","チ","ツ","テ","ト","ナ","ニ","ヌ",
"ネ","ノ","ハ","ヒ","フ","ヘ","ホ","マ","ミ","ム","メ","モ","ヤ","ユ","ヨ","ラ","リ","ル","レ","ロ","ワ","ヲ","ン","ガ","ギ",
"グ","ゲ","ゴ","ザ","ジ","ズ","ゼ","ゾ","ダ","ヂ","ヅ","デ","ド","バ","ビ","ブ","ベ","ボ","パ","ピ","プ","ペ","ポ","ャ","ュ",
"ョ","ヴ","ァ","ィ","ゥ","ェ","ォ"};
int x = 0;
void generator (int length) {
for (int z=0; z<length; z++) {
x = rand() % sizeof(*kata);
concat(word,kata[x]);
}
}
int main (void) {
srand((unsigned) time(NULL));
int length = rand() % 5 + 2;
generator(length);
puts(word);
}
word is the String that I want to get printed, and kata is a Char Array containing Katakana. However, if I don't include the "*" to make the array an String array, C complains that there are multiple characters in a char. The rest of the code works fine in my testing.
I'm using BoUoW which has a full Ubuntu environment on Windows, so I don't think that's the problem, but rather how I'm putting the String array into the String.
I've done a similar program in Java in about an hour and this has taken me much longer. Although that's probably because I'm new to C.
Lot of check should be added (overflow on word for example)
The number of element on kata is the sizeof kata / sizeof an element you can do a macro countof
strcat is the function you need.
The idea is something like
char *kata[] = {"ア", "イ", "ウ", "エ", "オ", "カ", "キ", "ク", "ケ", "コ", "サ", "シ", "ス","セ","ソ","タ","チ","ツ","テ","ト","ナ","ニ","ヌ
",
"ネ","ノ","ハ","ヒ","フ","ヘ","ホ","マ","ミ","ム","メ","モ","ヤ","ユ","ヨ","ラ","リ","ル","レ","ロ","ワ","ヲ","ン","ガ","ギ",
"グ","ゲ","ゴ","ザ","ジ","ズ","ゼ","ゾ","ダ","ヂ","ヅ","デ","ド","バ","ビ","ブ","ベ","ボ","パ","ピ","プ","ペ","ポ","ャ","ュ",
"ョ","ヴ","ァ","ィ","ゥ","ェ","ォ"};
int x = 0;
static void generator (int nb, char *word, size_t n) {
word[0] = '\0';
while (nb-- > 0) {
x = rand() % (sizeof(kata) / sizeof(char*));
strcat(word, kata[x]);
}
}
int main (void) {
char word[64];
srand((unsigned) time(NULL));
int nb = rand() % 5 + 2;
generator(nb, word, sizeof(word));
puts(word);
return 0;
}
Which book are you reading? The reason I ask is that you've encountered a series of categorical errors regarding the fundamentals of C which people who read good books don't usually encounter. I can recommend K&R2E to someone who's already experienced programming from another language.
word is the String ...
Not in the code you've shown us, no... word contains a null pointer, and in your code you're assigning into that null pointer. Bad news :(
Stop confusing the concept of strings (which are a category of values) with pointers (which are a category of types).
A string is a sequence of character values that terminates at the first '\0'. That's a value. Strings are stored within arrays; an array is a category of type.
A pointer denotes a type which has values that point at/into arrays (which might or might not contain a string), functions or at nothing (which are null pointers).
... kata is a Char Array containing Katakana ...
Again, this isn't the case. kata is an array of char *. char * is not a character type; it's a character pointer type!
However, if I don't include the "*" to make the array an String array, C complains that there are multiple characters in a char.
I'm not sure what you expect. Since the type of a string literal expression such as "ア" is a char[n] (character array type) which gets converted to a char * (character pointer type) with a value pointing at the first character, and you store multiple of those in an array, the type of your array needs to be char *[m]. The * is necessary! I don't see a problem here.
I do see other problems, however. Firstly, concat isn't defined. You've not asked a question about this, so here's the definition I'll use to fill in the blanks:
void concat(char *dest, char *src) {
strcat(dest, src); // `strcat` is from `<string.h>`
}
sizeof(*kata) retrieves the size of a char *, which is commonly four or eight... so rand() % sizeof(*kata) will equate to rand() % 4 or rand % 8 on common systems. Perhaps you meant rand() % (sizeof kata / sizeof *kata). More on that later...
As I mentioned earlier, word is a null pointer and you can't assign into such a pointer. You need to make it point at something. You can do this by:
Using the &address-of operator on a variable. I assume this isn't suitable for you, as you'll want your pointer to point at a sequence of more than one object, but this is helpful to explain anyway. For example:
int x;
int *pointer_to_x = &x;
Declaring an array, and using the identifier of the array, possibly in conjunction with the + addition operator to point at an element in the array. For example:
int array[42];
int *pointer_to_first = array + 0;
int *pointer_to_second = array + 1;
Calling malloc, realloc, calloc or some other function that returns a pointer to a suitably sized object. For example:
int *pointer_to_whatever = malloc(42 * sizeof *pointer_to_whatever);
// Remember to free(pointer_to_whatever) ONCE when you're done with it
int isn't really appropriate for storing array indexes or lengths; you're better off using size_t as that doesn't have negative values which doesn't just eliminate some bugs, but also makes your code a little more efficient.
rand() % sizeof(*kata) isn't very random.
In fact, it's quite predictable. By reseeding with the same seed, another program can reproduce that exact sequence. By iterating on seeds, starting with seed = time(NULL) and moving backwards in time, it's easy enough to prove that this is no less predictable than a single int value, despite the fact that it is in fact multiple character values.
Additionally, rand tends to introduce biases, especially when you use the % operator to reduce it. You need to remove the bias. You could do this by first assigning your random number to a double, then dividing it by RAND_MAX + 1.0 like so:
double rand_double(void) {
return rand() / (RAND_MAX + 1.0);
}
With this function returning a value between 0.0 and 1.0 (excluding 1.0), you should be able to use rand_double() * (sizeof kata / sizeof *kata) for example, and this will be better... but the bias is still there; it's just reduced. To eliminate the bias, you need to consider that rand returns a sequence of values, each of which lie within [0..RAND_MAX], and that your range doesn't divide evenly into that range; the remainder of the division is a huge part of your bias. You need to take the range, and truncate it down to something that does divide evenly! That is, make a function that wraps rand and discards values greater than RAND_MAX - (RAND_MAX % (sizeof kata / sizeof *kata))... I've described (and solved) this problem in a solution I posted on gist, so for your convenience here's an adaptation of that code:
unsigned int rand_range(unsigned int ceiling) {
int n;
do {
n = rand();
} while (RAND_MAX - n <= RAND_MAX % ceiling);
return n % ceiling;
}
This is better again, but you won't want to use anything rand-derived for security purposes, so don't use this for passwords! This is because of the attack described earlier, where people can go back in time by reseeding to produce values previously generated. Use a cryptographically secure random number generator for that.

Hash table with singly linked list implementation C [duplicate]

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.
To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.
Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.
If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html
A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

C: Generating hash keys for large data sets?

I am currently playing around with hashing and key generation trying to make my own hash key generator.
At the moment I have a list of ~90000 strings (each 1 word and a different word). I was wondering what the best way to generate keys (number keys not string keys) would be?
Currently depending on the words last ascii character I do a calculation based on the value of the letter.
The result is about 50% of the words generate a key that clashes with another.
I have used quadratic probing to then find space in the table for the rest of the words.
My question, as above, is what is generally the best sort of way to generate a key for 90000 different words? I know that the larger the data set, the more likely there will be clashes, but how would you suggest/or minimise the clashes?
Edit: Also - I don't care about cryptography, it just needs to be fast.
Thanks.
You can "borrow" Java's implementation of String's hashCode*:
int hashCode(const char* s) {
int h = 0;
while (*s) {
h = 31*h + (*s++);
}
return h;
}
This function achieves a reasonable separation, and is among the most widely used hash functions out there.
* which, as it turns out, Java in turn "borrowed" from Kernighan & Ritchie's book on C programming.
To prevent clashes you need a good hash key generator.
There are several algorithms available. One recent and very fast one is called xxHash. It's written in C.
It cant be good choice choosing 90,000 size of the hash table, there is much better concept of perfect hashing, according to this use double hashing one for table lookup and the other to maintain the list, you should try multiplication method for both, i think that's good idea.
I've seen Knuth use:
register int h,k; register char *p;
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;
Where hash_prime is a prime larger than 4x the expected number of live entries in the hash table.
See: Knuth's literateprogramming.com, the Adventure example.
Here's the hashing code in context:
#define hash_prime 1009/* the size of the hash table */
typedef struct {
char text[6]; /* string of length at most 5 */
char word_type; /* a |wordtype| */
char meaning;
} hash_entry;
hash_entry hash_table[hash_prime]; /* the table of words we know */
void new_word(w,m)
char *w; /* a string of length 5 or less */
int m; /* its meaning */
{
register int h,k; register char *p;
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;
while (hash_table[h].word_type) {
h++;if (h==hash_prime) h=0;
}
int lookup(w)
char *w; /* a string that you typed */
{
register int h; register char *p; register char t;
t=w[5]; w[5]='\0'; /* truncate the word */
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime; /* compute starting address */
w[5]=t; /* restore original word */
if (h<0) return -1; /* a negative character might screw us up */
while (hash_table[h].word_type) {
if (streq(w,hash_table[h].text)) return h;
h++;if (h==hash_prime) h=0;
}
return -1;
}
Note, this code:
register char t;
// . . .
t=w[5]; w[5]='\0'; /* truncate the word */
// . . .
w[5]=t; /* restore original word */
Are for a specific requirement to only look at the first 5 characters and should be removed so you hash the entire word.
The term you want is avalanche - a hash function that provides optimal spread.
If you want your keys to be guaranteed to be unique, and if your dataset has zero duplicates
then you can convert your word as a base36 number into a base10 number. If you use stroull() you can return really large integers
char *p=myword;
for(; *p; p++)
*p=toupper(*p);
unsigned long long key=strtoull(myword, NULL, 36);
This can overflow and still return a positive number. Some hashes when given a long string may overflow a 32bit integer. Kerneghan's hash and Bernstein's hash do that.
In reality and as pointed out by several other folks:
Consider that collisions are a function of the hash_table size and the avalanche of the hash_function modulo hash_table size. Instead of truly unique keys what you want may be a better hash_table algorithm and size.

Optimizing a search algorithm in C

Can the performance of this sequential search algorithm (taken from
The Practice of Programming) be improved using any of C's native utilities, e.g. if I set the i variable to be a register variable ?
int lookup(char *word, char*array[])
{
int i
for (i = 0; array[i] != NULL; i++)
if (strcmp(word, array[i]) == 0)
return i;
return -1;
}
Yes, but only very slightly. A much bigger performance improvement can be achieved by using better algorithms (for example keeping the list sorted and doing a binary search).
In general optimizing a given algorithm only gets you so far. Choosing a better algorithm (even if it's not completely optimized) can give you a considerable (order of magnitude) performance improvement.
I think, it will not make much of a difference. The compiler will already optimize it in that direction.
Besides, the variable i does not have much impact, word stays constant throughout the function and the rest is too large to fit in any register. It is only a matter how large the cache is and if the whole array might fit in there.
String comparisons are rather expensive computationally.
Can you perhaps use some kind of hashing for the array before searching?
There is well-known technique as sentinal method.
To use sentinal method, you must know about the length of "array[]".
You can remove "array[i] != NULL" comparing by using sentinal.
int lookup(char *word, char*array[], int array_len)
{
int i = 0;
array[array_len] = word;
for (;; ++i)
if (strcmp(word, array[i]) == 0)
break;
array[array_len] = NULL;
return (i != array_len) ? i : -1;
}
If you're reading TPOP, you will next see how they make this search many times faster with different data structures and algorithms.
But you can make things a bit faster by replacing things like
for (i = 0; i < n; ++i)
foo(a[i]);
with
char **p = a;
for (i = 0; i < n; ++i)
foo(*p);
++p;
If there is a known value at the end of the array (e.g. NULL) you can eliminate the loop counter:
for (p = a; *p != NULL; ++p)
foo(*p)
Good luck, that's a great book!
To optimize that code the best bet would be to rewrite the strcmp routine since you are only checking for equality and don't need to evaluate the entire word.
Other than that you can't do much else. You can't sort as it appears you are looking for text within a larger text. Binary search won't work either since the text is unlikely to be sorted.
My 2p (C-psuedocode):
wrd_end = wrd_ptr + wrd_len;
arr_end = arr_ptr - wrd_len;
while (arr_ptr < arr_end)
{
wrd_beg = wrd_ptr; arr_beg = arr_ptr;
while (wrd_ptr == arr_ptr)
{
wrd_ptr++; arr_ptr++;
if (wrd_ptr == wrd_en)
return wrd_beg;
}
wrd_ptr++;
}
Mark Harrison: Your for loop will never terminate! (++p is indented, but is not actually within the for :-)
Also, switching between pointers and indexing will generally have no effect on performance, nor will adding register keywords (as mat already mentions) -- the compiler is smart enough to apply these transformations where appropriate, and if you tell it enough about your cpu arch, it will do a better job of these than manual psuedo-micro-optimizations.
A faster way to match strings would be to store them Pascal style. If you don't need more than 255 characters per string, store them roughly like this, with the count in the first byte:
char s[] = "\x05Hello";
Then you can do:
for(i=0; i<len; ++i) {
s_len = strings[i][0];
if(
s_len == match_len
&& strings[i][s_len] == match[s_len-1]
&& 0 == memcmp(strings[i]+1, match, s_len-1)
) {
return 1;
}
}
And to get really fast, add memory prefetch hints for string start + 64, + 128 and the start of the next string. But that's just crazy. :-)
Another fast way to do it is to get your compiler to use a SSE2 optimized memcmp. Use fixed-length char arrays and align so the string starts on a 64-byte alignment. Then I believe you can get the good memcmp functions if you pass const char match[64] instead of const char *match into the function, or strncpy match into a 64,128,256,whatever byte array.
Thinking a bit more about this, these SSE2 match functions might be part of packages like Intel's and AMD's accelerator libraries. Check them out.
Realistically, setting I to be a register variable won't do anything that the compiler wouldn't do already.
If you are willing to spend some time upfront preprocessing the reference array, you should google "The World's Fastest Scrabble Program" and implement that. Spoiler: it's a DAG optimized for character lookups.
/* there is no more quick */
int lookup(char *word, char*array[])
{
int i;
for(i=0; *(array++) != NULL;i++)
if (strcmp(word, *array) == 0)
return i;
return -1;
}

Resources