C: Generating hash keys for large data sets? - c

I am currently playing around with hashing and key generation trying to make my own hash key generator.
At the moment I have a list of ~90000 strings (each 1 word and a different word). I was wondering what the best way to generate keys (number keys not string keys) would be?
Currently depending on the words last ascii character I do a calculation based on the value of the letter.
The result is about 50% of the words generate a key that clashes with another.
I have used quadratic probing to then find space in the table for the rest of the words.
My question, as above, is what is generally the best sort of way to generate a key for 90000 different words? I know that the larger the data set, the more likely there will be clashes, but how would you suggest/or minimise the clashes?
Edit: Also - I don't care about cryptography, it just needs to be fast.
Thanks.

You can "borrow" Java's implementation of String's hashCode*:
int hashCode(const char* s) {
int h = 0;
while (*s) {
h = 31*h + (*s++);
}
return h;
}
This function achieves a reasonable separation, and is among the most widely used hash functions out there.
* which, as it turns out, Java in turn "borrowed" from Kernighan & Ritchie's book on C programming.

To prevent clashes you need a good hash key generator.
There are several algorithms available. One recent and very fast one is called xxHash. It's written in C.

It cant be good choice choosing 90,000 size of the hash table, there is much better concept of perfect hashing, according to this use double hashing one for table lookup and the other to maintain the list, you should try multiplication method for both, i think that's good idea.

I've seen Knuth use:
register int h,k; register char *p;
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;
Where hash_prime is a prime larger than 4x the expected number of live entries in the hash table.
See: Knuth's literateprogramming.com, the Adventure example.
Here's the hashing code in context:
#define hash_prime 1009/* the size of the hash table */
typedef struct {
char text[6]; /* string of length at most 5 */
char word_type; /* a |wordtype| */
char meaning;
} hash_entry;
hash_entry hash_table[hash_prime]; /* the table of words we know */
void new_word(w,m)
char *w; /* a string of length 5 or less */
int m; /* its meaning */
{
register int h,k; register char *p;
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;
while (hash_table[h].word_type) {
h++;if (h==hash_prime) h=0;
}
int lookup(w)
char *w; /* a string that you typed */
{
register int h; register char *p; register char t;
t=w[5]; w[5]='\0'; /* truncate the word */
for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime; /* compute starting address */
w[5]=t; /* restore original word */
if (h<0) return -1; /* a negative character might screw us up */
while (hash_table[h].word_type) {
if (streq(w,hash_table[h].text)) return h;
h++;if (h==hash_prime) h=0;
}
return -1;
}
Note, this code:
register char t;
// . . .
t=w[5]; w[5]='\0'; /* truncate the word */
// . . .
w[5]=t; /* restore original word */
Are for a specific requirement to only look at the first 5 characters and should be removed so you hash the entire word.

The term you want is avalanche - a hash function that provides optimal spread.
If you want your keys to be guaranteed to be unique, and if your dataset has zero duplicates
then you can convert your word as a base36 number into a base10 number. If you use stroull() you can return really large integers
char *p=myword;
for(; *p; p++)
*p=toupper(*p);
unsigned long long key=strtoull(myword, NULL, 36);
This can overflow and still return a positive number. Some hashes when given a long string may overflow a 32bit integer. Kerneghan's hash and Bernstein's hash do that.
In reality and as pointed out by several other folks:
Consider that collisions are a function of the hash_table size and the avalanche of the hash_function modulo hash_table size. Instead of truly unique keys what you want may be a better hash_table algorithm and size.

Related

Print a big integer stored as an unsigned long long array

I'm reviewing the security of an app for a University project, the app encrypts a file using RSA, specifically it uses this library: https://github.com/ilansmith/rsa (DO NOT use this, it has serious vulnerabilities).
(If you want to take a look, most of the operations between these numbers are implemented in the rsa_num.c file.)
This tool uses arrays of unsigned long long to store the big numbers needed for RSA (n, e and d):
typedef struct {
u64 arr[17]; //u64 is defined as unsigned long long
int top; //points to the last occupied slot of the array
} u1024_t;
The problem is that I don't understand how the numbers are stored in this format.
What I need is being able to print the real numbers in some way, or at least a way to recover the numbers from the components of the arrays.
I tried just concatenating them like strings, but it doesn't seem right.
Thanks to whoever will be able to help!
Thank you #Matthieu! Your comment worked.
I needed to concatenate the unsigned long longs in reverse order and reversing their bytes due to endianness.
Following his solution, I implemented this function, which works perfectly:
void print_u1024(u1024_t number) {
int size = (number.top + 1) * sizeof(u64);
for (int i = size-1; i >= 0; i--) {
printf("%02x", ((unsigned char*)number.arr)[i]);
}
printf("\n");
}
Please note that this solution will probably only work on little-endian systems (most PCs).

Hash table with singly linked list implementation C [duplicate]

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.
To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.
Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.
If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html
A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

hash function in C

I need to create some hashing function... can you help me?
The input is a sequence of numbers. Your task is to determine the number of how many numbers are repeated.
Its string of numbers and letters (*a[]). N is the number of digits -input.
Returns the number of repetition.
int function(char *a[], int n)
{
int i,j;
int same=0;
for(i=0;i<n-1;i++)
{
for(j=i+1;j<n;j++)
{
if(!strcmp(a[i],a[j]))
same++;
}
}
return same;
}
int main(void)
{
char *a[] = {"AA123456", "BA987689", "AA123123", "AA312312", "BB345345", "AA123123"};
printf("Number of duplicates: %d\n", function(a, 6));
return 0;
}
Read the wikipage on hash functions & hash tables.
Often, a linear combination with prime coefficients (see bézout's identity) and involving the components and the partial hash gives good enough result.
For example, something like
int basile_hash(const char*str) {
int h = 65537;
while (*str) {
h = 75553*h + 5531* (*str);
str++;
};
return h;
}
I don't claim it is a very good hash, but it is probably good enough for your needs. All the constants 65537, 75553, 5531 are primes (given by /usr/games/primes from bsdgames Debian package)
You could make a variant with bitwise xor ^, or you could take into account more than one component:
h = 65579*str[0] ^ 5507*str[1] + 17*h;
str += 2;
but you should care -and special-case when s[1] is the terminating null byte.
Read also about MD5
Notice that a lot of standard or popular libraries gives you many hash functions. Most of the time the particular choice of some hash function is not very important. On the other hand, you can still earn a PhD on studying and inventing good hash functions. I have another one in this values.c file, function mom_cstring_hash near line 150 (I imagine that it might be better optimized, since for large strings some of the instructions might run "in parallel" inside the processor).
I certainly don't claim to be expert on hash functions.
Study also the source code of hash functions in free software libraries like Glib, Qt, etc.... See also gperf

C: sum of integer values by string identifiers

So, I have two files of financial data, say 'symbols', and 'volumes'. In symbols I have strings such as:
FOO
BAR
BAZINGA
...
In volumes, I have integer values such as:
0001387
0000022
0123374
...
The idea is that the stock symbols will repeat in the file and I need to find the total volume of each stock. So, each row where I observe foo I increment total volume of foo by the value observed in volumes. The problem is that these files can be huge: easily 5 - 100 million records. A typical day may have ~1K different symbols in the file.
Doing it using strcmp on symbols each new line will be very inefficient. I was thinking of using an associative array --- hash table library which allows string keys --- such as uthash or Glib's hashtable.
I am reading some pretty good things about Judy arrays? Is the licensing a problem in this case?
Any thoughts on the choice of an efficient hash-table implementation? And also, whether I should use hash tables at all or perhaps something else entirely.
Umm.. apologize for the omission earlier: I need to have a pure C solution.
Thanks.
Definitely hashtable sounds good. You should look at the libiberty implementation.
You can find it on the GCC project Here.
I would use Map of C++ STL. Here's how the pseudo-code looks like:
map< string, long int > Mp;
while(eof is not reached)
{
String stock_name=readline_from_file1();
long int stock_value=readline_from_file2();
Mp[stock_name]+=stock_value;
}
for(each stock_name in Mp)
cout<<stock_name<<" "<<stock_value<<endl;
Based on the amount of data you gave, it may be a bit inefficient, but I'd suggest this because its much easier to implement.
If the solution is to be implemented strictly in C, then hashing will be the best solution. But, if you feel that implementing a hash-table and writing the code to avoid collisions is complex, I have another idea of using trie. It may sound weird, but this can also help a bit.
I would suggest you to read this one. It has a nice explanation about what a trie is and how to construct it. The implementation in C was also given there. So, you may have a doubt of where to store the volumes for each stock. This value can be stored at the end of the stock string and can be updated easily whenever needed.
But as you say that you are new to C, i advice you to try implementing using hash table and then try this one.
Thinking why not stick to your associative array idea. I assume, at the end of execution you need to a have list of unique names with their aggregated values. Below will work as far as you have memory to hold all unique names. ofcourse, this might not be that efficient, however, few tricks can be done depending upon the patterns of your data.
Consolidate_Index =0;
struct sutruct_Customers
{
name[];
value[];
}
sutruct_Customers Customers[This_Could_be_worse_if_all_names_are_unique]
void consolidate_names(char *name , int value)
{
for(i=0;i<Consolidate_Index;i++){
if(Customers[i].name & name)
{
Customers[i].value+= Values[index];
}
else
{
Allocate memory for Name Now!
Customers[Consolidate_Index].name = name;
Customers[Consolidate_Index].value = Value;
Consolidate_Index++;
}
}
}
main(){
sutruct_Customers buffer[Size_In_Each_Iteration]
while(unless file is done){
file-data-chunk_names to buffer.name
file-data-chunk_values to buffer.Values
for(; i<Size_In_Each_Iteration;i++)
consolidate_names(buffer.Names , buffer.Values);
}
My solution:
I did end up using the JudySL array to solve this problem. After some reading, the solution was quite simple to implement using Judy. I am replicating the solution here in full for it to be useful to anyone else.
#include <stdio.h>
#include <Judy.h>
const unsigned int BUFSIZE = 10; /* A symbol is only 8 chars wide. */
int main (int argc, char const **argv) {
FILE *fsymb = fopen(argv[1], "r");
if (fsymb == NULL) return 1;
FILE *fvol = fopen(argv[2], "r");
if (fvol == NULL) return 1;
FILE *fout = fopen(argv[3], "w");
if (fout == NULL) return 1;
unsigned int lnumber = 0;
uint8_t symbol[BUFSIZE];
unsigned long volume;
/* Initialize the associative map as a JudySL array. */
Pvoid_t assmap = (Pvoid_t) NULL;
Word_t *value;
while (1) {
fscanf(fsymb, "%s", symbol);
if (feof(fsymb)) break;
fscanf(fvol, "%lu", &volume);
if (feof(fvol)) break;
++lnumber;
/* Insert a new symbol or return value if exists. */
JSLI(value, assmap, symbol);
if (value == PJERR) {
fclose(fsymb);
fclose(fvol);
fclose(fout);
return 2;
}
*value += volume;
}
symbol[0] = '\0'; /* Start from the empty string. */
JSLF(value, assmap, symbol); /* Find the next string in the array. */
while (value != NULL) {
fprintf(fout, "%s: %lu\n", symbol, *value); /* Print to output file. */
JSLN(value, assmap, symbol); /* Get next string. */
}
Word_t tmp;
JSLFA(tmp, assmap); /* Free the entire array. */
fclose(fsymb);
fclose(fvol);
fclose(fout);
return 0;
}
I tested the solution on a 'small' sample containing 300K lines. The output is correct and the elapsed time was 0.074 seconds.

Fastest way to count the number of occurrences of a string

I was wondering what is the fastest way to count the number of occurrences of a string (needle) within another string (haystack). The way I'm doing it is:
int findWord(char * file, char * word){
char *fptr;
char * current = strtok_r(file, " ,.\n", &fptr);
int sum = 0;
while (current != NULL){
//printf("%s\n", current);
if(strcmp(current, word) == 0)
sum+=1;
current = strtok_r(NULL, " ,.\n", &fptr);
}
return sum;
}
Would it be faster to use a more complex algorithm (Boyer-Moore)?
Thanks
Currently, if your program is counting word "blah" and encounters a token is "blahblah", your algorithm counts it as zero occurrences. If it needed to count it as two, you cound benefit from a more advanced approach.
If your program does what you want, you are processing as fast as you can: it is already linear in the number of letters of the longer "word", so you cannot speed it up further.
An even more interesting solution would be required to count words with self-aliasing: for example, count "aa"s inside "aaaa" string. If you needed to return 3 for this situation, you'd need a lot more advanced algorithm.
Would it be faster to use a more complex algorithm (Boyer-Moore)?
In your algorithm, the unit of comparison is a word rather than a character. This enables the algorithm to ignore matches that straddle a word boundary, and thus makes it run in O(n) time.
I doubt you'd be able to beat that asymptotically.
As far as lowering the multiplicative constant, right now your algorithm looks at every character in file twice. You can eliminate that redundancy by rewriting the code to use a pair of pointers and a single for loop (figuring out the details is left as an exercise for the reader :))
Unless your system has a bad implementation of string functions, this should be roughly the fastest:
const char *s, *t;
size_t cnt;
for (cnt=0, s=haystack; t=strchr(s, needle); s=t+1, cnt++);
Adjust it a bit (+strlen(needle) rather than +1) if you don't want to count overlapping matches.

Resources