Simple hash functions - c

I'm trying to write a C program that uses a hash table to store different words and I could use some help.
Firstly, I create a hash table with the size of a prime number which is closest to the number of the words I have to store, and then I use a hash function to find an address for each word.
I started with the simplest function, adding the letters together, which ended up with 88% collision.
Then I started experimenting with the function and found out that whatever I change it to, the collisions don't get lower than 35%.
Right now I'm using
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int counter, hashAddress =0;
for (counter =0; word[counter]!='\0'; counter++){
hashAddress = hashAddress*word[counter] + word[counter] + counter;
}
return (hashAddress%hashTableSize);
}
which is just a random function that I came up with, but it gives me the best results - around 35% collision.
I've been reading articles on hash functions for the past a few hours and I tried to use a few simple ones, such as djb2, but all of them gave me even worse results.(djb2 resulted in 37% collision, which is't much worse, but I was expecting something better rather than worse)
I also don't know how to use some of the other, more complex ones, such as the murmur2, because I don't know what the parameters (key, len, seed) they take in are.
Is it normal to get more than 35% collisions, even with using the djb2, or am I doing something wrong?
What are the key, len and seed values?

Try sdbm:
hashAddress = 0;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = word[counter] + (hashAddress << 6) + (hashAddress << 16) - hashAddress;
}
Or djb2:
hashAddress = 5381;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = ((hashAddress << 5) + hashAddress) + word[counter];
}
Or Adler32:
uint32_t adler32(const void *buf, size_t buflength) {
const uint8_t *buffer = (const uint8_t*)buf;
uint32_t s1 = 1;
uint32_t s2 = 0;
for (size_t n = 0; n < buflength; n++) {
s1 = (s1 + buffer[n]) % 65521;
s2 = (s2 + s1) % 65521;
}
return (s2 << 16) | s1;
}
// ...
hashAddress = adler32(word, strlen(word));
None of these are really great, though. If you really want good hashes, you need something more complex like lookup3, murmur3, or CityHash for example.
Note that a hashtable is expected to have plenty of collisions as soon as it is filled by more than 70-80%. This is perfectly normal and will even happen if you use a very good hash algorithm. That's why most hashtable implementations increase the capacity of the hashtable (e.g. capacity * 1.5 or even capacity * 2) as soon as you are adding something to the hashtable and the ratio size / capacity is already above 0.7 to 0.8. Increasing the capacity means a new hashtable is created with a higher capacity, all values from the current one are added to the new one (therefor they must all be rehashed, as their new index will be different in most cases), the new hashtable array replaces the old one and the old one is released/freed. If you plan on hashing 1000 words, a hashtable capacity of at 1250 least recommended, better 1400 or even 1500.
Hashtables are not supposed to be "filled to brim", at least not if they shall be fast and efficient (thus they always should have spare capacity). That's the downside of hashtables, they are fast (O(1)), yet they will usually waste more space than would be necessary for storing the same data in another structure (when you store them as a sorted array, you will only need a capacity of 1000 for 1000 words; the downside is that the lookup cannot be faster than O(log n) in that case). A collision free hashtable is not possible in most cases either way. Pretty much all hashtable implementations expect collisions to happen and usually have some kind of way to deal with them (usually collisions make the lookup somewhat slower, but the hashtable will still work and still beat other data structures in many cases).
Also note that if you are using a pretty good hash function, there is no requirement, yet not even an advantage, if the hashtable has a power of 2 capacity if you are cropping hash values using modulo (%) in the end. The reason why many hashtable implementations always use power of 2 capacities is because they do not use modulo, instead they use AND (&) for cropping because an AND operation is among the fastest operations you will find on most CPUs (modulo is never faster than AND, in the best case it would be equally fast, in most cases it is a lot slower). If your hashtable uses power of 2 sizes, you can replace any module with an AND operation:
x % 4 == x & 3
x % 8 == x & 7
x % 16 == x & 15
x % 32 == x & 31
...
This only works for power of 2 sizes, though. If you use modulo, power of 2 sizes can only buy something, if the hash is a very bad hash with a very bad "bit distribution". A bad bit distribution is usually caused by hashes that do not use any kind of bit shifting (>> or <<) or any other operations that would have a similar effect as bit shifting.
I created a stripped down lookup3 implementation for you:
#include <stdint.h>
#include <stdlib.h>
#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#define mix(a,b,c) \
{ \
a -= c; a ^= rot(c, 4); c += b; \
b -= a; b ^= rot(a, 6); a += c; \
c -= b; c ^= rot(b, 8); b += a; \
a -= c; a ^= rot(c,16); c += b; \
b -= a; b ^= rot(a,19); a += c; \
c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a,b,c) \
{ \
c ^= b; c -= rot(b,14); \
a ^= c; a -= rot(c,11); \
b ^= a; b -= rot(a,25); \
c ^= b; c -= rot(b,16); \
a ^= c; a -= rot(c,4); \
b ^= a; b -= rot(a,14); \
c ^= b; c -= rot(b,24); \
}
uint32_t lookup3 (
const void *key,
size_t length,
uint32_t initval
) {
uint32_t a,b,c;
const uint8_t *k;
const uint32_t *data32Bit;
data32Bit = key;
a = b = c = 0xdeadbeef + (((uint32_t)length)<<2) + initval;
while (length > 12) {
a += *(data32Bit++);
b += *(data32Bit++);
c += *(data32Bit++);
mix(a,b,c);
length -= 12;
}
k = (const uint8_t *)data32Bit;
switch (length) {
case 12: c += ((uint32_t)k[11])<<24;
case 11: c += ((uint32_t)k[10])<<16;
case 10: c += ((uint32_t)k[9])<<8;
case 9 : c += k[8];
case 8 : b += ((uint32_t)k[7])<<24;
case 7 : b += ((uint32_t)k[6])<<16;
case 6 : b += ((uint32_t)k[5])<<8;
case 5 : b += k[4];
case 4 : a += ((uint32_t)k[3])<<24;
case 3 : a += ((uint32_t)k[2])<<16;
case 2 : a += ((uint32_t)k[1])<<8;
case 1 : a += k[0];
break;
case 0 : return c;
}
final(a,b,c);
return c;
}
This code is not as highly optimized for performance as the original code, therefor it is a lot simpler. It is also not as portable as the original code, but it is portable to all major consumer platforms in use today. It is also completely ignoring the CPU endian, yet that is not really an issue, it will work on big and little endian CPUs. Just keep in mind that it will not calculate the same hash for the same data on big and little endian CPUs, but that is no requirement; it will calculate a good hash on both kind of CPUs and its only important that it always calculates the same hash for the same input data on a single machine.
You would use this function as follows:
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int initval;
unsigned int hashAddress;
initval = 12345;
hashAddress = lookup3(word, strlen(word), initval);
return (hashAddress%hashTableSize);
// If hashtable is guaranteed to always have a size that is a power of 2,
// replace the line above with the following more effective line:
// return (hashAddress & (hashTableSize - 1));
}
You way wonder what initval is. Well, it is whatever you want it to be. You could call it a salt. It will influence the hash values, yet the hash values will not get better or worse in quality because of this (at least not in the average case, it may lead to more or less collisions for very specific data, though). E.g. you can use different initval values if you want to hash the same data twice, yet each time should produce a different hash value (there is no guarantee it will, but it is rather likely if initval is different; if it creates the same value, this would be a very unlucky coincidence that you must treat that as a kind of collision). It is not advisable to use different initval values when hashing data for the same hashtable (this will rather cause more collisions on average). Another use for initval is if you want to combine a hash with some other data, in which case the already existing hash becomes initval when hashing the other data (so both, the other data as well as the previous hash influence the outcome of the hash function). You may even set initval to 0 if you like or pick a random value when the hashtable is created (and always use this random value for this instance of hashtable, yet each hashtable has its own random value).
A note on collisions:
Collisions are usually not such a huge problem in practice, it usually does not pay off to waste tons of memory just to avoid them. The question is rather how you are going to deal with them in an efficient way.
You said you are currently dealing with 9000 words. If you were using an unsorted array, finding a word in the array will need 4500 comparisons on average. On my system, 4500 string comparisons (assuming that words are between 3 and 20 characters long) need 38 microseconds (0.000038 seconds). So even such a simple, ineffective algorithm is fast enough for most purposes. Assuming that you are sorting the word list and use a binary search, finding a word in the array will need only 13 comparisons on average. 13 comparisons are close to nothing in terms of time, it's too little to even benchmark reliably. So if finding a word in a hashtable needs 2 to 4 comparisons, I wouldn't even waste a single second on the question whether that may be a huge performance problem.
In your case, a sorted list with binary search may even beat a hashtable by far. Sure, 13 comparisons need more time than 2-4 comparisons, however, in case of a hashtable you must first hash the input data to perform a lookup. Hashing alone may already take longer than 13 comparisons! The better the hash, the longer it will take for the same amount of data to be hashed. So a hashtable only pays off performance-wise if you have a really huge amount of data or if you must update the data frequently (e.g. constantly adding/removing words to/from the table, since these operations are less costly for a hashtable than they are for a sorted list). The fact that a hashatble is O(1) only means that regardless how big it is, a lookup will approx. always need the same amount of time. O(log n) only means that the lookup grows logarithmically with the number of words, that means more words, slower lookup. Yet the Big-O notation says nothing about absolute speed! This is a big misunderstanding. It is not said that a O(1) algorithm always performs faster than a O(log n) one. The Big-O notation only tells you that if the O(log n) algorithm is faster for a certain number of values and you keep increasing the number of values, the O(1) algorithm will certainly overtake the O(log n) algorithm at some point of time, but your current word count may be far below that point. Without benchmarking both approaches, you cannot say which one is faster by just looking at the Big-O notation.
Back to collisions. What should you do if you run into a collision? If the number of collisions is small, and here I don't mean the overall number of collisions (the number of words that are colliding in the hashtable) but the per index one (the number of words stored at the same hashtable index, so in your case maybe 2-4), the simplest approach is to store them as a linked list. If there was no collision so far for this table index, there is just a single key/value pair. If there was a collision, there is a linked list of key/value pairs. In that case your code must iterate over the linked list and verify each of the keys and return the value if it matches. Going by your numbers, this linked list won't have more than 4 entries and doing 4 comparisons is insignificant in terms of performance. So finding the index is O(1), finding the value (or detecting that this key is not in the table) is O(n), but here n is only the number of linked list entries (so it is 4 at most).
If the number of collisions raises, a linked list can become to slow and you may also store a dynamically sized, sorted array of key/value pairs, which allows lookups of O(log n) and again, n is only the number of keys in that array, not of all keys in the hashtable. Even if there were 100 collisions at one index, finding the right key/value pair takes at most 7 comparisons. That's still close to nothing. Despite the fact that if you really have 100 collisions at one index, either your hash algorithm is unsuited for your key data or the hashtable is far too small in capacity. The disadvantage of a dynamically sized, sorted array is that adding/removing keys is somewhat more work than in case of a linked list (code-wise, not necessarily performance-wise). So using a linked list is usually sufficient if you keep the number of collisions low enough and it is almost trivial to implement such a linked list yourself in C and add it to an existing hashtable implementation.
Most hashtable implementations I have seen use such a "fallback to an alternate data structure" to deal with collisions. The disadvantage is that these require a little bit extra memory to store the alternative data structure and a bit more code to also search for keys in that structure. There are also solutions that store collisions inside the hashtable itself and that don't require any additional memory. However, these solutions have a couple of drawbacks. The first drawback is that every collision increases the chances for even more collisions as more data is added. The second drawback is that while lookup times for keys decrease linearly with the number of collisions so far (and as I said before, every collision leads to even more collisions as data is added), lookup times for keys not in the hashtable decrease even worse and in the end, if you perform a lookup for a key that is not in the hashtable (yet you cannot know without performing the lookup), the lookup may take as long as a linear search over the whole hashtable (YUCK!!!). So if you can spare the extra memory, go for an alternate structure to handle collisions.

Firstly, I create a hash table with the size of a prime number which is the closes to the number of the words I have to store, and then I use a hash function to find an address for each word.
...
return (hashAddress%hashTableSize);
Since the number of different hashes is comparable to the number of words you cannot expect to have much lower collisions.
I made a simple statistical test with a random hash (which is the best you could achieve) and found that 26% is the limiting collision rate if you have #words == #different hashes.

Related

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

Fastest algorithm to figure out if an array has at least one duplicate

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.
Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).
If X &in; {A..Z} and N &in; {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.
Here you go: a python script to generate all possible 17 million keys:
import itertools
from string import ascii_uppercase
for prefix in itertools.product(ascii_uppercase, repeat=3):
for numeric in range(1000):
print "%s%03d" % (''.join(prefix), numeric)
and a simple C bitset filter:
#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
int byteIndex = value / CHAR_BIT;
unsigned char mask = 1 << (value % CHAR_BIT);
unsigned char byte = filter[byteIndex];
filter[byteIndex] = byte | mask;
return byte & mask;
}
which for your purposes you'd use like so:
#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
int alpha = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
int key = numeric + 1000 * alpha;
return filterTestAndSet(filter, key);
}
#include <stdio.h>
int main() {
unsigned char *filter = allocMyFilter();
char key[8]; /* 6 chars + newline + nul */
while (fgets(key, sizeof(key), stdin)) {
if (testAndSetMyKey(filter, key)) {
printf("collision: %s\n", key);
return 1;
}
}
return 0;
}
This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:
useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt
real 0m0.474s
user 0m0.436s
sys 0m0.036s
useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000
real 0m0.467s
user 0m0.452s
sys 0m0.016s
admittedly the input file is cached in memory for these runs.
The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.
public static bool ThereAreDuplicates(string[] inputs)
{
var hashTable = new Hashtable();
foreach (var input in inputs)
{
if (hashTable[input] != null)
return true;
hashTable.Add(input, string.Empty);
}
return false;
}
A fast but inefficient memory solution would use
// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 }; // or calloc() this
char buffer[100];
while (fgets(buffer, sizeof buffer, istream)) {
unsigned long index = strtoul(buffer, NULL, 36);
if (found[index]++) {
Dupe_found();
break;
}
}
The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.
Depending on how many different things there can be you have some options:
Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).
The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.
However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.
Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.
Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :
void counting_sort(int a[],int n,int max)
{
int count[max+1]={0},i;
for(i=0;i<n;++i){
count[a[i]]++;
if (count[a[i]]>=2) return 1;
}
return 0;
}
Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).
You can see counting sort visualization to understand it better, here:
https://www.cs.usfca.edu/~galles/visualization/CountingSort.html
Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.

Expand hash table without rehash?

I am looking to for a hash table data structure that does not require rehash for expansion and shrink?
Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?
does not require rehash for expansion and shrink? Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?
That depends on what you call "rehash":
If you simply mean that the table-level rehash shouldn't reapply the hash function to each key during resizing, then that's easy with most libraries: e.g. wrap the key and its raw (pre-modulo-table-size) real hash value together a la struct X { size_t hash_; Key key_ };, supply the hashtable library with a hash function that returns hash_, but a comparison function that compares key_s (depending on the complexity of key_ comparison, you may be able to use hash_ to optimise, e.g. lhs.hash_ == rhs.hash_ && lhs.key_ == rhs.key_).
This will help most if the hashing of keys was particularly time consuming (e.g. cryptographic strength on longish keys). For very simple hashing (e.g. passthrough of ints) it'll slow you down and waste memory.
If you mean the table-level operation of increasing or decreasing memory storage and reindexing all stored values, then yes - it can be avoided - but to do so you have to fundamentally change the way the hash table works, and the normal performance profile. Discussed below.
As just one example, you could leverage a more typical hashtable implementation (let's call it H) by having your custom hashtable (C) have an H** p that - up to an initial size limit - will have p[0] be the only instance of H, and simply ferry operations/results through. If the table grows beyond that, you keep p[0] referencing the existing H, while creating a second H hashtable to be tracked by p[1]. Then things start getting dicey:
to search or erase in C, your implementation needs to search p[1] then p[0] and report any match from either
to insert a new value in C, your implementation must confirm it's not in p[0], then insert to p[1]
with each insert (and potentially even for other operations), it could optionally migrate any matching - or an arbitrary p[0] entry - to p[1] so gradually p[0] empties; you can easily guarantee p[0] will be empty before p[1] will be so full (and consequently a larger table will be needed). When p[0] is empty you may want to p[0] = p[1]; p[1] = NULL; to keep the simple mental model of what's where - lots of options.
Some existing hash table implementations are very efficient at iterating over elements (e.g. GNU C++ std::unordered_set), as there's a singly linked list of all the values, and the hash table is really only a collection of pointers (in C++ parlance, iterators) into the linked list. This can mean that if your utilisation falls below some threshold (e.g. 10% load factor) for your only/larger hash table, you know you can very efficiently migrate the remaining elements to a smaller table.
These kind of tricks are used by some hash tables to avoid a sudden heavy cost during rehashing, and instead spread the pain more evenly over a number of subsequent operations, avoiding a possibly nasty spike in latency.
Some of the implementation options only make sense for either an open or a closed hashing implementation, or are only useful when the keys and/or values are small or large and depending on whether the table embeds them or points to them. Best way to learn about it is to code....
It depends what you want to avoid. Rehashing implies recomputing the hash values. You can avoid that by storing the hash values in the hash structures. Redispatching the entries into the reallocated hashtable may be less expensive (typically a single modulo or masking operation) and is hardly avoidable for simple hashtable implementations.
Assuming you actually do need this.. It is possible. Here I'll give a trivial example you can build on.
// Basic types we deal with
typedef uint32_t key_t;
typedef void * value_t;
typedef struct
{
key_t key;
value_t value;
} hash_table_entry_t;
typedef struct
{
uint32_t initialSize;
uint32_t size; // current max entries
uint32_t count; // current filled entries
hash_table_entry_t *entries;
} hash_table_t;
// Hash function depends on the size of the table
key_t hash(value_t value, uint32_t size)
{
// Simple hash function that just does modulo hash table size;
return *(key_t*)&value % size;
}
void init(hash_table_t *pTable, uint32_t initialSize)
{
pTable->initialSize = initialSize;
pTable->size = initialSize;
pTable->count = 0;
pTable->entries = malloc(pTable->size * sizeof(*pTable->entries));
/// #todo handle null return;
// Set to ~0 to signal invalid keys.
memset(pTable->entries, ~0, pTable->size * sizeof(*pTable->entries));
}
void insert(hash_table_t *pTable, value_t val)
{
key_t key = hash(val, pTable->size);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
{
pTable->entries[i].key = key;
pTable->entries[i].value = val;
pTable->count++;
break;
}
}
// Expand when 50% full
if (pTable->count > pTable->size/2)
{
pTable->size *= 2;
pTable->entries = realloc(pTable->entries, pTable->size * sizeof(*pTable->entries));
/// #todo handle null return;
memset(pTable->entries + pTable->size/2, ~0, pTable->size * sizeof(*pTable->entries));
}
}
_Bool contains(hash_table_t *pTable, value_t val)
{
// Try current size first
uint32_t sizeToTry = pTable->size;
do
{
key_t key = hash(val, sizeToTry);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
break;
if (pTable->entries[i].key == key && pTable->entries[i].value == val)
return true;
}
// Try all previous sizes we had. Only report failure if found for none.
sizeToTry /= 2;
} while (sizeToTry != pTable->initialSize);
return false;
}
The idea is that the hash function depends on the size of the table. When you change the size of the table, you don't rehash current entries. You add new ones with the new hash function. When reading the entries, you try all the hash functions that have ever been used on this table.
This way, get()/contains() and similar operations take longer the more times you expanded your table, but you don't have the huge spike of rehashing. I can imagine some systems where this would be a requirement.

Fast string comparison in C

I currently have this kind of loop
while(1)
{
generate_string(&buffer);
for(int i = 0; i < filelines; i++)
{
if(strcmp(buffer,line[i]) == 0)
{
/* do something */
}
}
}
I have a file with a few million strings(which hopefully should be cut by half sometime soon), the number of all these strings is stored in filelines
line[i] is basically where the string itself is stored.
Currently, due to the comparison of these million strings, function generate_string(&buffer); is executed around 42 times per second.
Is there a faster way to do string comparison in C?
strcmp is usually optimized by all vendors. However, if you're not satisfied with this you can try:
Lookup Burst Tries
Use a suffix tree for fast string comparison -- see this article
Depending on the size of strings in your application you can write a custom string comparator. E.g: GNU libc used to have this optimization for small strings where they tested strings smaller than five bytes as integers. MS cl also has some optimizations for small-strings (do look it up).
But more importantly make sure strcmp is your real bottleneck.
I can assure you, the function strcmp is ABSOLUTELY NOT the bottleneck. Typically, strcmp is well optimized and can do 32 or 64 bit comparisons for strings longer than 4/8 bytes depending on architecture. Both newlib and GNU libc do this. But even if you were to look at each byte in both strings 20 times, it doesn't matter as much as the algo & data structure choices made here.
The real bottle neck is the O(N) search algorithm. A single O(N log N) pass at the file could be used to at appropriate data structure (whether it's a normal BST, a trie, or just a simple sorted array) for doing O(log N) lookups.
Bear with me here--a lot of math follows. But I think this is a good opportunity to illustrate why choice of algorithm & data structure are sometimes FAR more important than method of string comparison. Steve touches on this, but I wanted to explain it in a little more depth.
With N=1e6, log(1e6, 2) = 19.9, so round up to 20 comparisons on an ideal data structure.
Currently you're doing a a worst case search of O(N), or 1e6 operations.
So say you just build a red-black tree with O(log N) insertion time, and you insert N items, that's O(N log N) time to build the tree. So that's 1e6 x 20 or 20e6 operations necessary to build your tree.
In your current approach, building the data structure is O(N), or 1e6 operations, but your worst case search time is O(N) as well. So by the time you read the file and do just 20 search operations, you're up to a theoretical worst case of 21,000,000 operations. By comparison, your worst case with a red-black tree and 20 searches is 20,000,400 operations, or 999,600 operations BETTER than the O(N) search on an unsorted array. So at 20 searches, you're at the first point where a more sophisticated data structure really pays off. But look at what happens at 1000 searches:
Unsorted array = initialization + 1000 x search time = O(N) + 1000 * O(N) = 1,000,000 + 2,000,000,000 = 2,001,000,000 operations.
Red-black = initialization + 1000 x search time = O(N log N) + 1000 * O(log N) = 20,000,000 + 20,000 = 20,020,000 operations.
2,001,000,000 / 20,020,000 ~= 100x as many operations for the O(N) search.
At 1e6 searches, that's (1e6 + 1e6 * 1e6) / (20e6 + 1e6 * 20 ) = 25,000x as many operations.
Assume your computer can handle the 40e6 'operations' it takes to do the log N searches in 1 minute. It would take 25,000 minutes, or 17 DAYS to do the same work with your current algorithm. Or another way to look at is that the O(N) search algorithm can only handle 39 searches in the time the O(log N) algorithm can do 1,000,000. And the more searches you do, the uglier it gets.
See responses from Steve and dirkgently for several better choices of data structures & algorithms. My only additional caution would be that qsort() suggested by Steve might have a worst-case complexity of O(N*N), which is far, far, worse than the O(N log N) you get with a heapsort or various tree-like structures.
Optimization of Computer Programs in C
You can save a little time by checking the first characters of the strings in question before doing the call. Obviously, if the first characters differ, there's no reason to call strcmp to check the rest. Because of the non-uniform distribution of letters in natural languages, the payoff is not 26:1 but more like 15:1 for uppercase data.
#define QUICKIE_STRCMP(a, b) (*(a) != *(b) ? \
(int) ((unsigned char) *(a) - \
(unsigned char) *(b)) : \
strcmp((a), (b)))
If The dictionary of words you are using are well defined (meaning you don't mind return value form strcmp but the 0==equal), for example, a set of command line arguments that begins with same prefix, ex: tcp-accept, tcp-reject than you can rewrite the macro and do some pointer arithmetic to compare not the 1st one but the Nth char, in this case, the 4th char, ex:
#define QUICKIE_STRCMP(a, b, offset) \
(*(a+offset) != *(b+offset))\ ? -1 : strcmp((a), (b)))
If I get your question correctly, you need to check if a string is along all the lines read so far. I would propose using a TRIE or even better a Patricia tree from the file lines. This way instead of going all over all the lines you can check linearly if your string is present(and with a little more effort - where).
You can try something 'cheap' like screening based on the first char. If the first chars don't match, the strings cannot be equal. If they match, then call strcmp to compare the entire string. You may wish to consider a better algorithm if that is appropriate for your situation; examples would be sorting the file/lines and doing a binary search, using a hash table, or similar string table techniques.
You're already compiling with optimization, right?
If you have a Trie or hashtable data structure lying around the place, ready to use, then you should.
Failing that, a fairly easy change that will probably speed things up is to sort your array line once, before you start generating strings to search for. Then binary search for buffer in the sorted array. It's easy because the two functions you need are standard -- qsort and bsearch.
A binary search into a sorted array only needs to do about log2(filelines) string comparisons, instead of about filelines. So in your case that's 20-something string comparisons per call to generate_string instead of a few million. From the figures you've given, I think you can reasonably expect it to go 20-25 times faster, although I promise nothing.
You can use a byte-wise comparator macro instead of strcmp() to achieve a very fast string comparison (of standard 8-bit char) if you know the string length beforehand. I benchmarked the byte-wise comparator macro against glibc's strcmp(), and the macro version significantly outperformed strcmp() implementation; it takes advantage of the CPU's vector processor.
Example:
#define str3_cmp(x, y0, y1, y2, y3) x[0] == y0 && x[1] == y1 && x[2] == y2 && x[3] == y3
static inline bool str3_cmp_helper(const char *x, const char *y) {
return str3_cmp(x, *y, *(y + 1), *(y + 2), *(y + 3));
}
const char *i = "hola"; // dynamically generated (eg: received over a network)
if (str3_cmp_helper(i, "hola")) {
/* do something */
} else {
/* do something else */
}
However, writing such a macro is tiresome, so I have included a PHP script to generate the macro. This script takes two arguments, (1) the string length to be compared (this argument is variadic so write as many macros as you want), and (2) the output filename.
#!/usr/bin/php
<?php
function generate_macro($num) : string {
$returner = "#define str".$num."cmp_macro(ptr, ";
for($x = 0; $x < $num; $x++){
$returner .= "c".$x;
if($x != $num-1){ $returner .= ", "; }
}
$returner .= ") ";
for($x = 0; $x < $num; $x++){
$returner .= "*(ptr+".$x.") == c".$x;
if($x != $num-1){ $returner .= " && "; }
}
return $returner;
}
function generate_static_inline_fn(&$generated_macro, $num) : string {
$generated_macro .= "static inline bool str".$num."cmp(const char* ptr, const char* cmp)".
"{\n\t\treturn str".$num."cmp_macro(ptr, ";
for($x = 0; $x < $num; $x++){
$generated_macro .= " *(cmp+".$x.")";
if($x != $num-1){ $generated_macro .= ", "; }
}
$generated_macro .= ");\n}\n";
return $generated_macro;
}
function handle_generation($argc, $argv) : void {
$out_filename = $argv[$argc-1];
$gen_macro = "";
for($x = 0; $x < $argc-2; $x++){
$macro = generate_macro($argv[$x+1])."\n";
$gen_macro .= generate_static_inline_fn($macro, $argv[$x+1]);
}
file_put_contents($out_filename, $gen_macro);
}
handle_generation($argc, $argv);
?>
Script example: $ ./gen_faststrcmp.php 3 5 fast_strcmp.h.
This generates fast_strcmp.h with macros for comparing strings of length 3 and 5:
#define str3cmp_macro(ptr, c0, c1, c2) *(ptr+0) == c0 && *(ptr+1) == c1 && *(ptr+2) == c2
static inline bool str3cmp(const char* ptr, const char* cmp){
return str3cmp_macro(ptr, *(cmp+0), *(cmp+1), *(cmp+2));
}
#define str5cmp_macro(ptr, c0, c1, c2, c3, c4) *(ptr+0) == c0 && *(ptr+1) == c1 && *(ptr+2) == c2 && *(ptr+3) == c3 && *(ptr+4) == c4
static inline bool str5cmp(const char* ptr, const char* cmp){
return str5cmp_macro(ptr, *(cmp+0), *(cmp+1), *(cmp+2), *(cmp+3), *(cmp+4));
}
You can use the macro like so:
const char* compare_me = "Hello";
if(str5cmp(compare_me, "Hello")) { /* code goes here */ }
I don't know that there's a faster way than calling strcmp to do string comparisons, but you can perhaps avoid calling strcmp so much. Use a hash table to store your strings and then you can check whether the string in buffer is in the hash table. If the index of a hit is important when you "do something", the table can map strings to indexes.
you may be able to get by with a binary comparison in this case because your program does not actually sort, but compares for equality.
you can also improve comparison speeds here by determining the lengths in advance (provided of course they vary enough). when the length does not match here, do something will not happen.
of course, hashing here would be another consideration depending on how many times you read the hashed value.
It depends on the length of the string.
If it's not too long, you can try to compare byte by byte:
str[0] == str2[0] && str[1] == str2[1] && str[2] == str2[2]
Otherwise, use memcmp(), it compares chunks of memory.
Use strcmp for regular strings. But if the string if really long you can use memcmp. It will compare chunks of memory.

Ideal data structure for mapping integers to integers?

I won't go into details, but I'm attempting to implement an algorithm similar to the Boyer-Moore-Horspool algorithm, only using hex color values instead of characters (i.e., there is a much greater range).
Following the example on Wikipedia, I originally had this:
size_t jump_table[0xFFFFFF + 1];
memset(jump_table, default_value, sizeof(jump_table);
However, 0xFFFFFF is obviously a huge number and this quickly causes C to seg-fault (but not stack-overflow, disappointingly).
Basically, what I need is an efficient associative array mapping integers to integers. I was considering using a hash table, but having a malloc'd struct for each entry just seems overkill to me (I also do not need hashes generated, as each key is a unique integer and there can be no duplicate entries).
Does anyone have any alternatives to suggest? Am I being overly pragmatic about this?
Update
For those interested, I ended up using a hash table via the uthash library.
0xffffff is rather too large to put on the stack on most systems, but you absolutely can malloc a buffer of that size (at least on current computers; not so much on a smartphone). Whether or not you should do it for this task is a separate issue.
Edit: Based on the comment, if you expect the common case to have a relatively small number of entries other than the "this color doesn't appear in the input" skip value, you should probably just go ahead and use a hash map (obviously only storing values that actually appear in the input).
(ignore earlier discussion of other data structures, which was based on an incorrect recollection of the algorithm under discussion -- you want to use a hash table)
If the array you were going to make (of size 0xFFFFFF) was going to be sparse you could try making a smaller array to act as a simple hash table, with the size being 0xFFFFFF / N and the hash function being hexValue / N (or hexValue % (0xFFFFFF / N)). You'll have to be creative to handle collisions though.
This is the only way I can foresee getting out of mallocing structs.
You can malloc(3) 0xFFFFFF blocks of size_t on the heap (for simplicity), and address them as you do with an array.
As for the stack overflow. Basically the program receives a SIGSEGV, which can be a result of a stack overflow or accessing illegal memory or writing on a read-only segment etc... They are all abstracted under the same error message "Segmentation fault".
But why don't you use a higher level language like python that supports associate arrays?
At possibly the cost of some speed, you could try modifying the algorithm to find only matches that are aligned to some boundary (every three or four symbols), then perform the search at byte level.
You could create a sparse array of sorts which has "pages" like this (this example uses 256 "pages", so the upper most byte is the page number):
int *pages[256];
/* call this first to make sure all of the pages start out NULL! */
void init_pages(void) {
for(i = 0; i < 256; ++i) {
pages[i] = NULL;
}
}
int get_value(int index) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
return pages[index / 0x10000][index % 0x10000];
}
void set_value(int index, int value) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
pages[index / 0x10000][index % 0x10000] = value;
}
this will allocate a page the first time it is touched, read or write.
To avoid the overhead of malloc you can use a hashtable where the entries in the table are your structs, assuming they are small. In your case a pair of integers should suffice, with a special value to indicate emptyness of the slot in the table.
How many values are there in your output space, i.e. how many different values do you map to in the range 0-0xFFFFF?
Using randomized universal hashing you can come up with a collision-free hash function with a table no bigger than 2 times the number of values in your output space (for a static table)

Resources