What's the best hash for utf-8 strings

What's the best hash for utf-8 strings - c

what's the best hash function for utf-8 strings that returns 32bit or 64bit integer, both considering performance and 'minimal collisions'

XOR version of djb2 algorithm:
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) ^ c; // hash(i - 1) * 33 ^ str[i]
return hash;
}
It's simple, fast and considered one of the best for string hashing.

If you don't have any other, more specific requirements, I'd go with Fowler/Noll/Vo or Jenkins' one-at-a-time.
Keep in mind that you should always check that your input data won't trigger degenerate cases (ie excessive collisions).

I currently use the one below. It is not fundamentally better than the *33 djb version (or FNV or Jenkins), but it has a somewhat better entropy in the lower bits, which is needed if the table sizes are powers of two.
unsigned hash_mem(void *dat, size_t len)
{
unsigned char *str = (unsigned char*) dat;
unsigned val=0;
size_t idx;
for(idx=0; idx < len; idx++ ) {
val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
}
return val;
}

Related

how to signature a string to generate a uint64 value? [duplicate]

I'm working on hash table in C language and I'm testing hash function for string.
The first function I've tried is to add ascii code and use modulo (% 100) but i've got poor results with the first test of data: 40 collisions for 130 words.
The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a .txt file.
Which is the best algorithm for hashing string?
And how to determinate the size of hash table?

I've had nice results with djb2 by Dan Bernstein.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}

First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.
Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.
For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:
int hash(char const *input) {
int result = 0x55555555;
while (*input) {
result ^= *input++;
result = rol(result, 5);
}
}
Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).

I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:
Hash function | Collisions | Time (words) | Time (file)
=================================================================
CRC32 | 23 (0.005%) | 112 ms | 38 ms
MurmurOAAT | 26 (0.006%) | 86 ms | 10 ms
FNV hash | 32 (0.007%) | 87 ms | 7 ms
Jenkins OAAT | 36 (0.008%) | 90 ms | 8 ms
DJB2 hash | 344 (0.074%) | 87 ms | 5 ms
K&R V2 | 356 (0.076%) | 86 ms | 5 ms
Coffin | 763 (0.164%) | 86 ms | 4 ms
x17 hash | 2242 (0.481%) | 87 ms | 7 ms
-----------------------------------------------------------------
MurmurHash3_x86_32 | 19 (0.004%) | 90 ms | 3 ms
I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex MurmurHash3_x86_32 into my test for reference.
Conclusion:
there is almost no point of using the popular DJB2 hash function for strings on Intel x86-64 (or AArch64 for that matter) architecture. Because it has much more collisions than similar functions (MurmurOAAT, FNV and Jenkins OAAT) while having very similar throughput. Bernstein's DJB2 performs especially bad on short strings. Example collisions: Liz/MHz, Bon/COM, Rey/SEX.
Test code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 2048
#define SEED 0x12345678
uint32_t DJB2_hash(const uint8_t *str)
{
uint32_t hash = 5381;
uint8_t c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
uint32_t FNV(const void* key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
h ^= 2166136261UL;
const uint8_t* data = (const uint8_t*)key;
for(int i = 0; i < len; i++)
{
h ^= data[i];
h *= 16777619;
}
return h;
}
uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
// One-byte-at-a-time hash based on Murmur's mix
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
for (; *str; ++str) {
h ^= *str;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint32_t KR_v2_hash(const char *s)
{
// Source: https://stackoverflow.com/a/45641002/5407270
uint32_t hashval = 0;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval;
}
uint32_t Jenkins_one_at_a_time_hash(const char *str, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += str[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
uint32_t crc32b(const uint8_t *str) {
// Source: https://stackoverflow.com/a/21001712
unsigned int byte, crc, mask;
int i = 0, j;
crc = 0xFFFFFFFF;
while (str[i] != 0) {
byte = str[i];
crc = crc ^ byte;
for (j = 7; j >= 0; j--) {
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
i = i + 1;
}
return ~crc;
}
inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
return x<<bits | x>>(32-bits); // C idiom: will be optimized to a single operation
}
uint32_t Coffin_hash(char const *input) {
// Source: https://stackoverflow.com/a/7666668/5407270
uint32_t result = 0x55555555;
while (*input) {
result ^= *input++;
result = _rotl32(result, 5);
}
return result;
}
uint32_t x17(const void * key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
const uint8_t * data = (const uint8_t*)key;
for (int i = 0; i < len; ++i)
{
h = 17 * h + (data[i] - ' ');
}
return h ^ (h >> 16);
}
uint32_t apply_hash(int hash, const char* line)
{
switch (hash) {
case 1: return crc32b((const uint8_t*)line);
case 2: return MurmurOAAT_32(line, SEED);
case 3: return FNV(line, strlen(line), SEED);
case 4: return Jenkins_one_at_a_time_hash(line, strlen(line));
case 5: return DJB2_hash((const uint8_t*)line);
case 6: return KR_v2_hash(line);
case 7: return Coffin_hash(line);
case 8: return x17(line, strlen(line), SEED);
default: break;
}
return 0;
}
int main(int argc, char* argv[])
{
// Read arguments
const int hash_choice = atoi(argv[1]);
char const* const fn = argv[2];
// Read file
FILE* f = fopen(fn, "r");
// Read file line by line, calculate hash
char line[MAXLINE];
while (fgets(line, sizeof(line), f)) {
line[strcspn(line, "\n")] = '\0'; // strip newline
uint32_t hash = apply_hash(hash_choice, line);
printf("%08x\n", hash);
}
fclose(f);
return 0;
}
P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.
If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.

djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes).
My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:
uint32_t inline MurmurOAAT32 ( const char * key)
{
uint32_t h(3323198485ul);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint64_t inline MurmurOAAT64 ( const char * key)
{
uint64_t h(525201411107845655ull);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e9955bd1e995;
h ^= h >> 47;
}
return h;
}
The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

djb2 is good
Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:
One of the K&R hashes is terrible, one is probably pretty good:
Apparently a terrible hash algorithm, as presented in K&R 1st edition. This is simply a summation of all bytes in the string (source):
unsigned long hash(unsigned char *str)
{
unsigned int hash = 0;
int c;
while (c = *str++)
hash += c;
return hash;
}
Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long, or even better: uint32_t or uint64_t, instead of the simple unsigned (int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm: hashvalue = new_byte + 31*hashvalue, for all bytes in the string:
unsigned hash(char *s)
{
unsigned hashval;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval % HASHSIZE;
}
Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.
The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.
The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.
This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):
Code:
// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
const size_t m = 0x5bd1e995;
size_t hash = seed ^ len;
const char* buf = static_cast<const char*>(ptr);
// Mix 4 bytes at a time into the hash.
while (len >= 4)
{
size_t k = unaligned_load(buf);
k *= m;
k ^= k >> 24;
k *= m;
hash *= m;
hash ^= k;
buf += 4;
len -= 4;
}
// Handle the last few bytes of the input array.
switch (len)
{
case 3:
hash ^= static_cast<unsigned char>(buf[2]) << 16;
[[gnu::fallthrough]];
case 2:
hash ^= static_cast<unsigned char>(buf[1]) << 8;
[[gnu::fallthrough]];
case 1:
hash ^= static_cast<unsigned char>(buf[0]);
hash *= m;
};
// Do a few final mixes of the hash.
hash ^= hash >> 13;
hash *= m;
hash ^= hash >> 15;
return hash;
}
MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.
Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.
See also
Other hash table algorithms to try out and test: http://www.cse.yorku.ca/~oz/hash.html. Hash algorithms mentioned there:
djb2
sdbm
lose lose (K&R 1st edition)

First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.
A hash function with a good reputation is MurmurHash3.
Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.

I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.
Hash function | collision rate | how many minutes to finish
==============================================================
MurmurHash3 | 6.?% | 4m15s
Jenkins One.. | 6.1% | 6m54s
Bob, 1st in link | 6.16% | 5m34s
SuperFastHash | 10% | 4m58s
bernstein | 20% | 14s only finish 1/20
one_at_a_time | 6.16% | 7m5s
crc | 6.16% | 7m56s
One strange things is that almost all the hash functions have 6% collision rate for my data.

One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).
You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.

Use of Murmurhash in C

I am in the process of implementing a hash table and hence hash function in C and heard that Murmurhash was a suitably fast algorithm for this purpose. Looking up some C code for this provided:
uint32_t murmur3_32(const char *key, uint32_t len, uint32_t seed) {
static const uint32_t c1 = 0xcc9e2d51;
static const uint32_t c2 = 0x1b873593;
static const uint32_t r1 = 15;
static const uint32_t r2 = 13;
static const uint32_t m = 5;
static const uint32_t n = 0xe6546b64;
uint32_t hash = seed;
const int nblocks = len / 4;
const uint32_t *blocks = (const uint32_t *) key;
int i;
for (i = 0; i < nblocks; i++) {
uint32_t k = blocks[i];
k *= c1;
k = (k << r1) | (k >> (32 - r1));
k *= c2;
hash ^= k;
hash = ((hash << r2) | (hash >> (32 - r2))) * m + n;
}
const uint8_t *tail = (const uint8_t *) (key + nblocks * 4);
uint32_t k1 = 0;
switch (len & 3) {
case 3:
k1 ^= tail[2] << 16;
case 2:
k1 ^= tail[1] << 8;
case 1:
k1 ^= tail[0];
k1 *= c1;
k1 = (k1 << r1) | (k1 >> (32 - r1));
k1 *= c2;
hash ^= k1;
}
hash ^= len;
hash ^= (hash >> 16);
hash *= 0x85ebca6b;
hash ^= (hash >> 13);
hash *= 0xc2b2ae35;
hash ^= (hash >> 16);
return hash;
}
I was wondering if I could clarify a few things with regard to the arguments that are being passed here. "Key" is obviously the string that you are hashing. If this is defined in a struct as having an array length of 46, would this be the value that I would pass as "length" in the above function? The argument "seed", I take it this can be any arbitrary value as long it stays constant between hash calls? Are there any other parameters that I need to change keeping in mind that I am working on a 32-bit machine?
I take it I will also need to modulo the return hash by the size of my hash table?
In addition, if anyone could recommend a superior/faster alternative hash function used for strings then that would be much appreciated
Thanks in advance

About the question regarding the parameters: yes, just read the code, your assumptions are correct.
You don't need modulo as long as the size of your hash table is a power of 2. Then you can just use a bitmask, e.g. (pseudocode)
void* hashtbl[1<<8]; /* 256 */
int key = hash(value, ...) & ((1<<8) - 1); /* 0xff */
Then keep in mind that performance is not the only relevant characteristic of a hash function. It's very important to get an equal distribution of the whole key space. I can't tell you how "good" murmurhash is in that respect, but probably much better than a very simple hashing I used resently for playing around a bit:
static unsigned int
hash(const void *key, size_t keyLen, unsigned int hashmask)
{
size_t i;
unsigned int h = 5381;
for (i=0; i<keyLen; ++i)
{
h += (h << 5) + ((const unsigned char *)key)[i];
}
return h & hashmask;
}
although this simple function is probably faster. It's a tradeoff and a "clever" hashing algorithm tries to be as fast as possible while still giving good distribution. The simplistic function above doesn't really give good distribution, for example it will never use the whole key space for small input (less than 5 bytes).

Is there a way to print the bits without using a loop in C?

Right now, what I do is this:
void print_bits(unsigned int x)
{
int i;
for(i=WORD_SIZE-1; i>=0; i--) {
(x & (1 << i)) ? putchar('1') : putchar('0');
}
printf("\n");
}
Also, it would be great to have a solution independent of word size (currently set to 32 in my example).

How about this:
void print2Bits(int a) {
char* table[] = {
"00",
"01",
"10",
"11"
};
puts(table[a & 3]);
}
void printByte(int a) {
print2Bits(a >> 6);
print2Bits(a >> 4);
print2Bits(a >> 2);
print2Bits(a);
}
void print32Bits(int a) {
printByte(a >> 24);
printByte(a >> 16);
printByte(a >> 8);
printByte(a);
}
I think, that's the closes you'll get to writing a binary number without a loop.

You may try itoa. Although it is not in standard C lib, it is available in most C compilers.
void print_bits(int x)
{
char bits[33];
itoa(x, bits, 2);
puts(bits);
}

Rather than making multiple calls to putchar or printf in a loop it's likely to be more efficient to build a temporary string first and then output that via one call to e.g. puts:
void print_bits(unsigned int x)
{
const unsigned int n = sizeof(x) * CHAR_BIT;
unsigned int mask = 1 << (n - 1);
char s[n + 1];
for (unsigned int i = 0; i < n; ++i)
{
s[i] = (x & mask) ? '1' : '0';
mask >>= 1;
}
s[n] = '\0';
puts(s);
}
LIVE DEMO

Here is a little hacky way of doing it for byte I found some time ago. I think it's worth linking here despite it not being the best solution.
http://gynvael.coldwind.pl/n/c_cpp_number_to_binary_string_01011010
void to_bin(unsigned char c, char *out) {
*(unsigned long long*)out = 0x3030303030303030ULL // ASCII '0'*8
+ (((c * 0x8040201008040201ULL) // spread out eight copies of c
>>7) & 0x101010101010101ULL); // shift to LSB & mask
}
Method provided by #cmaster is optimal and clean. Doing it in parts of 8 bits could be better though. You would construct the table in a loop using your method to avoid writing 256 strings manually. I don't think memory would be an issue too (it would take about 2kB).
Although I don't think there is a way to do it for variable of any size without loop.

hash function for string

I'm working on hash table in C language and I'm testing hash function for string.
The first function I've tried is to add ascii code and use modulo (% 100) but i've got poor results with the first test of data: 40 collisions for 130 words.
The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a .txt file.
Which is the best algorithm for hashing string?
And how to determinate the size of hash table?

I've had nice results with djb2 by Dan Bernstein.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}

First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.
Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.
For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:
int hash(char const *input) {
int result = 0x55555555;
while (*input) {
result ^= *input++;
result = rol(result, 5);
}
}
Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).

I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:
Hash function | Collisions | Time (words) | Time (file)
=================================================================
CRC32 | 23 (0.005%) | 112 ms | 38 ms
MurmurOAAT | 26 (0.006%) | 86 ms | 10 ms
FNV hash | 32 (0.007%) | 87 ms | 7 ms
Jenkins OAAT | 36 (0.008%) | 90 ms | 8 ms
DJB2 hash | 344 (0.074%) | 87 ms | 5 ms
K&R V2 | 356 (0.076%) | 86 ms | 5 ms
Coffin | 763 (0.164%) | 86 ms | 4 ms
x17 hash | 2242 (0.481%) | 87 ms | 7 ms
-----------------------------------------------------------------
MurmurHash3_x86_32 | 19 (0.004%) | 90 ms | 3 ms
I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex MurmurHash3_x86_32 into my test for reference.
Conclusion:
there is almost no point of using the popular DJB2 hash function for strings on Intel x86-64 (or AArch64 for that matter) architecture. Because it has much more collisions than similar functions (MurmurOAAT, FNV and Jenkins OAAT) while having very similar throughput. Bernstein's DJB2 performs especially bad on short strings. Example collisions: Liz/MHz, Bon/COM, Rey/SEX.
Test code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 2048
#define SEED 0x12345678
uint32_t DJB2_hash(const uint8_t *str)
{
uint32_t hash = 5381;
uint8_t c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
uint32_t FNV(const void* key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
h ^= 2166136261UL;
const uint8_t* data = (const uint8_t*)key;
for(int i = 0; i < len; i++)
{
h ^= data[i];
h *= 16777619;
}
return h;
}
uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
// One-byte-at-a-time hash based on Murmur's mix
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
for (; *str; ++str) {
h ^= *str;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint32_t KR_v2_hash(const char *s)
{
// Source: https://stackoverflow.com/a/45641002/5407270
uint32_t hashval = 0;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval;
}
uint32_t Jenkins_one_at_a_time_hash(const char *str, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += str[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
uint32_t crc32b(const uint8_t *str) {
// Source: https://stackoverflow.com/a/21001712
unsigned int byte, crc, mask;
int i = 0, j;
crc = 0xFFFFFFFF;
while (str[i] != 0) {
byte = str[i];
crc = crc ^ byte;
for (j = 7; j >= 0; j--) {
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
i = i + 1;
}
return ~crc;
}
inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
return x<<bits | x>>(32-bits); // C idiom: will be optimized to a single operation
}
uint32_t Coffin_hash(char const *input) {
// Source: https://stackoverflow.com/a/7666668/5407270
uint32_t result = 0x55555555;
while (*input) {
result ^= *input++;
result = _rotl32(result, 5);
}
return result;
}
uint32_t x17(const void * key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
const uint8_t * data = (const uint8_t*)key;
for (int i = 0; i < len; ++i)
{
h = 17 * h + (data[i] - ' ');
}
return h ^ (h >> 16);
}
uint32_t apply_hash(int hash, const char* line)
{
switch (hash) {
case 1: return crc32b((const uint8_t*)line);
case 2: return MurmurOAAT_32(line, SEED);
case 3: return FNV(line, strlen(line), SEED);
case 4: return Jenkins_one_at_a_time_hash(line, strlen(line));
case 5: return DJB2_hash((const uint8_t*)line);
case 6: return KR_v2_hash(line);
case 7: return Coffin_hash(line);
case 8: return x17(line, strlen(line), SEED);
default: break;
}
return 0;
}
int main(int argc, char* argv[])
{
// Read arguments
const int hash_choice = atoi(argv[1]);
char const* const fn = argv[2];
// Read file
FILE* f = fopen(fn, "r");
// Read file line by line, calculate hash
char line[MAXLINE];
while (fgets(line, sizeof(line), f)) {
line[strcspn(line, "\n")] = '\0'; // strip newline
uint32_t hash = apply_hash(hash_choice, line);
printf("%08x\n", hash);
}
fclose(f);
return 0;
}
P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.
If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.

djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes).
My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:
uint32_t inline MurmurOAAT32 ( const char * key)
{
uint32_t h(3323198485ul);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint64_t inline MurmurOAAT64 ( const char * key)
{
uint64_t h(525201411107845655ull);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e9955bd1e995;
h ^= h >> 47;
}
return h;
}
The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

djb2 is good
Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:
One of the K&R hashes is terrible, one is probably pretty good:
Apparently a terrible hash algorithm, as presented in K&R 1st edition. This is simply a summation of all bytes in the string (source):
unsigned long hash(unsigned char *str)
{
unsigned int hash = 0;
int c;
while (c = *str++)
hash += c;
return hash;
}
Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long, or even better: uint32_t or uint64_t, instead of the simple unsigned (int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm: hashvalue = new_byte + 31*hashvalue, for all bytes in the string:
unsigned hash(char *s)
{
unsigned hashval;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval % HASHSIZE;
}
Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.
The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.
The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.
This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):
Code:
// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
const size_t m = 0x5bd1e995;
size_t hash = seed ^ len;
const char* buf = static_cast<const char*>(ptr);
// Mix 4 bytes at a time into the hash.
while (len >= 4)
{
size_t k = unaligned_load(buf);
k *= m;
k ^= k >> 24;
k *= m;
hash *= m;
hash ^= k;
buf += 4;
len -= 4;
}
// Handle the last few bytes of the input array.
switch (len)
{
case 3:
hash ^= static_cast<unsigned char>(buf[2]) << 16;
[[gnu::fallthrough]];
case 2:
hash ^= static_cast<unsigned char>(buf[1]) << 8;
[[gnu::fallthrough]];
case 1:
hash ^= static_cast<unsigned char>(buf[0]);
hash *= m;
};
// Do a few final mixes of the hash.
hash ^= hash >> 13;
hash *= m;
hash ^= hash >> 15;
return hash;
}
MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.
Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.
See also
Other hash table algorithms to try out and test: http://www.cse.yorku.ca/~oz/hash.html. Hash algorithms mentioned there:
djb2
sdbm
lose lose (K&R 1st edition)

First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.
A hash function with a good reputation is MurmurHash3.
Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.

I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.
Hash function | collision rate | how many minutes to finish
==============================================================
MurmurHash3 | 6.?% | 4m15s
Jenkins One.. | 6.1% | 6m54s
Bob, 1st in link | 6.16% | 5m34s
SuperFastHash | 10% | 4m58s
bernstein | 20% | 14s only finish 1/20
one_at_a_time | 6.16% | 7m5s
crc | 6.16% | 7m56s
One strange things is that almost all the hash functions have 6% collision rate for my data.

One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).
You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.

Converting Char array to Long in C

This question may looks silly, but please guide me
I have a function to convert long data to char array
void ConvertLongToChar(char *pSrc, char *pDest)
{
pDest[0] = pSrc[0];
pDest[1] = pSrc[1];
pDest[2] = pSrc[2];
pDest[3] = pSrc[3];
}
And I call the above function like this
long lTemp = (long) (fRxPower * 1000);
ConvertLongToChar ((char *)&lTemp, pBuffer);
Which works fine.
I need a similar function to reverse the procedure. Convert char array to long.
I cannot use atol or similar functions.

You can do:
union {
unsigned char c[4];
long l;
} conv;
conv.l = 0xABC;
and access c[0] c[1] c[2] c[3]. This is good as it wastes no memory and is very fast because there is no shifting or any assignment besides the initial one and it works both ways.

Leaving the burden of matching the endianness with your other function to you, here's one way:
unsigned long int l = pdest[0] | (pdest[1] << 8) | (pdest[2] << 16) | (pdest[3] << 24);
Just to be safe, here's the corresponding other direction:
unsigned char pdest[4];
unsigned long int l;
pdest[0] = l & 0xFF;
pdest[1] = (l >> 8) & 0xFF;
pdest[2] = (l >> 16) & 0xFF;
pdest[3] = (l >> 24) & 0xFF;
Going from char[4] to long and back is entirely reversible; going from long to char[4] and back is reversible for values up to 2^32-1.
Note that all this is only well-defined for unsigned types.
(My example is little endian if you read pdest from left to right.)
Addendum: I'm also assuming that CHAR_BIT == 8. In general, substitute multiples of 8 by multiples of CHAR_BIT in the code.

A simple way would be to use memcpy:
char * buffer = ...;
long l;
memcpy(&l, buff, sizeof(long));
That does not take endianness into account, however, so beware if you have to share data between multiple computers.

If you mean to treat sizeof (long) bytes memory as a single long, then you should do the below:
char char_arr[sizeof(long)];
long l;
memcpy (&l, char_arr, sizeof (long));
This thing can be done by pasting each bytes of the long using bit shifting ans pasting, like below.
l = 0;
l |= (char_arr[0]);
l |= (char_arr[1] << 8);
l |= (char_arr[2] << 16);
l |= (char_arr[3] << 24);
If you mean to convert "1234\0" string into 1234L then you should
l = strtol (char_arr, NULL, 10); /* to interpret the base as decimal */

Does this work:
#include<stdio.h>
long ConvertCharToLong(char *pSrc) {
int i=1;
long result = (int)pSrc[0] - '0';
while(i<strlen(pSrc)){
result = result * 10 + ((int)pSrc[i] - '0');
++i;
}
return result;
}
int main() {
char* str = "34878";
printf("The answer is %d",ConvertCharToLong(str));
return 0;
}

This is dirty but it works:
unsigned char myCharArray[8];
// Put some data in myCharArray here...
long long integer = *((long long*) myCharArray);

char charArray[8]; //ideally, zero initialise
unsigned long long int combined = *(unsigned long long int *) &charArray[0];
Be wary of strings that are null terminated, as you will end up copying any bytes beyond the null terminator into combined; thus in the above assignment, charArray needs to be fully zero-initialised for a "clean" conversion.

Just found this having tried more than one of the above to no avail :=( :
char * vIn = "0";
long vOut = strtol(vIn,NULL,10);
Worked perfectly for me.
To give credit where it is due, this is where I found it:
https://www.convertdatatypes.com/Convert-char-Array-to-long-in-C.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

What's the best hash for utf-8 strings - c

what's the best hash function for utf-8 strings that returns 32bit or 64bit integer, both considering performance and 'minimal collisions'

XOR version of djb2 algorithm: unsigned long hash(unsigned char str) { unsigned long hash = 5381; int c; while (c = str++) hash = ((hash << 5) + hash) ^ c; // hash(i - 1) * 33 ^ str[i] return hash; } It's simple, fast and considered one of the best for string hashing.

If you don't have any other, more specific requirements, I'd go with Fowler/Noll/Vo or Jenkins' one-at-a-time. Keep in mind that you should always check that your input data won't trigger degenerate cases (ie excessive collisions).

Related

how to signature a string to generate a uint64 value? [duplicate]

Use of Murmurhash in C

Is there a way to print the bits without using a loop in C?

hash function for string

Converting Char array to Long in C

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

What's the best hash for utf-8 strings - c

what's the best hash function for utf-8 strings that returns 32bit or 64bit integer, both considering performance and 'minimal collisions'

XOR version of djb2 algorithm: unsigned long hash(unsigned char *str) { unsigned long hash = 5381; int c; while (c = *str++) hash = ((hash << 5) + hash) ^ c; // hash(i - 1) * 33 ^ str[i] return hash; } It's simple, fast and considered one of the best for string hashing.

If you don't have any other, more specific requirements, I'd go with Fowler/Noll/Vo or Jenkins' one-at-a-time. Keep in mind that you should always check that your input data won't trigger degenerate cases (ie excessive collisions).

Related

how to signature a string to generate a uint64 value? [duplicate]

Use of Murmurhash in C

Is there a way to print the bits without using a loop in C?

hash function for string

Converting Char array to Long in C

Categories

Resources

XOR version of djb2 algorithm: unsigned long hash(unsigned char str) { unsigned long hash = 5381; int c; while (c = str++) hash = ((hash << 5) + hash) ^ c; // hash(i - 1) * 33 ^ str[i] return hash; } It's simple, fast and considered one of the best for string hashing.