Reverse Jenkins' one-at-a-time hash - c

How would I go about obtaining any possible string value that matches a returned hash?
I don't want to obtain the exact key that was used, just any key that when passed into the function, will return the same hash of the unknown key.
uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
size_t i = 0;
uint32_t hash = 0;
while (i != length) {
hash += key[i++];
hash += hash << 10;
hash ^= hash >> 6;
}
hash += hash << 3;
hash ^= hash >> 11;
hash += hash << 15;
return hash;
}
E.g. I pass key as "keynumber1", the function returns 0xA7AF2FFE.
How would I find ANY string that can also be hashed into 0xA7AF2FFE.

While the brute force method suggested by chux works fine as it is, we can in fact speed it up by a factor of up to 256 or so (and, in fact, a lot more if we use all the optimizations described below).
The key realization here is that all the operations used to compute the hash are reversible. (This is by design, since it ensures that e.g. appending the same suffix to all input strings won't increase the number of hash collisions.) Specifically:
The operation hash += hash << n is, of course, equivalent to hash *= (1 << n) + 1. We're working with 32-bit unsigned integers, so all these calculations are done modulo 232. To undo this operation, all we need to do is find the modular multiplicative inverse of (1 << n) + 1 = 2n + 1 modulo 232 and multiply hash by it.
We can do this pretty easily e.g. with this Python script, based on this answer here on SO. As it turns out, the multiplicative inverses of 210 + 1, 23 + 1 and 215 + 1 are, in hex, 0xC00FFC01, 0x38E38E39 and 0x3FFF8001 respectively.
To find the inverse of hash ^= hash >> n for some constant n, first note that this operation leaves the highest n bits of hash entirely unchanged. The next lower n bits are simply XORed with the highest n bits, so for those, simply repeating the operation undoes it. Looks pretty simple so far, right?
To recover the original values of the third highest group of n bits, we need to XOR them with the original values of the second highest n bits, which we can of course calculate by XORing the two highest groups of n bits as describe above. And so on.
What this all boils down to is that the inverse operation to hash ^= hash >> n is:
hash ^= (hash >> n) ^ (hash >> 2*n) ^ (hash >> 3*n) ^ (hash >> 4*n) ^ ...
where, of course, we can cut off the series once the shift amount is equal or greater to the number of bits in the integers we're working with (i.e. 32, in this case). Alternatively, we could achieve the same result in multiple steps, doubling the shift amount each time until it exceeds the bitlength of the numbers we're working with, like this:
hash ^= hash >> n;
hash ^= hash >> 2*n;
hash ^= hash >> 4*n;
hash ^= hash >> 8*n;
// etc.
(The multiple step method scales better when n is small compared to the integer size, but for moderately large n, the single step method may suffer from fewer pipeline stalls on modern CPUs. It's hard to say which one is actually more efficient in any given situation without benchmarking them both, and the results may vary between compilers and CPU models. In any case, such micro-optimizations are mostly not worth worrying too much about.)
Finally, of course, the inverse of hash += key[i++] is simply hash -= key[--i].
All this means that, if we want to, we can run the hash in reverse like this:
uint32_t reverse_one_at_a_time_hash(const uint8_t* key, size_t length, uint32_t hash) {
hash *= 0x3FFF8001; // inverse of hash += hash << 15;
hash ^= (hash >> 11) ^ (hash >> 22);
hash *= 0x38E38E39; // inverse of hash += hash << 3;
size_t i = length;
while (i > 0) {
hash ^= (hash >> 6) ^ (hash >> 12) ^ (hash >> 18) ^ (hash >> 24) ^ (hash >> 30);
hash *= 0xC00FFC01; // inverse of hash += hash << 10;
hash -= key[--i];
}
return hash; // this should return 0 if the original hash was correct
}
Then calling, say, reverse_one_at_a_time_hash("keynumber1", 10, 0xA7AF2FFE) should return zero, as indeed it does.
OK, that's cool. But what good is this for finding preimages?
Well, for one thing, if we guess all but the first byte of the input, then we can set the first byte to zero and run the hash backwards over this input. At this point, there are two possible outcomes:
If the running the hash backwards like this produces an output that is a valid input byte (i.e. no greater than 255, and possibly with other restrictions if you e.g. want all the input bytes to be printable ASCII), then we can set the first byte of the input to that value, and we're done!
Conversely, if the result of running the hash backwards is not a valid input byte (e.g. if it's greater than 255), then we know that there's no first byte that could make the rest of the input hash to the output we want, and we'll need to try another guess instead.
Here's an example, which finds the same input as chux's code (but prints it as a quoted string, not as a little-endian int):
#define TARGET_HASH 0xA7AF2FFE
#define INPUT_LEN 4
int main() {
uint8_t buf[INPUT_LEN+1]; // buffer for guessed input (and one more null byte at the end)
for (int i = 0; i <= INPUT_LEN; i++) buf[i] = 0;
do {
uint32_t ch = reverse_one_at_a_time_hash(buf, INPUT_LEN, TARGET_HASH);
if (ch <= 255) {
buf[0] = ch;
// print the input with unprintable chars nicely quoted
printf("hash(\"");
for (int i = 0; i < INPUT_LEN; i++) {
if (buf[i] < 32 || buf[i] > 126 || buf[i] == '"' || buf[i] == '\\') printf("\\x%02X", buf[i]);
else putchar(buf[i]);
}
printf("\") = 0x%08X\n", TARGET_HASH);
return 0;
}
// increment buffer, starting from second byte
for (int i = 1; ++buf[i] == 0; i++) /* nothing */;
} while (buf[INPUT_LEN] == 0);
printf("No matching input of %d bytes found for hash 0x%08X. :(", INPUT_LEN, TARGET_HASH);
return 1;
}
And here's a version that restricts the input to printable ASCII (and outputs the five-byte string ^U_N.):
#define TARGET_HASH 0xA7AF2FFE
#define MIN_INPUT_CHAR ' '
#define MAX_INPUT_CHAR '~'
#define INPUT_LEN 5
int main() {
uint8_t buf[INPUT_LEN+1]; // buffer for guessed input (and one more null byte at the end)
buf[0] = buf[INPUT_LEN] = 0;
for (int i = 1; i < INPUT_LEN; i++) buf[i] = MIN_INPUT_CHAR;
do {
uint32_t ch = reverse_one_at_a_time_hash(buf, INPUT_LEN, TARGET_HASH);
if (ch >= MIN_INPUT_CHAR && ch <= MAX_INPUT_CHAR) {
buf[0] = ch;
printf("hash(\"%s\") = 0x%08X\n", buf, TARGET_HASH);
return 0;
}
// increment buffer, starting from second byte, while keeping bytes within the valid range
int i = 1;
while (buf[i] >= MAX_INPUT_CHAR) buf[i++] = MIN_INPUT_CHAR;
buf[i]++;
} while (buf[INPUT_LEN] == 0);
printf("No matching input of %d bytes found for hash 0x%08X. :(", INPUT_LEN, TARGET_HASH);
return 1;
}
Of course, it's easy to modify this code to be even more restrictive about which input bytes to accept. For example, using the following settings:
#define TARGET_HASH 0xA7AF2FFE
#define MIN_INPUT_CHAR 'A'
#define MAX_INPUT_CHAR 'Z'
#define INPUT_LEN 7
produces (after a few seconds of computation) the preimage KQEJZVS.
Restricting the input range does make the code run slower, since the probability of the result of the backwards hash computation being a valid input byte is, of course, proportional to the number of possible valid bytes.
There are various ways in which this code could be made to run even faster. For example, we could combine the backwards hashing with a recursive search, so that we don't have to repeatedly hash the whole input string even if only one byte of it changes:
#define TARGET_HASH 0xA7AF2FFE
#define MIN_INPUT_CHAR 'A'
#define MAX_INPUT_CHAR 'Z'
#define INPUT_LEN 7
static bool find_preimage(uint32_t hash, uint8_t *buf, int depth) {
// first invert the hash mixing step
hash ^= (hash >> 6) ^ (hash >> 12) ^ (hash >> 18) ^ (hash >> 24) ^ (hash >> 30);
hash *= 0xC00FFC01; // inverse of hash += hash << 10;
// then check if we're down to the first byte
if (depth == 0) {
bool found = (hash >= MIN_INPUT_CHAR && hash <= MAX_INPUT_CHAR);
if (found) buf[0] = hash;
return found;
}
// otherwise try all possible values for this byte
for (uint32_t ch = MIN_INPUT_CHAR; ch <= MAX_INPUT_CHAR; ch++) {
bool found = find_preimage(hash - ch, buf, depth - 1);
if (found) { buf[depth] = ch; return true; }
}
return false;
}
int main() {
uint8_t buf[INPUT_LEN+1]; // buffer for results
for (int i = 0; i <= INPUT_LEN; i++) buf[INPUT_LEN] = 0;
// first undo the finalization step
uint32_t hash = TARGET_HASH;
hash *= 0x3FFF8001; // inverse of hash += hash << 15;
hash ^= (hash >> 11) ^ (hash >> 22);
hash *= 0x38E38E39; // inverse of hash += hash << 3;
// then search recursively until we find a matching input
bool found = find_preimage(hash, buf, INPUT_LEN - 1);
if (found) {
printf("hash(\"%s\") = 0x%08X\n", buf, TARGET_HASH);
} else {
printf("No matching input of %d bytes found for hash 0x%08X. :(", INPUT_LEN, TARGET_HASH);
}
return !found;
}
But wait, we're not done yet! Looking at the original code of the one-at-a-time hash, we can see that the value of hash after the first iteration of the loop will be ((c << 10) + c) ^ ((c << 4) + (c >> 6)), where c is the first byte of input. Since c is an eight-bit byte, this means that only the lowest 18 bytes of hash can be set after the first iteration.
If fact, if we calculate the value of hash after the first iteration for every possible value of the first byte c, we can see that hash never exceeds 1042 * c. (In fact, the maximum of the ratio hash / c is only 1041.015625 = 1041 + 2-6.) This means that, if M is the maximum possible value of a valid input byte, the value of hash after the first iteration cannot exceed 1042 * M. And adding in the next input byte only increases hash by at most M.
So we can speed up the code above significantly by adding the following shortcut check into find_preimage():
// optimization: return early if no first two bytes can possibly match
if (depth == 1 && hash > MAX_INPUT_CHAR * 1043) return false;
In fact, a similar argument can be used to show that, after processing the first two bytes, at most the lowest 28 bytes of hash can be set (and, more precisely, that the ratio of hash to the maximum input byte value is at most 1084744.46667). So we can extend the optimization above to cover the last three stages of the search by rewriting find_preimage() like this:
static bool find_preimage(uint32_t hash, uint8_t *buf, int depth) {
// first invert the hash mixing step
hash ^= (hash >> 6) ^ (hash >> 12) ^ (hash >> 18) ^ (hash >> 24) ^ (hash >> 30);
hash *= 0xC00FFC01; // inverse of hash += hash << 10;
// for the lowest three levels, abort early if no solution is possible
switch (depth) {
case 0:
if (hash < MIN_INPUT_CHAR || hash > MAX_INPUT_CHAR) return false;
buf[0] = hash;
return true;
case 1:
if (hash > MAX_INPUT_CHAR * 1043) return false;
else break;
case 2:
if (hash > MAX_INPUT_CHAR * 1084746) return false;
else break;
}
// otherwise try all possible values for this byte
for (uint32_t ch = MIN_INPUT_CHAR; ch <= MAX_INPUT_CHAR; ch++) {
bool found = find_preimage(hash - ch, buf, depth - 1);
if (found) { buf[depth] = ch; return true; }
}
return false;
}
For the example search for a seven byte all-uppercase preimage of the hash 0xA7AF2FFE, this further optimization cuts the running time down to just 0.075 seconds (as opposed to 0.148 seconds for the depth == 1 shortcut alone, 2.456 seconds for the recursive search with no shortcuts, and 15.489 seconds for the non-recursive search, as timed by TIO).

If the hash function is good, just try lots of combinations of keys and see if the hash matches. That is the point of a good hash. It is hard to reverse.
I'd estimate with about 2^32 attempts, you would have a 50% chance of finding one. The below took a few seconds.
With this hash, short cuts may apply.
int main() {
const char *key1 = "keynumber1";
uint32_t match = jenkins_one_at_a_time_hash(key1, strlen(key1));
printf("Target 0x%lX\n", (unsigned long) match);
uint32_t i = 0;
do {
uint32_t hash = jenkins_one_at_a_time_hash(&i, sizeof i);
if (hash == match) {
printf("0x%lX: 0x%lX\n", (unsigned long) i, (unsigned long) hash);
fflush(stdout);
}
} while (++i);
const char *key2 = "\x3C\xA0\x94\xB9";
uint32_t match2 = jenkins_one_at_a_time_hash(key2, strlen(key2));
printf("Match 0x%lX\n", (unsigned long) match2);
}
Output
Target 0xA7AF2FFE
0xB994A03C: 0xA7AF2FFE
Match 0xA7AF2FFE

Related

how to signature a string to generate a uint64 value? [duplicate]

I'm working on hash table in C language and I'm testing hash function for string.
The first function I've tried is to add ascii code and use modulo (% 100) but i've got poor results with the first test of data: 40 collisions for 130 words.
The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a .txt file.
Which is the best algorithm for hashing string?
And how to determinate the size of hash table?
I've had nice results with djb2 by Dan Bernstein.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.
Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.
For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:
int hash(char const *input) {
int result = 0x55555555;
while (*input) {
result ^= *input++;
result = rol(result, 5);
}
}
Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).
I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:
Hash function | Collisions | Time (words) | Time (file)
=================================================================
CRC32 | 23 (0.005%) | 112 ms | 38 ms
MurmurOAAT | 26 (0.006%) | 86 ms | 10 ms
FNV hash | 32 (0.007%) | 87 ms | 7 ms
Jenkins OAAT | 36 (0.008%) | 90 ms | 8 ms
DJB2 hash | 344 (0.074%) | 87 ms | 5 ms
K&R V2 | 356 (0.076%) | 86 ms | 5 ms
Coffin | 763 (0.164%) | 86 ms | 4 ms
x17 hash | 2242 (0.481%) | 87 ms | 7 ms
-----------------------------------------------------------------
MurmurHash3_x86_32 | 19 (0.004%) | 90 ms | 3 ms
I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex MurmurHash3_x86_32 into my test for reference.
Conclusion:
there is almost no point of using the popular DJB2 hash function for strings on Intel x86-64 (or AArch64 for that matter) architecture. Because it has much more collisions than similar functions (MurmurOAAT, FNV and Jenkins OAAT) while having very similar throughput. Bernstein's DJB2 performs especially bad on short strings. Example collisions: Liz/MHz, Bon/COM, Rey/SEX.
Test code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 2048
#define SEED 0x12345678
uint32_t DJB2_hash(const uint8_t *str)
{
uint32_t hash = 5381;
uint8_t c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
uint32_t FNV(const void* key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
h ^= 2166136261UL;
const uint8_t* data = (const uint8_t*)key;
for(int i = 0; i < len; i++)
{
h ^= data[i];
h *= 16777619;
}
return h;
}
uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
// One-byte-at-a-time hash based on Murmur's mix
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
for (; *str; ++str) {
h ^= *str;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint32_t KR_v2_hash(const char *s)
{
// Source: https://stackoverflow.com/a/45641002/5407270
uint32_t hashval = 0;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval;
}
uint32_t Jenkins_one_at_a_time_hash(const char *str, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += str[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
uint32_t crc32b(const uint8_t *str) {
// Source: https://stackoverflow.com/a/21001712
unsigned int byte, crc, mask;
int i = 0, j;
crc = 0xFFFFFFFF;
while (str[i] != 0) {
byte = str[i];
crc = crc ^ byte;
for (j = 7; j >= 0; j--) {
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
i = i + 1;
}
return ~crc;
}
inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
return x<<bits | x>>(32-bits); // C idiom: will be optimized to a single operation
}
uint32_t Coffin_hash(char const *input) {
// Source: https://stackoverflow.com/a/7666668/5407270
uint32_t result = 0x55555555;
while (*input) {
result ^= *input++;
result = _rotl32(result, 5);
}
return result;
}
uint32_t x17(const void * key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
const uint8_t * data = (const uint8_t*)key;
for (int i = 0; i < len; ++i)
{
h = 17 * h + (data[i] - ' ');
}
return h ^ (h >> 16);
}
uint32_t apply_hash(int hash, const char* line)
{
switch (hash) {
case 1: return crc32b((const uint8_t*)line);
case 2: return MurmurOAAT_32(line, SEED);
case 3: return FNV(line, strlen(line), SEED);
case 4: return Jenkins_one_at_a_time_hash(line, strlen(line));
case 5: return DJB2_hash((const uint8_t*)line);
case 6: return KR_v2_hash(line);
case 7: return Coffin_hash(line);
case 8: return x17(line, strlen(line), SEED);
default: break;
}
return 0;
}
int main(int argc, char* argv[])
{
// Read arguments
const int hash_choice = atoi(argv[1]);
char const* const fn = argv[2];
// Read file
FILE* f = fopen(fn, "r");
// Read file line by line, calculate hash
char line[MAXLINE];
while (fgets(line, sizeof(line), f)) {
line[strcspn(line, "\n")] = '\0'; // strip newline
uint32_t hash = apply_hash(hash_choice, line);
printf("%08x\n", hash);
}
fclose(f);
return 0;
}
P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.
Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.
If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.
djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes).
My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:
uint32_t inline MurmurOAAT32 ( const char * key)
{
uint32_t h(3323198485ul);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint64_t inline MurmurOAAT64 ( const char * key)
{
uint64_t h(525201411107845655ull);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e9955bd1e995;
h ^= h >> 47;
}
return h;
}
The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.
djb2 is good
Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:
One of the K&R hashes is terrible, one is probably pretty good:
Apparently a terrible hash algorithm, as presented in K&R 1st edition. This is simply a summation of all bytes in the string (source):
unsigned long hash(unsigned char *str)
{
unsigned int hash = 0;
int c;
while (c = *str++)
hash += c;
return hash;
}
Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long, or even better: uint32_t or uint64_t, instead of the simple unsigned (int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm: hashvalue = new_byte + 31*hashvalue, for all bytes in the string:
unsigned hash(char *s)
{
unsigned hashval;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval % HASHSIZE;
}
Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.
The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.
The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.
This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):
Code:
// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
const size_t m = 0x5bd1e995;
size_t hash = seed ^ len;
const char* buf = static_cast<const char*>(ptr);
// Mix 4 bytes at a time into the hash.
while (len >= 4)
{
size_t k = unaligned_load(buf);
k *= m;
k ^= k >> 24;
k *= m;
hash *= m;
hash ^= k;
buf += 4;
len -= 4;
}
// Handle the last few bytes of the input array.
switch (len)
{
case 3:
hash ^= static_cast<unsigned char>(buf[2]) << 16;
[[gnu::fallthrough]];
case 2:
hash ^= static_cast<unsigned char>(buf[1]) << 8;
[[gnu::fallthrough]];
case 1:
hash ^= static_cast<unsigned char>(buf[0]);
hash *= m;
};
// Do a few final mixes of the hash.
hash ^= hash >> 13;
hash *= m;
hash ^= hash >> 15;
return hash;
}
MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.
Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.
See also
Other hash table algorithms to try out and test: http://www.cse.yorku.ca/~oz/hash.html. Hash algorithms mentioned there:
djb2
sdbm
lose lose (K&R 1st edition)
First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.
A hash function with a good reputation is MurmurHash3.
Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.
I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.
Hash function | collision rate | how many minutes to finish
==============================================================
MurmurHash3 | 6.?% | 4m15s
Jenkins One.. | 6.1% | 6m54s
Bob, 1st in link | 6.16% | 5m34s
SuperFastHash | 10% | 4m58s
bernstein | 20% | 14s only finish 1/20
one_at_a_time | 6.16% | 7m5s
crc | 6.16% | 7m56s
One strange things is that almost all the hash functions have 6% collision rate for my data.
One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).
You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.

Efficient algorithm for finding a byte in a bit array

Given a bytearray uint8_t data[N] what is an efficient method to find a byte uint8_t search within it even if search is not octet aligned? i.e. the first three bits of search could be in data[i] and the next 5 bits in data[i+1].
My current method involves creating a bool get_bit(const uint8_t* src, struct internal_state* state) function (struct internal_state contains a mask that is bitshifted right, &ed with src and returned, maintaining size_t src_index < size_t src_len) , leftshifting the returned bits into a uint8_t my_register and comparing it with search every time, and using state->src_index and state->src_mask to get the position of the matched byte.
Is there a better method for this?
If you're searching an eight bit pattern within a large array you can implement a sliding window over 16 bit values to check if the searched pattern is part of the two bytes forming that 16 bit value.
To be portable you have to take care of endianness issues which is done by my implementation by building the 16 bit value to search for the pattern manually. The high byte is always the currently iterated byte and the low byte is the following byte. If you do a simple conversion like value = *((unsigned short *)pData) you will run into trouble on x86 processors...
Once value, cmp and mask are setup cmp and mask are shifted. If the pattern was not found within hi high byte the loop continues by checking the next byte as start byte.
Here is my implementation including some debug printouts (the function returns the bit position or -1 if pattern was not found):
int findPattern(unsigned char *data, int size, unsigned char pattern)
{
int result = -1;
unsigned char *pData;
unsigned char *pEnd;
unsigned short value;
unsigned short mask;
unsigned short cmp;
int tmpResult;
if ((data != NULL) && (size > 0))
{
pData = data;
pEnd = data + size;
while ((pData < pEnd) && (result == -1))
{
printf("\n\npData = {%02x, %02x, ...};\n", pData[0], pData[1]);
if ((pData + 1) < pEnd) /* still at least two bytes to check? */
{
tmpResult = (int)(pData - data) * 8; /* calculate bit offset according to current byte */
/* avoid endianness troubles by "manually" building value! */
value = *pData << 8;
pData++;
value += *pData;
/* create a sliding window to check if search patter is within value */
cmp = pattern << 8;
mask = 0xFF00;
while (mask > 0x00FF) /* the low byte is checked within next iteration! */
{
printf("cmp = %04x, mask = %04x, tmpResult = %d\n", cmp, mask, tmpResult);
if ((value & mask) == cmp)
{
result = tmpResult;
break;
}
tmpResult++; /* count bits! */
mask >>= 1;
cmp >>= 1;
}
}
else
{
/* only one chance left if there is only one byte left to check! */
if (*pData == pattern)
{
result = (int)(pData - data) * 8;
}
pData++;
}
}
}
return (result);
}
I don't think you can do much better than this in C:
/*
* Searches for the 8-bit pattern represented by 'needle' in the bit array
* represented by 'haystack'.
*
* Returns the index *in bits* of the first appearance of 'needle', or
* -1 if 'needle' is not found.
*/
int search(uint8_t needle, int num_bytes, uint8_t haystack[num_bytes]) {
if (num_bytes > 0) {
uint16_t window = haystack[0];
if (window == needle) return 0;
for (int i = 1; i < num_bytes; i += 1) {
window = window << 8 + haystack[i];
/* Candidate for unrolling: */
for (int j = 7; j >= 0; j -= 1) {
if ((window >> j) & 0xff == needle) {
return 8 * i - j;
}
}
}
}
return -1;
}
The main idea is to handle the 87.5% of cases that cross the boundary between consecutive bytes by pairing bytes in a wider data type (uint16_t in this case). You could adjust it to use an even wider data type, but I'm not sure that would gain anything.
What you cannot safely or easily do is anything involving casting part or all of your array to a wider integer type via a pointer (i.e. (uint16_t *)&haystack[i]). You cannot be ensured of proper alignment for such a cast, nor of the byte order with which the result might be interpreted.
I don't know if it would be better, but i would use sliding window.
uint counter = 0, feeder = 8;
uint window = data[0];
while (search ^ (window & 0xff)){
window >>= 1;
feeder--;
if (feeder < 8){
counter++;
if (counter >= data.length) {
feeder = 0;
break;
}
window |= data[counter] << feeder;
feeder += 8;
}
}
//Returns index of first bit of first sequence occurrence or -1 if sequence is not found
return (feeder > 0) ? (counter+1)*8-feeder : -1;
Also with some alterations you can use this method to search for arbitrary length (1 to 64-array_element_size_in_bits) bits sequence.
If AVX2 is acceptable (with earlier versions it didn't work out so well, but you can still do something there), you can search in a lot of places at the same time. I couldn't test this on my machine (only compile) so the following is more to give to you an idea of how it could be approached than copy&paste code, so I'll try to explain it rather than just code-dump.
The main idea is to read an uint64_t, shift it right by all values that make sense (0 through 7), then for each of those 8 new uint64_t's, test whether the byte is in there. Small complication: for the uint64_t's shifted by more than 0, the highest position should not be counted since it has zeroes shifted into it that might not be in the actual data. Once this is done, the next uint64_t should be read at an offset of 7 from the current one, otherwise there is a boundary that is not checked across. That's fine though, unaligned loads aren't so bad anymore, especially if they're not wide.
So now for some (untested, and incomplete, see below) code,
__m256i needle = _mm256_set1_epi8(find);
size_t i;
for (i = 0; i < n - 6; i += 7) {
// unaligned load here, but that's OK
uint64_t d = *(uint64_t*)(data + i);
__m256i x = _mm256_set1_epi64x(d);
__m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
__m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
low = _mm256_cmpeq_epi8(low, needle);
high = _mm256_cmpeq_epi8(high, needle);
// in the qword right-shifted by 0, all positions are valid
// otherwise, the top position corresponds to an incomplete byte
uint32_t lowmask = 0x7f7f7fffu & _mm256_movemask_epi8(low);
uint32_t highmask = 0x7f7f7f7fu & _mm256_movemask_epi8(high);
uint64_t mask = lowmask | ((uint64_t)highmask << 32);
if (mask) {
int bitindex = __builtin_ffsl(mask);
// the bit-index and byte-index are swapped
return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
}
}
The funny "bit-index and byte-index are swapped" thing is because searching within a qword is done byte by byte and the results of those comparisons end up in 8 adjacent bits, while the search for "shifted by 1" ends up in the next 8 bits and so on. So in the resulting masks, the index of the byte that contains the 1 is a bit-offset, but the bit-index within that byte is actually the byte-offset, for example 0x8000 would correspond to finding the byte at the 7th byte of the qword that was right-shifted by 1, so the actual index is 8*7+1.
There is also the issue of the "tail", the part of the data left over when all blocks of 7 bytes have been processed. It can be done much the same way, but now more positions contain bogus bytes. Now n - i bytes are left over, so the mask has to have n - i bits set in the lowest byte, and one fewer for all other bytes (for the same reason as earlier, the other positions have zeroes shifted in). Also, if there is exactly 1 byte "left", it isn't really left because it would have been tested already, but that doesn't really matter. I'll assume the data is sufficiently padded that accessing out of bounds doesn't matter. Here it is, untested:
if (i < n - 1) {
// make n-i-1 bits, then copy them to every byte
uint32_t validh = ((1u << (n - i - 1)) - 1) * 0x01010101;
// the lowest position has an extra valid bit, set lowest zero
uint32_t validl = (validh + 1) | validh;
uint64_t d = *(uint64_t*)(data + i);
__m256i x = _mm256_set1_epi64x(d);
__m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
__m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
low = _mm256_cmpeq_epi8(low, needle);
high = _mm256_cmpeq_epi8(high, needle);
uint32_t lowmask = validl & _mm256_movemask_epi8(low);
uint32_t highmask = validh & _mm256_movemask_epi8(high);
uint64_t mask = lowmask | ((uint64_t)highmask << 32);
if (mask) {
int bitindex = __builtin_ffsl(mask);
return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
}
}
If you are searching a large amount of memory and can afford an expensive setup, another approach is to use a 64K lookup table. For each possible 16-bit value, the table stores a byte containing the bit shift offset at which the matching octet occurs (+1, so 0 can indicate no match). You can initialize it like this:
uint8_t* g_pLookupTable = malloc(65536);
void initLUT(uint8_t octet)
{
memset(g_pLookupTable, 0, 65536); // zero out
for(int i = 0; i < 65536; i++)
{
for(int j = 7; j >= 0; j--)
{
if(((i >> j) & 255) == octet)
{
g_pLookupTable[i] = j + 1;
break;
}
}
}
}
Note that the case where the value is shifted 8 bits is not included (the reason will be obvious in a minute).
Then you can scan through your array of bytes like this:
int findByteMatch(uint8_t* pArray, uint8_t octet, int length)
{
if(length >= 0)
{
uint16_t index = (uint16_t)pArray[0];
if(index == octet)
return 0;
for(int bit, i = 1; i < length; i++)
{
index = (index << 8) | pArray[i];
if(bit = g_pLookupTable[index])
return (i * 8) - (bit - 1);
}
}
return -1;
}
Further optimization:
Read 32 or however many bits at a time from pArray into a uint32_t and then shift and AND each to get byte one at a time, OR with index and test, before reading another 4.
Pack the LUT into 32K by storing a nybble for each index. This might help it squeeze into the cache on some systems.
It will depend on your memory architecture whether this is faster than an unrolled loop that doesn't use a lookup table.

Find a unique bit in a collection of numbers

Best way to explain this is a demonstration.
There is a collection of numbers. They may be repeated, so:
1110, 0100, 0100, 0010, 0110 ...
The number I am looking for is the one that has a bit set, that does not appear in any of the others. The result is the number (in this case 1 - the first number) and the bit position (or the mask is fine) so 1000 (4th bit). There may be more than one solution, but for this purpose it may be greedy.
I can do it by iteration... For each number N, it is:
N & ~(other numbers OR'd together)
But the nature of bits is that there is always a better method if you think outside the box. For instance numbers that appear more than once will never have a unique bit, and have no effect on ORing.
You just need to record whether each bit has been seen once or more and if it's been seen twice or more. Unique bits are those that have been seen once or more and not twice or more. This can be done efficiently using bitwise operations.
count1 = 0
count2 = 0
for n in numbers:
count2 |= count1 & n
count1 |= n
for n in numbers:
if n & count1 & ~count2:
return n
If you don't want to iterate over the numbers twice you can keep track of the some number that you've seen that contains each bit. This might be a good optimisation if the numbers are stored on disk and so streaming them requires disk-access, but of course it makes the code a bit more complex.
examples = [-1] * wordsize
count1 = 0
count2 = 0
for n in numbers:
if n & ~count1:
for i in xrange(wordsize):
if n & (1 << i):
examples[i] = n
count2 |= count1 & n
count1 |= n
for i in xrange(wordsize):
if (count1 & ~count2) & (1 << i):
return examples[i]
You might use tricks to extract the bit indexes more efficiently in the loop that sets examples, but since this code is executed at most 'wordsize' times, it's probably not worth it.
This code translates easily to C... I just wrote in Python for clarity.
(long version of what I wrote in a comment)
By counting the number of times that the bit at index k is one for every k (there is a trick to do this faster than naively, but it's still O(n)), you get a list of bitlength counters in which a count of 1 means that bit was only one once. The index of that counter (found in O(1) because you have a fixed number of bits) is therefore the bit-position you want. To find the number with that bit set, just iterate of all the numbers again and check whether it has that bit set (O(n) again), if it does it's the number you want.
In total: O(n) versus O(n2) of checking every number against all others.
This method uses less than 2 passes (but alters the input array)
#include <stdio.h>
unsigned array[] = { 0,1,2,3,4,5,6,7,8,16,17 };
#define COUNTOF(a) (sizeof(a)/sizeof(a)[0])
void swap(unsigned *a, unsigned *b)
{
unsigned tmp;
tmp = *a;
*a = *b;
*b = tmp;
}
int main(void)
{
unsigned idx,bot,totmask,dupmask;
/* First pass: shift all elements that introduce new bits into the found[] array.
** totmask is a mask of bits that occur once or more
** dupmask is a mask of bits that occur twice or more
*/
totmask=dupmask=0;
for (idx=bot=0; idx < COUNTOF(array); idx++) {
dupmask |= array[idx] & totmask;
if (array[idx] & ~totmask) goto add;
continue;
add:
totmask |= array[idx];
if (bot != idx) swap(array+bot,array+idx);
bot++;
}
fprintf(stderr, "Bot=%u, totmask=%u, dupmask=%u\n", bot, totmask, dupmask );
/* Second pass: reduce list of candidates by checking if
** they consist of *only* duplicate bits */
for (idx=bot; idx-- > 0 ; ) {
if ((array[idx] & dupmask) == array[idx]) goto del;
continue;
del:
if (--bot != idx) swap(array+bot,array+idx);
}
fprintf(stdout, "Results[%u]:\n", bot );
for (idx=0; idx < bot; idx++) {
fprintf(stdout, "[%u]: %x\n" ,idx, array[idx] );
}
return 0;
}
UPDATE 2011-11-28
Another version, that does not alter the original array. The (temporary) results are kept in a separate array.
#include <stdio.h>
#include <limits.h>
#include <assert.h>
unsigned array[] = { 0,1,2,3,4,5,6,7,8,16,17,32,33,64,96,128,130 };
#define COUNTOF(a) (sizeof(a)/sizeof(a)[0])
void swap(unsigned *a, unsigned *b)
{
unsigned tmp;
tmp = *a, *a = *b, *b = tmp;
}
int main(void)
{
unsigned idx,nfound,totmask,dupmask;
unsigned found[sizeof array[0] *CHAR_BIT ];
/* First pass: save all elements that introduce new bits to the left
** totmask is a mask of bits that occur once or more
** dupmask is a mask of bits that occur twice or more
*/
totmask=dupmask=0;
for (idx=nfound=0; idx < COUNTOF(array); idx++) {
dupmask |= array[idx] & totmask;
if (array[idx] & ~totmask) goto add;
continue;
add:
totmask |= array[idx];
found[nfound++] = array[idx];
assert(nfound <= COUNTOF(found) );
}
fprintf(stderr, "Bot=%u, totmask=%u, dupmask=%u\n", nfound, totmask, dupmask );
/* Second pass: reduce list of candidates by checking if
** they consist of *only* duplicate bits */
for (idx=nfound; idx-- > 0 ; ) {
if ((found[idx] & dupmask) == found[idx]) goto del;
continue;
del:
if (--nfound != idx) swap(found+nfound,found+idx);
}
fprintf(stdout, "Results[%u]:\n", nfound );
for (idx=0; idx < nfound; idx++) {
fprintf(stdout, "[%u]: %x\n" ,idx, found[idx] );
}
return 0;
}
As pointed out this is not working:
You can XOR together the numbers, the result will give you the mask.
And then you have to find the first number which doesn't give 0 for the N & mask expression.

hash function for string

I'm working on hash table in C language and I'm testing hash function for string.
The first function I've tried is to add ascii code and use modulo (% 100) but i've got poor results with the first test of data: 40 collisions for 130 words.
The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a .txt file.
Which is the best algorithm for hashing string?
And how to determinate the size of hash table?
I've had nice results with djb2 by Dan Bernstein.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.
Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.
For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:
int hash(char const *input) {
int result = 0x55555555;
while (*input) {
result ^= *input++;
result = rol(result, 5);
}
}
Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).
I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:
Hash function | Collisions | Time (words) | Time (file)
=================================================================
CRC32 | 23 (0.005%) | 112 ms | 38 ms
MurmurOAAT | 26 (0.006%) | 86 ms | 10 ms
FNV hash | 32 (0.007%) | 87 ms | 7 ms
Jenkins OAAT | 36 (0.008%) | 90 ms | 8 ms
DJB2 hash | 344 (0.074%) | 87 ms | 5 ms
K&R V2 | 356 (0.076%) | 86 ms | 5 ms
Coffin | 763 (0.164%) | 86 ms | 4 ms
x17 hash | 2242 (0.481%) | 87 ms | 7 ms
-----------------------------------------------------------------
MurmurHash3_x86_32 | 19 (0.004%) | 90 ms | 3 ms
I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex MurmurHash3_x86_32 into my test for reference.
Conclusion:
there is almost no point of using the popular DJB2 hash function for strings on Intel x86-64 (or AArch64 for that matter) architecture. Because it has much more collisions than similar functions (MurmurOAAT, FNV and Jenkins OAAT) while having very similar throughput. Bernstein's DJB2 performs especially bad on short strings. Example collisions: Liz/MHz, Bon/COM, Rey/SEX.
Test code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 2048
#define SEED 0x12345678
uint32_t DJB2_hash(const uint8_t *str)
{
uint32_t hash = 5381;
uint8_t c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
uint32_t FNV(const void* key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
h ^= 2166136261UL;
const uint8_t* data = (const uint8_t*)key;
for(int i = 0; i < len; i++)
{
h ^= data[i];
h *= 16777619;
}
return h;
}
uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
// One-byte-at-a-time hash based on Murmur's mix
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
for (; *str; ++str) {
h ^= *str;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint32_t KR_v2_hash(const char *s)
{
// Source: https://stackoverflow.com/a/45641002/5407270
uint32_t hashval = 0;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval;
}
uint32_t Jenkins_one_at_a_time_hash(const char *str, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += str[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
uint32_t crc32b(const uint8_t *str) {
// Source: https://stackoverflow.com/a/21001712
unsigned int byte, crc, mask;
int i = 0, j;
crc = 0xFFFFFFFF;
while (str[i] != 0) {
byte = str[i];
crc = crc ^ byte;
for (j = 7; j >= 0; j--) {
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
i = i + 1;
}
return ~crc;
}
inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
return x<<bits | x>>(32-bits); // C idiom: will be optimized to a single operation
}
uint32_t Coffin_hash(char const *input) {
// Source: https://stackoverflow.com/a/7666668/5407270
uint32_t result = 0x55555555;
while (*input) {
result ^= *input++;
result = _rotl32(result, 5);
}
return result;
}
uint32_t x17(const void * key, int len, uint32_t h)
{
// Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
const uint8_t * data = (const uint8_t*)key;
for (int i = 0; i < len; ++i)
{
h = 17 * h + (data[i] - ' ');
}
return h ^ (h >> 16);
}
uint32_t apply_hash(int hash, const char* line)
{
switch (hash) {
case 1: return crc32b((const uint8_t*)line);
case 2: return MurmurOAAT_32(line, SEED);
case 3: return FNV(line, strlen(line), SEED);
case 4: return Jenkins_one_at_a_time_hash(line, strlen(line));
case 5: return DJB2_hash((const uint8_t*)line);
case 6: return KR_v2_hash(line);
case 7: return Coffin_hash(line);
case 8: return x17(line, strlen(line), SEED);
default: break;
}
return 0;
}
int main(int argc, char* argv[])
{
// Read arguments
const int hash_choice = atoi(argv[1]);
char const* const fn = argv[2];
// Read file
FILE* f = fopen(fn, "r");
// Read file line by line, calculate hash
char line[MAXLINE];
while (fgets(line, sizeof(line), f)) {
line[strcspn(line, "\n")] = '\0'; // strip newline
uint32_t hash = apply_hash(hash_choice, line);
printf("%08x\n", hash);
}
fclose(f);
return 0;
}
P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.
Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.
If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.
djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes).
My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:
uint32_t inline MurmurOAAT32 ( const char * key)
{
uint32_t h(3323198485ul);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e995;
h ^= h >> 15;
}
return h;
}
uint64_t inline MurmurOAAT64 ( const char * key)
{
uint64_t h(525201411107845655ull);
for (;*key;++key) {
h ^= *key;
h *= 0x5bd1e9955bd1e995;
h ^= h >> 47;
}
return h;
}
The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.
djb2 is good
Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:
One of the K&R hashes is terrible, one is probably pretty good:
Apparently a terrible hash algorithm, as presented in K&R 1st edition. This is simply a summation of all bytes in the string (source):
unsigned long hash(unsigned char *str)
{
unsigned int hash = 0;
int c;
while (c = *str++)
hash += c;
return hash;
}
Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long, or even better: uint32_t or uint64_t, instead of the simple unsigned (int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm: hashvalue = new_byte + 31*hashvalue, for all bytes in the string:
unsigned hash(char *s)
{
unsigned hashval;
for (hashval = 0; *s != '\0'; s++)
hashval = *s + 31*hashval;
return hashval % HASHSIZE;
}
Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.
The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.
The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.
This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):
Code:
// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
const size_t m = 0x5bd1e995;
size_t hash = seed ^ len;
const char* buf = static_cast<const char*>(ptr);
// Mix 4 bytes at a time into the hash.
while (len >= 4)
{
size_t k = unaligned_load(buf);
k *= m;
k ^= k >> 24;
k *= m;
hash *= m;
hash ^= k;
buf += 4;
len -= 4;
}
// Handle the last few bytes of the input array.
switch (len)
{
case 3:
hash ^= static_cast<unsigned char>(buf[2]) << 16;
[[gnu::fallthrough]];
case 2:
hash ^= static_cast<unsigned char>(buf[1]) << 8;
[[gnu::fallthrough]];
case 1:
hash ^= static_cast<unsigned char>(buf[0]);
hash *= m;
};
// Do a few final mixes of the hash.
hash ^= hash >> 13;
hash *= m;
hash ^= hash >> 15;
return hash;
}
MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.
Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.
See also
Other hash table algorithms to try out and test: http://www.cse.yorku.ca/~oz/hash.html. Hash algorithms mentioned there:
djb2
sdbm
lose lose (K&R 1st edition)
First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.
A hash function with a good reputation is MurmurHash3.
Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.
I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.
Hash function | collision rate | how many minutes to finish
==============================================================
MurmurHash3 | 6.?% | 4m15s
Jenkins One.. | 6.1% | 6m54s
Bob, 1st in link | 6.16% | 5m34s
SuperFastHash | 10% | 4m58s
bernstein | 20% | 14s only finish 1/20
one_at_a_time | 6.16% | 7m5s
crc | 6.16% | 7m56s
One strange things is that almost all the hash functions have 6% collision rate for my data.
One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).
You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.

Efficient bitshifting an array of int?

To be on the same page, let's assume sizeof(int)=4 and sizeof(long)=8.
Given an array of integers, what would be an efficient method to logically bitshift the array to either the left or right?
I am contemplating an auxiliary variable such as a long, that will compute the bitshift for the first pair of elements (index 0 and 1) and set the first element (0). Continuing in this fashion the bitshift for elements (index 1 and 2) will be computer, and then index 1 will be set.
I think this is actually a fairly efficient method, but there are drawbacks. I cannot bitshift greater than 32 bits. I think using multiple auxiliary variables would work, but I'm envisioning recursion somewhere along the line.
There's no need to use a long as an intermediary. If you're shifting left, start with the highest order int, shifting right start at the lowest. Add in the carry from the adjacent element before you modify it.
void ShiftLeftByOne(int * arr, int len)
{
int i;
for (i = 0; i < len - 1; ++i)
{
arr[i] = (arr[i] << 1) | ((arr[i+1] >> 31) & 1);
}
arr[len-1] = arr[len-1] << 1;
}
This technique can be extended to do a shift of more than 1 bit. If you're doing more than 32 bits, you take the bit count mod 32 and shift by that, while moving the result further along in the array. For example, to shift left by 33 bits, the code will look nearly the same:
void ShiftLeftBy33(int * arr, int len)
{
int i;
for (i = 0; i < len - 2; ++i)
{
arr[i] = (arr[i+1] << 1) | ((arr[i+2] >> 31) & 1);
}
arr[len-2] = arr[len-1] << 1;
arr[len-1] = 0;
}
For anyone else, this is a more generic version of Mark Ransom's answer above for any number of bits and any type of array:
/* This function shifts an array of byte of size len by shft number of
bits to the left. Assumes array is big endian. */
#define ARR_TYPE uint8_t
void ShiftLeft(ARR_TYPE * arr_out, ARR_TYPE * arr_in, int arr_len, int shft)
{
const int int_n_bits = sizeof(ARR_TYPE) * 8;
int msb_shifts = shft % int_n_bits;
int lsb_shifts = int_n_bits - msb_shifts;
int byte_shft = shft / int_n_bits;
int last_byt = arr_len - byte_shft - 1;
for (int i = 0; i < arr_len; i++){
if (i <= last_byt){
int msb_idx = i + byte_shft;
arr_out[i] = arr_in[msb_idx] << msb_shifts;
if (i != last_byt)
arr_out[i] |= arr_in[msb_idx + 1] >> lsb_shifts;
}
else arr_out[i] = 0;
}
}
Take a look at BigInteger implementation in Java, which internally stores data as an array of bytes. Specifically you can check out the funcion leftShift(). Syntax is the same as in C, so it wouldn't be too difficult to write a pair of funciontions like those. Take into account too, that when it comes to bit shifting you can take advange of unsinged types in C. This means that in Java to safely shift data without messing around with sign you usually need bigger types to hold data (i.e. an int to shift a short, a long to shift an int, ...)

Resources