Expand hash table without rehash? - c

I am looking to for a hash table data structure that does not require rehash for expansion and shrink?
Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?

does not require rehash for expansion and shrink? Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?
That depends on what you call "rehash":
If you simply mean that the table-level rehash shouldn't reapply the hash function to each key during resizing, then that's easy with most libraries: e.g. wrap the key and its raw (pre-modulo-table-size) real hash value together a la struct X { size_t hash_; Key key_ };, supply the hashtable library with a hash function that returns hash_, but a comparison function that compares key_s (depending on the complexity of key_ comparison, you may be able to use hash_ to optimise, e.g. lhs.hash_ == rhs.hash_ && lhs.key_ == rhs.key_).
This will help most if the hashing of keys was particularly time consuming (e.g. cryptographic strength on longish keys). For very simple hashing (e.g. passthrough of ints) it'll slow you down and waste memory.
If you mean the table-level operation of increasing or decreasing memory storage and reindexing all stored values, then yes - it can be avoided - but to do so you have to fundamentally change the way the hash table works, and the normal performance profile. Discussed below.
As just one example, you could leverage a more typical hashtable implementation (let's call it H) by having your custom hashtable (C) have an H** p that - up to an initial size limit - will have p[0] be the only instance of H, and simply ferry operations/results through. If the table grows beyond that, you keep p[0] referencing the existing H, while creating a second H hashtable to be tracked by p[1]. Then things start getting dicey:
to search or erase in C, your implementation needs to search p[1] then p[0] and report any match from either
to insert a new value in C, your implementation must confirm it's not in p[0], then insert to p[1]
with each insert (and potentially even for other operations), it could optionally migrate any matching - or an arbitrary p[0] entry - to p[1] so gradually p[0] empties; you can easily guarantee p[0] will be empty before p[1] will be so full (and consequently a larger table will be needed). When p[0] is empty you may want to p[0] = p[1]; p[1] = NULL; to keep the simple mental model of what's where - lots of options.
Some existing hash table implementations are very efficient at iterating over elements (e.g. GNU C++ std::unordered_set), as there's a singly linked list of all the values, and the hash table is really only a collection of pointers (in C++ parlance, iterators) into the linked list. This can mean that if your utilisation falls below some threshold (e.g. 10% load factor) for your only/larger hash table, you know you can very efficiently migrate the remaining elements to a smaller table.
These kind of tricks are used by some hash tables to avoid a sudden heavy cost during rehashing, and instead spread the pain more evenly over a number of subsequent operations, avoiding a possibly nasty spike in latency.
Some of the implementation options only make sense for either an open or a closed hashing implementation, or are only useful when the keys and/or values are small or large and depending on whether the table embeds them or points to them. Best way to learn about it is to code....

It depends what you want to avoid. Rehashing implies recomputing the hash values. You can avoid that by storing the hash values in the hash structures. Redispatching the entries into the reallocated hashtable may be less expensive (typically a single modulo or masking operation) and is hardly avoidable for simple hashtable implementations.

Assuming you actually do need this.. It is possible. Here I'll give a trivial example you can build on.
// Basic types we deal with
typedef uint32_t key_t;
typedef void * value_t;
typedef struct
{
key_t key;
value_t value;
} hash_table_entry_t;
typedef struct
{
uint32_t initialSize;
uint32_t size; // current max entries
uint32_t count; // current filled entries
hash_table_entry_t *entries;
} hash_table_t;
// Hash function depends on the size of the table
key_t hash(value_t value, uint32_t size)
{
// Simple hash function that just does modulo hash table size;
return *(key_t*)&value % size;
}
void init(hash_table_t *pTable, uint32_t initialSize)
{
pTable->initialSize = initialSize;
pTable->size = initialSize;
pTable->count = 0;
pTable->entries = malloc(pTable->size * sizeof(*pTable->entries));
/// #todo handle null return;
// Set to ~0 to signal invalid keys.
memset(pTable->entries, ~0, pTable->size * sizeof(*pTable->entries));
}
void insert(hash_table_t *pTable, value_t val)
{
key_t key = hash(val, pTable->size);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
{
pTable->entries[i].key = key;
pTable->entries[i].value = val;
pTable->count++;
break;
}
}
// Expand when 50% full
if (pTable->count > pTable->size/2)
{
pTable->size *= 2;
pTable->entries = realloc(pTable->entries, pTable->size * sizeof(*pTable->entries));
/// #todo handle null return;
memset(pTable->entries + pTable->size/2, ~0, pTable->size * sizeof(*pTable->entries));
}
}
_Bool contains(hash_table_t *pTable, value_t val)
{
// Try current size first
uint32_t sizeToTry = pTable->size;
do
{
key_t key = hash(val, sizeToTry);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
break;
if (pTable->entries[i].key == key && pTable->entries[i].value == val)
return true;
}
// Try all previous sizes we had. Only report failure if found for none.
sizeToTry /= 2;
} while (sizeToTry != pTable->initialSize);
return false;
}
The idea is that the hash function depends on the size of the table. When you change the size of the table, you don't rehash current entries. You add new ones with the new hash function. When reading the entries, you try all the hash functions that have ever been used on this table.
This way, get()/contains() and similar operations take longer the more times you expanded your table, but you don't have the huge spike of rehashing. I can imagine some systems where this would be a requirement.

Related

Most efficient way of manipulating a very large text database of SHA256 hashes?

I have to frequently search hashes in a large (up to 1G) CSV database of the format
sha256_hash, md5_hash, sha1_hash, field1, field2, field3 etc
in C. This needs to be very fast and memory usage is a non-issue (32G minimum). I found this which is very close to what I had in mind: load the data into RAM, one-time order the database by hash, index by first 'n' bytes of the hash and then search through smaller sublists. But the thread above doesn't seem to address a question I have in mid. Since I'm not a cryptography guy, I was wondering about the distribution of hashes and whether of not it could be used to make searching the sublists even faster. Any suggestion about this or or my general approach?
Yes, a bloom filter can be used to kick out 'definite negatives' early, by using the distribution of the hash bits.
http://en.wikipedia.org/wiki/Bloom_filter
To create a bloom filter for a given bucket, logical OR all the hashes together to create your filter. Then logical AND the filter with your target hash. If the result is < your target hash (or result XOR target hash != 0), that bucket definitely does not contain that target hash, and you can skip searching it, but if the result == target hash, that bucket MAY contain your target hash, and you need to continue with searching it to be sure. The bloom filter can be cached and updated simply when new hashes are added, but has to be recomputed when hashes are removed, so all that remains for the search is the AND and < operations, which are very cheap and will reduce your O(N) operation to O(1) time in the best case scenario.
Care has to be taken with regards to bucket size so that filters of meaningful value are produced, because a filter of all high bits is of no value to anyone.
This is a very easy problem to solve with a lot of memory. Make the hash the key of a hashtable. Make the hash that you provide to the table the first N bytes of the hash (because they are so random that no human on earth can tell them apart from truly random data).
Not sure what's up with your idea with keying the table with a prefix of the key and having sublists. Any stock library-provided hashtable can easily solve your problem.
Or put it into any database and make the hash the primary key.
The distribution of hashes is uniform, which is helpful because you can put the hashes in a hash table.
// something like this...
struct entry {
bool used;
unsigned char sha256[32];
char field1[20];
char field2[20];
};
If you don't need to delete entries from the hash table, just create a big array of struct entry and insert records from the CSV into the index corresponding to some bits from the SHA-256 hash. Use linear probing to insert entries: if entry i is taken, use i+1, or i+2, until you find a free entry.
struct table {
int nbits;
struct entry *entries;
};
unsigned read_int(unsigned char *data)
{
unsigned v = data[0] | (data[1] << 8) |
(data[2] << 16) | ((unsigned)data[3] << 24);
}
struct entry *find_entry(struct table *table, unsigned char *sha256)
{
unsigned index = read_int(sha256);
unsigned mask = (1u << table->nbits) - 1;
while (1) {
struct entry *e = &table->entries[index & mask];
if (!e->used)
return NULL;
if (!memcmp(e->sha256, sha256, 32))
return e;
index++;
}
}

Data structure for playing notes in MIDI synthesizer

I'm working on a hardware virtual analog synthesizer using C, and I'm trying to come up with an efficient data structure to manage the dynamic assignment of synthesizer voices in response to incoming MIDI messages.
I have a structure type which holds the data for a single synthesizer voice (pitch, low frequency oscillator, ADSR settings, etc.) and I have a "NoteOn" callback function which is executed when a MIDI "Note On" message is decoded. This function needs to take an "idle" voice from an "idle" pool, modify some settings in the structure, and assign it to a "playing" pool that the main synth engine works with to generate audio samples. Then, when a "Note Off" message is received, the voice with a note value corresponding to the one in the "Note Off" message needs to be selected from the "playing" pool, have its data structure modified again, and eventually returned to the "idle" pool (depending on envelope/ADSR settings.)
I tried an implementation using linked lists for both pools, but my implementation seemed rather cumbersome and slow. I'd like this process to be as quick as possible, to maintain playing responsiveness. Any suggestions?
If a linked list is too slow, the usual answer is to implement a hash table. There many, many possible variations of the data structure and algorithm. I'll just describe open, "single"-hashing, because that's the variation I'm most familiar with.
So with an open hash table, the table is just an array ("closed" hashing has an array, too, but each element is a linked list). We want the array to be, at most, about half-full for performance reasons. And at maximum-capacity, the filled table will actually have one empty slot still, because this simplifies the algorithm.
We also need a hash function which accepts the type of the key values, and returns integers. It's very difficult to predict how the hash function will behave with respect to clustered keys and overall performance. So just make sure it's an isolated function that can easily be changed later. It can be as simple as shifting-around all the bytes and adding them together.
int hash (char *key, int key_length, int table_size)
{
int ret, i;
for (i=0, ret=0; i < key_length; i++)
{
ret += key[i] << i;
}
return abs(ret) % table_size;
}
The table-lookup function uses the hash function to decide where to start looking in the array. If the key isn't found there (determined by doing a memcmp() on the actual search key and the key stored at that position in the table), it looks at each successive key, wrapping from the end of the array back to the beginning, and declares failure if it finds an empty table element.
#define RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY \
if (memcmp(table + i, &key, sizeof key) == 0 || (key_type)table[i] == 0) \
return table + i;
key_value_pair *hash_lookup(key_value_pair *table, int table_size, key_type key)
{
int h, i;
h = hash(&key, sizeof key, table_size);
i = h;
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
for ( ; i < table_size; i++)
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
for (i=0; i < h; i++)
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
return NULL;
}
We'll need one more function in front of this to handle a few quirks. It can return a NULL pointer which indicates that not only has the key not been found, but the table itself is overfull. An overfull table, which really means "completely full", but we decided earlier that a "full" table should really have one empty element. This means that both for loops should not run to completion; when it finds an empty table position, that's a failure. With an overfull table, it has to scan the entire table before discovering that the key is not present, thus losing much of the performance advtantage from using a hash at all.
The lookup function can also return a valid pointer to an empty slot. This is also a failure to find the value, but not an error. If adding the key/value pair for the first time, this will be slot to store it.
Or it could return a pointer to the desired table element. And this will be faster than a linear search, be it an array or linked list.
Deleting a key from the table requires us to fill-in the vacated position in the sequence. There are a couple of options.
If you're not worried about the table running out of space (it's set really large, and the lifetime and usage can be controlled), you can overwrite the entry with a deleted special key, distinct from an empty key.
Or, if you want to reclaim the space, too, you'll need to lookup the key, and then scan the rest of the "chain" (sequence of keys up to the next empty slot (including wrap-around)) and move the last key with a matching hash into the key-to-delete's position. Then write-over this moved key/value's position with the empty key. .... oops! This process must be repeated for the this last matching key until we're actually clearing the very last key in the chain. (I need to go fix this in my implementation right now!....)

Simple hash functions

I'm trying to write a C program that uses a hash table to store different words and I could use some help.
Firstly, I create a hash table with the size of a prime number which is closest to the number of the words I have to store, and then I use a hash function to find an address for each word.
I started with the simplest function, adding the letters together, which ended up with 88% collision.
Then I started experimenting with the function and found out that whatever I change it to, the collisions don't get lower than 35%.
Right now I'm using
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int counter, hashAddress =0;
for (counter =0; word[counter]!='\0'; counter++){
hashAddress = hashAddress*word[counter] + word[counter] + counter;
}
return (hashAddress%hashTableSize);
}
which is just a random function that I came up with, but it gives me the best results - around 35% collision.
I've been reading articles on hash functions for the past a few hours and I tried to use a few simple ones, such as djb2, but all of them gave me even worse results.(djb2 resulted in 37% collision, which is't much worse, but I was expecting something better rather than worse)
I also don't know how to use some of the other, more complex ones, such as the murmur2, because I don't know what the parameters (key, len, seed) they take in are.
Is it normal to get more than 35% collisions, even with using the djb2, or am I doing something wrong?
What are the key, len and seed values?
Try sdbm:
hashAddress = 0;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = word[counter] + (hashAddress << 6) + (hashAddress << 16) - hashAddress;
}
Or djb2:
hashAddress = 5381;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = ((hashAddress << 5) + hashAddress) + word[counter];
}
Or Adler32:
uint32_t adler32(const void *buf, size_t buflength) {
const uint8_t *buffer = (const uint8_t*)buf;
uint32_t s1 = 1;
uint32_t s2 = 0;
for (size_t n = 0; n < buflength; n++) {
s1 = (s1 + buffer[n]) % 65521;
s2 = (s2 + s1) % 65521;
}
return (s2 << 16) | s1;
}
// ...
hashAddress = adler32(word, strlen(word));
None of these are really great, though. If you really want good hashes, you need something more complex like lookup3, murmur3, or CityHash for example.
Note that a hashtable is expected to have plenty of collisions as soon as it is filled by more than 70-80%. This is perfectly normal and will even happen if you use a very good hash algorithm. That's why most hashtable implementations increase the capacity of the hashtable (e.g. capacity * 1.5 or even capacity * 2) as soon as you are adding something to the hashtable and the ratio size / capacity is already above 0.7 to 0.8. Increasing the capacity means a new hashtable is created with a higher capacity, all values from the current one are added to the new one (therefor they must all be rehashed, as their new index will be different in most cases), the new hashtable array replaces the old one and the old one is released/freed. If you plan on hashing 1000 words, a hashtable capacity of at 1250 least recommended, better 1400 or even 1500.
Hashtables are not supposed to be "filled to brim", at least not if they shall be fast and efficient (thus they always should have spare capacity). That's the downside of hashtables, they are fast (O(1)), yet they will usually waste more space than would be necessary for storing the same data in another structure (when you store them as a sorted array, you will only need a capacity of 1000 for 1000 words; the downside is that the lookup cannot be faster than O(log n) in that case). A collision free hashtable is not possible in most cases either way. Pretty much all hashtable implementations expect collisions to happen and usually have some kind of way to deal with them (usually collisions make the lookup somewhat slower, but the hashtable will still work and still beat other data structures in many cases).
Also note that if you are using a pretty good hash function, there is no requirement, yet not even an advantage, if the hashtable has a power of 2 capacity if you are cropping hash values using modulo (%) in the end. The reason why many hashtable implementations always use power of 2 capacities is because they do not use modulo, instead they use AND (&) for cropping because an AND operation is among the fastest operations you will find on most CPUs (modulo is never faster than AND, in the best case it would be equally fast, in most cases it is a lot slower). If your hashtable uses power of 2 sizes, you can replace any module with an AND operation:
x % 4 == x & 3
x % 8 == x & 7
x % 16 == x & 15
x % 32 == x & 31
...
This only works for power of 2 sizes, though. If you use modulo, power of 2 sizes can only buy something, if the hash is a very bad hash with a very bad "bit distribution". A bad bit distribution is usually caused by hashes that do not use any kind of bit shifting (>> or <<) or any other operations that would have a similar effect as bit shifting.
I created a stripped down lookup3 implementation for you:
#include <stdint.h>
#include <stdlib.h>
#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#define mix(a,b,c) \
{ \
a -= c; a ^= rot(c, 4); c += b; \
b -= a; b ^= rot(a, 6); a += c; \
c -= b; c ^= rot(b, 8); b += a; \
a -= c; a ^= rot(c,16); c += b; \
b -= a; b ^= rot(a,19); a += c; \
c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a,b,c) \
{ \
c ^= b; c -= rot(b,14); \
a ^= c; a -= rot(c,11); \
b ^= a; b -= rot(a,25); \
c ^= b; c -= rot(b,16); \
a ^= c; a -= rot(c,4); \
b ^= a; b -= rot(a,14); \
c ^= b; c -= rot(b,24); \
}
uint32_t lookup3 (
const void *key,
size_t length,
uint32_t initval
) {
uint32_t a,b,c;
const uint8_t *k;
const uint32_t *data32Bit;
data32Bit = key;
a = b = c = 0xdeadbeef + (((uint32_t)length)<<2) + initval;
while (length > 12) {
a += *(data32Bit++);
b += *(data32Bit++);
c += *(data32Bit++);
mix(a,b,c);
length -= 12;
}
k = (const uint8_t *)data32Bit;
switch (length) {
case 12: c += ((uint32_t)k[11])<<24;
case 11: c += ((uint32_t)k[10])<<16;
case 10: c += ((uint32_t)k[9])<<8;
case 9 : c += k[8];
case 8 : b += ((uint32_t)k[7])<<24;
case 7 : b += ((uint32_t)k[6])<<16;
case 6 : b += ((uint32_t)k[5])<<8;
case 5 : b += k[4];
case 4 : a += ((uint32_t)k[3])<<24;
case 3 : a += ((uint32_t)k[2])<<16;
case 2 : a += ((uint32_t)k[1])<<8;
case 1 : a += k[0];
break;
case 0 : return c;
}
final(a,b,c);
return c;
}
This code is not as highly optimized for performance as the original code, therefor it is a lot simpler. It is also not as portable as the original code, but it is portable to all major consumer platforms in use today. It is also completely ignoring the CPU endian, yet that is not really an issue, it will work on big and little endian CPUs. Just keep in mind that it will not calculate the same hash for the same data on big and little endian CPUs, but that is no requirement; it will calculate a good hash on both kind of CPUs and its only important that it always calculates the same hash for the same input data on a single machine.
You would use this function as follows:
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int initval;
unsigned int hashAddress;
initval = 12345;
hashAddress = lookup3(word, strlen(word), initval);
return (hashAddress%hashTableSize);
// If hashtable is guaranteed to always have a size that is a power of 2,
// replace the line above with the following more effective line:
// return (hashAddress & (hashTableSize - 1));
}
You way wonder what initval is. Well, it is whatever you want it to be. You could call it a salt. It will influence the hash values, yet the hash values will not get better or worse in quality because of this (at least not in the average case, it may lead to more or less collisions for very specific data, though). E.g. you can use different initval values if you want to hash the same data twice, yet each time should produce a different hash value (there is no guarantee it will, but it is rather likely if initval is different; if it creates the same value, this would be a very unlucky coincidence that you must treat that as a kind of collision). It is not advisable to use different initval values when hashing data for the same hashtable (this will rather cause more collisions on average). Another use for initval is if you want to combine a hash with some other data, in which case the already existing hash becomes initval when hashing the other data (so both, the other data as well as the previous hash influence the outcome of the hash function). You may even set initval to 0 if you like or pick a random value when the hashtable is created (and always use this random value for this instance of hashtable, yet each hashtable has its own random value).
A note on collisions:
Collisions are usually not such a huge problem in practice, it usually does not pay off to waste tons of memory just to avoid them. The question is rather how you are going to deal with them in an efficient way.
You said you are currently dealing with 9000 words. If you were using an unsorted array, finding a word in the array will need 4500 comparisons on average. On my system, 4500 string comparisons (assuming that words are between 3 and 20 characters long) need 38 microseconds (0.000038 seconds). So even such a simple, ineffective algorithm is fast enough for most purposes. Assuming that you are sorting the word list and use a binary search, finding a word in the array will need only 13 comparisons on average. 13 comparisons are close to nothing in terms of time, it's too little to even benchmark reliably. So if finding a word in a hashtable needs 2 to 4 comparisons, I wouldn't even waste a single second on the question whether that may be a huge performance problem.
In your case, a sorted list with binary search may even beat a hashtable by far. Sure, 13 comparisons need more time than 2-4 comparisons, however, in case of a hashtable you must first hash the input data to perform a lookup. Hashing alone may already take longer than 13 comparisons! The better the hash, the longer it will take for the same amount of data to be hashed. So a hashtable only pays off performance-wise if you have a really huge amount of data or if you must update the data frequently (e.g. constantly adding/removing words to/from the table, since these operations are less costly for a hashtable than they are for a sorted list). The fact that a hashatble is O(1) only means that regardless how big it is, a lookup will approx. always need the same amount of time. O(log n) only means that the lookup grows logarithmically with the number of words, that means more words, slower lookup. Yet the Big-O notation says nothing about absolute speed! This is a big misunderstanding. It is not said that a O(1) algorithm always performs faster than a O(log n) one. The Big-O notation only tells you that if the O(log n) algorithm is faster for a certain number of values and you keep increasing the number of values, the O(1) algorithm will certainly overtake the O(log n) algorithm at some point of time, but your current word count may be far below that point. Without benchmarking both approaches, you cannot say which one is faster by just looking at the Big-O notation.
Back to collisions. What should you do if you run into a collision? If the number of collisions is small, and here I don't mean the overall number of collisions (the number of words that are colliding in the hashtable) but the per index one (the number of words stored at the same hashtable index, so in your case maybe 2-4), the simplest approach is to store them as a linked list. If there was no collision so far for this table index, there is just a single key/value pair. If there was a collision, there is a linked list of key/value pairs. In that case your code must iterate over the linked list and verify each of the keys and return the value if it matches. Going by your numbers, this linked list won't have more than 4 entries and doing 4 comparisons is insignificant in terms of performance. So finding the index is O(1), finding the value (or detecting that this key is not in the table) is O(n), but here n is only the number of linked list entries (so it is 4 at most).
If the number of collisions raises, a linked list can become to slow and you may also store a dynamically sized, sorted array of key/value pairs, which allows lookups of O(log n) and again, n is only the number of keys in that array, not of all keys in the hashtable. Even if there were 100 collisions at one index, finding the right key/value pair takes at most 7 comparisons. That's still close to nothing. Despite the fact that if you really have 100 collisions at one index, either your hash algorithm is unsuited for your key data or the hashtable is far too small in capacity. The disadvantage of a dynamically sized, sorted array is that adding/removing keys is somewhat more work than in case of a linked list (code-wise, not necessarily performance-wise). So using a linked list is usually sufficient if you keep the number of collisions low enough and it is almost trivial to implement such a linked list yourself in C and add it to an existing hashtable implementation.
Most hashtable implementations I have seen use such a "fallback to an alternate data structure" to deal with collisions. The disadvantage is that these require a little bit extra memory to store the alternative data structure and a bit more code to also search for keys in that structure. There are also solutions that store collisions inside the hashtable itself and that don't require any additional memory. However, these solutions have a couple of drawbacks. The first drawback is that every collision increases the chances for even more collisions as more data is added. The second drawback is that while lookup times for keys decrease linearly with the number of collisions so far (and as I said before, every collision leads to even more collisions as data is added), lookup times for keys not in the hashtable decrease even worse and in the end, if you perform a lookup for a key that is not in the hashtable (yet you cannot know without performing the lookup), the lookup may take as long as a linear search over the whole hashtable (YUCK!!!). So if you can spare the extra memory, go for an alternate structure to handle collisions.
Firstly, I create a hash table with the size of a prime number which is the closes to the number of the words I have to store, and then I use a hash function to find an address for each word.
...
return (hashAddress%hashTableSize);
Since the number of different hashes is comparable to the number of words you cannot expect to have much lower collisions.
I made a simple statistical test with a random hash (which is the best you could achieve) and found that 26% is the limiting collision rate if you have #words == #different hashes.

Embedded C - How to create a cache for expensive external reads?

I am working with a microcontroller that has an external EEPROM containing tables of information.
There is a large amount of information, however there is a good chance that we will request the same information cycle to cycle if we are fairly 'stable' - i.e. if we are at a constant temperature for example.
Reads from the EEPROM take around 1ms, and we do around 30 per cycle. Our cycle is currently about 100ms so there is significant savings to be had.
I am therefore looking at implementing a RAM cache. A hit should be significantly faster than 1ms since the microcontroller core is running at 8Mhz.
The lookup involves a 16-bit address returning 16-bit data. The microcontroller is 32-bit.
Any input on caching would be greatly appreciated, especially if I am totally missing the mark and should be using something else, like a linked list, or even a pre-existing library.
Here is what I think I am trying to achieve:
-A cache made up of an array of structs. The struct would contain the address, data and some sort of counter indicating how often this piece of data has been accessed (readCount).
-The array would be sorted by address normally. I would have an efficient lookup() function to lookup an address and get the data (suggestions?)
-If I got a cache miss, I would sort the array by readCount to determine the least used cached value and throw it away. I would then fill its position with the new value I have looked up from EEPROM. I would then reorder the array by address. Any sorting would use an efficient sort (shell sort? - not sure how to handle this with arrays)
-I would somehow decrement all of the readCount variables to that they would tend to zero if not used. This should preserve constantly used variables.
Here are my thoughts so far (pseudocode, apologies for my coding style):
#define CACHE_SIZE 50
//one piece of data in the cache
struct cacheItem
{
uint16_t address;
uint16_t data;
uint8_t readCount;
};
//array of cached addresses
struct cacheItem cache[CACHE_SIZE];
//function to get data from the cache
uint16_t getDataFromCache(uint16_t address)
{
uint8_t cacheResult;
struct cacheItem * cacheHit; //Pointer to a successful cache hit
//returns CACHE_HIT if in the cache, else returns CACHE_MISS
cacheResult = lookUpCache(address, cacheHit);
if(cacheResult == CACHE_MISS)
{
//Think this is necessary to easily weed out the least accessed address
sortCacheByReadCount();//shell sort?
removeLastCacheEntry(); //delete the last item that hasn't been accessed for a while
data = getDataFromEEPROM(address); //Expensive EEPROM read
//Add on to the bottom of the cache
appendToCache(address, data, 1); //1 = setting readCount to 1 for new addition
//Think this is necessary to make a lookup function faster
sortCacheByAddress(); //shell sort?
}
else
{
data = cacheHit->data; //We had a hit, so pull the data
cacheHit->readCount++; //Up the importance now
}
return data;
}
//Main function
main(void)
{
testData = getDataFromCache(1234);
}
Am I going down the completely wrong track here? Any input is appreciated.
Repeated sorting sounds expensive to me. I would implement the cache as a hash table on the address. To keep things simple, I would start by not even counting hits but rather evicting old entries immediately on seeing a hash collision:
const int CACHE_SIZE=32; // power of two
struct CacheEntry {
int16_t address;
int16_t value
};
CacheEntry cache[CACHE_SIZE];
// adjust shifts for different CACHE_SIZE
inline int cacheIndex(int adr) { return (((adr>>10)+(adr>>5)+adr)&(CACHE_SIZE-1)); }
int16_t cachedRead( int16_t address )
{
int idx = cacheIndex( address );
CacheEntry * pCache = cache+idx;
if( address != pCache->address ) {
pCache->value = readEeprom( address );
pCache->address = address;
}
return pCache->value
}
If this proves not effective enough, I would start by fiddling around with the hash function.
Don't be afraid to do more computations, in most cases I/O is slower.
This is the simpliest implementation I can think of:
#define CACHE_SIZE 50
something cached_vals[CACHE_SIZE];
short int cached_item_num[CACHE_SIZE];
char cache_hits[CACHE_SIZE]; // 0 means free.
void inc_hits(char index){
if (cache_hits[index] > 127){
for (int i = 0; i < CACHE_SIZE; i++)
cache_hits[i] <<= 1;
cache_hits[i]++; // 0 is reserved as "free" marker
};
cache_hits[index]++;
}:
int get_new_space(short int item){
for (int i = 0; i < CACHE_SIZE; i++)
if (!cache_hits[i]) {
inc_hits(i);
return i;
};
// no free values, dropping the one with lowest count
int min_val = 0;
for (int i = 1; i < CACHE_SIZE; i++)
min_val = min(cache_hits[min_val], cache_hits[i]);
cache_hits[min_val] = 2; // just to give new values more chanches to "survive"
cached_item_num[min_val] = item;
return min_val;
};
something* get_item(short int item){
for (int i = 0; i < CACHE_SIZE; i++){
if (cached_item_num[i] == item){
inc_hits(i);
return cached_vals + i;
};
};
int new_item = get_new_space(item);
read_from_eeprom(item, cached_vals + new_item);
return chached_vals + new_item;
};
Sorting and moving data seems like a bad idea, and it's not clear you gain anything useful from it.
I'd suggest a much simpler approach. Allocate 4*N (for some N) bytes of data, as an array of 4-byte structs each containing an address and the data. To look up a value at address A, you look at the struct at index A mod N; if its stored address is the one you want, then use the associated data, otherwise look up the data off the EEPROM and store it there along with address A. Simple, easy to implement, easy to test, and easy to understand and debug later.
If the location of your current lookup tends to be near the location of previous lookups, that should work quite well -- any time you're evicting data, it's going to be from at least N locations away in the table, which means you're probably not likely to want it again any time soon -- I'd guess that's at least as good a heuristic as "how many times did I recently use this". (If your EEPROM is storing several different tables of data, you could probably just do a cache for each one as the simplest way to avoid collisions there.)
You said that which entry you need from the table relates to the temperature, and that the temperature tends to remain stable. As long as the temperature does not change too quickly then it is unlikely that you will need an entry from the table which more than 1 entry away from the previously needed entry.
You should be able to accomplish your goal by keeping just 3 entries in RAM. The first entry is the one you just used. The next entry is the one corresponding to the temperature just below the last temperature measurement, and the other one is the temperature just above the last temperature measurement. When the temperature changes one of these entries probably becomes the new current one. You can then preform whatever task it is you need using this data, and then go ahead and read the entry you need (higher or lower than the current temperature) after you have finished other work (before reading the next temperature measure).
Since there are only 3 entries in RAM at a time you don't have to be clever about what data structure you need to store them in to access them efficiently, or even keeping them sorted because it will never be that long.
If temperatures can move faster than 1 unit per examination period then you could just increase the size of your cache and maybe have a few more anticipatory entries (in the direction that temperature seems to be heading) than you do trailing entries. Then you may want to store the entries in an efficient structure, though. I wouldn't worry about how recently you accessed an entry, though, because next temperature probability distribution predictions based on current temperature will usually be pretty good. You will need to make sure you handle the case where you are way off and need to read in the entry for a just read temperature immediately, though.
There are my suggestions:
Replace oldest, or replace least recent policy would be better, as reolacing least accessed would quickly fill up cache and then just repeatedly replace last element.
Do not traverse all array, but take some pseudo-random (seeded by address) location to replace. (special case of single location is already presented by #ruslik).
My idea would be:
#define CACHE_SIZE 50
//one piece of data in the cache
struct cacheItem
{
uint16_t address;
uint16_t data;
uint8_t whenWritten;
};
//array of cached addresses
struct cacheItem cache[CACHE_SIZE];
// curcular cache write counter
unit8_t writecount = 0;
// this suggest cache location either contains actual data or to be rewritten;
struct cacheItem *cacheLocation(uint16_t address) {
struct cacheLocation *bestc, *c;
int bestage = -1, age, i;
srand(address); // i'll use standard PRNG to acquire locations; as it initialized
// it will always give same sequence for same location
for(i = 0; i<4; i++) { // any number of iterations you find best
c = &(cache[rand()%CACHE_SIZE]);
if(c->address == address) return c; // FOUND!
age = (writecount - whenWritten) & 0xFF; // after age 255 comes age 0 :(
if(age > bestage) {
bestage = age;
bestc = c;
}
}
return c;
}
....
struct cacheItem *c = cacheLocation(addr);
if(c->address != addr) {
c->address = addr;
c->data = external_read(addr);
c->whenWritten = ++writecount;
}
cache age will wrap after 255 to 0 but but it's hust slightly randomizes cache replacements, so it did not make workaround.

Ideal data structure for mapping integers to integers?

I won't go into details, but I'm attempting to implement an algorithm similar to the Boyer-Moore-Horspool algorithm, only using hex color values instead of characters (i.e., there is a much greater range).
Following the example on Wikipedia, I originally had this:
size_t jump_table[0xFFFFFF + 1];
memset(jump_table, default_value, sizeof(jump_table);
However, 0xFFFFFF is obviously a huge number and this quickly causes C to seg-fault (but not stack-overflow, disappointingly).
Basically, what I need is an efficient associative array mapping integers to integers. I was considering using a hash table, but having a malloc'd struct for each entry just seems overkill to me (I also do not need hashes generated, as each key is a unique integer and there can be no duplicate entries).
Does anyone have any alternatives to suggest? Am I being overly pragmatic about this?
Update
For those interested, I ended up using a hash table via the uthash library.
0xffffff is rather too large to put on the stack on most systems, but you absolutely can malloc a buffer of that size (at least on current computers; not so much on a smartphone). Whether or not you should do it for this task is a separate issue.
Edit: Based on the comment, if you expect the common case to have a relatively small number of entries other than the "this color doesn't appear in the input" skip value, you should probably just go ahead and use a hash map (obviously only storing values that actually appear in the input).
(ignore earlier discussion of other data structures, which was based on an incorrect recollection of the algorithm under discussion -- you want to use a hash table)
If the array you were going to make (of size 0xFFFFFF) was going to be sparse you could try making a smaller array to act as a simple hash table, with the size being 0xFFFFFF / N and the hash function being hexValue / N (or hexValue % (0xFFFFFF / N)). You'll have to be creative to handle collisions though.
This is the only way I can foresee getting out of mallocing structs.
You can malloc(3) 0xFFFFFF blocks of size_t on the heap (for simplicity), and address them as you do with an array.
As for the stack overflow. Basically the program receives a SIGSEGV, which can be a result of a stack overflow or accessing illegal memory or writing on a read-only segment etc... They are all abstracted under the same error message "Segmentation fault".
But why don't you use a higher level language like python that supports associate arrays?
At possibly the cost of some speed, you could try modifying the algorithm to find only matches that are aligned to some boundary (every three or four symbols), then perform the search at byte level.
You could create a sparse array of sorts which has "pages" like this (this example uses 256 "pages", so the upper most byte is the page number):
int *pages[256];
/* call this first to make sure all of the pages start out NULL! */
void init_pages(void) {
for(i = 0; i < 256; ++i) {
pages[i] = NULL;
}
}
int get_value(int index) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
return pages[index / 0x10000][index % 0x10000];
}
void set_value(int index, int value) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
pages[index / 0x10000][index % 0x10000] = value;
}
this will allocate a page the first time it is touched, read or write.
To avoid the overhead of malloc you can use a hashtable where the entries in the table are your structs, assuming they are small. In your case a pair of integers should suffice, with a special value to indicate emptyness of the slot in the table.
How many values are there in your output space, i.e. how many different values do you map to in the range 0-0xFFFFF?
Using randomized universal hashing you can come up with a collision-free hash function with a table no bigger than 2 times the number of values in your output space (for a static table)

Resources