Data structure for playing notes in MIDI synthesizer - c

I'm working on a hardware virtual analog synthesizer using C, and I'm trying to come up with an efficient data structure to manage the dynamic assignment of synthesizer voices in response to incoming MIDI messages.
I have a structure type which holds the data for a single synthesizer voice (pitch, low frequency oscillator, ADSR settings, etc.) and I have a "NoteOn" callback function which is executed when a MIDI "Note On" message is decoded. This function needs to take an "idle" voice from an "idle" pool, modify some settings in the structure, and assign it to a "playing" pool that the main synth engine works with to generate audio samples. Then, when a "Note Off" message is received, the voice with a note value corresponding to the one in the "Note Off" message needs to be selected from the "playing" pool, have its data structure modified again, and eventually returned to the "idle" pool (depending on envelope/ADSR settings.)
I tried an implementation using linked lists for both pools, but my implementation seemed rather cumbersome and slow. I'd like this process to be as quick as possible, to maintain playing responsiveness. Any suggestions?

If a linked list is too slow, the usual answer is to implement a hash table. There many, many possible variations of the data structure and algorithm. I'll just describe open, "single"-hashing, because that's the variation I'm most familiar with.
So with an open hash table, the table is just an array ("closed" hashing has an array, too, but each element is a linked list). We want the array to be, at most, about half-full for performance reasons. And at maximum-capacity, the filled table will actually have one empty slot still, because this simplifies the algorithm.
We also need a hash function which accepts the type of the key values, and returns integers. It's very difficult to predict how the hash function will behave with respect to clustered keys and overall performance. So just make sure it's an isolated function that can easily be changed later. It can be as simple as shifting-around all the bytes and adding them together.
int hash (char *key, int key_length, int table_size)
{
int ret, i;
for (i=0, ret=0; i < key_length; i++)
{
ret += key[i] << i;
}
return abs(ret) % table_size;
}
The table-lookup function uses the hash function to decide where to start looking in the array. If the key isn't found there (determined by doing a memcmp() on the actual search key and the key stored at that position in the table), it looks at each successive key, wrapping from the end of the array back to the beginning, and declares failure if it finds an empty table element.
#define RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY \
if (memcmp(table + i, &key, sizeof key) == 0 || (key_type)table[i] == 0) \
return table + i;
key_value_pair *hash_lookup(key_value_pair *table, int table_size, key_type key)
{
int h, i;
h = hash(&key, sizeof key, table_size);
i = h;
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
for ( ; i < table_size; i++)
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
for (i=0; i < h; i++)
RETURN_TABLE_I_IF_EQUAL_KEY_OR_EMPTY
return NULL;
}
We'll need one more function in front of this to handle a few quirks. It can return a NULL pointer which indicates that not only has the key not been found, but the table itself is overfull. An overfull table, which really means "completely full", but we decided earlier that a "full" table should really have one empty element. This means that both for loops should not run to completion; when it finds an empty table position, that's a failure. With an overfull table, it has to scan the entire table before discovering that the key is not present, thus losing much of the performance advtantage from using a hash at all.
The lookup function can also return a valid pointer to an empty slot. This is also a failure to find the value, but not an error. If adding the key/value pair for the first time, this will be slot to store it.
Or it could return a pointer to the desired table element. And this will be faster than a linear search, be it an array or linked list.
Deleting a key from the table requires us to fill-in the vacated position in the sequence. There are a couple of options.
If you're not worried about the table running out of space (it's set really large, and the lifetime and usage can be controlled), you can overwrite the entry with a deleted special key, distinct from an empty key.
Or, if you want to reclaim the space, too, you'll need to lookup the key, and then scan the rest of the "chain" (sequence of keys up to the next empty slot (including wrap-around)) and move the last key with a matching hash into the key-to-delete's position. Then write-over this moved key/value's position with the empty key. .... oops! This process must be repeated for the this last matching key until we're actually clearing the very last key in the chain. (I need to go fix this in my implementation right now!....)

Related

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

Expand hash table without rehash?

I am looking to for a hash table data structure that does not require rehash for expansion and shrink?
Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?
does not require rehash for expansion and shrink? Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?
That depends on what you call "rehash":
If you simply mean that the table-level rehash shouldn't reapply the hash function to each key during resizing, then that's easy with most libraries: e.g. wrap the key and its raw (pre-modulo-table-size) real hash value together a la struct X { size_t hash_; Key key_ };, supply the hashtable library with a hash function that returns hash_, but a comparison function that compares key_s (depending on the complexity of key_ comparison, you may be able to use hash_ to optimise, e.g. lhs.hash_ == rhs.hash_ && lhs.key_ == rhs.key_).
This will help most if the hashing of keys was particularly time consuming (e.g. cryptographic strength on longish keys). For very simple hashing (e.g. passthrough of ints) it'll slow you down and waste memory.
If you mean the table-level operation of increasing or decreasing memory storage and reindexing all stored values, then yes - it can be avoided - but to do so you have to fundamentally change the way the hash table works, and the normal performance profile. Discussed below.
As just one example, you could leverage a more typical hashtable implementation (let's call it H) by having your custom hashtable (C) have an H** p that - up to an initial size limit - will have p[0] be the only instance of H, and simply ferry operations/results through. If the table grows beyond that, you keep p[0] referencing the existing H, while creating a second H hashtable to be tracked by p[1]. Then things start getting dicey:
to search or erase in C, your implementation needs to search p[1] then p[0] and report any match from either
to insert a new value in C, your implementation must confirm it's not in p[0], then insert to p[1]
with each insert (and potentially even for other operations), it could optionally migrate any matching - or an arbitrary p[0] entry - to p[1] so gradually p[0] empties; you can easily guarantee p[0] will be empty before p[1] will be so full (and consequently a larger table will be needed). When p[0] is empty you may want to p[0] = p[1]; p[1] = NULL; to keep the simple mental model of what's where - lots of options.
Some existing hash table implementations are very efficient at iterating over elements (e.g. GNU C++ std::unordered_set), as there's a singly linked list of all the values, and the hash table is really only a collection of pointers (in C++ parlance, iterators) into the linked list. This can mean that if your utilisation falls below some threshold (e.g. 10% load factor) for your only/larger hash table, you know you can very efficiently migrate the remaining elements to a smaller table.
These kind of tricks are used by some hash tables to avoid a sudden heavy cost during rehashing, and instead spread the pain more evenly over a number of subsequent operations, avoiding a possibly nasty spike in latency.
Some of the implementation options only make sense for either an open or a closed hashing implementation, or are only useful when the keys and/or values are small or large and depending on whether the table embeds them or points to them. Best way to learn about it is to code....
It depends what you want to avoid. Rehashing implies recomputing the hash values. You can avoid that by storing the hash values in the hash structures. Redispatching the entries into the reallocated hashtable may be less expensive (typically a single modulo or masking operation) and is hardly avoidable for simple hashtable implementations.
Assuming you actually do need this.. It is possible. Here I'll give a trivial example you can build on.
// Basic types we deal with
typedef uint32_t key_t;
typedef void * value_t;
typedef struct
{
key_t key;
value_t value;
} hash_table_entry_t;
typedef struct
{
uint32_t initialSize;
uint32_t size; // current max entries
uint32_t count; // current filled entries
hash_table_entry_t *entries;
} hash_table_t;
// Hash function depends on the size of the table
key_t hash(value_t value, uint32_t size)
{
// Simple hash function that just does modulo hash table size;
return *(key_t*)&value % size;
}
void init(hash_table_t *pTable, uint32_t initialSize)
{
pTable->initialSize = initialSize;
pTable->size = initialSize;
pTable->count = 0;
pTable->entries = malloc(pTable->size * sizeof(*pTable->entries));
/// #todo handle null return;
// Set to ~0 to signal invalid keys.
memset(pTable->entries, ~0, pTable->size * sizeof(*pTable->entries));
}
void insert(hash_table_t *pTable, value_t val)
{
key_t key = hash(val, pTable->size);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
{
pTable->entries[i].key = key;
pTable->entries[i].value = val;
pTable->count++;
break;
}
}
// Expand when 50% full
if (pTable->count > pTable->size/2)
{
pTable->size *= 2;
pTable->entries = realloc(pTable->entries, pTable->size * sizeof(*pTable->entries));
/// #todo handle null return;
memset(pTable->entries + pTable->size/2, ~0, pTable->size * sizeof(*pTable->entries));
}
}
_Bool contains(hash_table_t *pTable, value_t val)
{
// Try current size first
uint32_t sizeToTry = pTable->size;
do
{
key_t key = hash(val, sizeToTry);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
break;
if (pTable->entries[i].key == key && pTable->entries[i].value == val)
return true;
}
// Try all previous sizes we had. Only report failure if found for none.
sizeToTry /= 2;
} while (sizeToTry != pTable->initialSize);
return false;
}
The idea is that the hash function depends on the size of the table. When you change the size of the table, you don't rehash current entries. You add new ones with the new hash function. When reading the entries, you try all the hash functions that have ever been used on this table.
This way, get()/contains() and similar operations take longer the more times you expanded your table, but you don't have the huge spike of rehashing. I can imagine some systems where this would be a requirement.

Implementing a Hash Table using C

I am trying to read a file of names, and hash those names to a spot based on the value of their letters. I have succeeded in getting the value for each name in the file, but I am having trouble creating a hash table. Can I just use a regular array and write a function to put the words at their value's index?
while (fgets(name,100, ptr_file)!=NULL) //while file isn't empty
{
fscanf(ptr_file, "%s", &name); //get the first name
printf("%s ",name); //prints the name
int length = strlen(name); // gets the length of the name
int i;
for (i =0; i <length; i++)
//for the length of the string add each letter's value up
{
value = value + name [i];
value=value%50;
}
printf("value= %1d\n", value);
}
No, you can't, because you'll have collisions, and you'll thus have to account for multiple values to a single hash. Generally, your hashing is not really a wise implementation -- why do you limit the range of values to 50? Is memory really really sparse, so you can't have a bigger dictionary than 50 pointers?
I recommend using an existing C string hash table implementation, like this one from 2001.
In general, you will have hash collisions and you will need to find a way to handle that. The two main ways of handling that situation are as follows:
If the hash table entry is taken, you use the next entry, and iterate until you find an empty entry. When looking things up, you need to do similarly. This is nice and simple for adding things to the hash table, but makes removing things tricky.
Use hash buckets. Each hash table entry is a linked list of entries with the same hash value. First you find your hash bucket, then you find the entry on the list. When adding a new entry, you merely add it to the list.
This is my demo-program, which reads strings (words) from stdin, and
deposit into hashtable. Thereafter (at EOF), iterate hashtable, and compute number of words, distinct words, and prints most frequent word:
http://olegh.cc.st/src/words.c.txt
In hashtable, utilized double hashing algorithm.

Most efficient way of manipulating a very large text database of SHA256 hashes?

I have to frequently search hashes in a large (up to 1G) CSV database of the format
sha256_hash, md5_hash, sha1_hash, field1, field2, field3 etc
in C. This needs to be very fast and memory usage is a non-issue (32G minimum). I found this which is very close to what I had in mind: load the data into RAM, one-time order the database by hash, index by first 'n' bytes of the hash and then search through smaller sublists. But the thread above doesn't seem to address a question I have in mid. Since I'm not a cryptography guy, I was wondering about the distribution of hashes and whether of not it could be used to make searching the sublists even faster. Any suggestion about this or or my general approach?
Yes, a bloom filter can be used to kick out 'definite negatives' early, by using the distribution of the hash bits.
http://en.wikipedia.org/wiki/Bloom_filter
To create a bloom filter for a given bucket, logical OR all the hashes together to create your filter. Then logical AND the filter with your target hash. If the result is < your target hash (or result XOR target hash != 0), that bucket definitely does not contain that target hash, and you can skip searching it, but if the result == target hash, that bucket MAY contain your target hash, and you need to continue with searching it to be sure. The bloom filter can be cached and updated simply when new hashes are added, but has to be recomputed when hashes are removed, so all that remains for the search is the AND and < operations, which are very cheap and will reduce your O(N) operation to O(1) time in the best case scenario.
Care has to be taken with regards to bucket size so that filters of meaningful value are produced, because a filter of all high bits is of no value to anyone.
This is a very easy problem to solve with a lot of memory. Make the hash the key of a hashtable. Make the hash that you provide to the table the first N bytes of the hash (because they are so random that no human on earth can tell them apart from truly random data).
Not sure what's up with your idea with keying the table with a prefix of the key and having sublists. Any stock library-provided hashtable can easily solve your problem.
Or put it into any database and make the hash the primary key.
The distribution of hashes is uniform, which is helpful because you can put the hashes in a hash table.
// something like this...
struct entry {
bool used;
unsigned char sha256[32];
char field1[20];
char field2[20];
};
If you don't need to delete entries from the hash table, just create a big array of struct entry and insert records from the CSV into the index corresponding to some bits from the SHA-256 hash. Use linear probing to insert entries: if entry i is taken, use i+1, or i+2, until you find a free entry.
struct table {
int nbits;
struct entry *entries;
};
unsigned read_int(unsigned char *data)
{
unsigned v = data[0] | (data[1] << 8) |
(data[2] << 16) | ((unsigned)data[3] << 24);
}
struct entry *find_entry(struct table *table, unsigned char *sha256)
{
unsigned index = read_int(sha256);
unsigned mask = (1u << table->nbits) - 1;
while (1) {
struct entry *e = &table->entries[index & mask];
if (!e->used)
return NULL;
if (!memcmp(e->sha256, sha256, 32))
return e;
index++;
}
}

Ideal data structure for mapping integers to integers?

I won't go into details, but I'm attempting to implement an algorithm similar to the Boyer-Moore-Horspool algorithm, only using hex color values instead of characters (i.e., there is a much greater range).
Following the example on Wikipedia, I originally had this:
size_t jump_table[0xFFFFFF + 1];
memset(jump_table, default_value, sizeof(jump_table);
However, 0xFFFFFF is obviously a huge number and this quickly causes C to seg-fault (but not stack-overflow, disappointingly).
Basically, what I need is an efficient associative array mapping integers to integers. I was considering using a hash table, but having a malloc'd struct for each entry just seems overkill to me (I also do not need hashes generated, as each key is a unique integer and there can be no duplicate entries).
Does anyone have any alternatives to suggest? Am I being overly pragmatic about this?
Update
For those interested, I ended up using a hash table via the uthash library.
0xffffff is rather too large to put on the stack on most systems, but you absolutely can malloc a buffer of that size (at least on current computers; not so much on a smartphone). Whether or not you should do it for this task is a separate issue.
Edit: Based on the comment, if you expect the common case to have a relatively small number of entries other than the "this color doesn't appear in the input" skip value, you should probably just go ahead and use a hash map (obviously only storing values that actually appear in the input).
(ignore earlier discussion of other data structures, which was based on an incorrect recollection of the algorithm under discussion -- you want to use a hash table)
If the array you were going to make (of size 0xFFFFFF) was going to be sparse you could try making a smaller array to act as a simple hash table, with the size being 0xFFFFFF / N and the hash function being hexValue / N (or hexValue % (0xFFFFFF / N)). You'll have to be creative to handle collisions though.
This is the only way I can foresee getting out of mallocing structs.
You can malloc(3) 0xFFFFFF blocks of size_t on the heap (for simplicity), and address them as you do with an array.
As for the stack overflow. Basically the program receives a SIGSEGV, which can be a result of a stack overflow or accessing illegal memory or writing on a read-only segment etc... They are all abstracted under the same error message "Segmentation fault".
But why don't you use a higher level language like python that supports associate arrays?
At possibly the cost of some speed, you could try modifying the algorithm to find only matches that are aligned to some boundary (every three or four symbols), then perform the search at byte level.
You could create a sparse array of sorts which has "pages" like this (this example uses 256 "pages", so the upper most byte is the page number):
int *pages[256];
/* call this first to make sure all of the pages start out NULL! */
void init_pages(void) {
for(i = 0; i < 256; ++i) {
pages[i] = NULL;
}
}
int get_value(int index) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
return pages[index / 0x10000][index % 0x10000];
}
void set_value(int index, int value) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
pages[index / 0x10000][index % 0x10000] = value;
}
this will allocate a page the first time it is touched, read or write.
To avoid the overhead of malloc you can use a hashtable where the entries in the table are your structs, assuming they are small. In your case a pair of integers should suffice, with a special value to indicate emptyness of the slot in the table.
How many values are there in your output space, i.e. how many different values do you map to in the range 0-0xFFFFF?
Using randomized universal hashing you can come up with a collision-free hash function with a table no bigger than 2 times the number of values in your output space (for a static table)

Resources