How to use handle collisions in hashtables using glib - c

I'm trying to store some data using hashtables and I decided to use glib.
So I could use the g_str_hash to generate the key, but there are equal strings. Basically, the data comes from a csv file and there are multiple lines for the same id, for example. And I wanted to use the ids as keys and still have the lines separately.
So I was trying to implement a similar algorithm from g_str_hash but that when there is already something attached to the key, it would go to the next space available. But I have difficulties because of some issues regarding the types and how to do it.
guint hash(char * key, GHashTable *hasht ) {
unsigned int hash = 5381;
int c;
while ((c = *(key++))) {
hash = ((hash << 5) + hash) + c;
}
//this is where i get lost on how to check if there is alreay something stored using the hash I generated before
while (g_hash_table_contains(hasht, hash))
{
hash++;
}
return hash;
}
Soo I would really appreciate some help on how to do it! Thank you so much!

If you're just trying to learn, implementing a custom hash table for this would be a good experience. However, if you just want to get some working code you're over-complicating this. Just use a GHashTable of GPtrArrays. Something like (untested, largely from memory and I haven't really used GLib from C in years):
/* create hash table */
GHashTable* ht = g_hash_table_new_full(g_str_hash, g_str_equal, g_free, g_ptr_array_unref);
for (/* each record in CSV */) {
GPtrArray* arr = g_hash_table_lookup(ht, record->id);
if (arr == NULL) {
/* We don't have an entry for that ID yet */
arr = g_ptr_array_new_full(1, g_free);
g_hash_table_insert(ht, g_strdup(record->id), arr);
}
g_ptr_array_add(arr, g_strdup(record->value));
}
Unless you're in a very performance-critical loop, this should be fine.

A hash table works by using an array where each element of the array is a 'bucket'. The bucket (array index) for a particular key or element in the hashtable is computed from the hash value for that element. Each bucket contains a linked list of elements which have the same hash value (hash collisions), or NULL if the bucket is empty.
Therefore it is required that your hash function always return the same value for a given key/element. Otherwise, you'll never be able to find/lookup the given key. Note that your hash function is called both when inserting into the hash table, as well when trying to look up a given key. With this hash function, you could insert a key "key1" and then when you later try to lookup "key1", you will increment the hash value because it is already present in the hashtable, and it will not be found. You must not increment the hash value based on if the key is already present in the hash table.
A hash table also requires a way to determine if two keys are equal or not. This is used, for example, to find the particular element you are looking for in a lookup. In short, a lookup involves finding the bucket for the given key using the hash function, and then using the 'key_equal' function to find the specific, given key among all the other keys in that bucket. Something similar happens on an insert, depending if the hashtable allows multiple identical keys or not, and what it does when an equal key already exists in the hash table (e.g., fail or replace the existing key.)
I've not used GLib before, but looking at the docs, GLib.HashTable is implementing a hash map of key/value pairs. The keys are hashed, and each key has an associated value. It does not allow duplicate keys. Inserting a duplicate key results in the old one being replaced.
So, with that background, whether or not this data structure is appropriate depends on your requirements and what you need to do with the data later. It sounds like you want to associate all the lines with their id, storing all of the lines individually, and that you'll need to lookup some or all lines with a given id later. In that case, with GLib.HashTable, I would recommend using the id as the key, and a collection (perhaps a linked list) of lines as the HashTable value. And correcting your hash function to remove this part:
while (g_hash_table_contains(hasht, hash))
{
hash++;
}

Related

What does 'x' here mean?

CHAINED-HASH-INSERT(T,x)
Insert x at head of list T[h(x.key)]
CHAINED-HASH-SEARCH(T,k)
Search for an element with key k in list T[h(k)]
CHAINED-HASH-DELETE(T,x)
Delete x from the list T[h(x.key)]
I am studying hash tables right now and I am not able to understand what exactly x means in the attached pseudocode with this question. The three hash table dictionary functions are being implemented here to insert, search and delete an element from the table.
Now I understand what x.key means but the problem is, but what exactly is x in the data that has to be inserted into the table. Please can you help me with an example. I am currently trying to implement it using C.
Also h() is the hash function.
In practice it will be a struct for key/value pairs. x.key is the key that is used for the hashing. The value might be named x.value, or x.data or something like that. There must be a definition for the type of x somewhere, but it's name is not important to understand the algorithm and to implement it. You might use a void pointer to represent the value so that data of any kind could be stored in the hash table.
So x represents the data structure that is stored in the hash table's collision resolution list, which is where the key and value are stored.
Update
I have found this familiar looking reference which explains it all: Introduction to Algorithms

What happens if hash is unique but hash % size is same in hash table?

Recently I'm studying hash table, and understand the basis is
create an array, for example
hashtable ht[4];
hash the key
int hash = hash_key(key);
get the index
int index = hash % 4
set to hashtable
ht[index] = insert_or_update(value)
And I know there is hash collision problem, if key1 and key2 has same hash, they go to same ht[index], so separate chaining can solve this.
keys with same hash go to same bucket, these keys will be stored in a linked list.
My question is, what happens if hash is different, but modulus is same?
For example,
hash(key1): 3
hash(key2): 7
hash(key3): 11
hash(key4): 15
so index is 3, these keys with different hash and different key go to same bucket
I search google for some hash table implementation, it seems they don't deal with this situation. Am I overthought? Anything wrong?
For example, these implementations:
https://gist.github.com/tonious/1377667#file-hash-c-L139
http://www.cs.yale.edu/homes/aspnes/pinewiki/C(2f)HashTables.html?highlight=%28CategoryAlgorithmNotes%29#CA-552d62422da2c22f8793edef9212910aa5fe0701_156
redis:
https://github.com/antirez/redis/blob/unstable/src/dict.c#L488
nginx:
https://github.com/nginx/nginx/blob/master/src/core/ngx_hash.c#L34
they just compare if key is equal
If two objects' keys hash to the same bucket, it doesn't really matter if it's because they have the same hash, or because they have different hashes but they both map (via modulo) to the same bucket. As you note, a collision that occurs because of either of these situations is commonly dealt with by placing both objects in a bucket-specific list.
When we look for an object in a hashtable, we are looking for an object that shares the same key. The hashing / modulo operation is just used to tell us in which bucket we should look to see if the object is present. Once we've found the proper bucket, we still need to compare the keys of any found objects (i.e., the objects in the bucket-specific list) directly to be sure we've found a match.
So the situation of two objects with different hashes but that map to the same bucket works for the same reason that two objects with the same hashes works: we only use the bucket to find candidate matches, and rely on the key itself to determine a true match.

Understanding hash tables

I understand that some hash tables use "buckets", which is a linked list of "entries".
HashTable
-size //total possible buckets to use
-count // total buckets in use
-buckets //linked list of entries
Entry
-key //key identifier
-value // the object you are storing for reference
-next //the next entry
In order to get the bucket by index, you have to call:
myBucket = someHashTable[hashIntValue]
Then, you could iterate the linked list of entries until you find the one you are looking for or null.
Does the hash function always return a NUMBER % HashTable.size? That way, you stay within the limit? Is that how the hash function should work?
Mathematically speaking, a hash function is usually defined as a mapping from the universe of elements you want to store in the hash table to the range {0, 1, 2, .., numBuckets - 1}. This means that in theory, there's no requirement whatsoever that you use the mod operator to map some integer hash code into the range of valid bucket indices.
However, in practice, almost universally programmers will use a generic hash code that produces a uniformly-distributed integer value and then mod it down so that it fits in the range of the buckets. This allows hash codes to be developed independently of the number of buckets used in the hash table.
EDIT: Your description of a hash table is called a chained hash table and uses a technique called closed addressing. There are many other implementations of hash tables besides the one you've described. If you're curious - and I hope you are! :-) - you might want to check out the Wikipedia page on the subject.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
There is no predefined rule for how a hash function should behave. You can have all of your values map to index 0 - a perfectly valid hash function (performs poorly, but works).
Of course, if your hash function returns a value outside of the range of indices in your associated array, it won't work correctly. Thats not to say however, that you need to use the formula (number % TABLE_SIZE)
No, the table is typically an array of entries. You don't iterate it until you found the same hash, you use the hash result (or usually hash modulo numBuckets) to directly index into the array of entries. That gives you the O(1) behaviour (iterating would be O(n)).
When you try to store two different objects with the same hash result (called a 'hash collision'), you have to find some way to make space. Different implementations vary in how they handle collisions. You can create a linked list of all the objects with same hash, or use some rehashing to store in a different entry of the table.

Looking for an array (vs linked list) hashtable implementation in C

I'm looking for a hashtable implementation in C that stores its objects in (twodimensional) arrays rather than linked lists.
i.e. if a collision happens, the object that is causing the collision will be stored in the next free row index rather than pushed to the head and first element of a linked list.
plus, the objects themselves must be copied to the hashtable, rather than referenced by pointers. (the objects do not live for the whole lifetime of the program but the table does).
I know that such an implementation might have serious efficiency drawbacks and is not the "standard way of hashing" but as I work on a very special system-architecture i need those characteristics.
thanks
A super simple implementation:
char hashtable[MAX_KEY][MAX_MEMORY];
int counts[MAX_KEY] = {0};
/* Inserting something into the table */
SomeStruct* some_struct;
int hashcode = compute_code(some_struct);
int size = sizeof(SomeStruct);
memcpy(hashtable[hashcode] + counts[hashcode] * size, some_struct, size);
++counts[hashcode];
Don't forget to check against MAX_MEMORY.
My guess is your system does not allow for dynamic memory allocation. Therefore you will need to define up front array bounds that are reasonable for your data (number of total objects and maximum expected collisions) and additionally a custom hash function for your objects so it might be best to implement your own hash table.
It's not in C but in C++, but take a look at Google Sparse Hash - might give you some ideas. The key requirement is that the object being stored has a way to be null.

How to write a hash function in C?

Hash Tables are said to be the fastest/best way of Storing/Retrieving data.
My understanding of a hash table, hashing is as follows (Please correct me if I am wrong or Please add If there is anything more):
A Hash Table is nothing but an array (single or multi-dimensional) to store values.
Hashing is the process to find the index/location in the array to insert/retrieve the data. You take a data item(s) and pass it as a key(s) to a hash function and you would get the index/location where to insert/retrieve the data.
I have a question:
Is the hash function used to store/retrieve the data DIFFERENT from a
cryptographic hash function used in security applications for authentication
like MD5, HMAC, SHA-1 etc...?
In what way(s) are they different?
How to write a hash function in C?
Is there some standard or guidelines to it?
How do we ensure that the output of a hash function i.e, index is not out of range?
It would be great if you could mention some good links to understand these better.
A cryptographic hash emphasizes making it difficult for anybody to intentionally create a collision. For a hash table, the emphasis is normally on producing a reasonable spread of results quickly. As such, the two are usually quite different (in particular, a cryptographic hash is normally a lot slower).
For a typical hash function, the result is limited only by the type -- e.g. if it returns a size_t, it's perfectly fine for it to return any possible size_t. It's up to you to reduce that output range to the size of your table (e.g. using the remainder of dividing by the size of your table, which should often be a prime number).
As an example, a fairly typical normal hash function might look something like:
// warning: untested code.
size_t hash(char const *input) {
const int ret_size = 32;
size_t ret = 0x555555;
const int per_char = 7;
while (*input) {
ret ^= *input++;
ret = ((ret << per_char) | (ret >> (ret_size - per_char));
}
return ret;
}
The basic idea here is to have every bit of the input string affect the result, and to (as quickly as possible) have every bit of the result affected by at least part of the input. Note that I'm not particularly recommending this as a great hash function -- only trying to illustrate some of the basics of what you're trying to accomplish.
Bob Jenkins wrote an in-depth description of his good, if slightly outdated, hash function. The article has links to newer, better hash functions, but the writeup addresses the concerns of building a good one.
Also, most hash table implementations actually use an array of linked lists to resolve collisions. If you want to just use an array then the hash function needs to check for collisions and create a new hash index.
The cryptographic hash functions you mention could be used as hash functions for a hash table,
but they are much slower than hash functions designed for a hash table. Speed makes brute force attacks easier.
The design goals are different.
With cryptographic hash functions you want, for example, that the hash and the hash function cannot be used to determine the original data or any other data that would produce the same hash.
Hash functions used with hash tables & other data structures do not need such security properties. It's often enough if the hash function is fast and it will distribute the input set evenly into the set of possible hashes (to avoid unnecessary clustering / collisions).

Resources