What does 'x' here mean? - c

CHAINED-HASH-INSERT(T,x)
Insert x at head of list T[h(x.key)]
CHAINED-HASH-SEARCH(T,k)
Search for an element with key k in list T[h(k)]
CHAINED-HASH-DELETE(T,x)
Delete x from the list T[h(x.key)]
I am studying hash tables right now and I am not able to understand what exactly x means in the attached pseudocode with this question. The three hash table dictionary functions are being implemented here to insert, search and delete an element from the table.
Now I understand what x.key means but the problem is, but what exactly is x in the data that has to be inserted into the table. Please can you help me with an example. I am currently trying to implement it using C.
Also h() is the hash function.

In practice it will be a struct for key/value pairs. x.key is the key that is used for the hashing. The value might be named x.value, or x.data or something like that. There must be a definition for the type of x somewhere, but it's name is not important to understand the algorithm and to implement it. You might use a void pointer to represent the value so that data of any kind could be stored in the hash table.
So x represents the data structure that is stored in the hash table's collision resolution list, which is where the key and value are stored.
Update
I have found this familiar looking reference which explains it all: Introduction to Algorithms

Related

How to use handle collisions in hashtables using glib

I'm trying to store some data using hashtables and I decided to use glib.
So I could use the g_str_hash to generate the key, but there are equal strings. Basically, the data comes from a csv file and there are multiple lines for the same id, for example. And I wanted to use the ids as keys and still have the lines separately.
So I was trying to implement a similar algorithm from g_str_hash but that when there is already something attached to the key, it would go to the next space available. But I have difficulties because of some issues regarding the types and how to do it.
guint hash(char * key, GHashTable *hasht ) {
unsigned int hash = 5381;
int c;
while ((c = *(key++))) {
hash = ((hash << 5) + hash) + c;
}
//this is where i get lost on how to check if there is alreay something stored using the hash I generated before
while (g_hash_table_contains(hasht, hash))
{
hash++;
}
return hash;
}
Soo I would really appreciate some help on how to do it! Thank you so much!
If you're just trying to learn, implementing a custom hash table for this would be a good experience. However, if you just want to get some working code you're over-complicating this. Just use a GHashTable of GPtrArrays. Something like (untested, largely from memory and I haven't really used GLib from C in years):
/* create hash table */
GHashTable* ht = g_hash_table_new_full(g_str_hash, g_str_equal, g_free, g_ptr_array_unref);
for (/* each record in CSV */) {
GPtrArray* arr = g_hash_table_lookup(ht, record->id);
if (arr == NULL) {
/* We don't have an entry for that ID yet */
arr = g_ptr_array_new_full(1, g_free);
g_hash_table_insert(ht, g_strdup(record->id), arr);
}
g_ptr_array_add(arr, g_strdup(record->value));
}
Unless you're in a very performance-critical loop, this should be fine.
A hash table works by using an array where each element of the array is a 'bucket'. The bucket (array index) for a particular key or element in the hashtable is computed from the hash value for that element. Each bucket contains a linked list of elements which have the same hash value (hash collisions), or NULL if the bucket is empty.
Therefore it is required that your hash function always return the same value for a given key/element. Otherwise, you'll never be able to find/lookup the given key. Note that your hash function is called both when inserting into the hash table, as well when trying to look up a given key. With this hash function, you could insert a key "key1" and then when you later try to lookup "key1", you will increment the hash value because it is already present in the hashtable, and it will not be found. You must not increment the hash value based on if the key is already present in the hash table.
A hash table also requires a way to determine if two keys are equal or not. This is used, for example, to find the particular element you are looking for in a lookup. In short, a lookup involves finding the bucket for the given key using the hash function, and then using the 'key_equal' function to find the specific, given key among all the other keys in that bucket. Something similar happens on an insert, depending if the hashtable allows multiple identical keys or not, and what it does when an equal key already exists in the hash table (e.g., fail or replace the existing key.)
I've not used GLib before, but looking at the docs, GLib.HashTable is implementing a hash map of key/value pairs. The keys are hashed, and each key has an associated value. It does not allow duplicate keys. Inserting a duplicate key results in the old one being replaced.
So, with that background, whether or not this data structure is appropriate depends on your requirements and what you need to do with the data later. It sounds like you want to associate all the lines with their id, storing all of the lines individually, and that you'll need to lookup some or all lines with a given id later. In that case, with GLib.HashTable, I would recommend using the id as the key, and a collection (perhaps a linked list) of lines as the HashTable value. And correcting your hash function to remove this part:
while (g_hash_table_contains(hasht, hash))
{
hash++;
}

Why does having an index actually speed up look-up time?

I've always wondered about why this is the case.
For instance, say I want to find the number 5 located in an array of numbers. I have to compare my desired number against every other single value, to find what I'm looking for.
This is clearly O(N).
But, say for instance, I have an index that I know contains my desired item. I can just jump right to it right? And this is also the case with Maps that are hashed, because as I provide a key to lookup, the same hash function is ran on the key that determined it's index position, so this also allows me to just then, jump right to it's correct index.
But my question is why is that any different than the O(N) lookup time for finding a value in an array through direct comparison?
As far as a naive computer is concerned, shouldn't an index be the same as looking for a value? Shouldn't the raw operation still be, as I traverse the structure, I must compare the current index value to the one I know I'm looking for?
It makes a great deal of sense why something like binary search can achieve O(logN), but I still can't intuitively grasp why certain things can be O(1).
What am I missing in my thinking?
Arrays are usually stored as a large block of memory.
If you're looking for an index, this allows you to calculate the offset that that index will have in this block of memory in O(1).
Say the array starts at memory address 124 and each element is 10 bytes large, then you can know the 5th element is at address 124 + 10*5 = 174.
Binary search will actually (usually) do something similar (since by-index lookup is just O(1) for an array) - you start off in the middle - you do a by-index lookup to get that element. Then you look at the element at either the 1/4th or 3/4th position, which you need to do a by-index lookup for again.
A HashMap has an array underneath it. When an key/value pair is added to the map. The key's hashCode() is evaluated and normalized so that its value can be placed in its special index in the array. When two key's codes are normalized to belong to the same index of the map, they are appended to a LinkedList
When you perform a look-up, the key you are looking up has its hash code() evaluated and normalized to return an index to search for the key. It then traverses the linked list you find the key and returns the associated value.
This look-up time is the same, in the best case, as looking-up array[i] which is O(1)
The reason it is a speed up is because you don't actually have to traverse your structure to look something up, you just jump right to the place where you expect it to be.

How to handle data structures with indices larger than 32 bits?

I have a large index of size 80-bits and its corresponding data to be stored in a data structure on which I need to search. Can we use the 80-bit index in a hash table?? Or is there a better alternative data structure that will take a constant time for lookup (search)?
EDIT:
I think my question was not clear.... Here is the setup --- I have millions of files for which I will produce a cryptographic hash trapdoor of size 80-bits (to represent the file securely) and each 80-bit trapdoor is to be stored with its data in a data structure like hash table. Now since the domain of 80-bit trapdoor is larger than the range of hash table, there will be collisions for sure. But I need unique <80-bit trapdoor,data> pairs to be stored in the data structure. How can I achieve this using hash table? Or if there is any other alternative DS?
EDIT 2 :
Let's say that I created a hash table and there occurred a collision when adding the keys (say x & y in order) because the hash function generated the same index (i) for those keys. But by using collision resolution techniques (eg. double hashing), y is inserted in a different location j which is not i. I understand till this point. Now if I want to search based on a key y, does the hash table return the location i or j? If not i, how will it return j (the exact desired record)? does it store any counter(probe) for number of collisions?
You should probably review how hash tables work.
The object you want to use as an index are passed through an hash function and the the resulting value is used to find the memory position where you should place/look for the data associated to that index value.
If you need constant time lookups go for an hash table. Just be sure to use an appropriate hash function.
You can use whatever you want as index in a hash table if you provide a hash function. I don't hink there is a better alternative if you want constant time access.

Good Average of Speed/Memory Efficiency Method to Create a Set In C?:

Let's say that I am streaming non-empty strings (char[]/char*s) into my program. I would like to create a set of them. That is, for any element a in set S, a is unique in S.
I have thought to approach this in a few ways, but have run into issues.
If I knew the amount of items n I would be reading, I could just create a hash table, with all elements beginning as null, of the same size and if there was a collision, do not insert it into that table. When the insertions are done, I would iterate through the array of the hashtable, counting non-null values, size, and then create an array of that size, and then copy all the values to it.
I could use just use a single array and resize it before an element is added, using a search algorithm to check to see if an element already exists before resizing/adding it.
I realize the second method would work, but because the elements may not be sorted, could also take a very long time for large inputs because of choice of search algorithm and resizing, regardless.
Any input would be appreciated. Please feel free to ask questions in the comment box below if you need further information. Libraries would be very helpful! (Google searching "Sets in C" and similar things doesn't help very much.)
A hash table can work even if you didn't know the size of the number of elements that you are going to be inserting ... you would simply define you hash table to use "buckets" (i.e., each position is actually a linked list of elements that hash to the same value), and you would search through each "bucket" to make sure that each element has not already been inserted into the hash-table. The key to avoiding large "buckets" to search through would be a good hash algorithm.
You can also, if you can define a weak ordering of your objects, use a binary search tree. Then if !(A < B) and !(B < A), it can be assumed A == B, and you would therefore not insert any additional iterations of that object into the tree, which again would define a set.
While I know you're using C, consider the fact that in the C++ STL, std::set uses a RB-tree (red-black tree which is a balanced binary search tree), and std::unordered_set uses a hash-table.
Using an array is a bad idea ... resizing operations will take a long time, where-as insertions into a tree can be done in O(log N) time, and for a hash-table, ammortized O(1).

Understanding hash tables

I understand that some hash tables use "buckets", which is a linked list of "entries".
HashTable
-size //total possible buckets to use
-count // total buckets in use
-buckets //linked list of entries
Entry
-key //key identifier
-value // the object you are storing for reference
-next //the next entry
In order to get the bucket by index, you have to call:
myBucket = someHashTable[hashIntValue]
Then, you could iterate the linked list of entries until you find the one you are looking for or null.
Does the hash function always return a NUMBER % HashTable.size? That way, you stay within the limit? Is that how the hash function should work?
Mathematically speaking, a hash function is usually defined as a mapping from the universe of elements you want to store in the hash table to the range {0, 1, 2, .., numBuckets - 1}. This means that in theory, there's no requirement whatsoever that you use the mod operator to map some integer hash code into the range of valid bucket indices.
However, in practice, almost universally programmers will use a generic hash code that produces a uniformly-distributed integer value and then mod it down so that it fits in the range of the buckets. This allows hash codes to be developed independently of the number of buckets used in the hash table.
EDIT: Your description of a hash table is called a chained hash table and uses a technique called closed addressing. There are many other implementations of hash tables besides the one you've described. If you're curious - and I hope you are! :-) - you might want to check out the Wikipedia page on the subject.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
There is no predefined rule for how a hash function should behave. You can have all of your values map to index 0 - a perfectly valid hash function (performs poorly, but works).
Of course, if your hash function returns a value outside of the range of indices in your associated array, it won't work correctly. Thats not to say however, that you need to use the formula (number % TABLE_SIZE)
No, the table is typically an array of entries. You don't iterate it until you found the same hash, you use the hash result (or usually hash modulo numBuckets) to directly index into the array of entries. That gives you the O(1) behaviour (iterating would be O(n)).
When you try to store two different objects with the same hash result (called a 'hash collision'), you have to find some way to make space. Different implementations vary in how they handle collisions. You can create a linked list of all the objects with same hash, or use some rehashing to store in a different entry of the table.

Resources