I understand that some hash tables use "buckets", which is a linked list of "entries".
HashTable
-size //total possible buckets to use
-count // total buckets in use
-buckets //linked list of entries
Entry
-key //key identifier
-value // the object you are storing for reference
-next //the next entry
In order to get the bucket by index, you have to call:
myBucket = someHashTable[hashIntValue]
Then, you could iterate the linked list of entries until you find the one you are looking for or null.
Does the hash function always return a NUMBER % HashTable.size? That way, you stay within the limit? Is that how the hash function should work?
Mathematically speaking, a hash function is usually defined as a mapping from the universe of elements you want to store in the hash table to the range {0, 1, 2, .., numBuckets - 1}. This means that in theory, there's no requirement whatsoever that you use the mod operator to map some integer hash code into the range of valid bucket indices.
However, in practice, almost universally programmers will use a generic hash code that produces a uniformly-distributed integer value and then mod it down so that it fits in the range of the buckets. This allows hash codes to be developed independently of the number of buckets used in the hash table.
EDIT: Your description of a hash table is called a chained hash table and uses a technique called closed addressing. There are many other implementations of hash tables besides the one you've described. If you're curious - and I hope you are! :-) - you might want to check out the Wikipedia page on the subject.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
There is no predefined rule for how a hash function should behave. You can have all of your values map to index 0 - a perfectly valid hash function (performs poorly, but works).
Of course, if your hash function returns a value outside of the range of indices in your associated array, it won't work correctly. Thats not to say however, that you need to use the formula (number % TABLE_SIZE)
No, the table is typically an array of entries. You don't iterate it until you found the same hash, you use the hash result (or usually hash modulo numBuckets) to directly index into the array of entries. That gives you the O(1) behaviour (iterating would be O(n)).
When you try to store two different objects with the same hash result (called a 'hash collision'), you have to find some way to make space. Different implementations vary in how they handle collisions. You can create a linked list of all the objects with same hash, or use some rehashing to store in a different entry of the table.
Related
Recently I'm studying hash table, and understand the basis is
create an array, for example
hashtable ht[4];
hash the key
int hash = hash_key(key);
get the index
int index = hash % 4
set to hashtable
ht[index] = insert_or_update(value)
And I know there is hash collision problem, if key1 and key2 has same hash, they go to same ht[index], so separate chaining can solve this.
keys with same hash go to same bucket, these keys will be stored in a linked list.
My question is, what happens if hash is different, but modulus is same?
For example,
hash(key1): 3
hash(key2): 7
hash(key3): 11
hash(key4): 15
so index is 3, these keys with different hash and different key go to same bucket
I search google for some hash table implementation, it seems they don't deal with this situation. Am I overthought? Anything wrong?
For example, these implementations:
https://gist.github.com/tonious/1377667#file-hash-c-L139
http://www.cs.yale.edu/homes/aspnes/pinewiki/C(2f)HashTables.html?highlight=%28CategoryAlgorithmNotes%29#CA-552d62422da2c22f8793edef9212910aa5fe0701_156
redis:
https://github.com/antirez/redis/blob/unstable/src/dict.c#L488
nginx:
https://github.com/nginx/nginx/blob/master/src/core/ngx_hash.c#L34
they just compare if key is equal
If two objects' keys hash to the same bucket, it doesn't really matter if it's because they have the same hash, or because they have different hashes but they both map (via modulo) to the same bucket. As you note, a collision that occurs because of either of these situations is commonly dealt with by placing both objects in a bucket-specific list.
When we look for an object in a hashtable, we are looking for an object that shares the same key. The hashing / modulo operation is just used to tell us in which bucket we should look to see if the object is present. Once we've found the proper bucket, we still need to compare the keys of any found objects (i.e., the objects in the bucket-specific list) directly to be sure we've found a match.
So the situation of two objects with different hashes but that map to the same bucket works for the same reason that two objects with the same hashes works: we only use the bucket to find candidate matches, and rely on the key itself to determine a true match.
I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.
I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.
I have a large index of size 80-bits and its corresponding data to be stored in a data structure on which I need to search. Can we use the 80-bit index in a hash table?? Or is there a better alternative data structure that will take a constant time for lookup (search)?
EDIT:
I think my question was not clear.... Here is the setup --- I have millions of files for which I will produce a cryptographic hash trapdoor of size 80-bits (to represent the file securely) and each 80-bit trapdoor is to be stored with its data in a data structure like hash table. Now since the domain of 80-bit trapdoor is larger than the range of hash table, there will be collisions for sure. But I need unique <80-bit trapdoor,data> pairs to be stored in the data structure. How can I achieve this using hash table? Or if there is any other alternative DS?
EDIT 2 :
Let's say that I created a hash table and there occurred a collision when adding the keys (say x & y in order) because the hash function generated the same index (i) for those keys. But by using collision resolution techniques (eg. double hashing), y is inserted in a different location j which is not i. I understand till this point. Now if I want to search based on a key y, does the hash table return the location i or j? If not i, how will it return j (the exact desired record)? does it store any counter(probe) for number of collisions?
You should probably review how hash tables work.
The object you want to use as an index are passed through an hash function and the the resulting value is used to find the memory position where you should place/look for the data associated to that index value.
If you need constant time lookups go for an hash table. Just be sure to use an appropriate hash function.
You can use whatever you want as index in a hash table if you provide a hash function. I don't hink there is a better alternative if you want constant time access.
i am in search for a good Hash function which i can use in Hash table implementation. The thing is that i want to give both strings and integers as parameters(keys) in my hash function.
i have a txt file with ~500 data and every one of them consists of integers and strings(max 15 chars). So, the thing that i want to do is to pick one of these ints/strings and use it as a key for my hash function in order to put my data in the "right" bucket.
Is there any good function to do this?
Thank you :)
Use the Integer value if that's present & reasonably well distributed, then hash the String if it's not. Integer hashcode is much cheaper to compute than String.
The algorithm has to be repeatable, obviously.
Your question is somewhat vague. It's unclear if your data set has 500 columns and you are trying to figure out which column to use for hashing, or if it has 500 items which you want to hash.
If you are looking for a decent general purpose hash that will produce well-distributed hash values, you may want to check out the Jenkins hash functions which have variants for strings and integers. But, to be frank, if your dataset has 500 fixed items you may want to look at a perfect hash function generator, like GNU gperf or even alternative data structures depending on your data.
Since you want to hash using two keys, I presume the distribution improves using two keys.
For string hashing, I have had good results with PJW algorithm. Just google for "PJW Hash String". One variation here
To augment the hash with an integer, see here