Hash Tables are said to be the fastest/best way of Storing/Retrieving data.
My understanding of a hash table, hashing is as follows (Please correct me if I am wrong or Please add If there is anything more):
A Hash Table is nothing but an array (single or multi-dimensional) to store values.
Hashing is the process to find the index/location in the array to insert/retrieve the data. You take a data item(s) and pass it as a key(s) to a hash function and you would get the index/location where to insert/retrieve the data.
I have a question:
Is the hash function used to store/retrieve the data DIFFERENT from a
cryptographic hash function used in security applications for authentication
like MD5, HMAC, SHA-1 etc...?
In what way(s) are they different?
How to write a hash function in C?
Is there some standard or guidelines to it?
How do we ensure that the output of a hash function i.e, index is not out of range?
It would be great if you could mention some good links to understand these better.
A cryptographic hash emphasizes making it difficult for anybody to intentionally create a collision. For a hash table, the emphasis is normally on producing a reasonable spread of results quickly. As such, the two are usually quite different (in particular, a cryptographic hash is normally a lot slower).
For a typical hash function, the result is limited only by the type -- e.g. if it returns a size_t, it's perfectly fine for it to return any possible size_t. It's up to you to reduce that output range to the size of your table (e.g. using the remainder of dividing by the size of your table, which should often be a prime number).
As an example, a fairly typical normal hash function might look something like:
// warning: untested code.
size_t hash(char const *input) {
const int ret_size = 32;
size_t ret = 0x555555;
const int per_char = 7;
while (*input) {
ret ^= *input++;
ret = ((ret << per_char) | (ret >> (ret_size - per_char));
}
return ret;
}
The basic idea here is to have every bit of the input string affect the result, and to (as quickly as possible) have every bit of the result affected by at least part of the input. Note that I'm not particularly recommending this as a great hash function -- only trying to illustrate some of the basics of what you're trying to accomplish.
Bob Jenkins wrote an in-depth description of his good, if slightly outdated, hash function. The article has links to newer, better hash functions, but the writeup addresses the concerns of building a good one.
Also, most hash table implementations actually use an array of linked lists to resolve collisions. If you want to just use an array then the hash function needs to check for collisions and create a new hash index.
The cryptographic hash functions you mention could be used as hash functions for a hash table,
but they are much slower than hash functions designed for a hash table. Speed makes brute force attacks easier.
The design goals are different.
With cryptographic hash functions you want, for example, that the hash and the hash function cannot be used to determine the original data or any other data that would produce the same hash.
Hash functions used with hash tables & other data structures do not need such security properties. It's often enough if the hash function is fast and it will distribute the input set evenly into the set of possible hashes (to avoid unnecessary clustering / collisions).
Related
I'm familiar with the idea of a hash function but I'm unclear on how GLib's implementation is useful. I'll explain this with an example.
Suppose I have an expensive function that is recursive (somehow) on the positive real numbers in a weird way that depends on number theory (I'm a mathematician). Let's say I have an algorithm that needs to compute the function on some smallish-range of large numbers. Say [1000000000 - 1000999999].
I don't want to call my expensive function one million times, so I start memoizing values recursively. Then at each call I don't need to necessarily compute the whole function from scratch, I can hopefully remember any values of the function on the lower numbers (during my recursing) that I have already computed. Let's assume that the actual total number of calls at that first level of recursion is low. So that there are a lot of repeated values and memoizing actually saves you a lot of time.
This is my cartoony way of understanding why a hash table data structure is useful. What I don't get is how to do this without knowing exactly what keys I'll need in advance.
Since the recursive function is number theoretic in general I don't know which values it will take over and over again. So I'd like to just throw these in a bucket (hash table) as they pop out of recursive calls to my function
For GLib, it would seem that your (key,value) pairs are always pointers to data that you personally have to keep lying around somewhere. So if my function is computing for input x. I don't know how to tell if I've seen the value x before, the function g_hash_table_contains() for example needs a pointer, not the value x. So what's the use!?
I'm still learning so be kind. I'm familiar with coding in C, but haven't yet used hash tables in this language and I'm trying to do so and be adept at it with GLib but I just don't get this.
Let me take a dig at it to explain it.
First of all, if we are using hashmap, then we need [key, value] pair for sure as our input.
So as a user of hashmap, we have to be creative about choosing key, and it varies depending upon the usecase.
In your case, as far as I understood, you have a function which works on a range and gives you result. And when calculating, it uses memoization so that results of small problem, which constitutes the bigger problem, can be used.
So for example, your case, you can use string as your key where string will be [1000000009] which may use result of [1000999998] which may further use result of 1000999997 and so on, and you do not find results in hashmap, then you will calculate it and save it in hashmap.
In nutshell, as a user, we need to be creative about choosing keys.
The analogues to understand is how you would have done, if you have to think about choosing primary key of database.
Another example to think is how you would have thought about solving fibonacci(n) using the hashmap.
I am building a hash table, where the key is a phone number (here are some of them):
6948060987
6960780800
6963208768
6944870406
6947279288
6953691771
6956094283
6947092062
6960086297
6947719197
6951516975
6957531584
6969211184
6963238579
6957054322
6952077216
6956907738
The number of entries will be 200, 2000, 20000 and 2000000 and the entries will be unique.
About the size of the table, I am following this answer.
I store the phone number as an array of char's. I noticed that all the numbers begin with 69, so I can skip them in the hash function.
My attempt is to take the sum of the digits and do a modulo with the number of cells in the hash table, but it seems (on paper) that this is a bad function, since there are many collisions.
How should I modify my hash function to get better results (less collisions)?
Why do you need to a non-standard hash function at all?
There are plenty of hash functions which are well tested and have known properties which will work fine for any input, thus will also work well for phone numbers, which are after all a subset of ASCII strings. Is your application so time critical that you need to design your own hash function and risk something with more collisions? If not, why not use one of the well known hash functions?
For instance, if you need something with cryptographically demonstrable collision resistance, use SHA-256 (truncated if you want). If you are not worried about an adversary, use something like universal hashing. Unless your problem is very specialised, you will be better off using someone else's well tested hash algorithm than trying to invent one yourself.
An even easier hash is the original hash perl used, which worked as follows:
# Return the hashed value of a string: $hash = perlhash("key")
# (Defined by the PERL_HASH macro in hv.h)
sub perlhash
{
$hash = 0;
foreach (split //, shift) {
$hash = $hash*33 + ord($_);
}
return $hash;
}
In English, it takes the current hash value, multiplies by 33, and adds the ASCII value of the next character on. It's not a great hash, but it worked for perl for a long while.
i am in search for a good Hash function which i can use in Hash table implementation. The thing is that i want to give both strings and integers as parameters(keys) in my hash function.
i have a txt file with ~500 data and every one of them consists of integers and strings(max 15 chars). So, the thing that i want to do is to pick one of these ints/strings and use it as a key for my hash function in order to put my data in the "right" bucket.
Is there any good function to do this?
Thank you :)
Use the Integer value if that's present & reasonably well distributed, then hash the String if it's not. Integer hashcode is much cheaper to compute than String.
The algorithm has to be repeatable, obviously.
Your question is somewhat vague. It's unclear if your data set has 500 columns and you are trying to figure out which column to use for hashing, or if it has 500 items which you want to hash.
If you are looking for a decent general purpose hash that will produce well-distributed hash values, you may want to check out the Jenkins hash functions which have variants for strings and integers. But, to be frank, if your dataset has 500 fixed items you may want to look at a perfect hash function generator, like GNU gperf or even alternative data structures depending on your data.
Since you want to hash using two keys, I presume the distribution improves using two keys.
For string hashing, I have had good results with PJW algorithm. Just google for "PJW Hash String". One variation here
To augment the hash with an integer, see here
I understand that some hash tables use "buckets", which is a linked list of "entries".
HashTable
-size //total possible buckets to use
-count // total buckets in use
-buckets //linked list of entries
Entry
-key //key identifier
-value // the object you are storing for reference
-next //the next entry
In order to get the bucket by index, you have to call:
myBucket = someHashTable[hashIntValue]
Then, you could iterate the linked list of entries until you find the one you are looking for or null.
Does the hash function always return a NUMBER % HashTable.size? That way, you stay within the limit? Is that how the hash function should work?
Mathematically speaking, a hash function is usually defined as a mapping from the universe of elements you want to store in the hash table to the range {0, 1, 2, .., numBuckets - 1}. This means that in theory, there's no requirement whatsoever that you use the mod operator to map some integer hash code into the range of valid bucket indices.
However, in practice, almost universally programmers will use a generic hash code that produces a uniformly-distributed integer value and then mod it down so that it fits in the range of the buckets. This allows hash codes to be developed independently of the number of buckets used in the hash table.
EDIT: Your description of a hash table is called a chained hash table and uses a technique called closed addressing. There are many other implementations of hash tables besides the one you've described. If you're curious - and I hope you are! :-) - you might want to check out the Wikipedia page on the subject.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
There is no predefined rule for how a hash function should behave. You can have all of your values map to index 0 - a perfectly valid hash function (performs poorly, but works).
Of course, if your hash function returns a value outside of the range of indices in your associated array, it won't work correctly. Thats not to say however, that you need to use the formula (number % TABLE_SIZE)
No, the table is typically an array of entries. You don't iterate it until you found the same hash, you use the hash result (or usually hash modulo numBuckets) to directly index into the array of entries. That gives you the O(1) behaviour (iterating would be O(n)).
When you try to store two different objects with the same hash result (called a 'hash collision'), you have to find some way to make space. Different implementations vary in how they handle collisions. You can create a linked list of all the objects with same hash, or use some rehashing to store in a different entry of the table.
I have a list of n strings (names of people) that I want to store in a hash table or similar structure. I know the exact value of n, so I want to use that fact to have O(1) lookups, which would be rendered impossible if I had to use a linked list to store my hash nodes. My first reaction was to use the the djb hash, which essentially does this:
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
To compress the resulting h into the range [0,n], I would like to simply do h%n, but I suspect that this will lead to a much higher probability of clashes in a way that would essentially render my hash useless.
My question then, is how can I hash either the string or the resulting hash so that the n elements provide a relatively uniform distribution over [0,n]?
It's not enough to know n. Allocation of an item to a bucket is a function of the item itself so, if you want a perfect hash function (one item per bucket), you need to know the data.
In any case, if you're limiting the number of elements to a known n, you're already technically O(1) lookup. The upper bound will be based on the constant n. This would be true even for a non-hash solution.
Your best bet is to probably just use the hash function you have and have each bucket be a linked list of the colliding items. Even if the hash is less than perfect, you're still greatly minimising the time taken.
Only if the hash is totally imperfect (all n elements placed in one bucket) will it be as bad as a normal linked list.
If you don't know the data in advance, a perfect hash is not possible. Unless, of course, you use h itself as the hash key rather than h%n but that's going to take an awful lot of storage :-)
My advice is to go the good-enough hash with linked list route. I don't doubt that you could make a better hash function based on the relative frequencies of letters in people's names across the population but even the hash you have (which is ideal for all letters having the same frequency) should be adequate.
And, anyway, if you start relying on frequencies and you get an influx of people from those countries that don't seem to use vowels (a la Bosniaa), you'll end up with more collisions.
But keep in mind that it really depends on the n that you're using.
If n is small enough, you could even get away with a sequential search of an unsorted array. I'm assuming your n is large enough here that you've already established that (or a balanced binary tree) won't give you enough performance.
A case in point: we have some code which searches through problem dockets looking for names of people that left comments (so we can establish the last member on our team who responded). There's only ever about ten or so members in our team so we just use a sequential search for them - the performance improvement from using a faster data structure was deemed too much trouble.
aNo offence intended. I just remember the humorous article a long time ago about Clinton authorising the airlifting of vowels to Bosnia. I'm sure there are other countries with a similar "problem".
What you're after is called a Perfect Hash. It's a hash function where all the keys are known ahead of time, designed so that there are no collisions.
The gperf program generates C code for perfect hashes.
It sounds like you're looking for an implementation of a perfect hash function, or perhaps even a minimal perfect hash function. According to the Wikipedia page, CMPH might
fit your needs. Disclaimer: I've never used it.
The optimal algorithm for mapping n strings to integers 1-n is to build a DFA where the terminating states are the integers 1-n. (I'm sure someone here will step up with a fancy name for this...but in the end it's all DFA.) Size/speed tradeoff can be adjusted by varying your alphabet size (operating on bytes, half-bytes, or even bits).