how to write order preserving minimal perfect hash for integer keys? - c

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.

Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.

There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.

It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.

You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

Related

What C construct would allow me to 'reverse reference' an array?

Looking for an elegant way (or a construct with which I am unfamiliar) that allows me to do the equivalent of 'reverse referencing' an array. That is, say I have an integer array
handle[number] = nameNumber
Sometimes I know the number and need the nameNumber, but sometimes I only know the nameNumber and need the matching [number] in the array.
The integer nameNumber values are each unique, that is, no two nameNumbers that are the same, so every [number] and nameNumber pair are also unique.
Is there a good way to 'reverse reference' an array value (or some other construct) without having to sweep the entire array looking for the matching value, (or having to update and keep track of two different arrays with reverse value sets)?
If the array is sorted and you know the length of it, you could binary search for the element in the array. This would be an O(n log(n)) search instead of you doing O(n) search through the array. Divide the array in half and check if the element at the center is greater or less than what you're looking for, grab the half of the array your element is in, and divide in half again. Each decision you make will eliminate half of the elements in the array. Keep this process going and you'll eventually land on the element you're looking for.
I don't know whether it's acceptable for you to use C++ and boost libraries. If yes you can use boost::bimap<X, Y>.
Boost.Bimap is a bidirectional maps library for C++. With Boost.Bimap you can create associative containers in which both types can be used as key. A bimap can be thought of as a combination of a std::map and a std::map.

Is there a way to map a list of integers to a unique number or a unique hash?

The permutation of the list of integers should also be preserved in the hash -- i.e., lists containing the same numbers in a different order should have different hashes.
One way to do this would be to concatenate the list of integers into a string, but this could be an expensive comparison test if the list is massive.
Context: If I already have 5 large arrays 'analyzed' and hashed away, I would be able to quickly check whether an incoming array is new or not.
https://en.wikipedia.org/wiki/Pigeonhole_principle
"In mathematics, the pigeonhole principle states that if n items are put into m containers, with n > m, then at least one container must contain more than one item"
It is certainly possible to create a unique number, its just that its hilariously huge.
Consider
[1,2,3]
A simple list, but to make sure we have enough holes for our pigeons, we would need to have space for the largest integer in each slot, so assuming 4 bytes per item, we would need a 12 byte integer to store the hash uniquely, or ~3.4028237e+38 different values. And that's only 3 integers.
No, an efficient hash is rarely unique, but a good hash is unlikely to have collisions for similar values.
To answer your question about checking for existence, consider the following:
If you have an array of n items, in order to hash it, you need to take n steps. In order to check for existence, you need, at worst, n steps to check each item in turn.
In either case, you are going to be spending about the same amount time comparing arrays.
An array structure seems to be a perfect choice where the index differentiate between elements, or you can use a list of elements where an element has an index value assigned to just before insertion.
Never use a String as a list structure, because it has it's own properties, like immutability (in the case of Java).

Data Structure to do lookup on large number

I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.

Sorting n sets of data into one

I have n arrays of data, each of these arrays is sorted by the same criteria.
The number of arrays will, in almost all cases, not exceed 10, so it is a relatively small number. In each array, however, can be a large number of objects, that should be treated as infinite for the algorithm I am looking for.
I now want to treat these arrays as if they are one array. However, I do need a way, to retrieve objects in a given range as fast as possible and without touching all objects before the range and/or all objects after the range. Therefore it is not an option to iterate over all objects and store them in one single array. Fetches with low start values are also more likely than fetches with a high start value. So e.g. fetching objects [20,40) is much more likely than fetching objects [1000,1020), but it could happen.
The range itself will be pretty small, around 20 objects, or can be increased, if relevant for the performance, as long as this does not hit the limits of memory. So I would guess a couple of hundred objects would be fine as well.
Example:
3 arrays, each containing a couple of thousand entires. I now want to get the overall objects in the range [60, 80) without touching either the upper 60 objects in each set nor all the objets that are after object 80 in the array.
I am thinking about some sort of combined, modified binary search. My current idea is something like the following (note, that this is not fully thought through yet, it is just an idea):
get object 60 of each array - the beginning of the range can not be after that, as every single array would already meet the requirements
use these objects as the maximum value for the binary search in every array
from one of the arrays, get the centered object (e.g. 30)
with a binary search in all the other arrays, try to find the object in each array, that would be before, but as close as possible to the picked object.
we now have 3 objects, e.g. object 15, 10 and 20. The sum of these objects would be 45. So there are 42 objects in front, which is more than the beginning of the range we are looking for (30). We continue our binary search in the remaining left half of one of the arrays
if we instead get a value where the sum is smaller than the beginning of the range we are looking for, we continue our search on the right.
at some point we will hit object 30. From there on, we can simply add the objects from each array, one by one, with an insertion sort until we hit the range length.
My questions are:
Is there any name for this kind of algorithm I described here?
Are there other algorithms or ideas for this problem, that might be better suited for this issue?
Thans in advance for any idea or help!
People usually call this problem something like "selection in the union of multiple sorted arrays". One of the questions in the sidebar is about the special case of two sorted arrays, and this question is about the general case. Several comparison-based approaches appear in the combined answers; they more or less have to determine where the lower endpoint in each individual array is. Your binary search answer is one of the better approaches; there's an asymptotically faster algorithm due to Frederickson and Johnson, but it's complicated and not obviously an improvement for small ranks.

Why does having an index actually speed up look-up time?

I've always wondered about why this is the case.
For instance, say I want to find the number 5 located in an array of numbers. I have to compare my desired number against every other single value, to find what I'm looking for.
This is clearly O(N).
But, say for instance, I have an index that I know contains my desired item. I can just jump right to it right? And this is also the case with Maps that are hashed, because as I provide a key to lookup, the same hash function is ran on the key that determined it's index position, so this also allows me to just then, jump right to it's correct index.
But my question is why is that any different than the O(N) lookup time for finding a value in an array through direct comparison?
As far as a naive computer is concerned, shouldn't an index be the same as looking for a value? Shouldn't the raw operation still be, as I traverse the structure, I must compare the current index value to the one I know I'm looking for?
It makes a great deal of sense why something like binary search can achieve O(logN), but I still can't intuitively grasp why certain things can be O(1).
What am I missing in my thinking?
Arrays are usually stored as a large block of memory.
If you're looking for an index, this allows you to calculate the offset that that index will have in this block of memory in O(1).
Say the array starts at memory address 124 and each element is 10 bytes large, then you can know the 5th element is at address 124 + 10*5 = 174.
Binary search will actually (usually) do something similar (since by-index lookup is just O(1) for an array) - you start off in the middle - you do a by-index lookup to get that element. Then you look at the element at either the 1/4th or 3/4th position, which you need to do a by-index lookup for again.
A HashMap has an array underneath it. When an key/value pair is added to the map. The key's hashCode() is evaluated and normalized so that its value can be placed in its special index in the array. When two key's codes are normalized to belong to the same index of the map, they are appended to a LinkedList
When you perform a look-up, the key you are looking up has its hash code() evaluated and normalized to return an index to search for the key. It then traverses the linked list you find the key and returns the associated value.
This look-up time is the same, in the best case, as looking-up array[i] which is O(1)
The reason it is a speed up is because you don't actually have to traverse your structure to look something up, you just jump right to the place where you expect it to be.

Resources