How to approximate LUT index? - c

I have an array
...
//a b
{860, -30},
{853, -29},
{846, -28},
{838, -27},
{830, -26},
{822, -25},
{814, -24},
...
What is the quickest way using C to find find b with given a value? I guess some approximation is required for that? For example when a = 851 I would like to find -29 as quick as possible.

The fastest general purpose algorithm is a binary search. Depending on the size of the mapping array, you might consider hand-coding the search; that might be plausible for size 32, but I wouldn't go much larger. On a micro-controller, the fully-expanded binary search might be 50% faster, if you're lucky.
But if the mapping is not too non-linear, there's a nice alternative.
Divide the range of a into k equal-sized ranges, where k is not much bigger than the number of entries in the mapping array, such that the mappings of each range endpoint is either the same as or one more than the next range endpoint. (If this is possible; that's precisely what I meant by "not too non-linear"). Create another array which maps each endpoint into an index into the original array. (You only need the indices, not the endpoints, because the endpoints are evenly spaced.) For each range, the bottom endpoint's corresponding index value is the index of the smallest a value in the original array not less than the range's top endpoint. Note that because of the requirement presented above, there can be at most one a value in every range, so the index of each endpoint will always point to the a value for the end of the range, and the a value for the beginning of the range will always be either the same or the previous index.
Now, to look up a value, you first figure out the appropriate range index, which is a simple linear computation ((val - minval)/k) and then compare the value with the indicated a value by looking up the index for the comparison. If the value is less than the looked up a, then subtract one from the index. Then return the b value from the index.
For an example of such an algorithm, see my answer here.

Related

Algorithm - What is the best algorithm for detecting duplicate numbers in small array?

What is the best algorithm for detecting duplicate numbers in array, the best in speed, memory and avoiving overhead.
Small Array like [5,9,13,3,2,5,6,7,1] Note that 5 i dublicate.
After searching and reading about sorting algorithms, I realized that I will use one of these algorithms, Quick Sort, Insertion Sort or Merge Sort.
But actually I am really confused about what to use in my case which is a small array.
Thanks in advance.
To be honest, with that size of array, you may as well choose the O(n2) solution (checking every element against every other element).
You'll generally only need to worry about performance if/when the array gets larger. For small data sets like this, you could well have found the duplicate with an 'inefficient' solution before the sort phase of an efficient solution will have finished :-)
In other words, you can use something like (pseudo-code):
for idx1 = 0 to nums.len - 2 inclusive:
for idx2 = idx1 + 1 to nums.len - 1 inclusive:
if nums[idx1] == nums[idx2]:
return nums[idx1]
return no dups found
This finds the first value in the array which has a duplicate.
If you want an exhaustive list of duplicates, then just add the duplicate value to another (initially empty) array (once only per value) and keep going.
You can sort it using any half-decent algorithm though, for a data set of the size you're discussing, even a bubble sort would probably be adequate. Then you just process the sorted items sequentially, looking for runs of values but it's probably overkill in your case.
Two good approaches depend on the fact that you know or not the range from which numbers are picked up.
Case 1: the range is known.
Suppose you know that all numbers are in the range [a, b[, thus the length of the range is l=b-a.
You can create an array A the length of which is l and fill it with 0s, thus iterate over the original array and for each element e increment the value of A[e-a] (here we are actually mapping the range in [0,l[).
Once finished, you can iterate over A and find the duplicate numbers. In fact, if there exists i such that A[i] is greater than 1, it implies that i+a is a repeated number.
The same idea is behind counting sort, and it works fine also for your problem.
Case 2: the range is not known.
Quite simple. Slightly modify the approach above mentioned, instead of an array use a map where the keys are the number from your original array and the values are the times you find them. At the end, iterate over the set of keys and search those that have been found more then once.
Note.
In both the cases above mentioned, the complexity should be O(N) and you cannot do better, for you have at least to visit all the stored values.
Look at the first example: we iterate over two arrays, the lengths of which are N and l<=N, thus the complexity is at max 2*N, that is O(N).
The second example is indeed a bit more complex and dependent on the implementation of the map, but for the sake of simplicity we can safely assume that it is O(N).
In memory, you are constructing data structures the sizes of which are proportional to the number of different values contained in the original array.
As it usually happens, memory occupancy and performance are the keys of your choice. Greater the former, better the latter and vice versa. As suggested in another response, if you know that the array is small, you can safely rely on an algorithm the complexity of which is O(N^2), but that does not require memory at all.
Which is the best choice? Well, it depends on your problem, we cannot say.

how to write order preserving minimal perfect hash for integer keys?

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

Why does having an index actually speed up look-up time?

I've always wondered about why this is the case.
For instance, say I want to find the number 5 located in an array of numbers. I have to compare my desired number against every other single value, to find what I'm looking for.
This is clearly O(N).
But, say for instance, I have an index that I know contains my desired item. I can just jump right to it right? And this is also the case with Maps that are hashed, because as I provide a key to lookup, the same hash function is ran on the key that determined it's index position, so this also allows me to just then, jump right to it's correct index.
But my question is why is that any different than the O(N) lookup time for finding a value in an array through direct comparison?
As far as a naive computer is concerned, shouldn't an index be the same as looking for a value? Shouldn't the raw operation still be, as I traverse the structure, I must compare the current index value to the one I know I'm looking for?
It makes a great deal of sense why something like binary search can achieve O(logN), but I still can't intuitively grasp why certain things can be O(1).
What am I missing in my thinking?
Arrays are usually stored as a large block of memory.
If you're looking for an index, this allows you to calculate the offset that that index will have in this block of memory in O(1).
Say the array starts at memory address 124 and each element is 10 bytes large, then you can know the 5th element is at address 124 + 10*5 = 174.
Binary search will actually (usually) do something similar (since by-index lookup is just O(1) for an array) - you start off in the middle - you do a by-index lookup to get that element. Then you look at the element at either the 1/4th or 3/4th position, which you need to do a by-index lookup for again.
A HashMap has an array underneath it. When an key/value pair is added to the map. The key's hashCode() is evaluated and normalized so that its value can be placed in its special index in the array. When two key's codes are normalized to belong to the same index of the map, they are appended to a LinkedList
When you perform a look-up, the key you are looking up has its hash code() evaluated and normalized to return an index to search for the key. It then traverses the linked list you find the key and returns the associated value.
This look-up time is the same, in the best case, as looking-up array[i] which is O(1)
The reason it is a speed up is because you don't actually have to traverse your structure to look something up, you just jump right to the place where you expect it to be.

Minimal perfect hash for N number of unknown keys

I have two unsorted arrays of 32-bit unsigned integers, size N1 and N2, respectively. Each array may contain duplicates. I would like to map each value (2^32 possible keys) to a spot in a byte-array of size (N1 + N2) to record frequencies of each key. Duplicate key values should map to the same position in this array. Additionally, the frequency of each integer won't go above 100 (which is why I chose a byte-array to record each key's frequency to save space); if the max possible frequency were to go above this, I would simply change the byte-array to an array of shorts or something.
In the end, I need an array of size N1 + N2 -- not necessarily all entries will be used, as duplicates may have been encountered -- with frequencies of each unique key value. Worst case scenario, only one byte entry will be used (e.g. all values in both arrays are the same) leaving ((N1 + N2) - 1) entries unused. Best case scenario, all byte-entries are used.
From what I understand, I need to find a minimally perfect hashing function to map a known number of unknown keys (N1 + N2; all ranging from 0 - 2^32) to a known number of spots (N1 + N2). I was able to find a few other posts, but both answers basically said use gperf:
Is it possible to make a minimal perfect hash function in this situation?
Minimal perfect hash function
The second one (Minimal perfect hash function) is exactly what I'm attempting to do.
Rather than expecting source code from an answer (I'm using C by the way), I'd much prefer an explanation of how to go about creating a minimally perfect hashing function for N-number of any possible positive integers to N buckets. I could easily do this with a 4 GB array of direct mappings for every possible integer with lots of unused space, but I'd rather try to reduce this massive inefficiency of space. I'm also hoping to not use any external libraries, mostly for educational purposes to learn more about hashing, itself.
This is clearly impossible. If you have N numbers, there's no way to come up with a function which will hash them all to distinct values in the range [0, N) unless you know what those numbers are going to be beforehand. Otherwise, given any such function (with N < 2^32, of course), there will be at least one pair of integers such that both of those integers hash to the same value, so that function won't be perfect if those integers both show up in the input.
If you relax the conditions to allow the function to be created on the fly, this becomes possible, but only in a really trivial and useless way. Namely, a hash function could build itself up as it goes by recording each number that's fed into it and generating a new unique output for each one (say, counting up from 0). But such a function would need a hash table (or something equivalent) as part of its implementation, so it'd certainly be no use in implementing a hash table!
According to the Pigeonhole Principle, you will have "hash slots" occupied by more than one number. In other words: different numbers will "hash" to the same value.
Now, I wonder if you could benefit from a Bloom Filter. From Wikipedia:
False positive matches are possible, but false negatives are not; i.e.
a query returns either "possibly in set" or "definitely not in set".
If something is "definitely" not in the set of keys, you can move on (its frequency is one), and if it possibly is in the set, then you process it further to accumulate its actual statistic.

Categorizing data based on the data's signature

Let us say I have some large collection of rows of data, where each element in the row is a (key, value) pair:
1) [(bird, "eagle"), (fish, "cod"), ... , (soda, "coke")]
2) [(bird, "lark"), (fish, "bass"), ..., (soda, "pepsi")]
n) ....
n+1) [(bird, "robin"), (fish, "flounder"), ..., (soda, "fanta")]
I would like the ability to run some computation that would allow me to determine for a new row, what is the row that is "most similar" to this row?
The most direct way I could think of finding the "most similar" row for any particular row is to directly compare said row against all other rows. This is obviously computationally very expensive.
I am looking for a solution of the following form.
A function that can take a row, and generate some derivative integer for that row. This returned integer would be a sort of "signature" of the row. The important property of this signature is that if two rows are very "similar" they would generate very close integers, if rows are very "different", they would generate distant integers. Obviously, if they are identical rows they would generate the same signature.
I could then takes these generated signatures, with the index of the row they point to, and sort them all by their signatures. This data structure I would keep so that I can do fast lookups. Call it database B.
When I have a new row, I wish to know which existent row in database B is most similar, I would:
Generate a signature for the new row
Binary search through the sorted list of (signature,index) in database B for the closet match
Return the closest matching (could be a perfect match) row in database B.
I know their is a lot of hand waving in this question. My problem is that I do not actually know what the function would be that would generate this signature. I see Levenshtein distances, but those represent the transformation cost, not so much the signature. I see that I could try lossy compressions, two things might be "bucketable" as they compress to the same thing. I am looking for other ideas on how to do this.
Thank you.
EDIT: This is my original answer, which we will call Case 1, where there is no precedence to the keys
You cannot do it as a sorted integer because that is one dimensional and your data is multi-dimensional. So "nearness" in that sense cannot be established on a line.
Your example shows bird, fish and soda for all 3 lines. Are the keys fixed and known? If they are not, then your first step is to hash the keys of a row to establish rows that have the same keys.
For the values, consider this as a poor man's Saturday Night similarity trick. Hash the values, any two rows that match on that hash are an exact match and represent the same "spot", zero distance.
If N is the number of key/value pairs:
The closest non-exact "nearness" would mean matching N-1 out of N values. So you generate N more hashes, each one dropping out one of the values. Any two rows that match on those hashes have N-1 out of N values in common.
The next closest non-exact "nearness" would mean matching N-2 out of N values. So you generate more than N more hashes (I can't figure the binary this late), this time each hash leaves out a combination of two values. Any two rows that match on those hashes have N-2 out of N values in common.
So you can see where this is going. At the logical extreme you end up with 2^N hashes, not very savory, but I'm assuming you would not go that far because you reach a point where too few matching values would be considered to "far" to be worth considering.
EDIT: To see how we cannot escape dimensionality, consider just two keys, with values 1-9. Plot all possible values on a graph. We see see that {1,1} is close to {2,2}, but also that {5,6} is close to {6,7}. So we get a brainstorm, we say, Aha! I'll calculate each point's distance from the origin using Pythagorean theorem! This will make both {1,1} and {2,2} easy to detect. But then the two points {1,10} and {10,1} will get the same number, even though they are as far apart as they can be on the graph. So we say, ok, I need to add the angle for each. Two points at the same distance are distinguished by their angle, two points at the same angle are distinguished by their distance. But of course now we've plotted them on two dimensions.
EDIT: Case 2 would be when there is precedence to the keys, when key 1 is more significant than key 2, which is more significant than key 3, etc. In this case, if the allowed values were A-Z, you would string the values together as if they were digits to get a sortable value. ABC is very close to ABD, but very far from BBD.
If you had a lot of data, and wanted to do this hardcore, I would suggest a statistical method like PLSA or PSVM, which can extract identifying topics from text and identify documents with similar topic probabilities.
A simpler, but less accurate way of doing it is using Soundex, which is available for many languages. You can store the soundex (which will be a short string, not an integer I'm afraid), and look for exact matches to the soundex, which should point to similar rows.
I think it's unrealistic to expect a function to turn a series of strings into an integer such that integers near each other map to similar strings. The closest you might come is doing a checksum on each individual tuple, and comparing the checksums for the new row to the checksums of existing rows, but I'm guessing you're trying to come up with a single number you can index on.

Resources