Categorizing data based on the data's signature - database

Let us say I have some large collection of rows of data, where each element in the row is a (key, value) pair:
1) [(bird, "eagle"), (fish, "cod"), ... , (soda, "coke")]
2) [(bird, "lark"), (fish, "bass"), ..., (soda, "pepsi")]
n) ....
n+1) [(bird, "robin"), (fish, "flounder"), ..., (soda, "fanta")]
I would like the ability to run some computation that would allow me to determine for a new row, what is the row that is "most similar" to this row?
The most direct way I could think of finding the "most similar" row for any particular row is to directly compare said row against all other rows. This is obviously computationally very expensive.
I am looking for a solution of the following form.
A function that can take a row, and generate some derivative integer for that row. This returned integer would be a sort of "signature" of the row. The important property of this signature is that if two rows are very "similar" they would generate very close integers, if rows are very "different", they would generate distant integers. Obviously, if they are identical rows they would generate the same signature.
I could then takes these generated signatures, with the index of the row they point to, and sort them all by their signatures. This data structure I would keep so that I can do fast lookups. Call it database B.
When I have a new row, I wish to know which existent row in database B is most similar, I would:
Generate a signature for the new row
Binary search through the sorted list of (signature,index) in database B for the closet match
Return the closest matching (could be a perfect match) row in database B.
I know their is a lot of hand waving in this question. My problem is that I do not actually know what the function would be that would generate this signature. I see Levenshtein distances, but those represent the transformation cost, not so much the signature. I see that I could try lossy compressions, two things might be "bucketable" as they compress to the same thing. I am looking for other ideas on how to do this.
Thank you.

EDIT: This is my original answer, which we will call Case 1, where there is no precedence to the keys
You cannot do it as a sorted integer because that is one dimensional and your data is multi-dimensional. So "nearness" in that sense cannot be established on a line.
Your example shows bird, fish and soda for all 3 lines. Are the keys fixed and known? If they are not, then your first step is to hash the keys of a row to establish rows that have the same keys.
For the values, consider this as a poor man's Saturday Night similarity trick. Hash the values, any two rows that match on that hash are an exact match and represent the same "spot", zero distance.
If N is the number of key/value pairs:
The closest non-exact "nearness" would mean matching N-1 out of N values. So you generate N more hashes, each one dropping out one of the values. Any two rows that match on those hashes have N-1 out of N values in common.
The next closest non-exact "nearness" would mean matching N-2 out of N values. So you generate more than N more hashes (I can't figure the binary this late), this time each hash leaves out a combination of two values. Any two rows that match on those hashes have N-2 out of N values in common.
So you can see where this is going. At the logical extreme you end up with 2^N hashes, not very savory, but I'm assuming you would not go that far because you reach a point where too few matching values would be considered to "far" to be worth considering.
EDIT: To see how we cannot escape dimensionality, consider just two keys, with values 1-9. Plot all possible values on a graph. We see see that {1,1} is close to {2,2}, but also that {5,6} is close to {6,7}. So we get a brainstorm, we say, Aha! I'll calculate each point's distance from the origin using Pythagorean theorem! This will make both {1,1} and {2,2} easy to detect. But then the two points {1,10} and {10,1} will get the same number, even though they are as far apart as they can be on the graph. So we say, ok, I need to add the angle for each. Two points at the same distance are distinguished by their angle, two points at the same angle are distinguished by their distance. But of course now we've plotted them on two dimensions.
EDIT: Case 2 would be when there is precedence to the keys, when key 1 is more significant than key 2, which is more significant than key 3, etc. In this case, if the allowed values were A-Z, you would string the values together as if they were digits to get a sortable value. ABC is very close to ABD, but very far from BBD.

If you had a lot of data, and wanted to do this hardcore, I would suggest a statistical method like PLSA or PSVM, which can extract identifying topics from text and identify documents with similar topic probabilities.
A simpler, but less accurate way of doing it is using Soundex, which is available for many languages. You can store the soundex (which will be a short string, not an integer I'm afraid), and look for exact matches to the soundex, which should point to similar rows.
I think it's unrealistic to expect a function to turn a series of strings into an integer such that integers near each other map to similar strings. The closest you might come is doing a checksum on each individual tuple, and comparing the checksums for the new row to the checksums of existing rows, but I'm guessing you're trying to come up with a single number you can index on.

Related

Is there an algorithm to check how similar are two 2d data sets?

I need help
First of all, I'm not looking if the 2 data sets are equal (A==B), or if the have similar features, because they are similar.
I have two 2D data sets (there are actually 2 vector fields), one is 'fixed' and the other is 'experimental', I want to know HOW MUCH equal they are. My thought is to get a number per point who say if they are equal in a range of values (0 to 1, including decimals). That is for make an iterative algorithm to find the best experimental data set who agrees with the fixed one... but first I need to find "how much equal they are"
It's like measure the error to minimize it
If one has |A| = |B| and the same (or close) sample points, one could use simply the standard deviation of each pair of |a-b|, where a \in A, b \in B, pairwise. One doesn't need a separate temporary array if you use a stable, on-line algorithm like Welford's, (just take the square root at the end to get the standard deviation.)

Is there a way to map a list of integers to a unique number or a unique hash?

The permutation of the list of integers should also be preserved in the hash -- i.e., lists containing the same numbers in a different order should have different hashes.
One way to do this would be to concatenate the list of integers into a string, but this could be an expensive comparison test if the list is massive.
Context: If I already have 5 large arrays 'analyzed' and hashed away, I would be able to quickly check whether an incoming array is new or not.
https://en.wikipedia.org/wiki/Pigeonhole_principle
"In mathematics, the pigeonhole principle states that if n items are put into m containers, with n > m, then at least one container must contain more than one item"
It is certainly possible to create a unique number, its just that its hilariously huge.
Consider
[1,2,3]
A simple list, but to make sure we have enough holes for our pigeons, we would need to have space for the largest integer in each slot, so assuming 4 bytes per item, we would need a 12 byte integer to store the hash uniquely, or ~3.4028237e+38 different values. And that's only 3 integers.
No, an efficient hash is rarely unique, but a good hash is unlikely to have collisions for similar values.
To answer your question about checking for existence, consider the following:
If you have an array of n items, in order to hash it, you need to take n steps. In order to check for existence, you need, at worst, n steps to check each item in turn.
In either case, you are going to be spending about the same amount time comparing arrays.
An array structure seems to be a perfect choice where the index differentiate between elements, or you can use a list of elements where an element has an index value assigned to just before insertion.
Never use a String as a list structure, because it has it's own properties, like immutability (in the case of Java).

how to write order preserving minimal perfect hash for integer keys?

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

Sorting n sets of data into one

I have n arrays of data, each of these arrays is sorted by the same criteria.
The number of arrays will, in almost all cases, not exceed 10, so it is a relatively small number. In each array, however, can be a large number of objects, that should be treated as infinite for the algorithm I am looking for.
I now want to treat these arrays as if they are one array. However, I do need a way, to retrieve objects in a given range as fast as possible and without touching all objects before the range and/or all objects after the range. Therefore it is not an option to iterate over all objects and store them in one single array. Fetches with low start values are also more likely than fetches with a high start value. So e.g. fetching objects [20,40) is much more likely than fetching objects [1000,1020), but it could happen.
The range itself will be pretty small, around 20 objects, or can be increased, if relevant for the performance, as long as this does not hit the limits of memory. So I would guess a couple of hundred objects would be fine as well.
Example:
3 arrays, each containing a couple of thousand entires. I now want to get the overall objects in the range [60, 80) without touching either the upper 60 objects in each set nor all the objets that are after object 80 in the array.
I am thinking about some sort of combined, modified binary search. My current idea is something like the following (note, that this is not fully thought through yet, it is just an idea):
get object 60 of each array - the beginning of the range can not be after that, as every single array would already meet the requirements
use these objects as the maximum value for the binary search in every array
from one of the arrays, get the centered object (e.g. 30)
with a binary search in all the other arrays, try to find the object in each array, that would be before, but as close as possible to the picked object.
we now have 3 objects, e.g. object 15, 10 and 20. The sum of these objects would be 45. So there are 42 objects in front, which is more than the beginning of the range we are looking for (30). We continue our binary search in the remaining left half of one of the arrays
if we instead get a value where the sum is smaller than the beginning of the range we are looking for, we continue our search on the right.
at some point we will hit object 30. From there on, we can simply add the objects from each array, one by one, with an insertion sort until we hit the range length.
My questions are:
Is there any name for this kind of algorithm I described here?
Are there other algorithms or ideas for this problem, that might be better suited for this issue?
Thans in advance for any idea or help!
People usually call this problem something like "selection in the union of multiple sorted arrays". One of the questions in the sidebar is about the special case of two sorted arrays, and this question is about the general case. Several comparison-based approaches appear in the combined answers; they more or less have to determine where the lower endpoint in each individual array is. Your binary search answer is one of the better approaches; there's an asymptotically faster algorithm due to Frederickson and Johnson, but it's complicated and not obviously an improvement for small ranks.

Finding k different keys using binary search in an array of n elements

Say, I have a sorted array of n elements. I want to find 2 different keys k1 and k2 in this array using Binary search.
A basic solution would be to apply Binary search on them separately, like two calls for 2 keys which would maintain the time complexity to 2(logn).
Can we solve this problem using any other approach(es) for different k keys, k < n ?
Each search you complete can be used to subdivide the input to make it more efficient. For example suppose the element corresponding to k1 is at index i1. If k2 > k1 you can restrict the second search to i1..n, otherwise restrict it to 0..i1.
Best case is when your search keys are sorted also, so every new search can begin where the last one was found.
You can reduce the real complexity (although it will still be the same big O) by walking the shared search path once. That is, start the binary search until the element you're at is between the two items you are looking for. At that point, spawn a thread to continue the binary search for one element in the range past the pivot element you're at and spawn a thread to continue the binary search for the other element in the range before the pivot element you're at. Return both results. :-)
EDIT:
As Oli Charlesworth had mentioned in his comment, you did ask for an arbitrary amount of elements. This same logic can be extended to an arbitrary amount of search keys though. Here is an example:
You have an array of search keys like so:
searchKeys = ['findme1', 'findme2', ...]
You have key-value datastructure that maps a search key to the value found:
keyToValue = {'findme1': 'foundme1', 'findme2': 'foundme2', 'findme3': 'NOT_FOUND_VALUE'}
Now, following the same logic as before this EDIT, you can pass a "pruned" searchKeys array on each thread spawn where the keys diverge at the pivot. Each time you find a value for the given key, you update the keyToValue map. When there are no more ranges to search but still values in the searchKeys array, you can assume those keys are not to be found and you can update the mapping to signify that in some way (some null-like value perhaps?). When all threads have been joined (or by use of a counter), you return the mapping. The big win here is that you did not have to repeat the initial search logic that any two keys may share.
Second EDIT:
As Mark has added in his answer, sorting the search keys allows you to only have to look at the first item in the key range.
You can find academic articles calculating the complexity of different schemes for the general case, which is merging two sorted sequences of possibly very different lengths using the minimum number of comparisons. The paper at http://www.math.cmu.edu/~af1p/Texfiles/HL.pdf analyses one of the best known schemes, by Hwang and Lin, and has references to other schemes, and to the original paper by Hwang and Lin.
It looks a lot like a merge which steps through each item of the smaller list, skipping along the larger list with a stepsize that is the ratio of the sizes of the two lists. If it finds out that it has stepped too far along the large list it can use binary search to find a match amongst the values it has stepped over. If it has not stepped far enough, it takes another step.

Resources