Data search with partially match - database

I have a database with column A,B,C and row data, for example:
A B C
test1 2.0123 3.0123
test2 2.1234 3.1234
In my program I would like to search for the bestfit match in the databases,
for example I will key in value b=2.133, c=3.1342, then it will return me test2, how can I do that?
Please give me some idea or keyword to google as what I was thinking is searching algorithm but seem like searching algorithm are more on completely match and is not find the bestfit match. Or is this binpacking algorithm? How can I solve the problem with that.
I got about 5 column B,C,D,E,F and find the most match value.

Seems like you are looking for a k-d tree that maps 2-dimensional space (attributes B,C that are the key) to a value (attribute A).
K-D tree allows efficient look up for nearest neighbor of a given query, which seems to be exactly what you are after.
Note that the same DS will efficiently handle more attributes if needed, by increasing the dimensionality of the key.

Take a look at this(Nearest neighbor search):
http://en.wikipedia.org/wiki/Nearest_neighbor_search
In this the simplest algorithm(Linear search) would look something like this in SQL(for b=2.133, c=3.1342):
SELECT A, MIN(SQRT(POW(B-2.133,2)+POW(C-3.1342,2))) FROM tablename;
i.e. take the row with the minimum vector distance from the points (sqrt((b1-b2)^2+(c1-c2)^2))

Related

Is there a way to search for multiple strings within a range of cells, and return the range sorted by the first column?

I'm looking for a way to search within the cells in a range for a specific text value, and have them returned sorted by those values along with the corresponding cell in the next column over. For example, in column A I have the description of the item, and in Column B there is a number value:
Unsorted
These describe that particular formats (Col A) and how many of each (Col B). So what I'm looking to do is sort by multiple format, "K-7.75" and "K-20L" for example, and then return that cell along with the corresponding Column B. Is there a way I can write this all in one formula, searching for multiple strings within Col A? Here's what I'd like the output to look like, with the formulas being in A1, A9, and A13:
Sorted
I know there is a filter option, but I cant seem to use multiple strings in that and I'm looking to eliminate steps in reordering this info. I feel like there's a way to do this with Array and Search or Sort or something, but I'm stuck. Thank you in advance for any help.
Solved thanks to Kessy:
Do you know in advance the conditions you will get like "k-7.75"? If that is the case you can set every x rows or every x columns the following filter function: =FILTER(A:B, REGEXMATCH(A:A,"K-7.75")) – Kessy
=FILTER(A:B, REGEXMATCH(A:A,"K-5.16|1/6|K-20|20L|K-7.75|1/4|K-30|30L")). Using the operator character "|" as an OR logic. Thank you for putting me on the right track! – MisterSixer

Index/Match with three criteria

I have searched and searched and searched and searched, I can only find solutions for index/match with two criteria.
does anyone have a solution for index/match with three criteria?
as a sample of my actual data, i would like to index/match the year, type and name to find the data in the month column
You can match an unlimited number of criteria by using SUMPRODUCT() to find the proper row:
=INDEX(D2:D9,SUMPRODUCT((A2:A9=2015)*(B2:B9="Revenue")*(C2:C9="Name 1")*ROW(2:9))-1)
EDIT#1:
Scott's comment is correct! The advantagesof the SUMPRODUCT() approach is that it is not an array formula and can be expanded to handle many criteria. The disadvantage is that it will only work if there is 1 matching row. The use of SUMPRODUCT() is explained very well here:
xlDynamic Paper
Because your question has numerical data, you can simply use SUMIFS.
SUMIFS provides the sum from a particular range [column D in this case], where any number of other ranges of the same size [the other columns, in this case] each match a particular criteria. For text results, one of the other recommended solutions will be needed.
In addition to being a little cleaner, this has the attribute [could be good or bad depending on your needs] that it will pick up multiple rows of data if multiples exist, and sum them all. If you expect unique rows, that's bad, because it won't warn you there are multiples.
The formula in your case would be as follows [obviously, you should adjust the formulas to reference your ID cells, and pick up the appropriate columns]:
=SUMIFS(D:D,A:A,2015,B:B,"Revenue",C:C,"Name1"))
What this does is:
Sum column D, for each row where: (1) column A is the number 2015; (2) column B is the text "Revenue"; AND (3) column C is the word "Name1".
If assuming your data starts in A1 ("Year") and goes to D15 ("????"), you can use this. You bascically just add your criteria with &, then when doing the Match() regions, connect their respective ranges with & as well.
=Index(D2:D9,Match(A15&B15&C15,A2:A9&B2:B9&C2:C9,0))
and enter with CTRL+SHIFT+ENTER and make the references absolute (i.e. $D$2:$D$9), I just didn't to keep the formula a little easier to read.

Why does having an index actually speed up look-up time?

I've always wondered about why this is the case.
For instance, say I want to find the number 5 located in an array of numbers. I have to compare my desired number against every other single value, to find what I'm looking for.
This is clearly O(N).
But, say for instance, I have an index that I know contains my desired item. I can just jump right to it right? And this is also the case with Maps that are hashed, because as I provide a key to lookup, the same hash function is ran on the key that determined it's index position, so this also allows me to just then, jump right to it's correct index.
But my question is why is that any different than the O(N) lookup time for finding a value in an array through direct comparison?
As far as a naive computer is concerned, shouldn't an index be the same as looking for a value? Shouldn't the raw operation still be, as I traverse the structure, I must compare the current index value to the one I know I'm looking for?
It makes a great deal of sense why something like binary search can achieve O(logN), but I still can't intuitively grasp why certain things can be O(1).
What am I missing in my thinking?
Arrays are usually stored as a large block of memory.
If you're looking for an index, this allows you to calculate the offset that that index will have in this block of memory in O(1).
Say the array starts at memory address 124 and each element is 10 bytes large, then you can know the 5th element is at address 124 + 10*5 = 174.
Binary search will actually (usually) do something similar (since by-index lookup is just O(1) for an array) - you start off in the middle - you do a by-index lookup to get that element. Then you look at the element at either the 1/4th or 3/4th position, which you need to do a by-index lookup for again.
A HashMap has an array underneath it. When an key/value pair is added to the map. The key's hashCode() is evaluated and normalized so that its value can be placed in its special index in the array. When two key's codes are normalized to belong to the same index of the map, they are appended to a LinkedList
When you perform a look-up, the key you are looking up has its hash code() evaluated and normalized to return an index to search for the key. It then traverses the linked list you find the key and returns the associated value.
This look-up time is the same, in the best case, as looking-up array[i] which is O(1)
The reason it is a speed up is because you don't actually have to traverse your structure to look something up, you just jump right to the place where you expect it to be.

Finding k different keys using binary search in an array of n elements

Say, I have a sorted array of n elements. I want to find 2 different keys k1 and k2 in this array using Binary search.
A basic solution would be to apply Binary search on them separately, like two calls for 2 keys which would maintain the time complexity to 2(logn).
Can we solve this problem using any other approach(es) for different k keys, k < n ?
Each search you complete can be used to subdivide the input to make it more efficient. For example suppose the element corresponding to k1 is at index i1. If k2 > k1 you can restrict the second search to i1..n, otherwise restrict it to 0..i1.
Best case is when your search keys are sorted also, so every new search can begin where the last one was found.
You can reduce the real complexity (although it will still be the same big O) by walking the shared search path once. That is, start the binary search until the element you're at is between the two items you are looking for. At that point, spawn a thread to continue the binary search for one element in the range past the pivot element you're at and spawn a thread to continue the binary search for the other element in the range before the pivot element you're at. Return both results. :-)
EDIT:
As Oli Charlesworth had mentioned in his comment, you did ask for an arbitrary amount of elements. This same logic can be extended to an arbitrary amount of search keys though. Here is an example:
You have an array of search keys like so:
searchKeys = ['findme1', 'findme2', ...]
You have key-value datastructure that maps a search key to the value found:
keyToValue = {'findme1': 'foundme1', 'findme2': 'foundme2', 'findme3': 'NOT_FOUND_VALUE'}
Now, following the same logic as before this EDIT, you can pass a "pruned" searchKeys array on each thread spawn where the keys diverge at the pivot. Each time you find a value for the given key, you update the keyToValue map. When there are no more ranges to search but still values in the searchKeys array, you can assume those keys are not to be found and you can update the mapping to signify that in some way (some null-like value perhaps?). When all threads have been joined (or by use of a counter), you return the mapping. The big win here is that you did not have to repeat the initial search logic that any two keys may share.
Second EDIT:
As Mark has added in his answer, sorting the search keys allows you to only have to look at the first item in the key range.
You can find academic articles calculating the complexity of different schemes for the general case, which is merging two sorted sequences of possibly very different lengths using the minimum number of comparisons. The paper at http://www.math.cmu.edu/~af1p/Texfiles/HL.pdf analyses one of the best known schemes, by Hwang and Lin, and has references to other schemes, and to the original paper by Hwang and Lin.
It looks a lot like a merge which steps through each item of the smaller list, skipping along the larger list with a stepsize that is the ratio of the sizes of the two lists. If it finds out that it has stepped too far along the large list it can use binary search to find a match amongst the values it has stepped over. If it has not stepped far enough, it takes another step.

Categorizing data based on the data's signature

Let us say I have some large collection of rows of data, where each element in the row is a (key, value) pair:
1) [(bird, "eagle"), (fish, "cod"), ... , (soda, "coke")]
2) [(bird, "lark"), (fish, "bass"), ..., (soda, "pepsi")]
n) ....
n+1) [(bird, "robin"), (fish, "flounder"), ..., (soda, "fanta")]
I would like the ability to run some computation that would allow me to determine for a new row, what is the row that is "most similar" to this row?
The most direct way I could think of finding the "most similar" row for any particular row is to directly compare said row against all other rows. This is obviously computationally very expensive.
I am looking for a solution of the following form.
A function that can take a row, and generate some derivative integer for that row. This returned integer would be a sort of "signature" of the row. The important property of this signature is that if two rows are very "similar" they would generate very close integers, if rows are very "different", they would generate distant integers. Obviously, if they are identical rows they would generate the same signature.
I could then takes these generated signatures, with the index of the row they point to, and sort them all by their signatures. This data structure I would keep so that I can do fast lookups. Call it database B.
When I have a new row, I wish to know which existent row in database B is most similar, I would:
Generate a signature for the new row
Binary search through the sorted list of (signature,index) in database B for the closet match
Return the closest matching (could be a perfect match) row in database B.
I know their is a lot of hand waving in this question. My problem is that I do not actually know what the function would be that would generate this signature. I see Levenshtein distances, but those represent the transformation cost, not so much the signature. I see that I could try lossy compressions, two things might be "bucketable" as they compress to the same thing. I am looking for other ideas on how to do this.
Thank you.
EDIT: This is my original answer, which we will call Case 1, where there is no precedence to the keys
You cannot do it as a sorted integer because that is one dimensional and your data is multi-dimensional. So "nearness" in that sense cannot be established on a line.
Your example shows bird, fish and soda for all 3 lines. Are the keys fixed and known? If they are not, then your first step is to hash the keys of a row to establish rows that have the same keys.
For the values, consider this as a poor man's Saturday Night similarity trick. Hash the values, any two rows that match on that hash are an exact match and represent the same "spot", zero distance.
If N is the number of key/value pairs:
The closest non-exact "nearness" would mean matching N-1 out of N values. So you generate N more hashes, each one dropping out one of the values. Any two rows that match on those hashes have N-1 out of N values in common.
The next closest non-exact "nearness" would mean matching N-2 out of N values. So you generate more than N more hashes (I can't figure the binary this late), this time each hash leaves out a combination of two values. Any two rows that match on those hashes have N-2 out of N values in common.
So you can see where this is going. At the logical extreme you end up with 2^N hashes, not very savory, but I'm assuming you would not go that far because you reach a point where too few matching values would be considered to "far" to be worth considering.
EDIT: To see how we cannot escape dimensionality, consider just two keys, with values 1-9. Plot all possible values on a graph. We see see that {1,1} is close to {2,2}, but also that {5,6} is close to {6,7}. So we get a brainstorm, we say, Aha! I'll calculate each point's distance from the origin using Pythagorean theorem! This will make both {1,1} and {2,2} easy to detect. But then the two points {1,10} and {10,1} will get the same number, even though they are as far apart as they can be on the graph. So we say, ok, I need to add the angle for each. Two points at the same distance are distinguished by their angle, two points at the same angle are distinguished by their distance. But of course now we've plotted them on two dimensions.
EDIT: Case 2 would be when there is precedence to the keys, when key 1 is more significant than key 2, which is more significant than key 3, etc. In this case, if the allowed values were A-Z, you would string the values together as if they were digits to get a sortable value. ABC is very close to ABD, but very far from BBD.
If you had a lot of data, and wanted to do this hardcore, I would suggest a statistical method like PLSA or PSVM, which can extract identifying topics from text and identify documents with similar topic probabilities.
A simpler, but less accurate way of doing it is using Soundex, which is available for many languages. You can store the soundex (which will be a short string, not an integer I'm afraid), and look for exact matches to the soundex, which should point to similar rows.
I think it's unrealistic to expect a function to turn a series of strings into an integer such that integers near each other map to similar strings. The closest you might come is doing a checksum on each individual tuple, and comparing the checksums for the new row to the checksums of existing rows, but I'm guessing you're trying to come up with a single number you can index on.

Resources