Efficiently search large DB for similar records

Efficiently search large DB for similar records - database

Suppose the following: as input, one would get a record consisting of N numbers and booleans. This vector has to be compared to a database of vectors, which include M additional "result" elements. That means, the database holds P N+M sized vectors.
Each vector in the database holds as last element a boolean. The aim of the exercise is to find as fast a possible the record(s) which are closest match to the input vector AND have a resulting vector ending with a TRUE boolean.
To make the above a bit more comprehensible, give the following exampe:
A database with personal health information, consisting of records holding:
age
gender
weight
lenght
hearth issues (boolean)
lung issues (boolean)
residence
alternative plan Chosen (if done)
accepted offer
The program would then search get an input like
36 Male 185pound 68in FALSE FALSE NYC
It would then find out which plan would be the best to offer the client, based on what's in the database.
I know of a few methods which would help to do this, eg the levenshtein distance method. However, most methods would involve searching the entire database for the best matches.
Are there any algorithms, methods which would cut back on the processing power/time required? I can't imagine that eg. insurance agencies don't use more efficient methods to search their databases...
Any insights into this area would be greatly appreciated!

Assumption: this is a relational database. If instead it were NOSQL then please provide more info on which db.
Do you have option to create bitmap indexes? They can cut down the # of records returned . That is useful for almost all of the columsn since the cardinalities are low.
After that the only one left is the residence, and you should use a Geo distance for that.
If you are unable to create bitmap indexes then what are your filtering options? If none then you have to do a full table scan.
For each of the components e.g. age, gender, etc. you need to
(a) determine a distance metric
(b) determine how to compute both the metric and the distance between different records.
I'm not sure an Levenshtein would work here - you need to take each field separately to find their contribution to the whole distance measure.

Related

ANN Performance Using Separate Namespaces

I am trying to perform ANN, but my data is split into partitions or "tenants." Searches are always restricted to a single tenant, which represents a small percentage of the total documents.
I first tried implementing this using a filter on a tenant string attribute. However, I encountered this piece of documentation, that suggests the performance will be poor:
There is a small problem here however. If the eligibility list is small in relation to the number of items in the graph, skipping occurs with a high probability. This means that the algorithm needs to consider an exponentially increasing number of candidates, slowing down the search significantly. To solve this, Vespa.ai switches over to a brute-force search when this occurs. The result is a efficient ANN search when combined with filters.
What's the best way to solve my problem? Will partitioning my data into separate namespaces trigger the creation of a separate HNSW graph per namespace?

Performance will be fine, the query planner will just choose to not use the ANN index for these queries. You'll find lots of details on this topic, including how to tune this, in this blog post: https://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/
If all your queries are towards a single tenant which is a small percentage of the total documents I don't think you necessarily need to create an HNSW index at all, but this depends on the absolute numbers and the largest "small percentage".
(Namespaces are not relevant here - their only purpose is to safely add a string to ids so that you can have multiple sources of ids and still be guaranteed global uniqueness.)

Tricky duplicate control: meeting criterias in Excel array formulas

A bit of a tricky question - I might just have to do it through VBA with a proper script, however if someone actually has a complicated answer, (let's be honest I don't think there's a super simple formula for this) I'm taker. I'd rather do as much as I can through formulas. I've attached a sample.
The data: I have data that relates to countries. In each country, you can have multiple sites. For each site, you may or not have different distributions. When those distributions meet a given criteria, I want to tally up that as a "break" & count how many by countries, sites, etc.
How it works: I'm using array formulas with sumproduct() for this. The nice thing is that you can easily add criteria, each criteria returns your 0/1 so when you multiply them it gives you the array you need to sum up to see how many breaks you have.
The problem: I am unable to format the formula so that I can account for each site being counted only once in the case where the same site has 2 different distribution types and both meet the break criteria. If both distributions meet the break criteria, I don't want to record that as 2 breaks, otherwise I may end up with more sites with breaks recorded than the number of sites. Part of the problem is how I account for the unicness of sites:
(tdata[siteid]>"")/COUNTIF(tdata[siteid],tdata[siteid] &"")
This is actually a bit of a hack, in the sense that as opposed to other formulas it doesn't return 0/1 but possibly fractions. They do add up correctly and do allow me to, say, count the number of sites correctly, but the array isn't formated as 0/1 therefore when multiplied with other 0/1 arrays it messes up the results....
I control the data, so I have some leeway. I work with tables (as can be seen) and VBA is already used. I could sort the source tables if that helps. Source data:
1 row = 1 distribution for 1 site on 1 month
The summary table per country I linked is based on those source data.
Any idea?
EDIT - Filtering for distribution is not really an option. I do already have an event-based filters for the source data, and I can already calculate rightly the indicator for filtered data by distributions. But I also need to display global data (which is currently not working). Also there are other indicators that need to be calculate which won't work if I filter the data (it's big dashboard).
EDIT2: In other words, I need to find some way to account for the fact that if the same criteria (break or not) is found in 2 sites with the same siteid but 2 different distributions, I want to count that as 1 break only. While keeping in mind that if one distribution has a break (and the other not), I still want to record it as 1 site with break in that country.
EDIT3: I've decided to make a new table, that summarizes the data for each site individually (each of which may have more than once distribution). Then I can calculate global stuff from that.
My take home message from this: I think that when you have many level of data (e.g. countries, sites, with some kind of a sub-level with distributions) in Excel formulas, it's difficult NOT to summarize the data in intermediate tables for the level of analysis at which you want to focus. E.g. in my case, I am interested in country-level analysis, which is 2 "levels" above the distribution level. This means that there will be "duplication" of data from a site-level perspective. You may be able to navigate around this, but I think by far the simpler solution is to suck it up and make an intermediate table. I does shorten significantly your formulas as well.
I don't mark this as a solution because it's not what I was looking for. Still open to better suggestions allowing to work only with formulas....
File: https://www.dropbox.com/sh/4ofctha6qhfgtqw/AAD0aPJXr__tononRTpKc1oka?dl=0

Maybe the following can help.
First, you filter the entries which don't meet the criteria regarding the distribution.
In a second step, you sort the table from A to Z based on the column siteid.
Then you add an extra column after the last on with the formula =C3<>C4, where column C contains the siteid entries. In that way all duplicates are denoted by a FALSE value in the helper column.
After that you filter the FALSE values in this column.
You then get unique site ids.
In case I got your question wrong, I would be glad about an update in order to try to help you.

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!

Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.

I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.

Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.

It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.

If your DBMS has it, take a look at SOUNDEX().

Flagging possible identical users in an account management system

I am working on a possible architecture for an abuse detection mechanism on an account management system. What I want is to detect possible duplicate users based on certain correlating fields within a table. To make the problem simplistic, lets say I have a USER table with the following fields:
Name
Nationality
Current Address
Login
Interests
It is quite possible that one user has created multiple records within this table. There might be a certain pattern in which this user has created his/her accounts. What would it take to mine this table to flag records that may be possible duplicates. Another concern is scale. If we have lets say a million users, taking one user and matching it against the remaining users is unrealistic computationally. What if these records are distributed across various machines in various geographic locations?
What are some of the techniques, that I can use, to solve this problem? I have tried to pose this question in a technologically agnostic manner with the hopes that people can provide me with multiple perspectives.
Thanks

The answer really depends upon how you model your users and what constitutes a duplicate.
There could be a user that uses names from all harry potter characters. Good luck finding that pattern :)
If you are looking for records that are approximately similar try this simple approach:
Hash each word in the doc and pick the min shingle. Do this for k different hash functions. Concatenate these min hashes. What you have is a near duplicate.
To be clear, lets say a record has words w1....wn. Lets say your hash functions are h1...hk.
let m_i = min_j (h_i(w_j)
and the signature is S = m1.m2.m3....mk
The cool thing with this signature is that if two documents contain 90% same words then there is a good 90% chance that good chance that the signatures would be the same for the two documents. Hence, instead of looking for near duplicates, you look for exact duplicates in the signatures. If you want to increase the number of matches then you decrease the value of k, if you are getting too many false positives then you increase the number of k.
Of course there is the approach of implicit features of users such as thier IP addresses and cookie etc.

Is this data appropriate for keeping in a database?

In relation to my previous question where I was asking for some database suggestions; it just occured to me that I don't even know if what I'm trying to store there is appropriate for a database. Or should some other data storage method be used.
I have some physical models testing (let's say wind tunnel data; something similar) where for every model (M-1234) I have:
name (M-1234)
length L
breadth B
height H
L/B ratio
L/H ratio
...
lot of other ratios and dimensions ...
force versus speed curve given in the form of a lot of points for x-y plotting
...
few other similar curves (all of them of type x-y).
Now, what I'm trying to accomplish is store that in some reasonable way, so that the user who will be using the database can come and see what are the closest ten models to L/B=2.5 (or some similar demand). Then for that, somehow get all the data of those models, including the curve data (in a plain text file format).
Is a sql database (or any other, for that matter) an appropriate way of handling something like this ? Or should I take some other approach ?
I have about a month to finish this, and in that time I have to learn enough about databases as well, so ... give your suggestions, please, bearing that in mind. Assume no previous knowledge on the subject, whatsoever.

I think what you're looking for is possible. I'm using Postgresql here, but any database should work. This is my test database
CREATE TABLE test (
id serial primary key,
ratio double precision
);
COPY test (id, ratio) FROM stdin;
1 0.29999999999999999
2 0.40000000000000002
3 0.59999999999999998
4 0.69999999999999996
.
Then, to find the nearest values to a particular ratio
select id,ratio,abs(ratio-0.5) as score from test order by score asc limit 2;
In this case, I'm looking for the 2 nearest to 0.5
I'd probably do a datamodel where you have one table for the main data, the ratios and so on, and then a second table which holds the curve points, as I'm assuming that the curves aren't always the same size.

Yes, a database is probably the best approach for this.
A relational database (which usually uses SQL for data access) is suitable for data that is more or less structured as tables.
To give you an idea:
You could have a main table model with fields name, width etc. . Then subtable(s) for any values which can appear more than once, which refers back to model (look up "foreign key").
Then a subtable for your actual curves, again refering back to model.
How to actually model the curves in the DB I don't know, as I don't know how you model them. But if its lots of numbers, it can go into the DB.
It seems you know little about relational DBMS. Consider reading something on WIkipedia, or doing a few simple DBMS tutorials (PostgreSQL has some: http://www.postgresql.org/docs/8.4/interactive/tutorial.html , but there are many others). Then pick a DBMS for trying out (PostgreSQL is probably not a bad choice, but again there are many others).
Then try implementing a simple table schema, and get back to us with any detail questions (which you'll probably have).
One more thing: Those questions are probably more appropriate to serverfault.com.

This is arguably scientific data: you might find libraries/formats intended for arbitrary scientific data useful: HDF5 http://www.hdfgroup.org/ (note I am not an expert)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight