Recommendation for mailing address matching scenario? - sql-server

My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change that). The set of fields in the tables are similar though not identical - most exist in both tables, some only in one, some the other. The goal is to determine with "high confidence" whether or not two mailing addresses match.
Example fields:
Street Number
Predirection
Street Name
Street Suffix
Postdirection (one table and not the other)
Unit name (one table) v Address 2 (other table) --adds complexity
Zip code (length varies in each table 5 v 5+ digits)
Legal description
Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0). This call can be made in SQL or Python within my solution; free/open source highly preferred by client.
Among options such as SOUNDEX, DIFFERENCE, Levenshtein distance (all SQL) and usaddress, dedupe (Python) none stand out as a good-fit solution.

Ideally I'd like to a simple way to call a "function" which returns
either a boolean or a confidence level of match (0.0 - 1.0).
A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.
Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.
If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:
The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
The Jaccard Similarity coefficient algorithm.
a form of the Jaro-Winkler distance algorithm.
4 a longest common subsequence algorithm (which grows by one when transpositions are involved)
If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.
You can easily create your own metric using NGrams8K.
For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).
I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).

Related

Need algorithm for fast storage and retrieval (search) of sets and subsets

I need a way of storing sets of arbitrary size for fast query later on.
I'll be needing to query the resulting data structure for subsets or sets that are already stored.
===
Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves.
I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.
==
Original examples:
For example, say the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.
What algorithms are there for such needs?
I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.
Clarifications:
If it helps answer the question, I'm intending on using them for storing states:
Simple example:
[Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure
Conclusion
I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:
I'm confident that I can now contribute to the solution. One possible quite efficient way is a:
Trie invented by Frankling Mark Liang
Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.
The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.
What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).
But first a drawing by me:
The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.
Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.
Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).
Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.
Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.
I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.
I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.
How about having an inverse index built of hashes?
Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.
Then you define a map that maps an index to a vector of pointers to the tuples
std::map<size_t,std::vector<TupleStruct*> > mymap;
or, if you can use global indices, just
std::map<size_t,std::vector<size_t> > mymap;
For retrieval by queries X and Y, you need to
get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
Bitmap Index Design Choices and Their Performance
Implications
Bitmap Index Design and Evaluation
Compressing Bitmap Indexes for Faster Search Operations
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.
Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:
Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.
On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.
Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.
I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).

Motivation for k-medoids

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

How to find that two words differ by how much distance>> Is there any shortest way for this

I have read about Levenshtein distance about the calculation of the distance between the two distinct words.
I have one source string and i have to match it with all 10,000 target words. The closest word should be returned.
The problem is I have given a list of 10,000 target words, and input source words is also huge.... So what shortest and efficient algorithm to apply here. Levenshtein distance calculation for each n every combination(brute force logic) would be very time consuming.
Any hints, or ideas are most welcome.
I guess it depends a little on how the words are structured. For example this guy improved the implementation based on the fact that he processes his words in order and does not repeat calculations for common prefixes. But if all your 10,000 words are totally different that won't do you much good. It's written in python so might be a bit of work involved to port to C.
There are also some kinda homebrew algorithms out there (with which I mean there is no official paper written about it) but that might do the trick.
There's two common approaches for this, and I've blogged about both. The simpler one to implement is BK-Trees - a tree datastructure that speeds lookup based on levenshtein distance by only searching relevant parts of the tree. They'll probably be perfectly sufficient for your use-case.
A more complicated but more efficient approach is Levenshtein Automata. This works by constructing an NFA that recognizes all words within levenshtein distance x of your target string, then iterating through it and the dictionary in lockstep, effectively performing a merge join on them.

Similarity between line strings

I have a number of tracks recorded by a GPS, which more formally can be described as a number of line strings.
Now, some of the recorded tracks might be recordings of the same route, but because of inaccurasies in the GPS system, the fact that the recordings were made on separate occasions and that they might have been recorded travelling at different speeds, they won't match up perfectly, but still look close enough when viewed on a map by a human to determine that it's actually the same route that has been recorded.
I want to find an algorithm that calculates the similarity between two line strings. I have come up with some home grown methods to do this, but would like to know if this is a problem that's already has good algorithms to solve it.
How would you calculate the similarity, given that similar means represents the same path on a map?
Edit: For those unsure of what I'm talking about, please look at this link for a definition of what a line string is: http://msdn.microsoft.com/en-us/library/bb895372.aspx - I'm not asking about character strings.
Compute the Fréchet distance on each pair of tracks. The distance can be used to gauge the similarity of your tracks.
Math alert: Fréchet was a pioneer in the field of metric space which is relevant to your problem.
I would add a buffer around the first line based on the estimated probable error, and then determine if the second line fits entirely within the buffer.
To determine "same route," create the minimal set of normalized path vectors, calculate the total power differences and compare the total to a quality measure.
Normalize the GPS waypoints on total path length,
walk the vectors of the paths together, creating a new set of path vectors for each path based upon the shortest vector at each waypoint,
calculate the total power differences between endpoints of each vector in the normalized paths weighting for vector length, and
compare against a quality measure.
Tune the power of the differences (start with, say, squared differences) and the quality measure (say as a percent of the total power differences) visually. This algorithm produces a continuous quality measure of the path match as well as a binary result (Are the paths the same?)
Paul Tomblin said: I would add a buffer
around the first line based on the
estimated probable error, and then
determine if the second line fits
entirely within the buffer.
You could modify the algorithm as the normalized vector endpoints are compared. You could determine if any endpoint difference was above a certain size (implementing Paul's buffer idea) or perhaps, if the endpoints were outside the "buffer," use that fact to ignore that endpoint difference, allowing a comparison ignoring side trips.
You could walk along each point (Pa) of LineString A and measure the distance from Pa to the nearest line-segment of LineString B, averaging each of these distances.
This is not a quick or perfect method, but should be able to give use a useful number and is pretty quick to implement.
Do the line strings start and finish at similar points, or are they of very different extents?
If you consider a single line string to be a sequence of [x,y] points (or [x,y,z] points), then you could compute the similarity between each pair of line strings using the Needleman-Wunsch algorithm. As described in the referenced Wikipedia article, the Needleman-Wunsch algorithm requires a "similarity matrix" which defines the distance between a pair of points. However, it would be easy to use a function instead of a matrix. In your case you could simply use the 2D Euclidean distance function (or a 3D Euclidean function if your points have elevation) to provide the distance between each pair of points.
I actually side with the person (Aaron F) who said that you might be interested in the Levenshtein distance problem (and cited this). His answer seems to me to be the best so far.
More specifically, Levenshtein distance (also called edit distance), does not measure strictly the character-by-character distance, but also allows you to perform insertions and deletions. The best algorithm for this distance measure can be computed in quadratic time (pretty slow if your strings are long), but the computational biologists have pretty good heuristics for this, that might be of interest to you on their own. Check out BLAST and FASTA.
In your problem, it seems that you are dealing with differences between strings of numbers, and you care about the numbers. If you give more information, I might be able to direct you to the right variant of BLAST/FASTA/etc for your purposes. In any case, you might consider adapting BLAST and FASTA for your needs. They're quite simple.
1: http://en.wikipedia.org/wiki/Levenshtein_distance, http://www.nist.gov/dads/HTML/Levenshtein.html

Resources