Nearest Neighbor Matching using "Matching" Package - matching

I'm trying to apply "nearest neighbor matching algorithm" into "Matching" Package.
My objective is to have the matched data set from a PSM (Propensity score matching) which have the same number of treatment and control.
For example, at the beginning, I have 145 units of control group and 639 units of treatment group.
As I explained, I want to have 145 units of control group and 145 units of treatment group after matching.
Even though, "MatchIt" package has the nearest neighbor algorithm, I think "Matching" package does not have the algorithm as far as know.
I really appreciate your reply.
Thanks.

The Matching package does have nearest neighbor matching; it's the default. You might need to set whether you want matching with or without replacement (in your case, you want without replacement). I recommend using the MatchIt package though; it's more convenient to use and provides the same options.

Related

Recommendation for mailing address matching scenario?

My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change that). The set of fields in the tables are similar though not identical - most exist in both tables, some only in one, some the other. The goal is to determine with "high confidence" whether or not two mailing addresses match.
Example fields:
Street Number
Predirection
Street Name
Street Suffix
Postdirection (one table and not the other)
Unit name (one table) v Address 2 (other table) --adds complexity
Zip code (length varies in each table 5 v 5+ digits)
Legal description
Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0). This call can be made in SQL or Python within my solution; free/open source highly preferred by client.
Among options such as SOUNDEX, DIFFERENCE, Levenshtein distance (all SQL) and usaddress, dedupe (Python) none stand out as a good-fit solution.
Ideally I'd like to a simple way to call a "function" which returns
either a boolean or a confidence level of match (0.0 - 1.0).
A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.
Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.
If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:
The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
The Jaccard Similarity coefficient algorithm.
a form of the Jaro-Winkler distance algorithm.
4 a longest common subsequence algorithm (which grows by one when transpositions are involved)
If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.
You can easily create your own metric using NGrams8K.
For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).
I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).

How to determine the parameters of DBSCAN?

I tried the kdist plot as introduced in one paper I read and used the knee distance to set the epsilon. However,the results were not satisfying.
I use the WEKA to implement DBSCAN, but it always returns only one cluster.
Can anyone please give me some advice?
This can happen if the k-dist plot has more than 1 knee (this can happen when the dataset contains clusters having different density, and the outcome you have obtained arise when the high density clusters are nested inside the low density ones).
The solution is to search the next knee and re-apply the algorithm on the core points you already have found.

AppEngine: Geospatial queries under Go

There appears to be geospatial-query support under Java (https://cloud.google.com/appengine/docs/java/datastore/geosearch) but there appears to be absolutely no documentation for doing the same under Go. Grepping google.golang.org/appengine for "geo" renders nothing but construction and validation of GeoPoint values.
Since Java supports this the API support must obviously be there. Does anyone have any experience with this or advice? Thanks.
Edit:
It looks like what limited support there is is only offered for Java:
http://startup-with-gae.blogspot.com/2016/01/geospatial-queries-with-google-cloud.html
It's, officially, completely unsupported in Datastore at this time. Use geohashes: https://github.com/gansidui/geohash/blob/master/geohash.go . It reduces this fro ma geospatial-storage/RTREE support problem to a string prefix-search.
It allows you derive a hash that describes a particular location, and then you can take this string and successively remove characters from the right side (and do string prefix-searches against your list of places and their geohashes with what remains) to find places near your principal location and expanding outward.
There is a predictable amount of geographic precision between each adjacent character of a geohash string, so you can either use this or the common prefix bytes between the principal point and the corner latitude/longitude of the map window to identify all places that should appear within it.

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

google app engine : query geohash

I have a db.StringProperty() of geohash, by given a hashcode, how do I find the closer 10 result?
I tried below but doesn't seem to be right
pois = POI.all().filter('geohash <', h_latlng).order('-geohash').fetch(10)
A geohash cannot accomplish the task to find the n-nearest results. You can find the contents of any square region by prefix. But to find a reliable result containing the n-nearest you need to fetch at least 9 prefixes, making it a quite expensive query. Complicating the matter is that prefixes of the 9 squares need to be calculated.
IMO this problem is currently a hard problem to solve efficiently on app-engine. So far, I am on it since a year and have not found a sophisticated and fast solution. A Relational DB with geo index or 2 inequalities will perform such tasks better and faster. But I am interested in good solutions, too. :-)
Citation David Troy:
Geohash also has the property that as
the number of digits decreases (from
the right), accuracy degrades. This
property can be used to do bounding
box searches, as points near to one
another will share similar Geohash
prefixes.
However, because a given point may
appear at the edge of a given Geohash
bounding box, it is necessary to
generate a list of Geohash values in
order to perform a true proximity
search around a point. Because the
Geohash algorithm uses a base-32
numbering system, it is possible to
derive the Geohash values surrounding
any other given Geohash value using a
simple lookup table.
See: https://github.com/davetroy/geohash-js

Resources