There appears to be geospatial-query support under Java ( but there appears to be absolutely no documentation for doing the same under Go. Grepping for "geo" renders nothing but construction and validation of GeoPoint values.
Since Java supports this the API support must obviously be there. Does anyone have any experience with this or advice? Thanks.
It looks like what limited support there is is only offered for Java:
It's, officially, completely unsupported in Datastore at this time. Use geohashes: . It reduces this fro ma geospatial-storage/RTREE support problem to a string prefix-search.
It allows you derive a hash that describes a particular location, and then you can take this string and successively remove characters from the right side (and do string prefix-searches against your list of places and their geohashes with what remains) to find places near your principal location and expanding outward.
There is a predictable amount of geographic precision between each adjacent character of a geohash string, so you can either use this or the common prefix bytes between the principal point and the corner latitude/longitude of the map window to identify all places that should appear within it.
Let's say we have two words with easily confused spellings. Let's say we take the terms:
derailer (a device used to prevent fouling of a rail track)
Derailleur (a device used for changing gear ratios on a bicycle)
Now for some reason we have both terms in our spelling suggestions. As a result a search for one or the other will never yield spelling suggestions, despite it being likely that you'll get poor (or no) results if you meant the other term.
So the question is how can I convince solr to give me spelling suggestions if you search for one or the other, and what controls do I have to ensure that not every search results in showing spelling suggestions?
I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
I've recently start working on a personal project involving geo locations, maps (Google Maps V3) etc.
The project is developed in Python and is intended to run on Google App Engine.
I've learned that in order to find markers/position close to a position one can use to geohash algorithm (which is pretty cool).
What I don't understand is this: lets say I have all my locations in the data store (along with a latitude, longitude and a geohash (with high precision) of each location.)
I know that I should use the prefix of the geohash (to match locations within), but how do I calculate a geohash of a bounding box? Considering the bounding box is made up of two points, North-East and South-West, I do not understand how to go about doing this..
In order for me to querying which locations should be returned for the currently visible bounding box, I need the geohash of the visible/viewable bounding box - Now I know I can geohash the center location on the viewable map, but I do not know how many letters to cut off (to reduce precision) to achieve 'a fit' to the actual bounding box. (Or maybe that isn't the way...?)
What do you do when the bounding box container to geohashes? (like in the middle of the viewable area it splits between 'dqcjr0' and 'dqcjqb')
Also, lets assume I have a 5 letter geohash, how can I convert that back into a viewable bounding box? or in other words, how do I know what is 'included' the hash, and what is in adjacent hashes?
Thanks in advance for your help,
I used geohash with google app engine data types ie db.GeoPt a lot and I used to keep a geohash which I found was inferior to combine the db.GeoPt with the very good but a bit slow library called geomodel Geomodel can do bounding-box and radius mappings and I suggest that you try with the bounding-box since it is not as expensive as the radius. I can perform a bounding-box query like this:
articles = Article.bounding_box_fetch(Article.all().filter('modified >',
timeline).filter('published =',
True).filter('modified <=',
max_results=PAGESIZE + 1)
So even if I stored geohash for every article, using geomodel was much better in my case. Maybe you already evaluated geomodel and found that it didn't suit your purpose and that you absolutely must use geohashes I suggest that we agree on a common library for the geohash so that our coordinates hash to the same value. I do keep a version of the geohash library I used somewhere but it is probably outdated and the recent articles about geospatial queries also metion geomodel, so if you didn't look at geomodel yet, I really propose you look at the geomodel library to perform your geospatial queries.
You may want to update your question stating whether or not you're using django / django-nonrel?
I'm just about to try this (currently archived) port of Geomodel to django:
Kyle suggests that the upcoming Google "full text search" would replace his Geomodel implmentation. Nonetheless, I need it working within the next few days.
(My current conversation re: this topic:!topic/django-non-relational/WCxFjkUzw18
I have a db.StringProperty() of geohash, by given a hashcode, how do I find the closer 10 result?
I tried below but doesn't seem to be right
pois = POI.all().filter('geohash <', h_latlng).order('-geohash').fetch(10)
A geohash cannot accomplish the task to find the n-nearest results. You can find the contents of any square region by prefix. But to find a reliable result containing the n-nearest you need to fetch at least 9 prefixes, making it a quite expensive query. Complicating the matter is that prefixes of the 9 squares need to be calculated.
IMO this problem is currently a hard problem to solve efficiently on app-engine. So far, I am on it since a year and have not found a sophisticated and fast solution. A Relational DB with geo index or 2 inequalities will perform such tasks better and faster. But I am interested in good solutions, too. :-)
Citation David Troy:
Geohash also has the property that as
the number of digits decreases (from
the right), accuracy degrades. This
property can be used to do bounding
box searches, as points near to one
another will share similar Geohash
However, because a given point may
appear at the edge of a given Geohash
bounding box, it is necessary to
generate a list of Geohash values in
order to perform a true proximity
search around a point. Because the
Geohash algorithm uses a base-32
numbering system, it is possible to
derive the Geohash values surrounding
any other given Geohash value using a
simple lookup table.
I have two images of real world. (IMPORTANT)I approximately know transformation of one real world to another. Due to texture problem I don't get enough matches between two images. How can I bring transformation information into account to get more and correct matches by using SIFt.
Any idea will be helpful.
Have you tried other alternatives? Are you sure SIFT is the answer? First, OpenCV provides SIFT, among other tools. (At the moment, I can't speak highly enough of OpenCV).
If I were solving this problem, I would first try:
Downsample your two images to reduce the influence of "texture", i.e. cvPyrDown.
Perform some feature detection: edge detection, etc. OpenCV provides a Harris corner detector, among others. Google "cvGoodFeaturesToTrack" for some detail.
If you have good confidence in your transformations, take advantage of your a priori information and look for features in neighborhoods corresponding to the transformed locations.
If you still want to look at SIFT or SURF, OpenCV provides those capabilities, as well.
If you know the transform, then apply the transform and then apply SURF/SIFT to the transformed image. That's one standard way to extend the robustness of feature descriptors/matchers across large perspective changes.
There is another alternative:
In sift parameters, Contrast Threshold is set to 0.04. If you reduce it and set it to a lower value ( 0.02,0.01) SIFT would find more enough matches:
SIFT(int nfeatures=0, int nOctaveLayers=3, double contrastThreshold=0.04, double edgeThreshold=10, double sigma=1.6)
The first step I think is to try with the settings of the SIFT algorithm to find the best efficiency with respect to your problem.
One another way to use SIFT more effectively is adding the COLOR information to SIFT. So you can add the color information (RGB) of the points which are being used in the descriptor to it. For instance if your descriptor size is 10x128 then it shows that you are using 10 points in each descriptor. Now you can extract and add three column and make the size 10x(128+3) [R-G-B for each point]. In this way the SIFT algorithm will work more efficient. But remember, you need to apply weight to your descriptor and make the last three columns be stronger than the other 128 columns. Actually I do not know in your case how the images are. but this method helped me a lot. and you can see that this modification makes SIFT a stronger method than before.
A similar implementation can be find here.