google app engine : query geohash - google-app-engine

I have a db.StringProperty() of geohash, by given a hashcode, how do I find the closer 10 result?
I tried below but doesn't seem to be right
pois = POI.all().filter('geohash <', h_latlng).order('-geohash').fetch(10)

A geohash cannot accomplish the task to find the n-nearest results. You can find the contents of any square region by prefix. But to find a reliable result containing the n-nearest you need to fetch at least 9 prefixes, making it a quite expensive query. Complicating the matter is that prefixes of the 9 squares need to be calculated.
IMO this problem is currently a hard problem to solve efficiently on app-engine. So far, I am on it since a year and have not found a sophisticated and fast solution. A Relational DB with geo index or 2 inequalities will perform such tasks better and faster. But I am interested in good solutions, too. :-)
Citation David Troy:
Geohash also has the property that as
the number of digits decreases (from
the right), accuracy degrades. This
property can be used to do bounding
box searches, as points near to one
another will share similar Geohash
prefixes.
However, because a given point may
appear at the edge of a given Geohash
bounding box, it is necessary to
generate a list of Geohash values in
order to perform a true proximity
search around a point. Because the
Geohash algorithm uses a base-32
numbering system, it is possible to
derive the Geohash values surrounding
any other given Geohash value using a
simple lookup table.
See: https://github.com/davetroy/geohash-js

Related

Algorithm sorting details, but without excluding

I have come across a problem.
I’m not asking for help how to construct what I’m searching for, but only to guide me to what I’m looking for! 😊
The thing I want to create is some sort of ‘Sorting Algorithm/Mechanism’.
Example:
Imagine I have a database with over 1000 pictures of different vehicles.
A person sees a vehicle, he now tries to get as much information and details about that vehicle, such as:
Shape
number of wheels
number and shape of windows
number and shape of light(s)
number and shape of exhaust(s)
Etc…
He then gives me all information about that vehicle he saw. BUT! Without telling me anything about:
Make and model.
…
I will now take that information and tell my database to sort out every vehicle so that it arranges all 1000 vehicle by best match, based by the description it have been given.
But it should NOT exclude any vehicle!
So…
If the person tells me that the vehicle only has 4 wheels, but in reality it has 5 (he might not have seen the fifth wheel) it should just get a bad score in the # of wheels.
But if every other aspect matches that vehicle perfect it will still get a high score.
That way we don’t exclude the vehicle that he has seen, and we still have a change to find the correct vehicle.
The whole aspect of this mechanism is to, as said, sort out the most, so instead of looking through 1000 vehicles we only need to sort through the best matches which is 10 to maybe 50 vehicles out of a 1000 (hopefully).
I tried to describe it the best I could in a language that isn’t ‘my father’s tongue’. So bear with me.
Again, I’m not looking for anybody telling me how to make this algorithm, I’m pretty sure nobody even wants of have the time to do that for me, without getting paid somehow...
But I just need to know where to look regarding learning and understanding how to create this mess of a mechanism.
Kind regards
Gent!
Assuming that all your pictures have been indexed with the relevant fields (number of wheels, window shapes...), and given that they are not too numerous (a thousand is peanuts for a computer), you can proceed as follows:
for every criterion, weight the possible discrepancies (e.g. one wheel too much costs 5, one wheel too few costs 10, bad window shape costs 8...). Make this in a coherent way so that the costs of the criteria are well balanced.
to perform a search, evaluate the total discrepancy cost of every car, and sort the values increasingly. Report the first ten.
Technically, what you are after is called a "nearest neighbor search" in a high dimensional space. This problem has been well studied. There are fast solutions but they are extremely complex, and in your case are absolutely not worth using.
The default way of doing this for example in artificial intelligence is to encode all properties as a vector and applying certain weights to each property. The distance can then be calculated using any metric you like. In your case manhatten-distance should be fine. So in pseudocode:
distance(first_car, second_car):
return abs(first_car.n_wheels - second_car.n_wheels) * wheels_weight+ ... +
abs(first_car.n_windows - second_car.n_windows) * windows_weight
This works fine for simple properties like the number of wheels. For more complex properties like the shape of a window you'll probably need to split it up into multiple attributes depending on your requirements on similarity.
Weights are usually picked in such a way as to normalize all values, if their range is known. Optionally an additional factor can be multiplied to increase the impact of a specific attribute on the overall distance.

General Big-Data principles for finding pairs of similar objects - "fuzzy inner join"

Firstly, sorry for the vague title and if this question has been asked before, but I was not entirely sure how to phrase it.
I am looking for general design principles for finding pairs of 'similar' objects from two different data sources.
Lets for simplicity say that we have two databases, A and B, both containing large volumes of objects, each with time-stamp and geo-location, along with some other data that we don't care about here.
Now I want to perform a search along these lines:
Within as certain time-frame and location dictated as search tiem, find pairs of objects from A and B respectively, ordered by some similarity score. Here for example some scalar 'time/space distance' function, distance(a,b), that calculates the distance in time and space between the objects.
I am expecting to get a (potentially ginormous) set of results where the first result is a pair of data points which has the minimum 'distance'.
I realize that the full search space is cardinality(A) x cardinality(B).
Are there any general guidelines on how to do this in a reasonable efficient way? I assume that I would need to replicate the two databases into a common repository like Hadoop? But then what? I am not sure how to perform such a query in Hadoop either.
What is this this type of query called?
To me, this is some kind of "fuzzy inner join" that I struggle wrapping my head around how to construct, let along efficiently at scale.
SQL joins don't have to be based on equality. You can use ">", "<", "BETWEEN".
You can even do something like this:
select a.val aval, b.val bval, a.val - b.val diff
from A join B on abs(a.val - b.val) < 100
What you need is a way to divide your objects into buckets in advance, without comparing them (or at least making a linear, rather than square, number of comparisons). That way, at query time, you will only be comparing a small number of items.
There is no "one-size-fits-all" way to bucket your items. In your case the bucketing can be based on time, geolocation, or both. Time-based bucketing is very natural, and can also scales elastically (increase or decrease the bucket size). Geo-clustering buckets can be based on distance from a particular point in space (if the space is abstract), or on some finite division of the space (for example, if you divide the entire Earth's world map into tiles, which can also scale nicely if done right).
A good question to ask is "if my data starts growing rapidly, can I handle it by just adding servers?" If not, you might need to rethink the design.

How can I find the closest document using Google App Engine Search API?

I have approximately 400,000 documents in a GAE Search index. All documents have a location GeoPoint property and are spread over the entire globe. Some documents might be over 4000km away from any other document, others might be bunched within meters of each other.
I would like to find the closest document to a specific set of coordinates but find the following code gives incorrect results:
from google.appengine.api import search
# coords are in the form of a tuple e.g. (50.123, 1.123)
search.Document(
doc_id='meaningful-unique-id',
fields=[search.GeoField(name='location'
value=search.GeoPoint(coords[0], coords[1]))])
# find document function radius is in metres
def find_document(coords, radius=1000000):
sort_expr = search.SortExpression(
expression='distance(location, geopoint(%.3f, %.3f))' % coords,
direction=search.SortExpression.ASCENDING,
default_value=0)
search_query = search.Query(
query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \
% (coords[0], coords[1], radius),
options=search.QueryOptions(
limit=1,
ids_only=True,
sort_options=search.SortOptions(expressions=[sort_expr])))
index = search.Index(name='document-index')
return index.search(search_query)
With this code I will get results that are consistent but incorrect. For example, a search for the nearest document to London indicated the closest one was in Scotland. I have verified that there are thousands of closer documents.
I narrowed the problem down to the radius parameter being too large. I get correct results if the radius is down to around 12km (radius=12000). There are generally no more than 1000 documents in a 12 km radius. (Probably associated with search.SortOptions(limit=1000).)
The problem is that if I am in a sparse area of the globe where there aren't any documents for thousands of miles, my search function will not return anything with radius=12000 (12km). I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?
I believe the issue is the following.
Your query will select up to 10K documents, then those are sorted according to your distance sort expression and returned. (That is, the sort is in fact not over all 400k documents.)
So I suspect that some of the geographically closer points are not included in this 10k selection.
That's why things work better when you narrow your search radius, as you have fewer total points in that radius.
Essentially, you want to get your query 'hits' down to 10k, in a manner that makes sense for what you are querying on.
You can address this in at least a couple of ways, which you can combine:
Add a ranking, so that the most 'important' docs (by some criteria that makes sense in your domain) are returned in rank order, then these will be sorted by distance.
Filter on one or more document field(s) (e.g., 'business category', if your docs contain information about businesses) to reduce the number of candidate docs.
(I don't believe this 10k threshold is currently in the Search API documentation; I've filed a ticket to get it added).
I have the exact same problem, and I don't think its possible. The problem happens as you yourself has figured out when there are more possible results than returned results. The Google algorithm just quits when it has loaded the limits and then it sorts the results.
I have seen the same clusters as you and its part of the search API.
One Hack would be to subdivide your search into sub-sectors, do multiple simultaneous calls and then merge and order the results.
Wild idea, why not keep/record the distance from 3 points then calculate from that.

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

How to calculate the geohash of the viewable area/bounding box?

I've recently start working on a personal project involving geo locations, maps (Google Maps V3) etc.
The project is developed in Python and is intended to run on Google App Engine.
I've learned that in order to find markers/position close to a position one can use to geohash algorithm (which is pretty cool).
What I don't understand is this: lets say I have all my locations in the data store (along with a latitude, longitude and a geohash (with high precision) of each location.)
I know that I should use the prefix of the geohash (to match locations within), but how do I calculate a geohash of a bounding box? Considering the bounding box is made up of two points, North-East and South-West, I do not understand how to go about doing this..
In order for me to querying which locations should be returned for the currently visible bounding box, I need the geohash of the visible/viewable bounding box - Now I know I can geohash the center location on the viewable map, but I do not know how many letters to cut off (to reduce precision) to achieve 'a fit' to the actual bounding box. (Or maybe that isn't the way...?)
What do you do when the bounding box container to geohashes? (like in the middle of the viewable area it splits between 'dqcjr0' and 'dqcjqb')
Also, lets assume I have a 5 letter geohash, how can I convert that back into a viewable bounding box? or in other words, how do I know what is 'included' the hash, and what is in adjacent hashes?
Thanks in advance for your help,
Ken.
I used geohash with google app engine data types ie db.GeoPt a lot and I used to keep a geohash which I found was inferior to combine the db.GeoPt with the very good but a bit slow library called geomodel Geomodel can do bounding-box and radius mappings and I suggest that you try with the bounding-box since it is not as expensive as the radius. I can perform a bounding-box query like this:
articles = Article.bounding_box_fetch(Article.all().filter('modified >',
timeline).filter('published =',
True).filter('modified <=',
bookmark).order('-modified'),
bounds,
max_results=PAGESIZE + 1)
So even if I stored geohash for every article, using geomodel was much better in my case. Maybe you already evaluated geomodel and found that it didn't suit your purpose and that you absolutely must use geohashes I suggest that we agree on a common library for the geohash so that our coordinates hash to the same value. I do keep a version of the geohash library I used somewhere but it is probably outdated and the recent articles about geospatial queries also metion geomodel, so if you didn't look at geomodel yet, I really propose you look at the geomodel library to perform your geospatial queries.
Ken
You may want to update your question stating whether or not you're using django / django-nonrel?
I'm just about to try this (currently archived) port of Geomodel to django:
https://bitbucket.org/scotch/django-geomodel/
Kyle suggests that the upcoming Google "full text search" would replace his Geomodel implmentation. Nonetheless, I need it working within the next few days.
(My current conversation re: this topic:
https://groups.google.com/forum/#!topic/django-non-relational/WCxFjkUzw18
)
Jon

Resources