Search 10 nearest Locations in Datastore - google-app-engine

I have a lot of Entities containing geoPoints stored in Google's Datastore.
Now I need to get the 10 nearest locations based on a Location sent to a Google Cloud Function.
I saw, that there is a distance() function in Google's App Engine, but nothing comparable in Google Cloud Functions, not even the possible to calculate anything in the Database.
Is it possible to get the 10 nearest Locations from Datastore only using Google Cloud Functions or do I need to use a different Database for that ?
Best Regards,
Pascal

We run a geospatial-heavy service on AppEngine.
Our solution is to store the locations on Memcache and doing the calculations directly instead of relying on the database.
This obviously depends on the amount of locations, but if you are clever about how you store the locations, you can search very quickly.
R-Trees are a very good example: https://en.wikipedia.org/wiki/R-tree

I had a similar need, and I solved it with a grid-based clustering scheme.
Essentially I created a Computed String Property that was the string concatenation of the latitude & longitude with the decimals chopped off.
If an entity had obj.latitude = 37.123456 & obj.longitude = 45.234567 then obj.grid_id="37:45"
When performing a search, I determine the grid of the search latitude & longitude as well as the 8 other surrounding grids and queried for all entities that resided in those 9 grids.
# for search latitude = 37.456 & longitude = 45.67
query = SomeModel.query(SomeModel.grid_id.IN([
'36:44', '36:45', '36:46',
'37:44', '37:45', '37:46',
'38:44', '38:45', '38:46',
]))
You would then find the 10 nearest in code.
Depending on your needs you may want to make grids id include decimal positions (obj.grid_id="37.1:45.2") or make them less precise(obj.grid_id="30:40")
This may or may not work for you depending on the distribution of you data points, in which case Zebs suggestion to use R-Tree is more robust, but this was simple to implement and sufficed my needs.

please have a look at the following post
Geospatial Query at Google App Engine Datastore
Unfortunately it is not possible to get the nearest location from the google cloud datastore itself. You have to implement your own logic or you have to use a different database

Related

Location based horizontal scalable dating app database model

I am assessing backend for location base dating app similar to Tinder.
App feature is showing nearby online users (with sex, and age filter)
Some database engines in mind are Redis, Cassandra, MySQL Cluster
The app should scale horizontally by adding node at high traffic time
After researching, I am very confused whether there is a common "best practice" data model, algorithm for this.
My approach is using Redis Cluster:
// Store all online users in same location (city) to a Set. In this case, store user:1 to New York set
SADD location:NewYork 1
// Store all users age to Sorted Set. In this case, user:1 has age 30
ZADD age 30 "1"
// Retrieve users in NewYork age from 20 to 40
ZINTERSTORE tmpkey 2 location:NewYork age AGGREGATE MAX
ZRANGEBYSCORE tmpkey 20 40
I am inexperienced and can not foresee potential problem if scaling happen for million of concurrent users.
Hope any veteran could shed some light.
For your use case, mongodb would be a good choice.
You can store each user in single document, along with their current location.
Create indexes on fields you want to do queries on, e.g. age, gender, location
Mongodb has inbuilt support for geospatial queries, hence it is easy to find users within 1 km radius of another user.
Most noSQL Geo/proximity index features rely on the GeoHash Algorithm
http://www.bigfastblog.com/geohash-intro
It's a good thing to understand how it works, and it's really quite fascinating. This technique can also be used to create highly efficient indexes on a relational database.
Redis does have native support for this, but if you're using ElastiCache, that version of Redis does not, and you'll need to mange this in your API.
Any Relational Database will give you the most flexibility and simplest solution. The problem you may face is query times. If you're optimizing for searches on your DB instance (possibly have a 'search db' separate to profile/content data), then it's possible to have the entire index in memory for fast results.
I can also talk a bit about Redis: The sorted set operations are blazingly fast, but you need to filter. Either you have to scan through your nearby result and lookup meta information to filter, or maintain separate sets for every combination of filter you may need. The first will have more performance overhead. The second requires you to mange the indexes yourself. EG: What if someone removes one of their 'likes'? What if they move around?
It's not flash or fancy, but in most cases where you need to search a range of data, relational databases win due to their simplicity and support. Think of your search as a replica of your master source, and you can always migrate to another solution, or re-shard/scale if you need to in the future.
You may be interested in the Redis Geo API.
The Geo API consists of a set of new commands that add support for storing and querying pairs of longitude/latitude coordinates into Redis keys. GeoSet is the name of the data structure holding a set of (x,y) coordinates. Actually, there isn’t any new data structure under the hood: a GeoSet is simply a Redis SortedSet.
Redis Geo Tutorial
I will also support MongoDB on the basis of requirements with the development of MongoDB compass you can also visualize your geospatial data.The link of mongodb compass documentation is "https://docs.mongodb.com/compass/getting-started/".

Geosearch Search API GAE filtering distance(geopoint(MY_GEOPOINT), store_location) < DISTANCE queries

I in doubt about how the Search API makes the queries, especially how it scans the Documents in the Index. My doubt is the following:
I have an Index with a lot of Documents with GeoPoints on it. I want to list the points that are in an specific radius. For example, if I have 20 millions Documents in the Index and do a search like this one:
String query = distance(geopoint(MY_GEOPOINT), store_location) < 10000
It will list the stores that are on the radius of 10 km.
My question is the following: How Search API are going to do it? will it scans the 20 millions of documents(and take a long time) or it will optimize in some way?
I questioning because of performance, I am developing an app that will use GeoSearch and I'm afraid it with get slowly as the database grows.
Thanks for any help.
Kind Regards JLuiZ20
it will certainly not scan all records. you can infer this from the documentation in https://cloud.google.com/appengine/training/fts_adv/
Because geopoint is a supported data type and has the distance built-in function and supports geospatial queries, it can efficiently index geopoints in a way that it can query on a radius from a point.
appengine docs don't mention the algorithm used, there are many such algorithms that it could be using but you bet its efficient, otherwise it wouldnt be a supported type.

How to implement an efficient map tile engine on Google App Engine?

I am trying to implement a map tile engine on Google App Engine.
The map data is stored in the database, the datastore (big table). The problem is that 20 requests might come in at approximately the same time to draw 20 tiles based upon the same set of rows in the database.
So 20 requests come in, if I write the code to read from the database for each request, then I will be doing 20 read's, which are the same, from the database, one read for each tile image output. Since each read is the same query, it doesn't make sense to do the same query 20 times. In fact, this is very inefficient.
Can anyone suggest a better way to do this?
If I use the memcache, I need to put the data into memcache, but there are 20 requests coming in at the same time for the data, then if I do a nieve implementation then 20 processes will be writing to memcache, since they are all going at the same time in parallel.
I am programming in Google Go version 1 beta on Google App Engine, I refer to the Python doc's here since they are more complete.
References:
Google datastore http://code.google.com/appengine/docs/python/datastore/overview.html
Leaflet JS I am using for showing map tiles http://leaflet.cloudmade.com/
To clarify.
I generate the tile images from data in the database, that is, I query the database for the data (this is not the tile image), then I draw the data into a image and render the image as a JPEG. As GAE is efficient for drawing images on the server side http://blog.golang.org/2011/12/from-zero-to-go-launching-on-google.html
I don't know about how Google App Engine does it, but MySQL has a query cache so that if the same query gets asked twice in a row, then it uses the results from the first to answer the second. Google is smart about things, so hopefully they do that as well. (You might be able to figure out if they are by timing it.)
One thing you might need to make sure of is that the queries are exactly the same, not just returning the same results. For example, you don't want query1 to be
SELECT lat, lng FROM mytable WHERE tileX=1 AND tileY=1
and query2 to be
SELECT lat, lng FROM mytable WHERE tileX=1 AND tileY=2
I make tiles with gazillions of polygons, and when I did timing and optimization, I found to my surprise that it was faster to return ALL values and weed out the ones I didn't want in PHP than it was to stick in a WHERE clause to the SQL. I think that partly it was because the WHERE clause was different for every tile so the MySQL server couldn't cache effectively.
Organize tile entities so that you can find them via key instead of querying for them, i.e. using get() instead of query(). If you identify a tile based on several criteria, then create a natural ID by combining the criteria. E.g. if you would find a tile based on vertical and horizontal position inside an image then you'd do: naturalID = imageID + verticalID + horizontalID (you can also add separators for better viewing).
Once you have your own unique IDs you can use it to save the tile in Memcache.
If your tiles are immutable (= once created, their content does not change), than you can also cache them inside instance in a global map.
Edit: removed Objectify reference as I just realized you use python.
Edit2: added point 3.
A few things comes to mind:
How do you query for the tiles? you should be able to get the tiles using Key.get() which is way more efficient than query
Try to reduce the number of requests, using levels should reduce the number to around 4 request to retrieve the map.

Calculate Distances Between Addresses Without Google Maps

I bascially want to create a search in our Sales Orders database to find items that where shipped within a range of a particular address.
I can't use Google's API because:
It will be a report and there is no way to display a Map at runtime, which violates the terms of service.
Google limits you to 1,600 requests a day, so comparing and arbitrary address to all our sales orders would violate that before 1 search completed.
I imagine running the directions API to compare the address to each order in our database would take forever.
A lot of this will depend on the precision and exactly what you want to do.
For example, if you want Line of Sight calculations, you can use a service like this one http://geocoder.us/ to get the Latitude and Longitude of each address, from this you can do a simple calculation to get the "as the crow flies" distance between this point and another.
if you want true driving direction distance, that will be much more complicated.
You could use the Yahoo! geocoding API and cache the lat/lons to the database, and then use the lat/lons and standard mapping library calls to determine the distance to your point of interest.
Have you looked at other geocoding APIs, like Yahoo or Bing?
There's also this one, which says it's free and offers a few .NET code samples.

Clustering Lat/Longs in a Database

I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application.
There are a number of resources about how to cluster, either on the client side OR in the server (application) side .. but not in the database side :(
This is a similar question, asked by a fellow S.O. member. The solutions are server side based (ie. C# code behind).
Has anyone had any luck or experience with solving this, but in a database? Are there any database guru's out there who are after a hawt and sexy DB challenge?
please help :)
EDIT 1: Clarification - by clustering, i'm hoping to group x number of points into a single point, for an area. So, if i say cluster everything in a 1 mile / 1 km square, then all the results in that 'square' are GROUP'D into a single result (say ... the middle of the square).
EDIT 2: I'm using MS Sql 2008, but i'm open to hearing if there are other solutions in other DB's.
I'd probably use a modified* version of k-means clustering using the cartesian (e.g. WGS-84 ECF) coordinates for your points. It's easy to implement & converges quickly, and adapts to your data no matter what it looks like. Plus, you can pick k to suit your bandwidth requirements, and each cluster will have the same number of associated points (mod k).
I'd make a table of cluster centroids, and add a field to the original data table to indicate what cluster it belonged too. You'd obviously want to update the clustering periodically if your data is at all dynamic. I don't know if you could do that with a stored procedure & trigger, but perhaps.
*The "modification" would be to adjust the length of the computed centroid vectors so they'd be on the surface of the earth. Otherwise you'd end up with a bunch of points with negative altitude (when converted back to LLH).
If you're clustering on geographic location, and I can't imagine it being anything else :-), you could store the "cluster ID" in the database along with the lat/long co-ordinates.
What I mean by that is to divide the world map into (for example) a 100x100 matrix (10,000 clusters) and each co-ordinate gets assigned to one of those clusters.
Then, you can detect very close coordinates by selecting those in the same square and moderately close ones by selecting those in adjacent squares.
The size of your squares (and therefore the number of them) will be decided by how accurate you need the clustering to be. Obviously, if you only have a 2x2 matrix, you could get some clustering of co-ordinates that are a long way apart.
You will always have the edge cases such as two points close together but in different clusters (one northernmost in one cluster, the other southernmost in another) but you could adjust the cluster size OR post-process the results on the client side.
I did a similar thing for a geographic application where I wanted to ensure I could cache point sets easily. My geohashing code looks like this:
def compute_chunk(latitude, longitude)
(floor_lon(longitude) * 0x1000) | floor_lat(latitude)
end
def floor_lon(longitude)
((longitude + 180) * 10).to_i
end
def floor_lat(latitude)
((latitude + 90) * 10).to_i
end
Everything got really easy from there. I had some code for grabbing all of the chunks from a given point to a given radius that would translate into a single memcache multiget (and some code to backfill that when it was missing).
For movielandmarks.com I used the clustering code from Mike Purvis, one of the authors of Beginning Google Maps Applications with PHP and AJAX. It builds trees of clusters/points for different zoom levels using PHP and MySQL, storing it in the database so that recall is very fast. Some of it may be useful to you even if you are using a different database.
Why not testing multiple approaches?
translate the weka library in .NET CLI with IKVM.NET
add an assembly resulted from your code and weka.dll (use ilmerge) into your database
Make some tests, that is. No specific clustering works better than anyone else.
I believe you can use MSSQL's spatial data types. If they are similar to other spatial data types I know, they will store your points in a tree of rectangles, and then you can go to the lower-resolution rectangles to get implicit clusters.
If you end up wanting to explore Geohash's (which were invented at exactly the same time you posted this question), here's a more fleshed-out implementation of Geohash related functions for SQL Server's TSQL in which you might be interested.
QalGeohash-TSQL
I have used the Integer version of the Geohash extensively to cluster results to reduce data sent to a client for a limited viewport.

Resources