I am trying to implement a map tile engine on Google App Engine.
The map data is stored in the database, the datastore (big table). The problem is that 20 requests might come in at approximately the same time to draw 20 tiles based upon the same set of rows in the database.
So 20 requests come in, if I write the code to read from the database for each request, then I will be doing 20 read's, which are the same, from the database, one read for each tile image output. Since each read is the same query, it doesn't make sense to do the same query 20 times. In fact, this is very inefficient.
Can anyone suggest a better way to do this?
If I use the memcache, I need to put the data into memcache, but there are 20 requests coming in at the same time for the data, then if I do a nieve implementation then 20 processes will be writing to memcache, since they are all going at the same time in parallel.
I am programming in Google Go version 1 beta on Google App Engine, I refer to the Python doc's here since they are more complete.
References:
Google datastore http://code.google.com/appengine/docs/python/datastore/overview.html
Leaflet JS I am using for showing map tiles http://leaflet.cloudmade.com/
To clarify.
I generate the tile images from data in the database, that is, I query the database for the data (this is not the tile image), then I draw the data into a image and render the image as a JPEG. As GAE is efficient for drawing images on the server side http://blog.golang.org/2011/12/from-zero-to-go-launching-on-google.html
I don't know about how Google App Engine does it, but MySQL has a query cache so that if the same query gets asked twice in a row, then it uses the results from the first to answer the second. Google is smart about things, so hopefully they do that as well. (You might be able to figure out if they are by timing it.)
One thing you might need to make sure of is that the queries are exactly the same, not just returning the same results. For example, you don't want query1 to be
SELECT lat, lng FROM mytable WHERE tileX=1 AND tileY=1
and query2 to be
SELECT lat, lng FROM mytable WHERE tileX=1 AND tileY=2
I make tiles with gazillions of polygons, and when I did timing and optimization, I found to my surprise that it was faster to return ALL values and weed out the ones I didn't want in PHP than it was to stick in a WHERE clause to the SQL. I think that partly it was because the WHERE clause was different for every tile so the MySQL server couldn't cache effectively.
Organize tile entities so that you can find them via key instead of querying for them, i.e. using get() instead of query(). If you identify a tile based on several criteria, then create a natural ID by combining the criteria. E.g. if you would find a tile based on vertical and horizontal position inside an image then you'd do: naturalID = imageID + verticalID + horizontalID (you can also add separators for better viewing).
Once you have your own unique IDs you can use it to save the tile in Memcache.
If your tiles are immutable (= once created, their content does not change), than you can also cache them inside instance in a global map.
Edit: removed Objectify reference as I just realized you use python.
Edit2: added point 3.
A few things comes to mind:
How do you query for the tiles? you should be able to get the tiles using Key.get() which is way more efficient than query
Try to reduce the number of requests, using levels should reduce the number to around 4 request to retrieve the map.
Related
I have a lot of Entities containing geoPoints stored in Google's Datastore.
Now I need to get the 10 nearest locations based on a Location sent to a Google Cloud Function.
I saw, that there is a distance() function in Google's App Engine, but nothing comparable in Google Cloud Functions, not even the possible to calculate anything in the Database.
Is it possible to get the 10 nearest Locations from Datastore only using Google Cloud Functions or do I need to use a different Database for that ?
Best Regards,
Pascal
We run a geospatial-heavy service on AppEngine.
Our solution is to store the locations on Memcache and doing the calculations directly instead of relying on the database.
This obviously depends on the amount of locations, but if you are clever about how you store the locations, you can search very quickly.
R-Trees are a very good example: https://en.wikipedia.org/wiki/R-tree
I had a similar need, and I solved it with a grid-based clustering scheme.
Essentially I created a Computed String Property that was the string concatenation of the latitude & longitude with the decimals chopped off.
If an entity had obj.latitude = 37.123456 & obj.longitude = 45.234567 then obj.grid_id="37:45"
When performing a search, I determine the grid of the search latitude & longitude as well as the 8 other surrounding grids and queried for all entities that resided in those 9 grids.
# for search latitude = 37.456 & longitude = 45.67
query = SomeModel.query(SomeModel.grid_id.IN([
'36:44', '36:45', '36:46',
'37:44', '37:45', '37:46',
'38:44', '38:45', '38:46',
]))
You would then find the 10 nearest in code.
Depending on your needs you may want to make grids id include decimal positions (obj.grid_id="37.1:45.2") or make them less precise(obj.grid_id="30:40")
This may or may not work for you depending on the distribution of you data points, in which case Zebs suggestion to use R-Tree is more robust, but this was simple to implement and sufficed my needs.
please have a look at the following post
Geospatial Query at Google App Engine Datastore
Unfortunately it is not possible to get the nearest location from the google cloud datastore itself. You have to implement your own logic or you have to use a different database
I'm creating an app where I will store users under all postalcodes/zipcodes they want to deliver to. The structure looks like this:
postalcodes/{{postalcode}}/{{userId}}=true
The reason for the structure is to easily fetch all users who deliver to a certain postal code.
ex. postalcodes/21121/
If all user applies like 500 postalcodes and the app has about 1000 users it can become a lot of records:
500x1000 = 500000
Will Firebase easily handle that many records in data storage, or should I consider a different approach/solution? What are your thoughts?
Kind regards,
Elias
I'm quite sure Firebase can return 500k nodes without a problem.
The bigger concerns are how long that retrieval will take (especially in this mobile-first era) and what your application will show your user based on that many nodes.
A list with 500k rows is hardly useful, so most likely you'll show a subset of the data.
Say you just show the first screenful of nodes. How many nodes will that be? 20? So why would you already retrieve the other nodes already in that case? I'd simply retrieve the nodes needed to build the first screen and load the rest on demand - when/if needed.
Alternatively I could imagine you show a digest of the nodes (like a total number of nodes and maybe some averages per zip code area). You'd need all nodes to determine that digest. But I'd hardly consider it to task of a client application to determine the digest values. That's more something of a server-side task. That server could use the same technology as client-apps (i.e. the JavaScript API), but it wouldn't be bothered (as much) by bandwidth and time constraints.
Just some ideas of how I would approach this, so ymmv.
I am working on a freelance project that captures an audio file, runs some fourier analysis, and spits out three charts (x-y plots). Each chart has about ~3000 data points, which I plan to display with High Charts in the browser.
What database techniques do you recommend for storing and accessing this much data? Should I be storing the points in an array or in multiple rows? I'm considering Mongo too. Plan is to use Rails, so I was hoping to use a single database for both data and authentication.
I haven't dealt with queries accessing this much data for a single page, and this may very well be a tiny overall amount of data. In addition this is an MVP for demonstration to investors, so making it scalable to huge levels isn't of immediate concern.
My initial thought is that using Postgres and having one large table of data points, stored per-row, will be fine, and that that a bunch of doubles is not going to be too memory-intensive relative to images and such.
Realistically, I may just pull 100 evenly-spaced data points to make the chart, but the original data must still be stored.
I've done a lot of Mongo work and I can tell you what I would do if I were you.
One of the very nice properties about your data is that the x,y coordinates are of a fixed size generally. In other words it's not like you are storing comments from users, which can vary greatly in size.
With Mongo I would first make a sample document with the 3,000 points. Just a simple array of x,y points. I would see how big that document is and how my front end handled it - in other words can High Charts handle that?
I would also try to stick to the easiest conceptual model to manage, which is one document per chart, each chart having 3k points. This is a natural way to think of the data and I would start there and see if there were any performance hits. Mongo can easily store those documents, so I think the biggest pain would be in the UI with rendering the data.
Mongo would handle authentication well. I think it's a good choice for general data storage for an MVP.
Greetings Overflowers,
I'm wondering if there is a way to query some kind of a database and only fetch a certain window in the full result set without having to actually go through them all.
For example, if I query my database and I want only results number 100 to 200, would the database fetch all the results (say 0 to 1000) that match my query and later on filter them to exclude any thing outside my specified window frame ?
Actually, I'm working on a full text search problem (not really relational db stuff).
So how about Google and other search engines, do they get full result then filter or do they have direct access to only the needed window frame ?
Thank you all !
Your question is probably best answered in two parts.
For a database (traditional, relational), a query that is executed contains a number of "where" clauses, which will cause the database engine to limit the number of results that it returns. So if you specify a where clause that basically limits between 2 values of the primary key,
select * From table where id>99 and id<201;
you'll get what you're asking for.
For a search engine, a query you make to get the results will always paginate - using various techniques, all the results will be pre-split into pages and a few will be cached. Other pages will be generated on demand. So if you want pages 100-200 then you only ever fetch those that are needed.
The option to filter is not very efficient because large data sources never want to load all their data into memory and slice - you only want to load what's needed.
so i have a User class
class User(db.Model):
points = db.IntegerProperty()
so I created 1000 dummy entities on development server with points ranging from 1 to 1000
query = db.GqlQuery("SELECT * FROM User WHERE points >= 300"
"AND points <= 700"
"LIMIT 20"
"ORDER BY points desc")
I only want 20 results per query ( enough to fill a page). I don't need any pagination of the results.
Everything looks ok, it worked on developement server.
Question:
1. Will it work on a production server with 100,000 - 500,000 user entities? Will i experience great lag? I hope not, cos I heard that App Engine indexes the points column automatically
2. Any other optimization techniques that you can recommend?
I think that it is difficult to say what kind of performance issues that you will have with such a large number of entities. This one particular query will probably be fine, but you should be aware that no datastore query can ever return more than 1000 entities, so if you need to operate on numbers larger than 1000, you will need to do it in batches, and you may want to partition them into separate entity groups.
As far as optimization goes, you may want to consider caching the results of this query and only running it when you know the information has changed or at specific intervals. If the query is for some purpose where exactly correct results are not totally critical -- say, displaying a leader board or a high score list -- you might be choose to update and cache the result once every hour or something like that.
The only other optimization that I can think of is that you can save the cycles associated with parsing that GQL statement by doing it once and saving the resulting object, either in memchache or a global variable.
Your code seems fine to get the top users, but more complex queries, like finding out what's the rank of any specific user will be hard. If you need this kind of functionality too, have a look at google-app-engine-ranklist.
Ranklist is a python library for Google App Engine that implements a
data structure for storing integer
scores and quickly retrieving their
relative ranks.