Improve throughput of ndb query over large data - google-app-engine

I am trying to perform some data processing in a GAE application over data that is stored in the Datastore. The bottleneck point is the throughput in which the query returns entities and I wonder how to improve the query's performance.
What I do in general:
everything works in a task queue, so we have plenty of time (10 minute deadline).
I run a query over the ndb entities in order to select which entities need to be processed.
as the query returns results, I group entities in batches of, say, 1000 and send them to another task queue for further processing.
the stored data is going to be large (say 500K-1M entities) and there is a chance that the 10 minutes deadline is not enough. Therefore, when the task is reaching the taskqueue deadline, I spawn a new task. This means I need an ndb.Cursor in order to continue the query from where it stopped.
The problem is the rate in which the query returns entities. I have tried several approaches and observed the following performance (which is too slow for my app):
Use fetch_page() in a while loop.
The code is straightforward
while has_more and theres_more_time:
entities, cursor, more = query.fetch_page(1000, ...)
send_to_process_queue(entities)
has_more = more and cursor
With this approach, it takes 25-30 seconds to process 10K entities. Roughly speaking, that is 20K entities per minute. I tried changing the page size or the class of the frontend instance; neither made any difference in performance.
Segment the data and fire multiple fetch_page_async() in parallel.
This approach is taken from here (approach C)
The overall performance remains the same as above. I tried with various number of segments (from 2 to 10) in order to have 2-10 parallel fetch_async() calls. In all cases, the overall time remained the same. The more parallel fetch_page_async() are called, the longer it takes for each one to complete. I also tried with 20 parallel fetches and it got worse. Changing the page size or the fronted instance class did not have and impact either.
Fetch everything with a single fetch() call.
Now this is the least suitable approach (if not unsuitable at all) as the instance may run out of memory, plus I don't get a cursor in case I need to spawn to another task (in fact I won't even have the ability to do so, the task will simply exceed the deadline). I tried this out of curiosity in order to see how it performs and I observed the best performance! It took 8-10 seconds for 10K entities, which is roughly be 60K entities per minute. Now that is approx. 3 times faster than fetch_page(). I wonder why this happens.
Use query.iter() in a single loop.
This is match like the first approach. This will make use of the query iterator's underlying generator, plus I can obtain a cursor from the iterator in case I need to spawn a new task, so it suits me. With the query iterator, it fetched 10K entities in 16-18 seconds, which is approx. 36-40K entities per minute. The iterator is 30% faster than fetch_page, but much slower that fetch().
For all the above approaches, I tried F1 and F4 frontend instances without any difference in Datastore performance. I also tried to change the batch_size parameter in the queries, still without any change.
A first question is why do fetch(), fetch_page() and iter() behave so differently and how to make either fetch_page() or iter() do equally well as fetch()? And then another critical question is whether these throughputs (20-60K entities per minute, depending on api call) are the best we can do in GAE.
I 'm aware of the MapReduce API but I think it doesn't suit me. AFAIK, the MapReduce API doesn't support queries and I don't want to scan all the Datastore entities (it's will be too costly and slow - the query may return only a few results). Last, but not least, I have to stick to GAE. Resorting to another platform is not an option for me. So the question really is how to optimize the ndb query.
Any suggestions?

In case anyone is interested, I was able to significantly increase the throughput of the data processing by re-designing the component - it was suggested that I change the data models but that was not possible.
First, I segmented the data and then processed each data segment in a separate taskqueue.Task instead of calling multiple fetch_page_async from a single task (as I described in the first post). Initially, these tasks were processed by GAE sequentially utilizing only a single Fx instance. To achieve parallelization of the tasks, I moved the component to a specific GAE module and used basic scaling, i.e. addressable Bx instances. When I enqueue the tasks for each data segment, I explicitly instruct which basic instance will handle each task by specifying the 'target' option.
With this design, I was able to process 20.000 entities in total within 4-5 seconds (instead of 40'-60'!), using 5 B4 instances.
Now, this has additional costs because of the Bx instances. We 'll have to fine tune the type and number of basic instances we need.

The new experimental Data Processing feature (an AppEngine API for MapReduce) might be suitable. It uses automatic sharding to execute multiple parallel worker processes, which may or may not help (like the Approach C in the other linked question).

Your comment about "no need to scan all entities" triggers the thought that custom indexes could help your queries. That may entail schema changes to store the data in a less normal form.
Design a solution from the output perspective - what the simplest query is that produces the required results, then what the entity structure is to support such a query, then what work is needed to create and maintain such an entity structure from the current data.

Related

Using QuerySplitter in Google Datastore to load chunks of a known size

I'd like to load lots of data from a Google Datastore table. For performance, I'd like to run, in parallel, a few queries that each loads a lot of objects. Cursors are not suitable for the parallel execution.
QuerySplitter is. However, for QuerySplitter you have to tell it how many splits you want, and what I care about is loading a certain number of objects. The number is chosen for the needs of my application, large but not not too large, say 800 objects. It's OK if the number of objects returned by each query is only very roughly the same; nothing worse would happen that different threads running different amounts of time.
How do I do this? I could query all objects keys-only in order to count them, and divide by 800. Is there a better way.
Querying all your entities (even keys only) might not scale so well, but you could run your query/ies periodically and save the counts in datastore or memcache, depending on how frequently you need to run your job.
However, to find all the entities of a given kind you can use the Datastore Statistics API which should be a lot quicker. I don't know how frequently the stats are updated but it's probably the same as the stats in the console.
If you are going to more frequent counts, or figures for filtered queries, you might consider sharded counters. Since you only need an approximate number, you could update them asynchronously on each new put.

How many rows can a class in Parse.com realistically handle?

I know that Parse.com has some sort of limit on the number of seconds a request/query can take. Correct me if I'm wrong. Cloud code should typically finish within 7 seconds, for example.
If I have a Parse class with 70 million entities, and I run a query, how long would that take? Is it realistic to expect Parse to work well for that kind of data?
Parse is built on MongoDB, so you should look into its limitations as they should be shared.
From experience (and yours may be different) Parse cannot handle .count or .find operations on a class whose size is above ~80 million, timeouts are basically guaranteed. Response times start to slow down noticeably around 70M, if not earlier. Most of the issues you'll experience depend on the complexity of the requests you're making. Keep it simple.
I believe one solution to these timeouts or slow responses is to .get or .fetch Objects instead of running .find or .first queries for them.
In some cases it is worth reconsidering your current schema or database structure because an "oversized data table issue" is not necessary specific to Parse. Best of luck!

CouchDB Views: How much processing is acceptable in map reduce?

I've been toying around with Map Reduce with CouchDB. Some of the examples show some possibly heavy logic within the map reduce functions. In one particular case, they were performing for loops within map.
Is map reduce run on every single possible document before it emits your selected documents?
If so, I would think that means that running any kind of iterative processing within the map reduce functions would increase processing burden by an order of magnitude, at least.
Basically it boils down to the following question: how much logic can be performed within map reduce before its an unreasonably expensive query?
Lots of expensive processing is acceptable in CouchDB map-reduce.
CouchDB views (map-reduce) are more like CREATE INDEX than they are SELECT FROM.
Specifically, CouchDB guarantees that a map function runs only once per document, ever. (Well, actually once per document change ever.) That is what the "iterative map-reduce" is.
Therefore, suppose you had 10,000 documents and they take 1 second each to process (which is way higher than I have ever seen). That is 10,000 seconds or 2.8 hours to completely build the view. However once the view is complete, querying any row (?key=...) or row slice (?startkey=...&endkey=...) takes the same time as querying for documents directly. Lookup time is O(log n) for the document count.
In other words, even if it takes 1 second per document to execute the map, it will take a few milliseconds to fetch the result. (Of course, the view must build first, since it is actually an index.)
Querying the db is an unrelated activity from the map/reduce of a document. Therefore the query cost is not impacted by the complexity of the map/reduce.
In couchdb you are querying an index. This means it is a copy of your data in a format optimized for query speed. A query is not like a tablescan in sql. It does not loop through records.
So how do you make this index? It is done through the map function. The map function emits a key and a value. The key is put in the index. Some complicated map functions that you mention may loop and emit many keys and values. Couchdb is smart and only runs a document when it needs to, usually on create, updates, and deletes. This is why it is incremental map/reduce.
So as you might see, a complicated map function might impact create, update, and delete speed. But again couchdb is smart in that you can specify how stale the data might be when you query the index.

Insert thousands entities in a reasonnable time into BigTable

I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:
import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery
def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)
It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function.
When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached.
I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:
Why is it so slow ?
Is there any way to do it in a reasonnable time ?
Thanks :)
First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.
The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:
Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.
In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.
Use the Task Queue. If you want your dataset to process quickly, have your upload handler create a task for each subset of 500 using an offset value.
FWIW we process large CSV's into datastore using mapreduce, with some initial handling/ validation inside a task. Even tasks have a limit (10 mins) at the moment, but that's probably fine for your data size.
Make sure if you're doing inserts,etc. you batch as much as possible - don't insert individual records, and same for lookups - get_by_keyname allows you to pass in an array of keys. (I believe db put has a limit of 200 records at the moment?)
Mapreduce might be overkill for what you're doing now, but it's definitely worth wrapping your head around, it's a must-have for larger data sets.
Lastly, timing of anything on the SDK is largely pointless - think of it as a debugger more than anything else!

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Resources