My app needs to do many datastore operations on each request. I'd like to run them in parallel to get better response times.
For datastore updates I'm doing batch puts so they all happen asynchronously which saves many milliseconds. App Engine allows up to 500 entities to be updated in parallel.
But I haven't found a built-in function that allows datastore fetches of different kinds to execute in parallel.
Since App Engine does allow urlfetch calls to run asynchronously, I created a getter URL for each kind which returns the query results as JSON-formatted text. Now my app can do async urlfetch calls to these URLs which could parallelize the datastore fetches.
This technique works well with small numbers of parallel requests, but App Engine throws errors when attempting to run more than 5 or 10 of these urlfetch calls at the same time.
I'm only testing now, so each urlfetch is the identical query; since they work fine in small volumes but start failing with more than a handful of simultaneous requests, I'm thinking it must have something to do with the async urlfetch calls.
My questions are:
Is there a limit to the number of urlfetch.create_rpc() calls that can run asynchronously?
The synchronous urlfecth.fetch() function has a 'deadline' parameter that will allow the function to wait up to 10 seconds for a response before failing. Is there any way to tell urlfetch.create_rpc() how long to wait for a response?
What do the errors shown below mean?
Is there a better server-side technique to run datastore fetches of different kinds in parallel?
File "/base/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 501, in get_result
return self.__get_result_hook(self)
File "/base/python_lib/versions/1/google/appengine/api/urlfetch.py", line 331, in _get_fetch_result
raise DownloadError(str(err))
InterruptedError: ('The Wait() request was interrupted by an exception from another callback:', DownloadError('ApplicationError: 5 ',))
Since App Engine allows async urlfetch calls but does not allow async datastore gets, I was trying to use urlfetch RPCs to retrieve from the datastore in parallel.
The lack of async datastore gets is an acknowledged issue:
http://code.google.com/p/googleappengine/issues/detail?id=1889
And there's now a third-party tool that allows async queries:
http://code.google.com/p/asynctools/
"asynctools is a library allowing you to execute Google App Engine API calls in parallel. API calls can be mixed together and queued up and then all are kicked off in parallel."
This is exactly what I was looking for.
While I am afraid that I can't directly answer any of the questions that you pose, I think that I ought to tell you that all of your research along these lines may not lead to you to a working solution for your problem.
The problem is that datastore writes take much longer than reads, so if you find a way to max out the number of reads that can happen, you're code will very run out of time long before it is able to make corresponding writes to all of the entities that you have read.
I would seriously consider rethinking the design of your datastore classes to reduce the number of reads and writes that needs to happen, as this will quickly become a bottleneck for your application.
Have you considered using TaskQueues to do the work of queuing the requests to be executed later?
If the task returns a 4xx status it will be considered failed and will be retried later - so you could pass the error back up and have the task queue handle retrying the requests until the succeed. Also, with some experimentation with bucket sizes and rates, you can probably have the Task Queue slow down the requests enough that you don't max out the database
There's also a nice wrapper (deferred.defer) which makes things even simpler - you can make a deferred call to (almost) any function in your app.
Related
I am learning app-engine and have created a spring based application which has a controller for accepting all in-coming requests. There is just one method in the controller which will be used to populated 5 tables in BigQuery. So, I have 5 separate methods to insert data in BigQuery. I am calling each of these methods one at a time sequentially in my controller method. But, I want to execute these 5 BQ methods in parallel not in sequence. How can I achieve such a parallelism in App-Engine app.
There are a two different strategies you can use on GAE - concurrency and deferred approaches. Both have a few flavours.
Concurrency
There are two basic flavours of this, relying on async APIs or creating background threads.
Most of the GAE platform APIs are asynchronous (or can be) and you can invoke multiple of them at once then block until they've all resolved. In this case, you could make 5 asynchronous calls to BigQuery using the UrlFetchService.
GAE also allows the creation of background threads for the duration of a request. All threads must complete before the result is returned to the client. This is generally the least idiomatic approach for GAE.
Deferred processing
GAE offers two flavours of task queue, push and pull.
Push queues are basically a queued task being executed by a specified URL at a rate you control. They can participate in transactions and have retry rules etc. they can be used to ensure a workload is executed but independently of the initiating request. This is the most idiomatic solution for the general problem of 'background work' on GAE
Pull queues are queues that wait for an initiating request to slurp some data out for processing, usually in bulk. They're triggered by cron jobs typically.
In your case, your best bet is to use async http requests, unless you're using an SDK/API wrapper that doesn't expose this. If not, look to task queues. Almost any app you build will end up using them anyway, and they're very graceful and simple to comprehend.
I am using Objectify in my google cloud endpoints module , My endpoint project handles most of my datastore read and write ops , but i wanted to know if it is an efficient design practice to use Task queues to wrap a read or write operation on the datastore in google app engine .
All the data necessary for a task execution has to be written somewhere, and the App Engine persists this data in a task queue backed by the same Datastore. Unless your write operation involves number crunching, URL fetching, external API calls, updates of hundreds on entities, or some other expensive logic, there is no advantage to wrapping a write call in a task.
Wrapping read calls in tasks is impossible in most cases as you lose an ability to return this data in the same call.
Consider to use write-behind-cache if you want to speed up your writes. There's a little chance that you will lose your data, but you will dramatically speed up the write speed (as seen by user).
The idea is to write entity only into memcache first, so user will not wait for actual datastore write, and then pick up that memcached entity by task queue/cron and write it into datastore.
I have a GAE Golang app that should be able to handle hundreds of concurrent requests, and for each requests, I do some work on the input and then store it in the datastore.
Using the task queue (appengine/delay lib) I am getting pretty good performance, but it still seems very inefficient to perform single-row inserts for each request (even though the inserts are deferred using task queue).
If this was not app engine, I would probably append the output a file, and every once in a while I would batch load the file into the DB using a cron job / some other kind of scheduled service.
So my questions are:
Is there an equivalent scheme I can implement on app engine? I was
thinking - perhaps I should write some of the rows to memecache, and
then every couple of seconds I will bulk load all of the rows from
there and purge the cache.
Is this really needed? Can the datastore
handle thousands of concurrent writes - a write per http request my
app is getting?
Depends really on your setup. Are you using ancestor queries? If so then your are limited to 1 write per second PER ancestor (and all children, grand children). The datastore has a natural queue so if you try and write too quickly it will queue it. It only becomes an issue if you are writing too many way too quickly. You can read some best practices here.
If you think you will be going over that limit use a pull queues with async multi puts. You would put each entity in the queue. With a backed module (10 minute timeouts) you can pull in the entries in batches (10-50-100...) and do a put_async on them in batches. It will handle putting them in at the proper speed. While its working you can queue up the next batch. Just be wary of the timeout.
I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.
I'm trying to grasp the concept of the rpc on appengine. When or Why would i need to use one and what are the benefits?
Do they help with staying within your quota?
Are they more efficient?
When you use the datastore, memcache, URL Fetch, or many of the other services, you are implicitly creating and using an RPC.
Some methods take an optional RPC argument. You can create an RPC with custom settings, such as a deadline, to give you more control over the call. An example of when setting a deadline on datastore operations can be useful is deferring a write to the task queue on a timeout type failure. Setting a lower deadline will ensure you have enough time to try again or insert a task.
rpc on AppEngine is useful when you want to do a URL fetch and you want to do other things while you're waiting for the response to be completed.
Let's say your URL fetch will take 1 second to complete and you have 'other' processing to do for 1 second which you can do while your waiting. You can launch an rpc call, do the 'other' processing, and when the rpc fetch is finished you can continue the request. The request will take a total of 1 second (plus overhead) with rpc as opposed to the conventional approach which would take 2 seconds.