Concurrency & Parallelism in AppEngine - google-app-engine

I am learning app-engine and have created a spring based application which has a controller for accepting all in-coming requests. There is just one method in the controller which will be used to populated 5 tables in BigQuery. So, I have 5 separate methods to insert data in BigQuery. I am calling each of these methods one at a time sequentially in my controller method. But, I want to execute these 5 BQ methods in parallel not in sequence. How can I achieve such a parallelism in App-Engine app.

There are a two different strategies you can use on GAE - concurrency and deferred approaches. Both have a few flavours.
Concurrency
There are two basic flavours of this, relying on async APIs or creating background threads.
Most of the GAE platform APIs are asynchronous (or can be) and you can invoke multiple of them at once then block until they've all resolved. In this case, you could make 5 asynchronous calls to BigQuery using the UrlFetchService.
GAE also allows the creation of background threads for the duration of a request. All threads must complete before the result is returned to the client. This is generally the least idiomatic approach for GAE.
Deferred processing
GAE offers two flavours of task queue, push and pull.
Push queues are basically a queued task being executed by a specified URL at a rate you control. They can participate in transactions and have retry rules etc. they can be used to ensure a workload is executed but independently of the initiating request. This is the most idiomatic solution for the general problem of 'background work' on GAE
Pull queues are queues that wait for an initiating request to slurp some data out for processing, usually in bulk. They're triggered by cron jobs typically.
In your case, your best bet is to use async http requests, unless you're using an SDK/API wrapper that doesn't expose this. If not, look to task queues. Almost any app you build will end up using them anyway, and they're very graceful and simple to comprehend.

Related

Bulk Enqueue Google Cloud Tasks

As part of migrating my Google App Engine Standard project from python2 to python3, it looks like I also need to switch from using the Taskqueue API & Library to google-cloud-tasks.
In the taskqueue library I could enqueue upto 100 tasks at a time like this
taskqueue.Queue('default').add([...task objects...])
as well as enqueue tasks asynchronously.
In the new library as well as the new API, it looks like you can only enqueue tasks one at a time
https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create
https://googleapis.dev/python/cloudtasks/latest/gapic/v2/api.html#google.cloud.tasks_v2.CloudTasksClient.create_task
I have an endpoint where it receives a batch with thousands of elements, each of which need to get processed in an individual task. How should I go about this?
According to the official documentation (reference 1, reference 2) the feature of adding task to queues asynchronously (as this post suggests for adding bulk number of tasks to a queue), is NOT an available feature via Cloud Tasks API. It is available for the users of App Engine SDK though.
However, there is a reference in the documentation regarding adding a large number of Cloud Tasks to a queue via double-injection pattern workaround (this post might seem useful too).
To implement this scenario, you'll need to create a new injector queue, whose single task would contain information to add multiple(100) tasks of the original queue that you're using. On the receiving end of this injector queue would be a service which does the actual addition of the intended tasks to your original queue. Although the addition of tasks in this service will be synchronous and 1-by-1, it will provide an asynchronous interface to your main application to bulk add tasks. In such a way you can overcome the limits of synchronous, 1-by-1 task addition in your main application.
Note that the 500/50/5 pattern of task addition to queue is a suggested method, in order to avoid any (queue/target) overloads.
As I did not find any examples of this implementation, I will edit the answer as soon as I find one.
Since you are in a migration process, I figured out that this link would be useful, as it concerns migrating from Task Queue to Cloud Tasks (as you stated you are thinking to do).
Additional information on migrating your code with all the available details you can find here and here, regarding Pull queues to Cloud Pub/Sub Migration and Push queues to Cloud Tasks Migration correspondingly.
In order to recreate a batch pull mechanism, you would have to switch to Pub/Sub. Cloud Tasks does not have pull queues. With Pub/Sub you can batch push and batch pull messages.
If you are using a push queue architecture, I would recommend passing those elements as the task payload; however the max task size is limited to 100kb.

Golang on App Engine Datastore - Using PutMulti to Improve Performance

I have a GAE Golang app that should be able to handle hundreds of concurrent requests, and for each requests, I do some work on the input and then store it in the datastore.
Using the task queue (appengine/delay lib) I am getting pretty good performance, but it still seems very inefficient to perform single-row inserts for each request (even though the inserts are deferred using task queue).
If this was not app engine, I would probably append the output a file, and every once in a while I would batch load the file into the DB using a cron job / some other kind of scheduled service.
So my questions are:
Is there an equivalent scheme I can implement on app engine? I was
thinking - perhaps I should write some of the rows to memecache, and
then every couple of seconds I will bulk load all of the rows from
there and purge the cache.
Is this really needed? Can the datastore
handle thousands of concurrent writes - a write per http request my
app is getting?
Depends really on your setup. Are you using ancestor queries? If so then your are limited to 1 write per second PER ancestor (and all children, grand children). The datastore has a natural queue so if you try and write too quickly it will queue it. It only becomes an issue if you are writing too many way too quickly. You can read some best practices here.
If you think you will be going over that limit use a pull queues with async multi puts. You would put each entity in the queue. With a backed module (10 minute timeouts) you can pull in the entries in batches (10-50-100...) and do a put_async on them in batches. It will handle putting them in at the proper speed. While its working you can queue up the next batch. Just be wary of the timeout.

Is there a way to know when a set of app engine task queue tasks have completed?

is there a way to determine when a set of Google App Engine tasks (and child tasks they spawn) have all completed?
Let's say that I have 100 tasks to execute and 10 of those spawn 10 child tasks each. That's 200 tasks. Let's also say that those child tasks might spawn more tasks, recursively, etc...
Is there a way to determine when all tasks have completed? I tried using the app engine pipeline API, but it doesn't look like it's going to work out for my particular use case, even though it is a great API.
My use case is that I want to make a whole bunch of rate limited URL fetch calls while concurrently writing to a blob. At the end of all the URL fetch calls, I want to finalize the blob.
I found the solution with the pipeline API, but it does so much writing to the datastore that it wouldn't be cost effective for me with how often I need to run the pipeline.
There's no way around writing to a persistent storage medium of some sort, and the datastore is the only game in town. You could write your own server to track completions using a backend, but that's an awful lot of overhead for a simple task. Using the pipeline API is your best bet.

Tasks, Cron jobs or Backends for an app

I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.

Async urlfetch on App Engine

My app needs to do many datastore operations on each request. I'd like to run them in parallel to get better response times.
For datastore updates I'm doing batch puts so they all happen asynchronously which saves many milliseconds. App Engine allows up to 500 entities to be updated in parallel.
But I haven't found a built-in function that allows datastore fetches of different kinds to execute in parallel.
Since App Engine does allow urlfetch calls to run asynchronously, I created a getter URL for each kind which returns the query results as JSON-formatted text. Now my app can do async urlfetch calls to these URLs which could parallelize the datastore fetches.
This technique works well with small numbers of parallel requests, but App Engine throws errors when attempting to run more than 5 or 10 of these urlfetch calls at the same time.
I'm only testing now, so each urlfetch is the identical query; since they work fine in small volumes but start failing with more than a handful of simultaneous requests, I'm thinking it must have something to do with the async urlfetch calls.
My questions are:
Is there a limit to the number of urlfetch.create_rpc() calls that can run asynchronously?
The synchronous urlfecth.fetch() function has a 'deadline' parameter that will allow the function to wait up to 10 seconds for a response before failing. Is there any way to tell urlfetch.create_rpc() how long to wait for a response?
What do the errors shown below mean?
Is there a better server-side technique to run datastore fetches of different kinds in parallel?
File "/base/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 501, in get_result
return self.__get_result_hook(self)
File "/base/python_lib/versions/1/google/appengine/api/urlfetch.py", line 331, in _get_fetch_result
raise DownloadError(str(err))
InterruptedError: ('The Wait() request was interrupted by an exception from another callback:', DownloadError('ApplicationError: 5 ',))
Since App Engine allows async urlfetch calls but does not allow async datastore gets, I was trying to use urlfetch RPCs to retrieve from the datastore in parallel.
The lack of async datastore gets is an acknowledged issue:
http://code.google.com/p/googleappengine/issues/detail?id=1889
And there's now a third-party tool that allows async queries:
http://code.google.com/p/asynctools/
"asynctools is a library allowing you to execute Google App Engine API calls in parallel. API calls can be mixed together and queued up and then all are kicked off in parallel."
This is exactly what I was looking for.
While I am afraid that I can't directly answer any of the questions that you pose, I think that I ought to tell you that all of your research along these lines may not lead to you to a working solution for your problem.
The problem is that datastore writes take much longer than reads, so if you find a way to max out the number of reads that can happen, you're code will very run out of time long before it is able to make corresponding writes to all of the entities that you have read.
I would seriously consider rethinking the design of your datastore classes to reduce the number of reads and writes that needs to happen, as this will quickly become a bottleneck for your application.
Have you considered using TaskQueues to do the work of queuing the requests to be executed later?
If the task returns a 4xx status it will be considered failed and will be retried later - so you could pass the error back up and have the task queue handle retrying the requests until the succeed. Also, with some experimentation with bucket sizes and rates, you can probably have the Task Queue slow down the requests enough that you don't max out the database
There's also a nice wrapper (deferred.defer) which makes things even simpler - you can make a deferred call to (almost) any function in your app.

Resources