Bulk Enqueue Google Cloud Tasks - google-app-engine

As part of migrating my Google App Engine Standard project from python2 to python3, it looks like I also need to switch from using the Taskqueue API & Library to google-cloud-tasks.
In the taskqueue library I could enqueue upto 100 tasks at a time like this
taskqueue.Queue('default').add([...task objects...])
as well as enqueue tasks asynchronously.
In the new library as well as the new API, it looks like you can only enqueue tasks one at a time
https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create
https://googleapis.dev/python/cloudtasks/latest/gapic/v2/api.html#google.cloud.tasks_v2.CloudTasksClient.create_task
I have an endpoint where it receives a batch with thousands of elements, each of which need to get processed in an individual task. How should I go about this?

According to the official documentation (reference 1, reference 2) the feature of adding task to queues asynchronously (as this post suggests for adding bulk number of tasks to a queue), is NOT an available feature via Cloud Tasks API. It is available for the users of App Engine SDK though.
However, there is a reference in the documentation regarding adding a large number of Cloud Tasks to a queue via double-injection pattern workaround (this post might seem useful too).
To implement this scenario, you'll need to create a new injector queue, whose single task would contain information to add multiple(100) tasks of the original queue that you're using. On the receiving end of this injector queue would be a service which does the actual addition of the intended tasks to your original queue. Although the addition of tasks in this service will be synchronous and 1-by-1, it will provide an asynchronous interface to your main application to bulk add tasks. In such a way you can overcome the limits of synchronous, 1-by-1 task addition in your main application.
Note that the 500/50/5 pattern of task addition to queue is a suggested method, in order to avoid any (queue/target) overloads.
As I did not find any examples of this implementation, I will edit the answer as soon as I find one.
Since you are in a migration process, I figured out that this link would be useful, as it concerns migrating from Task Queue to Cloud Tasks (as you stated you are thinking to do).
Additional information on migrating your code with all the available details you can find here and here, regarding Pull queues to Cloud Pub/Sub Migration and Push queues to Cloud Tasks Migration correspondingly.

In order to recreate a batch pull mechanism, you would have to switch to Pub/Sub. Cloud Tasks does not have pull queues. With Pub/Sub you can batch push and batch pull messages.
If you are using a push queue architecture, I would recommend passing those elements as the task payload; however the max task size is limited to 100kb.

Related

Difference between TaskQueue and MapReduce in Google App Engine

I have read the docs about taskqueue and push queues in gae which is used to create long running tasks.
I have doubt in why there was the need for MapReduce? As both do the processing in background, what are the main principal differences between them.
Can someone please explain this?
Edit: I guess i was comparing Apples with monkeys! Hadoop, mapreduce are related. And gae is a backend framework.
You are getting confused with two entirely different things altogether.
MapReduce paradigm is all about distributed parallel processing of very huge amount of data.
TaskQueue is a Scheduler; which can schedule a task to execute say at certain time. It is just a scheduler like a unix cronjobs.
Please take note of bold and italic words in above statements to see the difference.
From the definition of TaskQueue
Task queues let applications perform work, called tasks,
asynchronously outside of a user request. If an app needs to execute
work in the background, it adds tasks to task queues. The tasks are
executed later, by worker services.
By definition, TaskQueue work outside of a user request; means there is no actual user request to perform a task (it is simply submitted/scheduled sometime in past). mapreduce programs are submitted by users to execute, though you may use TaskQueue to schedule one in future.
You are probably getting confused due to words like task, queue, scheduling used in mapreduce world. But those all thing in mapreduce may have some similarity, as they are generic terms - but they are definitely not the same.

Google AppEngine use Google Compute for Task?

I need to run some specific code that can't be run on Google AppEngine (because of restrictions).
Since these workers are asynchronous, I thought about launching a Compute instance every time I need it and connecting them via a specific Tasks via the Task Queue from Google AppEngine, but I can not find any documentation about if this is possible?
TL;DR: Is it possible to specify a Google Compute as instance for a Task queue?
No, there is no way to specify a Google Compute as instance for a Task queue.
But did you consider using the Flexible environment (eventually with a custom runtime to try to address the restrictions) instead? Or the alternatives suggested for the Flexible env (only has limited task queue support) From Task Queue:
The Task Queue service has limited availability outside of the
standard environment. If you want to use the service outside of the
standard environment, you can sign up for the Cloud Tasks alpha.
Outside of the standard environment, you can't add tasks to push
queues, but a service running in the flexible environment can be
the target of a push task. You can specify this using the
target parameter when adding a task to queue or by specifying
the default target for the queue in queue.yaml.
In many cases where you might use pull queues, such as queuing up
tasks or messages that will be pulled and processed by separate
workers, Cloud Pub/Sub can be a good alternative as it offers
similar functionality and delivery guarantees.

Concurrency & Parallelism in AppEngine

I am learning app-engine and have created a spring based application which has a controller for accepting all in-coming requests. There is just one method in the controller which will be used to populated 5 tables in BigQuery. So, I have 5 separate methods to insert data in BigQuery. I am calling each of these methods one at a time sequentially in my controller method. But, I want to execute these 5 BQ methods in parallel not in sequence. How can I achieve such a parallelism in App-Engine app.
There are a two different strategies you can use on GAE - concurrency and deferred approaches. Both have a few flavours.
Concurrency
There are two basic flavours of this, relying on async APIs or creating background threads.
Most of the GAE platform APIs are asynchronous (or can be) and you can invoke multiple of them at once then block until they've all resolved. In this case, you could make 5 asynchronous calls to BigQuery using the UrlFetchService.
GAE also allows the creation of background threads for the duration of a request. All threads must complete before the result is returned to the client. This is generally the least idiomatic approach for GAE.
Deferred processing
GAE offers two flavours of task queue, push and pull.
Push queues are basically a queued task being executed by a specified URL at a rate you control. They can participate in transactions and have retry rules etc. they can be used to ensure a workload is executed but independently of the initiating request. This is the most idiomatic solution for the general problem of 'background work' on GAE
Pull queues are queues that wait for an initiating request to slurp some data out for processing, usually in bulk. They're triggered by cron jobs typically.
In your case, your best bet is to use async http requests, unless you're using an SDK/API wrapper that doesn't expose this. If not, look to task queues. Almost any app you build will end up using them anyway, and they're very graceful and simple to comprehend.

How to monitor Google App Engine Queue size over time?

The Google App Engine developer console allows you to easier monitor instantaneous queue size for an app. How can you simply view queue size over time?
For context: the backend process off our application runs through a fairly restrictive queue, as front end availability is a priority (and it's currently a free app). What I'd like to monitor is the size of the task queue over time which would give me a good proxy of the backlog of work.
I could set up a process just to log this directly, and then a separate page to the graph it, however this seems a little involved for something that may be already easily available either as a graph, or at least a queriable data-series direct from the app engine.
Thanks to #tx802 for help with this answer:
It's not currently simple to view these metrics. The process to setting them up however is:
Set up a simple CRON job to read the QueueStatistics object for the given queue on whatever time basis is interesting (I chose every 5 minutes).
Use the Custom Metrics function to store the value as a custom metric which you can then pull up in the Cloud Monitoring Dashboard.

Is there a way to know when a set of app engine task queue tasks have completed?

is there a way to determine when a set of Google App Engine tasks (and child tasks they spawn) have all completed?
Let's say that I have 100 tasks to execute and 10 of those spawn 10 child tasks each. That's 200 tasks. Let's also say that those child tasks might spawn more tasks, recursively, etc...
Is there a way to determine when all tasks have completed? I tried using the app engine pipeline API, but it doesn't look like it's going to work out for my particular use case, even though it is a great API.
My use case is that I want to make a whole bunch of rate limited URL fetch calls while concurrently writing to a blob. At the end of all the URL fetch calls, I want to finalize the blob.
I found the solution with the pipeline API, but it does so much writing to the datastore that it wouldn't be cost effective for me with how often I need to run the pipeline.
There's no way around writing to a persistent storage medium of some sort, and the datastore is the only game in town. You could write your own server to track completions using a backend, but that's an awful lot of overhead for a simple task. Using the pipeline API is your best bet.

Resources