Google App Engine - Task dependency - google-app-engine

In my application, I have a long task, so I split it into n smaller tasks. After these n tasks complete, another one task is to be performed and it depends on the results of those n tasks. How do I achieve this dependency with Task API? i.e. perform one task after other n tasks.

I think there are 2 methods that can solve this problem.
Suppose the task TD depends on n other tasks TA, and there is a queue Q.
Push n TA tasks in to queue Q. When each task TA finishes, it checks if itself is the last one in the queue Q. If a TA is the last task in queue Q, it pushes TD to queue Q.
Push n TA tasks and TD to queue Q. When TD run, it checks if all TA task finish. If there is any TA unfinished, TD cancels its execution by returning any HTTP status code outside of the range 200-299.
The key of these methods is to get number of tasks in the queue Q. Although I haven't tried, I know there is a Python API provides an experimental method to get TaskQueue resource of a specific queue. The stats.totalTasks property is the total number of queues in the queue.
Please see http://code.google.com/intl/en/appengine/docs/python/taskqueue/rest.html

Take a look on the GAE Pipeline API, it is used to build complex task workflow like the one you described.

Yet another approach could be to start by adding all tasks to the queue. Have the N initial tasks log info to the datastore upon completion, in some manner that allows you to query the datastore to see if they have all run.
When the dependent task runs, it performs this datastore query to see if its conditions are met (checks that all initial tasks have logged that they are finished). If not, it needs to run later.
To accomplish this, the dependent task could add a copy of itself to the queue, scheduled to run after some given time interval. Alternately (as in the answer above), the dependent task could terminate with an error status code, in which case it will be automatically retried at some later point, as long as the retry_limit for the queue or the task is not exceeded.

Related

Parallel and bulk process datastore elements in google cloud dataflow

Problem:
I have list of 2M+ users data in my datastore project. I would like to send a weekly newsletter to all users. The mailing API accepts max 50 email address per API call.
Previous Solution:
Used app-engine backend and a simple datastore query to process all the records at one go. But what happens is, sometimes I get memory overflow critical error log and the process starts all over again. Because of this some users, get the same email more than once. So I moved to dataflow.
Current Solution:
I use the FlatMap function to send each email id to a function and then send email individually to each user.
def process_datastore(project, pipeline_options):
p = beam.Pipeline(options=pipeline_options)
query = make_query()
entities = (p | 'read from datastore' >> ReadFromDatastore(project, query))
entities | beam.FlatMap(lambda entity: sendMail([entity.properties.get('emailID', "")]))
return p.run()
With cloud dataflow, I have ensured that each user gets a mail only once and also nobody is missed out. There are no memory errors.
But this current process takes 7 hours to finish running. I have tried to replace FlatMap with ParDo, with the assumption that ParDo will parallelize the process. But even that takes same time.
Question:
How to bunch the email ids in group of 50, so that the mail API call is effectively used?
How to parallelize the process such that the time taken is less than an hour?
You could use query cursors to split your users in batches of 50 and do the actual batch processing (the email sending) inside push queue or deferred tasks. This would be a GAE-only solution, without cloud dataflow, IMHO a lot simpler.
You can find an example of such processing in Google appengine: Task queue performance (taking the answer into account as well). That solution is using the deferred library, but it is almost trivial to use push queue tasks instead.
The answer touches on the parallelism aspect in the sense that you may want to limit it to keep costs down.
You could also split the batching itself inside the tasks to obtain an indefinitely scalable solution (any number of recipients, without hitting memory or deadline exceeded failures), with the task re-enqueing itself to continue the work from where it left off.

Figure out group of tasks completion time using TaskQueue and Datastore

I have a push task queue and each of my jobs consists of multiple similar TaskQueue tasks. Each of these tasks takes less than a second to finish and can add new tasks to the queue (they should be also completed to consider the job finished). Task results are written to a DataStore.
The goal is to understand when a job has finished, i.e. all of its tasks are completed.
Writes are really frequent and I can't store the results inside one entity group. Is there a good workaround for this?
In a similar context I used a scheme based on memcache, which doesn't have a significant write rate limitation as datastore entity groups:
each job gets a unique memcache key associated with it, which it passes to each of subsequent execution tasks it may enqueue
every execution task updates the memcache value corresponding to the job key with the current timestamp and also enqueues a completion check task, delayed with an idle timeout value, large enough to declare the job completed if elapsed.
every completion check task compares the memcache value corresponding to the job key against the current timestamp:
if the delta is less than the idle timeout it means the job is not complete (some other task was executed since this completion check task was enqueued, hence some other completion check task is in the queue)
otherwise the job is completed
Note: the idle timeout should be larger than the maximum time a task might spend in the queue.

How can tasks be prioritized when using the task queue on google app engine?

I'm trying to solve the following problem:
I have a series of "tasks" which I would like to execute
I have a fixed number of workers to execute these workers (since they call an external API using urlfetch and the number of parallel calls to this API is limited)
I would like for these "tasks" to be executed "as soon as possible" (ie. minimum latency)
These tasks are parts of larger tasks and can be categorized based on the size of the original task (ie. a small original task might generate 1 to 100 tasks, a medium one 100 to 1000 and a large one over 1000).
The tricky part: I would like to do all this efficiently (ie. minimum latency and use as many parallel API calls as possible - without getting over the limit), but at the same time try to prevent a large number of tasks generated from "large" original tasks to delay the tasks generated from "small" original tasks.
To put it an other way: I would like to have a "priority" assigned to each task with "small" tasks having a higher priority and thus prevent starvation from "large" tasks.
Some searching around doesn't seem to indicate that anything pre-made is available, so I came up with the following:
create three push queues: tasks-small, tasks-medium, tasks-large
set a maximum number of concurrent request for each such that the total is the maximum number of concurrent API calls (for example if the max. no. concurrent API calls is 200, I could set up tasks-small to have a max_concurrent_requests of 30, tasks-medium 60 and tasks-large 100)
when enqueueing a task, check the no. pending task in each queue (using something like the QueueStatistics class), and, if an other queue is not 100% utilized, enqueue the task there, otherwise just enqueue the task on the queue with the corresponding size.
For example, if we have task T1 which is part of a small task, first check if tasks-small has free "slots" and enqueue it there. Otherwise check tasks-medium and tasks-large. If none of them have free slots, enqueue it on tasks-small anyway and it will be processed after the tasks added before it are processed (note: this is not optimal because if "slots" free up on the other queues, they still won't process pending tasks from the tasks-small queue)
An other option would be to use PULL queue and have a central "coordinator" pull from that queue based on priorities and dispatch them, however that seems to add a little more latency.
However this seems a little bit hackish and I'm wondering if there are better alternatives out there.
EDIT: after some thoughts and feedback I'm thinking of using PULL queue after all in the following way:
have two PULL queues (medium-tasks and large-tasks)
have a dispatcher (PUSH) queue with a concurrency of 1 (so that only one dispatch task runs at any time). Dispatch tasks are created in multiple ways:
by a once-a-minute cron job
after adding a medium/large task to the push queues
after a worker task finishes
have a worker (PUSH) queue with a concurrency equal to the number of workers
And the workflow:
small tasks are added directly to the worker queue
the dispatcher task, whenever it is triggered, does the following:
estimates the number of free workers (by looking at the number of running tasks in the worker queue)
for any "free" slots it takes a task from the medium/large tasks PULL queue and enqueues it on a worker (or more precisely: adds it to the worker PUSH queue which will result in it being executed - eventually - on a worker).
I'll report back once this is implemented and at least moderately tested.
The small/medium/large original task queues won't help much by themselves - once the original tasks are enqueued they'll keep spawning worker tasks, potentially even breaking the worker task queue size limit. So you need to pace/control enqueing of the original tasks.
I'd keep track of the "todo" original tasks in the datastore/GCS and enqueue these original tasks only when the respective queue size is sufficiently low (1 or maybe 2 pending jobs), from either a recurring task, a cron job or a deferred task (depending on the rate at which you need to perform the original task enqueueing) which would implement the desired pacing and priority logic just like a push queue dispatcher, but without the extra latency you mentioned.
I have not used pull queues, but from my understanding they could suit your use-case very well. Your could define 3 pull queues, and have X workers all pulling tasks from them, first trying the "small" queue then moving on to "medium" if it is empty (where X is your maximum concurrency). You should not need a central dispatcher.
However, then you would be left to pay for X workers even when there are no tasks (or X / threadsPerMachine?), or scale them down & up yourself.
So, here is another thought: make a single push queue with the correct maximum concurrency. When you receive a new task, push its info to the datastore and queue up a generic job. That generic job will then consult the datastore looking for tasks in priority order, executing the first one it finds. This way a short task will still be executed by the next job, even if that job was already enqueued from a large task.
EDIT: I now migrated to a simpler solution, similar to what #eric-simonton described:
I have multiple PULL queues, one for each priority
Many workers pull on an endpoint (handler)
The handler generates a random number and does a simple "if less than 0.6, try first the small queue and then the large queue, else vice-versa (large then small)"
If the workers get no tasks or an error, they do semi-random exponential backoff up to maximum timeout (ie. they start pulling every 1 second and approximately double the timeout after each empty pull up to 30 seconds)
This final point is needed - amongst other reasons - because the number of pulls / second from a PULL queue is limited to 10k/s: https://cloud.google.com/appengine/docs/python/taskqueue/overview-pull#Python_Leasing_tasks
I implemented the solution described in the UPDATE:
two PULL queues (medium-tasks and large-tasks)
a dispatcher (PUSH) queue with a concurrency of 1
a worker (PUSH) queue with a concurrency equal to the number of workers
See the question for more details. Some notes:
there is some delay in task visibility due to eventual consistency (ie. the dispatchers tasks sometimes don't see the tasks from the pull queue even if they are inserted together) - I worked around by adding a countdown of 5 seconds to the dispatcher tasks and also adding a cron job that adds a dispatcher task every minute (so if the original dispatcher task doesn't "see" the task from the pull queue, an other will come along later)
made sure to name every task to eliminate the possibility of double-dispatching them
you can't lease 0 items from the PULL queues :-)
batch operations have an upper limit, so you have to do your own batching over the batch taskqueue calls
there doesn't seem to be a way to programatically get the "maximum parallelism" value for a queue, so I had to hard-code that in the dispatcher (to calculate how many more tasks it can schedule)
don't add dispatcher tasks if they are already some (at least 10) in the queue

Does task queue truly run tasks in parallel?

We have an application that takes some input from a user and makes ~50 RPC calls. Each call takes around 4-5 minutes.
In the backend we are using a push queue and enqueuing each of these 50 calls as tasks. This is our queue spec:
queue:
- name: some-name
rate: 500/s
bucket_size: 100
max_concurrent_requests: 500
My understanding is that all 50 requests should be run in parallel, and thus all of them should be complete in 4-5 minutes. But what's actually happening is that only around ~15 of these requests are returning results, while the rest cross the 10 min limit and time out. Another thing to note is that this seems to work fine if we bring down the number of requests to < 10.
There's always the possibility that the requests that timed out did so because the RPC response actually took that long. But what I wanted to confirm is :
My understanding of the tasks running in parallel is correct.
Our queue config and the number of tasks we're enqueuing has nothing to do with these requests timing out.
Are these correct ?
(1) Parallel execution
Yes, tasks can be executed in parallel (up to 500 in your case), but in push queues, your app has no control in which particular order the tasks in a push queue are executed and no direct control how many tasks are executed at once. (Your app can control in which sequence tasks are added to a queue though, see the pattern in (2) below)
App Engine uses certain factors to decide how fast and which tasks are executed, especially the queue configuration and also the scaling configuration (e.g. in app.yaml). Since you pay for every first 15 minutes of an instance, it could get very expensive to really have 50 instances launched, then idling for 15 minutes before shutting them down (until the next request). In this regard, the mechanism that spawns new instances is a little smarter, whether it is HTTP requests by users or task queues.
(2) Request time outs
Yes, it is very unlikely that the enqueuing has anything to do with these request time outs. Unless the time-outs are an unintentional side-effect of the wrong assumption that a particular task was executed before.
In order to avoid request time outs in general, it makes sense to split a task into multiple tasks. For example, if you have a task do_foo and those executions exceed the time outs frequently (or memory limits), you could instead have do_foo load off work to other tasks that will do the actual jobs.
For some migration tasks I use this pattern in a linear / sequential way. E.g. classmethod do_foo just queries entities of a certain kind (ordered by creation timestamp for example), maybe filtered, by page (e.g. 50 in transactions with ancestor). It does some writes to the entities first, and only at the very end after successful commit it creates a new transactional do_foo task with cursor parameter to the next page, eventually with a countdown of 1 sec to avoid transaction errors. The next execution of do_foo will continue with the next page (of course only after the task with the previous page completed).
Depending on the nature of the tasks, you could alternatively have each task fan out into multiple tasks per execution, e.g. do_foo triggers do_bar, do_something and do_more. Also note that up to five tasks can be created transactionally inside a transaction.

GAE Map Reduce - hitting entities more than once?

I'm using Map Reduce (http://code.google.com/p/appengine-mapreduce/) to do an operation over a set of entities. However, I am finding my operations are being duplicated.
Are map reduce maps sometimes called more than once for a specific entity? Is this the case even if they don't fail the initial time?
edit: here are some more details.
def reparent_request(entity):
#check if the entity has a parent
if not is_valid_to_reparent(entity):
return
#copy it
try:
copy = clone_entity(Request, entity, parent=entity.user)
copy.put() #we hard put here so we can use the reference later in this function.
except:
...
... update some references to the copied object ...
#delete the original
yield op.db.Delete(entity)
At the end, I am non-deterministically left with two entities, both with the new parent.
I've reparented a load of entities before - it was a nightmare because of the exact problem you're facing.
What I would do instead is:
Create a new queue. Ensure its paused and that you have a lot of storage space dedicated to queues. It's only temporary, but you'll need it.
Instead of editing your entities in your map reduce job, add them to the queue with a name that will be unique for each entity. The key works fine.
When adding to the queue, because it's paused you'll get an error if you try and add the same named queue twice - so catch the error and skip it, because you know that entity must already have been touched by the map reduce job.
When you're confident that every entity has a matching queue task and the map reduce job has finished, unpause your queue. The queue will do the reparenting.
A couple of notes:
* the task queue size can get pretty big. Can't remember numbers, but it was gigs. Also the size of the queue doesn't update in real time - so don't worry that it might still says gigs of tasks when the queue is nearly empty.
* the reliability of the queue storage is an unknown I believe. It didn't happen to us, but queue items could disappear I guess. Fortunately, you can rerun this process multiple times to ensure it worked, especially if you're deleting entities.
* you may want to ensure you queue has a concurrency limit on it. Without one, a delay in the execution of a couple of tasks can absolutely cripple your application. Learnt that the hard way! I think 30 concurrent tasks went quite well for us.
Hope that's useful, let me know if you come up with any improvements!
App Engine mapreduce runs on the task queue, and like anything else that uses the task queue, tasks have to be idempotent - that is, running them multiple times should have the same effect as running them once. Tasks will occasionally be run more than once; the mapreduce library may have its own reasons for rerunning mapper tasks, too.
In your situation, I'd suggest creating the new entity with a key whose ID is the same as the old entity; that way running it multiple times will just overwrite the same entity.

Resources