Is the execution of a queued task always guaranteed on GAE? - google-app-engine

Is the following simple pattern enough to ensure the task sequence never stops even after application updates or hard, 'erratic' google failures.
def do_work():
... ....
deferred.defer(do_work, _countdown=..in 7 days..)
Can I schedule such a self-scheduling worker and never look back?

Two answers:
Yes, tasks will eventually execute and will also retry execution in case of errors in task execution. The retry options are set when you define the task.
No, task queue is not a scheduler, so you can not schedule a task to run at certain time. Tasks put into a task queue are served immediatelly in a FIFO fashion.
As #Jesse noted, for scheduling jobs you should look into GAE cron.

If a task is queued successfully, it will eventually execute. (And App Engine will keep trying for as long as it takes.)
The pattern you show might be better implemented using cron jobs, though, which run a task on a regular basis. A common pattern I use is to have a daily cron job kick off a task on a task queue with a small number of retries (so that if there's a temporary glitch, it will retry immediately).
If you do want to use the method above, rather than cron, there's another thing to worry about: since your method can be retried due to it failing or other system issues (e.g. the instance running it going down) you should make sure that you don't end up with two tasks. Imagine if it ran, registered the next task and then the node went down; App Engine would retry, starting a second task. To prevent this, you could use the data store (in a transaction) to test and see if the next task has already been enqueued. Something like:
def do_work(counter):
...
#db.transactional
def start_next():
# fetch myModel from the data store here
if myModel.counter == counter:
return # already started next job
myModel.counter = counter
myModel.put()
deferred.defer(do_work, counter + 1, _transactional=True, _countdown=...)
start_next()
Note the "transactional" argument in the defer call; this ensures that the MyModel instance will be updated if and only if the next task is enqueued.
You might also want to look into sending an email to an administrator after a certain number of failed retries. (You can find this in the request HTTP headers, but you can't use the deferred library if you want to do this; you have to use the task queue API directly.)

Related

How to mark a message as "in progress" so other workers don't work on it

I'm attempting to use a pull queue to create a queue of image processing tasks that could take longer that the acktimeout limit of 10 minutes. I'm using node.js api and I'm wondering how I could have a worker grab a message off the pull queue, mark it as in progress so no other workers attempt to grab it, do its work and acknowledge the message after the processing is done. This processing could take up to an hour per worker. If an exception occurs, I'd like to remove the "in progress" status and allow other workers to pick up this message and attempt to work on it.
I was hoping there was something in pubsub that would allow me to do this. My alternative is to, before processing, store an entity (inProgressMessage) with the message id, ack id, status=pending, timestamp=now() into datastore, have the worker immediately return the ackid after receiving the message (this will allow other workers to attempt other messages), then the worker can work on the lengthy task. If successful, mark the entity status as complete, if failed in a non permanent way, requeue the task into pubsub, if failed in a permanent way that won't allow reqeueing, I can have cron that checks datastore for pending tasks older than several hours and have them either be deleted or requeued.
My alternative feels like i'm re-implementing alot of what pub sub is supposed to help with.
Let me know if you can think of a better way.
To take longer than the ack deadline to process a message, you'll want to use modifyAckDeadline. You can extend the deadline as many times as you need up to 10 minutes per call. Your workflow would be as follows:
Pull the message.
Start to process the message.
While you are not done with the message, if you are close to the 10 minute ack deadline, call modifyAckDeadline to extend the deadline.
Once done processing the message, ack it.
Please note that calling modifyAckDeadline does not guarantee that the message won't be delivered to another task. In certain circumstances like server restarts, the message may end up being delivered to another of your subscribers. However, in most normal circumstances, as long as you call modifyAckDeadline before the current ack deadline, you can prevent a message's redelivered as long as necessary.
When creating a topic (only), you can configure the acknowledge time to be whatever up to 10 minutes (https://cloud.google.com/pubsub/subscriber). Once a message has been pulled from the queue, no other worker (of the same subscriber) will be able to take it for processing, unless the ack ttl was reached, and then the message is automatically returned to queue.
Since you need a longer period, you will have to implement something on your own, or seek another queuing solution. I think the design you suggested is fairly simple to implement, and is not really a re-implementation of what pubsub does.

What happens differently when you add a task Asynchronously on GAE?

Google's doc on async tasks assumes knowledge of the difference between regular and asynchronously added tasks.
add_async(task, transactional=False, rpc=None)
Asynchronously add a Task or a list of Tasks to this Queue.
How is adding tasks asynchronously different to adding them regularly.
I.e. what is the difference between using add(task, transactional=False) and add_async(task, transactional=False, rpc=None)
I've heard that adding tasks regularly blocks certain things. Any explanation of what it blocks and how, and how async tasks don't block would be greatly appreciated.
tasks are scheduled and run elsewhere.
The async bit refers to the fact the call returns immediately (your code does not wait for the round trip of the RPC that submits the task to a queue) however you still have to check/wait for the result at the end of the request, but it means you can be doing work and then check that the call completed before you exit.

DeadlineExceededException and DataStore/Task Queue Operations

I'm doing some operations that should complete under 60 seconds but there may be some rare cases where it takes longer (but will never take longer than 10 minutes). It says in the app engine docs if you catch a DeadlineExceededException you have less than a second to do operations before it permanently fails. Would this be enough time to add a task to a queue and/or do a datastore write? I assume the safest way would be to add a task async/write a datastore entity (async) at the beginning of an operation and remove it from the queue if the operation completes. The latter method would use up twice as many api calls but is it worth it?
I would suggest to use the queue as default for all operations so you won't have to implement the fallback to it if you catch a dead line exceed error. It is more clean and easier to maintain along with the fact that the user doesn't have to wait for the operation to complete. In order to achieve this you can trigger your queue with an ajax call and get the result in the background, so the user will not wait for the operation to complete. Yes it worth's it, since it can "guarantee" the window of time you might need.
The runtime environment gives the request handler a little bit more time (less than a second) after raising the exception to prepare a custom response. so it would be sufficient to add that it into task queue.
If you do not want the client to keep polling for a task queue result, I suggest you have a look at the Channel API. It will enable you to implement push notifications to the client.
At the end of your task queue, you'll just have to send a notification to the client to let him now that is task has been processed.

App Engine: Is it possible to enqueue tasks asynchronously?

Many of my handlers add a task to a task queue to do non-critical background processing. Since this processing isn't critical, if the call to taskqueue.add() throws an exception, my code just ignores it.
Tonight the task queue seemed to be down for around half an hour. Although my handlers correctly ignored the failure, they took about 5 seconds for the taskqueue.add() call to timeout and move on to processing the rest of the page. This therefore made my site run very slowly.
So, is it possible to enqueue a task asynchronously - meaning a way to add a task, without waiting to see if the addition succeeded?
Alternatively, is there a way to reduce that timeout from 5 seconds down to eg 1 second?
Thanks.
You can use the new taskqueue methods create_rpc and add_async. If you don't care if the add succeeds, simply call add_async and ignore the result. If you care, but only want to wait 1 second, set the deadline when calling create_rpc, and use the return value as the RPC argument to add_async. Call get_result to find out if the tasks were successfully added.
I think you can't do anything about it because the RPC call underneath the add method is a synchronous blocking API call.
You could try to add some check using the Capabilities API.
I am pretty sure GAE announced that TQ adds will be async with the next release (experimental feature).

Fail-safe datastore updates on app engine

The app engine datastore, of course, has downtime. However, I'd like to have a "fail-safe" put which is more robust in the face of datastore errors (see motivation below). It seems like the task queue is an obvious place to defer writes when the datastore is unavailable. I don't know of any other solutions though (other than shipping off the data to a third-party via urlfetch).
Motivation: I have an entity which really needs to be put in the datastore - simply showing an error message to the user won't do. For example, perhaps some side effect has taken place which can't easily be undone (perhaps some interaction with a third-party site).
I've come up with a simple wrapper which (I think) provides a reasonable "fail-safe" put (see below). Do you see any problems with this, or have an idea for a more robust implementation? (Note: Thanks to suggestions posted in the answers by Nick Johnson and Saxon Druce, this post has been edited with some improvements to the code.)
import logging
from google.appengine.api.labs.taskqueue import taskqueue
from google.appengine.datastore import entity_pb
from google.appengine.ext import db
from google.appengine.runtime.apiproxy_errors import CapabilityDisabledError
def put_failsafe(e, db_put_deadline=20, retry_countdown=60, queue_name='default'):
"""Tries to e.put(). On success, 1 is returned. If this raises a db.Error
or CapabilityDisabledError, then a task will be enqueued to try to put the
entity (the task will execute after retry_countdown seconds) and 2 will be
returned. If the task cannot be enqueued, then 0 will be returned. Thus a
falsey value is only returned on complete failure.
Note that since the taskqueue payloads are limited to 10kB, if the protobuf
representing e is larger than 10kB then the put will be unable to be
deferred to the taskqueue.
If a put is deferred to the taskqueue, then it won't necessarily be
completed as soon as the datastore is back up. Thus it is possible that
e.put() will occur *after* other, later puts when 1 is returned.
Ensure e's model is imported in the code which defines the task which tries
to re-put e (so that e can be deserialized).
"""
try:
e.put(rpc=db.create_rpc(deadline=db_put_deadline))
return 1
except (db.Error, CapabilityDisabledError), ex1:
try:
taskqueue.add(queue_name=queue_name,
countdown=retry_countdown,
url='/task/retry_put',
payload=db.model_to_protobuf(e).Encode())
logging.info('failed to put to db now, but deferred put to the taskqueue e=%s ex=%s' % (e, ex1))
return 2
except (taskqueue.Error, CapabilityDisabledError), ex2:
return 0
Request handler for the task:
from google.appengine.ext import db, webapp
# IMPORTANT: This task deserializes entity protobufs. To ensure that this is
# successful, you must import any db.Model that may need to be
# deserialized here (otherwise this task may raise a KindError).
class RetryPut(webapp.RequestHandler):
def post(self):
e = db.model_from_protobuf(entity_pb.EntityProto(self.request.body))
e.put() # failure will raise an exception => the task to be retried
I don't expect to use this for every put - most of the time, showing an error message is just fine. It is tempting to use it for every put, but I think sometimes it might be more confusing for the user if I tell them that their changes will appear later (and continue to show them the old data until the datastore is back up and the deferred puts execute).
Your approach is reasonable, but has several caveats:
By default, a put operation will retry until it runs out of time. Since you have a backup strategy, you may want to give up sooner - in which case you should supply an rpc parameter to the put method call, specifying a custom deadline.
There's no need to set an explicit countdown - the task queue will retry failing operations for you at increasing intervals.
You don't need to use pickle - Protocol Buffers have a natural string encoding which is much more efficient. See this post for a demonstration of how to use it.
As Saxon points out, task queue payloads are limited to 10 kilobytes, so you may have trouble with large entities.
Most importantly, this changes the datastore consistency model from 'strongly consistent' to 'eventually consistent'. That is, the put that you enqueued to the task queue could be applied at any time in the future, overwriting any changes that were made in the interim. Any number of race conditions are possible, essentially rendering transactions useless if there are puts pending on the task queue.
One potential issue is that tasks are limited to 10kb of data, so this won't work if you have an entity which is larger than that once pickled.

Resources