DeadlineExceededException and DataStore/Task Queue Operations - google-app-engine

I'm doing some operations that should complete under 60 seconds but there may be some rare cases where it takes longer (but will never take longer than 10 minutes). It says in the app engine docs if you catch a DeadlineExceededException you have less than a second to do operations before it permanently fails. Would this be enough time to add a task to a queue and/or do a datastore write? I assume the safest way would be to add a task async/write a datastore entity (async) at the beginning of an operation and remove it from the queue if the operation completes. The latter method would use up twice as many api calls but is it worth it?

I would suggest to use the queue as default for all operations so you won't have to implement the fallback to it if you catch a dead line exceed error. It is more clean and easier to maintain along with the fact that the user doesn't have to wait for the operation to complete. In order to achieve this you can trigger your queue with an ajax call and get the result in the background, so the user will not wait for the operation to complete. Yes it worth's it, since it can "guarantee" the window of time you might need.

The runtime environment gives the request handler a little bit more time (less than a second) after raising the exception to prepare a custom response. so it would be sufficient to add that it into task queue.

If you do not want the client to keep polling for a task queue result, I suggest you have a look at the Channel API. It will enable you to implement push notifications to the client.
At the end of your task queue, you'll just have to send a notification to the client to let him now that is task has been processed.

Related

How to mark a message as "in progress" so other workers don't work on it

I'm attempting to use a pull queue to create a queue of image processing tasks that could take longer that the acktimeout limit of 10 minutes. I'm using node.js api and I'm wondering how I could have a worker grab a message off the pull queue, mark it as in progress so no other workers attempt to grab it, do its work and acknowledge the message after the processing is done. This processing could take up to an hour per worker. If an exception occurs, I'd like to remove the "in progress" status and allow other workers to pick up this message and attempt to work on it.
I was hoping there was something in pubsub that would allow me to do this. My alternative is to, before processing, store an entity (inProgressMessage) with the message id, ack id, status=pending, timestamp=now() into datastore, have the worker immediately return the ackid after receiving the message (this will allow other workers to attempt other messages), then the worker can work on the lengthy task. If successful, mark the entity status as complete, if failed in a non permanent way, requeue the task into pubsub, if failed in a permanent way that won't allow reqeueing, I can have cron that checks datastore for pending tasks older than several hours and have them either be deleted or requeued.
My alternative feels like i'm re-implementing alot of what pub sub is supposed to help with.
Let me know if you can think of a better way.
To take longer than the ack deadline to process a message, you'll want to use modifyAckDeadline. You can extend the deadline as many times as you need up to 10 minutes per call. Your workflow would be as follows:
Pull the message.
Start to process the message.
While you are not done with the message, if you are close to the 10 minute ack deadline, call modifyAckDeadline to extend the deadline.
Once done processing the message, ack it.
Please note that calling modifyAckDeadline does not guarantee that the message won't be delivered to another task. In certain circumstances like server restarts, the message may end up being delivered to another of your subscribers. However, in most normal circumstances, as long as you call modifyAckDeadline before the current ack deadline, you can prevent a message's redelivered as long as necessary.
When creating a topic (only), you can configure the acknowledge time to be whatever up to 10 minutes (https://cloud.google.com/pubsub/subscriber). Once a message has been pulled from the queue, no other worker (of the same subscriber) will be able to take it for processing, unless the ack ttl was reached, and then the message is automatically returned to queue.
Since you need a longer period, you will have to implement something on your own, or seek another queuing solution. I think the design you suggested is fairly simple to implement, and is not really a re-implementation of what pubsub does.

Making db.put() failsafe

I would like to make a db.put() operation in my Google App Engine service as resilient as possible, trying to maximize the likelihood of success even in the event of infrastructure issues or overload. What I have come up with at the moment is to catch every possible exception that could occur and to create a task that retries the commit if the first attempt fails:
try:
db.put(new_user_record)
except DeadlineExceededError:
deferred.defer(db.put,new_user_record)
except:
deferred.defer(db.put,new_user_record)
Does this code trap all possible error paths? Or are there other ways db.put() can fail that would not by caught by this code?
Edit on March 28, 2013 - To clarify when failure is expected
It seems that the answers so far assume that if db.put() fails then it is because the datastore is down. In my experience of having run fairly high-workload applications this is not necessarily a requirement. Sometimes you run into workload-specific API bottlenecks, sometimes the slowness of one API causes the request deadline to expire in another. Even though such events have a low frequency, their number can be sizable if traffic is high. These are the situations I am trying to cover.
I wouldn't say this is the best approach - whatever caused the original exception is just likely to happen again. What I would do for extra resilience is first load the record to be saved into memcache and in the event of an exception with the put (any exception) it could attempt a certain number of retries (for example 3) with a short sleep between each attempt. Depending on your application this could be either a synchronous operation or using deferred tasks it could be done asynchronously using the data in memcache.
Finally I'd actually do a query on the record in the data store even if there wasn't an exception to confirm the row has actually been written.
Well, i don't think that it is a good idea to try such a fallback at all. If the datastore is down, its down and youre out of luck (shouldn't happen frequently :)
Some thoughts to your code:
There are way more exceptions that could be raised during a put-opertation (like InternalError, Timeout, CommittedButStillApplying, TransactionFailedError)
Some of them don't mean that the put has failed. (ie. CommittedButStillApplying just means the put-operation is delayed). With your approach, you would end up having that entry twice in the datastore after your deferred call succeeds.
Tasks are limited to ~100KB (total size, not payload). If your payload is close to or above that limit, the deferred-api will automatically try to
serialize your payload to the datastore in order to keep the task itself below that limit. If the datastore is really unavailable, this will fail, too.
So its probably better to catch datastore errors, and inform your user that his request failed.
Its all good to retry, however use exponential backoff and most important proper transaction use so that fail xoesnt end up o a partial write.

Is the execution of a queued task always guaranteed on GAE?

Is the following simple pattern enough to ensure the task sequence never stops even after application updates or hard, 'erratic' google failures.
def do_work():
... ....
deferred.defer(do_work, _countdown=..in 7 days..)
Can I schedule such a self-scheduling worker and never look back?
Two answers:
Yes, tasks will eventually execute and will also retry execution in case of errors in task execution. The retry options are set when you define the task.
No, task queue is not a scheduler, so you can not schedule a task to run at certain time. Tasks put into a task queue are served immediatelly in a FIFO fashion.
As #Jesse noted, for scheduling jobs you should look into GAE cron.
If a task is queued successfully, it will eventually execute. (And App Engine will keep trying for as long as it takes.)
The pattern you show might be better implemented using cron jobs, though, which run a task on a regular basis. A common pattern I use is to have a daily cron job kick off a task on a task queue with a small number of retries (so that if there's a temporary glitch, it will retry immediately).
If you do want to use the method above, rather than cron, there's another thing to worry about: since your method can be retried due to it failing or other system issues (e.g. the instance running it going down) you should make sure that you don't end up with two tasks. Imagine if it ran, registered the next task and then the node went down; App Engine would retry, starting a second task. To prevent this, you could use the data store (in a transaction) to test and see if the next task has already been enqueued. Something like:
def do_work(counter):
...
#db.transactional
def start_next():
# fetch myModel from the data store here
if myModel.counter == counter:
return # already started next job
myModel.counter = counter
myModel.put()
deferred.defer(do_work, counter + 1, _transactional=True, _countdown=...)
start_next()
Note the "transactional" argument in the defer call; this ensures that the MyModel instance will be updated if and only if the next task is enqueued.
You might also want to look into sending an email to an administrator after a certain number of failed retries. (You can find this in the request HTTP headers, but you can't use the deferred library if you want to do this; you have to use the task queue API directly.)

Is there an elegant way to post messages to AWS SQS with visibility delay of longer than 15 minutes?

In Amazon Web Services, their queues allow you to post messages with a visibility delay up to 15 minutes. What if I don't want messages visible for 6 months?
I'm trying to come up with an elegant solution to the poll/push problem. I can write code to poll the SQS (or a database) every few seconds, check for messages that are ready to be visible, then move them to a "visible queue", or something like that. I wish there was a simpler, more reliable method to have messages become visible in queues far into the future without me having to worry about my polling application working perfectly all the time.
I'm not married to AWS, SQS or any of that, but I'd prefer to find a cloud-friendly solution that is stable, reliable and will trigger an event far into the future without me having to worry about checking on its status every day.
Any thoughts or alternate trees for me to explore barking up are welcome.
Thanks!
It sounds like you might be misunderstanding the visibility delay. Its purpose is to make sure that the polling application doesn't pull the same item off the queue more than once.
In other words, when the item is pulled off the queue it becomes invisible for a predetermined period of time (default is 30 seconds, max is 15 minutes) in case the polling system has a cluster of machines reading from the queue all at once.
Here's the relevant documentation:
http://docs.amazonwebservices.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/IntroductionArticle.html#AboutVT
...and the sentence in particular that relates to my comment is:
"Immediately after the component receives the message, the message is still in the queue. However, you don't want other components in the system receiving and processing the message again. Therefore, Amazon SQS blocks them with a visibility timeout, which is a period of time during which Amazon SQS prevents other consuming components from receiving and processing that message."
You should be able to use SQS for your purpose since you can leave an item in the queue for as long as you want.
7 years later, and Amazon still doesn't support the feature you need!
The two ways you can sort of get it to work are:
have messages contain a delivery target datetime in their message_attributes, and have the workers that consume the queue's messages just delete and recreate any message that is consumed before its target, with delay = max(0, min(secs_until_target_datetime, 900)) ; that would allow you to effectively schedule a message for any arbitrary future time;
or,
(slightly less frequent and costly:) similarly, if a message isn't due to be handled yet, recreate it and change its visibility timeout to be timeout = max(0, min(secs_until_target_datetime, 43200))
The disadvantage of using visibility timeout is that any read will re-trigger it.
There has been a direct AWS solution possible since 2016-12-01: AWS Step Functions
Each execution can last/idle up to one year, persists the state between transitions, and doesn't cost you any money while it waits.

App Engine: Is it possible to enqueue tasks asynchronously?

Many of my handlers add a task to a task queue to do non-critical background processing. Since this processing isn't critical, if the call to taskqueue.add() throws an exception, my code just ignores it.
Tonight the task queue seemed to be down for around half an hour. Although my handlers correctly ignored the failure, they took about 5 seconds for the taskqueue.add() call to timeout and move on to processing the rest of the page. This therefore made my site run very slowly.
So, is it possible to enqueue a task asynchronously - meaning a way to add a task, without waiting to see if the addition succeeded?
Alternatively, is there a way to reduce that timeout from 5 seconds down to eg 1 second?
Thanks.
You can use the new taskqueue methods create_rpc and add_async. If you don't care if the add succeeds, simply call add_async and ignore the result. If you care, but only want to wait 1 second, set the deadline when calling create_rpc, and use the return value as the RPC argument to add_async. Call get_result to find out if the tasks were successfully added.
I think you can't do anything about it because the RPC call underneath the add method is a synchronous blocking API call.
You could try to add some check using the Capabilities API.
I am pretty sure GAE announced that TQ adds will be async with the next release (experimental feature).

Resources