Fail-safe datastore updates on app engine - google-app-engine

The app engine datastore, of course, has downtime. However, I'd like to have a "fail-safe" put which is more robust in the face of datastore errors (see motivation below). It seems like the task queue is an obvious place to defer writes when the datastore is unavailable. I don't know of any other solutions though (other than shipping off the data to a third-party via urlfetch).
Motivation: I have an entity which really needs to be put in the datastore - simply showing an error message to the user won't do. For example, perhaps some side effect has taken place which can't easily be undone (perhaps some interaction with a third-party site).
I've come up with a simple wrapper which (I think) provides a reasonable "fail-safe" put (see below). Do you see any problems with this, or have an idea for a more robust implementation? (Note: Thanks to suggestions posted in the answers by Nick Johnson and Saxon Druce, this post has been edited with some improvements to the code.)
import logging
from google.appengine.api.labs.taskqueue import taskqueue
from google.appengine.datastore import entity_pb
from google.appengine.ext import db
from google.appengine.runtime.apiproxy_errors import CapabilityDisabledError
def put_failsafe(e, db_put_deadline=20, retry_countdown=60, queue_name='default'):
"""Tries to e.put(). On success, 1 is returned. If this raises a db.Error
or CapabilityDisabledError, then a task will be enqueued to try to put the
entity (the task will execute after retry_countdown seconds) and 2 will be
returned. If the task cannot be enqueued, then 0 will be returned. Thus a
falsey value is only returned on complete failure.
Note that since the taskqueue payloads are limited to 10kB, if the protobuf
representing e is larger than 10kB then the put will be unable to be
deferred to the taskqueue.
If a put is deferred to the taskqueue, then it won't necessarily be
completed as soon as the datastore is back up. Thus it is possible that
e.put() will occur *after* other, later puts when 1 is returned.
Ensure e's model is imported in the code which defines the task which tries
to re-put e (so that e can be deserialized).
"""
try:
e.put(rpc=db.create_rpc(deadline=db_put_deadline))
return 1
except (db.Error, CapabilityDisabledError), ex1:
try:
taskqueue.add(queue_name=queue_name,
countdown=retry_countdown,
url='/task/retry_put',
payload=db.model_to_protobuf(e).Encode())
logging.info('failed to put to db now, but deferred put to the taskqueue e=%s ex=%s' % (e, ex1))
return 2
except (taskqueue.Error, CapabilityDisabledError), ex2:
return 0
Request handler for the task:
from google.appengine.ext import db, webapp
# IMPORTANT: This task deserializes entity protobufs. To ensure that this is
# successful, you must import any db.Model that may need to be
# deserialized here (otherwise this task may raise a KindError).
class RetryPut(webapp.RequestHandler):
def post(self):
e = db.model_from_protobuf(entity_pb.EntityProto(self.request.body))
e.put() # failure will raise an exception => the task to be retried
I don't expect to use this for every put - most of the time, showing an error message is just fine. It is tempting to use it for every put, but I think sometimes it might be more confusing for the user if I tell them that their changes will appear later (and continue to show them the old data until the datastore is back up and the deferred puts execute).

Your approach is reasonable, but has several caveats:
By default, a put operation will retry until it runs out of time. Since you have a backup strategy, you may want to give up sooner - in which case you should supply an rpc parameter to the put method call, specifying a custom deadline.
There's no need to set an explicit countdown - the task queue will retry failing operations for you at increasing intervals.
You don't need to use pickle - Protocol Buffers have a natural string encoding which is much more efficient. See this post for a demonstration of how to use it.
As Saxon points out, task queue payloads are limited to 10 kilobytes, so you may have trouble with large entities.
Most importantly, this changes the datastore consistency model from 'strongly consistent' to 'eventually consistent'. That is, the put that you enqueued to the task queue could be applied at any time in the future, overwriting any changes that were made in the interim. Any number of race conditions are possible, essentially rendering transactions useless if there are puts pending on the task queue.

One potential issue is that tasks are limited to 10kb of data, so this won't work if you have an entity which is larger than that once pickled.

Related

DynamoDB ConditionalCheckFailedException thrown but succeeds

I think that have seen in many occasions that a DynamoDB conditional put throws ConditionalCheckFailedException but succeeds. Usually in this scenario, the request takes quite long (~10s) to finish, but I can see that the request is updated despite the fact that a ConditionalCheckFailedException is thrown (and the it took few seconds).
By the way I don't force any timeout on the DDB request.
Is this a bug, or some DDB conditional put contract that I misunderstand? Has anyone experienced this issue?
Answering this late to inform others:
ConditionCheckFailedException but item is persisted:
This typically happens when you save an item to DynamoDB, DynamoDB acknowledges the write request but the response gets lost on the return path which can happen for multiple reasons, keeping in mind that DynamoDB is one of the largest distributed systems in the cloud.
This causes the SDK timeout to exceed while awaiting a response, which then triggers an SDK retry. When the write request is retried, the condition now evaluates to False as the item already exists, which in turn throws a ConditionCheckFailedException, which can cause confusion.
When I receive a ConditionCheckFailedException I typically do a strongly consistent GetItem request for the item to ensure it exists with the values I expect and move on.

Making db.put() failsafe

I would like to make a db.put() operation in my Google App Engine service as resilient as possible, trying to maximize the likelihood of success even in the event of infrastructure issues or overload. What I have come up with at the moment is to catch every possible exception that could occur and to create a task that retries the commit if the first attempt fails:
try:
db.put(new_user_record)
except DeadlineExceededError:
deferred.defer(db.put,new_user_record)
except:
deferred.defer(db.put,new_user_record)
Does this code trap all possible error paths? Or are there other ways db.put() can fail that would not by caught by this code?
Edit on March 28, 2013 - To clarify when failure is expected
It seems that the answers so far assume that if db.put() fails then it is because the datastore is down. In my experience of having run fairly high-workload applications this is not necessarily a requirement. Sometimes you run into workload-specific API bottlenecks, sometimes the slowness of one API causes the request deadline to expire in another. Even though such events have a low frequency, their number can be sizable if traffic is high. These are the situations I am trying to cover.
I wouldn't say this is the best approach - whatever caused the original exception is just likely to happen again. What I would do for extra resilience is first load the record to be saved into memcache and in the event of an exception with the put (any exception) it could attempt a certain number of retries (for example 3) with a short sleep between each attempt. Depending on your application this could be either a synchronous operation or using deferred tasks it could be done asynchronously using the data in memcache.
Finally I'd actually do a query on the record in the data store even if there wasn't an exception to confirm the row has actually been written.
Well, i don't think that it is a good idea to try such a fallback at all. If the datastore is down, its down and youre out of luck (shouldn't happen frequently :)
Some thoughts to your code:
There are way more exceptions that could be raised during a put-opertation (like InternalError, Timeout, CommittedButStillApplying, TransactionFailedError)
Some of them don't mean that the put has failed. (ie. CommittedButStillApplying just means the put-operation is delayed). With your approach, you would end up having that entry twice in the datastore after your deferred call succeeds.
Tasks are limited to ~100KB (total size, not payload). If your payload is close to or above that limit, the deferred-api will automatically try to
serialize your payload to the datastore in order to keep the task itself below that limit. If the datastore is really unavailable, this will fail, too.
So its probably better to catch datastore errors, and inform your user that his request failed.
Its all good to retry, however use exponential backoff and most important proper transaction use so that fail xoesnt end up o a partial write.

How to prevent ndb from batching a put_async() call and make it issue the RPC immediately?

I have a request handler that updates an entity, saves it to the datastore, then needs to perform some additional work before returning (like queuing a background task and json-serializing some results). I want to parallelize this code, so that the additional work is done while the entity is being saved.
Here's what my handler code boils down to:
class FooHandler(webapp2.RequestHandler):
#ndb.toplevel
def post(self):
foo = yield Foo.get_by_id_async(some_id)
# Do some work with foo
# Don't yield, as I want to perform the code that follows
# while foo is being saved to the datastore.
# I'm in a toplevel, so the handler will not exit as long as
# this async request is not finished.
foo.put_async()
taskqueue.add(...)
json_result = generate_result()
self.response.headers["Content-Type"] = "application/json; charset=UTF-8"
self.response.write(json_result)
However, Appstats shows that the datastore.Put RPC is being done serially, after taskqueue.Add:
A little digging around in ndb.context.py shows that a put_async() call ends up being added to an AutoBatcher instead of the RPC being issued immediately.
So I presume that the _put_batcher ends up being flushed when the toplevel waits for all async calls to be complete.
I understand that batching puts has real benefits in certain scenarios, but in my case here I really want the put RPC to be sent immediately, so I can perform other work while the entity is being saved.
If I do yield foo.put_async(), then I get the same waterfall in Appstats, but with datastore.Put being done before the rest:
This is to be expected, as yield makes my handler wait for the put_async() call to complete before executing the rest of the code.
I also have tried adding a call to ndb.get_context().flush() right after foo.put_async(), but the datastore.Put and taskqueue.BulkAdd calls are still not being made in parallel according to Appstats.
So my question is: how can I force the call to put_async() to bypass the auto batcher and issue the RPC immediately?
There's no supported way to do it. Maybe there should be. Can you try if this works?
loop - ndb.eventloop.get_event_loop()
while loop.run_idle():
pass
You may have to look at the source code of ndb/eventloop.py to see what else you could try -- basically you want to try most of what run0() does except waiting for RPCs. In particular, it's possible that you would have to do this:
while loop.current:
loop.run0()
while loop.run_idle():
pass
(This still isn't supported, because there are other conditions you may have to handle too, but those don't seem to occur in your example.)
Try this, I'm not 100% certain it will help:
foo = yield Foo.get_by_id_async(some_id)
future = foo.put_async()
future.done()
ndb requests get put into the autobatcher, the batch gets sent to RPC when you need a result. Since you don't need the result of foo.put_async(), it doesn't get sent until you make another ndb call (you don't) or until the #ndb.toplevel ends.
Calling future.done() does not block, but I'm guessing it might trigger the request.
Another thing to try to force the operation is:
ndb.get_context().flush()

NDB tasklets not visible in appstats

Why is it that the line numbers from methods decorated with #ndb.tasklet are not present in appstats?
In our app we have a convention to include both a synchronous and an asynchronous version of functions, something like:
def do_something(self, param=None):
return self.do_something_async(param=param).get_result()
#ndb.tasklet
def do_something_async(self, param=None):
stuff = yield self.do_something_else_async(stuff=param)
# ...
raise ndb.Return(stuff)
…but even after setting appegnine_config.appstats_MAX_STACK to something huge, and emptying appengine_config.appstats_RE_STACK_SKIP, still the reports in appstats will leave my application code the first time some_tasklet.get_result() is called.
Here's an example from appstats:
The expanded stack frame at learn.get_list_of_cards_to_learn() simply returns self.get_list_of_cards_to_learn_async().get_result(), which is a tasklet that in turn calls a bunch of other tasklets. However none of those tasklets are visible in appstats, all I see is ndb internals.
I'm not sure how exactly ndb is executing those decorators, but if I put a pdb trace in one of them and run my test suite, I can see the stack frames all the way down to the pdb line I put in the tasklet, so I don't understand why is that not there in appstats.
Some of the requests cause a large amount of RPC calls, but I'm not sure how to figure out which part of my app is making them, as I cannot trace it past the first tasklet in appstats.
Is there something maybe I need to fine-tune in appengine_config?
This has to do with the way tasklets are managed by NDB's scheduler. There's not much you can do about it.

Is the execution of a queued task always guaranteed on GAE?

Is the following simple pattern enough to ensure the task sequence never stops even after application updates or hard, 'erratic' google failures.
def do_work():
... ....
deferred.defer(do_work, _countdown=..in 7 days..)
Can I schedule such a self-scheduling worker and never look back?
Two answers:
Yes, tasks will eventually execute and will also retry execution in case of errors in task execution. The retry options are set when you define the task.
No, task queue is not a scheduler, so you can not schedule a task to run at certain time. Tasks put into a task queue are served immediatelly in a FIFO fashion.
As #Jesse noted, for scheduling jobs you should look into GAE cron.
If a task is queued successfully, it will eventually execute. (And App Engine will keep trying for as long as it takes.)
The pattern you show might be better implemented using cron jobs, though, which run a task on a regular basis. A common pattern I use is to have a daily cron job kick off a task on a task queue with a small number of retries (so that if there's a temporary glitch, it will retry immediately).
If you do want to use the method above, rather than cron, there's another thing to worry about: since your method can be retried due to it failing or other system issues (e.g. the instance running it going down) you should make sure that you don't end up with two tasks. Imagine if it ran, registered the next task and then the node went down; App Engine would retry, starting a second task. To prevent this, you could use the data store (in a transaction) to test and see if the next task has already been enqueued. Something like:
def do_work(counter):
...
#db.transactional
def start_next():
# fetch myModel from the data store here
if myModel.counter == counter:
return # already started next job
myModel.counter = counter
myModel.put()
deferred.defer(do_work, counter + 1, _transactional=True, _countdown=...)
start_next()
Note the "transactional" argument in the defer call; this ensures that the MyModel instance will be updated if and only if the next task is enqueued.
You might also want to look into sending an email to an administrator after a certain number of failed retries. (You can find this in the request HTTP headers, but you can't use the deferred library if you want to do this; you have to use the task queue API directly.)

Resources