Making db.put() failsafe - google-app-engine

I would like to make a db.put() operation in my Google App Engine service as resilient as possible, trying to maximize the likelihood of success even in the event of infrastructure issues or overload. What I have come up with at the moment is to catch every possible exception that could occur and to create a task that retries the commit if the first attempt fails:
try:
db.put(new_user_record)
except DeadlineExceededError:
deferred.defer(db.put,new_user_record)
except:
deferred.defer(db.put,new_user_record)
Does this code trap all possible error paths? Or are there other ways db.put() can fail that would not by caught by this code?
Edit on March 28, 2013 - To clarify when failure is expected
It seems that the answers so far assume that if db.put() fails then it is because the datastore is down. In my experience of having run fairly high-workload applications this is not necessarily a requirement. Sometimes you run into workload-specific API bottlenecks, sometimes the slowness of one API causes the request deadline to expire in another. Even though such events have a low frequency, their number can be sizable if traffic is high. These are the situations I am trying to cover.

I wouldn't say this is the best approach - whatever caused the original exception is just likely to happen again. What I would do for extra resilience is first load the record to be saved into memcache and in the event of an exception with the put (any exception) it could attempt a certain number of retries (for example 3) with a short sleep between each attempt. Depending on your application this could be either a synchronous operation or using deferred tasks it could be done asynchronously using the data in memcache.
Finally I'd actually do a query on the record in the data store even if there wasn't an exception to confirm the row has actually been written.

Well, i don't think that it is a good idea to try such a fallback at all. If the datastore is down, its down and youre out of luck (shouldn't happen frequently :)
Some thoughts to your code:
There are way more exceptions that could be raised during a put-opertation (like InternalError, Timeout, CommittedButStillApplying, TransactionFailedError)
Some of them don't mean that the put has failed. (ie. CommittedButStillApplying just means the put-operation is delayed). With your approach, you would end up having that entry twice in the datastore after your deferred call succeeds.
Tasks are limited to ~100KB (total size, not payload). If your payload is close to or above that limit, the deferred-api will automatically try to
serialize your payload to the datastore in order to keep the task itself below that limit. If the datastore is really unavailable, this will fail, too.
So its probably better to catch datastore errors, and inform your user that his request failed.

Its all good to retry, however use exponential backoff and most important proper transaction use so that fail xoesnt end up o a partial write.

Related

Handling poison messages in Apache Flink

I am trying to figure out the best practices to deal with poison messages / unhandled exceptions with Apache Flink. We have a Job doing real time event processing of location data from IoT devices. There are two potential scenarios where this can arise:
Data is bad in some way - e.g. invalid value
Data triggers a bug due to some edge case we have not anticipated.
Currently, all my data processing stops because of just one message.
I've seen two suggestions:
Catch the exceptions - this requires me wrapping every piece of logic with something to catch every runtime exception
Use side outputs as a kind of DLQ - from what I can tell this seems to be a variation on #1 where I have to catch all the exceptions and send them to the side output.
Is there really no way to do this other than wrap every piece of logic with exception handling? Is there no generic way to catch exceptions and not have processing continue?
I think the idea is not to catch all kinds of exceptions and send them elsewhere, but rather to have well-tested and functioning code and use dead letters only for invalid inputs.
So a typical pipeline would be
source => validate => ... => sink
\=> dead letter queue
As soon as your record passes your validate operator, you want all errors to bubble up, as any error in these operators may result in corrupted aggregates and data that - once written - cannot be reverted easily.
The validate step would work with any of the two approaches that you outlined. Typically, side-outputs have better semantics, but you may end up with more code.
Now you may have a service with high SLAs and actually want it to produce output even if it is corrupted just to produce data. Or you have simple transformation pipeline, where you'd miss some events but keep the majority (and downstream can deal with incomplete data). Then you are right that you need to wrap the code of all operators with try-catch. However, you'd typically still would only do it for the fragile operators and not for all of them. Trivial operators should be tested and then trusted to work. Further, you'd usually only catch specific kinds of exceptions to limit the scope to the kind of expected exceptions that can happen.
You might wonder why Flink doesn't have it incorporated as a default pattern. There are two reasons as far as I can see:
If Flink silently ignores any kind of exception and sends an extra message to a secondary sink, how can Flink ensure that the throwing operator is in a sane state afterwards? How can it avoid any kind of leaks that may happen because cleanup code is not executed?
It's more common in Java to let the developers explicitly reason about exceptions and exception handling. It's also not straight-forward to see what the requirements are: Do you want to have the input only? Do you also want to store the exception? What about the operator state that may have influenced the outcome? Should Flink still fail when too many errors have been received in a given time window? It quickly becomes a huge feature for something that should not happen at all in an ideal world where high quality data is ingested and properly processed.
So while it looks easy for your case because you exactly know which kinds of information you want to store, it's not easy to have a solution for all purposes, especially since the extra code that a user has to write is tiny compared to the generic solution.
What you could do is to extract most of the complicated logic things into a single ProcessFunction and use side-outputs as you have outlined. Since it's a central piece, you'd only need to write the side-output function once. If it's done multiple times, you could extract a helper function where you pass your actual code as a RunnableWithException lambda which hides all the side-output logic. Make sure you use plenty of finally blocks to ensure a sane state.
I'd also add quite a few IT cases and use mutation testing to harden your pipeline quicker. If you keep your test data inline, the mutants may also exactly simulate your unexpected data issues, such that your validate operator gets more complete.

DynamoDB ConditionalCheckFailedException thrown but succeeds

I think that have seen in many occasions that a DynamoDB conditional put throws ConditionalCheckFailedException but succeeds. Usually in this scenario, the request takes quite long (~10s) to finish, but I can see that the request is updated despite the fact that a ConditionalCheckFailedException is thrown (and the it took few seconds).
By the way I don't force any timeout on the DDB request.
Is this a bug, or some DDB conditional put contract that I misunderstand? Has anyone experienced this issue?
Answering this late to inform others:
ConditionCheckFailedException but item is persisted:
This typically happens when you save an item to DynamoDB, DynamoDB acknowledges the write request but the response gets lost on the return path which can happen for multiple reasons, keeping in mind that DynamoDB is one of the largest distributed systems in the cloud.
This causes the SDK timeout to exceed while awaiting a response, which then triggers an SDK retry. When the write request is retried, the condition now evaluates to False as the item already exists, which in turn throws a ConditionCheckFailedException, which can cause confusion.
When I receive a ConditionCheckFailedException I typically do a strongly consistent GetItem request for the item to ensure it exists with the values I expect and move on.

DeadlineExceededException and DataStore/Task Queue Operations

I'm doing some operations that should complete under 60 seconds but there may be some rare cases where it takes longer (but will never take longer than 10 minutes). It says in the app engine docs if you catch a DeadlineExceededException you have less than a second to do operations before it permanently fails. Would this be enough time to add a task to a queue and/or do a datastore write? I assume the safest way would be to add a task async/write a datastore entity (async) at the beginning of an operation and remove it from the queue if the operation completes. The latter method would use up twice as many api calls but is it worth it?
I would suggest to use the queue as default for all operations so you won't have to implement the fallback to it if you catch a dead line exceed error. It is more clean and easier to maintain along with the fact that the user doesn't have to wait for the operation to complete. In order to achieve this you can trigger your queue with an ajax call and get the result in the background, so the user will not wait for the operation to complete. Yes it worth's it, since it can "guarantee" the window of time you might need.
The runtime environment gives the request handler a little bit more time (less than a second) after raising the exception to prepare a custom response. so it would be sufficient to add that it into task queue.
If you do not want the client to keep polling for a task queue result, I suggest you have a look at the Channel API. It will enable you to implement push notifications to the client.
At the end of your task queue, you'll just have to send a notification to the client to let him now that is task has been processed.

Is there an elegant way to post messages to AWS SQS with visibility delay of longer than 15 minutes?

In Amazon Web Services, their queues allow you to post messages with a visibility delay up to 15 minutes. What if I don't want messages visible for 6 months?
I'm trying to come up with an elegant solution to the poll/push problem. I can write code to poll the SQS (or a database) every few seconds, check for messages that are ready to be visible, then move them to a "visible queue", or something like that. I wish there was a simpler, more reliable method to have messages become visible in queues far into the future without me having to worry about my polling application working perfectly all the time.
I'm not married to AWS, SQS or any of that, but I'd prefer to find a cloud-friendly solution that is stable, reliable and will trigger an event far into the future without me having to worry about checking on its status every day.
Any thoughts or alternate trees for me to explore barking up are welcome.
Thanks!
It sounds like you might be misunderstanding the visibility delay. Its purpose is to make sure that the polling application doesn't pull the same item off the queue more than once.
In other words, when the item is pulled off the queue it becomes invisible for a predetermined period of time (default is 30 seconds, max is 15 minutes) in case the polling system has a cluster of machines reading from the queue all at once.
Here's the relevant documentation:
http://docs.amazonwebservices.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/IntroductionArticle.html#AboutVT
...and the sentence in particular that relates to my comment is:
"Immediately after the component receives the message, the message is still in the queue. However, you don't want other components in the system receiving and processing the message again. Therefore, Amazon SQS blocks them with a visibility timeout, which is a period of time during which Amazon SQS prevents other consuming components from receiving and processing that message."
You should be able to use SQS for your purpose since you can leave an item in the queue for as long as you want.
7 years later, and Amazon still doesn't support the feature you need!
The two ways you can sort of get it to work are:
have messages contain a delivery target datetime in their message_attributes, and have the workers that consume the queue's messages just delete and recreate any message that is consumed before its target, with delay = max(0, min(secs_until_target_datetime, 900)) ; that would allow you to effectively schedule a message for any arbitrary future time;
or,
(slightly less frequent and costly:) similarly, if a message isn't due to be handled yet, recreate it and change its visibility timeout to be timeout = max(0, min(secs_until_target_datetime, 43200))
The disadvantage of using visibility timeout is that any read will re-trigger it.
There has been a direct AWS solution possible since 2016-12-01: AWS Step Functions
Each execution can last/idle up to one year, persists the state between transitions, and doesn't cost you any money while it waits.

Fail-safe datastore updates on app engine

The app engine datastore, of course, has downtime. However, I'd like to have a "fail-safe" put which is more robust in the face of datastore errors (see motivation below). It seems like the task queue is an obvious place to defer writes when the datastore is unavailable. I don't know of any other solutions though (other than shipping off the data to a third-party via urlfetch).
Motivation: I have an entity which really needs to be put in the datastore - simply showing an error message to the user won't do. For example, perhaps some side effect has taken place which can't easily be undone (perhaps some interaction with a third-party site).
I've come up with a simple wrapper which (I think) provides a reasonable "fail-safe" put (see below). Do you see any problems with this, or have an idea for a more robust implementation? (Note: Thanks to suggestions posted in the answers by Nick Johnson and Saxon Druce, this post has been edited with some improvements to the code.)
import logging
from google.appengine.api.labs.taskqueue import taskqueue
from google.appengine.datastore import entity_pb
from google.appengine.ext import db
from google.appengine.runtime.apiproxy_errors import CapabilityDisabledError
def put_failsafe(e, db_put_deadline=20, retry_countdown=60, queue_name='default'):
"""Tries to e.put(). On success, 1 is returned. If this raises a db.Error
or CapabilityDisabledError, then a task will be enqueued to try to put the
entity (the task will execute after retry_countdown seconds) and 2 will be
returned. If the task cannot be enqueued, then 0 will be returned. Thus a
falsey value is only returned on complete failure.
Note that since the taskqueue payloads are limited to 10kB, if the protobuf
representing e is larger than 10kB then the put will be unable to be
deferred to the taskqueue.
If a put is deferred to the taskqueue, then it won't necessarily be
completed as soon as the datastore is back up. Thus it is possible that
e.put() will occur *after* other, later puts when 1 is returned.
Ensure e's model is imported in the code which defines the task which tries
to re-put e (so that e can be deserialized).
"""
try:
e.put(rpc=db.create_rpc(deadline=db_put_deadline))
return 1
except (db.Error, CapabilityDisabledError), ex1:
try:
taskqueue.add(queue_name=queue_name,
countdown=retry_countdown,
url='/task/retry_put',
payload=db.model_to_protobuf(e).Encode())
logging.info('failed to put to db now, but deferred put to the taskqueue e=%s ex=%s' % (e, ex1))
return 2
except (taskqueue.Error, CapabilityDisabledError), ex2:
return 0
Request handler for the task:
from google.appengine.ext import db, webapp
# IMPORTANT: This task deserializes entity protobufs. To ensure that this is
# successful, you must import any db.Model that may need to be
# deserialized here (otherwise this task may raise a KindError).
class RetryPut(webapp.RequestHandler):
def post(self):
e = db.model_from_protobuf(entity_pb.EntityProto(self.request.body))
e.put() # failure will raise an exception => the task to be retried
I don't expect to use this for every put - most of the time, showing an error message is just fine. It is tempting to use it for every put, but I think sometimes it might be more confusing for the user if I tell them that their changes will appear later (and continue to show them the old data until the datastore is back up and the deferred puts execute).
Your approach is reasonable, but has several caveats:
By default, a put operation will retry until it runs out of time. Since you have a backup strategy, you may want to give up sooner - in which case you should supply an rpc parameter to the put method call, specifying a custom deadline.
There's no need to set an explicit countdown - the task queue will retry failing operations for you at increasing intervals.
You don't need to use pickle - Protocol Buffers have a natural string encoding which is much more efficient. See this post for a demonstration of how to use it.
As Saxon points out, task queue payloads are limited to 10 kilobytes, so you may have trouble with large entities.
Most importantly, this changes the datastore consistency model from 'strongly consistent' to 'eventually consistent'. That is, the put that you enqueued to the task queue could be applied at any time in the future, overwriting any changes that were made in the interim. Any number of race conditions are possible, essentially rendering transactions useless if there are puts pending on the task queue.
One potential issue is that tasks are limited to 10kb of data, so this won't work if you have an entity which is larger than that once pickled.

Resources