DynamoDB ConditionalCheckFailedException thrown but succeeds - database

I think that have seen in many occasions that a DynamoDB conditional put throws ConditionalCheckFailedException but succeeds. Usually in this scenario, the request takes quite long (~10s) to finish, but I can see that the request is updated despite the fact that a ConditionalCheckFailedException is thrown (and the it took few seconds).
By the way I don't force any timeout on the DDB request.
Is this a bug, or some DDB conditional put contract that I misunderstand? Has anyone experienced this issue?

Answering this late to inform others:
ConditionCheckFailedException but item is persisted:
This typically happens when you save an item to DynamoDB, DynamoDB acknowledges the write request but the response gets lost on the return path which can happen for multiple reasons, keeping in mind that DynamoDB is one of the largest distributed systems in the cloud.
This causes the SDK timeout to exceed while awaiting a response, which then triggers an SDK retry. When the write request is retried, the condition now evaluates to False as the item already exists, which in turn throws a ConditionCheckFailedException, which can cause confusion.
When I receive a ConditionCheckFailedException I typically do a strongly consistent GetItem request for the item to ensure it exists with the values I expect and move on.

Related

DeadlineExceededException and DataStore/Task Queue Operations

I'm doing some operations that should complete under 60 seconds but there may be some rare cases where it takes longer (but will never take longer than 10 minutes). It says in the app engine docs if you catch a DeadlineExceededException you have less than a second to do operations before it permanently fails. Would this be enough time to add a task to a queue and/or do a datastore write? I assume the safest way would be to add a task async/write a datastore entity (async) at the beginning of an operation and remove it from the queue if the operation completes. The latter method would use up twice as many api calls but is it worth it?
I would suggest to use the queue as default for all operations so you won't have to implement the fallback to it if you catch a dead line exceed error. It is more clean and easier to maintain along with the fact that the user doesn't have to wait for the operation to complete. In order to achieve this you can trigger your queue with an ajax call and get the result in the background, so the user will not wait for the operation to complete. Yes it worth's it, since it can "guarantee" the window of time you might need.
The runtime environment gives the request handler a little bit more time (less than a second) after raising the exception to prepare a custom response. so it would be sufficient to add that it into task queue.
If you do not want the client to keep polling for a task queue result, I suggest you have a look at the Channel API. It will enable you to implement push notifications to the client.
At the end of your task queue, you'll just have to send a notification to the client to let him now that is task has been processed.

Making db.put() failsafe

I would like to make a db.put() operation in my Google App Engine service as resilient as possible, trying to maximize the likelihood of success even in the event of infrastructure issues or overload. What I have come up with at the moment is to catch every possible exception that could occur and to create a task that retries the commit if the first attempt fails:
try:
db.put(new_user_record)
except DeadlineExceededError:
deferred.defer(db.put,new_user_record)
except:
deferred.defer(db.put,new_user_record)
Does this code trap all possible error paths? Or are there other ways db.put() can fail that would not by caught by this code?
Edit on March 28, 2013 - To clarify when failure is expected
It seems that the answers so far assume that if db.put() fails then it is because the datastore is down. In my experience of having run fairly high-workload applications this is not necessarily a requirement. Sometimes you run into workload-specific API bottlenecks, sometimes the slowness of one API causes the request deadline to expire in another. Even though such events have a low frequency, their number can be sizable if traffic is high. These are the situations I am trying to cover.
I wouldn't say this is the best approach - whatever caused the original exception is just likely to happen again. What I would do for extra resilience is first load the record to be saved into memcache and in the event of an exception with the put (any exception) it could attempt a certain number of retries (for example 3) with a short sleep between each attempt. Depending on your application this could be either a synchronous operation or using deferred tasks it could be done asynchronously using the data in memcache.
Finally I'd actually do a query on the record in the data store even if there wasn't an exception to confirm the row has actually been written.
Well, i don't think that it is a good idea to try such a fallback at all. If the datastore is down, its down and youre out of luck (shouldn't happen frequently :)
Some thoughts to your code:
There are way more exceptions that could be raised during a put-opertation (like InternalError, Timeout, CommittedButStillApplying, TransactionFailedError)
Some of them don't mean that the put has failed. (ie. CommittedButStillApplying just means the put-operation is delayed). With your approach, you would end up having that entry twice in the datastore after your deferred call succeeds.
Tasks are limited to ~100KB (total size, not payload). If your payload is close to or above that limit, the deferred-api will automatically try to
serialize your payload to the datastore in order to keep the task itself below that limit. If the datastore is really unavailable, this will fail, too.
So its probably better to catch datastore errors, and inform your user that his request failed.
Its all good to retry, however use exponential backoff and most important proper transaction use so that fail xoesnt end up o a partial write.

Is there an elegant way to post messages to AWS SQS with visibility delay of longer than 15 minutes?

In Amazon Web Services, their queues allow you to post messages with a visibility delay up to 15 minutes. What if I don't want messages visible for 6 months?
I'm trying to come up with an elegant solution to the poll/push problem. I can write code to poll the SQS (or a database) every few seconds, check for messages that are ready to be visible, then move them to a "visible queue", or something like that. I wish there was a simpler, more reliable method to have messages become visible in queues far into the future without me having to worry about my polling application working perfectly all the time.
I'm not married to AWS, SQS or any of that, but I'd prefer to find a cloud-friendly solution that is stable, reliable and will trigger an event far into the future without me having to worry about checking on its status every day.
Any thoughts or alternate trees for me to explore barking up are welcome.
Thanks!
It sounds like you might be misunderstanding the visibility delay. Its purpose is to make sure that the polling application doesn't pull the same item off the queue more than once.
In other words, when the item is pulled off the queue it becomes invisible for a predetermined period of time (default is 30 seconds, max is 15 minutes) in case the polling system has a cluster of machines reading from the queue all at once.
Here's the relevant documentation:
http://docs.amazonwebservices.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/IntroductionArticle.html#AboutVT
...and the sentence in particular that relates to my comment is:
"Immediately after the component receives the message, the message is still in the queue. However, you don't want other components in the system receiving and processing the message again. Therefore, Amazon SQS blocks them with a visibility timeout, which is a period of time during which Amazon SQS prevents other consuming components from receiving and processing that message."
You should be able to use SQS for your purpose since you can leave an item in the queue for as long as you want.
7 years later, and Amazon still doesn't support the feature you need!
The two ways you can sort of get it to work are:
have messages contain a delivery target datetime in their message_attributes, and have the workers that consume the queue's messages just delete and recreate any message that is consumed before its target, with delay = max(0, min(secs_until_target_datetime, 900)) ; that would allow you to effectively schedule a message for any arbitrary future time;
or,
(slightly less frequent and costly:) similarly, if a message isn't due to be handled yet, recreate it and change its visibility timeout to be timeout = max(0, min(secs_until_target_datetime, 43200))
The disadvantage of using visibility timeout is that any read will re-trigger it.
There has been a direct AWS solution possible since 2016-12-01: AWS Step Functions
Each execution can last/idle up to one year, persists the state between transitions, and doesn't cost you any money while it waits.

Fail-safe datastore updates on app engine

The app engine datastore, of course, has downtime. However, I'd like to have a "fail-safe" put which is more robust in the face of datastore errors (see motivation below). It seems like the task queue is an obvious place to defer writes when the datastore is unavailable. I don't know of any other solutions though (other than shipping off the data to a third-party via urlfetch).
Motivation: I have an entity which really needs to be put in the datastore - simply showing an error message to the user won't do. For example, perhaps some side effect has taken place which can't easily be undone (perhaps some interaction with a third-party site).
I've come up with a simple wrapper which (I think) provides a reasonable "fail-safe" put (see below). Do you see any problems with this, or have an idea for a more robust implementation? (Note: Thanks to suggestions posted in the answers by Nick Johnson and Saxon Druce, this post has been edited with some improvements to the code.)
import logging
from google.appengine.api.labs.taskqueue import taskqueue
from google.appengine.datastore import entity_pb
from google.appengine.ext import db
from google.appengine.runtime.apiproxy_errors import CapabilityDisabledError
def put_failsafe(e, db_put_deadline=20, retry_countdown=60, queue_name='default'):
"""Tries to e.put(). On success, 1 is returned. If this raises a db.Error
or CapabilityDisabledError, then a task will be enqueued to try to put the
entity (the task will execute after retry_countdown seconds) and 2 will be
returned. If the task cannot be enqueued, then 0 will be returned. Thus a
falsey value is only returned on complete failure.
Note that since the taskqueue payloads are limited to 10kB, if the protobuf
representing e is larger than 10kB then the put will be unable to be
deferred to the taskqueue.
If a put is deferred to the taskqueue, then it won't necessarily be
completed as soon as the datastore is back up. Thus it is possible that
e.put() will occur *after* other, later puts when 1 is returned.
Ensure e's model is imported in the code which defines the task which tries
to re-put e (so that e can be deserialized).
"""
try:
e.put(rpc=db.create_rpc(deadline=db_put_deadline))
return 1
except (db.Error, CapabilityDisabledError), ex1:
try:
taskqueue.add(queue_name=queue_name,
countdown=retry_countdown,
url='/task/retry_put',
payload=db.model_to_protobuf(e).Encode())
logging.info('failed to put to db now, but deferred put to the taskqueue e=%s ex=%s' % (e, ex1))
return 2
except (taskqueue.Error, CapabilityDisabledError), ex2:
return 0
Request handler for the task:
from google.appengine.ext import db, webapp
# IMPORTANT: This task deserializes entity protobufs. To ensure that this is
# successful, you must import any db.Model that may need to be
# deserialized here (otherwise this task may raise a KindError).
class RetryPut(webapp.RequestHandler):
def post(self):
e = db.model_from_protobuf(entity_pb.EntityProto(self.request.body))
e.put() # failure will raise an exception => the task to be retried
I don't expect to use this for every put - most of the time, showing an error message is just fine. It is tempting to use it for every put, but I think sometimes it might be more confusing for the user if I tell them that their changes will appear later (and continue to show them the old data until the datastore is back up and the deferred puts execute).
Your approach is reasonable, but has several caveats:
By default, a put operation will retry until it runs out of time. Since you have a backup strategy, you may want to give up sooner - in which case you should supply an rpc parameter to the put method call, specifying a custom deadline.
There's no need to set an explicit countdown - the task queue will retry failing operations for you at increasing intervals.
You don't need to use pickle - Protocol Buffers have a natural string encoding which is much more efficient. See this post for a demonstration of how to use it.
As Saxon points out, task queue payloads are limited to 10 kilobytes, so you may have trouble with large entities.
Most importantly, this changes the datastore consistency model from 'strongly consistent' to 'eventually consistent'. That is, the put that you enqueued to the task queue could be applied at any time in the future, overwriting any changes that were made in the interim. Any number of race conditions are possible, essentially rendering transactions useless if there are puts pending on the task queue.
One potential issue is that tasks are limited to 10kb of data, so this won't work if you have an entity which is larger than that once pickled.

How to abandon a long-running search in System.DirectoryServices.Protocols

I've been trying to work out how to cancel a long-running AD search in System.DirectoryServices.Protocols. Can anyone help?
I've looked at the supportControl/supportedCapabilities attributes on RootDSE and they don't contain the 1.3.6.1.1.8 OID so I think that means it doesn't support the LDAP CANCEL extended operation as defined here: https://www.rfc-editor.org/rfc/rfc3909
That leaves the original LDAP ABANDON command (see here for list). But there doesn't seem to be a matching DirectoryRequest Class.
Anyone have any ideas?
I think I've found my answer: whilst I was reading around your suggestion, Martin, I came across the Abort method on the LdapConnection class. I didn't expect to find it there: starting out from the LDAP documentation I'd expected to find it as just another LDAPMessage but the MS guys seem to have treated it as a special case. If anyone is familiar with a non-MS implementation of LDAP and can comment on whether the MS approach is typical, I'd appreciate it to improve my understanding.
I think, but I'm not positive, there is no asynch query with a cancel. It has an asynch property but it's to allow a collection to be filled, nothing to do with cancelling. The best I can offer is to put your query in a background worker thread and put an asynch callback that will deal with the answer when it comes back. If the user decides to cancel, you can just cancel the background worker thread. You'll free your app up, even if you haven't freed the ldap server up until it finishes it's query. You can find info on background worker threads at http://www.c-sharpcorner.com/UploadFile/LivMic/BGWorker07032007000515AM/BGWorker.aspx
Don't forget to call .Dispose() when cleaning up your active directory objects to prevent memory leaks.
If the query will produce many data also, you can abandon them through paging. Specify a PageResultRequestControl option in the query, giving a fairly low page size (IIUC, 1000 is the default page size). IIUC, you'll send new requests every time you got a page (passing cookies from one response into the next request). When you choose to cancel the query, send another request with zero expected results.

Resources