I have a counter in my app where I expect that 99% of the time there will not be contention issues in updating the counter with transactions.
To handle the 1% times when it is busy, I was thinking of updating the counter by using transactions within deferred tasks as follows:
def update_counter(my_key):
deferred.defer(update_counter_transaction)
#ndb.transactional
def update_counter_transaction(my_key):
x = my_key.get()
x.n += 1
x.put()
For the occasional instances when contention causes the transaction to fail, the task will be retried.
I'm familiar with sharded counters but this seems easier and suited to my situation.
Is there anything I am missing that might cause this solution to not work well?
A problem may exist with the automatic task retries which at least theoretically may happen for reasons other than transaction colissions for the intended counter increments. If such undesired retry successfully re-executes the counter increment code the counter value may be thrown off (will be higher than the expected value). Which might or might not be acceptable for your app, depending on the use of the counter.
Here's an example of undesired defered task invocation: GAE deferred task retried due to "instance unavailable" despite having already succeeded
The answer to that question seems inline with this note on regular task queue documentation (I saw no such note in the deferred task queues article, but I marked it as possible in my brain):
Note that task names do not provide an absolute guarantee of once-only
semantics. In extremely rare cases, multiple calls to create a task of
the same name may succeed, but in this event, only one of the tasks
would be executed. It's also possible in exceptional cases for a task
to run more than once.
From this perspective it might actually be better to keep the counter incrementing together with the rest of the related logical/transactional operations (if any) than to isolate it as a separate transaction on a task queue.
Related
I have a bit of a strange problem. I have a module running on gae that puts a whole lot of little tasks on the default task queue. The tasks access the same ndb module. Each task accesses a bunch of data from a few different tables then calls put.
The first few tasks work fine but as time continues I start getting these on the final put:
suspended generator _put_tasklet(context.py:358) raised TransactionFailedError(too much contention on these datastore entities. please try again.)
So I wrapped the put with a try and put in a randomised timeout so it retries a couple of times. This mitigated the problem a little, it just happens later on.
Here is some pseudocode for my task:
def my_task(request):
stuff = get_ndb_instances() #this accessed a few things from different tables
better_stuff = process(ndb_instances) #pretty much just a summation
try_put(better_stuff)
return {'status':'Groovy'}
def try_put(oInstance,iCountdown=10):
if iCountdown<1:
return oInstance.put()
try:
return oInstance.put()
except:
import time
import random
logger.info("sleeping")
time.sleep(random.random()*20)
return oInstance.try_put(iCountdown-1)
Without using try_put the queue gets about 30% of the way through until it stops working. With the try_put it gets further, like 60%.
Could it be that a task is holding onto ndb connections after it has completed somehow? I'm not making explicit use of transactions.
EDIT:
there seems to be some confusion about what I'm asking. The question is: Why does ndb contention get worse as time goes on. I have a whole lot of tasks running simultaneously and they access the ndb in a way that can cause contention. If contention is detected then a randomy timed retry happens and this eliminates contention perfectly well. For a little while. Tasks keep running and completing and the more that successfully return the more contention happens. Even though the processes using the contended upon data should be finished. Is there something going on that's holding onto datastore handles that shouldn't be? What's going on?
EDIT2:
Here is a little bit about the key structures in play:
My ndb models sit in a hierarchy where we have something like this (the direction of the arrows specifies parent child relationships, ie: Type has a bunch of child Instances etc)
Type->Instance->Position
The ids of the Positions are limited to a few different names, there are many thousands of instances and not many types.
I calculate a bunch of Positions and then do a try_put_multi (similar to try_put in an obvious way) and get contention. I'm going to run the code again pretty soon and get a full traceback to include here.
Contention will get worse overtime if you continually exceed the 1 write/transaction per entity group per second. The answer is in how Megastore/Paxo work and how Cloud Datastore handles contention in the backend.
When 2 writes are attempted at the same time on different nodes in Megastore, one transaction will win and the other will fail. Cloud Datastore detects this contention and will retry the failed transaction several times. Usually this results in the transaction succeeding without any errors being raised to the client.
If sustained writes above the recommended limit are being attempted, the chance that a transaction needs to be retried multiple times increases. The number of transactions in an internal retry state also increases. Eventually, transactions will start reaching our internal retry limit and will return a contention error to the client.
Randomized sleep method is an incorrect way to handle error response situations. You should instead look into exponential back-off with jitter (example).
Similarly, the core of your problem is a high write rate into a single entity group. you should look into whether the explicit parenting is required (removing it if not), or if you should shard the entity group in some manner that makes sense according to your queries and consistency requirements.
Today while browsing the source I noticed this comment in Pipeline.start method:
Returns:
A taskqueue.Task instance if return_task was True. This task will *not*
have a name, thus to ensure reliable execution of your pipeline you
should add() this task as part of a separate Datastore transaction.
Interesting, I do want reliable execution of my pipeline after all.
I suspect the comment is a bit inaccurate, since if you use the default return_task=False option the task is added inside a transaction anyway (by _PipelineContext.start)... it seems like the reason you'd want to add the task yourself is only if you want the starting of the pipeline to depend on success of something in your own transaction.
Can anyone confirm my suspicion or suggest how else following the comment's advice may effect 'reliable execution of your pipeline' ?
If you don't include the parameter when you call Pipeline.start(), the task is enqueued in the queue given by the Pipeline's inner variable context (type _PipelineContext). The default name for this queue is "default".
If you do include the parameter when you call Pipeline.start(), the task is not enqueued within these methods. Pipeline.start() will return _PipelineContext.start(), we see that it relies on an inner method txn(). This method is annotated transactional since it first does a bit of book-keeping for the Datastore records used to run this pipeline. Then, after this book-keeping is done, it creates a task without a name property (see the Task class definition here).
If return_task was not provided, it will go ahead and add that (un-named) task to the default queue for this pipeline's context. It also sets transactional on that task, so that it will be a "transactional task" which will only be added if the enclosing datastore transaction is committed successfully (ie. with all the bookkeeping in the txn() method successful so that this task, when run, will interact properly with the other parts of the pipeline etc.)
If, on the other hand, return_task was not defined, the un-named task, un-added to a queue, is returned. The txn() book-keeping work will nonetheless have taken place to prepare it to run. _PipelineContext.start() returns to Pipeline.start(), and user code will get the un-named, un-added task.
You're absolutely correct to say that the reason you would want this pattern is if you want pipeline execution to be part of a transaction in your code. Maybe you want to receive and store some data, kick off a pipeline, and store the pipeline id on a user's profile somewhere in Datastore. Of course, this means you want not only the datastore events but also the pipeline execution event to be grouped together into this atomic transaction. This pattern would allow you to do such a thing. If the transaction failed, the transactional task would not execute, and handle_run_exception() will be able to catch the TransactionFailedError and run pipeline.abort() to make sure the Datastore book-keeping data had been destroyed for the task that never ran.
The fact that the pipeline is un-named will not cause any disruption, since un-named tasks are auto-generated a unique name when added to a queue if the name is undefined, and in fact not having a name is actually a requirement for tasks which are added with transactional=True.
All in all, I think the comment just means to say that due to the fact that the task returned is transactional, in order for it to be reliably executed, you should make sure that the task.add (queue_name...) takes place in a transaction. It's not saying that the returned task is somehow "unreliable" just because you set return_task, it's basically using the word "reliable" superfluously, due to the fact that the task is run in a transaction.
On the Google App Engine with Python, I am looking for solutions to the race condition problem, i.e., multiple users are trying to increment a certain counter at the same time. I found two of them: the increment_counter() described in transactions and the bump_counter() in compare-and-set.
My questions: 1) Do both of them completely solve the race condition problem? 2) If so, which one is better?
In addition, could some body elaborate more about each of them, because I can't see how the codes solve the problem. For examples, 1) during the increment_counter() transaction, if another user updates the counter, the transaction would fail? 2) similarly, during the bump_counter() in compare-and-set, if another user updates the counter, the client.cas() would fail?
Yes they both can eliminate race conditions.
The first is using datastore, the second memcache. So they can not be compared. Memcache is volatile and can be purged at any time - you should not use it for storing permanent data. So in this regard datastore transactions are better. Also, transactions can assure atomicity on a set of entities, while compare_and_set assures atomicity only on one memcache value.
Transactions do not do blocking. If they detect collision they fail and you need to roll it back and repeat again yourself.
Ditto for memcache: you need to repeat the procedure yourself.
I need only confirmation that I get this right.
If, for example I have an Entity X with a field x, and when a request is sent I want to do X.x++. If I use just X = ofy().load().type(X.class).id(xId).get() then I do some calculations and afterwards I do X.x++ and the I save it. If during the calculations another request is posted, I'll get an unwanted behavior. And instead if I'll do this all in a transaction, the second request won't have access to X until I finish.
Is it so?
Sorry if the question is a bit nooby.
Thanks,
Dan
Yes you got it right but when using transaction remember the first that completes wins and the rest fail. Look also at #Peter Knego's answer for how they work.
But don't worry about the second request if it fails to read.
You have like 2 options:
Force a retries
Use eventual consistency in your transactions
As far as the retries are concerned:
Your transaction function can be called multiple times safely without
undesirable side effects. If this is not possible, you can set
retries=0, but know that the transaction will fail on the first
incident of contention
Example:
#db.transactional(retries=10)
As far as eventual consistency is concerned:
You can opt out of this protection by specifying a read policy that
requests eventual consistency. With an eventually consistent read of
an entity, your app gets the current known state of the entity being
read, regardless of whether there are still committed changes to be
applied. With an eventually consistent ancestor query, the indexes
used for the query are consistent with the time the indexes are read
from disk. In other words, an eventual consistency read policy causes
gets and queries to behave as if they are not a part of the current
transaction. This may be faster in some cases, since the operations do
not have to wait for committed changes to be written before returning
a result.
Example:
#db.transactional()
def test():
game_version = db.get(
db.Key.from_path('GameVersion', 1),
read_policy=db.EVENTUAL_CONSISTENCY)
No, GAE transaction do not do locking, they use optimistic concurrency control. You will have access to X all the time, but when you try to save it in the second transactions it will fail with ConcurrentModificationException.
I have a function in my app that does some processing in a transaction - creates or fails to create an entity depending on the attributes of others in the entity group.
I have been doing some testing that sees this function called in fast succession, a few times a second is possible in these tests.
The function triggers some deferred tasks that read from the entity group, but do not write to it.
I noticed something funny - when these tasks are triggered immediately, and interleave with the main function calls, I get contention errors quite frequently.
If I put a countdown of a couple of seconds on the deferred tasks, the main functions process successfully.
That suggests to me that the deferred tasks are causing contention on the entity group the main function writes to - but I thought reads from an entity group couldn't do this? Do look ups by keyname cause contention? Queries with filters?
It's kind of puzzling me. Should this be happening? I've read elsewhere that there is a limit of 1 write per second per entity group, but my tests routinely break that limit...at least when my spin-off deferred tasks are delayed for a couple of seconds.
This is on production, by the way.
Thanks for any insight!