ndb data contention getting worse and worse

ndb data contention getting worse and worse - google-app-engine

I have a bit of a strange problem. I have a module running on gae that puts a whole lot of little tasks on the default task queue. The tasks access the same ndb module. Each task accesses a bunch of data from a few different tables then calls put.
The first few tasks work fine but as time continues I start getting these on the final put:
suspended generator _put_tasklet(context.py:358) raised TransactionFailedError(too much contention on these datastore entities. please try again.)
So I wrapped the put with a try and put in a randomised timeout so it retries a couple of times. This mitigated the problem a little, it just happens later on.
Here is some pseudocode for my task:
def my_task(request):
stuff = get_ndb_instances() #this accessed a few things from different tables
better_stuff = process(ndb_instances) #pretty much just a summation
try_put(better_stuff)
return {'status':'Groovy'}
def try_put(oInstance,iCountdown=10):
if iCountdown<1:
return oInstance.put()
try:
return oInstance.put()
except:
import time
import random
logger.info("sleeping")
time.sleep(random.random()*20)
return oInstance.try_put(iCountdown-1)
Without using try_put the queue gets about 30% of the way through until it stops working. With the try_put it gets further, like 60%.
Could it be that a task is holding onto ndb connections after it has completed somehow? I'm not making explicit use of transactions.
EDIT:
there seems to be some confusion about what I'm asking. The question is: Why does ndb contention get worse as time goes on. I have a whole lot of tasks running simultaneously and they access the ndb in a way that can cause contention. If contention is detected then a randomy timed retry happens and this eliminates contention perfectly well. For a little while. Tasks keep running and completing and the more that successfully return the more contention happens. Even though the processes using the contended upon data should be finished. Is there something going on that's holding onto datastore handles that shouldn't be? What's going on?
EDIT2:
Here is a little bit about the key structures in play:
My ndb models sit in a hierarchy where we have something like this (the direction of the arrows specifies parent child relationships, ie: Type has a bunch of child Instances etc)
Type->Instance->Position
The ids of the Positions are limited to a few different names, there are many thousands of instances and not many types.
I calculate a bunch of Positions and then do a try_put_multi (similar to try_put in an obvious way) and get contention. I'm going to run the code again pretty soon and get a full traceback to include here.

Contention will get worse overtime if you continually exceed the 1 write/transaction per entity group per second. The answer is in how Megastore/Paxo work and how Cloud Datastore handles contention in the backend.
When 2 writes are attempted at the same time on different nodes in Megastore, one transaction will win and the other will fail. Cloud Datastore detects this contention and will retry the failed transaction several times. Usually this results in the transaction succeeding without any errors being raised to the client.
If sustained writes above the recommended limit are being attempted, the chance that a transaction needs to be retried multiple times increases. The number of transactions in an internal retry state also increases. Eventually, transactions will start reaching our internal retry limit and will return a contention error to the client.
Randomized sleep method is an incorrect way to handle error response situations. You should instead look into exponential back-off with jitter (example).
Similarly, the core of your problem is a high write rate into a single entity group. you should look into whether the explicit parenting is required (removing it if not), or if you should shard the entity group in some manner that makes sense according to your queries and consistency requirements.

Related

Memory is not coming down after data processing in Apache Flink

I am using broadcastprocess function to perform simple pattern matching. I am broadcasting around 60 patterns. Once the process completed the memory is not coming down i am using garbage collection setting in my flink configuration file env.java.opts = "-XX:+UseG1GC" to perform GC but it is also not working. But CPU percentage coming after completing the processing of data. I am doing checkpointing every 2 minutes and my statebackend is filesystem. Below are screenshots of memory and CPU usage

I don't see anything surprising or problematic in the graphs you have shared. After ingesting the patterns, each instance of your BroadcastProcessFunction will be holding onto a copy of all of the patterns -- so that will consume some memory.
If I understand correctly, it sounds like the situation is that as data is processed for matching against those patterns, the memory continues to increase until the pods crash with out-of-memory errors. Various factors might explain this:
If your patterns involve matching a sequence of events over time, then your pattern matching engine has to keep state for each partial match. If there's no timeout clause to ensure that partial matches are eventually cleaned up, this could lead to a combinatorial explosion.
If you are doing key-partitioned processing and your keyspace is unbounded, you may be holding onto state for stale keys.
The filesystem state backend has considerable overhead. You may have underestimated how much memory it needs.

GAE/P: Using deferred tasks and transactions for a counter

I have a counter in my app where I expect that 99% of the time there will not be contention issues in updating the counter with transactions.
To handle the 1% times when it is busy, I was thinking of updating the counter by using transactions within deferred tasks as follows:
def update_counter(my_key):
deferred.defer(update_counter_transaction)
#ndb.transactional
def update_counter_transaction(my_key):
x = my_key.get()
x.n += 1
x.put()
For the occasional instances when contention causes the transaction to fail, the task will be retried.
I'm familiar with sharded counters but this seems easier and suited to my situation.
Is there anything I am missing that might cause this solution to not work well?

A problem may exist with the automatic task retries which at least theoretically may happen for reasons other than transaction colissions for the intended counter increments. If such undesired retry successfully re-executes the counter increment code the counter value may be thrown off (will be higher than the expected value). Which might or might not be acceptable for your app, depending on the use of the counter.
Here's an example of undesired defered task invocation: GAE deferred task retried due to "instance unavailable" despite having already succeeded
The answer to that question seems inline with this note on regular task queue documentation (I saw no such note in the deferred task queues article, but I marked it as possible in my brain):
Note that task names do not provide an absolute guarantee of once-only
semantics. In extremely rare cases, multiple calls to create a task of
the same name may succeed, but in this event, only one of the tasks
would be executed. It's also possible in exceptional cases for a task
to run more than once.
From this perspective it might actually be better to keep the counter incrementing together with the rest of the related logical/transactional operations (if any) than to isolate it as a separate transaction on a task queue.

Improve throughput of ndb query over large data

I am trying to perform some data processing in a GAE application over data that is stored in the Datastore. The bottleneck point is the throughput in which the query returns entities and I wonder how to improve the query's performance.
What I do in general:
everything works in a task queue, so we have plenty of time (10 minute deadline).
I run a query over the ndb entities in order to select which entities need to be processed.
as the query returns results, I group entities in batches of, say, 1000 and send them to another task queue for further processing.
the stored data is going to be large (say 500K-1M entities) and there is a chance that the 10 minutes deadline is not enough. Therefore, when the task is reaching the taskqueue deadline, I spawn a new task. This means I need an ndb.Cursor in order to continue the query from where it stopped.
The problem is the rate in which the query returns entities. I have tried several approaches and observed the following performance (which is too slow for my app):
Use fetch_page() in a while loop.
The code is straightforward
while has_more and theres_more_time:
entities, cursor, more = query.fetch_page(1000, ...)
send_to_process_queue(entities)
has_more = more and cursor
With this approach, it takes 25-30 seconds to process 10K entities. Roughly speaking, that is 20K entities per minute. I tried changing the page size or the class of the frontend instance; neither made any difference in performance.
Segment the data and fire multiple fetch_page_async() in parallel.
This approach is taken from here (approach C)
The overall performance remains the same as above. I tried with various number of segments (from 2 to 10) in order to have 2-10 parallel fetch_async() calls. In all cases, the overall time remained the same. The more parallel fetch_page_async() are called, the longer it takes for each one to complete. I also tried with 20 parallel fetches and it got worse. Changing the page size or the fronted instance class did not have and impact either.
Fetch everything with a single fetch() call.
Now this is the least suitable approach (if not unsuitable at all) as the instance may run out of memory, plus I don't get a cursor in case I need to spawn to another task (in fact I won't even have the ability to do so, the task will simply exceed the deadline). I tried this out of curiosity in order to see how it performs and I observed the best performance! It took 8-10 seconds for 10K entities, which is roughly be 60K entities per minute. Now that is approx. 3 times faster than fetch_page(). I wonder why this happens.
Use query.iter() in a single loop.
This is match like the first approach. This will make use of the query iterator's underlying generator, plus I can obtain a cursor from the iterator in case I need to spawn a new task, so it suits me. With the query iterator, it fetched 10K entities in 16-18 seconds, which is approx. 36-40K entities per minute. The iterator is 30% faster than fetch_page, but much slower that fetch().
For all the above approaches, I tried F1 and F4 frontend instances without any difference in Datastore performance. I also tried to change the batch_size parameter in the queries, still without any change.
A first question is why do fetch(), fetch_page() and iter() behave so differently and how to make either fetch_page() or iter() do equally well as fetch()? And then another critical question is whether these throughputs (20-60K entities per minute, depending on api call) are the best we can do in GAE.
I 'm aware of the MapReduce API but I think it doesn't suit me. AFAIK, the MapReduce API doesn't support queries and I don't want to scan all the Datastore entities (it's will be too costly and slow - the query may return only a few results). Last, but not least, I have to stick to GAE. Resorting to another platform is not an option for me. So the question really is how to optimize the ndb query.
Any suggestions?

In case anyone is interested, I was able to significantly increase the throughput of the data processing by re-designing the component - it was suggested that I change the data models but that was not possible.
First, I segmented the data and then processed each data segment in a separate taskqueue.Task instead of calling multiple fetch_page_async from a single task (as I described in the first post). Initially, these tasks were processed by GAE sequentially utilizing only a single Fx instance. To achieve parallelization of the tasks, I moved the component to a specific GAE module and used basic scaling, i.e. addressable Bx instances. When I enqueue the tasks for each data segment, I explicitly instruct which basic instance will handle each task by specifying the 'target' option.
With this design, I was able to process 20.000 entities in total within 4-5 seconds (instead of 40'-60'!), using 5 B4 instances.
Now, this has additional costs because of the Bx instances. We 'll have to fine tune the type and number of basic instances we need.

The new experimental Data Processing feature (an AppEngine API for MapReduce) might be suitable. It uses automatic sharding to execute multiple parallel worker processes, which may or may not help (like the Approach C in the other linked question).

Your comment about "no need to scan all entities" triggers the thought that custom indexes could help your queries. That may entail schema changes to store the data in a less normal form.
Design a solution from the output perspective - what the simplest query is that produces the required results, then what the entity structure is to support such a query, then what work is needed to create and maintain such an entity structure from the current data.

GAE Map Reduce - hitting entities more than once?

I'm using Map Reduce (http://code.google.com/p/appengine-mapreduce/) to do an operation over a set of entities. However, I am finding my operations are being duplicated.
Are map reduce maps sometimes called more than once for a specific entity? Is this the case even if they don't fail the initial time?
edit: here are some more details.
def reparent_request(entity):
#check if the entity has a parent
if not is_valid_to_reparent(entity):
return
#copy it
try:
copy = clone_entity(Request, entity, parent=entity.user)
copy.put() #we hard put here so we can use the reference later in this function.
except:
...
... update some references to the copied object ...
#delete the original
yield op.db.Delete(entity)
At the end, I am non-deterministically left with two entities, both with the new parent.

I've reparented a load of entities before - it was a nightmare because of the exact problem you're facing.
What I would do instead is:
Create a new queue. Ensure its paused and that you have a lot of storage space dedicated to queues. It's only temporary, but you'll need it.
Instead of editing your entities in your map reduce job, add them to the queue with a name that will be unique for each entity. The key works fine.
When adding to the queue, because it's paused you'll get an error if you try and add the same named queue twice - so catch the error and skip it, because you know that entity must already have been touched by the map reduce job.
When you're confident that every entity has a matching queue task and the map reduce job has finished, unpause your queue. The queue will do the reparenting.
A couple of notes:
* the task queue size can get pretty big. Can't remember numbers, but it was gigs. Also the size of the queue doesn't update in real time - so don't worry that it might still says gigs of tasks when the queue is nearly empty.
* the reliability of the queue storage is an unknown I believe. It didn't happen to us, but queue items could disappear I guess. Fortunately, you can rerun this process multiple times to ensure it worked, especially if you're deleting entities.
* you may want to ensure you queue has a concurrency limit on it. Without one, a delay in the execution of a couple of tasks can absolutely cripple your application. Learnt that the hard way! I think 30 concurrent tasks went quite well for us.
Hope that's useful, let me know if you come up with any improvements!

App Engine mapreduce runs on the task queue, and like anything else that uses the task queue, tasks have to be idempotent - that is, running them multiple times should have the same effect as running them once. Tasks will occasionally be run more than once; the mapreduce library may have its own reasons for rerunning mapper tasks, too.
In your situation, I'd suggest creating the new entity with a key whose ID is the same as the old entity; that way running it multiple times will just overwrite the same entity.

Handling multiple calls to BeginExecuteNonQuery in SQL Server 2008

I have an application that is receiving a high volume of data that I want to store in a database. My current strategy is to fire off an asynchronous call (BeginExecuteNonQuery) with each record when it's ready. I'm using the asynchronous call to ensure that the rest of the application runs smoothly.
The problem I have is that as the volume of data increases, eventually I get to the point where I'm trying to fire a command down the connection while it's still in use. I can see two possible options:
Buffer the pending data myself until the existing command is finished.
Open multiple connections as needed.
I'm not sure which of these options is best, or if in fact there is a better way. Option 1 will probably lead to my buffer getting bigger and bigger, while option 2 may be very bad form - I just don't know.
Any help would be appreciated.

Depending on your locking strategy, it may be worth using several connections but certainly not a number "without upper bounds". So a good strategy/pattern to use here is "thread pool", with each of N dedicated threads holding a connection and picking up write requests as the requests come and the thread finishes the previous one it was doing. Number of threads in the pool for best performance is best determined empirically, by benchmarking various possibilities in a realistic experimental/prototype setting.
If the "buffer" queue (in which your main thread queues write requests and the dedicated threads in the pool picks them up) grows beyond a certain threshold, it means you're getting data faster than you can possibly write it out, so, unless you can get more resources, you'll simply have to drop some of the incoming data -- maybe by a random-sampling strategy to avoid biasing future statistical analysis. Just count how much you're writing and how much you're having to drop due to the resource shortage in each period of time (say every minute or so), so you can use "stratified sampling" techniques in future data-mining explorations.

Thanks Alex - so you'd suggest a hybrid method then, assuming that I'll still need to buffer updates if all connections are in use?
(I'm the original poster, I've just managed to get two accounts without realizing)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight