How to avoid concurrent data modifications in GAE Datastore - google-app-engine

Let’s say below is the GAE data store kind, where the need is to store two different players info into a single entity, how the data correctness can be maintained?
#Entity
public class Game
{
id
player1Id
player2Id
score1
score2
}
Let say player1 wants to modify score1, he first reads the entity and modifies score1 and saves back into datastore.
Similarly player2 wants to modify score2, he as well first reads the entity and modifies score2 and saves back into datastore.
So having this functionality, there is a chance that both player1 and player2 try to modify the same entity and they could end up modifying the data incorrectly due to a dirty read.
Is there a way to avoid the dirty reads with GAE Data store to make sure the data correctness (or) avoid the concurrent modification?

Transactions:
https://developers.google.com/appengine/docs/java/datastore/transactions
When two or more transactions simultaneously attempt to modify
entities in one or more common entity groups, only the first
transaction to commit its changes can succeed; all the others will
fail on commit.

You need to use Transactions, and then you have two options:
Implement your own logic for transaction failures (how many times to retry, etc.)
Instead of writing to the datastore directly, create a task to modify an entity. Run a transaction inside a task. If it fails, the App Engine will retry this task until it succeeds.

Related

GAE Push Queue database contention during datastore query

Summary
I have an issue where the database writes from my task queue (approximately 60 tasks, at 10/s) are somehow being overwritten/discarded during a concurrent database read of the same data. I will explain how it works. Each task in the task queue assigns a unique ID to a specific datastore entity of a model.
If I run a indexed datastore query on the model and loop through the entities while the task queue is in progress, I would expect that some of the entities will have been operated on by the task queue (ie.. assigned an ID) and others are still yet-to-be effected. Unfortunately what seems to be happening is during the loop through the query, entities that were already operated on (ie.. successfully assigned an ID) are being overwritten or discarded, saying that they were never operated on, even though -according to my logs- they were operated on.
Why is this happening? I need to be able to read the status of my data without affecting the taskqueue write operation in the background. I thought maybe it was a caching issue so I tried enforcing use_cache=False and use_memcache=False on the query, but that did not solve the issue. Any help would be appreciated.
Other interesting notes:
If I allow the task queue to complete fully before doing a datastore query, and then do a datastore query, it acts as expected and nothing is overwritten/discarded.
This is typically an indication that the write operations to the entities are not performed in transactions. Transactions can detect such concurrent write (and read!) operations and re-try them, ensuring that the data remains consistent.
You also need to be aware that queries (if they are not ancestor queries) are eventually consistent, meaning their results are a bit "behind" the actual datastore information (it takes some time from the moment the datastore information is updated until the corresponding indexes that the queries use are updated accordingly). So when processing entities from query results you should also transactionally verify their content. Personally I prefer to make keys_only queries and then obtain the entities via key lookups, which are always consistent (of course, also in transactions if I intend to update the entities and, on reads, if needed).
For example if you query for entities which don't have a unique ID you may get entities which were in fact recently operated on and have an ID. So you should (transactionally) check if the entity actually has an ID and skip its update.
Also make sure you're not updating entities obtained from projection queries - results obtained from such queries may not represent the entire entities, writing them back will wipe out properties not included in the projection.

Objectify - many writes to same entity in short period of time with and without transaction

What I'm doing is creating a transaction where
1) An entity A has a counter updated to +1
2) A new entity B is written to the datastore.
It looks like this:
WrappedBoolean result = ofy().transact(new Work<WrappedBoolean>() {
#Override
public WrappedBoolean run() {
// Increment count of EntityA.numEntityBs by 1 and save it
entityA.numEntityBs = entityA.numEntityBs +1;
ofy().save().entity(entityA).now();
// Create a new EntityB and save it
EntityB entityB = new EntityB();
ofy().save().entity(entityB).now();
// Return that everything is ok
return new WrappedBoolean("True");
}
});
What I am doing is keeping a count of how many EntityB's entityA has. The two operations need to be in a transaction so either both saves happen or neither happen.
However, it is possible that many users will be executing the api method that contains the above transaction. I fear that I may run into problems of too many people trying to update entityA. This is because if multiple transactions try to update the same entity, the first one to commit wins but all the others fail.
This leads me to two questions:
1) Is the transaction I wrote a bad idea and destined to cause writes not being made if a lot of calls are made to the API method? Is there a better way to achieve what I am trying to do?
2) What if there are a lot of updates being made to an entity not in a transaction (such as updating a counter the entity has) - will you eventually run into a scaling problem if a lot of updates are being made in a short period of time? How does datastore handle this?
Sorry for the long winded question but I hope someone could shed some light on how this system works for me with the above questions. Thanks.
Edit: When I mean a lot of updates being made to an entity over a short period of time, consider something like Instagram, where you want to keep track of how many "likes" a picture has. Some users have millions of followers and when they post a new picture, they can get something like 10-50 likes a second.
The datastore allows about 1 write/second per entity group. What might not appear obvious is that standalone entities (i.e. entities with no parent and no children) still belong to one entity group - their own. Repeated writes to the same standalone entity is thus subject to the same rate limit.
Exceeding the write limit will eventually cause write ops to fail with something like TransactionFailedError(Concurrency exception.)
Repeated writes to the same entity done outside transactions can overwrite each-other. Transactions can help with this - conflicting writes would be automatically retried a few times. Your approach looks OK from this prospective. But it only works if the average write rate remains below the limit.
You probably want to read Avoiding datastore contention. You need to shard your counter to be able to count events at more than 1/second rates.

Using SQLAlchemy sessions and transactions

While learning SQLAlchemy I came across two ways of dealing with SQLAlchemy's sessions.
One was creating the session once globally while initializing my database like:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
and import this DBSession instance in all my requests (all my insert/update) operations that follow.
When I do this, my DB operations have the following structure:
with transaction manager:
for each_item in huge_file_of_million_rows:
DBSession.add(each_item)
//More create, read, update and delete operations
I do not commit or flush or rollback anywhere assuming my Zope transaction manager takes care of it for me
(it commits at the end of the transaction or rolls back if it fails)
The second way and the most frequently mentioned on the web way was:
create a DBSession once like
DBSession=sessionmaker(bind=engine)
and then create a session instance of this per transaction:
session = DBSession()
for row in huge_file_of_million_rows:
for item in row:
try:
DBsesion.add(item)
//More create, read, update and delete operations
DBsession.flush()
DBSession.commit()
except:
DBSession.rollback()
DBSession.close()
I do not understand which is BETTER ( in terms of memory usage,
performance, and healthy) and how?
In the first method, I
accumulate all the objects to the session and then the commit
happens in the end. For a bulky insert operation, does adding
objects to the session result in adding them to the memory(RAM) or
elsewhere? where do they get stored and how much memory is consumed?
Both the ways tend to be very slow when I have about a
million inserts and updates. Trying SQLAlchemy core also takes the
same time to execute. 100K rows select insert and update takes about
25-30 minutes. Is there any way to reduce this?
Please point me in the right direction. Thanks in advance.
Here you have a very generic answer, and with the warning that I don't know much about zope. Just some simple database heuristics. Hope it helps.
How to use SQLAlchemy sessions:
First, take a look to their own explanation here
As they say:
The calls to instantiate Session would then be placed at the point in the application where database conversations begin.
I'm not sure I understand what you mean with method 1.; just in case, a warning: you should not have just one session for the whole application. You instantiate Session when the database conversations begin, but you surely have several points in the application in which you have different conversations beginning. (I'm not sure from your text if you have different users).
One commit at the end of a huge number of operations is not a good idea
Indeed it will consume memory, probably in the Session object of your python program, and surely in the database transaction. How much space? That's difficult to say with the information you provide; it will depend on the queries, on the database...
You could easily estimate it with a profiler. Take into account that if you run out of resources everything will go slower (or halt).
One commit per register is also not a good idea when processing a bulk file
It means you are asking the database to persist changes every time for every row. Certainly too much. Try with an intermediated number, commit every n hundreds of rows. But then it gets more complicated; one commit at the end of the file assures you that the file is either processed or not, while intermediate commits force you to take into account, when something fails, that your file is half through - you should reposition.
As for the times you mention, it is very difficult with the information you provide + what is your database + machine. Anyway, the order of magnitude of your numbers, a select+insert+update per 15ms, probably plus commit, sounds pretty high but more or less on the expected range (again it depends on queries + database + machine)... If you have to frequently insert so many registers you could consider other database solutions; it will depend on your scenario, and probably on dialects and may not be provided by an orm like SQLAlchemy.

How can we synchronize reads and writes to mem cache on app engine?

How can we make sure only one instance of a JVM is modifying a mem cache key instance at any one time, and also that a read isn't happening while in the middle of a write? I can use either the low level store, or the JSR wrapper.
List<Foo> foos = cache.get("foos");
foos.add(new Foo());
cache.put("foos", foos);
is there any way for us to protect against concurrent writes etc?
Thanks
Fh. is almost on the right way, but not completely. He is right saying MemCache works atomically, but this does not save you to take care of 'synchronization'. As we have many instances of JVM's running, we cannot speak of real synchronization in terms of what we usually understand when thinking of multi-threaded environments. Here's the solution to 'synchronize' MemCache access across one or many Instances :
With the methodMemcacheService.getIdentifiable(Object key) you get an Identifiable object instance. You can later put it back in the MemCache using MemcacheService.putIfUntouched(...)
Check the API at MemCache API : getIdentifiable().
An Identifiable Object is a wrapper containing the Object you fetched via it's Key.
Here's how it works:
Instance A fetches Identifiable X.
Instance B fetches Identifiable X at the 'same' time
Instance A & B will do an update of X's wrapped object (your object, actually).
The first instance doing a putIfUntouched(...) to store back the object will succeed, putIfUntouched will return true.
The second instance trying this will fail to do so, with putIfUntouched returning false.
Are you worried about corrupting memcache by reading a key, while a parallel operation is writing to the same key? Don't worry about this, on the memcache server both operations will run atomically.
Memcache will never output a 'corrupt' result, nor have corruption issues while writing.

Google App Engine Go score counting and saving

I want to make a simple GAE app in Go that will let users vote and store their answers in two ways. First way will be raw data (Database store of "voted for X"), the second will be a running count of those votes ("12 votes for X, 10 votes for Y"). What is an effective way to store both of those values with the app being accessed by multiple people at the same time? If I retrieve the data from the Datastore, change it, and save it back for one instance, another might be wanting to do the same in parallel, and I`m not sure if the final result will be correct.
It seems like a good way to do that is to simply store all vote events as separate entities (the "voted for X" way) and use the Task Queue for the recalculation (the "12 votes for X, 10 votes for Y" way), so the recalculation is done offline and sequentially (without any races and other concurrency issues). Then you'd have to put the recalc task every once in a while to the queue so the results are updated.
The Task Queue doesn't allow adding another task with the same name as an existing one, but doesn't allow checking whether a specific task is already enqueued, so maybe simply trying adding a task with a same name to the queue will be enough to be sure that multiple recalc tasks are not there.
Another way would be to use a goroutine waiting for a poke from an input channel in order to recalculate the results. I haven't run such goroutines on App Engine so I'm not sure of the general behavior of this approach.

Resources