Objectify transaction vs. regular load then save - google-app-engine

I need only confirmation that I get this right.
If, for example I have an Entity X with a field x, and when a request is sent I want to do X.x++. If I use just X = ofy().load().type(X.class).id(xId).get() then I do some calculations and afterwards I do X.x++ and the I save it. If during the calculations another request is posted, I'll get an unwanted behavior. And instead if I'll do this all in a transaction, the second request won't have access to X until I finish.
Is it so?
Sorry if the question is a bit nooby.
Thanks,
Dan

Yes you got it right but when using transaction remember the first that completes wins and the rest fail. Look also at #Peter Knego's answer for how they work.
But don't worry about the second request if it fails to read.
You have like 2 options:
Force a retries
Use eventual consistency in your transactions
As far as the retries are concerned:
Your transaction function can be called multiple times safely without
undesirable side effects. If this is not possible, you can set
retries=0, but know that the transaction will fail on the first
incident of contention
Example:
#db.transactional(retries=10)
As far as eventual consistency is concerned:
You can opt out of this protection by specifying a read policy that
requests eventual consistency. With an eventually consistent read of
an entity, your app gets the current known state of the entity being
read, regardless of whether there are still committed changes to be
applied. With an eventually consistent ancestor query, the indexes
used for the query are consistent with the time the indexes are read
from disk. In other words, an eventual consistency read policy causes
gets and queries to behave as if they are not a part of the current
transaction. This may be faster in some cases, since the operations do
not have to wait for committed changes to be written before returning
a result.
Example:
#db.transactional()
def test():
game_version = db.get(
db.Key.from_path('GameVersion', 1),
read_policy=db.EVENTUAL_CONSISTENCY)

No, GAE transaction do not do locking, they use optimistic concurrency control. You will have access to X all the time, but when you try to save it in the second transactions it will fail with ConcurrentModificationException.

Related

Is it possible for a DynamoDB read to return state that is older than the state returned by a previous read?

Let's say there is a DynamoDB key with a value of 0, and there is a process that repeatably reads from this key using eventually consistent reads. While these reads are occurring, a second process sets the value of that key to 1.
Is it ever possible for the read process to ever read a 0 after it first reads a 1? Is it possible in DynamoDB's eventual consistency model for a client to successfully read a key's fully up-to-date value, but then read a stale value on a subsequent request?
Eventually, the write will be fully propagated and the read process will only read 1 values, but I'm unsure if it's possible for the reads to go 'backward in time' while the propagation is occuring.
The property you are looking for is known as monotonic reads, see for example the definition in https://jepsen.io/consistency/models/monotonic-reads.
Obviously, DynamoDB's strongly consistent read (ConsistentRead=true) is also monotonic, but you rightly asked about DynamoDB's eventually consistent read mode.
#Charles in his response gave a link, https://www.youtube.com/watch?v=yvBR71D0nAQ&t=706s, to a nice official official talk by Amazon on how eventually-consistent reads work. The talk explains that DynamoDB replicates written data to three copies, but a write completes when two out of three (including one designated as the "leader") of the copies were updated. It is possible that the third copy will take some time (usually a very short time to get updated).
The video goes on to explain that an eventually consistent read goes to one of the three replicas at random.
So in that short amount of time where the third replica has old data, a request might randomly go to one of the updated nodes and return new data, and then another request slightly later might go by random to the non-updated replica and return old data. This means that the "monotonic read" guarantee is not provided.
To summarize, I believe that DynamoDB does not provide the monotonic read guarantee if you use eventually consistent reads. You can use strongly-consistent reads to get it, of course.
Unfortunately I can't find an official document which claims this. It would also be nice to test this in practice, similar to how he paper http://www.aifb.kit.edu/images/1/17/How_soon_is_eventual.pdf tested whether Amazon S3 (not DynamoDB) guaranteed monotonic reads, and discovered that it did not by actually seeing monotonic-read violations.
One of the implementation details which may make it hard to see these monotonic-read violations in practice is how Amazon handles requests from the same process (which you said is your case). When the same process sends several requests in sequence, it may (but also may not...) may use the same HTTP connections to do so, and Amazon's internal load balancers may (but also may not) decide to send those requests to the same backend replica - despite the statement in the video that each request is sent to a random replica. If this happens, it may be hard to see monotonic read violations in practice - but it may still happen if the load balancer changes its mind, or the client library opens another connection, and so on, so you still can't trust the monotonic read property to hold.
Yes it is possible. Requests are stateless so a second read from the same client is just as likely as any other request to see slightly stale data. If that’s an issue, choose strong consistency.
You will (probably) not ever get the old data after getting the new data..
First off, there's no warning in the docs about repeated reads returning stale data, just that a read after a write may return stale data.
Eventually Consistent Reads
When you read data from a DynamoDB table, the response might not
reflect the results of a recently completed write operation. The
response might include some stale data. If you repeat your read
request after a short time, the response should return the latest
data.
But more importantly, every item in DDB is stored in three storage nodes. A write to DDB doesn't return a 200 - Success until that data is written to 2 of 3 storage nodes. Thus, it's only if your read is serviced by the third node, that you'd see stale data. Once that third node is updated, every node has the latest.
See Amazon DynamoDB Under the Hood
EDIT
#Nadav's answer points that it's at least theoretically possible; AWS certainly doesn't seem to guarantee monotonic reads. But I believe the reality depends on your application architecture.
Most languages, nodejs being an exception, will use persistent HTTP/HTTPS connections by default to the DDB request router. Especially given how expensive it is to open a TLS connection. I suspect though can't find any documents confirming it that there's at least some level of stickiness from the request router to a storage node. #Nadav discusses this possibility. But only AWS knows for sure and they haven't said.
Assuming that belief is correct
curl in a shell script loop - more likely to see the old data again
loop in C# using a single connection - less likely
The other thing to consider is that in the normal course of things, the third storage node in "only milliseconds behind".
Ironically, if the request router truly picks a storage node at random, a non-persistent connection is then less likely to see old data again given the extra time it takes to establish the connection.
If you absolutely need monotonic reads, then you'd need to use strongly consistent reads.
Another option might be to stick DynamoDB Accelerator (DAX) in front of your DDB. Especially if you're retrieving the key with GetItem(). As I read how it works it does seem to imply monotonic reads, especially if you've written-through DAX. Though it does not come right out an say so. Even if you've written around DAX, reading from it should still be monotonic, it's just there will be more latency until you start seeing the new data.

Prioritizing Transactions in Google AppEngine

Let's say I need to perform two different kinds write operations on a datastore entity that might happen simultaneously, for example:
The client that holds a write-lock on the entry updates the entry's content
The client requests a refresh of the write-lock (updates the lock's expiration time-stamp)
As the content-update operation is only allowed if the client holds the current write-lock, I need to perform the lock-check and the content-write in a transaction (unless there is another way that I am missing?). Also, a lock-refresh must happen in a transaction because the client needs to first be confirmed as the current lock-holder.
The lock-refresh is a very quick operation.
The content-update operation can be quite complex. Think of it as the client sending the server a complicated update-script that the server executes on the content.
Given this, if there is a conflict between those two transactions (should they be executed simultaneously), I would much rather have the lock-refresh operation fail than the complex content-update.
Is there a way that I can "prioritize" the content-update transaction? I don't see anything in the docs and I would imagine that this is not a specific feature, but maybe there is some trick I can use?
For example, what happens if my content-update reads the entry, writes it back with a small modification (without committing the transaction), then performs the lengthy operation and finally writes the result and commits the transaction? Would the first write be applied immediately and cause a simultaneous lock-refresh transaction to fail? Or are all writes kept until the transaction is committed at the end?
Is there such a thing as keeping two transactions open? Or doing an intermediate commit in a transaction?
Clearly, I can just split my content-update into two transactions: The first one sets a "don't mess with this, please!"-flag and the second one (later) writes the changes and clears that flag.
But maybe there is some other trick to achieve this with fewer reads/writes/transactions?
Another thought I had was that there are 3 different "blocks" of data: The current lock-holder (LH), the lock expiration (EX), and the content that is being modified (CO). The lock-refresh operation needs to perform a read of LH and a write to EX in a transaction, while the content-update operation needs to perform a read of LH, a read of CO, and a write of CO in a transaction. Is there a way to break the data apart into three entities and somehow have the transactions span only the needed entities? Since LH is never modified by these two operations, this might help avoid the conflict in the first place?
The datastore uses optimistic concurrency control, which means that a (datastore primitive) transaction waits until it is committed, then succeeds only if someone else hasn't committed first. Typically, the app retries the failed transaction with fresh data. There is no way to modify this first-wins behavior.
It might help to know that datastore transactions are strongly consistent, so a client can first commit a lock refresh with a synchronous datastore call, and when that call returns, the client knows for sure whether it obtained or refreshed the lock. The client can then proceed with its update and lock clear. The case you describe where a lock refresh and an update might occur concurrently from the same client sounds avoidable.
I'm assuming you need the lock mechanism to prevent writes from other clients while the lock owner performs multiple datastore primitive transactions. If a client is actually only doing one update before it releases the lock and it can do so within seconds (well before the datastore RPC timeout), you might get by with just a primitive datastore transaction with optimistic concurrency control and retries. But a lock might be a good idea for simple serialization of, say, edits to a record in a user interface, where a user hits an "edit" button in a UI and you want that to guarantee that the user has some time to prepare and submit changes without the record being changed by someone else. (Whether that's the user experience you want is your decision. :) )

Appengine - querying database right after putting something in

In one place of code I do something like this:
FormModel(.. some data here..).put()
And a couple lines below I select from the database:
FormModel.all().filter(..).fetch(100)
The problem I noticed - sometimes the fetch doesn't notice the data I just added.
My theory is that this happens because I'm using high replication storage, and I don't give it enough time to replicate the data. But how can I avoid this problem?
Unless the data is in the same entity group there is no way to guarantee that the data will be the most up to data (If I understand this section correctly).
Shay is right: there's no way to know when the datastore will be ready to return the data you just entered.
However, you are guaranteed that the data will be entered eventually, once the call to put completes successfully. That's a lot of information, and you can use it to work around this problem. When you get the data back from fetch, just append/insert the new entities that you know will be in there eventually! In most cases it will be good enough to do this on a per-request basis, I think, but you could do something more powerful that uses memcache to cover all requests (except cases where memcache fails).
The hard part, of course, is figuring out when you should append/insert which entities. It's obnoxious to have to do this workaround, but a relatively low price to pay for something as astonishingly complex as the HRD.
From https://developers.google.com/appengine/docs/java/datastore/transactions#Java_Isolation_and_consistency
This consistent snapshot view also extends to reads after writes
inside transactions. Unlike with most databases, queries and gets
inside a Datastore transaction do not see the results of previous
writes inside that transaction. Specifically, if an entity is modified
or deleted within a transaction, a query or get returns the original
version of the entity as of the beginning of the transaction, or
nothing if the entity did not exist then.

Sql Server transactions - usage recommendations

I saw this sentence not only in one place:
"A transaction should be kept as short as possible to avoid concurrency issues and to enable maximum number of positive commits."
What does this really mean?
It puzzles me now because I want to use transactions for my app which in normal use will deal with inserting of hundreds of rows from many clients, concurrently.
For example, I have a service which exposes a method: AddObjects(List<Objects>) and of course these object contain other nested different objects.
I was thinking to start a transaction for each call from the client performing the appropriate actions (bunch of insert/update/delete for each object with their nested objects). EDIT1: I meant a transaction for entire "AddObjects" call in order to prevent undefined states/behaviour.
Am I going in the wrong direction? If yes, how would you do that and what are your recommendations?
EDIT2: Also, I understood that transactions are fast for bulk oeprations, but it contradicts somehow with the quoted sentence. What is the conclusion?
Thanks in advance!
A transaction has to cover a business specific unit of work. It has nothing to do with generic 'objects', it must always be expressed in domain specific terms: 'debit of account X and credit of account Y must be in a transaction', 'subtract of inventory item and sale must be in a transaction' etc etc. Everything that must either succeed together or fail together must be in a transaction. If you are down an abstract path of 'adding objects to a list is a transaction' then yes, you are on a wrong path. The fact that all inserts/updates/deletes triggered by a an object save are in a transaction is not a purpose, but a side effect. The correct semantics should be 'update of object X and update of object Y must be in a transaction'. Even a degenerate case of a single 'object' being updated should still be regarded in terms of domain specific terms.
That recommendation is best understood as Do not allow user interaction in a transaction. If you need to ask the user during a transaction, roll back, ask and run again.
Other than that, do use transaction whenever you need to ensure atomicity.
It is not a transactions' problem that they may cause "concurrency issues", it is the fact that the database might need some more thought, a better set of indices or a more standardized data access order.
"A transaction should be kept as short as possible to avoid concurrency issues and to enable maximum number of positive commits."
The longer a transaction is kept open the more likely it will lock resources that are needed by other transactions. This blocking will cause other concurrent transactions to wait for the resources (or fail depending on the design).
Sql Server is usually setup in Auto Commit mode. This means that every sql statement is a distinct transaction. Many times you want to use a multi-statement transaction so you can commit or rollback multiple updates. The longer the updates take, the more likely other transactions will conflict.

How to Decide to use Database Transactions

How do you guys decide that you should be wrapping the sql in a transaction?
Please throw some light on this.
Cheers !!
A transaction should be used when you need a set of changes to be processed completely to consider the operation complete and valid. In other words, if only a portion executes successfully, will that result in incomplete or invalid data being stored in your database?
For example, if you have an insert followed by an update, what happens if the insert succeeds and the update fails? If that would result in incomplete data (in this case, an orphaned record), you should wrap the two statements in a transaction to get them to complete as a "set".
If you are executing two or more statements that you expect to be functionally atomic, you should wrap them in a transaction.
if your have more than a single data modifying statement to execute to complete a task, all should be within a transaction.
This way, if the first one is successful, but any of the following ones has an error, you can rollback (undo) everything as if nothing was ever done.
Whenever you wouldn't like it if part of the operation can complete and part of it doesn't.
Anytime you want to lock up your database and potentially crash your production application, anytime you want to litter your application with hidden scalability nightmares go ahead and create a transaction. Make it big, slow, and put a loop inside.
Seriously, none of the above answers acknowledge the trade-off and potential problems that come with heavy use of transactions. Be careful, and consider the risk/reward each time.
Ebay doesn't use them at all. I'm sure there are many others.
http://www.infoq.com/interviews/dan-pritchett-ebay-architecture
Whenever any operation falls under ACID(Atomicity,Consistency,Isolation,Durability) criteria you should use transactions
Read this article
When you want to use atomic or isolation property of database for a set of changes.
Atomicity: An atomic transaction is an indivisible and irreducible series of database operations such that either all occurs, or nothing occurs(according to wikipedia).
Isolation: isolation determines how transaction integrity is visible to other users and systems(according to wikipedia).

Resources