Google AppEngine DataStore constistency - google-app-engine

My current understanding of Google AppEngine's High Replication DataStore is the following:
Gets and puts of individual entities are always strongly consistent, i.e. once a put of this entry completes, no later get will ever return a version earlier than the completed put. Or, more precisely, as soon as any one get returns the new version, no later get will ever return the old version again.
Gets and puts of multiple entities are strongly consistent, if they belong to the same ancestor group and are performed in a transaction, i.e. if I have two entities that are both being modified in a transaction by a put and "simultaneously" read in a different transaction with a get, the get will either return the old version of both entries or the new version of both entries, depending on whether the put-transaction has completed at the time of the get or not, but it will never return the old value of one entity and the new value of the other.
Queries with an ancestor filter can be chosen to be strongly or eventually consistent, where a strongly consistent query takes longer to complete, but will always return the "same" version (old or new) of all entities updated in the same transaction in this ancestor group and never some old and some new versions.
Queries that span ancestors are always eventually consistent, i.e. might return an old version of one result entity and a new version of another.
Did I get this right? Is this actually documented anywhere? (I only found some documentation about the query consistency here (between the first and second "Note") and here, but it doesn't talk about gets and puts...)

Yes, you're correct. They just word it slightly differently:
https://developers.google.com/appengine/docs/java/datastore/
Right at the beginning there are 5 point form features. The last two describe your question, except that they refer to "reads" instead of "gets".
This probably adds to your confusion, but when they mean "read" or "get", it really means fetching an entity directly - by key or id. If you call the python 'get' function with an attribute other than the key or id, it's actually issuing a query, which is eventually consistent (unless it's an ancestor query).

Related

Working with accumulated bucket values in Entity Framework

I'm attempting to find design patterns/strategies for working with accumulated bucket values in a database where concurrency can be a problem. I don't know the proper search terms to use to find information on the topic.
Here's my use case (I'm using code-first Entity Framework, so EF-specific advice is welcome):
I have a database table that contains a quantity value. This quantity value can be incremented or decremented by multiple clients at the same time (due to this, I call this value a "bucket" value as it is a bucket for a bunch of accumulated activity; this is in opposition of the other strategy where you keep all activity and calculate the value based on the activity). I am looking for strategies on ensuring accuracy of this "bucket" value (within the context of EF) that takes into consideration that multiple clients may attempt to change it simultaneously (concurrency).
The answer "you must track activity and derive your value from that activity" is acceptable, but I want to consider all bucket-centric solutions as well.
I am looking for advice on search terms to use to find good information on this topic as well as specific links.
Edit: You may assume that all activity is relative to the "bucket" value (no clients will be making an absolute change to the value; they will only increment or decrement).
Without directly coding the SQL Queries that update the buckets, you would have to use client-side Optimistic Concurrency. See Entity Framework Optimistic Concurrency Patterns. Clients whose update would overwrite a change will get an exception, after which you can reload with the current value and retry. This pattern requires a ROWVERSION column on the target table.
If you code the updates in TSQL you can code an atomic update, something like
update foo with (updlock)
set bucket_a = bucket_a + 1
output inserted.*
where id = #id
(The 'updlock' isn't strictly necessary in this query, but is good form any time you want to ensure this kind of isolation)

What is expected behavior of cloudant PUT with rev for non-existent item?

In a Cloudant database, what is the expected behavior of calling PUT on a document that doesn't exist with a revision defined?
The documentation says:
To update (or create) a document, make a PUT request with the updated
JSON content and the latest _rev value (not needed for creating new
documents) to https://$USERNAME.cloudant.com/$DATABASE/$DOCUMENT_ID.
I had assumed that if I did provide a revision, that the db would detect that it was not a match and reject the request. In my test cases I have inconsistent behavior. Most of the time I get the expected 409, Document update conflict. However, occasionally, the document ends up getting created (201), and assigned the next revision.
My test consists of creating a document and then using that revision to update a different document.
POST https://{url}/{db} {_id: "T1"} - store the returned revision
PUT https://{url}/{db}/T2 {_rev: }
So if the revision returned was something like 1-79c389ffdbcfe6c33ced242a13f2b6f2, then in the cases where the PUT succeeds, it returns the next revision (like 2-76054ab954c0ef41e9b82f732116154b).
EDIT
If I simplify the test to one step, I can also get different results.
PUT https://{url}/{db}/DoesNotExist {_rev: "1-ffffffffffffffffffffffffffffffff"}
Cloudant is an eventually consistent database. You're seeing the effects of that. Most of the time the cluster has time to reach a consistent state between your two api calls and you'll get the expected update conflict. Sometimes you hit the inconsistency window, as your first call has not yet been replicated around the cluster and you hit a different node. It's a valuable insight: it's not safe to read your writes.

Google App Engine / NDB - Strongly Consistent Read of Entity List after Put

Using Google App Engine's NDB datastore, how do I ensure a strongly consistent read of a list of entities after creating a new entity?
The example use case is that I have entities of the Employee kind.
Create a new employee entity
Immediately load a list of employees (including the one that was added)
I understand that the approach below will yield an eventually consistent read of the list of employees which may or may not contain the new employee. This leads to a bad experience in the case of the latter.
e = Employee(...)
e.put()
Employee.query().fetch(...)
Now here are a few options I've thought about:
IMPORTANT QUALIFIERS
I only care about a consistent list read for the user who added the new employee. I don't care if other users have an eventual consistent read.
Let's assume I do not want to put all the employees under an Ancestor to enable a strongly consistent ancestor query. In the case of thousands and thousands of employee entities, the 5 writes / second limitation is not worth it.
Let's also assume that I want the write and the list read to be the result of two separate HTTP requests. I could theoretically put both write and read into a single transaction (?) but then that would be a very non-RESTful API endpoint.
Option 1
Create a new employee entity in the datastore
Additionally, write the new employee object to memcache, local browser cookie, local mobile storage.
Query datastore for list of employees (eventually consistent)
If new employee entity is not in this list, add it to the list (in my application code) from memcache / local memory
Render results to user. If user selects the new employee entity, retrieve the entity using key.get() (strongly consistent).
Option 2
Create a new employee entity using a transaction
Query datastore for list of employees in a transaction
I'm not sure Option #2 actually works.
Technically, does the previous write transaction get written to all the servers before the read transaction of that entity occurs? Or is this not correct behavior?
Transactions (including XG) have a limit on number of entity groups and a list of employees (each is its own entity group) could exceed this limit.
What are the downsides of read-only transactions vs. normal reads?
Thoughts? Option #1 seems like it would work, but it seems like a lot of work to ensure consistency on a follow-on read.
If you don not use an entity group you can do a key_only query and get_multi(keys) lookup for entity consistency. For the new employee you have to pass the new key to key list of the get_multi.
Docs: A combination of the keys-only, global query with a lookup method will read the latest entity values. But it should be noted that a keys-only global query can not exclude the possibility of an index not yet being consistent at the time of the query, which may result in an entity not being retrieved at all. The result of the query could potentially be generated based on filtering out old index values. In summary, a developer may use a keys-only global query followed by lookup by key only when an application requirement allows the index value not yet being consistent at the time of a query.
More info and magic here : Balancing Strong and Eventual Consistency with Google Cloud Datastore
I had the same problem, option #2 doesn't really work: a read using the key will work, but a query might still miss the new employee.
Option #1 could work, but only in the same request. The saved memcache key can dissapear at any time, a subsequent query on the same instance or one on another instance potentially running on another piece of hw would still miss the new employee.
The only "solution" that comes to mind for consistent query results is to actually not attempt to force the new employee into the results and rather leave things flow naturally until it does. I'd just add a warning that creating the new user will take "a while". If tolerable maybe keep polling/querying in the original request until it shows up? - that would be the only place where the employee creation event is known with certainty.
This question is old as I write this. However, it is a good question and will be relevant long term.
Option #2 from the original question will not work.
If the entity creation and the subsequent query are truly independent, with no context linking them, then you are really just stuck - or you don't care. The trick is that there is almost always some relationship or some use case that must be covered. In other words if the query is truly some kind of, essentially, ad hoc query, then you really don't care. In that case, you just quote CAP theorem and remind the client executing the query how great it is that this system scales. However, almost always, if you are worried about the eventual consistency, there is some use case or set of cases that must be handled. For example, if you have a high score list, the highest score must be at the top of the list. The highest score may have just been achieved by the user who is now looking at the list. Another example might be that when an employee is created, that employee must be on the "new employees" list.
So what you usually do is exploit these known cases to balance the throughput needed with consistency. For example, for the high score example, you may be able to afford to keep a secondary index (an entity) that is the list of the high scores. You always get it by key and you can write to it as frequently as needed because high scores are not generated that often presumably. For the new employee example, you might use an approach that you started to suggest by storing the timestamp of the last employee in memcache. Then when you query, you check to make sure your list includes that employee ... or something along those lines.
The price in balancing write throughput and consistency on App Engine and similar systems is always the same. It requires increased model complexity / code complexity to bridge the business needs.

What's the difference between findAndModify and update in MongoDB?

I'm a little bit confused by the findAndModify method in MongoDB. What's the advantage of it over the update method? For me, it seems that it just returns the item first and then updates it. But why do I need to return the item first? I read the MongoDB: the definitive guide and it says that it is handy for manipulating queues and performing other operations that need get-and-set style atomicity. But I didn't understand how it achieves this. Can somebody explain this to me?
If you fetch an item and then update it, there may be an update by another thread between those two steps. If you update an item first and then fetch it, there may be another update in-between and you will get back a different item than what you updated.
Doing it "atomically" means you are guaranteed that you are getting back the exact same item you are updating - i.e. no other operation can happen in between.
findAndModify returns the document, update does not.
If I understood Dwight Merriman (one of the original authors of mongoDB) correctly, using update to modify a single document i.e.("multi":false} is also atomic. Currently, it should also be faster than doing the equivalent update using findAndModify.
From the MongoDB docs (emphasis added):
By default, both operations modify a single document. However, the update() method with its multi option can modify more than one document.
If multiple documents match the update criteria, for findAndModify(), you can specify a sort to provide some measure of control on which document to update.
With the default behavior of the update() method, you cannot specify which single document to update when multiple documents match.
By default, findAndModify() method returns the pre-modified version of the document. To obtain the updated document, use the new option.
The update() method returns a WriteResult object that contains the status of the operation. To return the updated document, use the find() method. However, other updates may have modified the document between your update and the document retrieval. Also, if the update modified only a single document but multiple documents matched, you will need to use additional logic to identify the updated document.
Before MongoDB 3.2 you cannot specify a write concern to findAndModify() to override the default write concern whereas you can specify a write concern to the update() method since MongoDB 2.6.
When modifying a single document, both findAndModify() and the update() method atomically update the document.
One useful class of use cases is counters and similar cases. For example, take a look at this code (one of the MongoDB tests):
find_and_modify4.js.
Thus, with findAndModify you increment the counter and get its incremented
value in one step. Compare: if you (A) perform this operation in two steps and
somebody else (B) does the same operation between your steps then A and B may
get the same last counter value instead of two different (just one example of possible issues).
This is an old question but an important one and the other answers just led me to more questions until I realized: The two methods are quite similar and in many cases you could use either.
Both findAndModify and update perform atomic changes within a single request, such as incrementing a counter; in fact the <query> and <update> parameters are largely identical
With both, the atomic change takes place directly on a document matching the query when the server finds it, ie an internal write lock on that document for the fraction of a millisecond that the server confirms the query is valid and applies the update
There is no system-level write lock or semaphore which a user can acquire. Full stop. MongoDB deliberately doesn't make it easy to check out a document then change it then write it back while somehow preventing others from changing that document in the meantime. (While a developer might think they want that, it's often an anti-pattern in terms of scalability and concurrency ... as a simple example imagine a client acquires the write lock then is killed while holding it. If you really want a write lock, you can make one in the documents and use atomic changes to compare-and-set it, and then determine your own recovery process to deal with abandoned locks, etc. But go with caution if you go that way.)
From what I can tell there are two main ways the methods differ:
If you want a copy of the document when your update was made: only findAndModify allows this, returning either the original (default) or new record after the update, as mentioned; with update you only get a WriteResult, not the document, and of course reading the document immediately before or after doesn't guard you against another process also changing the record in between your read and update
If there are potentially multiple matching documents: findAndModify only changes one, and allows you customize the sort to indicate which one should be changed; update can change all with multi although it defaults to just one, but does not let you say which one
Thus it makes sense what HungryCoder says, that update is more efficient where you can live with its restrictions (eg you don't need to read the document; or of course if you are changing multiple records). But for many atomic updates you do want the document, and findAndModify is necessary there.
We used findAndModify() for Counter operations (inc or dec) and other single fields mutate cases. Migrating our application from Couchbase to MongoDB, I found this API to replace the code which does GetAndlock(), modify the content locally, replace() to save and Get() again to fetch the updated document back. With mongoDB, I just used this single API which returns the updated document.

Does eventual consistency apply to the set of results of a query? Or the entities themselves that are returned?

I'm using the HRD on Appengine.
Say I have a query that cuts across entity groups (i.e. not an ancestor query). I understand that the set of results returned by this query may not be consistent:
For example, the query may return 4 entities {A, B, C, D} even though and 5th entity E, matches the query. This makes sense.
However, in the inconsistent query above, is it ALSO the case that any of the results in the set may themselves not be consisitent (i.e. their fields are not the freshest)? That is, if A has a property called foo, is foo consistent?
My question boils down to, which part of the query is inconsistent - the set of results, the properties of the returned results, or both?
Eventual consistency applies to both the entities themselves and the indexes. This means that if you modify an entity, then query with a filter that matches only the modified one (not the value before modification), you could get no records. It also means that potentially you could get entities back from a query whose current versions do not match the index criteria they were fetched for.
You can ensure you have the latest copy of an entity by doing a consistent get (though outside a transaction, this is fairly meaningless, since it could have changed the moment you do the get), but there's no equivalent way to do a consistent index lookup.
I think the answer is that inconsistency can occur in both the set of results and properties of the returned results. Because incosistency occurs when you query a replica (or data center as in Google docs) that doesn't know yet about some write you made before. And the write can be anything, creating new entity or updating existing one.
So if you have for example the entity A with property x and you:
update x on A to 50 (previously it was 40)
query for entities with x >= 30
Then you certainly get this entity in the resut set but it can has an old value of x (40), in case that the replica you queried didn't yet know about your update.

Resources