Google Datastore app architecture questions - google-app-engine

I'm working on a Google AppEngine app connecting to the Google Cloud Datastore via its JSON API (I'm using PHP).
I'm reading all the documentation provided by Google and I still have questions:
In the documentation about Transactions, there is the following mention: "Transactions must operate on entities that belong to a limited number (5) of entity groups" (BTW few lines later we can found: "All Datastore operations in a transaction can operate on a maximum of twenty-five entity groups"). I'm not sure about what is an entity group. Let's say that I've an object Country which is identified only by its kind (COUNTRY) and a datastore's auto affected key id. So there is no ancestor path, hierarchical relationships, etc... Is all the countries entities counting for only 1 entity group? Or each country is counting for one?
For the Country entity kind I need to have an incremental unique id (like the SQL AUTOINCREMENT). It has to be absolutely unique and without gap. Also, this kind of object won't be created more than few / minute so there is no need to handle contention & sharding. I'm thinking about having a unique counter that will reflect the auto increment and using it inside a transaction. Is the following code pattern OK?:
Starting transaction, getting the counter, commit the creation of the Country along with the update of the counter. Rollback the transaction if the commit fails. Does this pattern prevents the affectation of 2 same ids? Could you confirm me that if 2 processes get the counter at the same time (so the same value), the first one who commits will make the other to fail (so it will be able to restart and get the new counter value)?
The documentation also mention that: "If your application receives an exception when attempting to commit a transaction, it does not necessarily mean that the transaction has failed. It is possible to receive exceptions or error messages even when a transaction has been committed and will eventually be applied successfully" !? How are we supposed to handle that case? If this behavior occurs on the creation of my country (question #2), I will have an issue with my auto increment id, no!?
Since the datastore needs that all the writes actions of a transaction to be done in only one call. And since the transaction ensure that all or none of the transaction's actions will be performed, why do we have to make a rollback?
Is the limit of 1 write / sec only on an entity (so something defined by its kind and its key path) and not a whole entity group (I will be reassured only when I'll be sure about what exactly is an entity group ;-) question #1)
I'm stoping here to not make a huge post. I'll probably get back with others (or refined) questions after getting answers on this ones ;-)
Thanks for your help.
[UPDATE] Country is just used as a sample class object.

No, ('Country', 123123) and ('Country', 679621) are not in the same entity group. But ('Country', 123123, 'City', '1') and ('Country', 123123, 'City', '2') are in the same entity group. Entities with the same ancestor are in the same group.
Sounds like really bad idea to use auto-increment for things like countries. Just generate an ID based on the name of the country.
From the same paragraph:
Whenever possible, structure your Datastore transactions so that the end result will be unaffected if the same transaction is applied more than once.
In internal datastore APIs like db or ndb you don't have to worry about rolling back, its happening automatically.
It's about 1 write per sec per whole entity group, that's why you need to keep groups as smaller as possible.

Related

Working with accumulated bucket values in Entity Framework

I'm attempting to find design patterns/strategies for working with accumulated bucket values in a database where concurrency can be a problem. I don't know the proper search terms to use to find information on the topic.
Here's my use case (I'm using code-first Entity Framework, so EF-specific advice is welcome):
I have a database table that contains a quantity value. This quantity value can be incremented or decremented by multiple clients at the same time (due to this, I call this value a "bucket" value as it is a bucket for a bunch of accumulated activity; this is in opposition of the other strategy where you keep all activity and calculate the value based on the activity). I am looking for strategies on ensuring accuracy of this "bucket" value (within the context of EF) that takes into consideration that multiple clients may attempt to change it simultaneously (concurrency).
The answer "you must track activity and derive your value from that activity" is acceptable, but I want to consider all bucket-centric solutions as well.
I am looking for advice on search terms to use to find good information on this topic as well as specific links.
Edit: You may assume that all activity is relative to the "bucket" value (no clients will be making an absolute change to the value; they will only increment or decrement).
Without directly coding the SQL Queries that update the buckets, you would have to use client-side Optimistic Concurrency. See Entity Framework Optimistic Concurrency Patterns. Clients whose update would overwrite a change will get an exception, after which you can reload with the current value and retry. This pattern requires a ROWVERSION column on the target table.
If you code the updates in TSQL you can code an atomic update, something like
update foo with (updlock)
set bucket_a = bucket_a + 1
output inserted.*
where id = #id
(The 'updlock' isn't strictly necessary in this query, but is good form any time you want to ensure this kind of isolation)

Google App Engine / NDB - Strongly Consistent Read of Entity List after Put

Using Google App Engine's NDB datastore, how do I ensure a strongly consistent read of a list of entities after creating a new entity?
The example use case is that I have entities of the Employee kind.
Create a new employee entity
Immediately load a list of employees (including the one that was added)
I understand that the approach below will yield an eventually consistent read of the list of employees which may or may not contain the new employee. This leads to a bad experience in the case of the latter.
e = Employee(...)
e.put()
Employee.query().fetch(...)
Now here are a few options I've thought about:
IMPORTANT QUALIFIERS
I only care about a consistent list read for the user who added the new employee. I don't care if other users have an eventual consistent read.
Let's assume I do not want to put all the employees under an Ancestor to enable a strongly consistent ancestor query. In the case of thousands and thousands of employee entities, the 5 writes / second limitation is not worth it.
Let's also assume that I want the write and the list read to be the result of two separate HTTP requests. I could theoretically put both write and read into a single transaction (?) but then that would be a very non-RESTful API endpoint.
Option 1
Create a new employee entity in the datastore
Additionally, write the new employee object to memcache, local browser cookie, local mobile storage.
Query datastore for list of employees (eventually consistent)
If new employee entity is not in this list, add it to the list (in my application code) from memcache / local memory
Render results to user. If user selects the new employee entity, retrieve the entity using key.get() (strongly consistent).
Option 2
Create a new employee entity using a transaction
Query datastore for list of employees in a transaction
I'm not sure Option #2 actually works.
Technically, does the previous write transaction get written to all the servers before the read transaction of that entity occurs? Or is this not correct behavior?
Transactions (including XG) have a limit on number of entity groups and a list of employees (each is its own entity group) could exceed this limit.
What are the downsides of read-only transactions vs. normal reads?
Thoughts? Option #1 seems like it would work, but it seems like a lot of work to ensure consistency on a follow-on read.
If you don not use an entity group you can do a key_only query and get_multi(keys) lookup for entity consistency. For the new employee you have to pass the new key to key list of the get_multi.
Docs: A combination of the keys-only, global query with a lookup method will read the latest entity values. But it should be noted that a keys-only global query can not exclude the possibility of an index not yet being consistent at the time of the query, which may result in an entity not being retrieved at all. The result of the query could potentially be generated based on filtering out old index values. In summary, a developer may use a keys-only global query followed by lookup by key only when an application requirement allows the index value not yet being consistent at the time of a query.
More info and magic here : Balancing Strong and Eventual Consistency with Google Cloud Datastore
I had the same problem, option #2 doesn't really work: a read using the key will work, but a query might still miss the new employee.
Option #1 could work, but only in the same request. The saved memcache key can dissapear at any time, a subsequent query on the same instance or one on another instance potentially running on another piece of hw would still miss the new employee.
The only "solution" that comes to mind for consistent query results is to actually not attempt to force the new employee into the results and rather leave things flow naturally until it does. I'd just add a warning that creating the new user will take "a while". If tolerable maybe keep polling/querying in the original request until it shows up? - that would be the only place where the employee creation event is known with certainty.
This question is old as I write this. However, it is a good question and will be relevant long term.
Option #2 from the original question will not work.
If the entity creation and the subsequent query are truly independent, with no context linking them, then you are really just stuck - or you don't care. The trick is that there is almost always some relationship or some use case that must be covered. In other words if the query is truly some kind of, essentially, ad hoc query, then you really don't care. In that case, you just quote CAP theorem and remind the client executing the query how great it is that this system scales. However, almost always, if you are worried about the eventual consistency, there is some use case or set of cases that must be handled. For example, if you have a high score list, the highest score must be at the top of the list. The highest score may have just been achieved by the user who is now looking at the list. Another example might be that when an employee is created, that employee must be on the "new employees" list.
So what you usually do is exploit these known cases to balance the throughput needed with consistency. For example, for the high score example, you may be able to afford to keep a secondary index (an entity) that is the list of the high scores. You always get it by key and you can write to it as frequently as needed because high scores are not generated that often presumably. For the new employee example, you might use an approach that you started to suggest by storing the timestamp of the last employee in memcache. Then when you query, you check to make sure your list includes that employee ... or something along those lines.
The price in balancing write throughput and consistency on App Engine and similar systems is always the same. It requires increased model complexity / code complexity to bridge the business needs.

AppEngine entity modeling - minimizing entity groups and achieving atomic cascading update/delete

Am learning AppEngine and have started developing new app and want to clarify something.
I understood that
a. To achieve atomicity of update/delete of several entities we need to do it in a transaction and hence all should fall under same entity group
b. Having big entity groups is not scalable as it causes contention.
(Q1: Correct?)
So here is an entity model of an online examination system for sake of discussion:
Entities:
Subject
Exam
Page
Question
Answer
As you can see from top, each entity 1 - many relationship with the immediate bottom one i.e 1 Subject can have many exams, 1 exam -> many pages, 1 page can have many questions...
As you can see, i would like to establish cascading update/delete relationship among these entities (JPA datanucleus appengine implemention supports this (under the hood) by putting all entities under same entity group (Q2: Correct?) though AppEngine natively doesn't support this constraint) so naturally all would go under same entity group so that
a. i can delete a Page (if my user does) in a transaction and be sure that all pages, questions, answers are all deleted
b. or i can delete a subject altogether in a transaction all clear all stuff underneath it
So when i extend this to my real app, i see that all of my (or atleast most) entities are interrelated and fit into same entity group to be able to transact them altogether - making my model inefficient.
Q3: Please advice on how to rethink this design (and the best practice) and still achieve what i need. Ask me more if needed.
Would be great if you could point me to relevant examples.
p.s. 1 solution i could think of is having each entity in a separate entity group and a separate persistent field in each entity (say Exam) named 'IS_DELETED' defaulting to FALSE (value 0). Once a user deletes an Exam, i will set the field to 1 (TRUE) and that i don't load them anymore. I shall write a Cron job which clears all related entities in separate separate transaction in the backend which will retry upon failures if needed. But am sure this is not elegant and not sure whether this will work out..
Thanks all for your responses,
Hari
One of the simplest ways to improve things is to just have fewer entities in the first place. I can't really think of a terribly good reason why pages, questions and answers need to be separate entities. I suspect you normally display all of the questions on a single page in the same request, without exception. If that's really the case, just keep them in one entity.
It does make a lot of sense to use the Exam entities as the parent for pages; for one thing, each exam is probably limited to a reasonable, small number of pages, so scaling this up probably won't hurt much.
On the other hand, there probably are a great many exams per subject, and for that reason, subjects should not appear in the ancestry of exams (and by extension, pages).
If, for some reason you needed to delete all of the exams in the subject of math, even if they were in the same entity group, you'd probably be unable to complete the whole delete in one transaction without timing out. You might even have trouble completing the delete in a single request.
That suggests that you should be using the Task Queue for this operation. When a cascading change on a subject occurs, the request handler needs to insert a new task and then just return successfully. don't forget to just update the subject entity right there in the request handler.
The task queue pulls a block of affected entities from the datastore, updates them, and then checks the time. If there is still more time available for continued updates, it pulls another block of entities, and so on, until none remain. If time is almost up, the task just adds itself back to the queue so it can restart where it left off when it respawns.
It's a good idea to schedule the first task at least a few seconds into the future of the initial request, so that if, for instance, the subject was deleted, the delete can propagate to future requests and no new exams in that subject can be created by the time the task starts.

Clarification: can I put all of a user's data in a single entity group by making up an ancestor key?

I want to do several operations on a user's data in a single transaction, but won't need to update multiple users' data in a single transaction. I see from http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html#Entity_Groups_Ancestors_and_Paths that "A good rule of thumb for entity groups is that [entity groups] should be about the size of a single user's worth of data or smaller," so I think the correct choice is to use a single parent key when building the keys for the other entities related to a user.
Does this seem like a good idea?
Is it easy to code? Something like KeyBuilder.setParent(theKeyOfMyUserEntity)?
1) It is hard to comment without some addition details about the data. There are several things you should be aware of with entity groups; the biggest is that the group will be stored together. That means if you are trying to do many (separate) updates you could face contention, limiting your app's performance.
2) yes it is easy to code. The syntax is pretty close to what you posted.
There are other options for transactions. Check out Nick Johnson's article on distributed transactions. If you are wanting transactions for aggregates you should also check out Brett Slatkin's IO talk on high-throughput data pipelines.
Yes, it seems reasonable to store some user data as child entities of a User entity.
Why do you need to manually create keys ? The db.Model() constructor already has a convenient "parent" argument which will automatically put both the parent entity and the child entity in the same entity group.

Can I do transactions and locks in CouchDB?

I need to do transactions (begin, commit or rollback), locks (select for update).
How can I do it in a document model db?
Edit:
The case is this:
I want to run an auctions site.
And I think how to direct purchase as well.
In a direct purchase I have to decrement the quantity field in the item record, but only if the quantity is greater than zero. That is why I need locks and transactions.
I don't know how to address that without locks and/or transactions.
Can I solve this with CouchDB?
No. CouchDB uses an "optimistic concurrency" model. In the simplest terms, this just means that you send a document version along with your update, and CouchDB rejects the change if the current document version doesn't match what you've sent.
It's deceptively simple, really. You can reframe many normal transaction based scenarios for CouchDB. You do need to sort of throw out your RDBMS domain knowledge when learning CouchDB, though. It's helpful to approach problems from a higher level, rather than attempting to mold Couch to a SQL based world.
Keeping track of inventory
The problem you outlined is primarily an inventory issue. If you have a document describing an item, and it includes a field for "quantity available", you can handle concurrency issues like this:
Retrieve the document, take note of the _rev property that CouchDB sends along
Decrement the quantity field, if it's greater than zero
Send the updated document back, using the _rev property
If the _rev matches the currently stored number, be done!
If there's a conflict (when _rev doesn't match), retrieve the newest document version
In this instance, there are two possible failure scenarios to think about. If the most recent document version has a quantity of 0, you handle it just like you would in a RDBMS and alert the user that they can't actually buy what they wanted to purchase. If the most recent document version has a quantity greater than 0, you simply repeat the operation with the updated data, and start back at the beginning. This forces you to do a bit more work than an RDBMS would, and could get a little annoying if there are frequent, conflicting updates.
Now, the answer I just gave presupposes that you're going to do things in CouchDB in much the same way that you would in an RDBMS. I might approach this problem a bit differently:
I'd start with a "master product" document that includes all the descriptor data (name, picture, description, price, etc). Then I'd add an "inventory ticket" document for each specific instance, with fields for product_key and claimed_by. If you're selling a model of hammer, and have 20 of them to sell, you might have documents with keys like hammer-1, hammer-2, etc, to represent each available hammer.
Then, I'd create a view that gives me a list of available hammers, with a reduce function that lets me see a "total". These are completely off the cuff, but should give you an idea of what a working view would look like.
Map
function(doc)
{
if (doc.type == 'inventory_ticket' && doc.claimed_by == null ) {
emit(doc.product_key, { 'inventory_ticket' :doc.id, '_rev' : doc._rev });
}
}
This gives me a list of available "tickets", by product key. I could grab a group of these when someone wants to buy a hammer, then iterate through sending updates (using the id and _rev) until I successfully claim one (previously claimed tickets will result in an update error).
Reduce
function (keys, values, combine) {
return values.length;
}
This reduce function simply returns the total number of unclaimed inventory_ticket items, so you can tell how many "hammers" are available for purchase.
Caveats
This solution represents roughly 3.5 minutes of total thinking for the particular problem you've presented. There may be better ways of doing this! That said, it does substantially reduce conflicting updates, and cuts down on the need to respond to a conflict with a new update. Under this model, you won't have multiple users attempting to change data in primary product entry. At the very worst, you'll have multiple users attempting to claim a single ticket, and if you've grabbed several of those from your view, you simply move on to the next ticket and try again.
Reference: https://wiki.apache.org/couchdb/Frequently_asked_questions#How_do_I_use_transactions_with_CouchDB.3F
Expanding on MrKurt's answer. For lots of scenarios you don't need to have stock tickets redeemed in order. Instead of selecting the first ticket, you can select randomly from the remaining tickets. Given a large number tickets and a large number of concurrent requests, you will get much reduced contention on those tickets, versus everyone trying to get the first ticket.
A design pattern for restfull transactions is to create a "tension" in the system. For the popular example use case of a bank account transaction you must ensure to update the total for both involved accounts:
Create a transaction document "transfer USD 10 from account 11223 to account 88733". This creates the tension in the system.
To resolve any tension scan for all transaction documents and
If the source account is not updated yet update the source account (-10 USD)
If the source account was updated but the transaction document does not show this then update the transaction document (e.g. set flag "sourcedone" in the document)
If the target account is not updated yet update the target account (+10 USD)
If the target account was updated but the transaction document does not show this then update the transaction document
If both accounts have been updated you can delete the transaction document or keep it for auditing.
The scanning for tension should be done in a backend process for all "tension documents" to keep the times of tension in the system short. In the above example there will be a short time anticipated inconsistence when the first account has been updated but the second is not updated yet. This must be taken into account the same way you'll deal with eventual consistency if your Couchdb is distributed.
Another possible implementation avoids the need for transactions completely: just store the tension documents and evaluate the state of your system by evaluating every involved tension document. In the example above this would mean that the total for a account is only determined as the sum values in the transaction documents where this account is involved. In Couchdb you can model this very nicely as a map/reduce view.
No, CouchDB is not generally suitable for transactional applications because it doesn't support atomic operations in a clustered/replicated environment.
CouchDB sacrificed transactional capability in favor of scalability. In order to have atomic operations you need a central coordination system, which limits your scalability.
If you can guarantee you only have one CouchDB instance or that everyone modifying a particular document connects to the same CouchDB instance then you could use the conflict detection system to create a sort of atomicity using methods described above but if you later scale up to a cluster or use a hosted service like Cloudant it will break down and you'll have to redo that part of the system.
So, my suggestion would be to use something other than CouchDB for your account balances, it will be much easier that way.
As a response to the OP's problem, Couch is probably not the best choice here. Using views is a great way to keep track of inventory, but clamping to 0 is more or less impossible. The problem being the race condition when you read the result of a view, decide you're ok to use a "hammer-1" item, and then write a doc to use it. The problem is that there's no atomic way to only write the doc to use the hammer if the result of the view is that there are > 0 hammer-1's. If 100 users all query the view at the same time and see 1 hammer-1, they can all write a doc to use a hammer 1, resulting in -99 hammer-1's. In practice, the race condition will be fairly small - really small if your DB is running localhost. But once you scale, and have an off site DB server or cluster, the problem will get much more noticeable. Regardless, it's unacceptable to have a race condition of that sort in a critical - money related system.
An update to MrKurt's response (it may just be dated, or he may have been unaware of some CouchDB features)
A view is a good way to handle things like balances / inventories in CouchDB.
You don't need to emit the docid and rev in a view. You get both of those for free when you retrieve view results. Emitting them - especially in a verbose format like a dictionary - will just grow your view unnecessarily large.
A simple view for tracking inventory balances should look more like this (also off the top of my head)
function( doc )
{
if( doc.InventoryChange != undefined ) {
for( product_key in doc.InventoryChange ) {
emit( product_key, 1 );
}
}
}
And the reduce function is even more simple
_sum
This uses a built in reduce function that just sums the values of all rows with matching keys.
In this view, any doc can have a member "InventoryChange" that maps product_key's to a change in the total inventory of them. ie.
{
"_id": "abc123",
"InventoryChange": {
"hammer_1234": 10,
"saw_4321": 25
}
}
Would add 10 hammer_1234's and 25 saw_4321's.
{
"_id": "def456",
"InventoryChange": {
"hammer_1234": -5
}
}
Would burn 5 hammers from the inventory.
With this model, you're never updating any data, only appending. This means there's no opportunity for update conflicts. All the transactional issues of updating data go away :)
Another nice thing about this model is that ANY document in the DB can both add and subtract items from the inventory. These documents can have all kinds of other data in them. You might have a "Shipment" document with a bunch of data about the date and time received, warehouse, receiving employee etc. and as long as that doc defines an InventoryChange, it'll update the inventory. As could a "Sale" doc, and a "DamagedItem" doc etc. Looking at each document, they read very clearly. And the view handles all the hard work.
Actually, you can in a way. Have a look at the HTTP Document API and scroll down to the heading "Modify Multiple Documents With a Single Request".
Basically you can create/update/delete a bunch of documents in a single post request to URI /{dbname}/_bulk_docs and they will either all succeed or all fail. The document does caution that this behaviour may change in the future, though.
EDIT: As predicted, from version 0.9 the bulk docs no longer works this way.
Just use SQlite kind of lightweight solution for transactions, and when the transaction is completed successfully replicate it, and mark it replicated in SQLite
SQLite table
txn_id , txn_attribute1, txn_attribute2,......,txn_status
dhwdhwu$sg1 x y added/replicated
You can also delete the transactions which are replicated successfully.

Resources