If I use multitenancy feature on GAE Datastore, will a datastore transaction lock be applied per tenant as well? Or if a tenant is using a datastore transaction, all of the other tenants will have to wait until the tenant's transaction is finished?
Two things to note:
Namespace is part of the key of the entity, so transaction will work only for entities that are part of your transaction. Entities of other namespaces will not be affected even if they have same IDs.
Transactions on GAE do not do locking, instead they use optimistic concurrency control. So transactions are never blocking, just when two transactions operate on same entities the second will fail and then go runtime will try to repeat it up to three times. This auto-retry is the reason why your transactions should be idempotent (= running the code multiple times should produce same end result).
The scope of transactions is limited to entity groups. Namespaces (multi tenancy) don't define an entity group on their own. You need the key (plus potential ancestor).
The only clash will be with multiple requests writing to the same entity group. With namespaces in use that could not happen between tenants.
https://developers.google.com/appengine/docs/go/datastore/entities#Go_Ancestor_paths
Related
In many databases, when an operation is performed without explicitly starting a transaction, the database creates a new transaction implicitly.
Does the datastore do this?
If it does not, is there any model for reasoning about how the data changes in the absence of transactions? How do puts, fetches, and reads, work outside of transactions?
If it does, is there any characterization for when and how. Does it do it always? What is the scope of the transaction?
A mutation (put, delete) of a single entity will always be atomic (succeed entirely or fail entirely). You can think of the single mutation as transactional, even if you did not provide a transaction.
However, if you send multiple mutations in the same non-transactional request, that overall request is not atomic. Each mutation may succeed or fail independently -- one failure will not cause the other mutations to be reverted.
"Transactions are an optional feature of the Datastore; you're not required to use transactions to perform Datastore operations."
so there are no automatic transactions being opened for you across more than a single entity datastore operation.
a single entity commit will behave the same as a transaction internally. so if you are changing more than one entity or committing it more than once, its as if you open and close a transaction every time.
The questions is fairly simple.
Does Google Datastore Transactions Optimistic Concurrency Control or not?
One part of the documentations says that it does:
When a transaction starts, App Engine uses optimistic concurrency control by checking the last update time for the entity groups used in the transaction. Upon commiting a transaction for the entity groups, App Engine again checks the last update time for the entity groups used in the transaction. If it has changed since our initial check, an exception is thrown. Source
Another part of the documentation indicates that it doesn't:
When a transaction is started, the datastore rejects any other attempts to write to that entity group before the transaction is complete. To illustrate this, say you have an entity group consisting of two entities, one at the root of the hierarchy and the other directly below it. If these entities belonged to separate entity groups, they could be updated in parallel. But because they are part of the same entity group, any request attempting to update one of the entities will necessarily prevent a simultaneous request from updating any other entity in the same group until the original request is finished. Source
As I understand it, the first quote tells me that it is fine to start a transaction, read an entity and ignore closing the transaction, if I saw no reason for updating the entity.
The second quote tells me that, if I start a transaction and read an entity, then I should always remember to close it again, otherwise I cannot start a new on the same entity.
Which part of the documentation is correct?
BTW. In case the correct quote is the second one, I am using Objectify to handle all my transactions. Will this remember to close all started transactions, even though no changes was made?
The commenter (Greg) is correct. Whether or not you explicitly close a transaction, all transactions are closed by the container at the end of a request. You can't "leak" transactions (although you could screw up transactions within a single request).
Furthermore, with Objectify's transaction API, transactions are automatically opened and closed for you when you execute a unit of Work. You don't manage transactions yourself.
To answer your root question: Yes, all transactions in the GAE datastore are optimistic. There is no pessimistic locking in the datastore; you can start as many transactions as you want on a single entity group but only the first commit will succeed. All subsequent attempts to commit will rollback with ConcurrentModificationException.
XG transactions must be explicitly enabled when using JPA/JDO. Why so?
Are there possible problems or side effects by enabling them?
Quoting from the docs
An XG transaction that touches only a single entity group has exactly the same performance and cost as a single-group, non-XG
transaction.
In an XG transaction that touches multiple entity groups, operations cost the same as if they were performed in a non-XG
transaction, but may experience higher latency.
The guys who develeloped the datastore push you to be conscient of the underlaying infrastructure. That's why you cannot generate indexes in production, or run cross entity group transactions by default. You have to know why you use both of these features, with the tradeoffs they imply.
See #Jimmy Kane's answer for the performance aspect.
There is also a limitation on the number of entity groups (up to five) that can be "touched" by a XG transaction (docs):
The transaction can be applied across a maximum of five entity groups, and will succeed as long as no concurrent transaction touches any of the entity groups to which it applies
The Google App Engine documentation contains this paragraph:
Note: If your application receives an exception when committing a
transaction, it does not always mean that the transaction failed. You
can receive DatastoreTimeoutException,
ConcurrentModificationException, or DatastoreFailureException
exceptions in cases where transactions have been committed and
eventually will be applied successfully. Whenever possible, make your
Datastore transactions idempotent so that if you repeat a transaction,
the end result will be the same.
Wait, what? It seems like there's a very important class of transactions that just simply cannot be made idempotent because they depend on current datastore state. For example, a simple counter, as in a like button. The transaction needs to read the current count, increment it, and write out the count again. If the transaction appears to "fail" but doesn't REALLY fail, and there's no way for me to tell that on the client side, then I need to try again, which will result in one click generating two "likes." Surely there is some way to prevent this with GAE?
Edit:
it seems that this is problem inherent in distributed systems, as per non other than Guido van Rossum -- see this link:
app engine datastore transaction exception
So it looks like designing idempotent transactions is pretty much a must if you want a high degree of reliability.
I was wondering if it was possible to implement a global system across a whole app for ensuring idempotency. The key would be to maintain a transaction log in the datastore. The client would generated a GUID, and then include that GUID with the request (the same GUID would be re-sent on retries for the same request). On the server, at the start of each transaction, it would look in the datastore for a record in the Transactions entity group with that ID. If it found it, then this is a repeated transaction, so it would return without doing anything.
Of course this would require enabling cross-group transactions, or having a separate transaction log as a child of each entity group. Also there would be a performance hit if failed entity key lookups are slow, because almost every transaction would include a failed lookup, because most GUIDs would be new.
In terms of the additional $ cost in terms of additional datastore interactions, this would probably still be less than if I had to make every transaction idempotent, since that would require a lot of checking what's in the datastore in each level.
dan wilkerson, simon goldsmith, et al. designed a thorough global transaction system on top of app engine's local (per entity group) transactions. at a high level, it uses techniques similar to the GUID one you describe. dan dealt with "submarine writes," ie the transactions you describe that report failure but later surface as succeeded, as well as many other theoretical and practical details of the datastore. erick armbrust implemented dan's design in tapioca-orm.
i don't necessarily recommend that you implement his design or use tapioca-orm, but you'd definitely be interested in the research.
in response to your questions: plenty of people implement GAE apps that use the datastore without idempotency. it's only important when you need transactions with certain kinds of guarantees like the ones you describe. it's definitely important to understand when you do need them, but you often don't.
the datastore is implemented on top of megastore, which is described in depth in this paper. in short, it uses multi-version concurrency control within each entity group and Paxos for replication across datacenters, both of which can contribute to submarine writes. i don't know if there are public numbers on submarine write frequency in the datastore, but if there are, searches with these terms and on the datastore mailing lists should find them.
amazon's S3 isn't really a comparable system; it's more of a CDN than a distributed database. amazon's SimpleDB is comparable. it originally only provided eventual consistency, and eventually added a very limited kind of transactions they call conditional writes, but it doesn't have true transactions. other NoSQL databases (redis, mongo, couchdb, etc.) have different variations on transactions and consistency.
basically, there's always a tradeoff in distributed databases between scale, transaction breadth, and strength of consistency guarantees. this is best known by eric brewer's CAP theorem, which says the three axes of the tradeoff are consistency, availability, and partition tolerance.
The best way I came up with making counters idempotent is using a set instead of an integer in order to count. Thus, when a person "likes" something, instead of incrementing a counter I add the like to the thing like this:
class Thing {
Set<User> likes = ....
public void like (User u) {
likes.add(u);
}
public Integer getLikeCount() {
return likes.size();
}
}
this is in java, but i hope you get my point even if you are using python.
This method is idempotent and you can add a single user for how many times you like, it will only be counted once. Of course, it has the penalty of storing a huge set instead of a simple counter. But hey, don't you need to keep track of likes anyway? If you don't want to bloat the Thing object, create another object ThingLikes, and cache the like count on the Thing object.
another option worth looking into is app engine's built in cross-group transaction support, which lets you operate on up to five entity groups in a single datastore transaction.
if you prefer reading on stack overflow, this SO question has more details.
Making my way through the GAE documents.
I have a question I can't find an obvious answer to. Given that transaction to an entity group is limited to 1/sec, how can you scale a request where say, 10,000 users all want to access a particular user's page, at the same time?
Wouldn't this give you 10,000 reads on the particular user's entity group in 1/sec, thereby causing catastrophic system failure and unhappy users?
Or am I confused, and only writes get contentious.
AppEngine uses for transactions a optimistic concurrency control, meaning that they do not lock the data, but throw an exception when they detect that data is "dirty". So, first transaction to change data is ok, the second gets the exception and must retry.
Given this, I assume that reads do not block if they are not part of transaction, even if some other transaction is in progress.
Also, to make transactions less of a bottleneck, one should carefully organize entity groups and make them as small as possible and also have them organized in such a way that there is as few contention (parallel requests) as possible. Meaning:
Have small entity graphs - do not put a lot of entities under common parent.
Try having user entity as a root parent. Users usually do not create parallel transactions (e.g. make multiple money transfers at the same time, etc..)
Right. I wasn't thinking. The answer is memcache. At least partially. That, and an efficient data model/ schema.