Scalability of concurrent writes to App Engine datastore - google-app-engine

I'm thinking of dozens of concurrent jobs writing to the same datastore Model. Does the datastore scale regardless of the number of concurrent puts?

There is no contention for entity kinds - only for entity groups (entities with the same parent entity). Since you say you're writing to a new entity each time, you should be able to scale arbitrarially.
One subtlety remains, however: If you're inserting a high rate of entities (hundreds per second), and you're using the default auto-generated IDs, you can get 'hot tablets', which can cause contention. If you expect that high a rate of insertions, you should use key names, and select a key that doesn't cluster as auto generated IDs do - examples would be an email address, or a randomly generated UUID.

The datastore can only handle so many writes per second to any given entity. Trying to write to a specific entity too quickly leads to contention as described in Avoiding datastore contention. This article recommends sharding an entity if you expect it be consistently updating it more than one or two times per second.
The datastore is optimized for reads, but if your concurrent jobs are writing to separate entities (even if they are within the same model) then your application might scale - it will depend on how long your request handlers take to execute.

Related

How to implement "Sharding Counters" to create or update single entity more than 5 writes per second in Cloud Datastore?

I need to improve the server's performance by increasing the writing throughput in Google Cloud Datastore.
Requirement:
When the server gets more than 5 requests to create the user data at the same time, the server needs to create or update those entities.
However, I encountered a writing contention problem.
I know a possible solution is to use a write-behind cache mechanism moving the writes operation that can lead to contention to Memcache and a Taskqueue slowing down the Cloud Datastore hit rate.
But I want to do it in parallel without any delay time.
1.Is it possible to apply "Sharding Counters" to create or update ndb's user model?
2.Could you provide any sample codes for this?
I did some research and want to share as below
https://weishihhsun.blogspot.com/search/label/Google%20App%20Engine
The key point is we can do sharding on our entity group with a random unique id as shown below.
NUM_SHARDS = 1000
shard_string_index = str(random.randint(0, NUM_SHARDS - 1))
FriendShip(id=shard_string_index,
user_key='user Id',
friend_key='frind Id')
To simultaneously write a thousand of Friendship entities in parallel, we just need to set the number of NUM_SHARDS as 1000.

What exactly is the throughput restriction on an entity group in Google App Engine's datastore?

The documentation describes a limitation on the throughput to an entity group in the datastore, but is vague on what exactly the limitation is. My confusion is in two parts:
1. What is being restricted?
Specifically, is it:
The number of writes?
Number of transactions that write to the datastore?
Number of transactions regardless of whether it reads or writes to the datastore?
2. What is the type of the restriction?
Specifically, is it:
An artificially enforced one-per-second hard rule?
An empirically observed max throughput, that may in practice be better based on factors like network load, etc.?
There's no throughput restriction per se, but to guarantee atomicity in transactions, updates must be serialized and applied sequentially and in order, so if you make enough of them things will start to fail/timeout. This is called datastore contention:
Datastore contention occurs when a single entity or entity group is updated too rapidly. The datastore will queue concurrent requests to wait their turn. Requests waiting in the queue past the timeout period will throw a concurrency exception. If you're expecting to update a single entity or write to an entity group more than several times per second, it's best to re-work your design early-on to avoid possible contention once your application is deployed.
To directly answer your question in simple terms, it's specifically the number of writes per entity group (5/ish per second), and it's just a rule of thumb, your milage may vary (greatly).
Some people have reported no contention at all, while others have problems to get more than 1 update per second. As you can imagine this depends on the complexity of the operation and the load of all the machines involved in execution.
Limits:
writes per second to an entity group
entity groups per cross-entity-group transaction (XG transaction)
There is a limit of 1 write per second per entity group. This is a documented limit that in practice appears to be a 'soft' limit, as in it is possible to exceed it, but not guaranteed to be allowed. Transactions 'block' if the entity had been written to in the last second, however the API allows for transient exceptions to occur as well. Obviously you would be susceptible to timeouts as well.
This does not affect the overall number of transactions for your app, just specifically related to that entity group. If you need to, you can design portions of your data model to get around this limitation.
There is a limit of 25 entity groups per XG transaction, meaning a transaction can not incorporate more than 25 entity groups in its context (reads, writes etc). This used to be a limit of 5 but was recently increased.
So to answer your direct questions:
Writes for the entire entity group (as defined by the root key) within a second window (which is not strict)
artificially enforced one-per-second soft rule
If you ask that question, then the Google DataStore is probably not for you.
The Google DataStore is an experimental database, where the API can be changed any time - it is also ment for retail apps, non-critical applications.
A clear indication you meet when you signup for the DataStore, something like no responsibility to backwards compatibility etc. Another indication is the lack of clear examples, the lack of wrappers providing a simple API to implement an access to the DataStore - and the examples on the net being a soup of complicated installations and procedures to make a simple query.
My own conclusion so far after days of research, is Google DataStore is not ready for commercial use, but looks promising once it is finished and in a stable release version.
When you search the net, and look at the few Google examples, if there at all are any - it is about to notice whats not mentioned rather than what is mentioned - which is about nothing is mentioned by Google ..... ;-) If you look at the vendors "supporting" Google DataStore, they simply link to the Google DataStore site for further information, which mention nothing, so you are in a ring where nothing concrete is mentioned ....

GAE transaction failure and idempotency

The Google App Engine documentation contains this paragraph:
Note: If your application receives an exception when committing a
transaction, it does not always mean that the transaction failed. You
can receive DatastoreTimeoutException,
ConcurrentModificationException, or DatastoreFailureException
exceptions in cases where transactions have been committed and
eventually will be applied successfully. Whenever possible, make your
Datastore transactions idempotent so that if you repeat a transaction,
the end result will be the same.
Wait, what? It seems like there's a very important class of transactions that just simply cannot be made idempotent because they depend on current datastore state. For example, a simple counter, as in a like button. The transaction needs to read the current count, increment it, and write out the count again. If the transaction appears to "fail" but doesn't REALLY fail, and there's no way for me to tell that on the client side, then I need to try again, which will result in one click generating two "likes." Surely there is some way to prevent this with GAE?
Edit:
it seems that this is problem inherent in distributed systems, as per non other than Guido van Rossum -- see this link:
app engine datastore transaction exception
So it looks like designing idempotent transactions is pretty much a must if you want a high degree of reliability.
I was wondering if it was possible to implement a global system across a whole app for ensuring idempotency. The key would be to maintain a transaction log in the datastore. The client would generated a GUID, and then include that GUID with the request (the same GUID would be re-sent on retries for the same request). On the server, at the start of each transaction, it would look in the datastore for a record in the Transactions entity group with that ID. If it found it, then this is a repeated transaction, so it would return without doing anything.
Of course this would require enabling cross-group transactions, or having a separate transaction log as a child of each entity group. Also there would be a performance hit if failed entity key lookups are slow, because almost every transaction would include a failed lookup, because most GUIDs would be new.
In terms of the additional $ cost in terms of additional datastore interactions, this would probably still be less than if I had to make every transaction idempotent, since that would require a lot of checking what's in the datastore in each level.
dan wilkerson, simon goldsmith, et al. designed a thorough global transaction system on top of app engine's local (per entity group) transactions. at a high level, it uses techniques similar to the GUID one you describe. dan dealt with "submarine writes," ie the transactions you describe that report failure but later surface as succeeded, as well as many other theoretical and practical details of the datastore. erick armbrust implemented dan's design in tapioca-orm.
i don't necessarily recommend that you implement his design or use tapioca-orm, but you'd definitely be interested in the research.
in response to your questions: plenty of people implement GAE apps that use the datastore without idempotency. it's only important when you need transactions with certain kinds of guarantees like the ones you describe. it's definitely important to understand when you do need them, but you often don't.
the datastore is implemented on top of megastore, which is described in depth in this paper. in short, it uses multi-version concurrency control within each entity group and Paxos for replication across datacenters, both of which can contribute to submarine writes. i don't know if there are public numbers on submarine write frequency in the datastore, but if there are, searches with these terms and on the datastore mailing lists should find them.
amazon's S3 isn't really a comparable system; it's more of a CDN than a distributed database. amazon's SimpleDB is comparable. it originally only provided eventual consistency, and eventually added a very limited kind of transactions they call conditional writes, but it doesn't have true transactions. other NoSQL databases (redis, mongo, couchdb, etc.) have different variations on transactions and consistency.
basically, there's always a tradeoff in distributed databases between scale, transaction breadth, and strength of consistency guarantees. this is best known by eric brewer's CAP theorem, which says the three axes of the tradeoff are consistency, availability, and partition tolerance.
The best way I came up with making counters idempotent is using a set instead of an integer in order to count. Thus, when a person "likes" something, instead of incrementing a counter I add the like to the thing like this:
class Thing {
Set<User> likes = ....
public void like (User u) {
likes.add(u);
}
public Integer getLikeCount() {
return likes.size();
}
}
this is in java, but i hope you get my point even if you are using python.
This method is idempotent and you can add a single user for how many times you like, it will only be counted once. Of course, it has the penalty of storing a huge set instead of a simple counter. But hey, don't you need to keep track of likes anyway? If you don't want to bloat the Thing object, create another object ThingLikes, and cache the like count on the Thing object.
another option worth looking into is app engine's built in cross-group transaction support, which lets you operate on up to five entity groups in a single datastore transaction.
if you prefer reading on stack overflow, this SO question has more details.

GAE Datastore Structure

I have been using Google App Engine for a few months now and I have recently come to doubt some of my practices with regard to the Datastore. I have around 10 entities with 10-12 properties each. Everything works well in my app and the code is pretty straightforward with the way I have my data structured but I am wondering if I should break up these large entities into smaller ones for either optimization of reads and writes or just to follow best practices (which I am not sure of regarding GAE)
Right now I am over my quotas for reads and writes and would like to keep those in check.
Optimizing Reads:
If you use an offset in a query, the offset entities are counted as reads. If you run a query where offset=100, the datastore retrieves and discards the first 100 entities and you are billed for those reads. Use cursors wherever possible to reduce read ops. Cursors will also result in faster queries.
NDB won't necessarily reduce reads when you are running queries. Queries are made against the datastore and entities are returned, no memcache interaction occurs. If you want to retrieve entities from memcache in the context of a query, you will need to run a keys_only query and then attempt to retrieve those keys from memcache. You would then need to go to the datastore for any entities that were cache misses. Retrieving a key is a "small" op which is 1/7 the cost of a read op.
Optimizing Writes:
Remove unused indexes. By default every property on your entity is indexed and each of those incurs 2 writes the first time it is written and 4 writes whenever it is modified. You can disable indexing for a property like so: firstname = db.StringProperty(indexed=False).
If you use list properties, each item in the list is an individual property on the entity. The list properties are abstractions provided for convenience. A list property named things with the value ["thing1", "thing2"] is really two properties in the datastore: things_0="thing1" and things_1="things". This can get really expensive when combined with indexing.
Consolidate properties that you don't need to query. If you only need to query on one or two properties, serialize the rest of those properties and store it as a blob on the entity.
Further reading:
https://developers.google.com/appengine/docs/billing#Billable_Resource_Unit_Costs
https://developers.google.com/appengine/docs/python/datastore/entities#Understanding_Write_Costs
I would recommend looking into using NDB Entities. NDB will use the in-context cache (and Memcache if need be) before resorting to performing reads/writes to the Datastore. This should help you stay within your quota.
Read here for more information on how NDB uses caching: https://developers.google.com/appengine/docs/python/ndb/cache
And please consult this page for a discussion of best practices with regards to GAE: https://developers.google.com/appengine/articles/scaling/overview
AppEngine Datastore charges a fixed amount per Entity read, no matter how large the Entity is (although there is a max of 1MB). This means it makes sense to combine multiple entities that you ofter read together into a single one. The downside is only that the latency increases (as it needs to deserialize a larger Entity each time). I found this latency to be quite low (low 1 digit ms even for large ones).
The use of frameworks ontop of Datastore is a good idea. I am using Objectify and am very happy. Use the Memcache integration with care though. Googles provides only a fixed limited amount of memory to each application, so as soon as you are talking about larger data this will not solve your problem (since Entities have been evicted from Memcache and need to be re-read from datastore and put into cache again for each read).

GAE datastore contention avoidance?

Making my way through the GAE documents.
I have a question I can't find an obvious answer to. Given that transaction to an entity group is limited to 1/sec, how can you scale a request where say, 10,000 users all want to access a particular user's page, at the same time?
Wouldn't this give you 10,000 reads on the particular user's entity group in 1/sec, thereby causing catastrophic system failure and unhappy users?
Or am I confused, and only writes get contentious.
AppEngine uses for transactions a optimistic concurrency control, meaning that they do not lock the data, but throw an exception when they detect that data is "dirty". So, first transaction to change data is ok, the second gets the exception and must retry.
Given this, I assume that reads do not block if they are not part of transaction, even if some other transaction is in progress.
Also, to make transactions less of a bottleneck, one should carefully organize entity groups and make them as small as possible and also have them organized in such a way that there is as few contention (parallel requests) as possible. Meaning:
Have small entity graphs - do not put a lot of entities under common parent.
Try having user entity as a root parent. Users usually do not create parallel transactions (e.g. make multiple money transfers at the same time, etc..)
Right. I wasn't thinking. The answer is memcache. At least partially. That, and an efficient data model/ schema.

Resources