I have been using Google App Engine for a few months now and I have recently come to doubt some of my practices with regard to the Datastore. I have around 10 entities with 10-12 properties each. Everything works well in my app and the code is pretty straightforward with the way I have my data structured but I am wondering if I should break up these large entities into smaller ones for either optimization of reads and writes or just to follow best practices (which I am not sure of regarding GAE)
Right now I am over my quotas for reads and writes and would like to keep those in check.
Optimizing Reads:
If you use an offset in a query, the offset entities are counted as reads. If you run a query where offset=100, the datastore retrieves and discards the first 100 entities and you are billed for those reads. Use cursors wherever possible to reduce read ops. Cursors will also result in faster queries.
NDB won't necessarily reduce reads when you are running queries. Queries are made against the datastore and entities are returned, no memcache interaction occurs. If you want to retrieve entities from memcache in the context of a query, you will need to run a keys_only query and then attempt to retrieve those keys from memcache. You would then need to go to the datastore for any entities that were cache misses. Retrieving a key is a "small" op which is 1/7 the cost of a read op.
Optimizing Writes:
Remove unused indexes. By default every property on your entity is indexed and each of those incurs 2 writes the first time it is written and 4 writes whenever it is modified. You can disable indexing for a property like so: firstname = db.StringProperty(indexed=False).
If you use list properties, each item in the list is an individual property on the entity. The list properties are abstractions provided for convenience. A list property named things with the value ["thing1", "thing2"] is really two properties in the datastore: things_0="thing1" and things_1="things". This can get really expensive when combined with indexing.
Consolidate properties that you don't need to query. If you only need to query on one or two properties, serialize the rest of those properties and store it as a blob on the entity.
Further reading:
https://developers.google.com/appengine/docs/billing#Billable_Resource_Unit_Costs
https://developers.google.com/appengine/docs/python/datastore/entities#Understanding_Write_Costs
I would recommend looking into using NDB Entities. NDB will use the in-context cache (and Memcache if need be) before resorting to performing reads/writes to the Datastore. This should help you stay within your quota.
Read here for more information on how NDB uses caching: https://developers.google.com/appengine/docs/python/ndb/cache
And please consult this page for a discussion of best practices with regards to GAE: https://developers.google.com/appengine/articles/scaling/overview
AppEngine Datastore charges a fixed amount per Entity read, no matter how large the Entity is (although there is a max of 1MB). This means it makes sense to combine multiple entities that you ofter read together into a single one. The downside is only that the latency increases (as it needs to deserialize a larger Entity each time). I found this latency to be quite low (low 1 digit ms even for large ones).
The use of frameworks ontop of Datastore is a good idea. I am using Objectify and am very happy. Use the Memcache integration with care though. Googles provides only a fixed limited amount of memory to each application, so as soon as you are talking about larger data this will not solve your problem (since Entities have been evicted from Memcache and need to be re-read from datastore and put into cache again for each read).
Related
Here Java Understanding write costs I was reading about optimizing my entities.
What I don't understand in the first line is
When your application executes a Cloud Datastore put operation
I'm using NodeJs and the NodeJs documentation mentions no put command, hence I'm confused if the extra index write costs only applies to the Insert command or also to other commands like Update.
Update
I found this answer Google Datastore new pricing effect operations
from what I understand it doesn't matter if I let datastore automatically index my properties since I'm only charged once for each time an entity is inserted, updated and read.
I guess the only improvement I get by excluding indexes on some properties is decreased storage requirements?
Yes the amount of indexes won't increase write costs. Although they will be making use of storage. You have the official Datastore Pricing model here
The documentation describes a limitation on the throughput to an entity group in the datastore, but is vague on what exactly the limitation is. My confusion is in two parts:
1. What is being restricted?
Specifically, is it:
The number of writes?
Number of transactions that write to the datastore?
Number of transactions regardless of whether it reads or writes to the datastore?
2. What is the type of the restriction?
Specifically, is it:
An artificially enforced one-per-second hard rule?
An empirically observed max throughput, that may in practice be better based on factors like network load, etc.?
There's no throughput restriction per se, but to guarantee atomicity in transactions, updates must be serialized and applied sequentially and in order, so if you make enough of them things will start to fail/timeout. This is called datastore contention:
Datastore contention occurs when a single entity or entity group is updated too rapidly. The datastore will queue concurrent requests to wait their turn. Requests waiting in the queue past the timeout period will throw a concurrency exception. If you're expecting to update a single entity or write to an entity group more than several times per second, it's best to re-work your design early-on to avoid possible contention once your application is deployed.
To directly answer your question in simple terms, it's specifically the number of writes per entity group (5/ish per second), and it's just a rule of thumb, your milage may vary (greatly).
Some people have reported no contention at all, while others have problems to get more than 1 update per second. As you can imagine this depends on the complexity of the operation and the load of all the machines involved in execution.
Limits:
writes per second to an entity group
entity groups per cross-entity-group transaction (XG transaction)
There is a limit of 1 write per second per entity group. This is a documented limit that in practice appears to be a 'soft' limit, as in it is possible to exceed it, but not guaranteed to be allowed. Transactions 'block' if the entity had been written to in the last second, however the API allows for transient exceptions to occur as well. Obviously you would be susceptible to timeouts as well.
This does not affect the overall number of transactions for your app, just specifically related to that entity group. If you need to, you can design portions of your data model to get around this limitation.
There is a limit of 25 entity groups per XG transaction, meaning a transaction can not incorporate more than 25 entity groups in its context (reads, writes etc). This used to be a limit of 5 but was recently increased.
So to answer your direct questions:
Writes for the entire entity group (as defined by the root key) within a second window (which is not strict)
artificially enforced one-per-second soft rule
If you ask that question, then the Google DataStore is probably not for you.
The Google DataStore is an experimental database, where the API can be changed any time - it is also ment for retail apps, non-critical applications.
A clear indication you meet when you signup for the DataStore, something like no responsibility to backwards compatibility etc. Another indication is the lack of clear examples, the lack of wrappers providing a simple API to implement an access to the DataStore - and the examples on the net being a soup of complicated installations and procedures to make a simple query.
My own conclusion so far after days of research, is Google DataStore is not ready for commercial use, but looks promising once it is finished and in a stable release version.
When you search the net, and look at the few Google examples, if there at all are any - it is about to notice whats not mentioned rather than what is mentioned - which is about nothing is mentioned by Google ..... ;-) If you look at the vendors "supporting" Google DataStore, they simply link to the Google DataStore site for further information, which mention nothing, so you are in a ring where nothing concrete is mentioned ....
Everyone learns to use Memcache pretty quick. Another one I've learned recently is setting indexed=False for Model properties that I am not going to query against. What are some others? What are the big ones?
Don't use offset in queries. Use cursors instead.
Explanations: offset loads all data up to offset+limit and charges you for it, but only returns limit entities.
Minimize instance use, by tweaking idle instances and pending latency appropriately for your app.
A couple helped us (not all may be low-hanging at first). First, we denormalized our datastore to reduce joins. I'm using SQL terms because I came from a SQL background. By spreading commonly queried elements around, we reduced the number of reads we had to make considerably, even after factoring in Memcache. Potentially increases writes but for most apps, the number of reads far outweighs the number of writes.
Next, we started using task queues, backends, and the channel API more often. I don't remember specific examples but I do remember we were able to reduce our front-end usage down below the free quota mark by moving some processing around to queues and backends and by sending data down via channel rather than having the client poll.
Also, we use objectify for our data access which we configure to automatically use memcache wherever appropriate.
I'm writing a very limited-purpose web application that stores about 10-20k user-submitted articles (typically 500-700 words). At any time, any user should be able to perform searches on tags and keywords, edit any part of any article (metadata, text, or tags), or download a copy of the entire database that is recent up-to-the-hour. (It can be from a cache as long as it is updated hourly.) Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity. This usage pattern is set in stone.
Is GAE a wise choice for this application? It appeals to me for its low cost (hopefully free), elasticity of scale, and professional management of most of the stack. I like the idea of an app engine as an alternative to a host. However, the excessive limitations and quotas on all manner of datastore usage concern me, as does the trade-off between strong and eventual consistency imposed by the datastore's distributed architecture.
Is there a way to fit this application into GAE? Should I use the ndb API instead of the plain datastore API? Or are the requirements so data-intensive that GAE is more expensive than hosts like Webfaction?
As long as you don't require full text search on the articles (which is currently still marked as experimental and limited to ~1000 queries per day), your usage scenario sounds like it would fit just fine in App Engine.
stores about 10-20k user-submitted articles (typically 500-700 words)
Maximum entity size in App Engine is 1 MB, so as long as the total size of the article is lower than that, it should not be a problem. Also, the cost for reading data in is not tied to the size of the entity but to the number of entities being read.
At any time, any user should be able to perform searches on tags and keywords.
Again, as long as the search on the tags and keywords are not full text searches, App Engine's datastore queries could handle these kind of searches efficiently. If you want to search on both tags and keywords at the same time, you would need to build a composite index for both fields. This could increase your write cost.
download a copy of the entire database that is recent up-to-the-hour.
You could use cron/scheduled task to schedule a hourly dump to the blobstore. The cron could be targeted to a backend instance if your dump takes more than 60 seconds to be finished. Do remember that with each dump, you would need to read all entities in the database, and this means 10-20k read ops per hour. You could add a timestamp field to your entity, and have your dump servlet query for anything newer than the last dump instead to save up read ops.
Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity.
This is where GAE shines, you could have very efficient instance usages with GAE in this case.
I don't think your application is particularly "database-heavy".
500-700 words is only a few KB of data.
I think GAE is a good fit.
You could store each article as a textproperty on an entity, with tags in a listproperty. For searching text you could use the search service https://developers.google.com/appengine/docs/python/search/ (which currently has quota limits).
Not 100% sure about downloading all the data, but I think you could store all the data in the blobstore (possibly as pdf?) and then allow users to download that blob.
I would choose NDB over regular datastore, mostly for the built-in async functionality and caching.
Regarding staying below quota, it depends on how many people are accessing the site and how much data they download/upload.
I'm thinking of dozens of concurrent jobs writing to the same datastore Model. Does the datastore scale regardless of the number of concurrent puts?
There is no contention for entity kinds - only for entity groups (entities with the same parent entity). Since you say you're writing to a new entity each time, you should be able to scale arbitrarially.
One subtlety remains, however: If you're inserting a high rate of entities (hundreds per second), and you're using the default auto-generated IDs, you can get 'hot tablets', which can cause contention. If you expect that high a rate of insertions, you should use key names, and select a key that doesn't cluster as auto generated IDs do - examples would be an email address, or a randomly generated UUID.
The datastore can only handle so many writes per second to any given entity. Trying to write to a specific entity too quickly leads to contention as described in Avoiding datastore contention. This article recommends sharding an entity if you expect it be consistently updating it more than one or two times per second.
The datastore is optimized for reads, but if your concurrent jobs are writing to separate entities (even if they are within the same model) then your application might scale - it will depend on how long your request handlers take to execute.