Here Java Understanding write costs I was reading about optimizing my entities.
What I don't understand in the first line is
When your application executes a Cloud Datastore put operation
I'm using NodeJs and the NodeJs documentation mentions no put command, hence I'm confused if the extra index write costs only applies to the Insert command or also to other commands like Update.
Update
I found this answer Google Datastore new pricing effect operations
from what I understand it doesn't matter if I let datastore automatically index my properties since I'm only charged once for each time an entity is inserted, updated and read.
I guess the only improvement I get by excluding indexes on some properties is decreased storage requirements?
Yes the amount of indexes won't increase write costs. Although they will be making use of storage. You have the official Datastore Pricing model here
Related
Everyone learns to use Memcache pretty quick. Another one I've learned recently is setting indexed=False for Model properties that I am not going to query against. What are some others? What are the big ones?
Don't use offset in queries. Use cursors instead.
Explanations: offset loads all data up to offset+limit and charges you for it, but only returns limit entities.
Minimize instance use, by tweaking idle instances and pending latency appropriately for your app.
A couple helped us (not all may be low-hanging at first). First, we denormalized our datastore to reduce joins. I'm using SQL terms because I came from a SQL background. By spreading commonly queried elements around, we reduced the number of reads we had to make considerably, even after factoring in Memcache. Potentially increases writes but for most apps, the number of reads far outweighs the number of writes.
Next, we started using task queues, backends, and the channel API more often. I don't remember specific examples but I do remember we were able to reduce our front-end usage down below the free quota mark by moving some processing around to queues and backends and by sending data down via channel rather than having the client poll.
Also, we use objectify for our data access which we configure to automatically use memcache wherever appropriate.
I have been using Google App Engine for a few months now and I have recently come to doubt some of my practices with regard to the Datastore. I have around 10 entities with 10-12 properties each. Everything works well in my app and the code is pretty straightforward with the way I have my data structured but I am wondering if I should break up these large entities into smaller ones for either optimization of reads and writes or just to follow best practices (which I am not sure of regarding GAE)
Right now I am over my quotas for reads and writes and would like to keep those in check.
Optimizing Reads:
If you use an offset in a query, the offset entities are counted as reads. If you run a query where offset=100, the datastore retrieves and discards the first 100 entities and you are billed for those reads. Use cursors wherever possible to reduce read ops. Cursors will also result in faster queries.
NDB won't necessarily reduce reads when you are running queries. Queries are made against the datastore and entities are returned, no memcache interaction occurs. If you want to retrieve entities from memcache in the context of a query, you will need to run a keys_only query and then attempt to retrieve those keys from memcache. You would then need to go to the datastore for any entities that were cache misses. Retrieving a key is a "small" op which is 1/7 the cost of a read op.
Optimizing Writes:
Remove unused indexes. By default every property on your entity is indexed and each of those incurs 2 writes the first time it is written and 4 writes whenever it is modified. You can disable indexing for a property like so: firstname = db.StringProperty(indexed=False).
If you use list properties, each item in the list is an individual property on the entity. The list properties are abstractions provided for convenience. A list property named things with the value ["thing1", "thing2"] is really two properties in the datastore: things_0="thing1" and things_1="things". This can get really expensive when combined with indexing.
Consolidate properties that you don't need to query. If you only need to query on one or two properties, serialize the rest of those properties and store it as a blob on the entity.
Further reading:
https://developers.google.com/appengine/docs/billing#Billable_Resource_Unit_Costs
https://developers.google.com/appengine/docs/python/datastore/entities#Understanding_Write_Costs
I would recommend looking into using NDB Entities. NDB will use the in-context cache (and Memcache if need be) before resorting to performing reads/writes to the Datastore. This should help you stay within your quota.
Read here for more information on how NDB uses caching: https://developers.google.com/appengine/docs/python/ndb/cache
And please consult this page for a discussion of best practices with regards to GAE: https://developers.google.com/appengine/articles/scaling/overview
AppEngine Datastore charges a fixed amount per Entity read, no matter how large the Entity is (although there is a max of 1MB). This means it makes sense to combine multiple entities that you ofter read together into a single one. The downside is only that the latency increases (as it needs to deserialize a larger Entity each time). I found this latency to be quite low (low 1 digit ms even for large ones).
The use of frameworks ontop of Datastore is a good idea. I am using Objectify and am very happy. Use the Memcache integration with care though. Googles provides only a fixed limited amount of memory to each application, so as soon as you are talking about larger data this will not solve your problem (since Entities have been evicted from Memcache and need to be re-read from datastore and put into cache again for each read).
In order to decrease the cost of an existing application which over-consumes on Datastore reads, I am trying to get stats on the application as a whole.
What I'd like to get for the overall application is stats about the queries that are returning the biggest number of rows during a complete day of production. The cost of retrieving data being $0.70 / million, there is a big incentive to optimise / cache some queries but first I have to understand which query retrieves too much data.
Appstats apparently does not provide this information as the tool's primary driver is to optimise one RPC call.
Does anyone has a magic solution for this one ? One alternative I thought about was to build by myself a tool to log after each query the number of rows returned but that looks like an overkill and will require to open the code.
Thanks a lot for your help !
Hugues
See this related post: https://stackoverflow.com/questions/11282567/calculating-datastore-api-usage-per-request/
What you can do to measure and optimize is to look at the cost field provided by the LogService. (It's called cpm_usd in the admin panel).
Using this information you can find the most expensive urls and thus optimize its queries.
I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.
Google App Engine: Using Big Query on datastore?
I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.
The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas
Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.
I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).
I am having the following problem. I am now using the low-level
google datastore API rather than JDO, that way I should be in a
better position to see exactly what is happening in my code. I am
writing an entity to the datastore and shortly thereafter reading it
from the datastore using Jetty and eclipse. Sometimes the written
entity is not being read. This would be a real problem if it were to
happen in production code. I am using the 2.0 RC2 API.
I have tried this several times, sometimes the entity is retrieved
from the datastore and sometimes it is not. I am doing a simple
query on the datastore just after committing a write transaction.
(If I run the code through the debugger things run slow enough
that the entity has a chance of being read back on the second pass).
Any help with this issue would be greatly appreciated,
Regards,
The development server has the same consistency guarantees as the High Replication datastore on the live server. A "global" query uses an index that is only guaranteed to be eventually consistent with writes. To perform a query with strongly consistent guarantees, the query must be limited to an entity group, using an "ancestor" key.
A typical technique is to group data specific to a single user in a group, so the user can see changes to queries limited to the user's group with strong consistency guarantees. Another technique is to use fancier client logic to update the client's local view as soon as the change is submitted, so the user sees the change in the UI immediately while the update to the global index is in progress.
See the docs on queries and transactions.