JDO Batch PUT from Memcache - google-app-engine

In an effort to reduce the number of datastore PUTs I am consuming I
wish to use the memcache much more frequently. The idea being to store
entities in the memcache for n minutes before writing all entities to
the datastore and clearing the cache.
I have two questions:
Is there a way to batch PUT every entity in the cache? I believe
makePersistentAll() is not really batch saving but rather saving each
individually is it not?
Secondly is there a "callback" function you can place on entities as
you place them in the memcache? I.e. If I add an entity to the cache
(with a 2 minute expiration delta) can I tell AppEngine to save the
entity to the datastore when it is evicted?
Thanks!

makePersistentAll does indeed do a batch PUT, which the log ought to tell you clear enough

There's no way to fetch the entire contents of memcache in App Engine. In any case, this is a bad idea - items can be evicted from memcache at any time, including between when you inserted them and when you tried to write the data to the datastore. Instead, use memcache as a read cache, and write data to the datastore immediately (using batch operations when possible).

Related

GAE Datastore Structure

I have been using Google App Engine for a few months now and I have recently come to doubt some of my practices with regard to the Datastore. I have around 10 entities with 10-12 properties each. Everything works well in my app and the code is pretty straightforward with the way I have my data structured but I am wondering if I should break up these large entities into smaller ones for either optimization of reads and writes or just to follow best practices (which I am not sure of regarding GAE)
Right now I am over my quotas for reads and writes and would like to keep those in check.
Optimizing Reads:
If you use an offset in a query, the offset entities are counted as reads. If you run a query where offset=100, the datastore retrieves and discards the first 100 entities and you are billed for those reads. Use cursors wherever possible to reduce read ops. Cursors will also result in faster queries.
NDB won't necessarily reduce reads when you are running queries. Queries are made against the datastore and entities are returned, no memcache interaction occurs. If you want to retrieve entities from memcache in the context of a query, you will need to run a keys_only query and then attempt to retrieve those keys from memcache. You would then need to go to the datastore for any entities that were cache misses. Retrieving a key is a "small" op which is 1/7 the cost of a read op.
Optimizing Writes:
Remove unused indexes. By default every property on your entity is indexed and each of those incurs 2 writes the first time it is written and 4 writes whenever it is modified. You can disable indexing for a property like so: firstname = db.StringProperty(indexed=False).
If you use list properties, each item in the list is an individual property on the entity. The list properties are abstractions provided for convenience. A list property named things with the value ["thing1", "thing2"] is really two properties in the datastore: things_0="thing1" and things_1="things". This can get really expensive when combined with indexing.
Consolidate properties that you don't need to query. If you only need to query on one or two properties, serialize the rest of those properties and store it as a blob on the entity.
Further reading:
https://developers.google.com/appengine/docs/billing#Billable_Resource_Unit_Costs
https://developers.google.com/appengine/docs/python/datastore/entities#Understanding_Write_Costs
I would recommend looking into using NDB Entities. NDB will use the in-context cache (and Memcache if need be) before resorting to performing reads/writes to the Datastore. This should help you stay within your quota.
Read here for more information on how NDB uses caching: https://developers.google.com/appengine/docs/python/ndb/cache
And please consult this page for a discussion of best practices with regards to GAE: https://developers.google.com/appengine/articles/scaling/overview
AppEngine Datastore charges a fixed amount per Entity read, no matter how large the Entity is (although there is a max of 1MB). This means it makes sense to combine multiple entities that you ofter read together into a single one. The downside is only that the latency increases (as it needs to deserialize a larger Entity each time). I found this latency to be quite low (low 1 digit ms even for large ones).
The use of frameworks ontop of Datastore is a good idea. I am using Objectify and am very happy. Use the Memcache integration with care though. Googles provides only a fixed limited amount of memory to each application, so as soon as you are talking about larger data this will not solve your problem (since Entities have been evicted from Memcache and need to be re-read from datastore and put into cache again for each read).

Write/Read with High Replication Datastore + NDB

So I have been reading a lot of documentation on HRD and NDB lately, yet I still have some doubts regarding how NDB caches things.
Example case:
Imagine a case where a users writes data and the app needs to fetch it immediately after the write. E.g. A user creates a "Group" (similar to a Facebook/Linkedin group) and is redirected to the group immediately after creating it. (For now, I'm creating a group without assigning it an ancestor)
Result:
When testing this sort of functionality locally (having enabled high replication), the immediate fetch of the newly created group fails. A NoneType is returned.
Question:
Having gone through the High Replication docs and Google IO videos, I understand that there is a higher write latency, however, shouldn't NDB caching take care of this? I.e. A write is cached, and then asynchronously actually written on disk, therefore, an immediate read would be reading from cache and thus there should be no problem. Do I need to enforce some other settings?
Pretty sure you are running into the HRD feature where queries are "eventually consistent". NDB's caching has nothing to do with this behavior.
I suspect it might be because of the redirect that the NoneType is returned.
https://developers.google.com/appengine/docs/python/ndb/cache#incontext
The in-context cache persists only for the duration of a single incoming HTTP request and is "visible" only to the code that handles that request. It's fast; this cache lives in memory. When an NDB function writes to the Datastore, it also writes to the in-context cache. When an NDB function reads an entity, it checks the in-context cache first. If the entity is found there, no Datastore interaction takes place.
Queries do not look up values in any cache. However, query results are written back to the in-context cache if the cache policy says so (but never to Memcache).
So you are writing the value to the cache, redirecting it and the read then fails because the HTTP request on the redirect is a different one and so the cache is different.
I'm reaching the limit of my knowledge here but I'd suggest initially that you try the create in a transaction and redirect when complete/success.
https://developers.google.com/appengine/docs/python/ndb/transactions
Also when you put the group model into the datastore you'll get a key back. Can you pass that key (via urlsafe for example) to the redirect and then you'll be guaranteed to retrieve the data as you have it's explicit key? Can't have it's key if it's not in the datastore after all.
Also I'd suggest trying it as is on the production server, sometimes behaviours can be very different locally and on production.

How to handle caching and data-store sync in GAE

I am writing a small appengine application and I want to start using Datastore.
My app has some users and each user is a complicated JAVA class.
Users might swap some "point" objects between them, so I need the data to be available and fast.
My question is quite generic:
How should I handle the data caching?
Storing the data entirely in the Datastore and fetching it in every call sounds slow in runtime.
On the other hand, holding the data in a static JAVA class sounds tricky, because every now and then the server resets and data is erased.
If I had a main loop, like in a regular console application I would've probably saved the data twice-three times a day on predefined hours of the day.
How should I manage my code in such a way that it would save the status in the datastore every now and then and this way will not loose any data.
if the number of write operations are less than read operations,
it is good to try "Write-through caching" with memcache module.
https://developers.google.com/appengine/docs/java/memcache/
basically the write-through caching works like this:
for every write operations, program update the datastore, then backup the data to memcache
for every read operations, program will try to read from memcache. if it can found required data, then it returns, otherwise it will fetch the data from datastore and backup the results to memcache.
As #lucemia noted, you should use memcache.
If you use objectify, you can set-up caching declaratively, without writing and cache handling code.

Passing (large binary) data to appengine task

I have a task endpoint that needs to process data (say >1MB file) uploaded from a frontend request. However, I do not think I can pass the data from the frontend request via TaskOptions.Builder as I will get the "Task size too large" error.
I need some kind of "temporary" data store for the uploaded data, that can be deleted once the task has successfully processed it.
Option A: Store uploaded data in memcache, pass the key to the task. This is likely going to work most of the time, except when the data is evicted BEFORE the task is processed. If this can be resolved, sounds like a great solution.
Option B: Store the data in datastore (an Entity created just for this purpose). Pass the id to the task. The task is responsible for deleting the entity when it is done.
Option C: Use the Blobstore service. This, IMHO, is similar in concept to Option B.
At the moment, i'm thinking option B is the most feasible way.
Appreciate any advise on the best way to handle these situations.
If you are storing data larger than 1mb, you must use the blobstore. (Yes, you can segment the data in the datastore, but it's not worth the work.) There are two things to look out for, however. Make sure that you write the data to the blobstore in chunks less than 1mb. Also, since the task queue is idempotent, your tasks should not fail if the requested blobstore key does not exist, since a previous task may have deleted it already.
Option B doesn't work too, because maximum entity size is 1mb too. Same limit as for task.
Option A can't give any guarantee, values can expire from the memcache at any time, and may be expired prior to the expiration deadline set for the value.
Imho, Option C is best for your needs

using memcached for very small data, good idea?

I have a very small amount of data (~200 bytes) that I retrieve from the database very often. The write rate is insignificant.
I would like to get away from all the unnecessary database calls to fetch this almost static data. Is memcached a good use for this? Something else?
If it's of any relevance I'm running this on GAE using python. The data in question could be easily (de)serialized as json.
Memcache is well-suited for this - reading from the datastore is much more expensive than reading from memcache. This is especially true for small amounts of data for which the cost to retrieve is dominated by latency to the datastore.
If your app receives enough requests that instances typically stay alive for a little while, then you could go one step further and use App Caching to largely avoid memcache too. (Basically, cache the value in a global variable, and also app-cache the time the value was last updated. Provide an accessor for the value which retrieves the latest from memcache/db if it hasn't been updated in X minutes). Memcache is pretty cheap though, so this extra work might only make sense if you access this variable rather frequently.
If it changes less often than once per day, you could just hardcode it in webapp code, and reupload the file each time it changes.

Resources