I am planning to use Objectify to talk to Cloud Datastore from GAE Flex. The app is going to be running quite a few background threads talking to Datastore regarding which I have a couple of questions.
I am not planning to use any memcache setup and since these threads are going to be running for a long time, I dont want the Session cache to fill up either. I could not find a way to set ofy() to never cache locally too and the only option seems to be to run a clear() operation periodically. Is there a better way to avoid these caches?
As I see it, we need to wrap any such invocations of ofy() in a run() block to perform the cleanup. I wanted to confirm that this was the only way to use it outside requests scope and there was no in-built support for these longer contexts.
Thanks
You are correct. ObjectifyService.run() is the way to run requests outside of the ObjectifyFilter.
There is not currently any way to disable the session cache. The session cache is pretty deeply woven into the fabric of Objectify in order to get sane behavior for #Load operations. It's not impossible, it just hasn't risen to the top of the priority queue.
The best way to iterate large quantities of your datastore without hitting memory issues is to iterate specifying an explicit chunk() size and then clear() after processing that number of items. If you use Guava's Iterators.partition(), this is pretty much a one-liner.
Related
My python API initializes a global variable which takes about 10 seconds to fully initialize before the server starts running. I'm wondering if when GAE initializes a new instance, this same initialization is required? or am I able to access the same variable across multiple instances?
This answer is just complementary to the other mentioned approaches, in most if not all cases they can be combined.
If you're in the standard environment you can take advantage of the warmup requests to well... warm (most of) your instances up before real traffic hits them.
Multithreading complexity doesn't really matter in such cases since you know that no other request can hit the instance until its init isn't complete - i.e. until it successfully responds to the warmup request. So you can optimize for this case while still playing it safe (even if not very efficient) for the rare cases when instances still start up cold and can get multiple requests in parallel.
Warmup requests aren't supported in the flexible environment, but:
To warm up your app, implement a health check handler that only
returns a ready status when the application is warmed up.
For example, you can create a readiness check that returns a ready
status after initializing the cache so your app won't receive traffic
until it is ready.
Each instance in the application is a separate interpreter, so globals need to be initialised per instance.
If initialisation is costly, but the computed value doesn't change frequently it may be worth storing the value in memcache, the datastore, a database or some other globally available store. Retrieval from memcache is fast, but persistence is not guaranteed, so you may need to re-run the initialisation from time to time. Retrieval from the datastore or a database is usually slower, but persistence is guaranteed in normal circumstances.
As dhauptman observes in the comments, this article contains some advice on lazy-loading global variables.
Everyone learns to use Memcache pretty quick. Another one I've learned recently is setting indexed=False for Model properties that I am not going to query against. What are some others? What are the big ones?
Don't use offset in queries. Use cursors instead.
Explanations: offset loads all data up to offset+limit and charges you for it, but only returns limit entities.
Minimize instance use, by tweaking idle instances and pending latency appropriately for your app.
A couple helped us (not all may be low-hanging at first). First, we denormalized our datastore to reduce joins. I'm using SQL terms because I came from a SQL background. By spreading commonly queried elements around, we reduced the number of reads we had to make considerably, even after factoring in Memcache. Potentially increases writes but for most apps, the number of reads far outweighs the number of writes.
Next, we started using task queues, backends, and the channel API more often. I don't remember specific examples but I do remember we were able to reduce our front-end usage down below the free quota mark by moving some processing around to queues and backends and by sending data down via channel rather than having the client poll.
Also, we use objectify for our data access which we configure to automatically use memcache wherever appropriate.
I have implemented instance mem-caches because we have very static data and the memcache is not very reliable and rather slow compared to an instance cache.
However there is some situations where I would like to invalidate the instance caches. Is there any way to look them up?
Example
Admin A updates a large gamesheet on instance A and that instance looks up all other instances and update the data using a simple REST api.
TL;DR: you can't.
Unlike backends, frontend instances are not individually addressable; that is, there is no way for you to make a RESTy URLFetch call to a specific frontend instance. Even if they were, there is no builtin mechanism for enumerating frontend instances, so you would need to roll your own, e.g. keeping a list of live instances in the datastore and adding to it in a warmup request and removing on repeated connect failure. But at that point you've just implemented a slower, more costly, and less available memcache service.
If you moved all the cache services to backends (using your instance-local static, or, for instance, running a memcached written in Go as a different app version), it's true you would gain a degree of control (or at least transparency) regarding evictions. Availability, speed, and cost would still likely suffer.
Does anybody know if GAE provides a way to route a request to a specified instance? The startup of new instances is killing me on facebook URL linter requests since they timeout before a new instance can start up sometimes. I have no way to control this timeout either. So what I'd like to do is to keep specified instances idle for these calls without needing to hack around it with cron jobs. I think this would be more cost effective as well.
The new modules allows for direct addressing of instances. Much like how backends used to work.
Like so:
http://instance.version.module.app-id.appspot.com
Read more in the documentation here.
It sounds like you need a dedicated set of "always alive" instances to handle just those calls. Backends might be a good solution for that. You can set a separate url address to route to a specific backend.
http://code.google.com/appengine/docs/python/backends/overview.html#Addressing_Backends
This is not possible for frontends, but you can have requests directed to specific backends, and you can make backends externally accessible if you choose.
I'd suggest working on your app to improve loading time, though. If it's taking so long a bot gives up, that's got to have serious implications for usability by your users. Also, make sure you've got warmup requests enabled.
What did you do to make sure the CPU% is low?
Any sample code to look at?
I ask because every datastore read/query seems to push the CPU% beyond 100% and I get the yellow & red highlight in my dashboard. I read from else where that it's normal but surely there's something can be done about it.
Use appstats to get more detail on any long running tasks. It does a good job breaking down exactly how the CPU time is spent and lets you drill down individual calls and view the stack to narrow down which command is running long.
Urlfetch's and database calls tend to be expensive. As Sam suggests, both can be memcached for very significant savings.
You profile your code and improve its efficiency.
Datastore operations are expensive. Try reducing their usage with the help of memcache
Is your app restarting a lot?
I notice even a very minimal app will take over 1sec to load when it has been inactive for a while -- which brings up a warning marker in the log.
For pages you can cache you can use cache-control if you have a request handler.
self.response.headers["Cache-Control"] = "public,max-age=%s" % 86400
In many cases you also can use a cron job to regularly update your cache.
I've written a simple library to reduce datastore operations by using local instance and memcache as storage layers along with datastore. It also supports cached GQL results. I managed to cut my apps' CPU usage by 50% at least. You can give it a try if you're not using any sensitive data.