GAE Occasionally stops serving for 1-5 minutes - google-app-engine

Starting about 1 week ago, my app will occasionally and randomly completely stop serving for 1-5 minutes. Requests during this time hang for the full timeout and then return a 500.
The System Status dashboard reads OK, I have no cron jobs or anything special that might cause this disruption (that I know of).
Has anyone experienced this, and is there a solution?

If you have 'threadsafe: false' in your app.yaml configuration, App Engine will not send concurrent requests to your app. If you have a request that's blocking for a really long time, all other requests coming in will line up (and possibly time out) before being serviced. If this is the cause of your problem, either make your app thread-safe or have a look in your logs to find requests that take a long time and fix them.
Alternatively, if your app gets very little traffic, your instances might be getting shut down after they've been idle for a while. If your app takes a long time to start up, that would explain the behavior you're seeing. In app.yaml, you can set 'min_idle_instances' to some value greater than zero to avoid this startup penalty.

Related

App Engine TaskQueue: Interrupted and 20 Minutes to Restart

It seems that when app engine taskqueue's get interrupted, they take 20 minutes or more to restart, is this behavior normal?
I am using the TaskQueue on Google Cloud's App Engine Flexible system. I regularly add tasks to the taskqueue and they get processed on the system. It appears that occasionally, the task gets interrupted in the middle of what it's doing. I don't know why this happens, but I assume it's probably because the instance that its on restarted itself.
My software is resilient to such restarts, but the problem is that it takes a full 20 minutes for the task to be restarted. Has anyone experienced this before?
I think you're right, an instance grabs the task and then goes down. Taskqueue doesn't realize it and waits for some kind of timeout.
This sounds very similar to an issue i experienced:
app engine instance dies instantly, locking up deferred tasks until they hit 10 minute timeout
So to answer your question, I would say yes this does happen. As for what to do, I guess it depends on what it is this task is doing, how often it runs, etc. If the 20 minute lag isnt a big deal I would just live with it, just because fixing it can be a bit of a wild goose chase, but here's what I would try:
When launching tasks, launch duplicates as well with a staggered value for countdown/eta
setup a separate microservice to handle/execute these tasks, hopefully this will make it's execution more predictable, you'll be able to tweak instance-size, & scaling settings to better suit it.

Java App engine backend shuts down abruptly, how to resume work?

I have Cron job which runs every 30mins and queues a task to be executed on a Dynamic Backend (B2).
The Backend loops and does some work, then sleeps for few minutes and then repeats the work till finally the complete job is over after few hours, after which the Backend shuts down. (Till the backend is running, no new Task is actioned)
Now two days in a row, I have seen my Backend stop abruptly (after 1.5hrs) with the familiar "Process terminated because the backend took too long to shutdown.". I have searched through the forums but could not identify WHY exactly my backend shuts down (apart from the theoretical list of reasons that Appengine doc provides). I have checked my DS/Memcache operations, Memory and all looks normal. I upgraded my backend from B1 to B2, but no luck.
Q1. Does anybody know how to debug this issue further?
Q2. Even after this I wish that the job should be completed. If I register a shutdown hook LifecycleManager.getInstance().setShutdownHook(), what is a good way to ensure that the job is resumed (considering that the Cron job could be still 29minutes away from next execution, and I want the job to do its stuff every 2 minutes)
Yes the same has happened to me. I have a backend that uses constant memory and cpu. Apengine shuts it down periodically, usually after 15min but sometimes before that. The docs say that it may get shut down without explanation, it will notify the backend and then shut it down.
You are supposed to handle it gracefully which means it can work by chunks and restart its work. If you. Ant divide the work in chunks dont use backends, use a compute engine instance.
For your first question you'd have to take a closer look at the logs, app engine does promise to indicate shutdown behaviour through a request to /_ah/stop so that would give more insights at the issue.
Now for your second question, stick with app engine's suggestions of having more than one instance. In your case you could move away from looping through some entity infinitely and going to sleep state. Instead have a cron which looks up a task queue and process a single task. If that's processed successfully mark it so somewhere or do so by removing it from the queue after you're done processing it. So in case of failures that task would still be available to be processed unless its marked successful and your additional instances can take over.

GAE: Why do I experience loading requests even though I have fixed the number of instances to exactly one?

I have a low-load application which experienced latency spikes (requests taking up to 10s to return) due to loading requests, as seen in the logs:
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time.
Here I assume that "new process" means "new instance".
In order to avoid this, I fixed the number of idle instances to exactly one (max=1 and min=1), so there is always one instance running ("resident instance") and GAE shouldn't start new ones. Billing is enabled.
However, I still experience loading requests. Why? Can anything be done about this?
Idle instances are "reserve" instances - they are meant to handle spikes when traffic increases, not the "normal" traffic. Idle instances are used only during the spin-up of the dynamic instances.
So, when you have one idle instance and no dynamic instances running and you get a request, than the idle instance should handle the request, but a new dynamic instance will still be spun up.
I too experienced the same problem with my low-traffic app and here is the practical solution that almost always prevents my users to face a cold start :
- 1 resident F4 instance
- pending latency to 15 sec
- i worked so that my warmup request are as fast as possible (under 10 sec), still quite long cause i use the frameWork Play (Java)
- and when i really don t want to have any problems i create fake traffic by pinging my app.
With this config, the resident usually serves around 50 requests, during that time, a dynamic instance receives a warmup and then start serving.

App Engine loading request even when idle instance available

I have a simple app running on App Engine but I'm having odd problems with latency. It's a Python 2.7 app and a loading request takes between 1.5 and 10 secs (I guess depending on how GAE is feeling). This is a low traffic site right now, so previously GAE was sitting with no idle instances and most request were loading requests, resulting in a long wait time on the first page view.
I've tried configuring the minimum number of idle instances to "1" so that these infrequent page views can immediately hit a warm instance.
However, I've seen several cases now where even with one instance sitting unused, GAE will route an incoming request to a loading instance, leaving the warm instance untouched:
gae dashboard showing odd scheduling
How can I prevent this from happening? I feel I must be understanding something wrong, because I certainly don't expect this behavior.
Update: Also, what makes this even less comprehensible is that the app has threadsafe enabled, so I really don't understand why GAE would get flustered and spin up an instance for a single, lone request.
Actually, I believe this is normal behavior. Idle instances are supposed to guarantee a minimum number of instances always available (for spiky load).
So, when some requests start coming in, they are initially served by idle instances, but at the same time AE scheduler will start launching new instances to always guarantee the same amount of idle instances even during suddenly increased load. That is, to "cover" for those idle instances that became busy serving requests.
It is described in details on Adjusting Application Performance page.
Arrrgh! Suffer from this myself. This topic-area has come up in several threads (GAE groups & SO). If someone can dial-in the settings for a low-traffic site (billing on/off), that would be a real benefit. IIRC, someone with what I think is deep GAE experience noted in one thread that the Scheduler does not do well with very low volume apps. I have also seen wildly different startup times within a relatively short period of time. Painful to see a spinup take 700ms then 7000ms just a few minutes later. Overall the issue is not so much the cost to me, but more so the waste of infrastructure resources. In testing I've had two instances running despite having pinged the app with an RPC once every few minutes. If 50k other developers are similarly testing, that could accumulate into a significant waste.

Google App Engine - Request was aborted after waiting too long to attempt to service your request

I get this error sometimes.
Request was aborted after waiting too
long to attempt to service your
request. Most likely, this indicates
that you have reached your
simultaneous dynamic request limit.
This is almost always due to
excessively high latency in your app.
Please see
http://code.google.com/appengine/docs/quotas.html
for more details.
The request that causes it has 10 seconds of latency and 0ms of cpu time. It is a simple jsp page that doesn't do anything that takes long at all. Also, my app is very low traffic, and all the times it has happened, it is the only request being processed.
What causes this?
If your application is low-traffic, it's possibly the startup time. There seems to be an ongoing issue where it takes so long to start an instance up, that they breach the time limit.
Some people have "worked around" this by having a cron/scheduled request that runs every few minutes that does nothing (though personally I think is counter-productive, somewhat undermining the reason Google spin your app down!).
There was an issue in their bugtracker about this:
http://code.google.com/p/googleappengine/issues/detail?id=2456
It's now marked as fixed for version 1.4, and there's a little info on it here:
http://googleappengine.blogspot.com/2010/12/happy-holidays-from-app-engine-team-140.html
Always On - For high-priority applications with low or variable traffic, you can now reserve instances via App Engine's Always On feature. Always On is a premium feature costing $9 per month which reserves three instances of your application, never turning them off, even if the application has no traffic. This mitigates the impact of loading requests on applications that have small or variable amounts of traffic.

Resources