It seems that when app engine taskqueue's get interrupted, they take 20 minutes or more to restart, is this behavior normal?
I am using the TaskQueue on Google Cloud's App Engine Flexible system. I regularly add tasks to the taskqueue and they get processed on the system. It appears that occasionally, the task gets interrupted in the middle of what it's doing. I don't know why this happens, but I assume it's probably because the instance that its on restarted itself.
My software is resilient to such restarts, but the problem is that it takes a full 20 minutes for the task to be restarted. Has anyone experienced this before?
I think you're right, an instance grabs the task and then goes down. Taskqueue doesn't realize it and waits for some kind of timeout.
This sounds very similar to an issue i experienced:
app engine instance dies instantly, locking up deferred tasks until they hit 10 minute timeout
So to answer your question, I would say yes this does happen. As for what to do, I guess it depends on what it is this task is doing, how often it runs, etc. If the 20 minute lag isnt a big deal I would just live with it, just because fixing it can be a bit of a wild goose chase, but here's what I would try:
When launching tasks, launch duplicates as well with a staggered value for countdown/eta
setup a separate microservice to handle/execute these tasks, hopefully this will make it's execution more predictable, you'll be able to tweak instance-size, & scaling settings to better suit it.
Related
Starting about 1 week ago, my app will occasionally and randomly completely stop serving for 1-5 minutes. Requests during this time hang for the full timeout and then return a 500.
The System Status dashboard reads OK, I have no cron jobs or anything special that might cause this disruption (that I know of).
Has anyone experienced this, and is there a solution?
If you have 'threadsafe: false' in your app.yaml configuration, App Engine will not send concurrent requests to your app. If you have a request that's blocking for a really long time, all other requests coming in will line up (and possibly time out) before being serviced. If this is the cause of your problem, either make your app thread-safe or have a look in your logs to find requests that take a long time and fix them.
Alternatively, if your app gets very little traffic, your instances might be getting shut down after they've been idle for a while. If your app takes a long time to start up, that would explain the behavior you're seeing. In app.yaml, you can set 'min_idle_instances' to some value greater than zero to avoid this startup penalty.
I am using Google App Engine Task push queues to schedule future tasks that i'd like to occur within second precision of their scheduled time.
Typically I would schedule a task 30 seconds from now, that would trigger a change of state in my system, and finally schedule another future task.
Everything works fine on my local development server.
However, now that I have deployed to the GAE servers, I notice that the scheduled tasks run late. I've seen them running even two minutes after they have been scheduled.
From the task queues admin console, it actually says for the ETA:
ETA: "2013/11/02 22:25:14 0:01:38 ago"
Creation Time: "2013/11/02 22:24:44 0:02:08 ago"
Why would this be?
I could not find any documentation about the expectation and precision of tasks scheduled by ETA.
I'm programming in python, but I doubt this makes any difference.\
In the python code, the eta parameter is documented as follows:
eta: A datetime.datetime specifying the absolute time at which the task
should be executed. Must not be specified if 'countdown' is specified.
This may be timezone-aware or timezone-naive. If None, defaults to now.
My queue Settings:
queue:
- name: mgmt
rate: 30/s
The system is under no load what so ever, except for 5 tasks that should run every 30 seconds or so.
UPDATE:
I have found https://code.google.com/p/googleappengine/issues/detail?id=4901 which is an accepted feature request for timely queues although nothing seems to have been done about it. It accepts the fact that tasks with ETA can run late even by many minutes.
What other alternative mechanisms could I use to schedule a trigger with second-precision?
GAE makes no guarantees about clock synchronization within and across their data centers; see UTC Time on Google App engine? for a related discussion. So you can't even specify the absolute time accurately, even if they made the (different) guarantee that tasks are executed within some tolerance of the target time.
If you really need this kind of precision, you could consider setting up a persistent GAE "backend" instance that synchronizes itself with a trusted external clock, and provides task queuing and execution services.
(Aside: Unfortunately, that approach introduces a single point of failure, so to fix that you could just take the next steps and build a whole cluster of these backends... But at that point you may as well look elsewhere than GAE, since you're moving away from the GAE "automatic transmission" model, toward AWS's "manual transmission" model.)
I reported the issue to the GAE team and I got the following response:
This appears to be an isolation issue. Short version: a high-traffic user is sharing underlying resources and crowding you out.
Not a very satisfying response, I know. I've corrected this instance, but these things tend to revert over time.
We have a project in the pipeline that will correct the underlying issue. Deployment is expected in January or February of 2014.
See https://code.google.com/p/googleappengine/issues/detail?id=10228
See also thread: https://code.google.com/p/googleappengine/issues/detail?id=4901
After they "corrected this instance" I did some testing for a few hours. The situation improved a little especially for tasks without ETA. But for tasks with ETA I still see at least half of them running at least 10 seconds late. This is far from reliable for my requirements
For now I decided to use my own scheduling service on a different host, until the GAE team "correct the underlying issue" and have a more predictable task scheduling system.
I have Cron job which runs every 30mins and queues a task to be executed on a Dynamic Backend (B2).
The Backend loops and does some work, then sleeps for few minutes and then repeats the work till finally the complete job is over after few hours, after which the Backend shuts down. (Till the backend is running, no new Task is actioned)
Now two days in a row, I have seen my Backend stop abruptly (after 1.5hrs) with the familiar "Process terminated because the backend took too long to shutdown.". I have searched through the forums but could not identify WHY exactly my backend shuts down (apart from the theoretical list of reasons that Appengine doc provides). I have checked my DS/Memcache operations, Memory and all looks normal. I upgraded my backend from B1 to B2, but no luck.
Q1. Does anybody know how to debug this issue further?
Q2. Even after this I wish that the job should be completed. If I register a shutdown hook LifecycleManager.getInstance().setShutdownHook(), what is a good way to ensure that the job is resumed (considering that the Cron job could be still 29minutes away from next execution, and I want the job to do its stuff every 2 minutes)
Yes the same has happened to me. I have a backend that uses constant memory and cpu. Apengine shuts it down periodically, usually after 15min but sometimes before that. The docs say that it may get shut down without explanation, it will notify the backend and then shut it down.
You are supposed to handle it gracefully which means it can work by chunks and restart its work. If you. Ant divide the work in chunks dont use backends, use a compute engine instance.
For your first question you'd have to take a closer look at the logs, app engine does promise to indicate shutdown behaviour through a request to /_ah/stop so that would give more insights at the issue.
Now for your second question, stick with app engine's suggestions of having more than one instance. In your case you could move away from looping through some entity infinitely and going to sleep state. Instead have a cron which looks up a task queue and process a single task. If that's processed successfully mark it so somewhere or do so by removing it from the queue after you're done processing it. So in case of failures that task would still be available to be processed unless its marked successful and your additional instances can take over.
I have a simple app running on App Engine but I'm having odd problems with latency. It's a Python 2.7 app and a loading request takes between 1.5 and 10 secs (I guess depending on how GAE is feeling). This is a low traffic site right now, so previously GAE was sitting with no idle instances and most request were loading requests, resulting in a long wait time on the first page view.
I've tried configuring the minimum number of idle instances to "1" so that these infrequent page views can immediately hit a warm instance.
However, I've seen several cases now where even with one instance sitting unused, GAE will route an incoming request to a loading instance, leaving the warm instance untouched:
gae dashboard showing odd scheduling
How can I prevent this from happening? I feel I must be understanding something wrong, because I certainly don't expect this behavior.
Update: Also, what makes this even less comprehensible is that the app has threadsafe enabled, so I really don't understand why GAE would get flustered and spin up an instance for a single, lone request.
Actually, I believe this is normal behavior. Idle instances are supposed to guarantee a minimum number of instances always available (for spiky load).
So, when some requests start coming in, they are initially served by idle instances, but at the same time AE scheduler will start launching new instances to always guarantee the same amount of idle instances even during suddenly increased load. That is, to "cover" for those idle instances that became busy serving requests.
It is described in details on Adjusting Application Performance page.
Arrrgh! Suffer from this myself. This topic-area has come up in several threads (GAE groups & SO). If someone can dial-in the settings for a low-traffic site (billing on/off), that would be a real benefit. IIRC, someone with what I think is deep GAE experience noted in one thread that the Scheduler does not do well with very low volume apps. I have also seen wildly different startup times within a relatively short period of time. Painful to see a spinup take 700ms then 7000ms just a few minutes later. Overall the issue is not so much the cost to me, but more so the waste of infrastructure resources. In testing I've had two instances running despite having pinged the app with an RPC once every few minutes. If 50k other developers are similarly testing, that could accumulate into a significant waste.
I am using Task Queue in GAE for performing some background work for my application. I have come to know that there is a 10 minute time limit for a particular task. My concern is how do I test this thing in my local environment. I tried thread sleep but it didn't throw any exception as mentioned in google app engine docs. Also is this time limit is measured by CPU time or the actual time.
Thanks.
The time is measured in wall clock time. The development server doesn't enforce time limits, although it's unclear why you'd want to test it because it's unlikely your tests will perform the same as they will in production, so trying to guess how much you'll be able to accomplish in 10 minutes on the production servers by seeing how much you can accomplish in 10 minutes on the development server will fail horribly.
For your development server, start a timer when a task is initiated. keep checking in your code if you reached 10 mins wall clock time. When you reach, throw a DeadlineExceededError. It would be better to have the try and except statements in the class handlers which call a particular function of your code.