GAE Queue Statistics numbers wrong on development console - google-app-engine

I'm seeing very strange behavior in some code that checks the QueueStatistics for a queue to see if any tasks are currently running. To the best of my knowledge there are NO tasks running, and none have been queued up for the past 12+ hours. The development console corroborates this, saying that there are 0 tasks in the queue.
Looking at the QueueStatistics information in my debugger though, confirms that my process is exiting because it's seeing on the order of 500+ (!!!) tasks in the queue. It also says it ran >1000 tasks in the past minute, yet it ran 0 tasks in the past hour. If I parse through the ETA Usec, the time is "accurately" showing as if the ETA is within the next minute of when the QueueStatistics were pulled.
This is happening repeatedly whenever I re-run my servlet, and the first thing the servlet does is check the queue statistics. No other servlets, tasks, or cron jobs are running as this is my local development server. Yet the queue statistics continue to insist I've got hundreds of tasks running.
I couldn't find any other reports of this behavior, but it feels like I must be missing something major here in regards to Queue Statistics. The code I'm using is very simple:
Queue taskQueue = QueueFactory.getQueue("myQueue");
QueueStatistics stats = taskQueue.fetchStatistics();
if (stats.getNumTasks() > 0) { return; }
What am I missing? Are queue statistics entirely unreliable on the local dev server?

If it works as expected when deployed then that's the standard to go by.
Lots of things don't work as they do in the deployed environment (parallel threads are not parallel, backend support is somewhat broken for addressing them at the time of writing) so deploy deploy deploy!
Another example is the channel API. When used locally it uses polling, you'll see 100's of those if you look in the logs/browser debug. But when deployed all is well and it works as expected.

Related

Why is Google Cloud Tasks so slow?

I use Google Cloud Tasks with AppEngine to process tasks, but the tasks wait about 2-3 minutes in the queue before being sent to my App Engine endpoint.
There is no "delay" set on the tasks, and I expect them to be sent right away.
So the question is: Is Cloud Tasks slow?
As you can see is the following screenshot, Cloud Tasks gives an ETA of about 3 mins:
The official word from Google is that this is the best you can expect from their task queues.
In my experience, how you configure tasks seems to influence how quickly they get executed.
It seems that:
If you don't change the default behavior of your task queues (e.g., maximum concurrent, etc.) and if you don't specify an execution time of a task (e.g., eta) then your tasks will execute very soon after submission.
If you mess with either of these two things, then Google takes longer to execute your tasks. My guess is that it is the extra overhead of controlling task rate and execution.
I see from your screenshot that you have a task with an ETA of 2 min 49 sec which is the time until your task will be run. You have high bucket size and concurrency numbers, so I think your issue has more to do with the parameters you are using when queueing your tasks, especially the scheduled_time attribute. Check your code to see if you are adding a delay to your tasks, and make sure to tune it down.
Just adding here, that as of February 2023, I can queue tasks and then consume them VERY fast using the Python 3.7 libraries.
Takes me about 13.5 seconds to queue up 1000 tasks.
Takes about 1 minute to process those 1000 tasks using a Cloud Run deployed python/flask app. (No other processing done, just receive and reply with 200).
So, super fast!
BTW, pubsub was much slower in my tests... about 40ms per message to queue a message.

App Engine TaskQueue: Interrupted and 20 Minutes to Restart

It seems that when app engine taskqueue's get interrupted, they take 20 minutes or more to restart, is this behavior normal?
I am using the TaskQueue on Google Cloud's App Engine Flexible system. I regularly add tasks to the taskqueue and they get processed on the system. It appears that occasionally, the task gets interrupted in the middle of what it's doing. I don't know why this happens, but I assume it's probably because the instance that its on restarted itself.
My software is resilient to such restarts, but the problem is that it takes a full 20 minutes for the task to be restarted. Has anyone experienced this before?
I think you're right, an instance grabs the task and then goes down. Taskqueue doesn't realize it and waits for some kind of timeout.
This sounds very similar to an issue i experienced:
app engine instance dies instantly, locking up deferred tasks until they hit 10 minute timeout
So to answer your question, I would say yes this does happen. As for what to do, I guess it depends on what it is this task is doing, how often it runs, etc. If the 20 minute lag isnt a big deal I would just live with it, just because fixing it can be a bit of a wild goose chase, but here's what I would try:
When launching tasks, launch duplicates as well with a staggered value for countdown/eta
setup a separate microservice to handle/execute these tasks, hopefully this will make it's execution more predictable, you'll be able to tweak instance-size, & scaling settings to better suit it.

App Engine Tasks with ETA are fired much later than scheduled

I am using Google App Engine Task push queues to schedule future tasks that i'd like to occur within second precision of their scheduled time.
Typically I would schedule a task 30 seconds from now, that would trigger a change of state in my system, and finally schedule another future task.
Everything works fine on my local development server.
However, now that I have deployed to the GAE servers, I notice that the scheduled tasks run late. I've seen them running even two minutes after they have been scheduled.
From the task queues admin console, it actually says for the ETA:
ETA: "2013/11/02 22:25:14 0:01:38 ago"
Creation Time: "2013/11/02 22:24:44 0:02:08 ago"
Why would this be?
I could not find any documentation about the expectation and precision of tasks scheduled by ETA.
I'm programming in python, but I doubt this makes any difference.\
In the python code, the eta parameter is documented as follows:
eta: A datetime.datetime specifying the absolute time at which the task
should be executed. Must not be specified if 'countdown' is specified.
This may be timezone-aware or timezone-naive. If None, defaults to now.
My queue Settings:
queue:
- name: mgmt
rate: 30/s
The system is under no load what so ever, except for 5 tasks that should run every 30 seconds or so.
UPDATE:
I have found https://code.google.com/p/googleappengine/issues/detail?id=4901 which is an accepted feature request for timely queues although nothing seems to have been done about it. It accepts the fact that tasks with ETA can run late even by many minutes.
What other alternative mechanisms could I use to schedule a trigger with second-precision?
GAE makes no guarantees about clock synchronization within and across their data centers; see UTC Time on Google App engine? for a related discussion. So you can't even specify the absolute time accurately, even if they made the (different) guarantee that tasks are executed within some tolerance of the target time.
If you really need this kind of precision, you could consider setting up a persistent GAE "backend" instance that synchronizes itself with a trusted external clock, and provides task queuing and execution services.
(Aside: Unfortunately, that approach introduces a single point of failure, so to fix that you could just take the next steps and build a whole cluster of these backends... But at that point you may as well look elsewhere than GAE, since you're moving away from the GAE "automatic transmission" model, toward AWS's "manual transmission" model.)
I reported the issue to the GAE team and I got the following response:
This appears to be an isolation issue. Short version: a high-traffic user is sharing underlying resources and crowding you out.
Not a very satisfying response, I know. I've corrected this instance, but these things tend to revert over time.
We have a project in the pipeline that will correct the underlying issue. Deployment is expected in January or February of 2014.
See https://code.google.com/p/googleappengine/issues/detail?id=10228
See also thread: https://code.google.com/p/googleappengine/issues/detail?id=4901
After they "corrected this instance" I did some testing for a few hours. The situation improved a little especially for tasks without ETA. But for tasks with ETA I still see at least half of them running at least 10 seconds late. This is far from reliable for my requirements
For now I decided to use my own scheduling service on a different host, until the GAE team "correct the underlying issue" and have a more predictable task scheduling system.

Java App engine backend shuts down abruptly, how to resume work?

I have Cron job which runs every 30mins and queues a task to be executed on a Dynamic Backend (B2).
The Backend loops and does some work, then sleeps for few minutes and then repeats the work till finally the complete job is over after few hours, after which the Backend shuts down. (Till the backend is running, no new Task is actioned)
Now two days in a row, I have seen my Backend stop abruptly (after 1.5hrs) with the familiar "Process terminated because the backend took too long to shutdown.". I have searched through the forums but could not identify WHY exactly my backend shuts down (apart from the theoretical list of reasons that Appengine doc provides). I have checked my DS/Memcache operations, Memory and all looks normal. I upgraded my backend from B1 to B2, but no luck.
Q1. Does anybody know how to debug this issue further?
Q2. Even after this I wish that the job should be completed. If I register a shutdown hook LifecycleManager.getInstance().setShutdownHook(), what is a good way to ensure that the job is resumed (considering that the Cron job could be still 29minutes away from next execution, and I want the job to do its stuff every 2 minutes)
Yes the same has happened to me. I have a backend that uses constant memory and cpu. Apengine shuts it down periodically, usually after 15min but sometimes before that. The docs say that it may get shut down without explanation, it will notify the backend and then shut it down.
You are supposed to handle it gracefully which means it can work by chunks and restart its work. If you. Ant divide the work in chunks dont use backends, use a compute engine instance.
For your first question you'd have to take a closer look at the logs, app engine does promise to indicate shutdown behaviour through a request to /_ah/stop so that would give more insights at the issue.
Now for your second question, stick with app engine's suggestions of having more than one instance. In your case you could move away from looping through some entity infinitely and going to sleep state. Instead have a cron which looks up a task queue and process a single task. If that's processed successfully mark it so somewhere or do so by removing it from the queue after you're done processing it. So in case of failures that task would still be available to be processed unless its marked successful and your additional instances can take over.

Task queue doesn't run all my tasks -- App engine backend

I have an app running on a backend instance. It has 11 tasks. The first one is started by /_ah/start and it, in turn, starts the other ten, the worker tasks. The worker tasks have this structure:
done = False
while not done:
do_important_stuff()
time.sleep (30)
if a_long_time_has_passed():
done = True
The execution behavior on app engine is the same every time. The scheduling task runs and enqueues the 10 worker tasks. The first seven worker tasks start running, executing correctly. The last three sit in the queue, never running. The task queue app console shows all ten tasks in the queue with seven of them running.
The app also stop responding to HTTP requests, returning 503 status codes with the logs not showing that my http handlers are getting invoked.
The worker task queue is configured with a maximum rate of 1/s and 2 buckets. It's curious that the admin console shows that the enforced rate is 0.1 sec. Since the tasks run forever, they aren't returning unsuccessful completion status codes. And the cpu load is negligible. The workers mostly do a URL fetch and then wait 30 seconds to do it again.
The logs are not helpful. I don't know where to go to find diagnostics that will help me figure it out. I'm testing in a free account. Could there be a limit of 8 tasks executing at one time? I see nothing like that in the documentation, but I've run out of ideas. Eventually, I'd like to run even more tasks in parallel.
Thanks for any advice you can give me.
There's a limit to how many simultaneous requests a backend instance will process, and it sounds like you're running into that limit.
Alternatives include:
Use regular task queues rather than ones against a backend
Start more than one instance of your backend
Use threading to start threads yourself from the start request, rather than relying on the task queue to do it for you
Note that if your tasks are CPU bound, you're not going to get any extra benefit from running 10 of them over 5, 2, or maybe even 1.

Resources