Task queue doesn't run all my tasks -- App engine backend - google-app-engine

I have an app running on a backend instance. It has 11 tasks. The first one is started by /_ah/start and it, in turn, starts the other ten, the worker tasks. The worker tasks have this structure:
done = False
while not done:
do_important_stuff()
time.sleep (30)
if a_long_time_has_passed():
done = True
The execution behavior on app engine is the same every time. The scheduling task runs and enqueues the 10 worker tasks. The first seven worker tasks start running, executing correctly. The last three sit in the queue, never running. The task queue app console shows all ten tasks in the queue with seven of them running.
The app also stop responding to HTTP requests, returning 503 status codes with the logs not showing that my http handlers are getting invoked.
The worker task queue is configured with a maximum rate of 1/s and 2 buckets. It's curious that the admin console shows that the enforced rate is 0.1 sec. Since the tasks run forever, they aren't returning unsuccessful completion status codes. And the cpu load is negligible. The workers mostly do a URL fetch and then wait 30 seconds to do it again.
The logs are not helpful. I don't know where to go to find diagnostics that will help me figure it out. I'm testing in a free account. Could there be a limit of 8 tasks executing at one time? I see nothing like that in the documentation, but I've run out of ideas. Eventually, I'd like to run even more tasks in parallel.
Thanks for any advice you can give me.

There's a limit to how many simultaneous requests a backend instance will process, and it sounds like you're running into that limit.
Alternatives include:
Use regular task queues rather than ones against a backend
Start more than one instance of your backend
Use threading to start threads yourself from the start request, rather than relying on the task queue to do it for you
Note that if your tasks are CPU bound, you're not going to get any extra benefit from running 10 of them over 5, 2, or maybe even 1.

Related

How to increase wait time for Google Cloud Tasks?

I am creating a lot of tasks to be processed in Cloud Tasks, but some of them are failing due to lack of available resources (instances). Please see the image below:
As you can see, the average time Google waits before throwing http 500 error is 10 seconds, and sometimes, less than 10ms is enough to throw http 500. This queue has auto-retry set, so, eventually all tasks are executed, but the error remains.
Is there a way to increase this wait time? I don't care waiting 5 minutes to process the task, I just want to minimize the amount of errors like this on my logging panel.
I asked a similar question not too long ago and didn't get any helpful answers. I don't know if you can increase the rate of creating tasks.
A few things to try:
Batch tasks together -- Instead of creating 10 tasks can you create one task that does the work of all 10?
Serial tasks -- Create a first task that does work and then creates a second task. The second task does work and then creates a third task, etc.
Pull queues might allow for a higher task creation rate. Not sure about that.

Why is Google Cloud Tasks so slow?

I use Google Cloud Tasks with AppEngine to process tasks, but the tasks wait about 2-3 minutes in the queue before being sent to my App Engine endpoint.
There is no "delay" set on the tasks, and I expect them to be sent right away.
So the question is: Is Cloud Tasks slow?
As you can see is the following screenshot, Cloud Tasks gives an ETA of about 3 mins:
The official word from Google is that this is the best you can expect from their task queues.
In my experience, how you configure tasks seems to influence how quickly they get executed.
It seems that:
If you don't change the default behavior of your task queues (e.g., maximum concurrent, etc.) and if you don't specify an execution time of a task (e.g., eta) then your tasks will execute very soon after submission.
If you mess with either of these two things, then Google takes longer to execute your tasks. My guess is that it is the extra overhead of controlling task rate and execution.
I see from your screenshot that you have a task with an ETA of 2 min 49 sec which is the time until your task will be run. You have high bucket size and concurrency numbers, so I think your issue has more to do with the parameters you are using when queueing your tasks, especially the scheduled_time attribute. Check your code to see if you are adding a delay to your tasks, and make sure to tune it down.
Just adding here, that as of February 2023, I can queue tasks and then consume them VERY fast using the Python 3.7 libraries.
Takes me about 13.5 seconds to queue up 1000 tasks.
Takes about 1 minute to process those 1000 tasks using a Cloud Run deployed python/flask app. (No other processing done, just receive and reply with 200).
So, super fast!
BTW, pubsub was much slower in my tests... about 40ms per message to queue a message.

Is lease_tasks() in gae pull queues a blocking method?

I have a pull-queue in Google App Engine and a resident backend which processes the tasks in the pull queue. The backend has several worker threads for consuming and processing tasks, as suggested in a post in Google Cloud Platform blog
https://cloud.google.com/resources/articles/ios-push-notifications
Workers poll the pull-queue with lease_tasks(). My question is: is lease_tasks() supposed to be a blocking method, i.e. block the current thread's execution until either the queue has some tasks or a deadline is exceeded?
According to the GAE documentation
https://developers.google.com/appengine/docs/python/taskqueue/overview-pull#Python_Leasing_tasks
lease_tasks() accepts a 'deadline' parameter and may raise the DeadlineExceededError, thus isn't rational to assume that lease_tasks() blocks up to 'deadline' seconds?
The problem is that while I 'm developing the application in the development server, lease_tasks() returns immediately with an empty list of tasks. The result is that the worker thread's while-loop is constantly calling lease_tasks(), thus consuming 100% of CPU. If I put an explicit sleep(), say for 5 sec, that will make the worker go to sleep and won't wake up if a task is placed in the queue in the mean time. That would make the worker less responsive (worst case, it might take ->5 secs for handling the next task), plus I would consume more CPU (wake up->sleep cycles) than just having the thread block in a 'queue' (I know the pull-queue is actually an RPC, yet it abstractly remains a producer queue)
Perhaps this happens only with the dev app server while in GAE lease_tasks() blocks. However, the example code from the blog post mentioned above also suspends thread execution with sleep(). The example code is available in github and a link is in the blog post (unfortunately I cannot post it here)
lease_tasks does not wait for new tasks to be added. Most task queue calls take up to 5 seconds. The calls to lease tasks and to fetch queue statistics take longer - up to 10 seconds by default.
Most users won't need to set the deadline, it is an optional parameter. If you have very many workers contending on the same queue and often experience transient errors after 10 seconds, consider increasing the lease deadline to 20 seconds (or shard the load over more queues and/or tags). Alternately, if you only have one worker and it always needs time to perform other work in addition to leasing tasks, a small deadline like 5 seconds could be used, but it is much better to use the async API instead.

Java App engine backend shuts down abruptly, how to resume work?

I have Cron job which runs every 30mins and queues a task to be executed on a Dynamic Backend (B2).
The Backend loops and does some work, then sleeps for few minutes and then repeats the work till finally the complete job is over after few hours, after which the Backend shuts down. (Till the backend is running, no new Task is actioned)
Now two days in a row, I have seen my Backend stop abruptly (after 1.5hrs) with the familiar "Process terminated because the backend took too long to shutdown.". I have searched through the forums but could not identify WHY exactly my backend shuts down (apart from the theoretical list of reasons that Appengine doc provides). I have checked my DS/Memcache operations, Memory and all looks normal. I upgraded my backend from B1 to B2, but no luck.
Q1. Does anybody know how to debug this issue further?
Q2. Even after this I wish that the job should be completed. If I register a shutdown hook LifecycleManager.getInstance().setShutdownHook(), what is a good way to ensure that the job is resumed (considering that the Cron job could be still 29minutes away from next execution, and I want the job to do its stuff every 2 minutes)
Yes the same has happened to me. I have a backend that uses constant memory and cpu. Apengine shuts it down periodically, usually after 15min but sometimes before that. The docs say that it may get shut down without explanation, it will notify the backend and then shut it down.
You are supposed to handle it gracefully which means it can work by chunks and restart its work. If you. Ant divide the work in chunks dont use backends, use a compute engine instance.
For your first question you'd have to take a closer look at the logs, app engine does promise to indicate shutdown behaviour through a request to /_ah/stop so that would give more insights at the issue.
Now for your second question, stick with app engine's suggestions of having more than one instance. In your case you could move away from looping through some entity infinitely and going to sleep state. Instead have a cron which looks up a task queue and process a single task. If that's processed successfully mark it so somewhere or do so by removing it from the queue after you're done processing it. So in case of failures that task would still be available to be processed unless its marked successful and your additional instances can take over.

GAE Queue Statistics numbers wrong on development console

I'm seeing very strange behavior in some code that checks the QueueStatistics for a queue to see if any tasks are currently running. To the best of my knowledge there are NO tasks running, and none have been queued up for the past 12+ hours. The development console corroborates this, saying that there are 0 tasks in the queue.
Looking at the QueueStatistics information in my debugger though, confirms that my process is exiting because it's seeing on the order of 500+ (!!!) tasks in the queue. It also says it ran >1000 tasks in the past minute, yet it ran 0 tasks in the past hour. If I parse through the ETA Usec, the time is "accurately" showing as if the ETA is within the next minute of when the QueueStatistics were pulled.
This is happening repeatedly whenever I re-run my servlet, and the first thing the servlet does is check the queue statistics. No other servlets, tasks, or cron jobs are running as this is my local development server. Yet the queue statistics continue to insist I've got hundreds of tasks running.
I couldn't find any other reports of this behavior, but it feels like I must be missing something major here in regards to Queue Statistics. The code I'm using is very simple:
Queue taskQueue = QueueFactory.getQueue("myQueue");
QueueStatistics stats = taskQueue.fetchStatistics();
if (stats.getNumTasks() > 0) { return; }
What am I missing? Are queue statistics entirely unreliable on the local dev server?
If it works as expected when deployed then that's the standard to go by.
Lots of things don't work as they do in the deployed environment (parallel threads are not parallel, backend support is somewhat broken for addressing them at the time of writing) so deploy deploy deploy!
Another example is the channel API. When used locally it uses polling, you'll see 100's of those if you look in the logs/browser debug. But when deployed all is well and it works as expected.

Resources