GAE deferred task retried due to "instance unavailable" despite having already succeeded - google-app-engine

In our GAE application, we occasionally see TombstonedTaskError errors in deferred tasks due to named tasks being submitted multiple times with identical names. These tasks seem to be occasionally resubmitted automatically by GAE despite the first execution of the deferred task succeeding.
An example can be seen in this log screenshot: the task "refresh_stock_status-1451012400-GNeg-completion-poll-2" was submitted on the morning of Dec 25th, and completed 12-25 07:35:05. The task was then, for some reason, retried automatically by GAE ("X-Appengine-Taskretrycount:1") at 07:35:18, with the specified reason being "X-Appengine-Taskretryreason:Instance Unavailable". Clearly, an instance (with id ending with "...2976") had been available, and the task had already completed. Why was it retried like this? What can be done to prevent it? While rare, these errors are still misleading and checking them takes time during monitoring of our application.
The only similar situation I've found described online is at https://groups.google.com/forum/#!topic/google-appengine/0JWCp3OGnMI . No solution was offered beyond fiddling with the instance scaling configuration (which might reduce the error incidence) though, which in our case would have too many side effects.

After being in touch with Google support, it's emerged that the reason is that very rarely, deferred tasks can be executed multiple times by GAE. This means that tasks have to be implemented such that they can safely be executed multiple times without any errors, which was not the case for us (since we assumed that tasks would only be executed once).

Related

Re-distributing messages from busy subscribers

We have the following set up in our project: Two applications are communicating via a GCP Pub/Sub message queue. The first application produces messages that trigger executions (jobs) in the second (i.e. the first is the controller, and the second is the worker). However, the execution time of these jobs can vary drastically. For example, one could take up to 6 hours, and another could finish in less than a minute. Currently, the worker picks up the messages, starts a job for each one, and acknowledges the messages after their jobs are done (which could be after several hours).
Now getting to the problem: The worker application runs on multiple instances, but sometimes we see very uneven message distribution across the different instances. Consider the following graph, for example:
It shows the number of messages processed by each worker instance at any given time. You can see that some are hitting the maximum of 15 (configured via the spring.cloud.gcp.pubsub.subscriber.executor-threads property) while others are idling at 1 or 2. At this point, we also start seeing messages without any started jobs (awaiting execution). We assume that these were pulled by the GCP Pub/Sub client in the busy instances but cannot yet be processed due to a lack of executor threads. The threads are busy because they're processing heavier and more time-consuming jobs.
Finally, the question: Is there any way to do backpressure (i.e. tell GCP Pub/Sub that an instance is busy and have it re-distribute the messages to a different one)? I looked into this article, but as far as I understood, the setMaxOutstandingElementCount method wouldn't help us because it would control how many messages the instance stores in its memory. They would, however, still be "assigned" to this instance/subscriber and would probably not get re-distributed to a different one. Is that correct, or did I misunderstand?
We want to utilize the worker instances optimally and have messages processed as quickly as possible. In theory, we could try to split up the more expensive jobs into several different messages, thus minimizing the processing time differences but is this the only option?

How to avoid Google App Engine push queue increased error rates when using named tasks?

While reading the documentation for Google App Engine push queues in the Java 8 standard environment, I came across the following information regarding named tasks:
Note that de-duplication logic introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks.
I would like to utilize the de-duplication logic in a production environment, however, I am concerned about the potentially increased error rates. What is the cause of the increased error rates using named tasks and how can I effectively avoid these issues? Also, when naming the tasks I would use the random 32 character UID of a user as a prefix, therefore the names would not be sequential.
Data de-duplication is a technique for eliminating duplicate copies of repeating data. This can increases the performance needeed by Cloud Tasks in order to eliminate possible duplicate and only dispatch once your request.
If you accidentally add the same task to your list multiple times, the request will still only be dispatched once, avoiding de-duplication.
In conclusion, when you are using Cloud Tasks, you have a risk of having higher latencies, which can cause more errors as a result, like timeout errors for example.
To avoid this kind of errors please also bear in mind the de-duplication window when deleting tasks, as stated here:
The time period during which adding a task with the same name as a recently deleted task will cause the service to reject it with an error. This is the length of time that task de-duplication remains in effect after a task is deleted.

Backend "Process moved to a different machine" and fails withh error 500

I have a process that takes around five minutes to complete. It runs on a cron job every two hours in a backend instance.
Recently the process has started to fail; not every time but a few times a day. First thing that happens is that the memcache starts to throw exceptions:
04:21:13.640 com.google.appengine.api.memcache.LogAndContinueErrorHandler handleServiceError: Service error in memcache
com.google.appengine.api.memcache.MemcacheServiceException: Memcache get: exception getting 1 key (ItemFollowableCompleted:RegionUS:P8XD:0)
at com.google.appengine.api.memcache.MemcacheServiceApiHelper$RpcResponseHandler.handleApiProxyException(MemcacheServiceApiHelper.java:68)
at com.google.appengine.api.memcache.MemcacheServiceApiHelper$1.absorbParentException(MemcacheServiceApiHelper.java:109)
None of these are fatal exceptions but a few seconds later the process terminated without warning or shutdown message. Logs show
04:21:30.591 Process moved to a different machine.
and an error 500.
Is this a google infrastructure problem related to memcache or is there something in the app code that could be causing it?
No, it's not an error in Google infrastructure. Your process is expected to be moved among instances when needed (maintenance, more demand from your side, ...), and there's nothing you can do to prevent it.
Nonetheless there are a few things you could do to alleviate any effect this could have in your app.
Look [1] for some suggestions on how to keep track of your pending jobs when your instance is shut down and also have a look at the background threads.
I'm guessing you're using Python, if not, look for your corresponding language.
[1] https://developers.google.com/appengine/docs/python/backends/#Python_Backend_states
I have the same problem when I use ndb.putmulti() to load data. I tried a few things
1. increase my backends machine size, I moved to B4_1G
2. sleep between ndb.putmulti() (2 minutes for every 200 entities)
3. Dedicated memcache (1G)
1 and 2 were not very helpful, 3 seems to help.
I think rapid updates to ndb datastore affecting memcache is the root cause in my case. I could not find any other way besides paying for dedicated memcache.
I also met the issue "Process moved to a different machine" in the backend module too.
The issue context is as below:
Get the query result from one KIND
Iterating each entity in the query result, I will do some tasks and write new entities to different KINDs
The "Process moved to a different machine" happens during the half of iterating
After some experiments, I found it is due to "too many writing transactions in one request". Everything is fine when the size of query result is small, but cause problem when it becomes larger.
The final solution I took is to use Task Queue, the work should be done for a entity is looked as one task and be put into the PushQueue. So the issue is gone.
Hope this will help :)

App Engine Tasks with ETA are fired much later than scheduled

I am using Google App Engine Task push queues to schedule future tasks that i'd like to occur within second precision of their scheduled time.
Typically I would schedule a task 30 seconds from now, that would trigger a change of state in my system, and finally schedule another future task.
Everything works fine on my local development server.
However, now that I have deployed to the GAE servers, I notice that the scheduled tasks run late. I've seen them running even two minutes after they have been scheduled.
From the task queues admin console, it actually says for the ETA:
ETA: "2013/11/02 22:25:14 0:01:38 ago"
Creation Time: "2013/11/02 22:24:44 0:02:08 ago"
Why would this be?
I could not find any documentation about the expectation and precision of tasks scheduled by ETA.
I'm programming in python, but I doubt this makes any difference.\
In the python code, the eta parameter is documented as follows:
eta: A datetime.datetime specifying the absolute time at which the task
should be executed. Must not be specified if 'countdown' is specified.
This may be timezone-aware or timezone-naive. If None, defaults to now.
My queue Settings:
queue:
- name: mgmt
rate: 30/s
The system is under no load what so ever, except for 5 tasks that should run every 30 seconds or so.
UPDATE:
I have found https://code.google.com/p/googleappengine/issues/detail?id=4901 which is an accepted feature request for timely queues although nothing seems to have been done about it. It accepts the fact that tasks with ETA can run late even by many minutes.
What other alternative mechanisms could I use to schedule a trigger with second-precision?
GAE makes no guarantees about clock synchronization within and across their data centers; see UTC Time on Google App engine? for a related discussion. So you can't even specify the absolute time accurately, even if they made the (different) guarantee that tasks are executed within some tolerance of the target time.
If you really need this kind of precision, you could consider setting up a persistent GAE "backend" instance that synchronizes itself with a trusted external clock, and provides task queuing and execution services.
(Aside: Unfortunately, that approach introduces a single point of failure, so to fix that you could just take the next steps and build a whole cluster of these backends... But at that point you may as well look elsewhere than GAE, since you're moving away from the GAE "automatic transmission" model, toward AWS's "manual transmission" model.)
I reported the issue to the GAE team and I got the following response:
This appears to be an isolation issue. Short version: a high-traffic user is sharing underlying resources and crowding you out.
Not a very satisfying response, I know. I've corrected this instance, but these things tend to revert over time.
We have a project in the pipeline that will correct the underlying issue. Deployment is expected in January or February of 2014.
See https://code.google.com/p/googleappengine/issues/detail?id=10228
See also thread: https://code.google.com/p/googleappengine/issues/detail?id=4901
After they "corrected this instance" I did some testing for a few hours. The situation improved a little especially for tasks without ETA. But for tasks with ETA I still see at least half of them running at least 10 seconds late. This is far from reliable for my requirements
For now I decided to use my own scheduling service on a different host, until the GAE team "correct the underlying issue" and have a more predictable task scheduling system.

Java App engine backend shuts down abruptly, how to resume work?

I have Cron job which runs every 30mins and queues a task to be executed on a Dynamic Backend (B2).
The Backend loops and does some work, then sleeps for few minutes and then repeats the work till finally the complete job is over after few hours, after which the Backend shuts down. (Till the backend is running, no new Task is actioned)
Now two days in a row, I have seen my Backend stop abruptly (after 1.5hrs) with the familiar "Process terminated because the backend took too long to shutdown.". I have searched through the forums but could not identify WHY exactly my backend shuts down (apart from the theoretical list of reasons that Appengine doc provides). I have checked my DS/Memcache operations, Memory and all looks normal. I upgraded my backend from B1 to B2, but no luck.
Q1. Does anybody know how to debug this issue further?
Q2. Even after this I wish that the job should be completed. If I register a shutdown hook LifecycleManager.getInstance().setShutdownHook(), what is a good way to ensure that the job is resumed (considering that the Cron job could be still 29minutes away from next execution, and I want the job to do its stuff every 2 minutes)
Yes the same has happened to me. I have a backend that uses constant memory and cpu. Apengine shuts it down periodically, usually after 15min but sometimes before that. The docs say that it may get shut down without explanation, it will notify the backend and then shut it down.
You are supposed to handle it gracefully which means it can work by chunks and restart its work. If you. Ant divide the work in chunks dont use backends, use a compute engine instance.
For your first question you'd have to take a closer look at the logs, app engine does promise to indicate shutdown behaviour through a request to /_ah/stop so that would give more insights at the issue.
Now for your second question, stick with app engine's suggestions of having more than one instance. In your case you could move away from looping through some entity infinitely and going to sleep state. Instead have a cron which looks up a task queue and process a single task. If that's processed successfully mark it so somewhere or do so by removing it from the queue after you're done processing it. So in case of failures that task would still be available to be processed unless its marked successful and your additional instances can take over.

Resources