Keeping task queues in chronological order in appengine - google-app-engine

In google app engine, as you add tasks to the same push task queue, will they all be queued up one after another chronologically? Or is it possible that a task could be executed before another one although it was added last? (this is all assuming they are using the same queue).

Not necessarily. I can think of 2 cases where that might not happen:
tasks can have different ETAs (in the future, for example), the order would normally be the ETA one, see do google app engine pull queues return tasks in FIFO order?
task execution may fail (for whatever reason) and they may be automatically re-tried with a backoff scheme (i.e. after some delay). Which means other tasks which normally would run after the failed one may actually run before its retry attempt(s).

Task Queues make no guarantees about execution order. In particular, tasks that are scheduled to run immediately follow a code path that can result in significant re-ordering. Behavior for push and pull queues is also distinctly different.
If you schedule tasks to run after a short time in the future, however, execution order is more likely to be eta order. Again, there are no guarantees, and you should engineer around out-of-order delivery being a normal, albeit uncommon, case. Failure modes would typically be a significant number of out-of-order tasks for a brief period, not an occasionally isolated out-of-order task.

Related

Re-distributing messages from busy subscribers

We have the following set up in our project: Two applications are communicating via a GCP Pub/Sub message queue. The first application produces messages that trigger executions (jobs) in the second (i.e. the first is the controller, and the second is the worker). However, the execution time of these jobs can vary drastically. For example, one could take up to 6 hours, and another could finish in less than a minute. Currently, the worker picks up the messages, starts a job for each one, and acknowledges the messages after their jobs are done (which could be after several hours).
Now getting to the problem: The worker application runs on multiple instances, but sometimes we see very uneven message distribution across the different instances. Consider the following graph, for example:
It shows the number of messages processed by each worker instance at any given time. You can see that some are hitting the maximum of 15 (configured via the spring.cloud.gcp.pubsub.subscriber.executor-threads property) while others are idling at 1 or 2. At this point, we also start seeing messages without any started jobs (awaiting execution). We assume that these were pulled by the GCP Pub/Sub client in the busy instances but cannot yet be processed due to a lack of executor threads. The threads are busy because they're processing heavier and more time-consuming jobs.
Finally, the question: Is there any way to do backpressure (i.e. tell GCP Pub/Sub that an instance is busy and have it re-distribute the messages to a different one)? I looked into this article, but as far as I understood, the setMaxOutstandingElementCount method wouldn't help us because it would control how many messages the instance stores in its memory. They would, however, still be "assigned" to this instance/subscriber and would probably not get re-distributed to a different one. Is that correct, or did I misunderstand?
We want to utilize the worker instances optimally and have messages processed as quickly as possible. In theory, we could try to split up the more expensive jobs into several different messages, thus minimizing the processing time differences but is this the only option?

How to avoid Google App Engine push queue increased error rates when using named tasks?

While reading the documentation for Google App Engine push queues in the Java 8 standard environment, I came across the following information regarding named tasks:
Note that de-duplication logic introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks.
I would like to utilize the de-duplication logic in a production environment, however, I am concerned about the potentially increased error rates. What is the cause of the increased error rates using named tasks and how can I effectively avoid these issues? Also, when naming the tasks I would use the random 32 character UID of a user as a prefix, therefore the names would not be sequential.
Data de-duplication is a technique for eliminating duplicate copies of repeating data. This can increases the performance needeed by Cloud Tasks in order to eliminate possible duplicate and only dispatch once your request.
If you accidentally add the same task to your list multiple times, the request will still only be dispatched once, avoiding de-duplication.
In conclusion, when you are using Cloud Tasks, you have a risk of having higher latencies, which can cause more errors as a result, like timeout errors for example.
To avoid this kind of errors please also bear in mind the de-duplication window when deleting tasks, as stated here:
The time period during which adding a task with the same name as a recently deleted task will cause the service to reject it with an error. This is the length of time that task de-duplication remains in effect after a task is deleted.

Google App Engine (GAE) Task Queue Failure and Recovery Time Window

We're interested in using a push queue in GAE but one thing I can't find is the window for recovery in the event of queue or appengine downtime.
For example, I have a push queue with a number of tasks on it. Some of these tasks get pulled off and are executing. Let's say now the queue goes down (for whatever reason), while these tasks are executing, and then comes back up. What is the time window for the restoration of the queue? Is there a set time window of recovery?
There's the possibility of these tasks that were pulled off the queue and executing now reappearing on the queue and having them execute again due to the restoration time window.
We've got idempotence considerations in our code, but it would be good to know if there are time window recovery strategies for the GAE Queue downtimes.
If I understand your question correctly, you're worried that queues can go down in the sense that a knowledge of execution completion is lost for a particular eta range, and those tasks have to be re-executed.
This is not the way things work in the GAE task queue system. We track execution on a task by task basis. (We have to because the tasks need not be dispatched in strict eta order.) The queue doesn't "go down" in the sense you're referring to.
An individual task might execute successfully twice under the current system. When this happens (and it is very rare), there should be at least a minute between successive executions.
There is no time-window recover strategy you need to consider.

Can I block on a Google AppEngine Pull Task Queue until a Task is available?

Can I block on a Google AppEngine Pull Task Queue until a Task is available? Or, do I need to poll an empty queue until a task is available?
You need to poll the queue. A typical use case for pull queues is to have multiple backends, each obtaining one thousand tasks at a time.
For use cases where there are no tasks in the queue for hours at a time, push queues can be a better fit.
Not 100% sure about your question, but thought to try an answer. Having a pull task queue started by a cron may apply. Saves the expense of running a backend. I have client-side log data that needs to be serialized and stored. On-line handler simply passes the client data to a task pull queue. Cron fires up the task every minute, and up to 10k log items get serialized and stored each run. (Change settings according to your loads -- these more than meet my modest needs.) In this case, the queue acts as a buffer, and load spikes get spread across even processing units. Obviously not useful if you want quick access to the TQ data or have wildly unpredictable loads. Very importantly the log data serialization cuts data writes by a factor of 1,000. May not apply to your question, so I'll end with a big HTH. -stevep

How to create X tasks as fast as possible on Google App Engine

We push out alerts from GAE, and let's say we need to push out 50 000 alerts to CD2M (Cloud 2 Device Messaging). For this we:
Read all who wants alerts from the datastore
Loop through and create a "push task" for each notification
The problem is that the creation of the task takes some time so this doesn't scale when the user base grows. In my experience we are getting 20-30 seconds just creating the tasks when there is a lot of them. The reason for one task pr. push message is so that we can retry the task if something fails and it will only affect a single subscriber. Also C2DM only supports sending to one user at a time.
Will it be faster if we:
Read all who wants alerts from the datastore
Loop through and create a "pool task" for each 100 subscribers
Each "Pool task" will generate 100 "push tasks" when they execute
The task execution is very fast so in our scenario it seems like the creation of the tasks is the bottleneck and not the execution of the tasks. That's why I thought about this scenario to be able to increase the parallelism of the application. I would guess this would lead to faster execution but then again I may be all wrong :-)
We do something similar with APNS (Apple Push Notification Server): we create a task for a batch of notifications at a time (= pool task as you call it). When task executes, we iterate over a batch and send it to push server.
The difference with your setup is that we have a separate server for communicating with push, as APNS only supports socket communication.
The only downside is if there is an error, then whole task will be repeated and some users might get two notifications.
This sounds like it varies based on the number of alerts you need to send out, how long it takes to send each alert, and the number of active instances you have running.
My guess is that it takes a few milliseconds to tens of milliseconds to send out a CD2M alert, while it takes a few seconds for an instance to spin up, so you can probably issue a few hundred or a few thousand alerts before justifying another task instance. The ratio of the amount of time it takes to send each CD2M message vs the time it takes to launch an instance will dictate how many messages you'd want to send per task.
If you already have a fair number of instances running though, you don't have the delay of waiting for instances to spin up.
BTW, this seems almost like a perfect application of the MapReduce API. It mostly does what you describe in the second version, except it takes your initial query, and breaks that up into subqueries that each return a "page" of the result set. A task is launched for each subquery which processes all the items in its "page". This is an improvement from what you describe, because you don't need to spend the time looping through your initial result set.
I believe the default implementation for the MapReduce API just queries for all entities of a particular kind (ie all User objects), but you can change the filter used.

Resources