Google App Engine (GAE) Task Queue Failure and Recovery Time Window - google-app-engine

We're interested in using a push queue in GAE but one thing I can't find is the window for recovery in the event of queue or appengine downtime.
For example, I have a push queue with a number of tasks on it. Some of these tasks get pulled off and are executing. Let's say now the queue goes down (for whatever reason), while these tasks are executing, and then comes back up. What is the time window for the restoration of the queue? Is there a set time window of recovery?
There's the possibility of these tasks that were pulled off the queue and executing now reappearing on the queue and having them execute again due to the restoration time window.
We've got idempotence considerations in our code, but it would be good to know if there are time window recovery strategies for the GAE Queue downtimes.

If I understand your question correctly, you're worried that queues can go down in the sense that a knowledge of execution completion is lost for a particular eta range, and those tasks have to be re-executed.
This is not the way things work in the GAE task queue system. We track execution on a task by task basis. (We have to because the tasks need not be dispatched in strict eta order.) The queue doesn't "go down" in the sense you're referring to.
An individual task might execute successfully twice under the current system. When this happens (and it is very rare), there should be at least a minute between successive executions.
There is no time-window recover strategy you need to consider.

Related

Keeping task queues in chronological order in appengine

In google app engine, as you add tasks to the same push task queue, will they all be queued up one after another chronologically? Or is it possible that a task could be executed before another one although it was added last? (this is all assuming they are using the same queue).
Not necessarily. I can think of 2 cases where that might not happen:
tasks can have different ETAs (in the future, for example), the order would normally be the ETA one, see do google app engine pull queues return tasks in FIFO order?
task execution may fail (for whatever reason) and they may be automatically re-tried with a backoff scheme (i.e. after some delay). Which means other tasks which normally would run after the failed one may actually run before its retry attempt(s).
Task Queues make no guarantees about execution order. In particular, tasks that are scheduled to run immediately follow a code path that can result in significant re-ordering. Behavior for push and pull queues is also distinctly different.
If you schedule tasks to run after a short time in the future, however, execution order is more likely to be eta order. Again, there are no guarantees, and you should engineer around out-of-order delivery being a normal, albeit uncommon, case. Failure modes would typically be a significant number of out-of-order tasks for a brief period, not an occasionally isolated out-of-order task.

How to kill mapreduce running job on google app engine

I started map reduce job on google app engine but I could not stop it. It is just running and could not find a way to kill it. By looking into already asked questions, I deleted all tasks from default queue and even paused queue but still mapper call-back function is running and eating all my free quota every day in a hour. GAE is just making me frustrating...
Are you sure the MR job is running on the default task queue (it is configurable)? MR uses the task queue to invoke itself. If there are no tasks in the queue (and nothing is currently executing) the job can't proceed. A task is rescheduling itself, so if there was a running task while you were clearing the queue it is possible that it may schedule itself after you cleared. Pausing the queue, waiting a bit (to make sure nothing is running) and clearing it should do the job.

GAE w/ Java, Scheduling User Notifications

I'm creating an app on GAE with Java, and looking for advice on how to handle scheduling user notifications (which will be email, text, push, whatever). There are a couple ways notifications will generated: when a producer creates content, and on a consumer's schedule. The later is the tricky part, because a consumer can change its schedule at any time. Here are the options I have considered and my concerns so far:
Keep an entry in the datastore for each consumer, indexed by the time until the next notification. My concern is over the lag for an eventually-consistent index. The longest lag I've seen reported is about 4 hours, which would be unacceptable for this use-case. A user should not delay their schedule by a week, then 4 hours later receive a notification from the old schedule.
The same as above, but with each entry sharing a common parent so that I can use an ancestor query to eliminate its eventual-ness. My concern is that there could be enough consumers to cause a problem with contention. In my wildest dreams I could foresee something like 10,000 schedule changes per minute at peak usage.
Schedule a task for each consumer. When changing the schedule, it could delete the old task and create a new one at the new time. My concern has to do with the interaction of tasks and datastore transactions, since the schedule will be stored in the datastore. The documentation notes that enqueing a task plays nicely with transactions, but what about deleting one? I would not want a task to be deleted only to have the add fail as part of its transaction.
Edit: I experimented with deleting tasks (for option 3), and unfortunately a delete that is part of a failed transaction still succeeds. That is a disappointing asymmetry. I will probably end up going that route anyway, but adding some extra logic and datastore flags to ensure rogue tasks that didn't get deleted properly simply do nothing when they execute.
Eventual consistency in the Datastore typically measures in seconds. As Google states:
the time delay is typically small, but may be longer (even minutes or
more in exceptional circumstances).
Save a time of next notification for each user. Run a cron job periodically (e.g. once per hour), and send notifications to all users who have to be notified at this time (i.e. now >= next notification).
Create a task for each user when a user's schedule is created with the countdown value. When a task executes, it creates the next task for this user.
The first approach is probably more efficient, especially if you choose a large enough window for your cron job.
As for transactions, I don't see why you need them. You can design your system that in the very rare fail situation a user will receive two notifications instead of one (old schedule and new schedule). This is not such a bad thing that you need to design around it.

Can I block on a Google AppEngine Pull Task Queue until a Task is available?

Can I block on a Google AppEngine Pull Task Queue until a Task is available? Or, do I need to poll an empty queue until a task is available?
You need to poll the queue. A typical use case for pull queues is to have multiple backends, each obtaining one thousand tasks at a time.
For use cases where there are no tasks in the queue for hours at a time, push queues can be a better fit.
Not 100% sure about your question, but thought to try an answer. Having a pull task queue started by a cron may apply. Saves the expense of running a backend. I have client-side log data that needs to be serialized and stored. On-line handler simply passes the client data to a task pull queue. Cron fires up the task every minute, and up to 10k log items get serialized and stored each run. (Change settings according to your loads -- these more than meet my modest needs.) In this case, the queue acts as a buffer, and load spikes get spread across even processing units. Obviously not useful if you want quick access to the TQ data or have wildly unpredictable loads. Very importantly the log data serialization cuts data writes by a factor of 1,000. May not apply to your question, so I'll end with a big HTH. -stevep

How to create X tasks as fast as possible on Google App Engine

We push out alerts from GAE, and let's say we need to push out 50 000 alerts to CD2M (Cloud 2 Device Messaging). For this we:
Read all who wants alerts from the datastore
Loop through and create a "push task" for each notification
The problem is that the creation of the task takes some time so this doesn't scale when the user base grows. In my experience we are getting 20-30 seconds just creating the tasks when there is a lot of them. The reason for one task pr. push message is so that we can retry the task if something fails and it will only affect a single subscriber. Also C2DM only supports sending to one user at a time.
Will it be faster if we:
Read all who wants alerts from the datastore
Loop through and create a "pool task" for each 100 subscribers
Each "Pool task" will generate 100 "push tasks" when they execute
The task execution is very fast so in our scenario it seems like the creation of the tasks is the bottleneck and not the execution of the tasks. That's why I thought about this scenario to be able to increase the parallelism of the application. I would guess this would lead to faster execution but then again I may be all wrong :-)
We do something similar with APNS (Apple Push Notification Server): we create a task for a batch of notifications at a time (= pool task as you call it). When task executes, we iterate over a batch and send it to push server.
The difference with your setup is that we have a separate server for communicating with push, as APNS only supports socket communication.
The only downside is if there is an error, then whole task will be repeated and some users might get two notifications.
This sounds like it varies based on the number of alerts you need to send out, how long it takes to send each alert, and the number of active instances you have running.
My guess is that it takes a few milliseconds to tens of milliseconds to send out a CD2M alert, while it takes a few seconds for an instance to spin up, so you can probably issue a few hundred or a few thousand alerts before justifying another task instance. The ratio of the amount of time it takes to send each CD2M message vs the time it takes to launch an instance will dictate how many messages you'd want to send per task.
If you already have a fair number of instances running though, you don't have the delay of waiting for instances to spin up.
BTW, this seems almost like a perfect application of the MapReduce API. It mostly does what you describe in the second version, except it takes your initial query, and breaks that up into subqueries that each return a "page" of the result set. A task is launched for each subquery which processes all the items in its "page". This is an improvement from what you describe, because you don't need to spend the time looping through your initial result set.
I believe the default implementation for the MapReduce API just queries for all entities of a particular kind (ie all User objects), but you can change the filter used.

Resources