AppEngine: scheduled tasks per Application or per Application Instance? - google-app-engine

From what I gather, AppEngine fires up "Application Instances" (for a lack of better terminology that I know of) as a function of demand on the said application.
Now, let's say I define Scheduled Tasks for my Application, is it possible that the said tasks might end-up being run by multiple Application Instances?
The reason I am asking: if my application uses the datastore as some sort of "Task Repository" and I use Scheduled Tasks to pull work items from it, is it possible that an Application Instance might get the same work items as another (assuming I am not adding addition state to account for this possibility) ?

The contract for the Task Queue API is such that it is possible for tasks to be executed more than once - though such occurrences are rare, and they wouldn't result in the same task being executed multiple times simultaneously. If re-execution does occur, it's entirely possible that they'll be executed on different instances.

Related

Re-distributing messages from busy subscribers

We have the following set up in our project: Two applications are communicating via a GCP Pub/Sub message queue. The first application produces messages that trigger executions (jobs) in the second (i.e. the first is the controller, and the second is the worker). However, the execution time of these jobs can vary drastically. For example, one could take up to 6 hours, and another could finish in less than a minute. Currently, the worker picks up the messages, starts a job for each one, and acknowledges the messages after their jobs are done (which could be after several hours).
Now getting to the problem: The worker application runs on multiple instances, but sometimes we see very uneven message distribution across the different instances. Consider the following graph, for example:
It shows the number of messages processed by each worker instance at any given time. You can see that some are hitting the maximum of 15 (configured via the spring.cloud.gcp.pubsub.subscriber.executor-threads property) while others are idling at 1 or 2. At this point, we also start seeing messages without any started jobs (awaiting execution). We assume that these were pulled by the GCP Pub/Sub client in the busy instances but cannot yet be processed due to a lack of executor threads. The threads are busy because they're processing heavier and more time-consuming jobs.
Finally, the question: Is there any way to do backpressure (i.e. tell GCP Pub/Sub that an instance is busy and have it re-distribute the messages to a different one)? I looked into this article, but as far as I understood, the setMaxOutstandingElementCount method wouldn't help us because it would control how many messages the instance stores in its memory. They would, however, still be "assigned" to this instance/subscriber and would probably not get re-distributed to a different one. Is that correct, or did I misunderstand?
We want to utilize the worker instances optimally and have messages processed as quickly as possible. In theory, we could try to split up the more expensive jobs into several different messages, thus minimizing the processing time differences but is this the only option?

Difference between TaskQueue and MapReduce in Google App Engine

I have read the docs about taskqueue and push queues in gae which is used to create long running tasks.
I have doubt in why there was the need for MapReduce? As both do the processing in background, what are the main principal differences between them.
Can someone please explain this?
Edit: I guess i was comparing Apples with monkeys! Hadoop, mapreduce are related. And gae is a backend framework.
You are getting confused with two entirely different things altogether.
MapReduce paradigm is all about distributed parallel processing of very huge amount of data.
TaskQueue is a Scheduler; which can schedule a task to execute say at certain time. It is just a scheduler like a unix cronjobs.
Please take note of bold and italic words in above statements to see the difference.
From the definition of TaskQueue
Task queues let applications perform work, called tasks,
asynchronously outside of a user request. If an app needs to execute
work in the background, it adds tasks to task queues. The tasks are
executed later, by worker services.
By definition, TaskQueue work outside of a user request; means there is no actual user request to perform a task (it is simply submitted/scheduled sometime in past). mapreduce programs are submitted by users to execute, though you may use TaskQueue to schedule one in future.
You are probably getting confused due to words like task, queue, scheduling used in mapreduce world. But those all thing in mapreduce may have some similarity, as they are generic terms - but they are definitely not the same.

App engine TaskQueue task impacting user facing handlers performance

My queue task uses urlfetch to get some data from an external API and saves it to ndb Datastore entities.
This takes about 15 seconds total.
Somehow, when the task runs, all other handlers (simple json response handlers) become slower. (slower means +500ms)
What could be causing this?
Isn't the idea of background tasks that is doesn't affect the user facing requests.
I stumbled upon this blogpost, but my task takes longer than 1 second to complete. I don't see how that's going to help me.
By default, your tasks are executed by the same instances that serve user requests. Background or not, they share the same CPU, memory and bandwidth. It's a good idea to run these tasks on a different module, which means a different instance. You can do it by specifying a target for your task queue.
Note that typically an automatic App Engine scheduler will spin a new instance when responses from your current instances slow down. However, a slowdown in your case is caused not by the growing volume of standard requests, but an unusual request which takes much longer. This prevents automatic scheduler from reacting to the increased latencies. You can switch to manual or basic scheduling, which give you more control over capacity (total number of instances) and rules for spinning new instances, but creating a different module for background tasks is a better solution.

How to create X tasks as fast as possible on Google App Engine

We push out alerts from GAE, and let's say we need to push out 50 000 alerts to CD2M (Cloud 2 Device Messaging). For this we:
Read all who wants alerts from the datastore
Loop through and create a "push task" for each notification
The problem is that the creation of the task takes some time so this doesn't scale when the user base grows. In my experience we are getting 20-30 seconds just creating the tasks when there is a lot of them. The reason for one task pr. push message is so that we can retry the task if something fails and it will only affect a single subscriber. Also C2DM only supports sending to one user at a time.
Will it be faster if we:
Read all who wants alerts from the datastore
Loop through and create a "pool task" for each 100 subscribers
Each "Pool task" will generate 100 "push tasks" when they execute
The task execution is very fast so in our scenario it seems like the creation of the tasks is the bottleneck and not the execution of the tasks. That's why I thought about this scenario to be able to increase the parallelism of the application. I would guess this would lead to faster execution but then again I may be all wrong :-)
We do something similar with APNS (Apple Push Notification Server): we create a task for a batch of notifications at a time (= pool task as you call it). When task executes, we iterate over a batch and send it to push server.
The difference with your setup is that we have a separate server for communicating with push, as APNS only supports socket communication.
The only downside is if there is an error, then whole task will be repeated and some users might get two notifications.
This sounds like it varies based on the number of alerts you need to send out, how long it takes to send each alert, and the number of active instances you have running.
My guess is that it takes a few milliseconds to tens of milliseconds to send out a CD2M alert, while it takes a few seconds for an instance to spin up, so you can probably issue a few hundred or a few thousand alerts before justifying another task instance. The ratio of the amount of time it takes to send each CD2M message vs the time it takes to launch an instance will dictate how many messages you'd want to send per task.
If you already have a fair number of instances running though, you don't have the delay of waiting for instances to spin up.
BTW, this seems almost like a perfect application of the MapReduce API. It mostly does what you describe in the second version, except it takes your initial query, and breaks that up into subqueries that each return a "page" of the result set. A task is launched for each subquery which processes all the items in its "page". This is an improvement from what you describe, because you don't need to spend the time looping through your initial result set.
I believe the default implementation for the MapReduce API just queries for all entities of a particular kind (ie all User objects), but you can change the filter used.

Strategy for interleaving jobs on AppEngine?

Let's say I have 1000's of jobs to perform repeatedly, how would you propose I architect my system on Google AppEngine?
I need to be able to add more jobs whilst effectively scaling the system. Scheduled Tasks are part of the solution of course as well as Task Queues but I am looking for more insights has to best utilize these resources.
NOTE: There are no dependencies between "jobs".
Based on what little description you've provided, it's hard to say. You probably want to use the Task Queue, and maybe the deferred library if you're using Python. All that's required to use these is to use the API to enqueue a task.
If you're talking about having many repeating tasks, you have a couple of options:
Start off the first task on the task queue manually, and use 'chaining' to have each invocation queue the next one with the appropriate interval.
Store each schedule in the datastore. Have a cron job regularly scan for any tasks that have reached their ETA; fire off a task queue task for each, updating the ETA for the next run.
I think you could use Cron Jobs.
Regards.

Resources