GAE Task Queues with ETA and large number of tasks

GAE Task Queues with ETA and large number of tasks - google-app-engine

In my app, I need to send emails to a large number of users when an event happens. I'd like to send those emails out gradually instead of all at once. For clarity in explaining, let's say I need to send out emails to 10,000 users.
I currently do this with a task queue with a maximum rate of 1 task/second. I enqueue 10,000 tasks in batches, and the emails get sent out at a rate of 1/second.
I'd like to change this to using an ETA for the tasks instead of limiting the task queue to a maximum rate. Conceptually it would be like this (except that task submission would be batched):
now = datetime.utcnow()
for i, email in enumerate(email_list):
eta = now + datetime.timedelta(seconds=i)
deferred.defer(send_email, email, _eta=eta)
Before implementing a change like this, I'd like to have some confidence that GAE can do this efficiently.
If I have 10,000 tasks in a task queue, each with a different ETA, will the GAE task queue be able to efficiently monitor all the tasks and start them at approximately (the precise ETA isn't important) the appropriate times? I don't know what algorithm Google uses for this.
EDIT:
Imagine if you inserted a billion tasks in a single day each with an ETA. How would GAE monitor those tasks to make sure they got fired off at the right time? Polling all the tasks at some time interval (e.g., every minute) would be a terrible solution. Perhaps GAE uses some kind of priority queue. It would be nice to have some confidence that GAE has implemented an algorithm that will scale for a lot of tasks with an ETA.

With the stated daily quota of 10 billion tasks one would think they should be able to handle 10,000 of them :)
In my current project I'm also sending ~10,000 emails (SendGrid) with tasks & _eta (although in batches of 25) which works fine so far...

In the current infrastructure, the logic can be a little batchy when the throughput is significantly below the configured rate. Queues prepare tasks 5s in advance but processing can slow down if there are no tasks in a given 5s window.
It should work in general, but you might see a pattern of delays of up to 20s followed by bursts.
At a total throughput of 1B tasks/day, you would probably want to split to run over 40 queues at a rate of around 300 tasks/sec/queue. With a rate that steady, delays would be uncommon.

Related

Gatling: Keep fixed number of users/requests at any instant

how can we keep fixed number of active concurrent users/requests at time for a scenario.
I have an unique testing problem where I am required to do the performance testing of services with fixed number of request at a given moment for a given time periods like 10 minutes or 30 minutes or 1 hour.
I am not looking for per second thing, what I am looking for is that we start with N number of requests and as any of request out of N requests completes we add one more so that at any given moment we have N concurrent requests only.
Things which I tried are rampUsers(100) over 10 seconds but what I see is sometimes there are more than 50 users at a given instance.
constantUsersPerSec(20) during (1 minute) also took the number of requests t0 50+ for sometime.
atOnceUsers(20) seems related but I don't see any way to keep it running for given number of seconds and adding more requests as previous ones completes.
Thankyou community in advance, expecting some direction from your side.

There is a throttling mechanism (https://gatling.io/docs/3.0/general/simulation_setup/#throttling) which allow you to set max number of requests, but you must remember that users are injected to simulation independently of that and you must inject enough users to produce that max number of request, without that you will end up with lower req/s. Also users that will be injected but won't be able to send request because of throttling will wait in queue for they turn. It may result in huge load just after throttle ends or may extend your simulation, so it is always better to have throttle time longer than injection time and add maxDuration() option to simulation setup.
You should also have in mind that throttled simulation is far from natural way how users behave. They never wait for other user to finish before opening page or making any action, so in real life you will always end up with variable number of requests per second.

Use the Closed Work Load Model injection supported by Gatling 3.0. In your case, to simulate and maintain 20 active users/requests for a minute, you can use an injection like,
Script.<Controller>.<Scenario>.inject(constantConcurrentUsers(20) during (60 seconds))

Configuring a task queue and instance for non urgent work

I am using an F4 instance (because of memory needs) with automatic scheduling to do some background processing. It is run from a task queue. It takes 40s to 60s to complete each invocation. Because of the high memory needs, each instance should only handle one request at a time.
The action that needs to be done is not urgent. If it doesn't get scheduled for 30 minutes that isn't a problem. Even 60 minutes is acceptable and I'd rather make use of that time rather than spin up more instances. However, if the service gets popular and the is getting more than 60 requests an hour I want to spin up more instances to make sure there isn't more than a 60 minute wait.
I am having trouble figuring out how to configure the instance and queue parameters to keep my costs down but be able to scale in that way. My initial thought was something like this:
<queue>
<name>non-urgent-queue</name>
<target>slow-service</target>
<rate>1/m</rate>
<bucket-size>1</bucket-size>
<max-concurrent-requests>1</max-concurrent-requests>
</queue>
<automatic-scaling>
<min-idle-instances>0</min-idle-instances>
<max-idle-instances>0</max-idle-instances>
<min-pending-latency>20m</min-pending-latency>
<max-pending-latency>1h</max-pending-latency>
<max-concurrent-requests>1</max-concurrent-requests>
</automatic-scaling>
First of all those latency settings are invalid, but I can't find documentation on the valid range or units. Can anyone direct me to that info?
Secondly, if I understand the queue settings correctly, this configuration would limit it to 60 invocations an hour getting to the service, even if the task queue had 60+ jobs waiting.
Thanks for your help!

Indeed, throttling at the queue level basically defeats the ability to scale when needed. So you can't use the <rate> in the queue configuration at the values you have right now, you need to use the value matching the maximum rate you're willing to accept (with you max number of instances running simultaneously):
the max rate of requests that can go through the queue being limited at 1/min means you can't scale above 60/h
the <bucket-size> set at 1 means no peaks above the rate can be handled (as soon as one task starts the token bucket empties).
the <max-concurrent-requests> set at 1 will basically prevent multiple instances dealing simultaneouly with the queued workload. They may be started by the autoscaler because of the request latencies, but they won't be able to help since only one queue task can be handled at a time.
In the <automatic-scaling> section the <max-concurrent-requests> set to 1 is good - this ensures no instance handles more than 1 request at a time - which is what you want.
The bad news is that the max values for the latencies appear to be 15s. At least when using the app.yaml config for python (but I think it's unlikely for that to differ across language sandboxes):
Error 400: --- begin server output ---
automatic_scaling.min_pending_latency (30s), must be in the range [0.010000s,15.000000s].
--- end server output ---
and
Error 400: --- begin server output ---
automatic_scaling.max_pending_latency (60s), must be in the range [0.010000s,15.000000s].
--- end server output ---
Which probably also explains why your 5m and 1h values aren't accepted - I used 30s and 60s and got the above errors.
This means you won't be able to use the autoscaling parameters to tune such a slow-moving processing like you desire.
The only alternative I can think of is to have 2 queues:
a fast one feeding just trigger tasks for the slow-service jobs, but which your service intercepts and saves in the datastore. Maybe performed by some faster service (you don't want these stuck behind a slow-service job execution as it can cause unnecessary instance launching. Maybe, depending on the rest of your implementation, you can replace this queue completely with just storing the job info in the datastore instead of enqueing tasks in the fast queue.
a slow one for the actual slow-service job execution tasks
You'd also have a cron job executing once a minute, checking how many triggers are pending in the datastore, decide how much to scale and enqueue the corresponding number of slow-service job tasks in the slow queue. The autoscaler would simply bring up the corresponding number of instances (if needed). Low latency autoscaling configs would be desirable in this case - you already decided how you want your app to scale.

This is how I ended up doing it. I use a slow queue and a fast queue configured like this:
<queue>
<name>slow-queue</name>
<target>pdf-service</target>
<rate>2/m</rate>
<bucket-size>1</bucket-size>
<max-concurrent-requests>1</max-concurrent-requests>
</queue>
<queue>
<name>fast-queue</name>
<target>pdf-service</target>
<rate>10/m</rate>
<bucket-size>1</bucket-size>
<max-concurrent-requests>5</max-concurrent-requests>
</queue>
The max-concurrent-requests in the slow queue ensures only one task will run at a time, so there will only be one instance active.
Before I post to the slow queue I check to see how many items are already on the queue. The result may not be totally reliable, but for my purposes it is sufficient. In java:
QueueStatistics queueStats = queue.fetchStatistics();
if(queueStats.getNumTasks()<30) {
//post to slow queue
} else {
//post to fast queue
}
So when my slow queue gets too full, I post to the fast queue which allows concurrent requests.
The instance is configured like this:
<automatic-scaling>
<min-idle-instances>0</min-idle-instances>
<max-idle-instances>automatic</max-idle-instances>
<min-pending-latency>15s</min-pending-latency>
<max-pending-latency>15s</max-pending-latency>
<max-concurrent-requests>1</max-concurrent-requests>
</automatic-scaling>
So it will create new instances as slowly as possible (15s is the max latency) and make sure only one process runs on an instance at a time.
With this configuration I'll have a max of 6 instances at a time but that should do about 500/hr. I could increase the rate and concurrent requests to do more.
The negative of this solution is an element of unfairness. Under heavy load, some tasks will be stuck in the slow queue while others will get processed more quickly in the fast queue.
Because of that, I have decreased the max items on the slow queue to 13 so the unfairness won't be so extreme, maybe a 10 minute wait for jobs that go to the slow queue when it is full.

How can tasks be prioritized when using the task queue on google app engine?

I'm trying to solve the following problem:
I have a series of "tasks" which I would like to execute
I have a fixed number of workers to execute these workers (since they call an external API using urlfetch and the number of parallel calls to this API is limited)
I would like for these "tasks" to be executed "as soon as possible" (ie. minimum latency)
These tasks are parts of larger tasks and can be categorized based on the size of the original task (ie. a small original task might generate 1 to 100 tasks, a medium one 100 to 1000 and a large one over 1000).
The tricky part: I would like to do all this efficiently (ie. minimum latency and use as many parallel API calls as possible - without getting over the limit), but at the same time try to prevent a large number of tasks generated from "large" original tasks to delay the tasks generated from "small" original tasks.
To put it an other way: I would like to have a "priority" assigned to each task with "small" tasks having a higher priority and thus prevent starvation from "large" tasks.
Some searching around doesn't seem to indicate that anything pre-made is available, so I came up with the following:
create three push queues: tasks-small, tasks-medium, tasks-large
set a maximum number of concurrent request for each such that the total is the maximum number of concurrent API calls (for example if the max. no. concurrent API calls is 200, I could set up tasks-small to have a max_concurrent_requests of 30, tasks-medium 60 and tasks-large 100)
when enqueueing a task, check the no. pending task in each queue (using something like the QueueStatistics class), and, if an other queue is not 100% utilized, enqueue the task there, otherwise just enqueue the task on the queue with the corresponding size.
For example, if we have task T1 which is part of a small task, first check if tasks-small has free "slots" and enqueue it there. Otherwise check tasks-medium and tasks-large. If none of them have free slots, enqueue it on tasks-small anyway and it will be processed after the tasks added before it are processed (note: this is not optimal because if "slots" free up on the other queues, they still won't process pending tasks from the tasks-small queue)
An other option would be to use PULL queue and have a central "coordinator" pull from that queue based on priorities and dispatch them, however that seems to add a little more latency.
However this seems a little bit hackish and I'm wondering if there are better alternatives out there.
EDIT: after some thoughts and feedback I'm thinking of using PULL queue after all in the following way:
have two PULL queues (medium-tasks and large-tasks)
have a dispatcher (PUSH) queue with a concurrency of 1 (so that only one dispatch task runs at any time). Dispatch tasks are created in multiple ways:
by a once-a-minute cron job
after adding a medium/large task to the push queues
after a worker task finishes
have a worker (PUSH) queue with a concurrency equal to the number of workers
And the workflow:
small tasks are added directly to the worker queue
the dispatcher task, whenever it is triggered, does the following:
estimates the number of free workers (by looking at the number of running tasks in the worker queue)
for any "free" slots it takes a task from the medium/large tasks PULL queue and enqueues it on a worker (or more precisely: adds it to the worker PUSH queue which will result in it being executed - eventually - on a worker).
I'll report back once this is implemented and at least moderately tested.

The small/medium/large original task queues won't help much by themselves - once the original tasks are enqueued they'll keep spawning worker tasks, potentially even breaking the worker task queue size limit. So you need to pace/control enqueing of the original tasks.
I'd keep track of the "todo" original tasks in the datastore/GCS and enqueue these original tasks only when the respective queue size is sufficiently low (1 or maybe 2 pending jobs), from either a recurring task, a cron job or a deferred task (depending on the rate at which you need to perform the original task enqueueing) which would implement the desired pacing and priority logic just like a push queue dispatcher, but without the extra latency you mentioned.

I have not used pull queues, but from my understanding they could suit your use-case very well. Your could define 3 pull queues, and have X workers all pulling tasks from them, first trying the "small" queue then moving on to "medium" if it is empty (where X is your maximum concurrency). You should not need a central dispatcher.
However, then you would be left to pay for X workers even when there are no tasks (or X / threadsPerMachine?), or scale them down & up yourself.
So, here is another thought: make a single push queue with the correct maximum concurrency. When you receive a new task, push its info to the datastore and queue up a generic job. That generic job will then consult the datastore looking for tasks in priority order, executing the first one it finds. This way a short task will still be executed by the next job, even if that job was already enqueued from a large task.

EDIT: I now migrated to a simpler solution, similar to what #eric-simonton described:
I have multiple PULL queues, one for each priority
Many workers pull on an endpoint (handler)
The handler generates a random number and does a simple "if less than 0.6, try first the small queue and then the large queue, else vice-versa (large then small)"
If the workers get no tasks or an error, they do semi-random exponential backoff up to maximum timeout (ie. they start pulling every 1 second and approximately double the timeout after each empty pull up to 30 seconds)
This final point is needed - amongst other reasons - because the number of pulls / second from a PULL queue is limited to 10k/s: https://cloud.google.com/appengine/docs/python/taskqueue/overview-pull#Python_Leasing_tasks
I implemented the solution described in the UPDATE:
two PULL queues (medium-tasks and large-tasks)
a dispatcher (PUSH) queue with a concurrency of 1
a worker (PUSH) queue with a concurrency equal to the number of workers
See the question for more details. Some notes:
there is some delay in task visibility due to eventual consistency (ie. the dispatchers tasks sometimes don't see the tasks from the pull queue even if they are inserted together) - I worked around by adding a countdown of 5 seconds to the dispatcher tasks and also adding a cron job that adds a dispatcher task every minute (so if the original dispatcher task doesn't "see" the task from the pull queue, an other will come along later)
made sure to name every task to eliminate the possibility of double-dispatching them
you can't lease 0 items from the PULL queues :-)
batch operations have an upper limit, so you have to do your own batching over the batch taskqueue calls
there doesn't seem to be a way to programatically get the "maximum parallelism" value for a queue, so I had to hard-code that in the dispatcher (to calculate how many more tasks it can schedule)
don't add dispatcher tasks if they are already some (at least 10) in the queue

Does task queue truly run tasks in parallel?

We have an application that takes some input from a user and makes ~50 RPC calls. Each call takes around 4-5 minutes.
In the backend we are using a push queue and enqueuing each of these 50 calls as tasks. This is our queue spec:
queue:
- name: some-name
rate: 500/s
bucket_size: 100
max_concurrent_requests: 500
My understanding is that all 50 requests should be run in parallel, and thus all of them should be complete in 4-5 minutes. But what's actually happening is that only around ~15 of these requests are returning results, while the rest cross the 10 min limit and time out. Another thing to note is that this seems to work fine if we bring down the number of requests to < 10.
There's always the possibility that the requests that timed out did so because the RPC response actually took that long. But what I wanted to confirm is :
My understanding of the tasks running in parallel is correct.
Our queue config and the number of tasks we're enqueuing has nothing to do with these requests timing out.
Are these correct ?

(1) Parallel execution
Yes, tasks can be executed in parallel (up to 500 in your case), but in push queues, your app has no control in which particular order the tasks in a push queue are executed and no direct control how many tasks are executed at once. (Your app can control in which sequence tasks are added to a queue though, see the pattern in (2) below)
App Engine uses certain factors to decide how fast and which tasks are executed, especially the queue configuration and also the scaling configuration (e.g. in app.yaml). Since you pay for every first 15 minutes of an instance, it could get very expensive to really have 50 instances launched, then idling for 15 minutes before shutting them down (until the next request). In this regard, the mechanism that spawns new instances is a little smarter, whether it is HTTP requests by users or task queues.
(2) Request time outs
Yes, it is very unlikely that the enqueuing has anything to do with these request time outs. Unless the time-outs are an unintentional side-effect of the wrong assumption that a particular task was executed before.
In order to avoid request time outs in general, it makes sense to split a task into multiple tasks. For example, if you have a task do_foo and those executions exceed the time outs frequently (or memory limits), you could instead have do_foo load off work to other tasks that will do the actual jobs.
For some migration tasks I use this pattern in a linear / sequential way. E.g. classmethod do_foo just queries entities of a certain kind (ordered by creation timestamp for example), maybe filtered, by page (e.g. 50 in transactions with ancestor). It does some writes to the entities first, and only at the very end after successful commit it creates a new transactional do_foo task with cursor parameter to the next page, eventually with a countdown of 1 sec to avoid transaction errors. The next execution of do_foo will continue with the next page (of course only after the task with the previous page completed).
Depending on the nature of the tasks, you could alternatively have each task fan out into multiple tasks per execution, e.g. do_foo triggers do_bar, do_something and do_more. Also note that up to five tasks can be created transactionally inside a transaction.

How to handle daily/weekly/mothly boards on AppEngine datastore?

I'm developing a high score web service for my game, and it's running on Google App Engine.
My game has 5 difficulties, so I originally had 5 boards with entries for each (player_login, score and time). If the player submitted a lower score than the previously scored, it got dismissed, so only the highest score is kept for each player.
But to add more fun into this, I'd decided to include daily/weekly/monthly/yearly high score tables. So I've created 5 boards for each difficulty, making it 25 boards. When a score is submitted, it's saved into each board, and the boards are supposed to be cleared on every day/week/month/year.
This happens by a cron job that is invoked and deletes all entries from a specific board.
Here comes the problem: it looks like deleting entries from the datastore is slow. From my test daily cleanups it looks like deleting a single entry takes around 200 ms.
In the worst-case scenario, if the game would be quite popular and would have, say, 100 000 players, and each of them would have an entry in the yearly board, it would take 100 000 * 0.012 seconds = 12 000 seconds (3 hours!!) to clear that board. I think we are allowed to have jobs of up to 30 seconds in App Engine, so this wouldn't work.
I'm deleting with following code (thanks to Nick Johnson):
q = Score.all(keys_only=True).filter('b = ',boardToClear)
results = q.fetch(500)
while results:
self.response.out.write("deleting one batch;")
db.delete(results)
q = Score.all(keys_only=True).filter('b = ',boardToClear).with_cursor(q.cursor())
results = q.fetch(500)
What do you recommend me to do with this problem?
One approach that comes to my mind is to use a task queue and delete older scores than that are permitted in each board, i.e. which have expired, but in smaller quantities. This way I wouldn't hit the CPU limit for one task, but the cleanup would not be (nearly) instantaneous, so my 12 000 seconds long cleanup would be split into 1 200 tasks, each roughly 10 seconds long.
But I think that there is something that I'm doing wrong, this kind of operation would be a lot faster when done in relational database. Possibly something is wrong with my approach to the datastore and scoring, because being locked in RDBMS mindset.

First, a couple of small suggestions:
Does deletion take 200ms per item even when you delete items in a batch process? The fastest way to delete should be to do a keys_only query and then call db.delete() on an entire list of keys at once.
The 30-second limit was recently relaxed to 10 minutes for background work (like the cron jobs or queue tasks that you're contemplating) as of 1.4.0.
These may not fundamentally address your problem, though. I think there's no way to get around the fact that deleting a large number of records (hundreds of thousands, say), will take some time. I'm not sure that this is as big a problem for your use case though, as I can see a couple of techniques that would help.
As you suggest, use a task queue to split up a long-running tasks into several smaller tasks. Your use case (deleting a huge number of items that match a particular query) is ideal for a map-reduce task. Nick Johnson's blog post on the Mapper API may be very helpful for you (so that you don't have to write all of that task management code on your own).
Do you need to delete all the out-of-date board entries immediately? If you had a field that listed which week, month, or year that a particular entry counted for, you could index on that field and then only display entries from the current month on the visible leaderboard. (Disk space is cheap, after all.) And then if you wanted to slowly (over hours, say, instead of milliseconds) remove the out-of-date data, you could do that in the background without ever having incorrect data on your leaderboards.

Delete entities in batches. Although a single delete takes a noticeable amount of time (though 200ms seems very high), batch deletes take no longer, as they delete all the entities in parallel. Task Queue and cron jobs can now run for up to 10 minutes, so timeouts should not be an issue.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight