No matter what I set rate and bucket_size I always see only one or two tasks running in the same time. Does anyone know what might be the reason?
My queue configuration is:
name: MyQueue
rate: 30/s
bucket_size: 50
Note that the documentation mentions
To ensure that the taskqueue system does not overwhelm your
application, it may throttle the rate at which requests are sent. This
throttled rate is known as the enforced rate. The enforced rate may be
decreased when your application returns a 503 HTTP response code, or
if there are no instances able to execute a request for an extended
period of time.
In such situations, the admin console will show an Enforced Rate lower than the 30/s you requested.
Related
We tried to enforce a certain rate limit on Cloud PubSub Push Subscrber by setting the quota on "Push subscriber throughput, kB" to 1, effectively meaning that PubSub should process no more than 1 kbps with the push subscriber.
However, the actual throughput can be higher than that, around to 6-8 kbps.
Why is that not limiting the throughput as expected?
More details:
The goal is to have a rate limit of 50 messages per second.
We can assume the average message size, for the purposes of our testing we use 50 bytes messages, which is 50 bytes * 60 second = 3000 bytes per second, or 3 kbps for a message every second. By setting the quota to 1 we expected to get way less than 50 messages per second pushed by PubSub. During testing we got signiticantly more than that.
At the moment, there is a known issue with the enforcement of push subscriber quota in Google Cloud Pub/Sub.
In general, push subscriber quota is not really a good way to try to enforce flow control. For true flow control, it is better to use pull subscribers and the client libraries. The goal of flow control in the subscriber is to prevent the subscriber from being overwhelmed. In the client library, flow control is defined in terms of outstanding messages and/or outstanding bytes. When one of these limits is reached, Cloud Pub/Sub suspends the delivery of more messages.
The issue with rate-based flow control is that it doesn't account well for unexpected issues with the subscriber or its downstream dependencies. For example, imagine that the subscriber receives messages, writes to a database, and then acknowledges the message. If the database were suffering from high latency or just unavailable for a period of time, then rate-based flow control is still going to deliver more messages to the subscriber, which will back up and could eventually overload its memory. With flow control based on outstanding messages or bytes, the fact that the database is unavailable (which prevents the acknowledgement of messages by the subscriber) means that delivery is completely halted. In this situation where the database cannot process any messages or is processing them extremely slowly, sending more messages--even at a very low rate--is still harmful to the subscriber.
I was experimenting with concurrent request handling on few platforms.
The aim of the experiment was to have a broad measure of the capacity bounds of some selected technologies.
I set up a Linux VM on my machine with a basic Go http server (the vanilla http.HandleFunc of the http default package).
The server would then compute a modified version of the fasta algorithm that restricted threads and processes to 1, and return the result. N was set to 100000.
The algorithm runs in roughly 2 seconds.
I used the same algorithm and logic on a Google App Engine project.
The algorithm is written using the same code, just the handler set up is done on init() instead of main() as per GAE requirements.
On the other end an Android client is spawning 500 threads each one issuing in parallel a GET request to the fasta computing server, with a request timeout of 5000 ms.
I was expecting the GAE application to scale and answer back to each request and the local Go server to fail on some of the 500 requests but results were the opposite:
the local server correctly replied to each request within the timeout bounds while the GAE application was able to handle just 160 requests out of 500. The remaining requests timed out.
I checked on the Cloud Console and I verified that 18 GAE instances were spawned, but still the vast majority of requests failed.
I thought that most of them failed because of the start-up time of each GAE instance, so I repeated the experiment right after but I had the same results: most of the requests timed out.
I was expecting GAE to scale to accomodate ALL the requests, believing that if a single local VM could successfully reply to 500 concurrent requests GAE would have done the same, but this is not what happened.
The GAE console doesn't show any error and correctly reports the number of incoming requests.
What could be the cause of this?
Also, if a single instance could handle all the incoming requests on my machine by virtue of only goroutines, how come that GAE needed to scale so much at all?
To make optimal usage in terms of minimizing costs you need to configure few things in app.yaml:
Enable threadsafe: true - actually it's from Python config and not applicable to Go but I would set it just in case.
Adjust scaling section:
max_concurrent_requests - set to maximum 80
max_idle_instances - set to minimum 0
max_pending_latency - set it to automatic or greater then min_pending_latency
min_idle_instances - set it to 0
min_pending_latency - set to higher number. If you are OK to get 1 second latency and you handlers take on average 100ms to process set it to 900ms.
Then you should be able to proceed a lot of request on single instance.
If you OK to burn cash for the sake of responsiveness & scalabiluty - increase min_idle_instances & max_idle_instances.
Also do you use similar instance types for VM and GAE? The GAE F1 instance is not too fast and is more optimal for async tasks like working with IO (datastore,http,etc.). You can configure usage of more powerful instance to better scale for computation intensive tasks.
Also do you test on paid account? Free accounts have quotas and AppEngine would refuse percentage of requests if it believe the load would exceed the daily quota if continuous with the same pattern.
Extending on Alexander's answer.
The GAE scaling logic is based on incoming traffic trend analysis.
The key for being able to handle your case - sudden spikes in traffic (which can't be takes into account in the trend analysis due to its variation speed) - is to have sufficient resident (idle) instances configured for your application to handle such traffic until GAE spins up additional dynamic instances. It can handle as high peaks as you want (if your pockets are deep enough).
See Scaling dynamic instances for more details.
Thanks everyone for their help.
Many interesting points and insights have been made by the answers I had on this topic.
The fact the the Cloud Console were reporting no errors led me to believe that the bottleneck was happening after the real request processing.
I found the reason why the results were not as expected: bandwidth.
Each response had a payload of roughly 1MB and thus responding to 500 simultaneous connections from the same client would clog the lines, resulting in timeouts.
This was obviously not happening when requesting to the VM, where the bandwith is much larger.
Now GAE scaling is in line with what I expected: it successfully scales to accomodate each incoming request.
Today AppEngine went down for a while:
http://code.google.com/status/appengine/detail/serving/2012/10/26#ae-trust-detail-helloworld-get-latency
The result was that all requests were kept as pending, for some for as long as 24 minutes. Here is an excerpt from my server log. These requests are in general handled in less than 200 ms.
https://www.evernote.com/shard/s8/sh/ad3b58bf-9338-4cf7-aa35-a255d96aebbc/4b90815ba1c8cd2080b157a54d714ae0
My quota (8$ per day) was exploded in a matter of minutes when it previously was at around 2$ per day.
How can I prevent pending_ms to eat all my quota, even though my actual request is still responding very fast? I had the pending delay from 300 ms to Automatic. Does limiting the maximum to 10 seconds prevent that type of outbreak?
blackjack75,
You're right, raising the pending latency to something like 10 seconds will help reduce the number of instances started.
It looks like the long running requests tied up your instances. When this happens, app engine spins up new instances to handle the new requests, and of course instances cost money.
Lowering your min and max idle instances to smaller numbers should also help.
On your dashboard, you can look at your instance graph you to see how long the burst of instances was left idle after the request load was finished.
You can look at your typical usage to help estimate a safe max.
Lowering them can cause slowness when legitimate traffic needs to spin up a new instance, especially with bursty traffic, so you would want to adjust this to match your budget. For comparision, on a non-production appspot having the min and max set to 1 works fine.
Besides that, general techniques for reducing app engine resource usage will help. It sounds like you've gone through that already since your typical request time is low. Enabling concurrent requests could help here if your code will handle threads correctly (no globals, etc.) and your instances have enough free memory handle multiple requests.
Our "Enforced Rate" dropped to 0.10/s even though the queue is defined to be 20/s for a Push Task Queue. We had a 2 hour backlog of tasks built up.
The documentation says:
The enforced rate may be decreased when your application returns a 503
HTTP response code, or if there are no instances able to execute a
request for an extended period of time.
There were no 503s (all 200s), yet we saw 0 (zero) instances running for the version processing that queue, which may explain the Enforce Rate drop. The target version of the application processing the queue is not the primary version, and the target version has no other source of requests than the Push Task Queue workload.
If the Enforced Rate had been dropped due to lack of instances, why did GAE not spin up new instances?
There were no budget limits or rate limits that explain the drop to 0 instances.
I'm answering this as the bug we're starring http://code.google.com/p/googleappengine/issues/detail?id=7338
GAE dashboard shows stats for different URIs of your app. It includes Req/Min, Requests, Runtime MCycles, and Avg Latency.
The help provided seems to be outdated, here what it states:
The current load table provides two data points for CPU usage, "Avg CPU (API)" and "% CPU". The "Avg CPU (API)" displays the average amount of CPU a request to that URI has consumed over the past hour, measured in megacycles. The "% CPU" column shows the percentage of CPU that URI has consumed since midnight PST with respect to the other URIs in your application.
So I assume Runtime MCycles is what the help calls Avg CPU (API)?
How do I map this number to the request stats in the logs?
For example one of the requests has this kind of logs: ms=583 cpu_ms=519 api_cpu_ms=402.
Do I understand correctly that ms includes cpu_ms and cpu_ms includes api_cpu_ms?
So then cpu_ms is the Runtime MCycles which is shown as average for the given URI on dashboard?
I have a F1 instance with 600Mhz and concurrency enabled for my app. Does it mean this instance throughput it 600 MCycles per second? So if average request takes 100 Mcycles, it should handle 5-6 request on average?
I am digging into this to try to predict the costs for my app under load.
This blog post (by Nick Johnson) is a useful summary of what the request log fields mean: http://blog.notdot.net/2011/06/Demystifying-the-App-Engine-request-logs