Concurrent requests handling on Google App Engine - google-app-engine

I was experimenting with concurrent request handling on few platforms.
The aim of the experiment was to have a broad measure of the capacity bounds of some selected technologies.
I set up a Linux VM on my machine with a basic Go http server (the vanilla http.HandleFunc of the http default package).
The server would then compute a modified version of the fasta algorithm that restricted threads and processes to 1, and return the result. N was set to 100000.
The algorithm runs in roughly 2 seconds.
I used the same algorithm and logic on a Google App Engine project.
The algorithm is written using the same code, just the handler set up is done on init() instead of main() as per GAE requirements.
On the other end an Android client is spawning 500 threads each one issuing in parallel a GET request to the fasta computing server, with a request timeout of 5000 ms.
I was expecting the GAE application to scale and answer back to each request and the local Go server to fail on some of the 500 requests but results were the opposite:
the local server correctly replied to each request within the timeout bounds while the GAE application was able to handle just 160 requests out of 500. The remaining requests timed out.
I checked on the Cloud Console and I verified that 18 GAE instances were spawned, but still the vast majority of requests failed.
I thought that most of them failed because of the start-up time of each GAE instance, so I repeated the experiment right after but I had the same results: most of the requests timed out.
I was expecting GAE to scale to accomodate ALL the requests, believing that if a single local VM could successfully reply to 500 concurrent requests GAE would have done the same, but this is not what happened.
The GAE console doesn't show any error and correctly reports the number of incoming requests.
What could be the cause of this?
Also, if a single instance could handle all the incoming requests on my machine by virtue of only goroutines, how come that GAE needed to scale so much at all?

To make optimal usage in terms of minimizing costs you need to configure few things in app.yaml:
Enable threadsafe: true - actually it's from Python config and not applicable to Go but I would set it just in case.
Adjust scaling section:
max_concurrent_requests - set to maximum 80
max_idle_instances - set to minimum 0
max_pending_latency - set it to automatic or greater then min_pending_latency
min_idle_instances - set it to 0
min_pending_latency - set to higher number. If you are OK to get 1 second latency and you handlers take on average 100ms to process set it to 900ms.
Then you should be able to proceed a lot of request on single instance.
If you OK to burn cash for the sake of responsiveness & scalabiluty - increase min_idle_instances & max_idle_instances.
Also do you use similar instance types for VM and GAE? The GAE F1 instance is not too fast and is more optimal for async tasks like working with IO (datastore,http,etc.). You can configure usage of more powerful instance to better scale for computation intensive tasks.
Also do you test on paid account? Free accounts have quotas and AppEngine would refuse percentage of requests if it believe the load would exceed the daily quota if continuous with the same pattern.

Extending on Alexander's answer.
The GAE scaling logic is based on incoming traffic trend analysis.
The key for being able to handle your case - sudden spikes in traffic (which can't be takes into account in the trend analysis due to its variation speed) - is to have sufficient resident (idle) instances configured for your application to handle such traffic until GAE spins up additional dynamic instances. It can handle as high peaks as you want (if your pockets are deep enough).
See Scaling dynamic instances for more details.

Thanks everyone for their help.
Many interesting points and insights have been made by the answers I had on this topic.
The fact the the Cloud Console were reporting no errors led me to believe that the bottleneck was happening after the real request processing.
I found the reason why the results were not as expected: bandwidth.
Each response had a payload of roughly 1MB and thus responding to 500 simultaneous connections from the same client would clog the lines, resulting in timeouts.
This was obviously not happening when requesting to the VM, where the bandwith is much larger.
Now GAE scaling is in line with what I expected: it successfully scales to accomodate each incoming request.

Related

What is the difference between min-instances and min-idle-instances in google app engine?

I want to understand the difference between min-instances & min-idle-instances?
I saw documentation on https://cloud.google.com/appengine/docs/standard/java/config/appref#scaling_elements but I am not able to differentiate between the two.
My use case:
I want at least 1 instance always up, as otherwise in most of the cases GAE would take time in creating instance causing my requests to time out (in case of basic scaling).
It should stay up, no matter if there is traffic or not, and if a request comes it should immediately serve it. If request volume grows then it should scale.
Which one I should use?
The min-idle-instances make reference to the instances that are ready to support your application in case you receive high traffic or CPU intensive tasks, unlike the min_instances which are the instances used to process the incoming request immediately. I suggest you to take a look on this link to have a deeper explanation of idle instances.
Based on this, since your use-case is focused on serve the incoming requests immediately, I think you should rather go with the min_instances functionality and use the min-idle-instances only in case you want to be ready for sudden load spikes.
The min-instances configuration applies to dynamic instances while min-idle-instances applies to idle/resident instances.
See also:
Introduction to instances for a description of the 2 instance types
Why do more requests go to new (dynamic) instances than to resident instance? for a bit more details
min_instances: the minimum number of instances running at any time, traffic or no traffic, rain or shine.
min_idle_instances: the minimum of idle (or "unused") instances running over the currently used instances. Example: you automatically scaled to 5 app engine instances that are receiving requests, by setting min_idle_instances to 2, you will be running 7 instances in total, the 2 "extra" instances are idle and waiting in case you receive more load. The goal is that when load raises, your users don't have to wait the load time it takes to start up an instance.
IMPORTANT: you need to configure warmup requests for that to work
IMPORTANT2: you'll be billed for any instance running, idle or not. App engine is not cheap so be careful.
min_instances applies to the number of instances that you want to have running, from 0 (useful if you want to scale down when you don't receive traffic) to 1000. You are charged for the number of instances you have running, so, this is important to save costs.
For your case set this value to 1, as it's the most straightforward option.

When does Google App Engine start or stop an instance?

We have an App Engine app that handles an average .5 requests per second, and seemingly all those requests can be handled by the same instance running a Go app as the main version.
However, sometimes App Engine kicks off a second instance (and sometimes even a third one), that doesn't seem to do anything past handling one or two requests. Here's an example.
Shutting down that instance manually doesn't seem to cause any harm, so my question is, why does App Engine not kill the instance after it did not get any requests for a while? (The above example had four requests in the past hour, often the requests/age ratio gets even lower).
Update:
A similar situation is when an instance is started on a different version. App Engine only seems to kill the instance after hours of not getting any requests.
Under Application Settings → Performance,
Idle Instances is set to Automatic – 20
Pending Latency is set to 150ms – 250ms
I wish I knew what controls if/when it kills idle instances, but I can't see any documentation of it.
To avoid excess instances starting, I think the main thing you can do here is increase the pending latency:
The Pending Latency slider controls how long requests spend in the pending queue before being served by an Instance of the default version of your application. If the minimum pending latency is high App Engine will allow requests to wait rather than start new Instances to process them. This can reduce the number of instance hours your application uses, but can result in more user-visible latency.
Even if you only average 4 requests/hour, if you happen to get two closely spaced I suppose it's possible it would start a new instance.
You can also see some small amount of information in the logs about why it started a new instance.
The "How Applications Scale" section of the Google App Engine documentation states:
Scaling in Instances
Each instance has its own queue for incoming requests. App Engine monitors the number of requests waiting in each instance's queue. If App Engine detects that queues for an application are getting too long due to increased load, it automatically creates a new instance of the application to handle that load.
App Engine also scales instances in reverse when request volumes decrease. This scaling helps ensure that all of your application's current instances are being used to optimal efficiency and cost effectiveness.
It also states you can "specify a minimum number of idle instances", and to "optimize for high performance or low cost" in the administration console.
Try setting the "Idle instances" field to something like 3 - 5, and "optimize for low cost" and see if that affects the instance kill time.

Preparing for a flash crowd on Google App Engine

I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.
You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.

Google AppEngine sending all requests to same instance

Lately, I have seen GAE taking much, much longer to process requests than it did just a week ago. Nothing changed in my code, but GAE now is taking 4000-12000ms to respond to requests. What makes is worse is that I have plenty of instances available with 0 requests on them.
Has anyone else seen this happen?
What can I do to fix it?I have gone as far as to spin up 15 extra instances (and paid through the nose for them) but nothing seems to send requests to the other idle instances reliably.
My bill has gone from 70-90c/day to $5-8/day without any code change or increase in traffic. In fact, I am losing traffic because of the huge latency.
QPS* Latency* Requests Errors Age Memory Availability
0.000 0.0 ms 1378 0 10:10:09 57.9 MBytes Dynamic
0.000 0.0 ms 1681 0 15:39:57 57.2 MBytes Dynamic
0.017 9687.0 ms 886 0 10:19:10 56.7 MBytes Dynamic
I recommend installing AppStats to get a picture of what's taking so long in each request. I'd guess that you're having some contention issues or large numbers of reads/writes caused by some new data configuration.
The idle instances won't help decrease latency - it looks like every request takes a long time, and with less than one request per minute (in this sample anyway), 10s requests could run serially on the same instance.
We have a similar problem in our app. In our case, we are under the impression that GAE's scheduler did a poor job in balancing requests to existing instances.
In some cases, the scheduler decided to spin up new instances instead of re-using already existing ones. Since spinning a new instance took 5 to >45 seconds, I suspect this might be what happened to you.
Try to investigate the following and see if it helps you:
Make sure your app has thread-safe enabled so that you could process concurrent requests. You could configure this in your app.yaml if you are using Python, or in your appengine-web.xml if you use Java. Of course, you also need to make sure that the code in your app is threadsafe.
In your application settings, if it is still set on automatic, change the minimum pending latency to a non-automatic setting. I'd suggest around 10 seconds for now, but you could experiment later on which setting would suit you the most. This force the scheduler to wait for a certain time to see if any instance is available within the time before spinning up a new instance.
Now, to answer your original question regarding sending all requests to same instance, as far as I know there is no way to address a specific front-end instance in order to direct the requests to that particular instance.
What you could do is migrate your app to use backend instances instead of the regular frontend instance. Backends provides a way to directly target any particular instance within it. You could deploy your app in a single backend to have more control on the number of instance that you spawn. And since using the backend bypass the scheduler, you would not encounter latencies caused by new instances spinning up.
The major drawback of using this approach is that you lose the auto-scalability benefit of using front-end instances. But seeing from your low daily billing, I think scalability is not yet a major concern for the scale of your app.

How many parallel requests can one Google App Engine Python instance handle?

How many threads/requests can one Google App Engine Python instance handle in parallel? I'm using python27 runtime and threadsafe option is enabled (true).
Are there any restictions or conditions which could limit parallelism?
For clarification: this isn't about Java or Python GAE SDK.
The amount of parallelism you get is highly dependent on the workload of your application. If your requests are CPU bound, you'll only serve one request at a time. On the other hand, if your requests are RPC bound, you could potentially serve 10's of concurrent requests. However, there are two relavent limits:
1. Instance size. The default 600MHz F1 instance can only serve so many concurrent requests before hitting the CPU limit, overloading your instance and causing a significant increase in latency.
2. There is a hard limit on concurrent requests. It's implementation dependent and subject to change, but at this moment on python27, it's 8.
Although I get millions of hits/day my QPS is around 2 and my requests are under a second
So don't expect too much parallelism its 2-3 at most
( It's impossible to determine a QPS value for your use-case, this is my use-case )

Resources