Scaling out GAE when memory utilization exceeds limit - google-app-engine

I am using GAE for heavy tasks with high memory needs. And I got the following error:
Exceeded soft memory limit of 512 MB with 561 MB
after servicing 3 requests total.
Consider setting a larger instance class in app.yaml.
As the tasks are expensive, I assume that two applications can work within one instance. But it does not work for three applications:
While handling this request,
the process that handled this request was found to be using
too much memory and was terminated.
This is likely to cause a new process to be used
for the next request to your application.
If you see this message frequently,
you may have a memory leak in your application or may be
using an instance with insufficient memory.
Consider setting a larger instance class in app.yaml.
My current settings:
runtime: nodejs8
instance_class: B4
basic_scaling:
max_instances: 10
idle_timeout: 1m
I also tried with these settings:
runtime: nodejs8
instance_class: F4
automatic_scaling:
target_cpu_utilization: 0.5
max_instances: 10
The failure of executing the task is "Exceeded soft memory limit".
So, to solve this error, I think the scale out should be done based on the "memory utilization" rather than "cpu utilization".
How can scaling out be done when the memory utilization exceeds the limit?

The decisions of the dynamic instance scheduler are not based on instance memory usage. From Scaling dynamic instances:
The App Engine scheduler decides whether to serve each new request
with an existing instance (either one that is idle or accepts
concurrent requests), put the request in a pending request queue, or
start a new instance for that request. The decision takes into account
the number of available instances, how quickly your application has
been serving requests (its latency), and how long it takes to start a
new instance.
The error messages you received all show a very low number of requests being handled before the error happens, which indicates that the memory starvation is very strong.
What isn't clear is if the high memory usage comes from a single request or if it is related to multiple concurrent requests - i.e. requests hitting an instance at the same time (or too close to each-other for the memory freeing mechanism, if any, to keep up). But that can be experimentally determined.
If multiple concurrent incoming requests is what drives the instance memory usage over the threshold you can try to deal with it by controlling the target_throughput_utilization and/or max_concurrent_requests knobs:
Target Throughput Utilization
Sets the throughput threshold for the number of concurrent requests
after which more instances will be started to handle traffic.
Max Concurrent Requests
Sets the max concurrent requests an instance can accept before the
scheduler spawns a new instance.
If the memory limit is being hit even without the instance handling multiple requests simultaneously (or if you would like to be able to serve more instances concurrently - typically to reduce costs) the only way you can address it is by using instance classes with more memory. The F4 and B4 classes you tried both have 512M, try F4_1G/B4_1G, both of which have 1G.

Related

How can I increase number of concurrent task in Google App Engine?

I'm using App engine to concurrently handle a number of long running tasks (therefore I need to use basic scaling).
I noticed with one instance, only 8 tasks can be handled simultaneously (consistent with the number of workers for a B4 instance). For the ninth task I receive:
POST 503: Request was aborted after waiting too long to attempt to service your request.
How can I handle more task than this simultaneously without adding more instances?
As a best practice, the number of workers you specify should match the instance class of your App Engine app, but you can change it by modifying the number of workers in the entrypoint as in the example below and try and see if it works for you.
entrypoint: gunicorn -b :8080 -w 2 main:app
Consider that a service with basic scaling is configured by setting the maximum number of instances in the max_instances parameter of the basic_scaling setting. You can control the number of live instance scales with the processing volume by changing to manual scaling.
If you use basic scaling, App Engine attempts to keep your cost low, even though that may result in higher latency as the volume of incoming requests increases.
If you tune the scaling settings to reduce costs by minimizing idle instances, then you run the risk of seeing latency spikes if the load increases unexpectedly.
Basic scaling type is designed to minimize costs at the expense of latency.
Your code needs to scale the number of workers based on processing volume. If your code does not handle scaling, you risk wasting computing resources if there are no tasks to process; you also risk latency if you have too many tasks to process.
A good way to speed up requests is to make use of multiple caching layers.
This article is helpful to handle the instance settings and modify it to get the desired performance.
Have you tried increasing max_concurrent_requests in your app.yaml? It should be defaulting to being able to handle 10 requests at a time.
https://cloud.google.com/appengine/docs/standard/python3/config/appref#max_concurrent_requests

Why would GAE not distribute the load to the other available instances?

First thing first, here is my app.yaml:
runtime: nodejs10
env: standard
instance_class: F1
handlers:
- url: /.*
script: auto
automatic_scaling:
min_instances: 1
max_instances: 20
inbound_services:
- warmup
I'm using Apache Benchmark for this:
ab -c30 -n100000 "${URL}"
What I notice in the GAE console is that I have 8 instances available but only 3 take on 99% of the work. The rest is serving either no request or a very small portion.
Any idea what the problem could be here?
I would recommend to use the “max_concurrent_requests” element in your “app.yaml” file, as this element is the number of concurrent requests an automatic scaling instance can accept before scheduler spawns a new instance (Keep in mind that the maximum limit is 80).
Furthermore, you can also set “max_pending_latency” that specifies the maximum amount of time that App Engine should allow a request to wait in the pending queue before starting additional instances to handle requests, so that pending latency is reduced.
If you reach the limit, it will be a signal to scale up, so the number of instances will be increased.
The fact that the load is not evenly distributed across the running instances is normal and actually desired as long as the number of instances still processing requests is sufficient to handle the current load level with a satisfactory level of performance - this allows the other instances to be idle long enough to be automatically shutdown (due to inactivity).
This is part of the dynamic instance management logic used with automatic and basic scheduling.

Does App Engine Flexible for Python support concurrent requests?

From the documentation on how GAE Flexible handles requests, it says that "An instance can handle multiple requests concurrently" but I don't know what this exactly means.
Let's say my application can process a single request every 60 seconds.
After starting to process the initial request, will another request (or 3) that occur say 30 seconds after (so halfway done with the first request), be handled by the same instance, or will it trigger autoscaling and spin up more instances to handle those new requests? This situation assumes that CPU utilization for the first request is still below the scaling CPU-utilization threshold.
I'm worried that because it takes my instance 60 seconds to process a single request and I will be receiving multiple requests at a time, that I'll be inefficiently triggering autoscaling even if there is enough processing power to handle additional requests on the same instance. Is this how it works? I would ideally like to be able to multi-thread my processing and accept additional requests on the same instance while still under the CPU utilization threshold.
The documentation for concurrent requests is scarce for the Flexible environment unlike the Standard environment so I want to be sure.
Perhaps 'number of workers' is the config setting you're looking for:
https://cloud.google.com/appengine/docs/flexible/python/runtime#recommended_gunicorn_configuration
Gunicorn uses workers to handle requests. By default, Gunicorn uses sync workers. This worker class is compatible with all web applications, but each worker can only handle one request at a time. By default, gunicorn only uses one of these workers. This can often cause your instances to be underutilized and increase latency in applications under high load.
And it sounds like you've already seen that you can specify the cpu utilization threshold:
https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#automatic_scaling
You can also use something other than gunicorn if you prefer. Here's one of their example's where they use Honcho instead:
https://github.com/GoogleCloudPlatform/getting-started-python/blob/master/6-pubsub/app.yaml

Concurrent requests handling on Google App Engine

I was experimenting with concurrent request handling on few platforms.
The aim of the experiment was to have a broad measure of the capacity bounds of some selected technologies.
I set up a Linux VM on my machine with a basic Go http server (the vanilla http.HandleFunc of the http default package).
The server would then compute a modified version of the fasta algorithm that restricted threads and processes to 1, and return the result. N was set to 100000.
The algorithm runs in roughly 2 seconds.
I used the same algorithm and logic on a Google App Engine project.
The algorithm is written using the same code, just the handler set up is done on init() instead of main() as per GAE requirements.
On the other end an Android client is spawning 500 threads each one issuing in parallel a GET request to the fasta computing server, with a request timeout of 5000 ms.
I was expecting the GAE application to scale and answer back to each request and the local Go server to fail on some of the 500 requests but results were the opposite:
the local server correctly replied to each request within the timeout bounds while the GAE application was able to handle just 160 requests out of 500. The remaining requests timed out.
I checked on the Cloud Console and I verified that 18 GAE instances were spawned, but still the vast majority of requests failed.
I thought that most of them failed because of the start-up time of each GAE instance, so I repeated the experiment right after but I had the same results: most of the requests timed out.
I was expecting GAE to scale to accomodate ALL the requests, believing that if a single local VM could successfully reply to 500 concurrent requests GAE would have done the same, but this is not what happened.
The GAE console doesn't show any error and correctly reports the number of incoming requests.
What could be the cause of this?
Also, if a single instance could handle all the incoming requests on my machine by virtue of only goroutines, how come that GAE needed to scale so much at all?
To make optimal usage in terms of minimizing costs you need to configure few things in app.yaml:
Enable threadsafe: true - actually it's from Python config and not applicable to Go but I would set it just in case.
Adjust scaling section:
max_concurrent_requests - set to maximum 80
max_idle_instances - set to minimum 0
max_pending_latency - set it to automatic or greater then min_pending_latency
min_idle_instances - set it to 0
min_pending_latency - set to higher number. If you are OK to get 1 second latency and you handlers take on average 100ms to process set it to 900ms.
Then you should be able to proceed a lot of request on single instance.
If you OK to burn cash for the sake of responsiveness & scalabiluty - increase min_idle_instances & max_idle_instances.
Also do you use similar instance types for VM and GAE? The GAE F1 instance is not too fast and is more optimal for async tasks like working with IO (datastore,http,etc.). You can configure usage of more powerful instance to better scale for computation intensive tasks.
Also do you test on paid account? Free accounts have quotas and AppEngine would refuse percentage of requests if it believe the load would exceed the daily quota if continuous with the same pattern.
Extending on Alexander's answer.
The GAE scaling logic is based on incoming traffic trend analysis.
The key for being able to handle your case - sudden spikes in traffic (which can't be takes into account in the trend analysis due to its variation speed) - is to have sufficient resident (idle) instances configured for your application to handle such traffic until GAE spins up additional dynamic instances. It can handle as high peaks as you want (if your pockets are deep enough).
See Scaling dynamic instances for more details.
Thanks everyone for their help.
Many interesting points and insights have been made by the answers I had on this topic.
The fact the the Cloud Console were reporting no errors led me to believe that the bottleneck was happening after the real request processing.
I found the reason why the results were not as expected: bandwidth.
Each response had a payload of roughly 1MB and thus responding to 500 simultaneous connections from the same client would clog the lines, resulting in timeouts.
This was obviously not happening when requesting to the VM, where the bandwith is much larger.
Now GAE scaling is in line with what I expected: it successfully scales to accomodate each incoming request.

GAE: Why do I experience loading requests even though I have fixed the number of instances to exactly one?

I have a low-load application which experienced latency spikes (requests taking up to 10s to return) due to loading requests, as seen in the logs:
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time.
Here I assume that "new process" means "new instance".
In order to avoid this, I fixed the number of idle instances to exactly one (max=1 and min=1), so there is always one instance running ("resident instance") and GAE shouldn't start new ones. Billing is enabled.
However, I still experience loading requests. Why? Can anything be done about this?
Idle instances are "reserve" instances - they are meant to handle spikes when traffic increases, not the "normal" traffic. Idle instances are used only during the spin-up of the dynamic instances.
So, when you have one idle instance and no dynamic instances running and you get a request, than the idle instance should handle the request, but a new dynamic instance will still be spun up.
I too experienced the same problem with my low-traffic app and here is the practical solution that almost always prevents my users to face a cold start :
- 1 resident F4 instance
- pending latency to 15 sec
- i worked so that my warmup request are as fast as possible (under 10 sec), still quite long cause i use the frameWork Play (Java)
- and when i really don t want to have any problems i create fake traffic by pinging my app.
With this config, the resident usually serves around 50 requests, during that time, a dynamic instance receives a warmup and then start serving.

Resources