From the documentation on how GAE Flexible handles requests, it says that "An instance can handle multiple requests concurrently" but I don't know what this exactly means.
Let's say my application can process a single request every 60 seconds.
After starting to process the initial request, will another request (or 3) that occur say 30 seconds after (so halfway done with the first request), be handled by the same instance, or will it trigger autoscaling and spin up more instances to handle those new requests? This situation assumes that CPU utilization for the first request is still below the scaling CPU-utilization threshold.
I'm worried that because it takes my instance 60 seconds to process a single request and I will be receiving multiple requests at a time, that I'll be inefficiently triggering autoscaling even if there is enough processing power to handle additional requests on the same instance. Is this how it works? I would ideally like to be able to multi-thread my processing and accept additional requests on the same instance while still under the CPU utilization threshold.
The documentation for concurrent requests is scarce for the Flexible environment unlike the Standard environment so I want to be sure.
Perhaps 'number of workers' is the config setting you're looking for:
https://cloud.google.com/appengine/docs/flexible/python/runtime#recommended_gunicorn_configuration
Gunicorn uses workers to handle requests. By default, Gunicorn uses sync workers. This worker class is compatible with all web applications, but each worker can only handle one request at a time. By default, gunicorn only uses one of these workers. This can often cause your instances to be underutilized and increase latency in applications under high load.
And it sounds like you've already seen that you can specify the cpu utilization threshold:
https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#automatic_scaling
You can also use something other than gunicorn if you prefer. Here's one of their example's where they use Honcho instead:
https://github.com/GoogleCloudPlatform/getting-started-python/blob/master/6-pubsub/app.yaml
Related
I'm using App engine to concurrently handle a number of long running tasks (therefore I need to use basic scaling).
I noticed with one instance, only 8 tasks can be handled simultaneously (consistent with the number of workers for a B4 instance). For the ninth task I receive:
POST 503: Request was aborted after waiting too long to attempt to service your request.
How can I handle more task than this simultaneously without adding more instances?
As a best practice, the number of workers you specify should match the instance class of your App Engine app, but you can change it by modifying the number of workers in the entrypoint as in the example below and try and see if it works for you.
entrypoint: gunicorn -b :8080 -w 2 main:app
Consider that a service with basic scaling is configured by setting the maximum number of instances in the max_instances parameter of the basic_scaling setting. You can control the number of live instance scales with the processing volume by changing to manual scaling.
If you use basic scaling, App Engine attempts to keep your cost low, even though that may result in higher latency as the volume of incoming requests increases.
If you tune the scaling settings to reduce costs by minimizing idle instances, then you run the risk of seeing latency spikes if the load increases unexpectedly.
Basic scaling type is designed to minimize costs at the expense of latency.
Your code needs to scale the number of workers based on processing volume. If your code does not handle scaling, you risk wasting computing resources if there are no tasks to process; you also risk latency if you have too many tasks to process.
A good way to speed up requests is to make use of multiple caching layers.
This article is helpful to handle the instance settings and modify it to get the desired performance.
Have you tried increasing max_concurrent_requests in your app.yaml? It should be defaulting to being able to handle 10 requests at a time.
https://cloud.google.com/appengine/docs/standard/python3/config/appref#max_concurrent_requests
I want to understand the difference between min-instances & min-idle-instances?
I saw documentation on https://cloud.google.com/appengine/docs/standard/java/config/appref#scaling_elements but I am not able to differentiate between the two.
My use case:
I want at least 1 instance always up, as otherwise in most of the cases GAE would take time in creating instance causing my requests to time out (in case of basic scaling).
It should stay up, no matter if there is traffic or not, and if a request comes it should immediately serve it. If request volume grows then it should scale.
Which one I should use?
The min-idle-instances make reference to the instances that are ready to support your application in case you receive high traffic or CPU intensive tasks, unlike the min_instances which are the instances used to process the incoming request immediately. I suggest you to take a look on this link to have a deeper explanation of idle instances.
Based on this, since your use-case is focused on serve the incoming requests immediately, I think you should rather go with the min_instances functionality and use the min-idle-instances only in case you want to be ready for sudden load spikes.
The min-instances configuration applies to dynamic instances while min-idle-instances applies to idle/resident instances.
See also:
Introduction to instances for a description of the 2 instance types
Why do more requests go to new (dynamic) instances than to resident instance? for a bit more details
min_instances: the minimum number of instances running at any time, traffic or no traffic, rain or shine.
min_idle_instances: the minimum of idle (or "unused") instances running over the currently used instances. Example: you automatically scaled to 5 app engine instances that are receiving requests, by setting min_idle_instances to 2, you will be running 7 instances in total, the 2 "extra" instances are idle and waiting in case you receive more load. The goal is that when load raises, your users don't have to wait the load time it takes to start up an instance.
IMPORTANT: you need to configure warmup requests for that to work
IMPORTANT2: you'll be billed for any instance running, idle or not. App engine is not cheap so be careful.
min_instances applies to the number of instances that you want to have running, from 0 (useful if you want to scale down when you don't receive traffic) to 1000. You are charged for the number of instances you have running, so, this is important to save costs.
For your case set this value to 1, as it's the most straightforward option.
App Engine has been great for requests that process quickly with no external API calls to databases or caches or third-party resources, but we've found that introducing any sort of "longer running" component or external latency (for example in a HTTP POST operation that runs asynchronously in the background and might take a second or two to process a few more intense database queries... totally invisible and OK from a UX perspective on the client-side because it's asynchronous but expensive to App Engine billing since it's long running) ... the "instance hours" compound and drive costs up considerably.
These sorts of expense inducing situations where a request is literally just waiting for a response from an external resource and requiring almost zero CPU during their idling seem avoidable, but I'm not sure if it's avoidable with App Engine.
It's almost like a "long poll" where the response might be left open but doing nothing.
Is there a way to do this on App Engine without just paying an insane amount for instance hours, or would we be better off moving to Compute Engine or EC2? Does it scale automatically based on CPU load, or is it based solely on open and perhaps inactive requests in total count? — threadsafe is indeed enabled.
There are really two ways to go about this one (top of mind).
Use Task Queues!
If the work doesn't need to be exactly at the same time of the request, this is exactly what [task queues] in App Engine are for. They allow you to put a job on a queue, and have another module pick up the work. They're kind of great because you can separately scale your front end and back end processes.
If that doesn't work....
Use App Engine Flexible
Under the hood App Engine Flexible is just running GCE instances. The cost structure is entirely different, since you persistently have a VM running in the background serving your requests.
Hope this helps!
What you're really worried about here is how App Engine scales your instances. Because many of your requests require few resources, your app might be able to handle many more concurrent requests on a single instance than normal. You can look into parameters that shape scaling here. Of particular interest:
max_concurrent_requests The number of concurrent requests an automatic scaling instance can accept before the scheduler spawns a new instance (Default: 8, Maximum: 80).
There is a danger here, where an instance may fill up with non-long-polling requests and become overburdened. To prevent that, you could isolate your long-polling requests into their own service and set its scaling parameters separately from the rest of your app.
My queue task uses urlfetch to get some data from an external API and saves it to ndb Datastore entities.
This takes about 15 seconds total.
Somehow, when the task runs, all other handlers (simple json response handlers) become slower. (slower means +500ms)
What could be causing this?
Isn't the idea of background tasks that is doesn't affect the user facing requests.
I stumbled upon this blogpost, but my task takes longer than 1 second to complete. I don't see how that's going to help me.
By default, your tasks are executed by the same instances that serve user requests. Background or not, they share the same CPU, memory and bandwidth. It's a good idea to run these tasks on a different module, which means a different instance. You can do it by specifying a target for your task queue.
Note that typically an automatic App Engine scheduler will spin a new instance when responses from your current instances slow down. However, a slowdown in your case is caused not by the growing volume of standard requests, but an unusual request which takes much longer. This prevents automatic scheduler from reacting to the increased latencies. You can switch to manual or basic scheduling, which give you more control over capacity (total number of instances) and rules for spinning new instances, but creating a different module for background tasks is a better solution.
I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.
You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.