When does Google App Engine start or stop an instance?

When does Google App Engine start or stop an instance? - google-app-engine

We have an App Engine app that handles an average .5 requests per second, and seemingly all those requests can be handled by the same instance running a Go app as the main version.
However, sometimes App Engine kicks off a second instance (and sometimes even a third one), that doesn't seem to do anything past handling one or two requests. Here's an example.
Shutting down that instance manually doesn't seem to cause any harm, so my question is, why does App Engine not kill the instance after it did not get any requests for a while? (The above example had four requests in the past hour, often the requests/age ratio gets even lower).
Update:
A similar situation is when an instance is started on a different version. App Engine only seems to kill the instance after hours of not getting any requests.
Under Application Settings → Performance,
Idle Instances is set to Automatic – 20
Pending Latency is set to 150ms – 250ms

I wish I knew what controls if/when it kills idle instances, but I can't see any documentation of it.
To avoid excess instances starting, I think the main thing you can do here is increase the pending latency:
The Pending Latency slider controls how long requests spend in the pending queue before being served by an Instance of the default version of your application. If the minimum pending latency is high App Engine will allow requests to wait rather than start new Instances to process them. This can reduce the number of instance hours your application uses, but can result in more user-visible latency.
Even if you only average 4 requests/hour, if you happen to get two closely spaced I suppose it's possible it would start a new instance.
You can also see some small amount of information in the logs about why it started a new instance.

The "How Applications Scale" section of the Google App Engine documentation states:
Scaling in Instances
Each instance has its own queue for incoming requests. App Engine monitors the number of requests waiting in each instance's queue. If App Engine detects that queues for an application are getting too long due to increased load, it automatically creates a new instance of the application to handle that load.
App Engine also scales instances in reverse when request volumes decrease. This scaling helps ensure that all of your application's current instances are being used to optimal efficiency and cost effectiveness.
It also states you can "specify a minimum number of idle instances", and to "optimize for high performance or low cost" in the administration console.
Try setting the "Idle instances" field to something like 3 - 5, and "optimize for low cost" and see if that affects the instance kill time.

Related

What is the difference between min-instances and min-idle-instances in google app engine?

I want to understand the difference between min-instances & min-idle-instances?
I saw documentation on https://cloud.google.com/appengine/docs/standard/java/config/appref#scaling_elements but I am not able to differentiate between the two.
My use case:
I want at least 1 instance always up, as otherwise in most of the cases GAE would take time in creating instance causing my requests to time out (in case of basic scaling).
It should stay up, no matter if there is traffic or not, and if a request comes it should immediately serve it. If request volume grows then it should scale.
Which one I should use?

The min-idle-instances make reference to the instances that are ready to support your application in case you receive high traffic or CPU intensive tasks, unlike the min_instances which are the instances used to process the incoming request immediately. I suggest you to take a look on this link to have a deeper explanation of idle instances.
Based on this, since your use-case is focused on serve the incoming requests immediately, I think you should rather go with the min_instances functionality and use the min-idle-instances only in case you want to be ready for sudden load spikes.

The min-instances configuration applies to dynamic instances while min-idle-instances applies to idle/resident instances.
See also:
Introduction to instances for a description of the 2 instance types
Why do more requests go to new (dynamic) instances than to resident instance? for a bit more details

min_instances: the minimum number of instances running at any time, traffic or no traffic, rain or shine.
min_idle_instances: the minimum of idle (or "unused") instances running over the currently used instances. Example: you automatically scaled to 5 app engine instances that are receiving requests, by setting min_idle_instances to 2, you will be running 7 instances in total, the 2 "extra" instances are idle and waiting in case you receive more load. The goal is that when load raises, your users don't have to wait the load time it takes to start up an instance.
IMPORTANT: you need to configure warmup requests for that to work
IMPORTANT2: you'll be billed for any instance running, idle or not. App engine is not cheap so be careful.

min_instances applies to the number of instances that you want to have running, from 0 (useful if you want to scale down when you don't receive traffic) to 1000. You are charged for the number of instances you have running, so, this is important to save costs.
For your case set this value to 1, as it's the most straightforward option.

Google App Engine for long running but low CPU tasks, or long-polling?

App Engine has been great for requests that process quickly with no external API calls to databases or caches or third-party resources, but we've found that introducing any sort of "longer running" component or external latency (for example in a HTTP POST operation that runs asynchronously in the background and might take a second or two to process a few more intense database queries... totally invisible and OK from a UX perspective on the client-side because it's asynchronous but expensive to App Engine billing since it's long running) ... the "instance hours" compound and drive costs up considerably.
These sorts of expense inducing situations where a request is literally just waiting for a response from an external resource and requiring almost zero CPU during their idling seem avoidable, but I'm not sure if it's avoidable with App Engine.
It's almost like a "long poll" where the response might be left open but doing nothing.
Is there a way to do this on App Engine without just paying an insane amount for instance hours, or would we be better off moving to Compute Engine or EC2? Does it scale automatically based on CPU load, or is it based solely on open and perhaps inactive requests in total count? — threadsafe is indeed enabled.

There are really two ways to go about this one (top of mind).
Use Task Queues!
If the work doesn't need to be exactly at the same time of the request, this is exactly what [task queues] in App Engine are for. They allow you to put a job on a queue, and have another module pick up the work. They're kind of great because you can separately scale your front end and back end processes.
If that doesn't work....
Use App Engine Flexible
Under the hood App Engine Flexible is just running GCE instances. The cost structure is entirely different, since you persistently have a VM running in the background serving your requests.
Hope this helps!

What you're really worried about here is how App Engine scales your instances. Because many of your requests require few resources, your app might be able to handle many more concurrent requests on a single instance than normal. You can look into parameters that shape scaling here. Of particular interest:
max_concurrent_requests The number of concurrent requests an automatic scaling instance can accept before the scheduler spawns a new instance (Default: 8, Maximum: 80).
There is a danger here, where an instance may fill up with non-long-polling requests and become overburdened. To prevent that, you could isolate your long-polling requests into their own service and set its scaling parameters separately from the rest of your app.

Preparing for a flash crowd on Google App Engine

I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.

You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.

GAE: Why do I experience loading requests even though I have fixed the number of instances to exactly one?

I have a low-load application which experienced latency spikes (requests taking up to 10s to return) due to loading requests, as seen in the logs:
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time.
Here I assume that "new process" means "new instance".
In order to avoid this, I fixed the number of idle instances to exactly one (max=1 and min=1), so there is always one instance running ("resident instance") and GAE shouldn't start new ones. Billing is enabled.
However, I still experience loading requests. Why? Can anything be done about this?

Idle instances are "reserve" instances - they are meant to handle spikes when traffic increases, not the "normal" traffic. Idle instances are used only during the spin-up of the dynamic instances.
So, when you have one idle instance and no dynamic instances running and you get a request, than the idle instance should handle the request, but a new dynamic instance will still be spun up.

I too experienced the same problem with my low-traffic app and here is the practical solution that almost always prevents my users to face a cold start :
- 1 resident F4 instance
- pending latency to 15 sec
- i worked so that my warmup request are as fast as possible (under 10 sec), still quite long cause i use the frameWork Play (Java)
- and when i really don t want to have any problems i create fake traffic by pinging my app.
With this config, the resident usually serves around 50 requests, during that time, a dynamic instance receives a warmup and then start serving.

App Engine loading request even when idle instance available

I have a simple app running on App Engine but I'm having odd problems with latency. It's a Python 2.7 app and a loading request takes between 1.5 and 10 secs (I guess depending on how GAE is feeling). This is a low traffic site right now, so previously GAE was sitting with no idle instances and most request were loading requests, resulting in a long wait time on the first page view.
I've tried configuring the minimum number of idle instances to "1" so that these infrequent page views can immediately hit a warm instance.
However, I've seen several cases now where even with one instance sitting unused, GAE will route an incoming request to a loading instance, leaving the warm instance untouched:
gae dashboard showing odd scheduling
How can I prevent this from happening? I feel I must be understanding something wrong, because I certainly don't expect this behavior.
Update: Also, what makes this even less comprehensible is that the app has threadsafe enabled, so I really don't understand why GAE would get flustered and spin up an instance for a single, lone request.

Actually, I believe this is normal behavior. Idle instances are supposed to guarantee a minimum number of instances always available (for spiky load).
So, when some requests start coming in, they are initially served by idle instances, but at the same time AE scheduler will start launching new instances to always guarantee the same amount of idle instances even during suddenly increased load. That is, to "cover" for those idle instances that became busy serving requests.
It is described in details on Adjusting Application Performance page.

Arrrgh! Suffer from this myself. This topic-area has come up in several threads (GAE groups & SO). If someone can dial-in the settings for a low-traffic site (billing on/off), that would be a real benefit. IIRC, someone with what I think is deep GAE experience noted in one thread that the Scheduler does not do well with very low volume apps. I have also seen wildly different startup times within a relatively short period of time. Painful to see a spinup take 700ms then 7000ms just a few minutes later. Overall the issue is not so much the cost to me, but more so the waste of infrastructure resources. In testing I've had two instances running despite having pinged the app with an RPC once every few minutes. If 50k other developers are similarly testing, that could accumulate into a significant waste.