Keeping GAE at a single instance with flexibility to scale - google-app-engine

I have a relatively low traffic app that can easily be handled by a single instance more than 95% of the time. Occasionally, having more than one instance running would be helpful to provide a better user experience.
It seems that GAE should be able to automatically scale in this way, but I can't get GAE to keep only a single instance when traffic is low. This is what I have tried:
Set min instances to 1
Set max instances to 3
Set min pending latency to 1 second
Set max pending latency to automatic (and also 1 second)
With this configuration, GAE will just about always run two instances even though one is sufficient.
I know I can set max instances to one, but I want to be able to automatically scale when I need it.
Is it possible to do what I want?

Note that the min/max property that you are setting are for IDLE instances.
Set min instances to 1 means that you will ALWAYS have at least one instance running, even when there are no requests for over 15 minutes. This could be set to 0 if you have low traffic AND your app launches quickly, i.e. under 1-2 seconds, otherwise the users will have bad experience with very slow response on their first request.
Set max instances to 3 means that it's OK for GAE to keep up to three instances running at any time, even when there are only few requests. This could be set to 1 to save some costs but would make some requests slow (time it takes to start new instance + time to launch your app) when traffic increases.
The max-idle-instances does not limit the number of instances in the event of a traffic spike, your app will always scale and new instances will keep launching if needed. The min/max settings are only there to help handle a sudden increase in traffic and there is no way to limit the number of instances that can be launched.
Take a look at this article for some more details: Setting the Number of Idle Instances
Regarding your question, you could try decreasing the max-idle-instances to 1 and see if that helps. You don't have to worry about scaling, new instances will still launch if needed, just keep in mind that the experience might not be as smooth for your users. If you decreased the number of max-idle-isntances and you still see more than 1 instance running on very low traffic, then your app might need to be optimized and multi-threading might need to be enabled if it wasn't.

Related

What is the difference between min-instances and min-idle-instances in google app engine?

I want to understand the difference between min-instances & min-idle-instances?
I saw documentation on https://cloud.google.com/appengine/docs/standard/java/config/appref#scaling_elements but I am not able to differentiate between the two.
My use case:
I want at least 1 instance always up, as otherwise in most of the cases GAE would take time in creating instance causing my requests to time out (in case of basic scaling).
It should stay up, no matter if there is traffic or not, and if a request comes it should immediately serve it. If request volume grows then it should scale.
Which one I should use?
The min-idle-instances make reference to the instances that are ready to support your application in case you receive high traffic or CPU intensive tasks, unlike the min_instances which are the instances used to process the incoming request immediately. I suggest you to take a look on this link to have a deeper explanation of idle instances.
Based on this, since your use-case is focused on serve the incoming requests immediately, I think you should rather go with the min_instances functionality and use the min-idle-instances only in case you want to be ready for sudden load spikes.
The min-instances configuration applies to dynamic instances while min-idle-instances applies to idle/resident instances.
See also:
Introduction to instances for a description of the 2 instance types
Why do more requests go to new (dynamic) instances than to resident instance? for a bit more details
min_instances: the minimum number of instances running at any time, traffic or no traffic, rain or shine.
min_idle_instances: the minimum of idle (or "unused") instances running over the currently used instances. Example: you automatically scaled to 5 app engine instances that are receiving requests, by setting min_idle_instances to 2, you will be running 7 instances in total, the 2 "extra" instances are idle and waiting in case you receive more load. The goal is that when load raises, your users don't have to wait the load time it takes to start up an instance.
IMPORTANT: you need to configure warmup requests for that to work
IMPORTANT2: you'll be billed for any instance running, idle or not. App engine is not cheap so be careful.
min_instances applies to the number of instances that you want to have running, from 0 (useful if you want to scale down when you don't receive traffic) to 1000. You are charged for the number of instances you have running, so, this is important to save costs.
For your case set this value to 1, as it's the most straightforward option.

When does Google App Engine start or stop an instance?

We have an App Engine app that handles an average .5 requests per second, and seemingly all those requests can be handled by the same instance running a Go app as the main version.
However, sometimes App Engine kicks off a second instance (and sometimes even a third one), that doesn't seem to do anything past handling one or two requests. Here's an example.
Shutting down that instance manually doesn't seem to cause any harm, so my question is, why does App Engine not kill the instance after it did not get any requests for a while? (The above example had four requests in the past hour, often the requests/age ratio gets even lower).
Update:
A similar situation is when an instance is started on a different version. App Engine only seems to kill the instance after hours of not getting any requests.
Under Application Settings → Performance,
Idle Instances is set to Automatic – 20
Pending Latency is set to 150ms – 250ms
I wish I knew what controls if/when it kills idle instances, but I can't see any documentation of it.
To avoid excess instances starting, I think the main thing you can do here is increase the pending latency:
The Pending Latency slider controls how long requests spend in the pending queue before being served by an Instance of the default version of your application. If the minimum pending latency is high App Engine will allow requests to wait rather than start new Instances to process them. This can reduce the number of instance hours your application uses, but can result in more user-visible latency.
Even if you only average 4 requests/hour, if you happen to get two closely spaced I suppose it's possible it would start a new instance.
You can also see some small amount of information in the logs about why it started a new instance.
The "How Applications Scale" section of the Google App Engine documentation states:
Scaling in Instances
Each instance has its own queue for incoming requests. App Engine monitors the number of requests waiting in each instance's queue. If App Engine detects that queues for an application are getting too long due to increased load, it automatically creates a new instance of the application to handle that load.
App Engine also scales instances in reverse when request volumes decrease. This scaling helps ensure that all of your application's current instances are being used to optimal efficiency and cost effectiveness.
It also states you can "specify a minimum number of idle instances", and to "optimize for high performance or low cost" in the administration console.
Try setting the "Idle instances" field to something like 3 - 5, and "optimize for low cost" and see if that affects the instance kill time.

Preparing for a flash crowd on Google App Engine

I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.
You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.

Handling AppEngine outage with pending_ms during several minutes

Today AppEngine went down for a while:
http://code.google.com/status/appengine/detail/serving/2012/10/26#ae-trust-detail-helloworld-get-latency
The result was that all requests were kept as pending, for some for as long as 24 minutes. Here is an excerpt from my server log. These requests are in general handled in less than 200 ms.
https://www.evernote.com/shard/s8/sh/ad3b58bf-9338-4cf7-aa35-a255d96aebbc/4b90815ba1c8cd2080b157a54d714ae0
My quota (8$ per day) was exploded in a matter of minutes when it previously was at around 2$ per day.
How can I prevent pending_ms to eat all my quota, even though my actual request is still responding very fast? I had the pending delay from 300 ms to Automatic. Does limiting the maximum to 10 seconds prevent that type of outbreak?
blackjack75,
You're right, raising the pending latency to something like 10 seconds will help reduce the number of instances started.
It looks like the long running requests tied up your instances. When this happens, app engine spins up new instances to handle the new requests, and of course instances cost money.
Lowering your min and max idle instances to smaller numbers should also help.
On your dashboard, you can look at your instance graph you to see how long the burst of instances was left idle after the request load was finished.
You can look at your typical usage to help estimate a safe max.
Lowering them can cause slowness when legitimate traffic needs to spin up a new instance, especially with bursty traffic, so you would want to adjust this to match your budget. For comparision, on a non-production appspot having the min and max set to 1 works fine.
Besides that, general techniques for reducing app engine resource usage will help. It sounds like you've gone through that already since your typical request time is low. Enabling concurrent requests could help here if your code will handle threads correctly (no globals, etc.) and your instances have enough free memory handle multiple requests.

App Engine loading request even when idle instance available

I have a simple app running on App Engine but I'm having odd problems with latency. It's a Python 2.7 app and a loading request takes between 1.5 and 10 secs (I guess depending on how GAE is feeling). This is a low traffic site right now, so previously GAE was sitting with no idle instances and most request were loading requests, resulting in a long wait time on the first page view.
I've tried configuring the minimum number of idle instances to "1" so that these infrequent page views can immediately hit a warm instance.
However, I've seen several cases now where even with one instance sitting unused, GAE will route an incoming request to a loading instance, leaving the warm instance untouched:
gae dashboard showing odd scheduling
How can I prevent this from happening? I feel I must be understanding something wrong, because I certainly don't expect this behavior.
Update: Also, what makes this even less comprehensible is that the app has threadsafe enabled, so I really don't understand why GAE would get flustered and spin up an instance for a single, lone request.
Actually, I believe this is normal behavior. Idle instances are supposed to guarantee a minimum number of instances always available (for spiky load).
So, when some requests start coming in, they are initially served by idle instances, but at the same time AE scheduler will start launching new instances to always guarantee the same amount of idle instances even during suddenly increased load. That is, to "cover" for those idle instances that became busy serving requests.
It is described in details on Adjusting Application Performance page.
Arrrgh! Suffer from this myself. This topic-area has come up in several threads (GAE groups & SO). If someone can dial-in the settings for a low-traffic site (billing on/off), that would be a real benefit. IIRC, someone with what I think is deep GAE experience noted in one thread that the Scheduler does not do well with very low volume apps. I have also seen wildly different startup times within a relatively short period of time. Painful to see a spinup take 700ms then 7000ms just a few minutes later. Overall the issue is not so much the cost to me, but more so the waste of infrastructure resources. In testing I've had two instances running despite having pinged the app with an RPC once every few minutes. If 50k other developers are similarly testing, that could accumulate into a significant waste.

Resources