GAE: What's the difference between <min-pending-latency> and <max-pending-latency>? - google-app-engine

As far as I can read the docs, both settings do the same thing: start a new instance when a request has spent in pending queue longer than that setting says.
<max-pending-latency> The maximum amount of time that App Engine should allow a request to wait in the pending queue before starting a new instance to handle it. Default: "30ms".
A low maximum means App Engine will start new instances sooner for pending requests, improving performance but raising running costs.
A high maximum means users might wait longer for their requests to be served, if there are pending requests and no idle instances to serve them, but your application will cost less to run.
<min-pending-latency>
The minimum amount of time that App Engine should allow a request to wait in the pending queue before starting a new instance to handle it.
A low minimum means requests must spend less time in the pending queue when all existing instances are active. This improves performance but increases the cost of running your application.
A high minimum means requests will remain pending longer if all existing instances are active. This lowers running costs but increases the time users must wait for their requests to be served.
Source: https://cloud.google.com/appengine/docs/java/config/appref
What's the difference between min and max then?

The piece of information you might be missing to understand these settings is that App Engine can choose to create an instance at any time between min-pending-latency and max-pending-latency.
This means an instance will never be created to serve a pending request before min-pending-latency and will always be created once max-pending-latency has been reached.
I believe the best way to understand is to look at the the timeline of events when a request enters the pending queue:
A request reaches the application but no instance are available to serve it so it is placed in the pending requests queue.
Until the min-pending-latency is reached: App Engine tries to find an available instance to serve the request and will not create a new instance. If a request is served below this threshold, it is a signal for App Engine to scale down.
After the min-pending-latency is reached and until max-pending-latency is reached: App Engine tries to find an available instance to serve the request.
After the max-pending-latency is reached: App Engine stops searching for an available instance to serve the request and creates a new instance.
Source: app.yaml automatic_scaling element

Related

(Basic Scaling) Will App Engine shut down an app that's still busy handling a request if the idle timeout is reached?

Google describes basic scaling like this:
I don't really have any other choice, as I'm using a B1 instance, so automatic scaling is not allowed.
That raises the question though, if I have an endpoint that takes a variable amount of time (could be minutes, could be hours), and I basically have to set an idle_timeout, will App Engine calculate idle_timeout from the time the request was made in the first place or when the app is done handling the request?
If the former is right, then it feels a bit unfair to have to guess how long requests will take when thread activity is a usable indicator of whether or not to initiate shutdown of an app.
Here you are mixing up two different terms.
idle_timeout is the time that the instance will wait before shutting down after receiving its last request
Request timeout is the amount of time that App Engine will wait for the return of a request from your app
As per the documentation:
Requests can run for up to 24 hours. A manually-scaled instance can
choose to handle /_ah/start and execute a program or script for many
hours without returning an HTTP response code. Task queue tasks can
run up to 24 hours.

Pub/Sub push return 503 for basic scaling

I am using Pub/Sub push subscription, ack deadline is set to 10 minutes, the push endpoint is hosted within AppEngine using basic scaling.
In my logs, I see that some of the Pub/Sub (supposedly delivered to starting instances) push requests are failed with 503 error status and Request was aborted after waiting too long to attempt to service your request. log message. The execution time for this request varies from 10 seconds (for most of the requests) up to 30 seconds for some of them.
According to this article https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed#instance_scaling Deadlines for HTTP request is 24 hours and request should not be aborted in 10 seconds.
Is there a way to avoid such exceptions?
These failed requests are most likely timing out in the Pending Request Queue, meaning that no instances are available to serve them. This usually happens during spikes of PubSub messages that are delivered in burst and App Engine can't scale up quickly enough to cope with them.
One option to mitigate this would be to switch the scaling option to automatic scaling in your app.yaml file. You can tweak the min_pending_latency and max_pending_latency to better fit your scenario. You can also specify min_idle_instances to get idle instances that would be ready to handle extra load (make sure to also enable and handle warmup requests)
Take into account though that PubSub will automatically retry to deliver failed messages. It will adjust the delivery rate according to your system's behavior, as documented here. So you may experience some errors during spikes of messages, while new instances are being spawned, but your messages will eventually be processed (as long as you have setup max_instances high enough to handle the load).

Programatically listing and sending requests to dynamic App Engine instances

I want to send a particular HTTP request (or otherwise communicate a message) to every (dynamic/autoscaled) instance which is currently running for a particular App Engine application.
My goal is to trigger each instance to discard some locally cached data (because I have just modified the underlying data and want them to reload it).
One possible solution is to store a value in Memcache, and have instances check this each time they handle a request to see if they should flush their cache. But this adds latency to every request.
Another possible solution would be to somehow stop all running instances. No fixed overhead, but some impact while instances are restarted.
An even less desirable solution would be to redeploy the application code in order to cause all instances to be stopped. This now adds additional delay on my end as a deployment takes some time.
You could use the management API to list instances for a given version, but I'd suggest that you'd probably want to use something like the PubSub API to create a subscription on each of your App Engine instances. Since each instance has its own subscription, any messages sent to the monitored queue will be received by all instances.
You can create the subscription at startup (the /_ah/start endpoint may be useful), and then delete it at shutdown (using the /_ah/stop endpoint).

Can GAE be configured to launch a new instance per request?

I'm building a data processing application, where I want an incoming (REST) request to cause a cloud instance to be started, do some processing, then retrieve the results. Typically, it would be something like this:
receive request
start instance
send request to instance
instance processes (~100% load on all instance CPUs)
poll service running on instance for status
fetch results from instance
shut down instance
I was planning on doing the instance management manually using something like jclouds, but am wondering if GAE could be configured to do something like this (saving me work).
If I have my processing service set up in GAE, can I make it so that a new instance is launched for every incoming request (or whenever the current instance(s) are at 100% CPU usage)?
Referring to instance management only (i.e. 1-4 and 7)...
From Scaling dynamic instances:
The App Engine scheduler decides whether to serve each new request
with an existing instance (either one that is idle or accepts
concurrent requests), put the request in a pending request queue, or
start a new instance for that request. The decision takes into account
the number of available instances, how quickly your application has
been serving requests (its latency), and how long it takes to spin up
a new instance.
Each instance has its own queue for incoming requests. App Engine
monitors the number of requests waiting in each instance's queue. If
App Engine detects that queues for an application are getting too long
due to increased load, it automatically creates a new instance of the
application to handle that load.
App Engine also scales instances in reverse when request volumes
decrease. This scaling helps ensure that all of your application's
current instances are being used to optimal efficiency and cost
effectiveness.
So in the scaling configuration I'd keep automatic_scaling (which is the default) and play with:
max_pending_latency:
The maximum amount of time that App Engine should allow a request
to wait in the pending queue before starting a new instance to handle
it. The default value is "30ms".
A low maximum means App Engine will start new instances sooner for pending requests, improving performance but raising running costs.
A high maximum means users might wait longer for their requests to be served (if there are pending requests and no idle instances to
serve them), but your application will cost less to run.
min_pending_latency:
The minimum amount of time that App Engine should allow a request to
wait in the pending queue before starting a new instance to handle it.
A low minimum means requests must spend less time in the pending queue when all existing instances are active. This improves
performance but increases the cost of running your application.
A high minimum means requests will remain pending longer if all existing instances are active. This lowers running costs but increases
the time users must wait for their requests to be served.
See also in Change auto scaling performance settings:
Min Pending Latency - Raising Min Pending Latency instructs App Engine’s scheduler to not start a new instance unless a request
has been pending for more than the specified time. If all instances
are busy, user-facing requests may have to wait in the pending queue
until this threshold is reached. Setting a high value for this setting
will require fewer instances to be started, but may result in high
user-visible latency during increased load.
You may also want to take a look at Warmup requests, in case you want to reduce the latency for requests which would cause a new instance to be started.

How do dynamic backends start in Google App Engine

Can we start a dynamic backend programatically? mean while when a backend is starting how can i handle the request by falling back on the application(i mean app.appspot.com).
When i stop a backend manually in admin console, and send a request to it, its not starting "dynamically"
Dynamic backends come into existence when they receive a request, and
are turned down when idle; they are ideal for work that is
intermittent or driven by user activity.
Resident backends run continuously, allowing you to rely on the state
of their memory over time and perform complex initialization.
http://code.google.com/appengine/docs/python/backends/overview.html
I recently started executing a long running task on a dynamic backend and noticed a dramatic increase in the performance of the frontends. I assume this was because the long running task was competing for resources with normal user requests.
Backends are documented quite thoroughly here. Backends have to be started and stopped with appcfg or the admin console, as documented here. A stopped backend will not handle requests - if you want this, you should probably be using the Task Queue instead.
It appears that a dynamic backend need not be explicitly stopped. The overvicew (http://code.google.com/appengine/docs/python/backends/overview.html) states that the billing for a dynamic backend stops 15 minutes after the last request is processed. So, if your app has a cron job, for example, that requires 5 minutes to complete, and needs to run every hour, then you could configure a backend to do this. The cost you'll incur is 15+5 minutes every hour, or 8 hours for the whole day. I suppose the free quota allows you 9 backend hours. So, this type of scenario would be free for you. The backend will start when you send your first request to it through a queue, and will stop 15 minutes after the last request you send is processed completely.

Resources