Frequent restarts on GAE application flex environment - google-app-engine

I have a GAE application that's set up as a flexible instance, which is expected to be restarted on a weekly basis (and a continually unhealthy instance can be restarted): https://cloud.google.com/appengine/docs/flexible/java/how-instances-are-managed
However, we're seeing this restart ("npm run build" command) several times per week! For example in the past three weeks we've had 9 restarts, and I've confirmed that the log entries leading up are successful 200 responses (no sign of trouble)- all for the active version serving traffic (and not for the other versions that are stopped).
Has anyone seen this symptom before or know of something else that can cause frequent restarts?
Let me know if any other info would be helpful.

An instance restart in the Google App Engine flexible environment can occur for several reasons:
According to the GAE documentation, there is no guarantee that an instance runs indefinitely, it can be restarted due to hardware maintenance, software updates or unforeseen issues. Besides that, as you stated, all instances are restarted on a weekly basis.
An instance can also be restarted if it fails to respond to a specified number of consecutive health check requests.
In case that you observe a unusual number of restarts I recommend you to open a ticket in Google Cloud Platform Support. They have internal tools that are able to check what is going on in the instance and figure out why the restarts are happening.
#DianeKaplan's comment:
Contacting GCP support has given me some a few helpful nuggets so far:
The automatic weekly restart of an instance due to maintenance can occur around different times (so it may only be 5 days since the last one, for example)
our deployments (which result in new GAE versions) make Google Builds
In some cases, a VM was being created overnight and then immediately deleted, where it didn't look like autoscaling was needed. Still looking into this, but was pointed towards the Google Cloud Console section Home > Activity as a good place to find clues

Related

Costs generated by old versions of my appengine application

I have many versions of appengine instances that are active but not used because they are old. What are the costs they can generate? How do you behave with the old versions of appengine instances? Do you delete them or deactivate them?
On the documentation I don't find any reference to the costs of the old instances.
https://cloud.google.com/appengine/pricing?hl=it
UPDATE:
(GAE STANDARD)
Thank you
It's a poorly documented aspect of App Engine. What you describe as versions that are "not used" are more specifically versions that don't receive traffic. But depending on your scaling configuration (essentially defined in your app.yaml file), there may not be a 1:1 relationship between traffic and the number of active instances serving a version.
For example, I'm familiar with using "automatic_scaling" with min_instances = 1. This will prevent a service to scale to zero instance and have latency to serve an incoming request after some idle time, but it also means that any version until deleted will generate a baseline cost of 1 instance running 24/7.
Also, I've found that the estimated number of instances displayed in the dashboard you screenshotted can be misleading (more specifically, it can show 0 instance while there is actually one running).
Note that if you do not have scaling related configuration in your app.yaml file, you should check what are the default values currently considered by App Engine.
It's tricky when you get started and I'm sure I'm not the only one who lost most of the free trial budget because of this.
There is actually a limit of versions you can have depending on your app's pricing:
Limit
Free app
Paid app
Max versions
15
210
It seems that you can have them active in case you want to switch between versions migrating or splitting the traffic between them, but they won't charge you for them if you don't reach more than 15 versions.

app.yaml configuration to to shutdown the idle running instance for GAE

Is there a parameter that can be used in .yaml file, which can turn off the google app engine running instance when idle for a specified time? The intention is to reduce the instance hours hence billing.
There is no option in app.yaml flex environment to stop instance if it is idle.
Flex should have atlease 1 instance running.
If you want to be billed for an instance, stop the instance manually or if you know the certain time when your app is not being used (e.g 6pm to 6am next day), you can schedule to stop / start instance version.
gcloud app versions stop v1
There is no app.yaml element that can stop an App Engine instance based on a condition for a specific amount of time.
The closest thing you can do to reduce costs using the app.yaml file, is to specify a cheaper, albeit less potent Instance Class and / or reducing the resources you assign to the instance, (depending on whether you’re using the standard or flexible environment respectively), as these are part of what you’re billed for.
Reducing the amount of instances you need is another approach; this can be done by lowering the value of max_instances and / or max_idle_instances in standard, and max_num_instances in flexible.
If you don’t want to be billed for an instance at all, you can stop the version associated to it with the gcloud command gcloud app versions stop. In standard you won’t be charged when it’s stopped as it’s not running, but in flexible you will still pay for the disk size despite it.
A tool that can help you anticipate and estimate costs is the Pricing Calculator, where you can enter your desired configuration and see what would the costs approximately be. Setting up budget alerts for when you reach a certain spending limit can be useful too. Similarly, in standard, you can set a spending limit, and when an application exceeds it, operations will consequently fail but you won't be billed for it.

How to prevent downtime in App Engine Flex when instances are automatically restarted

Situation
custom runtime (Docker/Node) on App Engine Flex
manually scaled to 1 single instance as we manage the resources ourselves (2 cpu / 6 gb ram)
liveness and readiness checks are configured
as expected, vm instances are automatically restarted on a weekly basis to apply OS / system updates
this is visible in the Activity pane of the Google Cloud Console
Stackdriver logs confirm this activity (e.g. shutdown-script: INFO Starting shutdown scripts. and startup-script: INFO Starting startup scripts.)
no instance is available during these restarts, resulting in 503 errors when visiting the application running on the instance
Goal
to have some control on the amount of instances to prevent downtime
e.g. temporarily scale to 2 instances while 1 instance is restarting
keeping control of the available resources (cpu / ram)
Question
We've considered simply having 2 instances available at all times, but are worried both would be restarted at the same time since they are part of the same instance group.
What would allow us to keep everything up and running while still controlling the amount of instances / resources used?
I have a flex app with two instances running for similar reasons. For me, an instance will occasionally exceed memory limits and need to be restarted. Since I have a second instance, there should always be an instance available.
I hadn't considered the Google updates to my instances. I just checked my recent history, and Google restarted my two instances yesterday. The restarts were 7 minutes apart so, at least in this example, my users always had an instance available to them.
I suspect that Google does not simultaneously restart all of your instances. This would create a brief period of downtime for all flex customers, and nobody wants downtime for a cloud service.
UPDATE:
This is a guess, but I expect that when Google updates a flex instance, it will create a new instance and only shutdown the old instance after the new instance is available. At least, if I were running Google, that is how I would do it. That way you have 100% uptime and you will very briefly have an extra instance running. This would even work with a single flex instance.
Maybe you should try Automatic scaling showed here: Scaling instances.
This allows your application to automatically create instances based on request rate, response latencies, and other application metrics. When one of your instances are gets shut down, another instance could be created in order to "cover" the missing instance. Thus, your service won't get interrupted.

Do App Engine Flexible Environment VM instance restarts take advantage of automatic scaling?

I'm developing my first App Engine Flexible Environment application.
The docs explain that virtual machines are restarted weekly:
VM instances are restarted on a weekly basis. During restarts
Google's management services will apply any necessary operating system
and security updates.
Will restarts result in downtime for apps with automatic scaling enabled? If so, are there any steps I can take to avoid downtime?
For example, I could frequently migrate traffic to new instances so that no instance runs for more than one week.
Well, Later I checked with the Google support team and here the recommendation from them to avoid the downtime.
My questions are:
The weekly update is not fixed in time. Maybe there is a range in time in which I should expect the reboot of the instances? (ie: every Friday during the night).
The weekly update involves all the instances, independently from when they were created? (ie: an instance created 1 hour or 1 day before the weekly update will be restarted?).
How do we suppose to handle such a problem? it returns 502 for all request in the meantime.
1.- At this moment there is no way to know when the weekly restart is going to happen. GCP determine when is necessary and it does the restart of certain instances (once per week).
2.- No, as long as you have more than 1 one instance running you won’t see all of them being restarted at the same time.
3.- What we recommend to avoid downtime due to weekly restarts is having more than 1 instance as a minimum instance. Try to set at least 2 instances as a minimum.
I hope, this information is useful to others.
The answer to your question is in the docs:
App Engine attempts to keep manual scaling instances running indefinitely, but there is no uptime guarantee. Hardware or software failures that cause early termination or frequent restarts can occur without warning and can take considerable time to resolve. Your application should be able to handle such failures.
Here are some good strategies for avoiding downtime due to instance restarts:
Use load balancing across multiple instances.
Configure more instances than required to handle normal traffic.
Write fall-back logic that uses cached results when a manual scaling instance is unavailable.
Reduce the amount of time it takes for your instances to start up and shutdown.
Duplicate the state information across more than one instance.
For long-running computations, checkpoint the state from time to time so you can resume it if it doesn't complete.

App Engine Tasks with ETA are fired much later than scheduled

I am using Google App Engine Task push queues to schedule future tasks that i'd like to occur within second precision of their scheduled time.
Typically I would schedule a task 30 seconds from now, that would trigger a change of state in my system, and finally schedule another future task.
Everything works fine on my local development server.
However, now that I have deployed to the GAE servers, I notice that the scheduled tasks run late. I've seen them running even two minutes after they have been scheduled.
From the task queues admin console, it actually says for the ETA:
ETA: "2013/11/02 22:25:14 0:01:38 ago"
Creation Time: "2013/11/02 22:24:44 0:02:08 ago"
Why would this be?
I could not find any documentation about the expectation and precision of tasks scheduled by ETA.
I'm programming in python, but I doubt this makes any difference.\
In the python code, the eta parameter is documented as follows:
eta: A datetime.datetime specifying the absolute time at which the task
should be executed. Must not be specified if 'countdown' is specified.
This may be timezone-aware or timezone-naive. If None, defaults to now.
My queue Settings:
queue:
- name: mgmt
rate: 30/s
The system is under no load what so ever, except for 5 tasks that should run every 30 seconds or so.
UPDATE:
I have found https://code.google.com/p/googleappengine/issues/detail?id=4901 which is an accepted feature request for timely queues although nothing seems to have been done about it. It accepts the fact that tasks with ETA can run late even by many minutes.
What other alternative mechanisms could I use to schedule a trigger with second-precision?
GAE makes no guarantees about clock synchronization within and across their data centers; see UTC Time on Google App engine? for a related discussion. So you can't even specify the absolute time accurately, even if they made the (different) guarantee that tasks are executed within some tolerance of the target time.
If you really need this kind of precision, you could consider setting up a persistent GAE "backend" instance that synchronizes itself with a trusted external clock, and provides task queuing and execution services.
(Aside: Unfortunately, that approach introduces a single point of failure, so to fix that you could just take the next steps and build a whole cluster of these backends... But at that point you may as well look elsewhere than GAE, since you're moving away from the GAE "automatic transmission" model, toward AWS's "manual transmission" model.)
I reported the issue to the GAE team and I got the following response:
This appears to be an isolation issue. Short version: a high-traffic user is sharing underlying resources and crowding you out.
Not a very satisfying response, I know. I've corrected this instance, but these things tend to revert over time.
We have a project in the pipeline that will correct the underlying issue. Deployment is expected in January or February of 2014.
See https://code.google.com/p/googleappengine/issues/detail?id=10228
See also thread: https://code.google.com/p/googleappengine/issues/detail?id=4901
After they "corrected this instance" I did some testing for a few hours. The situation improved a little especially for tasks without ETA. But for tasks with ETA I still see at least half of them running at least 10 seconds late. This is far from reliable for my requirements
For now I decided to use my own scheduling service on a different host, until the GAE team "correct the underlying issue" and have a more predictable task scheduling system.

Resources