Unable to export to Monitering service because: GaxError RPC failed, caused by 3 - google-app-engine

I have a Java Application in App Engine, and recently I started getting following error:
Unable to export to Monitering service because: GaxError RPC failed, caused by 3:One or more TimeSeries could not be written: Metrics cannot be written to gae_app. See https://cloud.google.com/monitoring/custom-metrics/creating-metrics#which-resource for a list of writable resource types.: timeSeries[0]
and this happens every time after health check log:
Health checks: instance=instanceName start=2020-01-14T14:28:07+00:00 end=2020-01-14T14:28:53+00:00 total=18 unhealthy=0 healthy=18
and after some time my instances would be restarted and the same thing starts to happen again.
app.yaml:
#https://cloud.google.com/appengine/docs/flexible/java/reference/app-yaml
#General settings
runtime: java
api_version: '1.0'
env: flex
runtime_config:
jdk: openjdk8
#service: service_name #Required if creating a service. Optional for the default service.
#https://cloud.google.com/compute/docs/machine-types
#Resource settings
resources:
cpu: 2
memory_gb: 6 #memory_gb = cpu * [0.9 - 6.5] - 0.4
# disk_size_gb: 10 #default
##Liveness checks - Liveness checks confirm that the VM and the Docker container are running. Instances that are deemed unhealthy are restarted.
liveness_check:
path: "/liveness_check"
timeout_sec: 20 #1-300 Timeout interval for each request, in seconds.
check_interval_sec: 30 #1-300 1-300Time interval between checks, in seconds.
failure_threshold: 6 #1-10 An instance is unhealthy after failing this number of consecutive checks.
success_threshold: 2 #1-10 An unhealthy instance becomes healthy again after successfully responding to this number of consecutive checks.
initial_delay_sec: 300 #0-3600 The delay, in seconds, after the instance starts during which health check responses are ignored. This setting can allow an instance more time at deployment to get up and running.
##Readiness checks - Readiness checks confirm that an instance can accept incoming requests. Instances that don't pass the readiness check are not added to the pool of available instances.
readiness_check:
path: "/readiness_check"
timeout_sec: 10 #1-300 Timeout interval for each request, in seconds.
check_interval_sec: 15 #1-300 Time interval between checks, in seconds.
failure_threshold: 4 #1-10 An instance is unhealthy after failing this number of consecutive checks.
success_threshold: 2 #1-10 An unhealthy instance becomes healthy after successfully responding to this number of consecutive checks.
app_start_timeout_sec: 300 #1-3600 The maximum time, in seconds, an instance has to become ready after the VM and other infrastructure are provisioned. After this period, the deployment fails and is rolled back. You might want to increase this setting if your application requires significant initialization tasks, such as downloading a large file, before it is ready to serve.
#Service scaling settings
automatic_scaling:
min_num_instances: 2
max_num_instances: 3
cpu_utilization:
target_utilization: 0.7

The error is caused by an upgrade of the stackdriver logging sidecar to 1.6.25 version, which starts to push FluentD metrics to Stackdriver monitoring via OpenCensus. However the integration with App Engine Flex doesn't work yet.
These errors should be logs only. It is not relative to the health check logs. It should not impact VM restart. If your VM instances are restarted frequently, there may be caused by some other reason. In Stackdriver logging UI, you can search Free disk space under vm.syslog stream and unhealthy sidecars under vm.events stream. If some logs show up, your instances restart may be caused by low free disk size or any unhealthy sidecar containers.

Related

How does app engine load balancer route to idle instances?

I am in the process of migrating an existing Spring Boot app to GAE. The app is automatically scaled having a minimum of 0 instances. During my test runs I have noticed that initially the first instance is loaded and starts to handle requests. As auto-scaling kicks in, a second instance is launched. The traffic after the second instance is up and running, are all being sent to that second instance. Meantime, the first instance remains idle after having run the first handful of the request that prompted the auto scale.
I'm fairly new at tuning, I'm probably missing something.
app.yaml:
runtime: java17
instance_class: F2
automatic_scaling:
target_cpu_utilization: 0.8
min_instances: 0
max_instances: 4
min_pending_latency: 8s
max_pending_latency: 10s
max_concurrent_requests: 25
env_variables:
SPRING_PROFILES_ACTIVE: "prod"
This is the image of both instances after handling 100 requests in 5 minutes. the majority of the requests are being routed to the second instance that was autoscaled.
I have tried changing some of the autoscale values. I am expecting the GAE load balancer to distribute the load among all available instances. But I don't see an even distribution of traffic. Why is that?

Why are idle instances not being shut down when there is no traffic?

Some weeks ago my app on App Engine just started to increase the number of idle instances to an unreasonable high amount, even when there is close to zero traffic. This of course impacts my bill which is skyrocketing.
My app is simple Node.js application serving a GraphQL API that connects to my CloudSQL database.
Why are all these idle instances being started?
My app.yaml:
runtime: nodejs12
service: default
handlers:
- url: /.*
script: auto
secure: always
redirect_http_response_code: 301
automatic_scaling:
max_idle_instances: 1
Screenshot of monitoring:
This is very strange behavior, as per the documentation it should only temporarily exceed the max_idle_instances.
Note: When settling back to normal levels after a load spike, the
number of idle instances can temporarily exceed your specified
maximum. However, you will not be charged for more instances than the
maximum number you've specified.
Some possible solutions:
Confirm in the console that the actual app.yaml configuration is the same as in the app engine console.
Set min_idle_instances to 1 and max_idle_instances to 2 (temporarily) and redeploy the application. It could be that there is just something wrong on the scaling side, and redeploying the application could solve this.
Check your logging (filter app engine) if there is any problem in shutting down the idle instances.
Finally, you could tweak settings like max_pending_latency. I have seen people build applications that take 2-3 seconds to start up, while the default is 30ms before another instance is being spun up.
This post suggests setting the following, which you could try:
instance_class: F1
automatic_scaling:
max_idle_instances: 1 # default value
min_pending_latency: automatic # default value
max_pending_latency: 30ms
Switch to basic_scaling, let Google determine the best scaling algorithm (last resort option). This would look something like this:
basic_scaling:
max_instances: 5
idle_timeout: 15m
The solution could of course also be a combination of 2 and 4.
Update after 24 hours:
I followed #Nebulastic suggestions, number 2 and 4, but it did not make any difference. So in frustration I disabled the entire Google App Engine (App Engine > Settings > Disable application) and left it off for 10 minutes and confirmed in the monitoring dashboard that everything was dead (sorry, users!).
After 10 minutes I enabled App Engine again and it booted only 1 instance. I've been monitoring it closely since and it seems (finally) to be good now. And now after the restart it also adheres to the "min" and "max" idle instances configuration - the suggestion from #Nebulastic. Thanks!
Screenshots:
Have you checked to make sure you dont have a bunch of old versions still running? https://console.cloud.google.com/appengine/versions
check for each service in the services dropdown

Frequent restarts on Google App Engine Standard second generation

We have a problem with frequent restarts of App Engine instances which last for 15-30 minutes, sometimes maybe 1 hour.
Last 24 hours, we had 72 instance restarts. We have looked into the logs, but can't find any error messages explaining this.
Min_instances is set to 1.
The app is a PHP Codeigniter app running on the php73 runtime.
Maybe this is relevant as it shows up regularly in the log, not at the same time as web requests :
A 2020-05-01T17:46:46.675532Z [start] 2020/05/01 17:46:46.674713 Quitting on terminated signal
A 2020-05-01T17:46:46.900441Z [start] 2020/05/01 17:46:46.899377 Start program failed: termination triggered by nginx exit
Looking at the request log it looks like there is no pattern in page requests that could lead to instances crashing.
All page-requests load in typically 1-80 ms, there are no heavy scripts. It looks like the instances crash while idle.
We have also tried to increase the instance type to F4 with the same results.
The graphs for CPU usage and memory usage don't give us any clue.
The problem with this is loading requests for site visitors. Most of the time the site is fast and responsive, but it is possible with accidental loading times of 1s+ when new instances start. We have set up warm up requests, but that does not cover all instance starts.
Is this a normal behavior ? How can we debug further ? Any clue what can be wrong ?
Thanks for any help.
EDIT: Here is our app.yaml:
runtime: php73
entrypoint: serve public_html/index.php
instance_class: F2
automatic_scaling:
min_instances: 1
inbound_services:
- warmup
vpc_access_connector:
name: "xx"
handlers
- url: /
script: auto
secure: always
- url: /(.+)
script: auto
secure: always
env_variables:
CLOUD_SQL_CONNECTION_NAME: xx
REDIS_HOST: xx
REDIS_PORT: xx

Rolling restarts are causing are app engine app to go offline. Is there a way to change the config to prevent that from happening?

About once a week our flexible app engine node app goes offline and the following line appears in the logs: Restarting batch of VMs for version 20181008t134234 as part of rolling restart. We have our app set to automatic scaling with the following settings:
runtime: nodejs
env: flex
beta_settings:
cloud_sql_instances: tuzag-v2:us-east4:tuzag-db
automatic_scaling:
min_num_instances: 1
max_num_instances: 3
liveness_check:
path: "/"
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
readiness_check:
path: "/"
check_interval_sec: 15
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
resources:
cpu: 1
memory_gb: 1
disk_size_gb: 10
I understand the rolling restarts of GCP/GAE, but am confused as to why Google isn't spinning up another VM before taking our primary one offline. Do we have to run with a min num of 2 instances to prevent this from happening? Is there a way I get configure my app.yaml to make sure another instance is spun up before it reboots the only running instance? After the reboot finishes, everything comes back online fine, but there's still 10 minutes of downtime, which isn't acceptable, especially considering we can't control when it reboots.
We know that it is expected behaviour that Flexible instances are restarted on a weekly basis. Provided that health checks are properly configured and are not the issue, the recommendation is, indeed, to set up a minimum of two instances.
There is no alternative functionality in App Engine Flex, of which I am aware of, that raises a new instance to avoid downtime as a result of a weekly restart. You could try to run directly on Google Compute Engine instead of App Engine and manage updates and maintenance by yourself, perhaps that would suit your purpose better.
Are you just guessing this based on that num instances graph in the app engine dashboard? Or is your app engine project actually unresponsive during that time?
You could use cron to hit it every 5 minutes to see if it's responsive.
Does this issue persist if you change cool_down_period_sec & target_utilization back to their defaults?
If your service is truly down during that time, maybe you should implement a request handler for liveliness checks:
https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#updated_health_checks
Their default polling config would tell GAE to launch within a couple minutes
Another thing worth double checking is how long it takes your instance to start up.

Health API hammered on my GCP Flex application - how do I dial that back?

The GCP infra hits your app on '/_ah/health' to determine that it is still alive. My app is a 'Flex' type, and it is being hit a lot on that URL:
It is making the logs hard to navigate, apart from anything else :(
How do I dial back the frequency of GCP testing the health end-point?
You can configure how App Engine performs health checks against your VM instances hosting your apps as part of the health_check: configuration in your app.yaml.
Health checks
Periodic health check requests are used to confirm that a VM instance
has been successfully deployed, and to check that a running instance
maintains a healthy status. Each health check must be answered within
a specified time interval. An instance is unhealthy when it fails to
respond to a specified number of consecutive health check requests. An
unhealthy instance will not receive any client requests, but health
checks will still be sent. If an unhealthy instance continues to fail
to respond to a predetermined number of consecutive health checks, it
will be restarted.
Health check requests are enabled by default, with default threshold
values. You can customize VM health checking by adding an optional
health check section to your configuration file:
health_check:
enable_health_check: True
check_interval_sec: 5
timeout_sec: 4
unhealthy_threshold: 2
healthy_threshold: 2
You can use the following options with health checks:
enable_health_check - Enable/disable health checks. Health checks are enabled by default. To disable health checking, set to False.
Default: True
check_interval_sec - Time interval between checks. Default: 1 second
timeout_sec - Health check timeout interval. Default: 1 second
unhealthy_threshold - An instance is unhealthy after failing this number of consecutive checks. Default: 1 check
healthy_threshold - An unhealthy instance becomes healthy again after successfully responding to this number of consecutive checks.
Default: 1 check
restart_threshold - When the number of failed consecutive health checks exceeds this threshold, the instance is restarted. Default:
300 checks
By default the health checks are turned on and it is recommended not to turn off health checking just to ensure App Engine does not send requests to a VM which is not responding. Instead you can control the check_interval_sec to adjust how often you want your VM health checked.
You can always filter out information from the logs when viewing the logs. Check out the information on Advanced Log filtering UI and syntax.
As others have mentioned there's an app.yaml config page
As of date of posting, it doesn't say what the default is.
Here's what's super weird for the check_interval_sec var - My deployments are blocked by the GCP deployer, for certain settings values.
A setting of 200 for 'check_interval_sec', caused the deployment to fail with the following message:
"Invalid value for field 'resource.checkIntervalSec': '6000'.
Must be less than or equal to 300"
A setting of 20, caused the deployment to fail with the following message:
"Invalid value for field 'resource.checkIntervalSec': '600'.
Must be less than or equal to 300"
A setting of 10 worked, and it is indeed ten seconds in the console (though a bunch of threads his the JVM at the same moment for some reason).
TL;DR: seconds are not seconds in 'check_interval_sec' - at least for 'flex' apps.

Resources