I'm running a very simple node app in a GAE Flex custom instance instance.
All of a sudden, seemingly out of nowhere, it shuts down causing a short period of 503s, before eventually coming back up:
I'm absolutely certain nobody did this manually.
What's going on? Are GAE apps expected to randomly shut down and restart?
Here's my config:
runtime: custom
api_version: '1.0'
env: flexible
threadsafe: true
automatic_scaling:
cool_down_period: 120s
min_num_instances: 1
max_num_instances: 15
cpu_utilization:
target_utilization: 0.5
network: {}
liveness_check:
initial_delay_sec: 300
path: /
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 4
success_threshold: 2
readiness_check:
path: /
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
It turns out that GAE flex instances automatically reboot once a week. So when you have only one instance running, this behaviour is to be expected.
Related
My app.yaml configuration has a custom URL /my_readiness_check for readiness check instead of the default /readiness_check.
I see requests to the custom URL that succeed. Then I see my instance restarted with a SIGKILL to the app and they are immediately preceded by a couple of 503 failures to /readiness_check
runtime: python
env: flex
readiness_check:
path: "/my_readiness_check"
check_interval_sec: 10
timeout_sec: 4
failure_threshold: 1
success_threshold: 2
app_start_timeout_sec: 300
liveness_check:
path: "/my_liveness_check"
check_interval_sec: 30
timeout_sec: 30
failure_threshold: 3
initial_delay_sec: 60
...
I deployed a simple nodejs server on Google app engine flex.
When it has 1 instance running, it is getting 3 times as much liveness and readyness checks as it should be reiceving considering the configuration on my app.yml file.
The documentation says:
If you examine the nginx.health_check logs for your application, you might see health check polling happening more frequently than you have configured, due to the redundant health checkers that are also following your settings. These redundant health checkers are created automatically and you cannot configure them.
Still this does look like an aggressive behaviour. Is this normal?
My app.yml config :
runtime: nodejs
env: flex
service: web
resources:
cpu: 1
memory_gb: 3
disk_size_gb: 10
automatic_scaling:
min_num_instances: 1
cpu_utilization:
target_utilization: 0.6
readiness_check:
path: "/readiness_check"
timeout_sec: 4
check_interval_sec: 5
failure_threshold: 2
success_threshold: 1
app_start_timeout_sec: 300
liveness_check:
path: "/liveness_check"
timeout_sec: 4
check_interval_sec: 30
failure_threshold: 2
success_threshold: 1
Yes, this is normal. Three different locations are checking health of your service. You have configured the health check to be every five seconds. If you want less health check traffic, change check_interval_sec: 5 to be a larger number.
I am using app.yaml file to configure my app engine. Below is the file.
runtime: java
env: flex
resources:
memory_gb: 6.5
cpu: 5
disk_size_gb: 20
automatic_scaling:
min_num_instances: 6
max_num_instances: 8
cpu_utilization:
target_utilization: 0.6
handlers:
- url: /.*
script: this field is required, but ignored
network:
session_affinity: true
Now when I click the "view" link for the version list in the cloud console, I can see below config.
runtime: java
api_version: '1.0'
env: flexible
threadsafe: true
handlers:
- url: /.*
script: 'this field is required, but ignored'
automatic_scaling:
cool_down_period: 120s
min_num_instances: 6
max_num_instances: 8
cpu_utilization:
target_utilization: 0.6
network: {}
resources:
cpu: 5
memory_gb: 6.5
disk_size_gb: 20
liveness_check:
initial_delay_sec: 300
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 4
success_threshold: 2
readiness_check:
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
So as you can see network property is still blank, if i change others parameters like cpu , min_num_instances all others properties are getting reflected except below one not sure why ?.
network:
session_affinity: true
Actually this is a known issue for App Engine, the status can be tracked at this link
You can use gcloud beta app deploy as a workaround to get the session affinity working until the issue is resolved
You may need to add an instance_tag and a name. The others are optional:
network:
instance_tag: TAG_NAME
name: NETWORK_NAME
session_affinity: true (optional)
subnetwork_name: SUBNETWORK_NAME (optional)
forwarded_ports: (optional)
- PORT
- HOST_PORT:CONTAINER_PORT
- PORT/tcp
- HOST_PORT:CONTAINER_PORT/udp
I have a Flask app that deploys fine in the Google App Engine Flexible environment but some new updates have made it relatively resource intensive (Was receiving a [CRITICAL] Worker Timeout message.) In attempting to fix this issue I wanted to increase the number of CPUs for my app.
app.yaml:
env: flex
entrypoint: gunicorn -t 600 --timeout 600 -b :$PORT main:server
runtime: python
threadsafe: false
runtime_config:
python_version: 2
automatic_scaling:
min_num_instances: 3
max_num_instances: 40
cool_down_period_sec: 260
cpu_utilization:
target_utilization: .5
resources:
cpu: 3
After some time I receive:
"Updating service [default] (this may take several minutes)...failed.
ERROR: (gcloud.app.deploy) Error Response: [13] An internal error occurred during deployment."
Is there some sort of permission issue preventing me from increasing the CPUs? Or is my app.ymal invalid?
You can not set the number of cores(CPU) to odd numbers except 1. It should be even.
I've noticed that when I'm upgrading the vm tag to the new env one, that the auto scaler creates an infinite amount of instances during the deployment process. This only happens when I'm using the new env flag. This is my conf file:
runtime: custom
vm: true
service: default
threadsafe: true
health_check:
enable_health_check: True
check_interval_sec: 5
timeout_sec: 4
unhealthy_threshold: 2
healthy_threshold: 2
restart_threshold: 60
automatic_scaling:
min_num_instances: 1
cool_down_period_sec: 60
cpu_utilization:
target_utilization: 0.9
It would be great if anyone could help me because I'm unable to migrate my VM's because of this problem.
Cheers