High latency on webapp2 endpoint under appengine - google-app-engine

I asked about this in the appengine user group here and wasn't able to resolve the issue.
The issue I'm having is that, for seemingly a very light endpoint and others like it, latency seems to be an issue. Here's an example request as shown in GCP's Trace tool:
I'm aware that if it's a request that spawned a new instance or if memory usage is high, that would explain high latency. Neither of which are the case here.
It seems that intermittently some endpoints simply take a good second or two to respond on top of however long the endpoint itself takes to do its job, and the bulk of that is "untraced time" under GCP Stackdriver's Trace tool. I put a log entry as early as I seemingly possibly could, in webapp2's RequestHandler object on initialization. You can see that in the screenshot as "webapp2 request init".
I'm not familiar enough with webapp2's inner working to know where I could put a log elsewhere that could help explain this, if anywhere.
These are the scaling settings for that appengine service defined in my yaml file:
instance_class: F4
automatic_scaling:
max_idle_instances: 1
max_pending_latency: 1s
max_concurrent_requests: 50
Not sure what other information would be useful here.

Related

Uptime checks keeping some AppEngine instances alive and others not?

We've noticed that using GCP's Monitoring Uptime checks on one AppEngine service appears to keep it 'alive' unnecessarily - removing the monitoring configuration reduced the running instances to 0. However, we have two other AppEngine services that would happily reduce to 0 instances, even with the monitoring in place.
We've been unable to find any difference in configuration. The one other visible difference we spotted was in the 'Traffic' graphs, the instances that still shut down included 'Sent (cached)' and 'Received (cached)' as series on the graph (along side Sent and Received):
Whereas the 'problem' service only has Sent and Received:
There is no cloud load balancing in place - we're just using AppEngine to map the endpoints.
Config for both look like this, with different handlers configured:
runtime: python310
env: standard
instance_class: F1
handlers:
- *snip*
automatic_scaling:
min_idle_instances: automatic
max_idle_instances: automatic
min_pending_latency: automatic
max_pending_latency: automatic
service_account: XXXX#appspot.gserviceaccount.com
Can anyone clarify what might be different between these? Thank you.

Request Entity Too Large (App Engine + Docker + Java)

I am aware that App Engine has a limit of 32 MB request upload limit. I am wondering if that could be increased.
A lot of other research suggests that I need to use the blobstore api directly, however my application has a special requirement where I cannot use it.
Other issues suggest that you can modify the nginx file in your custom flex environment. However I ssh'd into the instance I did not see any nginx. I have a reason to believe that its the GAE Load Balancer blocking the request to even reach the application.
Here is my setup.
GAE Flex Environment
Custom Runtime, Java using Docker
Objective: I want to increase the client_max_body_size to a 100 MB.
As you can see here this limit is stated in the official documentation. There is no way you can increase that limit, as it is something regarding the programming language itself. You can use Go environment, which has a limit of 64 MB.
This issue is discussed on more forums, but, for now, you just need to handle this kind of requests programatically. Check if they are bigger than 32MB, and in case they are, split them somehow and aggregate the results.
As a workaround you can also store the data in Google Cloud Storage as a temporary path for your workflow.

GCP App Engine flex (GAE): Error when deploying

When deploying using gcloud app deploy I get the following error:
Timed out waiting for the app infrastructure to become healthy gcp
I contacted GCP Support and they told me the same thing I had read in other threads:
the error you are referring to may be related to the Compute Engine “In-Use IP Addresses” Quota limit. You can view your current quota limit information by accessing from your GCP menu “IAM & Admin > Quotas”.
I checked the "In-Use IP Addresses" and it doesn't seem like I have a problem with quotas:
Looking for the error, I found that in the Activity tab, when deploying, I get an error. Apparently , when App Engine is trying to delete a VM, the process starts to loop trying to delete it. You can see the error:
(I intentionally covered the project ID)
Edit: It seem like the problem is only with southamerica-east1. I created a new project in southamerica-east1 but I kept getting the same error, so then I created a new project with the App Engine in us-west2 and worked like a charm (I used the same application and app.yaml). I wonder if the problem is GCP southamerica-east1 or a unknown bad configuration by my side.
This is probably related to this issue: https://issuetracker.google.com/u/2/issues/73583699. It does mentioned the "in-use IP Address" quota, but many people have posted in recent days (Nov 2018) indicating that they are seeing the error and have verified that they have not hit their quota.
Unfortunately, no solution has been posted and there hasn't been any recent comment from the devs.
First, our apologies that you’ve experienced this issue. Be assured that we are aware of the situation and the team works hard to resolve it.
Our goal is to make sure that there are available resources in all zones. This
type of issue is rare. When a situation like this occurs, or is about to
occur, our team is notified immediately and the issue is investigated.
We recommend deploying and balancing your workload across multiple zones or
regions to reduce the likelihood of an outage. Please review our documentation
which outlines how to build resilient and scalable architectures on Google
Cloud Platform.
For the time being, you can try relaxing your requirements (e.g. requesting a smaller instance or one with fewer resources) or removing the external IP requirement.
If that proves not to be enough, you can try deploying your application to another region
Again, we want to offer our sincerest apologies.
Thanks for understanding.
At the end we didn't find a real solution so we moved all our services from Brazil to US-2. I'm not sure if the Region is the problem, but there in US-2 all works like a charm

Google AppEngine Flex Spamming Liveness and Rediness Check

I'm trying to debug 502 errors coming out of the nginx container with my AppEngine Flex setup.
I noticed that the logs indicate liveness and rediness checks being spammed very rapidly (see attached).
For clarification this is currently running a single instance in manual_scaling mode.
check_interval_sec is set for 30s on liveness_check and 5 sec on rediness_check.
Can anyone provide insight into what is going on here?
It looks like you setup readiness and liveness checks too aggressively in your app.yaml. Please keep in mind that the checks works for every instance so if you have a lot of instances, it will occur frequently.
If you only have one instance setup, then the behavior contradict what the documentation described. Please file an issue with us on the issue tracker.

Basic scaling instances of GAE don't shutdown even when idle-timeout is far exceeded

I have configured a version of my default service on Google App Engine Standard (Java, though that shouldn't make any difference) to use basic scaling and run a single B2 instance:
<appengine-web-app xmlns="http://appengine.google.com/ns/1.0">
<application>${app.id}</application>
<version>tasks</version>
<threadsafe>true</threadsafe>
<runtime>java8</runtime>
<module>default</module>
<instance-class>B2</instance-class>
<basic-scaling>
<idle-timeout>60s</idle-timeout>
<max-instances>1</max-instances>
</basic-scaling>
<!-- Other stuff -->
</appengine-web-app>
Despite not getting any request for almost 28 minutes, the instance did not shutdown by itself (I manually shut it down with appcfg.cmd stop_module_version...):
There are no background threads.
Why doesn't this instance shutdown? Two days prior the instance ran almost the whole day, idle, with this configuration... so what's the issue?
The definition of idle is that no new requests are received in x amount of time. What if the last request is taking 20 minutes to execute? Shouldn't idle be defined as the time since the last request finished?
I posted this question on SeverFault (since this is not a programming question) but was told that StackOverflow would be the better site...
GCP Support here:
I tried to reproduce the same behaviour but either using 1m or 60s, instances would be shutdown after serving its last request.
Nonetheless, when I had any long lasting requests, threads, and/or Task queues running for minutes, the instance wouldn't be shutdown until this request was finished. You can also find this information here for both Manual/basic scaling :
Requests can run for up to 24 hours. A manually-scaled instance can
choose to handle /_ah/start and execute a program or script for many
hours without returning an HTTP response code. Task queue tasks can
run up to 24 hours.
In your case, it seems there was a request lasting for minutes prior being finished and thus, the instance was active (rather than idle) throughout until you manually stopped it. You may find this Instance life cycle documentation useful as well.
If you believe this behaviour isn't what you experienced back then, I'd suggest you to create a private issue tracker so we can investigate further. Make sure to provide the project number and all required details in there (fresh samples). Once created, share the issue tracker number so we can look into this.

Resources