Errors in vm.syslog and Memory Usage constantly increasing on NodeJS AppEngine - google-app-engine

I am having a problem on some of my AppEngine projects, since a few days I started to I see a lot of errors (which I noticed they might happen when an health check arrives) in my vm.syslog logs from Stackdriver Logging.
In the specific these are:
write_gcm: Server response (CollectdTimeseriesRequest) contains errors:#012{#012 "payloadErrors": [#012 {#012 "index": 71,#012 "error": {#012 "code": 3,#012 "message": "Expected 4 labels. Found 0. Mismatched labels for payload [values {\n data_source_name: \"value\"\n data_source_type: GAUGE\n value {\n double_value: 694411264\n }\n}\nstart_time {\n seconds: 1513266364\n nanos: 618061284\n}\nend_time {\n seconds: 1513266364\n nanos: 618061284\n}\nplugin: \"processes\"\nplugin_instance: \"all\"\ntype: \"ps_rss\"\n] on resource [type: \"gce_instance\"\nlabels {\n key: \"instance_id\"\n value: \"xxx\"\n}\nlabels {\n key: \"zone\"\n value: \"europe-west2-a\"\n}\n] for project xxx"#012 }#012 }#012 ]#012}
write_gcm: Unsuccessful HTTP request 400: {#012 "error": {#012 "code": 400,#012 "message": "Field timeSeries[11].metric.labels[1] had an invalid value of \"health_check_type\": Unrecognized metric label.",#012 "status": "INVALID_ARGUMENT"#012 }#012}
write_gcm: Error talking to the endpoint.
write_gcm: wg_transmit_unique_segment failed.
write_gcm: wg_transmit_unique_segments failed. Flushing.
At the same time, I noticed that my Memory Usage in the AppEngine dashboard for the very same projects is increasing with the passing of time at the point where it reaches the max amount available and the instance restarts, throwing a 502 error when visiting the web site that the app is serving.
All this is not happening on a couple of projects that have not been updated since at least 2 weeks (neither the errors above or the memory increase) but it does happen on a newly created instance when deployed with the same codebase of one of the healthy projects. In addition, I don't happen to see any increase in the memory when running my project locally.
Can someone gently tell me if they experienced something similar or if they think that the errors and the memory increase are related? I have haven't changed my yaml file for deployment recently and I haven't specified any custom configuration for the health checks (which run on legacy mode at the default rate).
Thank you for your help,
Nicola

Simliar question here App Engine Deferred: Tracking Down Memory Leaks
Going through same thing in compute engine on a single VM. I've tried increasing memory but the problem persists. Seems to be tied to a stackdriver method call. Not sure what to do, causes machines to stop after about 24hrs for me. In my case, I'm getting information every 3 seconds from a set of API's, but the error comes up every minute in the serial port 1 (console), which makes me suspect that it is a some kind of failure outside of my code. More from Google here: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.collectdTimeSeries/create .

I'm not sure about all of the errors, but for the "write_gcm: Server response (CollectdTimeseriesRequest)" I had the same issue and contacted Google Cloud Support. They told me that the Stackdriver service has been updated recently to accept more detailed information on ps_rss metrics, but it has caused metrics from older agents to not be sent at all.
You should be able to fix this issue by upgrading your Stackdriver agent to the latest version. On Compute Engine (that I was running) you have control over this, I'm not sure how you'd do it on AppEngine, maybe trigger a new deploy?

Related

Googel App Engine: Intermittent Issue: Process terminated because the request deadline was exceeded. (Error code 123)

Problem you have encountered: I have deployed a spring boot application (backend) on Google App Engine. For past few days, I am getting below mentioned error intermittently:
Error: Process terminated because the request deadline was exceeded. (Error code 123)
Description:
I have deployed a spring boot application (backend) on Google App Engine.
For past few days, I am getting below mentioned error intermittently:
Error: Process terminated because the request deadline was exceeded. (Error code 123)
The application uses CloudSQL (MySQL) database. I have setup below mentioned auto-scale properties as well:
<sessions-enabled>true</sessions-enabled>
<warmup-requests-enabled>true</warmup-requests-enabled>
<instance-class>F2</instance-class>
<automatic-scaling>
<target-cpu-utilization>0.65</target-cpu-utilization>
<min-instances>10</min-instances>
<max-instances>20</max-instances>
<min-idle-instances>5</min-idle-instances>
<max-idle-instances>6</max-idle-instances>
<min-pending-latency>30ms</min-pending-latency>
<max-pending-latency>500ms</max-pending-latency>
<max-concurrent-requests>10</max-concurrent-requests>
</automatic-scaling>
<inbound-services>
<service>warmup</service>
</inbound-services>
What you expected to happen: The application hosted on GAE shouldn't fail with this intermittent error
Steps to reproduce: No steps available, its intermittent.
Other information (workarounds you have tried, documentation consulted, etc): I have tried multiple different combinations for the configurations of autoscaling, etc.
I just hat the exact same issue in a GAE Standard Java 8 Application - after A LOT of trial and error I found the issue was related to Cloud SQL - one symptom was that autoscaled instances restarted (probably crashed) frequently as if there was something stuck. The "Error: Process terminated because the request deadline was exceeded. (Error code 123)" also produced no further logs.
The solution (in our case) was (because that was the only thing we changed right before the error appeared all the time) that we frequently used a Cloud SQL Query in our application that went like this:
case when column=1 then 1 else -1 end
column could be NULL in some (rare) cases - no problem with our normal SQL client we tested this with, but for some weird reason, Cloud SQL and JDBC has a problem with this causing these instance issue
changing to
case when coalesce(column,0)=1 then 1 else -1 end
so that the there would never be a comparison against NULL in the case statement solved the problem

GCP cloud task triggered API throwing 500 and fails 100%

Facing the below issue continuously with one POST request which is triggered from Google Cloud Task. The node application is deployed on Google App Engine. tried even increasing the instances but no luck.
Process terminated because the request deadline was exceeded. Please ensure that your HTTP server is listening for requests on 0.0.0.0 and on the port defined by the PORT environment variable. (Error code 123)
Any help would be useful. Thanks in advance
I have run into this issue in the past. The question, as posted, doesn't have enough information to narrow down the issue, but here are a few things you can try:
Increase memory size: It's possible that your instances are running out of memory and failing to start. Try increasing memory size of the instance to see if that helps.
Billing: Make sure all of your billing information is up to date and the issue isn't related to cost.
Warmups: Do you have a warmup, liveness check, or readiness check turned on? If so, make sure they are working properly.
Add Logging: The request has no logs displayed. You should try adding logging.info(...) statements in your code to see how far it gets before failing.

Increase Per-User Limit for Google API: Saving not possible, always error message "Your input was invalid"

I've the same problem as this user:
http://productforums.google.com/forum/#!topic/analytics/mOOrYoTVWgo
"I am trying to increase the Per-User Limit from 1 requests/second/user to 10 requests/second/user for the Analytics API. I keep getting an error that says "Your input was invalid". I even tried updating without any changes and I keep getting the same error. It appears that there may be a bug with increasing Per-User Limits?"
What can I do?
I was facing the exact same issue & found a workaround. I disabled all the other API services from the "Services" menu in my Google API Console. I turned "off" every other service except the Analytics API, then managed to update the per-user limit for Analytics API. After doing this, I then went back and re-activated the other API services.
Update: Google has fixed the bug, now it works.
If you are still having this issue, limits may have changed on one of the apis you already are using. Play around with some of your higher rate services and see if kicking them down allows you to update your limits.

Intermittent "500 Server Error" on "/_ah/openid_verify" using AppEngine Federated / OpenID login

I am getting this error about 20% of the time. I've dumped and compared traffic on successful and failed requests and there is no noticeable difference:
There's nothing in the AppEngine logs or dashboard, and also no way to catch exceptions on requests that hit "/_ah" URLs. I've attached a script that tries the login every 5 minutes, as well as the traffic dumps for successful and failed requests.
I would really appreciate it if someone from Google could take a look at this. The error definitely occurs deep in the bowls of the AppEngine OpenID implementation and there is no way for an outsider to see such errors.
Thanks,
Graeme
https://dl.dropboxusercontent.com/u/6618078/AppEngine%20OpenID%20error/error.dump
https://dl.dropboxusercontent.com/u/6618078/AppEngine%20OpenID%20error/success.dump
https://dl.dropboxusercontent.com/u/6618078/AppEngine%20OpenID%20error/test.sh
It could be related to this bug that is only 3.5 years old and not fixed as they don't consider it a "production issue".
https://code.google.com/p/googleappengine/issues/detail?id=3589
The bug is about non-gmail accounts but I have the same server error with gmail accounts (started today for me, 01/12/2014).
No error acknowledged here https://code.google.com/status/appengine.

Random 500 errors on AppEngine

I have a fairly big application which went over a major overhaul.
The newer version uses lot of JSONP calls and I notice 500 server errors. Nothing is logged in the logs section to determine the error cause. It happens on JS, png and even jersey (servlets) too.
Searching SO and groups suggested that these errors are common during deployment. But it happens even after hours after deployment.
BTW, the application has become slightly bigger and it even causes deadline exception while starting few instances in few rare cases. Sometimes, it starts & serves within 6-10secs. Sometimes it goes to more than 75secs thereby causing a timeout for the similar request. I see the same behavior for warmup requests too. Nothing custom is loaded during app warmup.
I feel like you should be seeing the errors in your logs. Are you exceeding quotas or having deadline errors? Perhaps you have an error in your error handler like your file cannot be found, or the path to the error handler overlaps with another static file route?
To troubleshoot, I would implement custom error pages so you could determine the actual error code. I'm assuming Python since you never specified what language you are using. Add the following to your app.yaml and create static html pages that will give the recipient some idea of what's going on and then report back with your findings:
error_handlers:
- file: default_error.html
- error_code: over_quota
file: over_quota.html
- error_code: dos_api_denial
file: dos_api_denial.html
- error_code: timeout
file: timeout.html
If you already have custom error handlers, can you provide some of your app.yaml so we can help you?
Some 500s are not logged in your application logs. They are failures at the front-end of GAE. If, for some reason, you have a spike in requests and new instances of your application cannot be started fast enough to serve those requests, your client may see 500s even though those 500s do not appear in your application's logs. GAE team is working to provide visibility into those front-end logs.
I just saw this myself... I was researching some logs of visitors who only loaded half of the graphics files on a page. I tried clicking on the same link on a blog that they did to get to our site. In my case, I saw a 500 error in the chrome browser developer console for a js file. Yet when I looked at the GAE logs it said it served the file correctly with a 200 status. That js file loads other images which were not. In my case, it was an https request.
It is really important for us to know our customer experience (obviously). I wanted to let you know that this problem is still occurring. Just having it show up in the logs would be great, even attach a warm-up error to it or something so we know it is an unavoidable artefact of a complex server system (totally understandable). I just need to know if I should be adding instances or something else. This error did not wait for 60 seconds, maybe 5 to 10 seconds. It is like the round trip for SSL handshaking failed in the middle but the logs showed it as success.
So can I increase any timeout for the handshake or is that done on the browser side?

Resources