Google App Engine throws error on Basic Scaling - google-app-engine

I'm using golang & Google App Engine for the project. I had a task where I received a huge file, split it into lines and sent these lines one by one to a queue to be resolved. My initial setting for scaling inside the app.yaml file was the following:
instance_class: F1
automatic_scaling:
min_instances: 0
max_instances: 4
min_idle_instances: 0
max_idle_instances: 1
target_cpu_utilization: 0.8
min_pending_latency: 15s
It was working alright, but it had an issue - since there were really a lot of tasks, 10 minutes later it would fail (as per documentation, of course). So I decided to use B1 instance class instead of F1 - and this is where the things went wrong.
My setup for B1 looks like this:
instance_class: B1
basic_scaling:
max_instances: 4
Now, I've created a very simple demo to demonstrate the idea:
r.GET("foo", func(c *gin.Context) {
_, err := tm.CreateTask(&tasks.TaskOptions{
QueueID: "bar",
Method: "method",
PostBody: "foooo",
})
if err != nil {
lg.LogErrorAndChill("failed, %v", err)
}
})
r.POST("bar/method", func(c *gin.Context) {
data, err := c.GetRawData()
if err != nil {
lg.LogErrorAndPanic("failed", err)
}
fmt.Printf("data is %v \n", string(data))
})
To explain the logic behind it: I send a request to "foo" which creates a task which is added to the queue with some body text. Inside the task a post method is being called based on the queueId and method parameters, which receives some text and in this simple example just logs it out.
Now, when I run the request, I get the 500 error, which looks like this:
[GIN] 2021/10/05 - 19:38:29 | 500 | 301.289ยตs | 0.1.0.3 | GET "/_ah/start"
And in the logs I can see:
Process terminated because it failed to respond to the start request with an HTTP status code of 200-299 or 404.
And inside the queue in the task (reason to retry):
INTERNAL(13): Instance Unavailable. HTTP status code 500
Now, I've read the documentation and I know about the following:
Manual, basic, and automatically scaling instances startup differently. When you start a manual scaling instance, App Engine immediately sends a /_ah/start request to each instance. When you start an instance of a basic scaling service, App Engine allows it to accept traffic, but the /_ah/start request is not sent to an instance until it receives its first user request. Multiple basic scaling instances are only started as necessary, in order to handle increased traffic. Automatically scaling instances do not receive any /_ah/start request.
When an instance responds to the /_ah/start request with an HTTP status code of 200โ€“299 or 404, it is considered to have successfully started and can handle additional requests. Otherwise, App Engine terminates the instance. Manual scaling instances are restarted immediately, while basic scaling instances are restarted only when needed for serving traffic
But it is not really helpful - I don't understand why the /_ah/start request does not respond properly and I am not really sure how to debug it or how to fix it, especially since the F1 instance was working ok.

Request to the url /_ah/start/ are routed to your app, and your app apparently is not ready to handle it, which leads to the 500 response. Check your logs.
Basically your app needs to be ready to incoming requests with url /_ah/start/ (similar way as it is ready to handle requests to url /foo/). If you run the app locally, try to open such url (via curl etc) and see what will be the response. It needs to respond with a response code 200โ€“299 or 404 (as mentioned in the text you quoted), otherwise it wont be considered as a successfully started instance.

Related

Network request failed from fetch in reactjs app

I am using fetch in a NodeJS application. Technically, I have a ReactJS front-end calling the NodeJS backend (as a proxy), and then the proxy calls out to backend services on a different domain.
However, from logging errors from consumers (I haven't been able to reproduce this issue myself) I see that a lot of these proxy calls (using fetch) throw an error that just says Network Request Failed, which is of no help. Some context:
This only occurs on a subset of all total calls (lets say 5% of traffic)
Users that encounter this error can often make the same call again some time later (next couple minutes/hours/days) and it will go through
From Application Insights, I can see no correlation between browsers, locations, etc
Calls often return fast, like < 100 ms
All calls are HTTPS, non are HTTP
We have a fetch polyfill from fetch-ponyfill that will take over if fetch is not available (Internet Explorer). I did test this package itself and the calls went through fine. I also mentioned that this error does occur on browsers that do support fetch, so I don't think this is the error.
Fetch settings for all requests
Method is set per request, but I've seen it fail on different types (GET, POST, etc)
Mode is set to 'same-origin'. I thought this was odd, since we were sending a request from one domain to another, but I tried to set it differently and it didn't affect anything. Also, why would some requests work for some, but not for others?
Body is set per request, based on the data being sent.
Headers is usually just Accept and Content-Type, both set to JSON.
I have tried researching this topic before, but most posts I found referenced React native applications running on iOS, where you have to set some security permissions in the plist file to allow HTTP requests or something to do with transport security.
I have implement logging specific points for the data in Application Insights, and I can see that fetch() was called, but then() was never reached; it went straight to the .catch(). So it's not even reaching code that parses the request, because apparently no request came back (we then parse the JSON response and call other functions, but like I said, it doesn't even reach this point).
Which is also odd, since the request never comes back, but it fails (often) within 100 ms.
My suspicions:
Some consumers have some sort of add-on for there browser that is messing with the request. Although, I run with uBlock Origin and HTTPS Everywhere and I have not seen this error. I'm not sure what else could be modifying requests that would cause it to immediately fail.
The call goes through, which then reaches an Azure Application Gateway, which might fail for some reason (too many connected clients, not enough ports, etc) and returns a response that immediately fails the fetch call without running the .then() on the response.
For #2, I remember I had traced a network call that failed and returned Network Request Failed: Made it through the proxy -> made it through the Application Gateway -> hit the backend services -> backend services sent a response. I am currently requesting access to backend service logs in order to verify this on some more recent calls (last time I did this, I did it through a screenshare with a backend developer), and hopefully clear up the path back to the client (the ReactJS application). I do remember though that it made it to the backend services successfully.
So I'm honestly not sure what's going on here. Does anyone have any insight?
Based on your excellent description and detective work, it's clear that the problem is between your Node app and the other domain. The other domain is throwing an error and your proxy has no choice but to say that there's an error on the server. That's why it's always throwing a 500-series error, the Network Request Failed error that you're seeing.
It's an intermittent problem, so the error is inconsistent. It's a waste of your time to continue to look at the browser because the problem will have been created beyond that, either in your proxy translating that request or on the remote server. You have to find that error.
Here's what I'd do...
Implement brute-force logging in your Node app. You can use Bunyan, or Winston or just require(fs) and write out to some file when an error occurs. Then look at the results. Only log it out when the response code from the other server is in the 400 or 500 ranges. Log the request object and the response object.
Something like this with Bunyan:
fetch(urlToRemoteServer)
.then(res => res.json())
.then(res => whateverElseYoureDoing(res))
.catch(err => {
// Get the request & response to the remote server
log.info({request: req, response: res, err: err});
});
where the res in this case is the response we just got from the other domain and req is our request to them.
The logs on your Azure server will then have the entire request and response. From this you can find commonalities. and (๐Ÿคž) the cause of the problem.

App Engine urlfetch DeadlineExceededError

I have 2 service. One is hosted in Google App Engine and one is hosted in Cloud Run.
I use urlfetch (Python 2) imported from google.appengine.api in GAE to call APIs provided by the Cloud Run.
Occasionally there are a few (like <10 per week) DeadlineExceededError shown up like this:
Deadline exceeded while waiting for HTTP response from URL
But these few days such error suddenly occurs frequently (like ~40 per day). Not sure if it is due to Christmas peak hour or what.
I've checked Load Balancer logs of Cloud Run and turned out the request has never reached the Load Balancer.
Has anyone encountered similar issue before? Is anything wrong with GAE urlfetch?
I found a conversion which is similar but the suggestion was to handle the error...
Wonder what can I do to mitigate the issue. Many thanks.
Update 1
Checked again, found some requests from App Engine did show up in Cloud Run Load Balancer logs but the time is weird:
e.g.
Logs from GAE project
10:36:24.706 send request
10:36:29.648 deadline exceeded
Logs from Cloud Run project
10:36:35.742 reached load balancer
10:36:49.289 finished processing
Not sure why it took so long for the request to reach the Load Balancer...
Update 2
I am using GAE Standard located in US with the following settings:
runtime: python27
api_version: 1
threadsafe: true
automatic_scaling:
max_pending_latency: 5s
inbound_services:
- warmup
- channel_presence
builtins:
- appstats: on
- remote_api: on
- deferred: on
...
The Cloud Run hosted API gateway I was trying to call is located in Asia. In front of it there is a Google Load Balancer whose type is HTTP(S) (classic).
Update 3
I wrote a simple script to directly call Cloud Run endpoint using axios (whose timeout is set to 5s) periodically. After a while some requests were timed out. I checked the logs in my Cloud Run project, 2 different phenomena were found:
For request A, pretty much like what I mentioned in Update 1, logs were found for both Load Balancer and Cloud Run revision.
Time of CR revision log - Time of LB log > 5s so I think this is an acceptable time out.
But for request B, no logs were found at all.
So I guess the problem is not about urlfetch nor GAE?
Deadline exceeded while waiting for HTTP response from URL is actually a DeadlineExceededError. The URL was not fetched because the deadline was exceeded. This can occur with either the client-supplied deadline (which you would need to change), or the system default if the client does not supply a deadline parameter.
When you are making a HTTP request, App Engine maps this request to URLFetch. URLFetch has its own deadline that is configurable. See the URLFetch documentation.
You can set a deadline for each URLFetch request. By default, the deadline for a fetch is 5 seconds. You can change this default by:
Including the following appengine.api.urlfetch.defaultDeadline setting in your appengine-web.xml configuration file. Specify the timeout in seconds:
<system-properties>:
<property name="appengine.api.urlfetch.defaultDeadline" value="10"/>
</system-properties>
You can also adjust the default deadline by using the urlfetch.set_default_fetch_deadline() function. This function stores the new default deadline on a thread-local variable, so it must be set for each request, for example, in a custom middleware.
from google.appengine.api import urlfetch
urlfetch.set_default_fetch_deadline(45)
If your Cloud Run service is processing long requests, you can increase the request timeout. If your service doesn't return a response within the time specified, the request ends and the service returns an HTTP 504 error.
Update the timeoutSeconds attribute in YAML file as :
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: SERVICE
spec:
template:
spec:
containers:
- image: IMAGE
timeoutSeconds: VALUE
OR
You can update the request timeout for a given revision at any time by using the following command:
gcloud run services update [SERVICE] --timeout=[TIMEOUT]
If requests are terminating earlier with error code 503, you might need to update the request timeout setting for your language framework:
Node.js developers might need to update the [server.timeout property via server.setTimeout][6] (use server.setTimeout(0) to achieve an unlimited timeout) depending on the version you are using.
Python developers need to update Gunicorn's default timeout.

How to fetch data via http request from Node.Js on AppEngine?

Everything works perfect when I run locally. When I deploy my app on AppEngine, for some reason, the most simple request gets timeout errors. I even implemented retry and, while I made some progress, it still not working well.
I don't think it matter since I don't have the problem running on local, but here's the code I just used for request-retry module:
request({
url: url,
maxAttempts: 5,
retryDelay: 1000, // 1s delay
}, function (error, res, body) {
if (!error && res.statusCode === 200) {
resolve(body);
} else {
console.log(c.red, 'Error getting data from url:', url, c.Reset);
reject(error);
}
});
Any suggestions?
Also, I can see this errors in the Debug:
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)
The error 203 means that Google App Engine detected that the RPC channel has closed unexpectedly and shuts down the instance. The request failure is caused by the instance shutting down.
The other message about a request causing a new process to start in you application is most likely caused by the instances shutting down. This message appears when a new instance starts serving a request. As your instances were dying due to the error 203, new instances were taking its place, serving your new requests and sending that message.
An explaination for why it's working on Google Cloud Engine (or locally) is because the App Engine component causing the error is not present on those environments.
Lastly, if you are still interested in solving the issue with App Engine and are entitled to GCP support, I suggest contacting with the Technical Support team. The issue seems exclusive to App Engine, but I can't answer further about the reason why, that's why I'm suggesting contacting with support. They have more tools available and will be able to help investigate the issue more thoughtfully.

GAE App gets socket errors when communicating with BigQuery

Our GAE python application communicates with BigQuery using the Google Api Client for Python (currently we use version 1.3.1) with the GAE-specific authentication helpers. Very often we get a socket error while communicating with BigQuery.
More specifically, we build a python Google API client as follows
1. bq_scope = 'https://www.googleapis.com/auth/bigquery'
2. credentials = AppAssertionCredentials(scope=bq_scope)
3. http = credentials.authorize(httplib2.Http())
4. bq_service = build('bigquery', 'v2', http=http)
We then interact with the BQ service and get the following error
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/gae_override/httplib.py", line 536, in getresponse
'An error occured while connecting to the server: %s' % e)
error: An error occured while connecting to the server: Unable to fetch URL: [api url...]
The error raised is of type google.appengine.api.remote_socket._remote_socket_error.error, not an exception that wraps the error.
Initially we thought that it might be timeout-related, so we also tried setting a timeout altering line 3 in the above snippet to
3. http = credentials.authorize(httplib2.Http(timeout=60))
However, according to the log output of client library the API call takes less than 1 second to crash and explicitly setting the timeout did not change the system behavior.
Note that the error occurs in various API calls, not just a single one, and usually this happens on very light operations, for example we often see the error while polling BQ for a job status and rarely on data fetching. When we re-run the operation, the system works.
Any idea why this might happen and -perhaps- a best-practice to handle it?
All HTTP(s) requests will be routed through the urlfetch service.
Beneath that, the Google Api Client for Python uses httplib2 to make HTTP(s) requests and under the covers this library uses socket.
Since the error is coming from socket you might try to set the timeout there.
import socket
timeout = 30
socket.setdefaulttimeout(timeout)
If we continue up the stack httplib2 will use the timeout parameter from the socket level timeout.
http://httplib2.readthedocs.io/en/latest/libhttplib2.html
Moving further up the stack you can set the timeout and retries for BigQuery.
try:
timeout = 30000
num_retries = 5
query_request = bigquery_service.jobs()
query_data = {
'query': (query_var),
'timeoutMs': timeout,
}
And finally you can set the timeout for urlfetch.
from google.appengine.api import urlfetch
urlfetch.set_default_fetch_deadline(30)
If you believe it's timeout related you might want to test each library / level to make sure the timeout is being passed correctly. You can also use a basic timer to see the results.
start_query = time.time()
query_response = query_request.query(
projectId='<project_name>',
body=query_data).execute(num_retries=num_retries)
end_query = time.time()
logging.info(end_query - start_query)
There are dozens of questions about timeout and deadline exceeded for GAE and BigQuery on this site so I wouldn't be surprised if you're hitting something weird.
Good luck!

Request to App Engine Backend Timing Out

I created an App Engine backend to serve http requests for a long running process. The backend process works as expected when the query references an input of small size, but times out when the input size is large. The query parameter is the url of an App Engine BlobStore blob, which is the input data for the backend process. I thought the whole point of using App Engine backends was to avoid the timeout restricts that App Engine frontends possess. How can I avoid getting a timeout?
I call the backend like this, setting the connection timeout length to infinite:
HttpURLConnection connection = (HttpURLConnection)(new URL(url + "?" + query).openConnection());
connection.setRequestProperty("Accept-Charset", charset);
connection.setRequestMethod("GET");
connection.setConnectTimeout(0);
connection.connect();
InputStream in = connection.getInputStream();
int ch;
while ((ch = in.read()) != -1)
json = json + String.valueOf((char) ch);
System.out.println("Response Message is: " + json);
connection.disconnect();
The traceback (edited for anonymity) is:
Uncaught exception from servlet
java.net.SocketTimeoutException: Timeout while fetching URL: http://my-backend.myapp.appspot.com/somemethod?someparameter=AMIfv97IBE43y1pFaLNSKO1hAH1U4cpB45dc756FzVAyifPner8_TCJbg1pPMwMulsGnObJTgiC2I6G6CdWpSrH8TrRBO9x8BG_No26AM9LmGSkcbQZiilhC_-KGLx17mrS6QOLsUm3JFY88h8TnFNer5N6-cl0iKA
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:142)
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:43)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.fetchResponse(URLFetchServiceStreamHandler.java:417)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getInputStream(URLFetchServiceStreamHandler.java:296)
at org.someorg.server.HUDXML3UploadService.doPost(SomeService.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
As you can see, I'm not getting the DeadlineExceededException, so I think something other than Google's limits is causing the timeout, and also making this a different issue from similar stackoverflow posts on the topic.
I humbly thank you for any insights.
Update 2/19/2012: I see what's going on, I think. I should be able to have the client wait indefinitely using GWT [or any other type of client-side async framework] async handler for an any client request to complete, so I don't think that is the problem. The problem is that the file upload is calling the _ah/upload App Engine system endpoint which then, once the blob is stored in the Blobstore) calls the upload service's doPost backend to process the blob. The client request to _ah/upload is what is timing out, because the backend doesn't return in a timely fashion. To make this timeout problem go away, I attempted to make the _ah_upload service itself a public backend accessible via http://backend_name.project_name.appspot.com/_ah/upload, but I don't think that google allows a system service (like _ah/upload) to be run as a backend. Now my next approach is to just have ah_upload immediately return after triggering the backend processing, and then call another service to get the original response I wanted, after processing is finished.
The solution was to start a backend process as a tasks and add that to the task queue, then returning a response to client before it waits to process the backend task (which can take a long time). If I could have assigned ah_upload to a backend, this would have also solved the problem, since the clien't async handler could wait forever for the backend to finish, but I do not think Google permits assigning System Servlets to backends. The client will now have to poll persisted backend process response data, as Paul C mentioned, since tasks can not respond like a normal servlet.

Resources