Anyone seen the below error message from PHP7 on Google App Engine standard environment?
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Example log from GCP:
xx.xxx.xxx.xx - - [01/Jul/2019:09:16:11 +0100] "GET /api/courses HTTP/1.1" 500 - - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" "example.com" ms=12009 cpu_ms=3303 cpm_usd=0 loading_request=0 instance=00c61b117c915eddeba77b2a4f20a2ae2b81fc81a941138beb270170b47b91aedffd87f2 app_engine_release=1.9.71 trace_id=1f2cdcf1c2ea56bc5ebf7cf12577b057
I don't see any logs in my PHP application so don't think the issue is there, but I also can't see any details of the error in GCP.
Where do I need to look? Any help appreciated!
Thanks
EDIT
Using gcloud app logs tail reveals the following errors, no idea how to fix these or if they're the cause of the issue!
2019-07-01 19:21:29 default[20190701t094939] nginx: [warn] the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /tmp/google-config/nginx.conf:3
2019-07-01 19:21:27 default[20190701t094939] [01-Jul-2019 19:21:27] ERROR: unable to read what child say: Bad file descriptor (9)
2019-07-01 19:21:18 default[20190701t094939] [01-Jul-2019 19:21:18] WARNING: [pool app] child 25 exited on signal 7 (SIGBUS) after 0.718745 seconds from start
EDIT 2
I've added caching to a pretty heavy API end point that has stopped these errors from happening, as you can see after I deployed the change on July 1st in the evening. Looking at the 500 logs they seem to correlate with spikes in traffic so if I had to guess maybe the instance was hitting a Memory/CPU limit?!
Here is an update.
I removed the retry limit..maybe that explains why tasks are lost.
I also reduce max concurrent based on Google's suggestions.
Here is the current queue definition:
<queue>
<name>OsmOrderQueue</name>
<rate>20/s</rate>
<max-concurrent-requests>10</max-concurrent-requests>
<bucket-size>100</bucket-size>
<retry-parameters>
<min-backoff-seconds>30</min-backoff-seconds>
<max-backoff-seconds>30</max-backoff-seconds>
<max-doublings>0</max-doublings>
</retry-parameters>
</queue>
Also, here is the backends definition. I added a definition to override the default instances.
<backend name="osm-backend">
<class>B8</class>
<instances>4</instances>
<options>
<dynamic>true</dynamic>
<public>true</public>
</options>
</backend>
But I didnt see any change in the number of instance deployed. Its always 1.
I did the update with
appcfg.cmd update <war directory>
This updates the queue definition even when the queue is running. Thats a cool feature.
Now the situation is unbelievably different. Now the tasks sit for almost 3000 seconds and then are switched. I bet I am billed for this time!
2015-03-14 05:06:57.387 /sampleServlet 500 2869079ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=maptest-backend
E 2015-03-14 05:06:57.387 A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the nex
2015-03-14 05:06:57.386 /sampleServlet 500 2879643ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=maptest-backend
E 2015-03-14 05:06:57.386 A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the nex
2015-03-14 05:06:57.384 /sampleServlet 500 2889684ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=maptest-backend
E 2015-03-14 05:06:57.384 A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the nex
2015-03-14 04:47:33.062 /sampleServlet 200 3674187ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=maptest-backend
By the way, the task I am performing has not threading. It reads from the datastore and cloud storage and writes to big query. This should be the most common model in app engine I would think. If I run one of these tasks by itself,it normally completes in about 200-300 seconds. Unbelievable slow for B8 machine. I can process on my PC the reading of the same file and it takes about 10 seconds. I wish I could see an error in my task or queue definition, but I cannot. How can the peformance be so poor? How it be so subtle how to configure a task queue? I am at a loss of understanding...
I am trying to get jobs done in parallel using a task queue with the following configuration.
<queue>
<name>OsmOrderQueue</name>
<rate>1/s</rate>
<max-concurrent-requests>8</max-concurrent-requests>
<bucket-size>4</bucket-size>
<retry-parameters>
<task-retry-limit>7</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
<max-doublings>2</max-doublings>
</retry-parameters>
</queue>
Is it strange that max size is larger than bucket size?
I submitted 100 jobs. I checked my logs that indeed 100 tasks were entered. I have tried with with only one concurrent session and all tasks were processed. But here is the kind of error I see. I see many HTTP codes of 500. However, the sum is not equal to the lost jobs. Also, I have a job where bigQuery is not found!!!
Notice that some jobs run almost 5 minutes, which I am paying for, then die and say they are moving to another machine. But they don't show up.
Strange thing is when I tested this with no-concurrency, they all ran fine. But I need to speed up the process by running more in parallel. I don't think my servlet has concurrency issues as I looked for exceptions from my code execution and there are none that I can see. So why is Google's task queue failing so much?
Finally, why the URL error reaching bigQuery as shown at the end of the logs below?
2015-03-09 21:42:07.024 /sampleServlet 500 212866ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
0.1.0.2 - - [09/Mar/2015:21:42:07 -0700] "POST /sampleServlet HTTP/1.1" 500 0 "https://1-dot-mindful-highway-451.appspot.com/upload" "AppEngine-Google; (+http://code.google.com/appengine)" "osm-backend.mindful-highway-451.appspot.com" ms=212866 cpu_ms=785288 cpm_usd=0.000061 queue_name=OsmOrderQueue task_name=77053872005060790511 exit_code=107 instance=0 app_engine_release=1.9.18
W 2015-03-09 21:42:07.024
Process moved to a different machine.
2015-03-09 21:42:07.023 /sampleServlet 500 227518ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 21:42:07.023 Process moved to a different machine.
2015-03-09 21:42:07.022 /sampleServlet 500 203726ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 21:42:07.022 Process moved to a different machine.
2015-03-09 21:42:07.020 /sampleServlet 500 196668ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 21:42:07.020 Process moved to a different machine.
2015-03-09 21:42:07.019 /sampleServlet 500 220996ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 21:42:07.019 Process moved to a different machine.
2015-03-09 21:41:43.699 /_ah/start 404 3160ms 0kb instance=0 module=default version=osm-backend
I 2015-03-09 21:41:43.699 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2015-03-09 21:38:21.758 /_ah/start 404 1968ms 0kb instance=0 module=default version=osm-backend
I 2015-03-09 21:38:21.757 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2015-03-09 20:15:51.414 /_ah/stop 200 13ms 0kb instance=0 module=default version=osm-backend
2015-03-09 20:04:27.355 /_ah/start 404 2547ms 0kb instance=0 module=default version=osm-backend
I 2015-03-09 20:04:27.355 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2015-03-09 20:04:11.770 /sampleServlet 500 241352ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 20:04:11.770 Process moved to a different machine.
2015-03-09 20:00:12.995 /_ah/start 404 2154ms 0kb instance=0 module=default version=osm-backend
I 2015-03-09 20:00:12.995 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
BIG QUERY FAILED...
2015-03-09 21:51:06.675 /sampleServlet 200 576506ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
0.1.0.2 - - [09/Mar/2015:21:51:06 -0700] "POST /sampleServlet HTTP/1.1" 200 0 "https://1-dot-mindful-highway-451.appspot.com/upload" "AppEngine-Google; (+http://code.google.com/appengine)" "osm-backend.mindful-highway-451.appspot.com" ms=576507 cpu_ms=484901 cpm_usd=0.156302 queue_name=OsmOrderQueue task_name=21675006186640709011 pending_ms=13531 instance=0 app_engine_release=1.9.18
2015-03-09 21:51:06.671
com.example.lifescore.SampleServlet uploadFileToBigQuerry: New table throws exception e:java.io.IOException: Could not fetch URL: https://www.googleapis.com/upload/bigquery/v2/projects/mindful-highway-451/jobs?uploadType=resumable&upload_id=AEnB2UqDHFbMpUsL5m_a88fWh0hnhYzxp20qbbQlHe1mplsiNyo0g0Roktir0Gk5E6yUkBblXrTjz6cxw7aWF3m0dT03Q6CiQA
As suggested, I increased the bucket size to 100 and removed the max-concurrent-request line. It had no impact. I issued 100 jobs, they were all in the queue, but still run sequentially. I dont see any jobs running in parallel. This dump shows 86 in queue, but only 6 running.
Queue Name Maximum Rate Enforced Rate Bucket Size Maximum Concurrent Oldest Task Tasks in Queue Run in Last Minute Running
OsmOrderQueue 1.0/s 0.10/s 100.0 2015-03-10 18:25:28 (0:09:45 ago) 86 6 6
But what is interesting is that the failed 500 events seem to come after
an ahStart.
2015-03-10 18:37:16.358 /sampleServlet 500 230964ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:37:16.358 Process moved to a different machine.
2015-03-10 18:37:16.357 /sampleServlet 500 68596ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:37:16.357 Process moved to a different machine.
2015-03-10 18:37:16.355 /sampleServlet 500 88692ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:37:16.355 Process moved to a different machine.
2015-03-10 18:37:16.354 /sampleServlet 500 99255ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:37:16.354 Process moved to a different machine.
2015-03-10 18:36:51.219 /_ah/start 404 2620ms 0kb instance=0 module=default version=osm-backend
I 2015-03-10 18:36:51.219 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
This came after about 20 perfectly run (but sadly slow, each about 5 minutes)
which came after about 5 500's
2015-03-10 18:15:16.894 /sampleServlet 500 114343ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.894 Process moved to a different machine.
2015-03-10 18:15:16.893 /sampleServlet 500 98997ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.893 Process moved to a different machine.
2015-03-10 18:15:16.892 /sampleServlet 500 154237ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.892 Process moved to a different machine.
2015-03-10 18:15:16.891 /sampleServlet 500 139429ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.891 Process moved to a different machine.
2015-03-10 18:15:16.890 /sampleServlet 500 122964ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.890 Process moved to a different machine.
2015-03-10 18:15:16.889 /sampleServlet 500 130682ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.889 Process moved to a different machine.
2015-03-10 18:15:16.888 /sampleServlet 500 163503ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.887 Process moved to a different machine.
2015-03-10 18:14:52.896 /_ah/start 404 2668ms 0kb instance=0 module=default version=osm-backend
I 2015-03-10 18:14:52.860 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2015-03-10 18:12:35.918 /_ah/start 404 2518ms 0kb instance=0 module=default version=osm-backend
I 2015-03-10 18:12:35.917 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time.
Here is an image of the taskqueue performance littered with contant 500 events with no logging of any errors or exceptions (just moved to another machine)...Pretty poor huh!
I tried to add the image but SO says I need higher reputation. Can anyone help me there..Thanks
If you do not specify <max-concurrent-request> then all your task queues can execute in parallel as long as there are tokens in your bucket. I have answered this in detail over here. You really need to read up on the documentation over here.
I see many HTTP codes of 500. However, the sum is not equal to the
lost jobs.
I can imagine that you see more 500's than scheduled tasks because failed tasks will retry.
Also, I have a job where bigQuery is not found!!
Expect an occasional glitch when talking to services and think about your retry stratgey. Make sure your calls are idempotent.
It looks like there are two issues here: 1) you are not getting the parallelism you'd expect and 2) your tasks are getting cancelled and moved to other machines.
As to 1: From what you posted, your settings look correct (once you removed max-concurrent-requests from the queue config). The queue status line reported the enforced rate at 0.10/s. I'd expect to see this if the number of instances is capped. Are you seeing the number of instances increase during processing (doc)?
If you're using Backends, be sure to set the maximum number instances (the 'instances' option); it defaults to one. Even better, switch to modules since using Backends is deprecated and you'll have more control (doc).
As to 2: While it's expected to see applications moved, a 30% rate seems high. I'd expect this to get better after fixing 1. Follow up on this thread if this continues to be a problem.
Here are some work-around suggestions if you need some short-term ideas:
Use multiple instances but turn off module multithreading. (this will only allow one of your computations to occur per instance, reducing load but also reducing instance efficiency)
Checkpoint your results (i.e. save out intermediete results and check for intermediete results on task startup).
Better logging showed me the main issue. I found that each task I was submitting was spending the vast majority of the time inside my main conversion function, not waiting on data from datastore or bigquery. I dug in and found that I made a pretty junior coding practice mistake, which is, I was concatenating strings to generate the output file, and this caused the task timing to degrade exponentially, up to close to an hour per file. I wouldnt have believed it , as I had
checked the code before, but at the Google supports suggestion (which was a good one!) to test locally. I verified that my local code showed the exponential time increase. Changing to stringBuilder with an sb.append(new line string) and the file = sb.toString().
So with many long running tasks sitting in the queue, I was stressing the queue logic, getting the 500 status for moving to another machine.
When I restructed the code to use a stringBuilder, the same task was done in seconds. In this case, I had no trouble getting 100 tasks to run smoothly and didnt miss any as expected. I managed to have setting working as follows quite well.
OsmOrderQueue
20/s
10
100
10
100
2
I indeed saw up to 10 running in parallel and good performance with
faster running tasks.
However, when you work with big query you learn that the table change limits impact system design. So I am moving back to larger files, which will mean longer task times, but fewer tasks. In this case, I wonder if I will start to see the "moving to another machine" phenomenon again.
So I guess, I will state that my task queues work well when tasks are not long.
I created a GAE application. This worked fine till this morning...
There are still not so many request going to my application, but GAE regenerated my instance the whole time ...
Now when somebody comes to my site, first I do about 10 requests to the server to get some data, before this was done smoothly. Since today these 10 requests cause a lot of requests to /_ah/warmup as you can see below all within 1 second. This has as a result I came to my limit of "Frontend Instance Hours" without actually happening something.
So I wonder and hope somebody can give me an idea what can have caused this to happen, or how I can fix this issue,
Thank you,
2014-05-22 11:21:13.513 /_ah/warmup 200 7688ms 0kb module=default version=1
I 2014-05-22 11:21:13.512 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2014-05-22 11:21:13.093 /_ah/warmup 200 6838ms 0kb module=default version=1
I 2014-05-22 11:21:13.093 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2014-05-22 11:21:11.671 /_ah/warmup 500 5794ms 0kb module=default version=1
I 2014-05-22 11:21:11.670 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
E 2014-05-22 11:21:11.670 Process terminated due to exceeding quotas.
2014-05-22 11:21:11.478 /_ah/warmup 500 5655ms 0kb module=default version=1
I 2014-05-22 11:21:11.475 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
E 2014-05-22 11:21:11.475 Process terminated due to exceeding quotas.
2014-05-22 11:21:11.319 /_ah/warmup 200 5492ms 0kb module=default version=1
I 2014-05-22 11:21:11.319 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2014-05-22 11:21:10.403 /_ah/warmup 500 4587ms 0kb module=default version=1
I 2014-05-22 11:21:10.388 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
E 2014-05-22 11:21:10.388 Process terminated due to exceeding quotas.
2014-05-22 11:21:10.310 /_ah/warmup 200 4494ms 0kb module=default version=1
I 2014-05-22 11:21:10.310 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2014-05-22 11:20:33.941 /_ah/warmup 200 7059ms 0kb module=default version=1
I 2014-05-22 11:20:33.941 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2014-05-22 11:20:33.698 /_ah/warmup 200 7106ms 0kb module=default version=1
I 2014-05-22 11:20:33.697 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2014-05-22 11:20:33.161 /_ah/warmup 200 6731ms 0kb module=default version=1
I 2014-05-22 11:20:33.161 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2014-05-22 11:20:12.854 /_ah/warmup 503 18ms 0kb module=default version=1
2014-05-22 10:19:51.619 /_ah/warmup 500 4085ms 0kb module=default version=1
I 2014-05-22 10:19:51.618 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
E 2014-05-22 10:19:51.618 Process terminated due to exceeding quotas.
I had the same errors. I saw the errors in the trace label but not in the logs. I asked in many places and finally received this answer:
it appears that the 500 warmup errors you are seeing do not indicate
an error in your application. They are rather used internally as
signal messages in our architecture. That explains why you are not
seeing them in the Logs but only in Trace and why they are marked as
healthy messages. As Cloud Trace is still in beta it is understandable
to see minor inconsistencies.
Queue seemed to jam up. Very large number of retries of with:
47:32.546 /_ah/mapreduce/controller_callback 200 325ms 0kb AppEngine-Google; (+xxx://code.google.com/appengine)
0.1.0.2 - - [14/May/2013:14:47:32 -0700] "POST /_ah/mapreduce/controller_callback HTTP/1.1" 200 124 "xxx://ah-builtin-python-bundle-dot-ok-alone.appspot.com/_ah/mapreduce/controller_callback" "AppEngine-Google; (+xxx://code.google.com/appengine)" "ah-builtin-python-bundle-dot-ok-alone.appspot.com" ms=326 cpu_ms=0 cpm_usd=0.000014 queue_name=default task_name=appengine-mrcontrol-15811304617282FD9E118-1182 pending_ms=100 app_engine_release=1.8.0 instance=00c61b117ce38007a896105636da1be48f70e6db
'xxx' replaces 'http'
Ran out my data write quota very rapidly though my data writes are actually tiny and few, relatively. What caused this problem? How to fix it?
I am using just the default queue without any modifications.
Any help greatly appreciated!
I just had this same issue. I fixed it by manually deleting the unsuccessful jobs in the default Task Queue on the admin console.
I have a project on Google App Engine. It has 2 separate data-stores one which holds articles and the other holds any article which is classified as a crime. (True or False)
But when I try and run my cron to move the crime articles into the "crime" data-store I receive this error:
Has anyone experienced this and how did they over come it!?
0kb AppEngine-Google;
0.1.0.1 - - [22/Apr/2011:09:47:02 -0700] "GET /place HTTP/1.1" 500 138 - "AppEngine-Google;
(+http://code.google.com/appengine)"
"geo-event-maps.appspot.com" ms=1642
cpu_ms=801 api_cpu_ms=404
cpm_usd=0.022761 queue_name=__cron
task_name=740b13ec69de6ac36b81ff431d584a1a loading_request=1
As a result my cron fails.
I just had a similar problem where my cron was crashing due to the fact that it was finding an non-ascii character and not able to process it? Try encode('utf-8'). My cron's work ok without needing the login URL, but it's a good tip for future :-)
Just my 2 cents worth for your question ;-)
It's probably not related to cron. Trying to load your URL directly (http://geo-event-maps.appspot.com/place) returns an HTTP 500 error. As an admin of the app, you should be able to run any cron job without error just by pasting the URL into a browser, so start there.
By the way, make sure to require admin access to any cron URLs. As an unauthorized user I should have received a 401 error, not a 500. Even if you use just one handler, you can do something like this in your app.yaml:
- url: /cron/.*
script: main.py
login: admin
- url: /.*
script: main.py