Here is an update.
I removed the retry limit..maybe that explains why tasks are lost.
I also reduce max concurrent based on Google's suggestions.
Here is the current queue definition:
<queue>
<name>OsmOrderQueue</name>
<rate>20/s</rate>
<max-concurrent-requests>10</max-concurrent-requests>
<bucket-size>100</bucket-size>
<retry-parameters>
<min-backoff-seconds>30</min-backoff-seconds>
<max-backoff-seconds>30</max-backoff-seconds>
<max-doublings>0</max-doublings>
</retry-parameters>
</queue>
Also, here is the backends definition. I added a definition to override the default instances.
<backend name="osm-backend">
<class>B8</class>
<instances>4</instances>
<options>
<dynamic>true</dynamic>
<public>true</public>
</options>
</backend>
But I didnt see any change in the number of instance deployed. Its always 1.
I did the update with
appcfg.cmd update <war directory>
This updates the queue definition even when the queue is running. Thats a cool feature.
Now the situation is unbelievably different. Now the tasks sit for almost 3000 seconds and then are switched. I bet I am billed for this time!
2015-03-14 05:06:57.387 /sampleServlet 500 2869079ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=maptest-backend
E 2015-03-14 05:06:57.387 A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the nex
2015-03-14 05:06:57.386 /sampleServlet 500 2879643ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=maptest-backend
E 2015-03-14 05:06:57.386 A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the nex
2015-03-14 05:06:57.384 /sampleServlet 500 2889684ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=maptest-backend
E 2015-03-14 05:06:57.384 A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the nex
2015-03-14 04:47:33.062 /sampleServlet 200 3674187ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=maptest-backend
By the way, the task I am performing has not threading. It reads from the datastore and cloud storage and writes to big query. This should be the most common model in app engine I would think. If I run one of these tasks by itself,it normally completes in about 200-300 seconds. Unbelievable slow for B8 machine. I can process on my PC the reading of the same file and it takes about 10 seconds. I wish I could see an error in my task or queue definition, but I cannot. How can the peformance be so poor? How it be so subtle how to configure a task queue? I am at a loss of understanding...
I am trying to get jobs done in parallel using a task queue with the following configuration.
<queue>
<name>OsmOrderQueue</name>
<rate>1/s</rate>
<max-concurrent-requests>8</max-concurrent-requests>
<bucket-size>4</bucket-size>
<retry-parameters>
<task-retry-limit>7</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
<max-doublings>2</max-doublings>
</retry-parameters>
</queue>
Is it strange that max size is larger than bucket size?
I submitted 100 jobs. I checked my logs that indeed 100 tasks were entered. I have tried with with only one concurrent session and all tasks were processed. But here is the kind of error I see. I see many HTTP codes of 500. However, the sum is not equal to the lost jobs. Also, I have a job where bigQuery is not found!!!
Notice that some jobs run almost 5 minutes, which I am paying for, then die and say they are moving to another machine. But they don't show up.
Strange thing is when I tested this with no-concurrency, they all ran fine. But I need to speed up the process by running more in parallel. I don't think my servlet has concurrency issues as I looked for exceptions from my code execution and there are none that I can see. So why is Google's task queue failing so much?
Finally, why the URL error reaching bigQuery as shown at the end of the logs below?
2015-03-09 21:42:07.024 /sampleServlet 500 212866ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
0.1.0.2 - - [09/Mar/2015:21:42:07 -0700] "POST /sampleServlet HTTP/1.1" 500 0 "https://1-dot-mindful-highway-451.appspot.com/upload" "AppEngine-Google; (+http://code.google.com/appengine)" "osm-backend.mindful-highway-451.appspot.com" ms=212866 cpu_ms=785288 cpm_usd=0.000061 queue_name=OsmOrderQueue task_name=77053872005060790511 exit_code=107 instance=0 app_engine_release=1.9.18
W 2015-03-09 21:42:07.024
Process moved to a different machine.
2015-03-09 21:42:07.023 /sampleServlet 500 227518ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 21:42:07.023 Process moved to a different machine.
2015-03-09 21:42:07.022 /sampleServlet 500 203726ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 21:42:07.022 Process moved to a different machine.
2015-03-09 21:42:07.020 /sampleServlet 500 196668ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 21:42:07.020 Process moved to a different machine.
2015-03-09 21:42:07.019 /sampleServlet 500 220996ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 21:42:07.019 Process moved to a different machine.
2015-03-09 21:41:43.699 /_ah/start 404 3160ms 0kb instance=0 module=default version=osm-backend
I 2015-03-09 21:41:43.699 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2015-03-09 21:38:21.758 /_ah/start 404 1968ms 0kb instance=0 module=default version=osm-backend
I 2015-03-09 21:38:21.757 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2015-03-09 20:15:51.414 /_ah/stop 200 13ms 0kb instance=0 module=default version=osm-backend
2015-03-09 20:04:27.355 /_ah/start 404 2547ms 0kb instance=0 module=default version=osm-backend
I 2015-03-09 20:04:27.355 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2015-03-09 20:04:11.770 /sampleServlet 500 241352ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-09 20:04:11.770 Process moved to a different machine.
2015-03-09 20:00:12.995 /_ah/start 404 2154ms 0kb instance=0 module=default version=osm-backend
I 2015-03-09 20:00:12.995 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
BIG QUERY FAILED...
2015-03-09 21:51:06.675 /sampleServlet 200 576506ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
0.1.0.2 - - [09/Mar/2015:21:51:06 -0700] "POST /sampleServlet HTTP/1.1" 200 0 "https://1-dot-mindful-highway-451.appspot.com/upload" "AppEngine-Google; (+http://code.google.com/appengine)" "osm-backend.mindful-highway-451.appspot.com" ms=576507 cpu_ms=484901 cpm_usd=0.156302 queue_name=OsmOrderQueue task_name=21675006186640709011 pending_ms=13531 instance=0 app_engine_release=1.9.18
2015-03-09 21:51:06.671
com.example.lifescore.SampleServlet uploadFileToBigQuerry: New table throws exception e:java.io.IOException: Could not fetch URL: https://www.googleapis.com/upload/bigquery/v2/projects/mindful-highway-451/jobs?uploadType=resumable&upload_id=AEnB2UqDHFbMpUsL5m_a88fWh0hnhYzxp20qbbQlHe1mplsiNyo0g0Roktir0Gk5E6yUkBblXrTjz6cxw7aWF3m0dT03Q6CiQA
As suggested, I increased the bucket size to 100 and removed the max-concurrent-request line. It had no impact. I issued 100 jobs, they were all in the queue, but still run sequentially. I dont see any jobs running in parallel. This dump shows 86 in queue, but only 6 running.
Queue Name Maximum Rate Enforced Rate Bucket Size Maximum Concurrent Oldest Task Tasks in Queue Run in Last Minute Running
OsmOrderQueue 1.0/s 0.10/s 100.0 2015-03-10 18:25:28 (0:09:45 ago) 86 6 6
But what is interesting is that the failed 500 events seem to come after
an ahStart.
2015-03-10 18:37:16.358 /sampleServlet 500 230964ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:37:16.358 Process moved to a different machine.
2015-03-10 18:37:16.357 /sampleServlet 500 68596ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:37:16.357 Process moved to a different machine.
2015-03-10 18:37:16.355 /sampleServlet 500 88692ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:37:16.355 Process moved to a different machine.
2015-03-10 18:37:16.354 /sampleServlet 500 99255ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:37:16.354 Process moved to a different machine.
2015-03-10 18:36:51.219 /_ah/start 404 2620ms 0kb instance=0 module=default version=osm-backend
I 2015-03-10 18:36:51.219 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
This came after about 20 perfectly run (but sadly slow, each about 5 minutes)
which came after about 5 500's
2015-03-10 18:15:16.894 /sampleServlet 500 114343ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.894 Process moved to a different machine.
2015-03-10 18:15:16.893 /sampleServlet 500 98997ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.893 Process moved to a different machine.
2015-03-10 18:15:16.892 /sampleServlet 500 154237ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.892 Process moved to a different machine.
2015-03-10 18:15:16.891 /sampleServlet 500 139429ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.891 Process moved to a different machine.
2015-03-10 18:15:16.890 /sampleServlet 500 122964ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.890 Process moved to a different machine.
2015-03-10 18:15:16.889 /sampleServlet 500 130682ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.889 Process moved to a different machine.
2015-03-10 18:15:16.888 /sampleServlet 500 163503ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=osm-backend
W 2015-03-10 18:15:16.887 Process moved to a different machine.
2015-03-10 18:14:52.896 /_ah/start 404 2668ms 0kb instance=0 module=default version=osm-backend
I 2015-03-10 18:14:52.860 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ
2015-03-10 18:12:35.918 /_ah/start 404 2518ms 0kb instance=0 module=default version=osm-backend
I 2015-03-10 18:12:35.917 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time.
Here is an image of the taskqueue performance littered with contant 500 events with no logging of any errors or exceptions (just moved to another machine)...Pretty poor huh!
I tried to add the image but SO says I need higher reputation. Can anyone help me there..Thanks
If you do not specify <max-concurrent-request> then all your task queues can execute in parallel as long as there are tokens in your bucket. I have answered this in detail over here. You really need to read up on the documentation over here.
I see many HTTP codes of 500. However, the sum is not equal to the
lost jobs.
I can imagine that you see more 500's than scheduled tasks because failed tasks will retry.
Also, I have a job where bigQuery is not found!!
Expect an occasional glitch when talking to services and think about your retry stratgey. Make sure your calls are idempotent.
It looks like there are two issues here: 1) you are not getting the parallelism you'd expect and 2) your tasks are getting cancelled and moved to other machines.
As to 1: From what you posted, your settings look correct (once you removed max-concurrent-requests from the queue config). The queue status line reported the enforced rate at 0.10/s. I'd expect to see this if the number of instances is capped. Are you seeing the number of instances increase during processing (doc)?
If you're using Backends, be sure to set the maximum number instances (the 'instances' option); it defaults to one. Even better, switch to modules since using Backends is deprecated and you'll have more control (doc).
As to 2: While it's expected to see applications moved, a 30% rate seems high. I'd expect this to get better after fixing 1. Follow up on this thread if this continues to be a problem.
Here are some work-around suggestions if you need some short-term ideas:
Use multiple instances but turn off module multithreading. (this will only allow one of your computations to occur per instance, reducing load but also reducing instance efficiency)
Checkpoint your results (i.e. save out intermediete results and check for intermediete results on task startup).
Better logging showed me the main issue. I found that each task I was submitting was spending the vast majority of the time inside my main conversion function, not waiting on data from datastore or bigquery. I dug in and found that I made a pretty junior coding practice mistake, which is, I was concatenating strings to generate the output file, and this caused the task timing to degrade exponentially, up to close to an hour per file. I wouldnt have believed it , as I had
checked the code before, but at the Google supports suggestion (which was a good one!) to test locally. I verified that my local code showed the exponential time increase. Changing to stringBuilder with an sb.append(new line string) and the file = sb.toString().
So with many long running tasks sitting in the queue, I was stressing the queue logic, getting the 500 status for moving to another machine.
When I restructed the code to use a stringBuilder, the same task was done in seconds. In this case, I had no trouble getting 100 tasks to run smoothly and didnt miss any as expected. I managed to have setting working as follows quite well.
OsmOrderQueue
20/s
10
100
10
100
2
I indeed saw up to 10 running in parallel and good performance with
faster running tasks.
However, when you work with big query you learn that the table change limits impact system design. So I am moving back to larger files, which will mean longer task times, but fewer tasks. In this case, I wonder if I will start to see the "moving to another machine" phenomenon again.
So I guess, I will state that my task queues work well when tasks are not long.
Whenever I try to start a task on my GAE backend, its gets shutdown almost immediately. I've looked at the documentation regarding the reasons for shutdown but I can't determine the cause. The is no evidence of excess CPU or memory. It happens every time.
I added the shutdown handler and it is getting invoked. Before I put in the handler, I got no log entries from my code, only the "Process terminated because the backend took too long to shutdown" message. Now with the handler I get the initial log entry and the exception that results from the interruptAllRequests call in the handler.
Any ideas?
Log image: http://imgur.com/8TXHkJS
2014-04-04 11:35:27.372 /camperschoicecloud/task 500 6357ms 0kb instance=0 AppEngine-Google; (+http://code.google.com/appengine) module=default version=sched-backend
D 2014-04-04 11:35:24.883 com.stellarcoresoftware.camperschoice.server.NamespaceFilter doFilter: Server Name: sched-backend.campers-choice.appspot.com
I 2014-04-04 11:35:24.891 com.stellarcoresoftware.camperschoice.server.TaskServiceImpl doPost: executing task: sched, entity 6288495693791232
E 2014-04-04 11:35:27.366 com.stellarcoresoftware.camperschoice.server.util.JobUtil getJob: Error reading job com.google.apphosting.api.ApiProxy$CancelledException: The API cal
2014-04-04 11:35:25.878 /_ah/stop 200 100ms 0kb instance=0 module=default version=sched-backend
[04/Apr/2014:08:35:25 -0700] "GET /_ah/stop HTTP/1.1" 200 2 - - "0.sched-backend.campers-choice.appspot.com" ms=101 cpu_ms=8 cpm_usd=0.000000 instance=0 app_engine_release=1.9.2
E 2014-04-04 11:35:25.868
com.stellarcoresoftware.camperschoice.server.TaskServiceImpl$1 shutdown: Shutdown Hook
2014-04-04 11:35:24.876 /_ah/start 404 3809ms 0kb instance=0 module=default version=sched-backend
[04/Apr/2014:08:35:24 -0700] "GET /_ah/start HTTP/1.1" 404 234 - - "0.sched-backend.campers-choice.appspot.com" ms=3810 cpu_ms=2232 cpm_usd=0.000026 loading_request=1 instance=0 app_engine_release=1.9.2
D 2014-04-04 11:35:24.834
com.stellarcoresoftware.camperschoice.server.NamespaceFilter doFilter: Server Name: 0.sched-backend.campers-choice.appspot.com
I 2014-04-04 11:35:24.873
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.
I'm solved the issue by changing the backend class from B1 to B4. That leads me to believe it was a memory problem even though I there are no specific indications of that. I see that on the Python side there are easy methods to check CPU and memory. Are there Java equivalents?
Queue seemed to jam up. Very large number of retries of with:
47:32.546 /_ah/mapreduce/controller_callback 200 325ms 0kb AppEngine-Google; (+xxx://code.google.com/appengine)
0.1.0.2 - - [14/May/2013:14:47:32 -0700] "POST /_ah/mapreduce/controller_callback HTTP/1.1" 200 124 "xxx://ah-builtin-python-bundle-dot-ok-alone.appspot.com/_ah/mapreduce/controller_callback" "AppEngine-Google; (+xxx://code.google.com/appengine)" "ah-builtin-python-bundle-dot-ok-alone.appspot.com" ms=326 cpu_ms=0 cpm_usd=0.000014 queue_name=default task_name=appengine-mrcontrol-15811304617282FD9E118-1182 pending_ms=100 app_engine_release=1.8.0 instance=00c61b117ce38007a896105636da1be48f70e6db
'xxx' replaces 'http'
Ran out my data write quota very rapidly though my data writes are actually tiny and few, relatively. What caused this problem? How to fix it?
I am using just the default queue without any modifications.
Any help greatly appreciated!
I just had this same issue. I fixed it by manually deleting the unsuccessful jobs in the default Task Queue on the admin console.
Since 2 days.. when trying to load the home page at http://achhabachhadev.appspot.com/ I am receiving the following error .. need to know If I can do something to fix it.
Error: Server Error The server encountered an error and could not complete your request. If the problem persists, please report your problem and mention this error message and the query that caused it.
At the logs level all I see is that the request could not be completed in 1 min. Please tell me is there something else which could be the problem. Any help is welcome.. as it was all working till 2 days ago..
The logs are as given below:
53 / 500 62976ms 0kb Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.79 Safari/535.11 I 2012-03-15 04:41:49.013 javax.servlet.ServletContext log: Initializing Spring root WebApplicationContext I 2012-03-15 04:42:19.717 javax.servlet.ServletContext log: Initializing Spring FrameworkServlet 'ICCFinal02' W 2012-03-15 04:42:42.245 Error for / com.google.apphosting.runtime.HardDeadlineExceededError: This request (80d56e654b79f25b) started at 2012/03/15 11:41:41.284 UTC and was st W 2012-03-15 04:42:42.245 at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthentic C 2012-03-15 04:42:42.293 Uncaught exception from servlet com.google.apphosting.runtime.HardDeadlineExceededError: This request (80d56e654b79f25b) started at 2012/03/15 11:41:4 I 2012-03-15 04:42:42.315 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ W 2012-03-15 04:42:42.315 A problem was encountered with the process that handled this request, causing it to exit.
Did you try re-deploying? I've noticed that every once in a while the appservers get hung up and a redeploy is required to fix it.
I have a project on Google App Engine. It has 2 separate data-stores one which holds articles and the other holds any article which is classified as a crime. (True or False)
But when I try and run my cron to move the crime articles into the "crime" data-store I receive this error:
Has anyone experienced this and how did they over come it!?
0kb AppEngine-Google;
0.1.0.1 - - [22/Apr/2011:09:47:02 -0700] "GET /place HTTP/1.1" 500 138 - "AppEngine-Google;
(+http://code.google.com/appengine)"
"geo-event-maps.appspot.com" ms=1642
cpu_ms=801 api_cpu_ms=404
cpm_usd=0.022761 queue_name=__cron
task_name=740b13ec69de6ac36b81ff431d584a1a loading_request=1
As a result my cron fails.
I just had a similar problem where my cron was crashing due to the fact that it was finding an non-ascii character and not able to process it? Try encode('utf-8'). My cron's work ok without needing the login URL, but it's a good tip for future :-)
Just my 2 cents worth for your question ;-)
It's probably not related to cron. Trying to load your URL directly (http://geo-event-maps.appspot.com/place) returns an HTTP 500 error. As an admin of the app, you should be able to run any cron job without error just by pasting the URL into a browser, so start there.
By the way, make sure to require admin access to any cron URLs. As an unauthorized user I should have received a 401 error, not a 500. Even if you use just one handler, you can do something like this in your app.yaml:
- url: /cron/.*
script: main.py
login: admin
- url: /.*
script: main.py