Java App engine backend shuts down abruptly, how to resume work? - google-app-engine

I have Cron job which runs every 30mins and queues a task to be executed on a Dynamic Backend (B2).
The Backend loops and does some work, then sleeps for few minutes and then repeats the work till finally the complete job is over after few hours, after which the Backend shuts down. (Till the backend is running, no new Task is actioned)
Now two days in a row, I have seen my Backend stop abruptly (after 1.5hrs) with the familiar "Process terminated because the backend took too long to shutdown.". I have searched through the forums but could not identify WHY exactly my backend shuts down (apart from the theoretical list of reasons that Appengine doc provides). I have checked my DS/Memcache operations, Memory and all looks normal. I upgraded my backend from B1 to B2, but no luck.
Q1. Does anybody know how to debug this issue further?
Q2. Even after this I wish that the job should be completed. If I register a shutdown hook LifecycleManager.getInstance().setShutdownHook(), what is a good way to ensure that the job is resumed (considering that the Cron job could be still 29minutes away from next execution, and I want the job to do its stuff every 2 minutes)

Yes the same has happened to me. I have a backend that uses constant memory and cpu. Apengine shuts it down periodically, usually after 15min but sometimes before that. The docs say that it may get shut down without explanation, it will notify the backend and then shut it down.
You are supposed to handle it gracefully which means it can work by chunks and restart its work. If you. Ant divide the work in chunks dont use backends, use a compute engine instance.

For your first question you'd have to take a closer look at the logs, app engine does promise to indicate shutdown behaviour through a request to /_ah/stop so that would give more insights at the issue.
Now for your second question, stick with app engine's suggestions of having more than one instance. In your case you could move away from looping through some entity infinitely and going to sleep state. Instead have a cron which looks up a task queue and process a single task. If that's processed successfully mark it so somewhere or do so by removing it from the queue after you're done processing it. So in case of failures that task would still be available to be processed unless its marked successful and your additional instances can take over.

Related

App Engine TaskQueue: Interrupted and 20 Minutes to Restart

It seems that when app engine taskqueue's get interrupted, they take 20 minutes or more to restart, is this behavior normal?
I am using the TaskQueue on Google Cloud's App Engine Flexible system. I regularly add tasks to the taskqueue and they get processed on the system. It appears that occasionally, the task gets interrupted in the middle of what it's doing. I don't know why this happens, but I assume it's probably because the instance that its on restarted itself.
My software is resilient to such restarts, but the problem is that it takes a full 20 minutes for the task to be restarted. Has anyone experienced this before?
I think you're right, an instance grabs the task and then goes down. Taskqueue doesn't realize it and waits for some kind of timeout.
This sounds very similar to an issue i experienced:
app engine instance dies instantly, locking up deferred tasks until they hit 10 minute timeout
So to answer your question, I would say yes this does happen. As for what to do, I guess it depends on what it is this task is doing, how often it runs, etc. If the 20 minute lag isnt a big deal I would just live with it, just because fixing it can be a bit of a wild goose chase, but here's what I would try:
When launching tasks, launch duplicates as well with a staggered value for countdown/eta
setup a separate microservice to handle/execute these tasks, hopefully this will make it's execution more predictable, you'll be able to tweak instance-size, & scaling settings to better suit it.

Backend "Process moved to a different machine" and fails withh error 500

I have a process that takes around five minutes to complete. It runs on a cron job every two hours in a backend instance.
Recently the process has started to fail; not every time but a few times a day. First thing that happens is that the memcache starts to throw exceptions:
04:21:13.640 com.google.appengine.api.memcache.LogAndContinueErrorHandler handleServiceError: Service error in memcache
com.google.appengine.api.memcache.MemcacheServiceException: Memcache get: exception getting 1 key (ItemFollowableCompleted:RegionUS:P8XD:0)
at com.google.appengine.api.memcache.MemcacheServiceApiHelper$RpcResponseHandler.handleApiProxyException(MemcacheServiceApiHelper.java:68)
at com.google.appengine.api.memcache.MemcacheServiceApiHelper$1.absorbParentException(MemcacheServiceApiHelper.java:109)
None of these are fatal exceptions but a few seconds later the process terminated without warning or shutdown message. Logs show
04:21:30.591 Process moved to a different machine.
and an error 500.
Is this a google infrastructure problem related to memcache or is there something in the app code that could be causing it?
No, it's not an error in Google infrastructure. Your process is expected to be moved among instances when needed (maintenance, more demand from your side, ...), and there's nothing you can do to prevent it.
Nonetheless there are a few things you could do to alleviate any effect this could have in your app.
Look [1] for some suggestions on how to keep track of your pending jobs when your instance is shut down and also have a look at the background threads.
I'm guessing you're using Python, if not, look for your corresponding language.
[1] https://developers.google.com/appengine/docs/python/backends/#Python_Backend_states
I have the same problem when I use ndb.putmulti() to load data. I tried a few things
1. increase my backends machine size, I moved to B4_1G
2. sleep between ndb.putmulti() (2 minutes for every 200 entities)
3. Dedicated memcache (1G)
1 and 2 were not very helpful, 3 seems to help.
I think rapid updates to ndb datastore affecting memcache is the root cause in my case. I could not find any other way besides paying for dedicated memcache.
I also met the issue "Process moved to a different machine" in the backend module too.
The issue context is as below:
Get the query result from one KIND
Iterating each entity in the query result, I will do some tasks and write new entities to different KINDs
The "Process moved to a different machine" happens during the half of iterating
After some experiments, I found it is due to "too many writing transactions in one request". Everything is fine when the size of query result is small, but cause problem when it becomes larger.
The final solution I took is to use Task Queue, the work should be done for a entity is looked as one task and be put into the PushQueue. So the issue is gone.
Hope this will help :)

jBPM6 not persisting boundary timers

I have a jBPM process setup with a boundary timer on a human task set for 30s (for testing purposes) - this is to escalate to another task if the time expires.
This normally functions correctly - when the task is reached and 30s are up, the flow is moved to the next task.
However, if I bounce the server, it seems that none of the timers are recreated and the flow sits on that task indefinitely.
The chances of the server being bounced in the real world are fairly high, as the timeouts will be more likely to last a couple of days.
Does anyone know if this is a known issue?
How are you executing your process, using the execution server as part of jbpm-console or embedding the engine yourself?
If you are embedding the engine yourself, note that you need to reinitialize your RuntimeManager upon restart (don't wait on the first request to do this, as this won't reactivate timers).

GAE Queue Statistics numbers wrong on development console

I'm seeing very strange behavior in some code that checks the QueueStatistics for a queue to see if any tasks are currently running. To the best of my knowledge there are NO tasks running, and none have been queued up for the past 12+ hours. The development console corroborates this, saying that there are 0 tasks in the queue.
Looking at the QueueStatistics information in my debugger though, confirms that my process is exiting because it's seeing on the order of 500+ (!!!) tasks in the queue. It also says it ran >1000 tasks in the past minute, yet it ran 0 tasks in the past hour. If I parse through the ETA Usec, the time is "accurately" showing as if the ETA is within the next minute of when the QueueStatistics were pulled.
This is happening repeatedly whenever I re-run my servlet, and the first thing the servlet does is check the queue statistics. No other servlets, tasks, or cron jobs are running as this is my local development server. Yet the queue statistics continue to insist I've got hundreds of tasks running.
I couldn't find any other reports of this behavior, but it feels like I must be missing something major here in regards to Queue Statistics. The code I'm using is very simple:
Queue taskQueue = QueueFactory.getQueue("myQueue");
QueueStatistics stats = taskQueue.fetchStatistics();
if (stats.getNumTasks() > 0) { return; }
What am I missing? Are queue statistics entirely unreliable on the local dev server?
If it works as expected when deployed then that's the standard to go by.
Lots of things don't work as they do in the deployed environment (parallel threads are not parallel, backend support is somewhat broken for addressing them at the time of writing) so deploy deploy deploy!
Another example is the channel API. When used locally it uses polling, you'll see 100's of those if you look in the logs/browser debug. But when deployed all is well and it works as expected.

crawler on appengine

i want to run a program continiously on appengine.This program will automatically crawl some website continiously and store the data into its database.Is it possible for the program to
continiously keep doing it on appengine?Or will appengine kill the process?
Note:The website which will be crawled is not stored on appengine
i want to run a program continiously
on appengine.
Can't.
The closest you can get is background-running scheduled tasks that last no more than 30 seconds:
Notably, this means that the lifetime
of a single task's execution is
limited to 30 seconds. If your task's
execution nears the 30 second limit,
App Engine will raise an exception
which you may catch and then quickly
save your work or log process.
A friend of mine suggested following
Create a task queue
Start the queue by passing some data.
Use an Exception handler and handle DeadlineExceededException.
In your handler create a new queue for same purpose.
You can run your job infinitely. You only need to consider used CPU Time and storage.
You might want to consider Backends introduced in the newer version of GAE.
These run continuous processes
Is Possible Yes, I have already build a solution on Appengine - wowprice
Sharing all details here will make my answer lengthy,
Problem - Suppose I want to crawl walmart.com, As i known that I cant crawl in one shot(millions products)
Solution - I have designed my spider to break the task in smaller task.
Step 1 : I input job for walmart.com, Job scheduler will create a task.
Step 2 : My spider will pick the job and its notice that Its index page, now my spider will create more jobs as starting page as categories page, Now its enters 20 more tasks
Step 3 : now spider make more smaller jobs for subcategories, and its will go till it gets product list page and create task for it.
Step 4 : for product list pages, its get the product and make call to to stores the product data and in case of next page It ll make one task to crawl them.
Advantages -
We can crawl without breaking 30 seconds rules, and speed of crawling will depends backend machine, It will provide parallel crawling for single target.
they fixed it for you.
you can run background threads on a manual scaled instance.
check https://developers.google.com/appengine/docs/python/modules/#Python_Background_threads
You cannot literally run one continuous process for more than 30 seconds. However, you can use the Task Queue to have one process call another in a continuous chain. Alternatively you can schedule jobs to run with the Cron service.
Use a cron job to periodically check for pages which have not been scraped in the past n hours/days/whatever, and put scraping tasks for some subset of these pages onto a task queue. This way your processes don't get killed for taking too long, and you don't hammer the server you're scraping with excessive bursts of traffic.
I've done this, and it works pretty well. Watch out for task timeouts; if things take too long, split them into multiple phases and be sure to use memcached liberally.
Try this:
on appengine run any program. You connect from browser, click for start url during ajax. Ajax call server, download some data from internet and return you (your browser) next url. This is not one request, each url is one diferent request. You mast only resolve in JS how ajax is calling url un cycle.
You can using lasted GAE service called backends . Check this http://code.google.com/appengine/docs/java/backends/
Backends are special App Engine instances that have no request deadlines, higher memory and CPU limits, and persistent state across requests. They are started automatically by App Engine and can run continously for long periods. Each backend instance has a unique URL to use for requests, and you can load-balance requests across multiple instances.

Resources