I have a jBPM process setup with a boundary timer on a human task set for 30s (for testing purposes) - this is to escalate to another task if the time expires.
This normally functions correctly - when the task is reached and 30s are up, the flow is moved to the next task.
However, if I bounce the server, it seems that none of the timers are recreated and the flow sits on that task indefinitely.
The chances of the server being bounced in the real world are fairly high, as the timeouts will be more likely to last a couple of days.
Does anyone know if this is a known issue?
How are you executing your process, using the execution server as part of jbpm-console or embedding the engine yourself?
If you are embedding the engine yourself, note that you need to reinitialize your RuntimeManager upon restart (don't wait on the first request to do this, as this won't reactivate timers).
Related
I am creating a lot of tasks to be processed in Cloud Tasks, but some of them are failing due to lack of available resources (instances). Please see the image below:
As you can see, the average time Google waits before throwing http 500 error is 10 seconds, and sometimes, less than 10ms is enough to throw http 500. This queue has auto-retry set, so, eventually all tasks are executed, but the error remains.
Is there a way to increase this wait time? I don't care waiting 5 minutes to process the task, I just want to minimize the amount of errors like this on my logging panel.
I asked a similar question not too long ago and didn't get any helpful answers. I don't know if you can increase the rate of creating tasks.
A few things to try:
Batch tasks together -- Instead of creating 10 tasks can you create one task that does the work of all 10?
Serial tasks -- Create a first task that does work and then creates a second task. The second task does work and then creates a third task, etc.
Pull queues might allow for a higher task creation rate. Not sure about that.
It seems that when app engine taskqueue's get interrupted, they take 20 minutes or more to restart, is this behavior normal?
I am using the TaskQueue on Google Cloud's App Engine Flexible system. I regularly add tasks to the taskqueue and they get processed on the system. It appears that occasionally, the task gets interrupted in the middle of what it's doing. I don't know why this happens, but I assume it's probably because the instance that its on restarted itself.
My software is resilient to such restarts, but the problem is that it takes a full 20 minutes for the task to be restarted. Has anyone experienced this before?
I think you're right, an instance grabs the task and then goes down. Taskqueue doesn't realize it and waits for some kind of timeout.
This sounds very similar to an issue i experienced:
app engine instance dies instantly, locking up deferred tasks until they hit 10 minute timeout
So to answer your question, I would say yes this does happen. As for what to do, I guess it depends on what it is this task is doing, how often it runs, etc. If the 20 minute lag isnt a big deal I would just live with it, just because fixing it can be a bit of a wild goose chase, but here's what I would try:
When launching tasks, launch duplicates as well with a staggered value for countdown/eta
setup a separate microservice to handle/execute these tasks, hopefully this will make it's execution more predictable, you'll be able to tweak instance-size, & scaling settings to better suit it.
I have Cron job which runs every 30mins and queues a task to be executed on a Dynamic Backend (B2).
The Backend loops and does some work, then sleeps for few minutes and then repeats the work till finally the complete job is over after few hours, after which the Backend shuts down. (Till the backend is running, no new Task is actioned)
Now two days in a row, I have seen my Backend stop abruptly (after 1.5hrs) with the familiar "Process terminated because the backend took too long to shutdown.". I have searched through the forums but could not identify WHY exactly my backend shuts down (apart from the theoretical list of reasons that Appengine doc provides). I have checked my DS/Memcache operations, Memory and all looks normal. I upgraded my backend from B1 to B2, but no luck.
Q1. Does anybody know how to debug this issue further?
Q2. Even after this I wish that the job should be completed. If I register a shutdown hook LifecycleManager.getInstance().setShutdownHook(), what is a good way to ensure that the job is resumed (considering that the Cron job could be still 29minutes away from next execution, and I want the job to do its stuff every 2 minutes)
Yes the same has happened to me. I have a backend that uses constant memory and cpu. Apengine shuts it down periodically, usually after 15min but sometimes before that. The docs say that it may get shut down without explanation, it will notify the backend and then shut it down.
You are supposed to handle it gracefully which means it can work by chunks and restart its work. If you. Ant divide the work in chunks dont use backends, use a compute engine instance.
For your first question you'd have to take a closer look at the logs, app engine does promise to indicate shutdown behaviour through a request to /_ah/stop so that would give more insights at the issue.
Now for your second question, stick with app engine's suggestions of having more than one instance. In your case you could move away from looping through some entity infinitely and going to sleep state. Instead have a cron which looks up a task queue and process a single task. If that's processed successfully mark it so somewhere or do so by removing it from the queue after you're done processing it. So in case of failures that task would still be available to be processed unless its marked successful and your additional instances can take over.
I am using Task Queue in GAE for performing some background work for my application. I have come to know that there is a 10 minute time limit for a particular task. My concern is how do I test this thing in my local environment. I tried thread sleep but it didn't throw any exception as mentioned in google app engine docs. Also is this time limit is measured by CPU time or the actual time.
Thanks.
The time is measured in wall clock time. The development server doesn't enforce time limits, although it's unclear why you'd want to test it because it's unlikely your tests will perform the same as they will in production, so trying to guess how much you'll be able to accomplish in 10 minutes on the production servers by seeing how much you can accomplish in 10 minutes on the development server will fail horribly.
For your development server, start a timer when a task is initiated. keep checking in your code if you reached 10 mins wall clock time. When you reach, throw a DeadlineExceededError. It would be better to have the try and except statements in the class handlers which call a particular function of your code.
I'm currently in the process of rewriting my java code to run it on Google App Engine. Since I cannot use Timer for programming timeouts (no thread creation allowed), I need to rely on system clock to mark the time of the timeout start so that I could compare it later in order to find out if the timeout has occurred.
Now, several people (even on Google payroll) have advised developers not to rely on system time due to the distributed nature of Google app servers and not being able to keep their clocks in sync. Some say the deviance of system clocks can be up to 10s or even more.
1s deviance would be very good for my app, 2 seconds can be tolerable, anything higher than that would cause a lot of grief for me and my app users, but 10 second difference would turn my app effectively unusable.
I don't know if anything has changed for the better since then (I hope yes), but if not, then what are my options other than shooting up a new separate request so that its handler would sleep the duration of the timeout (which cannot exceed 30 seconds due to request timeout limitation) in order to keep the timeout duration consistent.
Thanks!
More Specifically:
I'm trying to develop a poker game server, but for those who are not familiar how online poker works: I have a set of players attached to 1 game instance. Evey player has a certain amount of time to act before the timeout will occur so the next player can act. There is a countdown on each actor and every client has to see it. Only one player can act at a time. The timeout durations I need are 10s and 20s for now.
You should never be making your request handlers sleep or wait. App Engine will only automatically scale your app if request handlers complete in an average of 1000ms or less; deliberately waiting will ruin that. There's invariably a better option than sleeping/waiting - let us know what you're doing, and perhaps we can suggest one.