How to implement automatic retries with GAE tasks? - google-app-engine

Here is my code:
class PublishPhotosHandler(webapp.RequestHandler):
for argument in files_arguments:
taskqueue.add(url='/upload', params={'key': key})
class UploadWorker(webapp.RequestHandler):
def post(self):
key = self.request.get('key')
result = urlfetch.fetch(...)
# how to return there an error, so the task will be retried?

If a task fails to execute (by returning any HTTP status code outside of the range 200–299), App Engine retries the task until it succeeds. By default, the system gradually reduces the retry rate to avoid flooding your application with too many requests, but schedules retry attempts to recur at a maximum of once per hour until the task succeeds.
raising any exception will cause a non-2XX status code, therefore raising any exception will cause the the task to be queued up again and retried.

Related

How to handle transient/application failures in Apache Flink?

My Flink processor listens to Kafka and the business logic in processor involves calling external REST services and there are possibilities that the services may be down. I would like to replay the tuple back into the processor and Is there anyway to do it? I have used Storm and we will be able to fail the tuple so that the the tuple will not be acknowledged. So the same tuple will be replayed to the processor.
In Flink, the tuple is being acknowledged automatically once the message is consumed by Flink-Kafka Consumer. There are ways to solve this. One such way is to publish the message back to the same queue/retry queue. But I am looking for a solution similar to Storm.
I know that Flink's Savepoint/Checkpoint will be used for fault tolerance. But in my understanding, the tuples will be replayed win case of the Flink's failure. I would like to get ideas on how to handle transient failures.
Thank you
When interacting with external systems I would recommend to use Flink's async I/O operator. It allows you to execute asynchronous tasks without blocking the execution of an operator.
If you want to retry failed operations without restarting the Flink job from the last successful checkpoint, then I would suggest to implement the retry policy yourself. It could look the following way:
new AsyncFunction<IN, OUT>() {
#Override
public void asyncInvoke(IN input, ResultFuture<OUT> resultFuture) throws Exception {
FutureUtils
.retrySuccessfulWithDelay(
() -> triggerAsyncOperation(input),
Time.seconds(1L),
Deadline.fromNow(Duration.ofSeconds(10L)),
this::decideWhetherToRetry,
new ScheduledExecutorServiceAdapter(new DirectScheduledExecutorService()))
.whenComplete((result, throwable) -> {
if (result != null) {
resultFuture.complete(Collections.singleton(result));
} else {
resultFuture.completeExceptionally(throwable);
}
})
}
}
with triggerAsyncOperation encapsulating your asynchronous operation and decideWhetherToRetry encapsulating your retry strategy. If decideWhetherToRetry returns true, then resultFuture will be completed with the value of this operation attempt.
If resultFuture is completed exceptionally, then it will trigger a failover which will cause the job to restart from that last successful checkpoint.

How do I catch PermanentTaskFailure

Is this the correct way to catch PermannentTask Failure? (https://cloud.google.com/appengine/articles/deferred)
def do_something_with_key(k):
entity = k.get()
# Do something with entity
entity.put()
k = ndb.Key('MyModel', 123)
try:
deferred.defer(do_something_with_key, k, _countdown=60)
except PermanentTaskFailure:
#catch here
Or do I need to put try/except inside do_something_with_key function
The PermanentTaskFailure exception is typically raised when the task executes or attempts to execute, so you won't catch it when you create the task. Unless, maybe, if you do that from another task execution handler, but in that case it'd be for the enqueueing task, not for the task being enqueued. Or maybe if enqueuing itself has trouble? Not sure - I never got it in such case.
So, at best, I think you might be able to catch it from do_something_with_key(). But you won't be able to catch it for all cases - for example if the task code fails to execute - the exception is caught by the deferred library code itself, see an example in Issue with appengine deferred tasks, execution throws unknown error.
I was able to catch it (again, probably not for all cases), but that was after I switched from the deferred library to directly using the push tasks (which is what the deferred library uses under the hood).
The article you referenced discusses PermanentTaskFailure in the context of your handler code (intentionally) raising the exception to signal to the deferred library that it shouldn't enqueue yet another copy of the task - which is what it does by default if the task execution fails (based on its return code for the request), until the maximum number of retries is reached.

Stop Camel after too many retries

I am trying to implement more advanced Apache Camel error handling:
in case if there are too many pending retries then stop processing at all and log all collected exceptions somewhere.
First part (stop on too many retries) is already implemented by following helper method, that gets size of retry queue and I just stop context if queue is over some limit:
static Long getToRetryTaskCount(CamelContext context) {
Long retryTaskCount = null;
ScheduledExecutorService errorHandlerExecutor = context.getErrorHandlerExecutorService();
if (errorHandlerExecutor instanceof SizedScheduledExecutorService)
{
SizedScheduledExecutorService svc = (SizedScheduledExecutorService) errorHandlerExecutor;
ScheduledThreadPoolExecutor executor = svc.getScheduledThreadPoolExecutor();
BlockingQueue<Runnable> queue = executor.getQueue();
retryTaskCount = (long) queue.size();
}
return retryTaskCount;
}
But this code smells to me and I don't like it and also I don't see here any way to collect the exceptions caused all this retries.
There is also a new control bus component in camel 2.11 which could do what you want (source)
template.sendBody("controlbus:route?routeId=foo&action=stop", null);
I wouldn't try to shutdown the CamelContext, just the route in question...that way the rest of your app can still function, you can get route stats and view/move messages to alternate queues, etc.
see https://camel.apache.org/how-can-i-stop-a-route-from-a-route.html

java.sql.SQLRecoverableException: Connection is already in use

In my java code, I am processing huge amount of data. So I moved the code as servlet to Cron Job of App Engine. Some days it works fine. After the amount of the data increases, the cron job is not working and shows the following error message.
2012-09-26 04:18:40.627
'ServletName' 'MethodName': Inside SQLExceptionjava.sql.SQLRecoverableException:
Connection is already in use.
I 2012-09-26 04:18:40.741
This request caused a new process to be started for your application, and thus caused
your application code to be loaded for the first time. This request may thus take
longer and use more CPU than a typical request for your application.
W 2012-09-26 04:18:40.741
A problem was encountered with the process that handled this request, causing it to
exit. This is likely to cause a new process to be used for the next request to your
application. If you see this message frequently, you may be throwing exceptions during
the initialization of your application. (Error code 104)
How to handle this problem?
This exception is typical when a single connection is shared between multiple threads. This will in turn happen when your code does not follow the standard JDBC idiom of acquiring and closing the DB resources in the shortest possible scope in the very same try-finally block like so:
public Entity find(Long id) throws SQLException {
Connection connection = null;
// ...
try {
connection = dataSource.getConnection();
// ...
} finally {
// ...
if (connection != null) try { connection.close(); } catch (SQLException ignore) {}
}
return entity;
}
Your comment on the question,
#TejasArjun i used connection pooling with servlet Init() method.
doesn't give me the impression that you're doing it the right way. This suggests that you're obtaining a DB connection in servlet's init() method and reusing the same one across all HTTP requests in all HTTP sessions. This is absolutely not right. A servlet instance is created/initialized only once during webapp's startup and reused throughout the entire remaining of the application's lifetime. This at least confirms the exception you're facing.
Just rewrite your JDBC code according the standard try-finally idiom as demonstrated above and you should be all set.
See also:
Is it safe to use a static java.sql.Connection instance in a multithreaded system?

how to manually set a task to run in a gae queue for the second time

I have a task that runs in GAE queue.
according to my logic, I want to determine if the task will run again or not.
I don't want it do be normally executed by the queue and then to put it again in the queue
because I want to have the ability to check the "X-AppEngine-TaskRetryCount"
and quit trying after several attempts.
To my understanding it seems that the only case that a task will re-executed is when an internal GAE error will happen (or If my code will take too long in a "DeadlineExceededException" cases..(And I don't want to hold the code "hostage" for that long :) )
How can I re-enter a task to the queue in a manner that GAE will set X-AppEngine-TaskRetryCount ++ ??
You can programmatically retry / restart a task using a self.error() in python.
From the docs: App engine retries a task by returning any HTTP status code outside of the range 200–299
And at the beginning of the task you can test for the number of retries using:
retries = int(self.request.headers['X-Appengine-Taskretrycount'])
if retries < 10 :
self.error(409)
return

Resources