Visibly & permanently failing GAE tasks - google-app-engine

The App Engine taskqueues ensure that tasks are retried if they return a status code outside of 2xx range, which obviously includes stray exceptions. This is fine for occasional failures such as timeouts, but in case of permanent failures - when task cannot complete successfully regardless of how many time it's retried - it causes unnecessary load. Of course one could return 2xx in such case but this would not be registered as an errornous request by GAE and not shown in the 'Errors' table on Admin Console dashboard.
Hence I'm asking: is there a way to fail a task in such fashion that it is:
not retried (permanent failure)
visible in the Admin Console as errornous request

All you have to do to make a request show up in the errors tab of the admin console is to log at least one message at level ERROR or higher. Simply log said message, then return a 200 status code to ensure your task is not re-enqueued.
If all you want to do is stop retrying after a certain number of retries, you can configure that.

There's a bit of a catch 22 here. You want to not retry a task that fails in some particular way. But other than HTTP response codes, how is GAE to know? "Yes, I failed, but that's O.K." or "Yes, I failed, and I forever will" aren't possible to communicate given the HTTP response codes available (other than maybe "501 Not Implemented", which means something else). The closest you can gets is a 2xx response, which some failure scenarios will preclude.
There's no facility for examining task stack traces, but if there were, determining that a particular stack trace means that a failure condition is permanent is going to be rather difficult. PhD dissertations might be involved.
I think this one comes down to testing and vigilance.

Related

GAE custom Go runtime - internal.flushLog error

I have recently changed to use custom Go runtime on GAE, and noticed many errors like this from logs:
internal.flushLog: Flush RPC: Call error 3: invalid security ticket: 6c8027dc99b3ed3e
internal.flushLog: Flush RPC: Canceled: (timeout)
The server is still running well, but I have no idea about that error, as well as why it happens.
I'm using a custom Go runtime by using Dockerfile, and App Engine Release is 1.9.37.
Any help to clarify the error would be highly appreciated. Thanks.
This is a known issue with the Go runtime on App Engine Flexible. It tends to happen when a line is logged right before the end of a request/response.
What happens is that when the line is logged it is actually put in a list of log lines to be batched together and sent to the application server as an RPC at periodic intervals. The security ticket is canceled at the end of a request/response which sometimes can happen before the log lines have been flushed. It's harmless, except that you may lose a log line or two. :\
We're actively working on fixing it.

App Engine silently fails on some requests

Some requests silently fail in my python app, intermittently and unpredictably. The hallmarks of the failure are:
Request returns a 200, so the client doesn't know there's a problem.
Request does NOT successfully execute on the server.
No logging statements are recorded for the request.
Below is an example from my logs of a bunch of requests which are each supposed to write an entity to the datastore. You can see for the lower, successful request, a blue 'i' is present, indicating that info level logs were recorded. When I examine the datastore, an entity was successfully written for this request.
However, for the failed request, you can see there is just a white box, and there are no logging statements present at all. While the server returned a 200, no entity was written to the datastore for this request.
Has anyone encountered something like this before on App Engine? Any ideas on how to debug it? I've seen it in multiple different apps myself, but I've never been able to figure it out.
EDIT
To clarify, the main problem here is that code doesn't execute, as measured by the failure to write an entity. The spurious 200 and lack of logging is an associated symptom.
From a comment originally, but seems to be the resolution path for this issue:
Given that there are no log statements at all in the line and you appear to unpack the arguments and log them as soon as you enter the handler, this starts to look like an infrastructure/platform issue.
In such a case, it's best to open a public issue tracker issue, with "Type-Production" as a tag, including your app's app id and a timeframe, and as much information about your app and request handler involved as possible, and platform support will pick up the issue in the course of triage.
That said, it's worth examining the handler to make absolutely sure there's no way you could be exiting from the handler and sending a 200 without logging anything or seeing an exception. It all depends on what the code handling the request is capable of, what stack of libraries it's build upon, etc.

Tasks targeted at dynamic backend fail frequently, silently

I had converted some tasks to run on a dynamic backend.
The tasks are failing silently [no logged error, no retry, nothing] ~20% of the time (min:10%, max:60%, sample:large, long term). Switching the task away from the backend restores retries and gets the failure rate back to ~0%.
Any ideas?
Converting it to a backend exacerbated the problem but wasn't the problem.
I had specified a task_retry_limit and the queue was a push queue. With a backend the number of instances is specified. (I believe you can replicate this issue on the frontend by ramping up requests rapidly, to a big number).
Tasks were failing 503: Instance Unavailable until they hit the task_retry_limit. This is visible temporarily in Task Queues, but will not show up in Logs.
I should be using pull queues. Even if my use case was stupid I'd probably +1 a task dying due to multiple 503: Instance Unavailable logging something so it doesn't appear like a phantom task.
Which runtime are you using on the backend?
Try running the backend for a bit without dynamic set to true and exercise the failing component.
On my project, I have seen tasks that target a static backend disappear on occasion, but no where near the rate you are seeing.

receive an alert when job is *not* running

I know how to set up a job to alert when it's running.
But I'm writing a job which is meant to run many times a day, and I don't want to be bombarded by emails, but rather I'd like a solution where I get an alert when the job hasn't been executed for X minutes.
This can be acheived by setting the job to alert on execution, and then setting up some process which checks for these alerts, and warns when no such alert is seen for X minutes.
I'm wondering if anyone's already implemented such a thing (or equivalent).
Supporting multiple jobs with different X values would be great.
The danger of this approach is this: suppose you set this up. One day you receive no emails. What does this mean?
It could mean
the supposed-to-be-running job is running successfully (and silently), and so the absence-of-running monitor job has nothing to say
or alternatively
the supposed-to-be-running job is NOT running successfully, but the absence-of-running monitor job has ALSO failed
or even
your server has caught fire, and can't send any emails even if it wants to
Don't seek to avoid receiving success messages - instead devise a strategy for coping with them. Because the only way to know that a job is running successfully is getting a message which says precisely this.

Service Broker error handling simulation

I work now in project in which multiple POSes should be synchronized to main server by using Server Broker feature. Now i prepare error handling for this solution and want to show to client how it works. That means i will prepare test scripts for every kind of errors and client runs it on test POS to see if it errors processed correctly.
We will use SQL Server 2008R2 with poison message = OFF.
Message type=XML (but inside can be different type of data, some nodes will contain BLOBs).
POSes will be outside of domen so transport will be secured (but no dialog encryption).
I divide errors on several sub-groups:
Logical error (e.g. string instead
of number) .It will be processed by
TRY-CATCH block on server side.It is
easy to simulate
Service Broker configuration error
(message or will be not returned or
cannot reach destination). I think
it can be handled by using SQl
Server Service Broker events and
simulation will be some kind of "bad
configuration" (SB GUID,service name
etc)
Transport error. This is when we
have a broken message. In fact it is
client opinion to test such kind of
error. I do not know if we have
secured transport level
(certificate) we are protected from
such kind of error. Another question
how can I simulate this.
Questions:
are there another error types?
is #2 error handling logic described good enough?
how to handle and simulate #3?
The second part of my article here goes into a discussion of Service Broker errors, how they occur and how to handle them. The important thing for you is to distinguish between two categories of errors:
recoverable: transport problems, most configuration errors like bad routing, an unreachable server. All these will result not in a SSB error, but in a delay. Messages will stay in transmission_queue expecting that the problem is transient and can be solved, including some configuration problems. Once the problem is solved, SSB will retry and the message gets delivered.
unrecoverable: these are problems SSB deems as non-recoverable, eg. a bad message format. In such a case the conversation will be aborted and both endpoints receive a Error message.
I also have an article Error Handling in Service Broker procedures that discusses some of the topics particular to exception handling in SSB activated context.
A final note: I strongly discourage you from turning poison message detection OFF. It is much better to disable the processing than to spin ad-nauseam w/o making progress because of a poison message.
As on the topic on how to simulate a corrupted message: is hard to simulate (you can try with setting up a port forwarder that lets all traffic pass by, but randomly corrupts some of it) but is rather pointless. All SSB traffic, even when in clear text, is cryptographically signed and any message corruption would result in an abrupt disconnect due to message signing validation failure.

Resources