Tasks targeted at dynamic backend fail frequently, silently - google-app-engine

I had converted some tasks to run on a dynamic backend.
The tasks are failing silently [no logged error, no retry, nothing] ~20% of the time (min:10%, max:60%, sample:large, long term). Switching the task away from the backend restores retries and gets the failure rate back to ~0%.
Any ideas?

Converting it to a backend exacerbated the problem but wasn't the problem.
I had specified a task_retry_limit and the queue was a push queue. With a backend the number of instances is specified. (I believe you can replicate this issue on the frontend by ramping up requests rapidly, to a big number).
Tasks were failing 503: Instance Unavailable until they hit the task_retry_limit. This is visible temporarily in Task Queues, but will not show up in Logs.
I should be using pull queues. Even if my use case was stupid I'd probably +1 a task dying due to multiple 503: Instance Unavailable logging something so it doesn't appear like a phantom task.

Which runtime are you using on the backend?
Try running the backend for a bit without dynamic set to true and exercise the failing component.
On my project, I have seen tasks that target a static backend disappear on occasion, but no where near the rate you are seeing.

Related

What's the right way to do long-running processes on app engine?

I'm working with Go on App Engine and I'm trying to build an API that needs to perform a long-running background task - in this case it needs to parse and chunk a big file out to task queues. I'd like it to return a 200 and close the user connection immediately and let the process keep running in the background until it's complete (this could take 5-10 minutes). Task queues alone don't really work for my use case because parsing the initial file can take more than the time limit for an API request.
At first I tried a Go routine as a solution for this problem. This failed because my app engine context expired as soon as the parent function closed the user connection. (I suppose I could try writing a go routine that doesn't require a context, but then I'd lose logging and I'd need to fetch the entire remote file and pass it to the go routine.)
Looking through the docs, it looks like App Engine used to have functionality to support exactly what I want to do: [runtime.RunInBackground], but that functionality is now deprecated and the replacement isn't obvious.
Is there a "right" or recommended way to do background processing now?
I suppose I could put a link to my big file into a task queue, but if I understand correctly, even functions called through task queues have to complete execution within a specified amount of time (is it 90 seconds?) I need to be able to run longer than that.
Thanks for any help.
try using:
appengine.BackgroundContext()
it should be long-lived but will only work on GAE Flex

App Engine silently fails on some requests

Some requests silently fail in my python app, intermittently and unpredictably. The hallmarks of the failure are:
Request returns a 200, so the client doesn't know there's a problem.
Request does NOT successfully execute on the server.
No logging statements are recorded for the request.
Below is an example from my logs of a bunch of requests which are each supposed to write an entity to the datastore. You can see for the lower, successful request, a blue 'i' is present, indicating that info level logs were recorded. When I examine the datastore, an entity was successfully written for this request.
However, for the failed request, you can see there is just a white box, and there are no logging statements present at all. While the server returned a 200, no entity was written to the datastore for this request.
Has anyone encountered something like this before on App Engine? Any ideas on how to debug it? I've seen it in multiple different apps myself, but I've never been able to figure it out.
EDIT
To clarify, the main problem here is that code doesn't execute, as measured by the failure to write an entity. The spurious 200 and lack of logging is an associated symptom.
From a comment originally, but seems to be the resolution path for this issue:
Given that there are no log statements at all in the line and you appear to unpack the arguments and log them as soon as you enter the handler, this starts to look like an infrastructure/platform issue.
In such a case, it's best to open a public issue tracker issue, with "Type-Production" as a tag, including your app's app id and a timeframe, and as much information about your app and request handler involved as possible, and platform support will pick up the issue in the course of triage.
That said, it's worth examining the handler to make absolutely sure there's no way you could be exiting from the handler and sending a 200 without logging anything or seeing an exception. It all depends on what the code handling the request is capable of, what stack of libraries it's build upon, etc.

Getting ``DeadlineExceededError'' using GAE when doing many (~10K) DB updates

I am using Django 1.4 on GAE + Google Cloud SQL - my code works perfectly fine (on dev with local sqlite3 db for Django) but chocks with Server Error (500) when I try to "refresh" DB. This involves parsing certain files and creating ~10K records and saving them (I'm saving them in batch using commit_on_success).
Any advise ?
This error is raised for front end requests after 60 seconds. (its increased)
Solution options:
Use task queue (again a time limit of 10 minutes is imposed, which is practically enough).
Divide your task in smaller batches.
How we do it: we divide it on client side in smaller chunks and call them repeatedly.
Both the solutions work fine, depends on how you make these calls and want the results. Taskqueue doesn't return back the results to the client.
For tasks that take longer than 30s you should use task queue.
Also, database operations can also timeout when batch operations are too big. Try to use smaller batches.
Google app engine has a maximum time allowed for a request. If a request takes longer than 30 seconds, this error is raised. If you have a large quantity of data to upload, either import it directly from the admin console, or break up the request into smaller chunks, or use the command line python manage.py dbshell to upload the data from your computer.

Visibly & permanently failing GAE tasks

The App Engine taskqueues ensure that tasks are retried if they return a status code outside of 2xx range, which obviously includes stray exceptions. This is fine for occasional failures such as timeouts, but in case of permanent failures - when task cannot complete successfully regardless of how many time it's retried - it causes unnecessary load. Of course one could return 2xx in such case but this would not be registered as an errornous request by GAE and not shown in the 'Errors' table on Admin Console dashboard.
Hence I'm asking: is there a way to fail a task in such fashion that it is:
not retried (permanent failure)
visible in the Admin Console as errornous request
All you have to do to make a request show up in the errors tab of the admin console is to log at least one message at level ERROR or higher. Simply log said message, then return a 200 status code to ensure your task is not re-enqueued.
If all you want to do is stop retrying after a certain number of retries, you can configure that.
There's a bit of a catch 22 here. You want to not retry a task that fails in some particular way. But other than HTTP response codes, how is GAE to know? "Yes, I failed, but that's O.K." or "Yes, I failed, and I forever will" aren't possible to communicate given the HTTP response codes available (other than maybe "501 Not Implemented", which means something else). The closest you can gets is a 2xx response, which some failure scenarios will preclude.
There's no facility for examining task stack traces, but if there were, determining that a particular stack trace means that a failure condition is permanent is going to be rather difficult. PhD dissertations might be involved.
I think this one comes down to testing and vigilance.

SQL Server Service Broker Service Disappearing (Automatically Deleted)?

I've implemented a messaging system over SQL Server Service Broker. It is working great, with the sole exception that every once in a while (maybe once per week per server) my initiator service just vanishes without a trace. The corresponding queue is still there, but the service is missing.
Obviously this causes problems in my system. It's a simple matter to recreate the service by hand, but I'm confused as to what might cause this behavior. I understand that automatic poison message handling causes queues to be disabled, but I don't see anything that indicates services can be disabled or deleted automatically.
When this happens, I usually have a large backlog of messages in multiple application queues, but nothing extreme. Total message backlog is around 200,000.
Does anyone know what might be happening here?
You must have a bug of some sort that issues a DROP SERVICE statement. That is the only way a service gets deleted.
Check the default trace, the DROP statement gets traced and saved into it so you can track down the application/user/statement that issues the DROP. Check sys.traces to find the location of the default trace then open the .TRC file in Profiler.

Resources