How to get notifications when a GAE cron job fails? - google-app-engine

I have an app which stores user data in GCP Datastore. Since this data is very important, I have made a cron job that is scheduled to export the data in the datastore using the instructions given here: https://cloud.google.com/datastore/docs/schedule-export
Now, I want to get notifications when this cron job fails. I have tried the error reporting service from Stackdriver (https://console.cloud.google.com/errors?time=PT1H&filter&order=COUNT_DESC) to do this but it doesn't notify me when the job fails (yes, I intentionally made it fail to test it). The problem is Stackdriver regards cron job failure as mere warnings instead of errors. Click here to see Stackdriver Logs Screenshot.
How to get notifications when the cron job fails?
One way can be to get the Stackdriver Logs using stackdriver logging entries.list API (https://cloud.google.com/logging/docs/reference/v2/rest/) and then use it in a cron job which will notify me when any log has severity: warning or error or critical, but this process is very tedious.

You can try to catch the exceptions and send yourself an email to get notified of a task failure.

Related

Cloud Tasks client ignores retry configuration

Basically what the title says. The API and client docs state that a retry can be passed to create_task:
retry (Optional[google.api_core.retry.Retry]): A retry object used
to retry requests. If ``None`` is specified, requests will
be retried using a default configuration.
But this simply doesn't work. Passing a Retry instance does nothing and the queue-level settings are still used. For example:
from google.api_core.retry import Retry
from google.cloud.tasks_v2 import CloudTasksClient
client = CloudTasksClient()
retry = Retry(predicate=lambda _: False)
client.create_task('/foo', retry=retry)
This should create a task that is not retry. I've tried all sorts of different configurations and every time it just uses whatever settings are set on the queue.
You can pass a custom predicate to retry on different exceptions. There is no formal indication that this parameter prevents retrying. You may check the Retry page for details.
Google Cloud Support has confirmed that task-level retries are not currently supported. The documentation for this client library is incorrect. A feature request exists here https://issuetracker.google.com/issues/141314105.
Task-level retry parameters are available in the Google App Engine bundled service for task queuing, Task Queues. If your app is on GAE, which I'm guessing it is since your question is tagged with google-app-engine, you could switch from Cloud Tasks to GAE Task Queues.
Of course, if your app relies on something that is exclusive to Cloud Tasks like the beta HTTP endpoints, the bundled service won't work (see the list of new features, and don't worry about the "List Queues command" since you can always see that in the configuration you would use in the bundled service). Barring that, here are some things to consider before switching to Task Queues.
Considerations
Supplier preference - Google seems to be preferring Cloud Tasks. From the push queues migration guide intro: "Cloud Tasks is now the preferred way of working with App Engine push queues"
Lock in - even if your app is on GAE, moving your queue solution to the GAE bundled one increases your "lock in" to GAE hosting (i.e. it makes it even harder for you to leave GAE if you ever want to change where you run your app, because you'll lose your task queue solution and have to deal with that in addition to dealing with new hosting)
Queues by retry - the GAE Task Queues to Cloud Tasks migration guide section Retrying failed tasks suggests creating a dedicated queue for each set of retry parameters, and then enqueuing tasks accordingly. This might be a suitable way to continue using Cloud Tasks

Setting dataflow job dependencies using Cloud function and App engine

Just a thought, My first job will be triggered by a Cloud Function by any file arrival event. I will capture it's job ID in Cloud Function itself. Once I get the job ID, I will pass that ID to App engine code and that code will monitor the completion of job with that particular ID. Once the App engine code identifies the 'success' status of that job then it will trigger another job which was dependant on the successful completion status for prior job.
Any idea whether it can be made possible or not? If yes, then any help with code samples will be highly appreciated.

How do I run a cron job on Google App Engine immediately?

I have configured Google App Engine to record exception with ereporter.
The cron job is configured to run every 59 minutes. The cron.yaml is as follows
cron:
- description: Daily exception report
url: /_ereporter?sender=xxx.xxx#gmail.com # The sender must be an app admin.
schedule: every 59 minutes
How to do I run this immediately.
What I am trying to do here is simulate a 500 HTTP error and see the stack trace delivered immediately via the cron job.
Just go to the URL from your browser.
You can't using cron. Cron is a scheduling system, you could get it to run every minute.
Alternately you could wrap your entire handler in a try/except block and try to catch everything. (You can do this for some DeadlineExceededErrors for instance) then fire off a task which invokes ereporter handler, and then re-raise the Exception.
However in many cases Google infrastructure can be the cause of the Error 500 and you won't be able to catch the error. To be honest you are only likely to be able to cause an email sent for a subset of all possible Error 500's. The most reliable way probably be to have a process continuously monitor the logs, and email from there.
Mind you email isn't consider reliable or fast so a 1 min cron cycle is probably fast enough.
I came across this thread as I was trying to do this as well. A (hacky) solution I found was to add a curl command at the end of my cloudbuild.yaml file that triggers the file immediately per this thread. Hope this helps!
Make a curl request in Cloud Build CI/CD pipeline

How can Google App Engine be prevented from immediately rescheduling tasks after status code 500?

I have a Google App Engine servlet that is cron configured to run once a week. Since it will take more than 1 minute of execution time, it launches a task (i.e. another servlet task/clear) on the application's default push task queue.
Now what I'm observing is this: if the task causes an exception (e.g. NullPointerException inside its second servlet), this gets translated into HTTP status 500 (i.e. HttpURLConnection.HTTP_INTERNAL_ERROR) and Google App Engine apparently reacts by immediately relaunching the same task again. It announces this by printing:
Web hook at http://127.0.0.1:8888/task/clear returned status code 500. Rescheduling..
I can see how this can sometimes be a feature, but in my scenario it's inappropriate. Can I request that Google App Engine should not do such automatic rescheduling, or am I expected to use other status codes to indicate error conditions that would not cause rescheduling by its rules? Or is this something that happens only on the dev. server?
BTW, I am currently also running other tasks (with different frequencies) on the same task queue, so throttling reschedules on the level of task queue configuration would be inconvenient (so I hope there is another/better option too.)
As per https://developers.google.com/appengine/docs/java/taskqueue/overview-push#Java_Task_execution - the task must return a response code between 200 and 299.
You can either return the correct value, set the taskRetryLimit in RetryOptions or check the header X-AppEngine-TaskExecutionCount when task launches to check how many times it has been launched and act accordingly.
I think I've found a solution: in the Java API, there is a method RetryOptions#taskRetryLimit, which serves my case.

'Version is not ready' error on update - GAE Python

I am unable to update my frontends nor my backends. I get the error message 'Version is not ready'. This bug has persisted for coming up to 24 hours now. I have a task perpetually running in a queue. My best guess is that this task is stopping the update. I am unable to delete the task as it is perpetually running, nor can I delete the queue as I am unable to upload a new queue.yaml definition. The same task previously failed due to a maximum recursion error as I had a synchronous RPC within an asynchronous tasklet.
I'm pretty sure the fix will require someone from the GAE side forcibly resetting the task queue. Thus, this question would be more suitably directed to the GAE team with details about my app in a less public forum. Though, from what I can see, they do not allow direct support questions and suggest posting the question here. My follow up question, then, is when you have a GAE issue that requires action from the GAE team - how do you get hold of them (other than paying US$500/month for a premium support account)?
EDIT:
The task is/was meant to be running on a backend instance. I intended to shutdown all backend and frontend instances via the console assuming that they would cancel the task and restart themselves. But I found that only one frontend instance was running - no backends. After shutting down that frontend instance, the dashboard has reported that I have 0 instances running, yet the website is still serving and the task remains perpetually running.
EDIT:
Disabling the app stopped the task from running. After reenabling the app, I was able to update it. Though I am left with a ghost task in my queue.
If you have a stuck task queue job, I'd try disabling the queue and killing the instance running that job. If that doesn't work, I'd try disabling the app temporarily.

Resources