I'm currently trying to monitor Flink streaming jobs with Prometheus.
One of the requirements is to send an alert when a job has failed.
According to documentation the metric flink_jobmanager_job_downtime emits -1 for completed jobs, so I have created an alert using the following expression.
expr: flink_jobmanager_job_downtime{job_id=".*"} == -1
The problem is that I checked Prometheus web UI and the metric flink_jobmanager_job_downtime for a failed job never emits -1 for a failed job.
In fact it only emits 0 so the alert never triggers.
Am I missing something or is this really the expected behavior?
Related
We are using the Flink REST API to submit job to Flink EMR clusters. These clusters are already running in Session mode. We want to know if there is any way to pass following Flink Job manager configuration param while submitting the job via Flink REST API call.
s3.connection.maximum : 1000
state.backend.local-recovery: true
state.checkpoints.dir: hdfs://ha-nn-uri/flink/checkpoints
state.savepoints.dir : hdfs://ha-nn-uri/flink/savepoints
I figured out Flink submit job has "programArgs" field and I tried using it but Flink job manager configuration didn't pick up these settings
"programArgs": f" --s3.connection.maximum 1000 state.backend.local-recovery true --stage '{ddb_config}' --cell-name '{cluster_name}'"
I'm using Kafka, Flink. My message read by Flink from Kafka then executes some business logic to DB and sends it to Third API (Ex: Mail, GG Sheet), Each message required to send exactly one. Everything work wells, but in case of Job failed and restart (I'm using checkpoint), any message relay and resend to Third API. I can you Redis to check the message that has been sent. In this way, each message should be checked in Redis, and affect performance. I wondering a solution doesn't need to use Redis to check duplicate.
I'm trying to monitor the availability of my flink jobs using Prometheus alerts.
I have tried with the flink_jobmanager_job_uptime/downtime metrics but they don't seem to fit since they just stop being emmited after the job has failed/finished.
I have already been pointed out to the numRunningJobs metric in order to alert of a missing job. I don't want to use this solution since I would have to update my prometheus config each time i want to deploy a new job.
Has anyone managed to create this alert of a Flink failed job using Prometheus?
Prometheus has an absent() function that will return 1 if the metric don't exist. So, you can just set the alert expression to something like
absent(flink_jobmanager_job_uptime) == 1
I have an app which stores user data in GCP Datastore. Since this data is very important, I have made a cron job that is scheduled to export the data in the datastore using the instructions given here: https://cloud.google.com/datastore/docs/schedule-export
Now, I want to get notifications when this cron job fails. I have tried the error reporting service from Stackdriver (https://console.cloud.google.com/errors?time=PT1H&filter&order=COUNT_DESC) to do this but it doesn't notify me when the job fails (yes, I intentionally made it fail to test it). The problem is Stackdriver regards cron job failure as mere warnings instead of errors. Click here to see Stackdriver Logs Screenshot.
How to get notifications when the cron job fails?
One way can be to get the Stackdriver Logs using stackdriver logging entries.list API (https://cloud.google.com/logging/docs/reference/v2/rest/) and then use it in a cron job which will notify me when any log has severity: warning or error or critical, but this process is very tedious.
You can try to catch the exceptions and send yourself an email to get notified of a task failure.
I have a Google App Engine servlet that is cron configured to run once a week. Since it will take more than 1 minute of execution time, it launches a task (i.e. another servlet task/clear) on the application's default push task queue.
Now what I'm observing is this: if the task causes an exception (e.g. NullPointerException inside its second servlet), this gets translated into HTTP status 500 (i.e. HttpURLConnection.HTTP_INTERNAL_ERROR) and Google App Engine apparently reacts by immediately relaunching the same task again. It announces this by printing:
Web hook at http://127.0.0.1:8888/task/clear returned status code 500. Rescheduling..
I can see how this can sometimes be a feature, but in my scenario it's inappropriate. Can I request that Google App Engine should not do such automatic rescheduling, or am I expected to use other status codes to indicate error conditions that would not cause rescheduling by its rules? Or is this something that happens only on the dev. server?
BTW, I am currently also running other tasks (with different frequencies) on the same task queue, so throttling reschedules on the level of task queue configuration would be inconvenient (so I hope there is another/better option too.)
As per https://developers.google.com/appengine/docs/java/taskqueue/overview-push#Java_Task_execution - the task must return a response code between 200 and 299.
You can either return the correct value, set the taskRetryLimit in RetryOptions or check the header X-AppEngine-TaskExecutionCount when task launches to check how many times it has been launched and act accordingly.
I think I've found a solution: in the Java API, there is a method RetryOptions#taskRetryLimit, which serves my case.