I'm trying to monitor the availability of my flink jobs using Prometheus alerts.
I have tried with the flink_jobmanager_job_uptime/downtime metrics but they don't seem to fit since they just stop being emmited after the job has failed/finished.
I have already been pointed out to the numRunningJobs metric in order to alert of a missing job. I don't want to use this solution since I would have to update my prometheus config each time i want to deploy a new job.
Has anyone managed to create this alert of a Flink failed job using Prometheus?
Prometheus has an absent() function that will return 1 if the metric don't exist. So, you can just set the alert expression to something like
absent(flink_jobmanager_job_uptime) == 1
Related
Currently, I have a streaming job which is firing a batch job when it receives a specific trigger.
I want to follow that fired batch job and when it finishes, want to insert an entry to a database like elastic search or so.
Any ideas, how we can achieve this? How we can listen to that job?
FLINK provides some REST APIs to query job status, you could use this one to query batch job state: https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#jobs-jobid. While tasks are running, their status will be reported to JM. Through this API, you can get the job state based on the response from the request.
I'm currently trying to monitor Flink streaming jobs with Prometheus.
One of the requirements is to send an alert when a job has failed.
According to documentation the metric flink_jobmanager_job_downtime emits -1 for completed jobs, so I have created an alert using the following expression.
expr: flink_jobmanager_job_downtime{job_id=".*"} == -1
The problem is that I checked Prometheus web UI and the metric flink_jobmanager_job_downtime for a failed job never emits -1 for a failed job.
In fact it only emits 0 so the alert never triggers.
Am I missing something or is this really the expected behavior?
I am trying to use Flinks monitoring REST API in order to retrieve some metrics for a specific time period.
Looking at the documentation, I can find the metrics of the job by navigating to http://hostname:8081/jobs/:jobid and I have the following:
{
"jid":"692c1d818afb77daaca891484e0b6a7g",
"name":"myjob",
"isStoppable":false,
"state":"RUNNING",
"start-time":1570552858876,
"end-time":-1,
"duration":62639599,
"now":1570615498475,
...
I would like to know if there is a method for requesting metrics from a specific start-time and end-time, the documentation does not mention if this can be done.
I dont think that you can achieve that via Rest API.
But you can defiantly export flink metrics for further analysis.
Maybe you can help me with my problem
I start spark job on google-dataproc through API. This job writes results on the google data storage.
When it will be finished I want to get a callback to my application.
Do you know any way to get it? I don't want to track job status through API each time.
Thanks in advance!
I'll agree that it would be nice if there was to either wait for or get a callback for when operations such as VM creation, cluster creation, job completion, etc finish. Out of curiosity, are you using one of the api clients (like google-cloud-java), or are you using the REST API directly?
In the mean time, there are a couple of workarounds that come to mind:
1) Google Cloud Storage (GCS) callbacks
GCS can trigger callbacks (either Cloud Functions or PubSub notifications) when you create files. You can create an file at the end of your Spark job, which will then trigger a notification. Or, just add a trigger for when you put an output file on GCS.
If you're modifying the job anyway, you could also just have the Spark job call back directly to your application when it's done.
2) Use the gcloud command line tool (probably not the best choice for web servers)
gcloud already waits for jobs to complete. You can either use gcloud dataproc jobs submit spark ... to submit and wait for a new job to finish, or gcloud dataproc jobs wait <jobid> to wait for an in-progress job to finish.
That being said, if you're purely looking for a callback for choosing whether to run another job, consider using Apache Airflow + Cloud Composer.
In general, the more you tell us about what you're trying to accomplish, we can help you better :)
Just a thought, My first job will be triggered by a Cloud Function by any file arrival event. I will capture it's job ID in Cloud Function itself. Once I get the job ID, I will pass that ID to App engine code and that code will monitor the completion of job with that particular ID. Once the App engine code identifies the 'success' status of that job then it will trigger another job which was dependant on the successful completion status for prior job.
Any idea whether it can be made possible or not? If yes, then any help with code samples will be highly appreciated.