How to check if Flink task is backpressured programmatically - apache-flink

Using Flink 1.11. I have a requirement to identify if a Flink task is facing back pressured. Using webui , we can monitor backpressure status. Is there any way to check in Flink Application if a particular task is facing backpressure ?

You should be able to use the Flink Job Manager REST API to get back-pressure information: https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/rest_api/#jobs-jobid-vertices-vertexid-backpressure
/jobs/:jobid/vertices/:vertexid/backpressure
Returns back-pressure information for a job, and may initiate back-pressure sampling if necessary.

Related

Apache Flink sinking its results into OpenSearch

I am running Apache Flink v1.14 on the server which does some pre-processing on the data that is reads from Kafka. I need it to write the results to OpenSearch after which I can fetch the results from OpenSearch.
However, when going through the list of flink v1.14 connectors, I don't see OpenSearch. Is there any other way I can implement it?
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/overview/
In the above link, I see only ElasticSearch, no OpenSearch
I think the OpenSearch sink has been added in Flink 1.16, so You may consider updating Your cluster. Otherwise, You may need to port the changes to 1.14 (which shouldn't be hard at all) and push as a custom library.

how to deploy a new job without downtime

I have an Apache Flink application that reads from a single Kafka topic.
I would like to update the application from time to time without experiencing downtime. For now the Flink application executes some simple operators such as map and some synchronous IO to external systems via http rest APIs.
I have tried to use the stop command, but i get "Job termination (STOP) failed: This job is not stoppable.", I understand that the Kafka connector does not support the the stop behavior - a link!
A simple solution would be to cancel with savepoint and to redeploy the new jar with the savepoint, but then we get downtime.
Another solution would be to control the deployment from the outside, for example, by switching to a new topic.
what would be a good practice ?
If you don't need exactly-once output (i.e., can tolerate some duplicates) you can take a savepoint without cancelling the running job. Once the savepoint is completed, you start a second job. The second job could write to different topic but doesn't have to. When the second job is up, you can cancel the first job.

How to monitor Flink Backpressure in Grafana with Prometheus metrics

Flink Web UI has a brilliant backpressure section. But I can not see any metrics, given by Prometheus reporter, which could be used to detect backpressure in the same way for a Grafana dashboard.
Is there some way to get the same metrics outside of the Flink Web UI? Using the metrics described here https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html. Or even having a prometheus scraper for scraping the web api?
The back pressure monitoring that appears in the Flink dashboard isn't using the metrics system, so those values aren't available via a MetricsReporter. But you can access this info via the REST api at
/jobs/:jobid/vertices/:vertexid/backpressure
While this back pressure detection mechanism is useful, it does have its limitations. It works by calling Thread.getStackTrace(), which is expensive, and some operators (such as AsyncFunction) do critical activities in threads that aren't being sampled.
Another way to investigate back pressure is to set this configuration option in flink-conf.yaml
taskmanager.network.detailed-metrics: true
and then you can look at the metrics measuring inbound/outbound network queue lengths.

How to schedule a job in apache-flink

I want to write a task that is triggered by apache flink after every 24 hours and then processed by flink. What is the possible way to do this? Does flink provide any job scheduling functionality?
Apache Flink is not a job scheduler but an event processing engine which is a different paradigm, as Flink jobs are supposed to run continuously instead of being triggered by a schedule.
That said, you could achieve the functionality by simply using an off the shelve scheduler (i.e. cron) who is scheduled to start a job on your Flink cluster and then stop it after you receive some sort of notification that the job was done (i.e. through a Kafka topic) or simply use a timeout after which you would assume that the job is finished and you can stop the job. But again, especially because Flink is not designed for this kind of use cases, you would most certainly run into edge cases which Flink does not support.
Alternatively you can simply use a 24 hour tumbling window and run your task in the corresponding trigger function. See https://flink.apache.org/news/2015/12/04/Introducing-windows.html for details on that matter.

How to handle backpressure in flink streaming job?

I am running a streaming flink job which consumes the streaming data from kafka and do some process over the data in flink map function and write the data to the Azure data lake and the elastic search. For map function I used a parallelism of one because I need to process the incoming data one by one over the list of data I maintain as a global variable. Now when I run the job as the flink starts to get the streaming data from kafka it's backpressure becomes high in the map function. Is there any settings or configurations I could do to avoid the backpressure in flink?
Backpressure on a given operator indicates that the next operator is consuming elements slowly. From your description it would seem that one of the sinks is performing poorly. Consider scaling up the sink, commenting-out a sink for troubleshooting purposes, and/or investigating whether you're hitting an Azure rate limit.

Resources