How to monitor Flink Backpressure in Grafana with Prometheus metrics - apache-flink

Flink Web UI has a brilliant backpressure section. But I can not see any metrics, given by Prometheus reporter, which could be used to detect backpressure in the same way for a Grafana dashboard.
Is there some way to get the same metrics outside of the Flink Web UI? Using the metrics described here https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html. Or even having a prometheus scraper for scraping the web api?

The back pressure monitoring that appears in the Flink dashboard isn't using the metrics system, so those values aren't available via a MetricsReporter. But you can access this info via the REST api at
/jobs/:jobid/vertices/:vertexid/backpressure
While this back pressure detection mechanism is useful, it does have its limitations. It works by calling Thread.getStackTrace(), which is expensive, and some operators (such as AsyncFunction) do critical activities in threads that aren't being sampled.
Another way to investigate back pressure is to set this configuration option in flink-conf.yaml
taskmanager.network.detailed-metrics: true
and then you can look at the metrics measuring inbound/outbound network queue lengths.

Related

How to check if Flink task is backpressured programmatically

Using Flink 1.11. I have a requirement to identify if a Flink task is facing back pressured. Using webui , we can monitor backpressure status. Is there any way to check in Flink Application if a particular task is facing backpressure ?
You should be able to use the Flink Job Manager REST API to get back-pressure information: https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/rest_api/#jobs-jobid-vertices-vertexid-backpressure
/jobs/:jobid/vertices/:vertexid/backpressure
Returns back-pressure information for a job, and may initiate back-pressure sampling if necessary.

Lantency Monitoring in Flink 1.14

I am following this Flink tutorial for reactive scaling and am interested in knowing how overall end-to-end latencies are affected by such rapid changes in the number of worker nodes. As per the documentation, I have added metrics.latency.interval: 1000 to the config map with the understanding that a new latency metric will be added with markers being sent every 1 second. However, I cannot seem to find the corresponding histogram in prometheus. Listed below are this which are available metrics associated with latency:
I am using Flink 1.14. Is there something which I am missing?
I am suspecting that something happened to the latency metric between releases 1.13.2 and 1.14. Per now, I am not able to see the latency metrics from Flink after migration to 1.14, despite setting the latency interval to a positive number. Have you tried 1.13.2?
.. further exploration lead me to believe it is the migration to the KafkaSource / KafkaSink classes, as opposed to the deprecated FlinkKafkaConsumer and FlinkKafkaProducer that actually made the latency metric disappear. Currently, I am seeing the latency measures on flink 1.14, however using the deprecated Kafka source / sinks..

Measuring event-time latency with Flink CEP

I have implemented a pattern with Flink CEP that matches three Events such as A->B->C. After I have defined my pattern I generate a
PatternStream<Event> patternStream = CEP.pattern(eventStream, pattern);
with a PatternSelectFunction such that
patternStream.select(new MyPatternSelectFunction()).print();
This works like a charm but I am interested in the event-time of all matched events. I know that the traditional Flink streaming API offers rich functions which allow you to register Flink's internal latency tracker as described in this question. I have also seen that for Flink 1.8 a new RichPatternSelectFunction has been added. But unfortunately I cannot set up Flink 1.8 with Flink CEP.
Finally, is there a way to get the event-time of all matched events?
You don't need Rich Functions to use Flink's latency tracking. You just need to enable it by setting latencyTrackingInterval to a positive number in either the Flink configuration or ExecutionConfig, e.g.,
env.getConfig().setLatencyTrackingInterval(1000);
and then you can observe the results in your metrics solution, or via the REST api (latency metrics are not reported in the Flink web UI).
Documentation
Update:
The latency statistics are job metrics, and are in the list returned by
http://<job_manager_rest_endpoint>/jobs/<job_id>/metrics
Latency metric values can be fetched from
http://<job_manager_rest_endpoint>/jobs/<job_id>/metrics?get=<metric_name>
These metrics have names like
latency.source_id.<ID>.operator_id.<ID>.operator_subtask_index.<SUBTASK>.<metric>
where the IDs identity the source and operator nodes in the job graph between which the latency is being measured.
For example, I can determine the 95th percentile latency between the source and one of the sinks in a job I am running right now with this request:
http://localhost:8081/jobs/94b189a96b98b3aafaba6db6aa8b770b/metrics?get=latency.source_id.bc764cd8ddf7a0cff126f51c16239658.operator_id.fd0ee602f2fa8d310d9bd9f694e185f5.operator_subtask_index.0.latency_p95
Alternatively, you could use a ProcessFunction to add processing time timestamps to your events before they enter the CEP part of your job, and then use another ProcessFunction afterwards to measure the elapsed time.

Flink latency metrics not being shown

While running Flink 1.5.0 with a local environment I was trying to get latency metrics via REST (with something similar to http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/vertices/e70bbd798b564e0a50e10e343f1ac56b/metrics) but there isn't any reference to latency.
All of this while the latency tracking is enabled which I confirmed by checking with the debugger that the LatencyMarksEmitter is emiting the marks.
What can I be doing wrong?
In 1.5 latency metrics aren't exposed for tasks but for jobs instead, the reasoning being that latency metrics inherently contain information about multiple tasks. You have to query "http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/metrics" instead.

Filesystem metrics from Heapster api is not available

I set up heapster+influxdb+grafana combination for my Minikube kubernetes cluster.In heapster metrics api documentation they mention about filesystem metrics along with cpu ,memory, network related apis. I can get CPU , memory related metrics by using hepster api. But I am not able to access filesystem metrics using api. Any help guys?

Resources