Flink latency metrics not being shown - apache-flink

While running Flink 1.5.0 with a local environment I was trying to get latency metrics via REST (with something similar to http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/vertices/e70bbd798b564e0a50e10e343f1ac56b/metrics) but there isn't any reference to latency.
All of this while the latency tracking is enabled which I confirmed by checking with the debugger that the LatencyMarksEmitter is emiting the marks.
What can I be doing wrong?

In 1.5 latency metrics aren't exposed for tasks but for jobs instead, the reasoning being that latency metrics inherently contain information about multiple tasks. You have to query "http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/metrics" instead.

Related

The metric flink.task.isBackPressured will not be reported because only number types are supported by this reporter

When the Flink cluster is running, the INFO log was output time from time:
org.apache.flink.metrics.datadog.DatadogHttpReporter [] - The metric flink.task.isBackPressured will not be reported because only number types are supported by this reporter.
Does anyone know what is the problem with this metrics?
I am using the Datadog for the metric reporter and I need to know which Flink tasks were back pressured.
isBackPressured is a boolean, and has the limitation you've run into when used with the datadog reporter.
Fortunately there's a better way to assess backpressure that's available since Flink 1.13: you can look at backPressuredTimeMsPerSecond and related metrics. These metrics are based on a smoothed aggregation measured over a couple of seconds, and thus provide a more meaningful view of the actual behavior than isBackPressured, which is a point-in-time snapshot of whether a task was (perhaps only momentarily) backpressured.
See the docs for more details.

Lantency Monitoring in Flink 1.14

I am following this Flink tutorial for reactive scaling and am interested in knowing how overall end-to-end latencies are affected by such rapid changes in the number of worker nodes. As per the documentation, I have added metrics.latency.interval: 1000 to the config map with the understanding that a new latency metric will be added with markers being sent every 1 second. However, I cannot seem to find the corresponding histogram in prometheus. Listed below are this which are available metrics associated with latency:
I am using Flink 1.14. Is there something which I am missing?
I am suspecting that something happened to the latency metric between releases 1.13.2 and 1.14. Per now, I am not able to see the latency metrics from Flink after migration to 1.14, despite setting the latency interval to a positive number. Have you tried 1.13.2?
.. further exploration lead me to believe it is the migration to the KafkaSource / KafkaSink classes, as opposed to the deprecated FlinkKafkaConsumer and FlinkKafkaProducer that actually made the latency metric disappear. Currently, I am seeing the latency measures on flink 1.14, however using the deprecated Kafka source / sinks..

How to monitor Flink Backpressure in Grafana with Prometheus metrics

Flink Web UI has a brilliant backpressure section. But I can not see any metrics, given by Prometheus reporter, which could be used to detect backpressure in the same way for a Grafana dashboard.
Is there some way to get the same metrics outside of the Flink Web UI? Using the metrics described here https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html. Or even having a prometheus scraper for scraping the web api?
The back pressure monitoring that appears in the Flink dashboard isn't using the metrics system, so those values aren't available via a MetricsReporter. But you can access this info via the REST api at
/jobs/:jobid/vertices/:vertexid/backpressure
While this back pressure detection mechanism is useful, it does have its limitations. It works by calling Thread.getStackTrace(), which is expensive, and some operators (such as AsyncFunction) do critical activities in threads that aren't being sampled.
Another way to investigate back pressure is to set this configuration option in flink-conf.yaml
taskmanager.network.detailed-metrics: true
and then you can look at the metrics measuring inbound/outbound network queue lengths.

Apache flink - Limit the amount of metrics exposed

We have a flink job with roughly 30 operators. When we run this job with a parallelism of 12 flink outputs 400.000 metrics in total which is too many metrics for our metric platform to handle well.
When looking at the kind of metrics this does not seem to be a bug or anything like that.
It's just when having lots of operators with many taskmanagers and taskslots the number of metrics gets duplicated often enough to reach the 400.000 (maybe job restarts also duplicate the number of metrics?)
This is the config I use for our metrics:
metrics.reporters: graphite
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: some-host.com
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP
metrics.reporter.graphite.interval: 60 SECONDS
metrics.scope.jm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager
metrics.scope.jm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager.<job_name>
metrics.scope.tm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>
metrics.scope.tm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<task_id>.<subtask_index>
metrics.scope.operator: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<operator_id>.<subtask_index>
As we don't need all 400.000 of them, is it possible to influence which metrics are being exposed?
You are probably experiencing the cardinality explosion of latency metrics present in some versions of Flink, wherein latencies are tracked from each source subtask to each operator subtask. This was addressed in Flink 1.7. See https://issues.apache.org/jira/browse/FLINK-10484 and https://issues.apache.org/jira/browse/FLINK-10243 for details.
For a quick fix, you could try disabling latency tracking by configuring metrics.latency.interval to be 0.

How to get the throughput of KafkaSource in Flink?

I want to know the throughput of KafkaSource. In other words, I want to measure the speed at which flink reads data. My idea is to add a map operator after the Source and use the built-in Metrics in the map operator. Will this increase the overhead? I hope to get this metric without adding a lot of overhead. what should I do? Or is there a way to get the output throughput of this topic in kafka? Or should I get KafkaSource's NumberOutPersecond through the REST API?
Take a look at Kafka Manager which displays a lot of metrics related to Kafka. It's a tool which is used to manage Kafka and acts as a real-time dashboard. You need to install and configure this separately.
This can be used to check the consumption rate for your Flink consumer.
You can also make use of built-in metrics publisher on the source operator without using a Map only for that purpose.

Resources