Lantency Monitoring in Flink 1.14 - apache-flink

I am following this Flink tutorial for reactive scaling and am interested in knowing how overall end-to-end latencies are affected by such rapid changes in the number of worker nodes. As per the documentation, I have added metrics.latency.interval: 1000 to the config map with the understanding that a new latency metric will be added with markers being sent every 1 second. However, I cannot seem to find the corresponding histogram in prometheus. Listed below are this which are available metrics associated with latency:
I am using Flink 1.14. Is there something which I am missing?

I am suspecting that something happened to the latency metric between releases 1.13.2 and 1.14. Per now, I am not able to see the latency metrics from Flink after migration to 1.14, despite setting the latency interval to a positive number. Have you tried 1.13.2?

.. further exploration lead me to believe it is the migration to the KafkaSource / KafkaSink classes, as opposed to the deprecated FlinkKafkaConsumer and FlinkKafkaProducer that actually made the latency metric disappear. Currently, I am seeing the latency measures on flink 1.14, however using the deprecated Kafka source / sinks..

Related

The metric flink.task.isBackPressured will not be reported because only number types are supported by this reporter

When the Flink cluster is running, the INFO log was output time from time:
org.apache.flink.metrics.datadog.DatadogHttpReporter [] - The metric flink.task.isBackPressured will not be reported because only number types are supported by this reporter.
Does anyone know what is the problem with this metrics?
I am using the Datadog for the metric reporter and I need to know which Flink tasks were back pressured.
isBackPressured is a boolean, and has the limitation you've run into when used with the datadog reporter.
Fortunately there's a better way to assess backpressure that's available since Flink 1.13: you can look at backPressuredTimeMsPerSecond and related metrics. These metrics are based on a smoothed aggregation measured over a couple of seconds, and thus provide a more meaningful view of the actual behavior than isBackPressured, which is a point-in-time snapshot of whether a task was (perhaps only momentarily) backpressured.
See the docs for more details.

Flink: Reading from Kinesis causes ReadProvisionedThroughputExceeded

I've got a Flink app with a Kinesis source. I'm seeing a lot of ReadProvisionedThroughputExceeded errors from AWS when running the Flink app. I've tried updating the consumer config with different settings to reduce the number of get record calls and increase time between calls but that doesn't seem to help:
consumerConfig.put(ConsumerConfigConstants.SHARD_GETRECORDS_MAX, "500")
consumerConfig.put(ConsumerConfigConstants.SHARD_GETRECORDS_INTERVAL_MILLIS, "30000")
consumerConfig.put(ConsumerConfigConstants.SHARD_GETRECORDS_BACKOFF_BASE, "3000")
consumerConfig.put(ConsumerConfigConstants.SHARD_GETRECORDS_BACKOFF_MAX, "10000")
Are there other setting that I should be tuning? Thanks!
Try to do the following checks:
Check Kinesis monitor to get which metric exceeds limit: number of records or sum of bytes for all records for each poll.
Competitive consumers. Are there any other consumers reading from the same shard? If enhanced-fan-out is not enabled, then they will share the same throughputs.
Enable enhance-fan-out(https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kinesis/#using-enhanced-fan-out) is a potential solution.
Turns out it's a bug: https://issues.apache.org/jira/browse/FLINK-21661 in the version of I'm using - Flink 1.12

Measuring event-time latency with Flink CEP

I have implemented a pattern with Flink CEP that matches three Events such as A->B->C. After I have defined my pattern I generate a
PatternStream<Event> patternStream = CEP.pattern(eventStream, pattern);
with a PatternSelectFunction such that
patternStream.select(new MyPatternSelectFunction()).print();
This works like a charm but I am interested in the event-time of all matched events. I know that the traditional Flink streaming API offers rich functions which allow you to register Flink's internal latency tracker as described in this question. I have also seen that for Flink 1.8 a new RichPatternSelectFunction has been added. But unfortunately I cannot set up Flink 1.8 with Flink CEP.
Finally, is there a way to get the event-time of all matched events?
You don't need Rich Functions to use Flink's latency tracking. You just need to enable it by setting latencyTrackingInterval to a positive number in either the Flink configuration or ExecutionConfig, e.g.,
env.getConfig().setLatencyTrackingInterval(1000);
and then you can observe the results in your metrics solution, or via the REST api (latency metrics are not reported in the Flink web UI).
Documentation
Update:
The latency statistics are job metrics, and are in the list returned by
http://<job_manager_rest_endpoint>/jobs/<job_id>/metrics
Latency metric values can be fetched from
http://<job_manager_rest_endpoint>/jobs/<job_id>/metrics?get=<metric_name>
These metrics have names like
latency.source_id.<ID>.operator_id.<ID>.operator_subtask_index.<SUBTASK>.<metric>
where the IDs identity the source and operator nodes in the job graph between which the latency is being measured.
For example, I can determine the 95th percentile latency between the source and one of the sinks in a job I am running right now with this request:
http://localhost:8081/jobs/94b189a96b98b3aafaba6db6aa8b770b/metrics?get=latency.source_id.bc764cd8ddf7a0cff126f51c16239658.operator_id.fd0ee602f2fa8d310d9bd9f694e185f5.operator_subtask_index.0.latency_p95
Alternatively, you could use a ProcessFunction to add processing time timestamps to your events before they enter the CEP part of your job, and then use another ProcessFunction afterwards to measure the elapsed time.

Apache flink - Limit the amount of metrics exposed

We have a flink job with roughly 30 operators. When we run this job with a parallelism of 12 flink outputs 400.000 metrics in total which is too many metrics for our metric platform to handle well.
When looking at the kind of metrics this does not seem to be a bug or anything like that.
It's just when having lots of operators with many taskmanagers and taskslots the number of metrics gets duplicated often enough to reach the 400.000 (maybe job restarts also duplicate the number of metrics?)
This is the config I use for our metrics:
metrics.reporters: graphite
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: some-host.com
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP
metrics.reporter.graphite.interval: 60 SECONDS
metrics.scope.jm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager
metrics.scope.jm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager.<job_name>
metrics.scope.tm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>
metrics.scope.tm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<task_id>.<subtask_index>
metrics.scope.operator: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<operator_id>.<subtask_index>
As we don't need all 400.000 of them, is it possible to influence which metrics are being exposed?
You are probably experiencing the cardinality explosion of latency metrics present in some versions of Flink, wherein latencies are tracked from each source subtask to each operator subtask. This was addressed in Flink 1.7. See https://issues.apache.org/jira/browse/FLINK-10484 and https://issues.apache.org/jira/browse/FLINK-10243 for details.
For a quick fix, you could try disabling latency tracking by configuring metrics.latency.interval to be 0.

Flink latency metrics not being shown

While running Flink 1.5.0 with a local environment I was trying to get latency metrics via REST (with something similar to http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/vertices/e70bbd798b564e0a50e10e343f1ac56b/metrics) but there isn't any reference to latency.
All of this while the latency tracking is enabled which I confirmed by checking with the debugger that the LatencyMarksEmitter is emiting the marks.
What can I be doing wrong?
In 1.5 latency metrics aren't exposed for tasks but for jobs instead, the reasoning being that latency metrics inherently contain information about multiple tasks. You have to query "http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/metrics" instead.

Resources