Apache flink - Limit the amount of metrics exposed

Apache flink - Limit the amount of metrics exposed - apache-flink

We have a flink job with roughly 30 operators. When we run this job with a parallelism of 12 flink outputs 400.000 metrics in total which is too many metrics for our metric platform to handle well.
When looking at the kind of metrics this does not seem to be a bug or anything like that.
It's just when having lots of operators with many taskmanagers and taskslots the number of metrics gets duplicated often enough to reach the 400.000 (maybe job restarts also duplicate the number of metrics?)
This is the config I use for our metrics:
metrics.reporters: graphite
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: some-host.com
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP
metrics.reporter.graphite.interval: 60 SECONDS
metrics.scope.jm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager
metrics.scope.jm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.jobmanager.<job_name>
metrics.scope.tm: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>
metrics.scope.tm.job: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<task_id>.<subtask_index>
metrics.scope.operator: applications.__ENVIRONMENT__.__APPLICATION__.<host>.taskmanager.<tm_id>.<job_name>.<operator_id>.<subtask_index>
As we don't need all 400.000 of them, is it possible to influence which metrics are being exposed?

You are probably experiencing the cardinality explosion of latency metrics present in some versions of Flink, wherein latencies are tracked from each source subtask to each operator subtask. This was addressed in Flink 1.7. See https://issues.apache.org/jira/browse/FLINK-10484 and https://issues.apache.org/jira/browse/FLINK-10243 for details.
For a quick fix, you could try disabling latency tracking by configuring metrics.latency.interval to be 0.

Related

The metric flink.task.isBackPressured will not be reported because only number types are supported by this reporter

When the Flink cluster is running, the INFO log was output time from time:
org.apache.flink.metrics.datadog.DatadogHttpReporter [] - The metric flink.task.isBackPressured will not be reported because only number types are supported by this reporter.
Does anyone know what is the problem with this metrics?
I am using the Datadog for the metric reporter and I need to know which Flink tasks were back pressured.

isBackPressured is a boolean, and has the limitation you've run into when used with the datadog reporter.
Fortunately there's a better way to assess backpressure that's available since Flink 1.13: you can look at backPressuredTimeMsPerSecond and related metrics. These metrics are based on a smoothed aggregation measured over a couple of seconds, and thus provide a more meaningful view of the actual behavior than isBackPressured, which is a point-in-time snapshot of whether a task was (perhaps only momentarily) backpressured.
See the docs for more details.

Lantency Monitoring in Flink 1.14

I am following this Flink tutorial for reactive scaling and am interested in knowing how overall end-to-end latencies are affected by such rapid changes in the number of worker nodes. As per the documentation, I have added metrics.latency.interval: 1000 to the config map with the understanding that a new latency metric will be added with markers being sent every 1 second. However, I cannot seem to find the corresponding histogram in prometheus. Listed below are this which are available metrics associated with latency:
I am using Flink 1.14. Is there something which I am missing?

I am suspecting that something happened to the latency metric between releases 1.13.2 and 1.14. Per now, I am not able to see the latency metrics from Flink after migration to 1.14, despite setting the latency interval to a positive number. Have you tried 1.13.2?

.. further exploration lead me to believe it is the migration to the KafkaSource / KafkaSink classes, as opposed to the deprecated FlinkKafkaConsumer and FlinkKafkaProducer that actually made the latency metric disappear. Currently, I am seeing the latency measures on flink 1.14, however using the deprecated Kafka source / sinks..

Flink task managers are not processing data after restart

I am new to flink and i deployed my flink application which basically perform simple pattern matching. It is deployed in Kubernetes cluster with 1 JM and 6 TM. I am sending messages of size 4.4k and 200k messages every 10 min to eventhub topic and performing load testing. I added restart strategy and checking pointing as below and i am not explicitly using any states in my code as there is no requirement for it
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE);
// advanced options:
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(120000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
5, // number of restart attempts
Time.of(5, TimeUnit.MINUTES) // delay
));
Initially i was facing Netty server issue with network buffer and i followed this link https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#taskmanager-network-memory-floating-buffers-per-gate flink network and heap memory optimizations and applied below settings and everything is working fine
taskmanager.network.memory.min: 256mb
taskmanager.network.memory.max: 1024mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 2mb
taskmanager.network.memory.floating-buffers-per-gate: 16
cluster.evenly-spread-out-slots: true
taskmanager.heap.size: 1024m
taskmanager.memory.framework.heap.size: 64mb
taskmanager.memory.managed.fraction: 0.7
taskmanager.memory.framework.off-heap.size: 64mb
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 256mb
taskmanager.memory.jvm-overhead.max: 1gb
taskmanager.memory.jvm-overhead.fraction: 0.4
But i have two below questions
If any task manager restarts because of any failures the task manager is restarting successfully and getting registered with job manager but after the restarted task manager don't perform any processing of data it will sit idle. Is this normal flink behavior or do i need to add any setting to make task manager to start processing again.
Sorry and correct me if my understanding is wrong, flink has a restart strategy in my code i made limit 5 attempts of restart. What will happen if my flink job is not successfully overcomes the task failure entire flink job will be remained in idle state and i have to restart job manually or is there any mechanism i can add to restart my job even after it crossed the limit of restart job attempts.
Is there any document to calculate the number of cores and memory i should assign to flink job cluster based on data size and rate at which my system receives the data ?
Is there any documentation on flink CEP optimization techniques?
This is the error stack trace i am seeing in job manager
I am seeing the below errors in my job manager logs before the pattern matching
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.244.9.163:46377'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance, please help me in resolving my doubts

Various points:
If your patterns involve matching temporal sequences (e.g., "A followed by B"), then you need state to do this. Most of Flink's sources and sinks also use state internally to record offsets, etc., and this state needs to be checkpointed if you care about exactly-once guarantees. If the patterns are being streamed in dynamically, then you'll want to store the patterns in Flink state as well.
Some of the comments in the code don't match the configuration parameters: e.g., "500 ms of progress" vs. 1000, "checkpoints have to complete within one minute" vs 120000. Also, keep in mind that the section of the documentation that you copied these settings from is not recommending best practices, but is instead illustrating how to make changes. In particular, env.getCheckpointConfig().setPreferCheckpointForRecovery(true); is a bad idea, and that config option should probably not exist.
Some of your entries in config.yaml are concerning. taskmanager.memory.managed.fraction is rather large (0.7) -- this only makes sense if you are using RocksDB, since managed memory has no other purpose for streaming. And taskmanager.memory.network.fraction and taskmanager.memory.jvm-overhead.fraction are both very large, and the sum of these three fractions is 1.5, which doesn't make sense.
In general the default network configuration works well across a wide range of deployment scenarios, and it is unusual to need to tune these settings, except in large clusters (which is not the case here). What sort of problems did you encounter?
As for your questions:
After a TM failure and recovery, the TMs should automatically resume processing from the most recent checkpoint. To diagnose why this isn't happening, we'll need more information. To gain experience with a deployment that handles this correctly, you can experiment with the Flink Operations Playground.
Once the configured restart strategy has played itself out, the job will FAIL, and Flink will no longer try to recover that job. You can, of course, build your own automation on top of Flink's REST API, if you want something more sophisticated.
Documentation on capacity planning? No, not really. This is generally figured out through trial and error. Different applications tend to have different requirements in ways that are difficult to anticipate. Things like your choice of serializer, state backend, number of keyBys, the sources and sinks, key skew, watermarking, and so on can all have significant impacts.
Documentation on optimizing CEP? No, sorry. The main points are
do everything you can to constrain the matches; avoid patterns that must keep state indefinitely
getEventsForPattern can be expensive

Flink latency metrics not being shown

While running Flink 1.5.0 with a local environment I was trying to get latency metrics via REST (with something similar to http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/vertices/e70bbd798b564e0a50e10e343f1ac56b/metrics) but there isn't any reference to latency.
All of this while the latency tracking is enabled which I confirmed by checking with the debugger that the LatencyMarksEmitter is emiting the marks.
What can I be doing wrong?

In 1.5 latency metrics aren't exposed for tasks but for jobs instead, the reasoning being that latency metrics inherently contain information about multiple tasks. You have to query "http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/metrics" instead.

Flink buffer request blocking leads to low performance

As for real-time streaming data platform on flink-1.1.5, I meet an issue.
phenomenal description:
Our business logic of flink job is a generic ETL process which source operator and sink operator are both based on kafka and transform operators are some related etl logic. But the data from kafka topic which is read by source operator are in different size, such as one data is just less than 100 KB in non-peak hours but the other data are nearly 600KB in peak hours. In non-peak session, the performance of flink job is fine, nevertheless it is unsatisfactory in peak session which the performance decrease sharply and sustain the low state until the peak session go by and the performance will recover. As a consequence. I didn`t see any full GC cases happen on taskmanager and the system load of every taskmananger decreased during that period. Besides, there is no exception and extra error in logs. I can see that all threads are blocking at request buffer via JMX at that time.
Related configuration and metircs:
flink standalone cluster on 1.1.5 which has 1 jobmanager and 5 taskmanagers(each one has 18 slot which is less than CPU cores).
jdk version is 1.8 which GC type is G1.
source and sink are both kafka
job consistency type: exactly once
because the source partition amount of kafka topic is less than transform operator amount, the parallelism of each operator is not quite the same and the task can not become one task chain as optimization.
Based on what I describe above, we have tried some approaches:
modify the memory-segment size, 32KB and 64KB
aggrandize the network buffer pool capacity,
modify kafka consumer API parameter
enhance flink cluster which add more taskmanager nodes and add more process threads.
decrease transform operator threads and keep the amount of source operator, all transform operator and sink operator as the same, the task can become one task chain.
We had a try above #1,#2,#3,#4 approach but they does not work, just the #5 was effectual, but the #5 approach was not a excellent manner for us because its performance was declining.
Have you met this issue before? Is there any better approach to avoid this issue?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight