High Flink network buffer usage, which causes Kafka lagging - apache-flink

Our Flink Jobs contains a filter, key by session id and then session window with 30mins gap. The session window will need to accumulate all the event for the session, and process them using ProcessWindowFunction.
We are using Flink 1.9, 128 containers with 20G memory in total to run our job and the cut-off ratio is 0.3.
We are doing incremental checkpoints.
When session windows start to trigger process function, the network buffer usage start getting pretty high, and then we start getting Kafka input lagging.
Our setting:
state.backend: rocksdb
state.checkpoints.dir: hdfs://nameservice0/service
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
#https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
state.backend.rocksdb.memory.write-buffer-ratio: 0.6
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1
state.backend.rocksdb.block.blocksize: 16mb
state.backend.rocksdb.writebuffer.count: 8
state.backend.rocksdb.writebuffer.size: 256mb
state.backend.rocksdb.timer-service.factory: heap
containerized.heap-cutoff-ratio: 0.25
taskmanager.network.memory.fraction: 0.85
taskmanager.network.memory.min: 512mb
taskmanager.network.memory.max: 7168mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 4mb
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.network.netty.transport: poll
Some of the graphs:
Any suggestion will be appreciated!

If I had access to the details, here's what I would look at to try to improve performance for this application:
(1) Could the windows be re-implemented to do incremental aggregation? Currently the windows are building up what may be rather long lists of events, and they only work through those lists when the sessions end. This apparently takes long enough to cause backpressure on Kafka. If you can pre-aggregate the session results, this will even out the processing, and the problem should go away.
And no, I'm not contradicting what I said here. If I haven't been clear, let me know.
(2) You've put a lot of extra network buffering in place. This is usually counter-productive; you want the backpressure to ripple back quickly and throttle the source, rather than pushing more data into Flink's network buffers.
You would do better to reduce the network buffering, and if possible, use your available resources to provide more slots instead. Having more slots will reduce the impact when one slot is busy working through the contents of a long session that just ended. Giving more memory to RocksDB might help too.
(3) See if you can optimize serialization. There can be a 10x difference in throughput between the best and worst serializers. See Flink Serialization Tuning. If there are any fields in the records that you don't actually need, drop them.
(4) Look at tuning RocksDB. Make sure you are using the fastest available local disks for RocksDB, such as local SSDs. Avoid using network attached storage (such as EBS) for state.backend.rocksdb.localdir.

I do not know the internals of the flink but reason could be related to the session window.
What i mean, if you have so many session operations with the same interval(30mins), all session operations will be performed at the same time which can create a delay.

Related

Flink task managers are not processing data after restart

I am new to flink and i deployed my flink application which basically perform simple pattern matching. It is deployed in Kubernetes cluster with 1 JM and 6 TM. I am sending messages of size 4.4k and 200k messages every 10 min to eventhub topic and performing load testing. I added restart strategy and checking pointing as below and i am not explicitly using any states in my code as there is no requirement for it
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE);
// advanced options:
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(120000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
5, // number of restart attempts
Time.of(5, TimeUnit.MINUTES) // delay
));
Initially i was facing Netty server issue with network buffer and i followed this link https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#taskmanager-network-memory-floating-buffers-per-gate flink network and heap memory optimizations and applied below settings and everything is working fine
taskmanager.network.memory.min: 256mb
taskmanager.network.memory.max: 1024mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 2mb
taskmanager.network.memory.floating-buffers-per-gate: 16
cluster.evenly-spread-out-slots: true
taskmanager.heap.size: 1024m
taskmanager.memory.framework.heap.size: 64mb
taskmanager.memory.managed.fraction: 0.7
taskmanager.memory.framework.off-heap.size: 64mb
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 256mb
taskmanager.memory.jvm-overhead.max: 1gb
taskmanager.memory.jvm-overhead.fraction: 0.4
But i have two below questions
If any task manager restarts because of any failures the task manager is restarting successfully and getting registered with job manager but after the restarted task manager don't perform any processing of data it will sit idle. Is this normal flink behavior or do i need to add any setting to make task manager to start processing again.
Sorry and correct me if my understanding is wrong, flink has a restart strategy in my code i made limit 5 attempts of restart. What will happen if my flink job is not successfully overcomes the task failure entire flink job will be remained in idle state and i have to restart job manually or is there any mechanism i can add to restart my job even after it crossed the limit of restart job attempts.
Is there any document to calculate the number of cores and memory i should assign to flink job cluster based on data size and rate at which my system receives the data ?
Is there any documentation on flink CEP optimization techniques?
This is the error stack trace i am seeing in job manager
I am seeing the below errors in my job manager logs before the pattern matching
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.244.9.163:46377'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance, please help me in resolving my doubts
Various points:
If your patterns involve matching temporal sequences (e.g., "A followed by B"), then you need state to do this. Most of Flink's sources and sinks also use state internally to record offsets, etc., and this state needs to be checkpointed if you care about exactly-once guarantees. If the patterns are being streamed in dynamically, then you'll want to store the patterns in Flink state as well.
Some of the comments in the code don't match the configuration parameters: e.g., "500 ms of progress" vs. 1000, "checkpoints have to complete within one minute" vs 120000. Also, keep in mind that the section of the documentation that you copied these settings from is not recommending best practices, but is instead illustrating how to make changes. In particular, env.getCheckpointConfig().setPreferCheckpointForRecovery(true); is a bad idea, and that config option should probably not exist.
Some of your entries in config.yaml are concerning. taskmanager.memory.managed.fraction is rather large (0.7) -- this only makes sense if you are using RocksDB, since managed memory has no other purpose for streaming. And taskmanager.memory.network.fraction and taskmanager.memory.jvm-overhead.fraction are both very large, and the sum of these three fractions is 1.5, which doesn't make sense.
In general the default network configuration works well across a wide range of deployment scenarios, and it is unusual to need to tune these settings, except in large clusters (which is not the case here). What sort of problems did you encounter?
As for your questions:
After a TM failure and recovery, the TMs should automatically resume processing from the most recent checkpoint. To diagnose why this isn't happening, we'll need more information. To gain experience with a deployment that handles this correctly, you can experiment with the Flink Operations Playground.
Once the configured restart strategy has played itself out, the job will FAIL, and Flink will no longer try to recover that job. You can, of course, build your own automation on top of Flink's REST API, if you want something more sophisticated.
Documentation on capacity planning? No, not really. This is generally figured out through trial and error. Different applications tend to have different requirements in ways that are difficult to anticipate. Things like your choice of serializer, state backend, number of keyBys, the sources and sinks, key skew, watermarking, and so on can all have significant impacts.
Documentation on optimizing CEP? No, sorry. The main points are
do everything you can to constrain the matches; avoid patterns that must keep state indefinitely
getEventsForPattern can be expensive

Checkpointing issues in Flink 1.10.1 using RocksDB state backend

We are experiencing a very difficult-to-observe problem with our Flink job.
The Job is reasonably simple, it:
Reads messages from Kinesis using the Flink Kinesis connector
Keys the messages and distributes them to ~30 different CEP operators, plus a couple of custom WindowFunctions
The messages emitted from the CEP/Windows are forward to a SinkFunction that writes messages to SQS
We are running Flink 1.10.1 Fargate, using 2 containers with 4vCPUs/8GB, we are using the RocksDB state backend with the following configuration:
state.backend: rocksdb
state.backend.async: true
state.backend.incremental: false
state.backend.rocksdb.localdir: /opt/flink/rocksdb
state.backend.rocksdb.ttl.compaction.filter.enabled: true
state.backend.rocksdb.files.open: 130048
The job runs with a parallelism of 8.
When the job starts from cold, it uses very little CPU and checkpoints complete in 2 sec. Over time, the checkpoint sizes increase but the times are still very reasonable couple of seconds:
During this time we can observe the CPU usage of our TaskManagers gently growing for some reason:
Eventually, the checkpoint time will start spiking to a few minutes, and then will just start repeatedly timing out (10 minutes). At this time:
Checkpoint size (when it does complete) is around 60MB
CPU usage is high, but not 100% (usually around 60-80%)
Looking at in-progress checkpoints, usually 95%+ of operators complete the checkpoint with 30 seconds, but a handful will just stick and never complete. The SQS sink will always be included on this, but the SinkFunction is not rich and has no state.
Using the backpressure monitor on these operators reports a HIGH backpressure
Eventually this situation resolves one of 2 ways:
Enough checkpoints fail to trigger the job to fail due to a failed checkpoint proportion threshold
The checkpoints eventually start succeeding, but never go back down to the 5-10s they take initially (when the state size is more like 30MB vs. 60MB)
We are really at a loss at how to debug this. Our state seems very small compared to the kind of state you see in some questions on here. Our volumes are also pretty low, we are very often under 100 records/sec.
We'd really appreciate any input on areas we could look into to debug this.
Thanks,
A few points:
It's not unusual for state to gradually grow over time. Perhaps your key space is growing, and you are keeping some state for each key. If you are relying on state TTL to expire stale state, perhaps it is not configured in a way that allows it clean up expired state as quickly as you would expect. It's also relatively easy to inadvertently create CEP patterns that need to keep some state for a very long time before certain possible matches can be ruled out.
A good next step would be to identify the cause of the backpressure. The most common cause is that a job doesn't have adequate resources. Most jobs gradually come to need more resources over time, as the number of users (for example) being managed rises. For example, you might need to increase the parallelism, or give the instances more memory, or increase the capacity of the sink(s) (or the speed of the network to the sink(s)), or give RocksDB faster disks.
Besides being inadequately provisioned, other causes of backpressure include
blocking i/o is being done in a user function
a large number of timers are firing simultaneously
event time skew between different sources is causing large amounts of state to be buffered
data skew (a hot key) is overwhelming one subtask or slot
lengthy GC pauses
contention for critical resources (e.g., using a NAS as the local disk for RocksDB)
Enabling RocksDB native metrics might provide some insight.
Add this property to your configuration:
state.backend.rocksdb.checkpoint.transfer.thread.num: {threadNumberAccordingYourProjectSize}
if you do not add this , it will be 1 (default)
Link: https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBOptions.java#L62

Apache Flink Resource Planning best practices

I'm looking for recommendations/best practices in determining required optimal resources for deploying a streaming job on Flink Cluster.
Resources are
No. of tasks slots per TaskManager
Optimal Memory allocation for TaskManager
Max Parallelism
This blog post gives some ideas on how to size. It's meant for moving a Flink application under development to production.
I'm not aware of a resource that helps to size before that, as the topology of the job has a tremendous impact. So you'd usually start with a PoC and low data volume and then extrapolate your findings.
Memory settings are described on the Flink docs. I'd also use the appropriate page for your Flink version as it got changed recently.
Number of task slots per Task Manager
One slot per TM is a rough rule of thumb as a starting point, but you probably want to the keep the number of TMs under 100, or so. This is because the Checkpoint Coordinator will eventually struggle if it has to manage too many distinct TMs. Running with lots of slots per TM works better with RocksDB than with the heap-based state backends, because with RocksDB the state is off-heap -- with state on the heap, running with lots of slots increases the likelihood of significant GC pauses.
Max Parallelism
The default is 128. Changing this parameter is painful, as it is baked into each checkpoint and savepoint. But making it larger than necessary comes with some cost (in memory/performance). Make it large enough that you will never have to change it, but no larger.

Flink checkpoints causes backpressure

I have a Flink job processing data at around 200k qps. Without checkpoints, the job is running fine.
But when I tried to add checkpoints (with interval 50mins), it causes backpressue at the first task, which is adding a key field to each entry, the data lagging goes up constantly as well.
the lagging of my two Kafka topics, first half was having checkpoints enabled, lagging goes up very quickly. second part(very low lagging was having checkpoints disabled, where the lagging is within milliseconds)
I am using at least once checkpoint mode, which should be asynchronized process. Could anyone suggest?
My checkpointing setting
env.enableCheckpointing(1800000,
CheckpointingMode.AT_LEAST_ONCE);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig()
.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.getCheckpointConfig()
.setCheckpointTimeout(10min);
env.getCheckpointConfig()
.setFailOnCheckpointingErrors(
jobConfiguration.getCheckpointConfig().getFailOnCheckpointingErrors());
my job has 128 containers.
With 10mins checkpoint time, following is the stats:
I am trying to use a 30mins checkpoint and see
I was trying to tune memory usage, but it seems not working.
But in the task manager, it's still:
TLDR; it's sometimes hard to analyse the problem. I have two lucky guesses/shots - if you are using RocksDB state backend, you could switch to FsStateBackend - it's usually faster and RocksDB makes most sense with large state sizes, that do not fit into memory (or if you really need incremental checkpointing feature). Second is to fiddle with parallelism, either increasing or decreasing.
I would suspect the same thing that #ArvidHeise wrote. You checkpoint size is not huge, but it's also not trivial. It can add the extra overhead to bring the job over the threshold of barely keeping up with the traffic, to not keeping up and causing the backpressure. If you are under backpressure, the latency will just keep accumulating, so even a change in couple of % of extra overhead can make a difference between end to end latencies of milliseconds vs unbounded ever growing value.
If you can not just simply add more resources, you would have to analyse what's exactly adding this extra over head and what resource is the bottleneck.
Is it CPU? Check CPU usage on the cluster. If it's ~100%, that's the thing you need to optimise for.
Is it IO? Check IO usage on the cluster and compare it against the maximal throughput/number of requests per second that you can achieve.
If both CPU & IO usage is low, you might want to try to increase parallelism, but...
Keep in mind data skew. Backpressure could be caused by a single task and in that case it makes it hard to analyse the problem, as it will be a single bottlenecked thread (on either IO or CPU), not whole machine.
After figuring out what resource is the bottleneck, next question would be why? It might be immediately obvious once you see it, or it might require digging in, like checking GC logs, attaching profiler etc.
Answering those questions could either give you information what you could try to optimise in your job or allow you to tweak configuration or could give us (Flink developers) an extra data point what we could try to optimise on the Flink side.
Any kind of checkpointing adds computation overhead. Most of the checkpointing is asynchronously (as you have stated), but it still adds up general I/O operations. These additional I/O request may, for example, congest your access to external systems. Also if you enable checkpointing, Flink needs to keep track of more information (new vs. already checkpointed).
Have you tried to add more resources to your job? Could you share your whole checkpointing configuration?

How to make Flink job with huge state finish

We are running a Flink cluster to calculate historic terabytes of streaming data. The data calculation has a huge state for which we use keyed states - Value and Map states with RocksDb backend. At some point in the job calculation the job performance starts degrading, input and output rates drop to almost 0. At this point exceptions like 'Communication with Taskmanager X timeout error" can be seen in the logs, however the job is compromised even before.
I presume the problem we are facing has to the with the RocksDb's disk backend. As the state of the job grows it needs to access the Disk more often which drags the performance to 0. We have played with some of the options and have set some which make sense for our particular setup:
We are using the SPINNING_DISK_OPTIMIZED_HIGH_MEM predefined profile, further optimized with optimizeFiltersForHits and some other options which has somewhat improved performance. However not of this can provide a stable computation and on a job re-run against a bigger data set the job halts again.
What we are looking for is a way to modify the job so that it progresses at SOME speed even when the input and the state increases. We are running on AWS with limits set to around 15 GB for Task Manager and no limit on disk space.
using SPINNING_DISK_OPTIMIZED_HIGH_MEM will cost huge off-heap memory by memtable of RocksDB, Seeing as you are running job with memory limitation around 15GB, I think you will encounter the OOM issue, but if you choose the default predefined profile, you will face the write stall issue or CPU overhead by decompressing the page cache of Rocksdb, so I think you should increase the memory limitation.
and here are some post about Rocksdb FYI:
https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
https://www.ververica.com/blog/manage-rocksdb-memory-size-apache-flink

Resources