Flink checkpoints interval and state size - apache-flink

We are running a few flink jobs, all of which have a kafka source and multiple cassandra sinks. We are heavily relying on time windows with reduce function, on keyed data.
Our TPS is currently around 100—200.
I have a few questions about checkpoints and the size of the state that being saved:
Since we're using reduce function, is the state size only influenced by the number of opened windows? If an hourly window and a minute window both have same accumaltor, should we expect a similar state size? For some reason were seeing that hourly window has much larger state than minute window, and daily window has larger state than hourly window.
What is considered to be a reasonable amount of opened windows? What is considered to be a large state? What are the most common checkpoint time intervals (ours is 5 seconds which seems far too often to me), how long should we expect a checkpoint save time to take in a reasonable storage, for 1 gb of state? How TBs of state (which i read some system has) can be checkpointed in a reasonable amount of time? I know these are abstract questions but were not sure that our flink setup is working as expected and what to expect as our data grows.
Were seeing both async and sync checkpoint times in the UI. Can anyone explain why flink is using both?
Thanks for anyone who can help with any of the questions.

There are a lot of factors that can influence checkpointing performance, including which version of Flink you are running, which state backend you are using and how it is configured, and which kind of time windows is involved (e.g. sliding vs tumbling windows). Incremental checkpoints can have a huge impact when TBs of state are involved.
One factor that can have a large impact is the number of distinct keys involved for different time intervals. You've indicated these are keyed windows, and I would expect that over the course of an hour, many more distinct keys are used than during a typical minute. Windows are created lazily, when the first event is assigned to them, so there will be many more keyed windows created for an hour-long window than for a one-minute-long window. The same effect will arise for day-long keyed windows, but to a lesser extent.
Each of your job's operators go through a (hopefully brief) synchronous phase during checkpoint handling regardless of whether the bulk of the checkpointing is done synchronously or asynchronously. With the heap-based state backends, both synchronous and asynchronous snapshots are supported -- you'll want asynchronous snapshots for optimal performance.

Related

High Flink network buffer usage, which causes Kafka lagging

Our Flink Jobs contains a filter, key by session id and then session window with 30mins gap. The session window will need to accumulate all the event for the session, and process them using ProcessWindowFunction.
We are using Flink 1.9, 128 containers with 20G memory in total to run our job and the cut-off ratio is 0.3.
We are doing incremental checkpoints.
When session windows start to trigger process function, the network buffer usage start getting pretty high, and then we start getting Kafka input lagging.
Our setting:
state.backend: rocksdb
state.checkpoints.dir: hdfs://nameservice0/service
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
#https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
state.backend.rocksdb.memory.write-buffer-ratio: 0.6
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1
state.backend.rocksdb.block.blocksize: 16mb
state.backend.rocksdb.writebuffer.count: 8
state.backend.rocksdb.writebuffer.size: 256mb
state.backend.rocksdb.timer-service.factory: heap
containerized.heap-cutoff-ratio: 0.25
taskmanager.network.memory.fraction: 0.85
taskmanager.network.memory.min: 512mb
taskmanager.network.memory.max: 7168mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 4mb
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.network.netty.transport: poll
Some of the graphs:
Any suggestion will be appreciated!
If I had access to the details, here's what I would look at to try to improve performance for this application:
(1) Could the windows be re-implemented to do incremental aggregation? Currently the windows are building up what may be rather long lists of events, and they only work through those lists when the sessions end. This apparently takes long enough to cause backpressure on Kafka. If you can pre-aggregate the session results, this will even out the processing, and the problem should go away.
And no, I'm not contradicting what I said here. If I haven't been clear, let me know.
(2) You've put a lot of extra network buffering in place. This is usually counter-productive; you want the backpressure to ripple back quickly and throttle the source, rather than pushing more data into Flink's network buffers.
You would do better to reduce the network buffering, and if possible, use your available resources to provide more slots instead. Having more slots will reduce the impact when one slot is busy working through the contents of a long session that just ended. Giving more memory to RocksDB might help too.
(3) See if you can optimize serialization. There can be a 10x difference in throughput between the best and worst serializers. See Flink Serialization Tuning. If there are any fields in the records that you don't actually need, drop them.
(4) Look at tuning RocksDB. Make sure you are using the fastest available local disks for RocksDB, such as local SSDs. Avoid using network attached storage (such as EBS) for state.backend.rocksdb.localdir.
I do not know the internals of the flink but reason could be related to the session window.
What i mean, if you have so many session operations with the same interval(30mins), all session operations will be performed at the same time which can create a delay.

Checkpointing issues in Flink 1.10.1 using RocksDB state backend

We are experiencing a very difficult-to-observe problem with our Flink job.
The Job is reasonably simple, it:
Reads messages from Kinesis using the Flink Kinesis connector
Keys the messages and distributes them to ~30 different CEP operators, plus a couple of custom WindowFunctions
The messages emitted from the CEP/Windows are forward to a SinkFunction that writes messages to SQS
We are running Flink 1.10.1 Fargate, using 2 containers with 4vCPUs/8GB, we are using the RocksDB state backend with the following configuration:
state.backend: rocksdb
state.backend.async: true
state.backend.incremental: false
state.backend.rocksdb.localdir: /opt/flink/rocksdb
state.backend.rocksdb.ttl.compaction.filter.enabled: true
state.backend.rocksdb.files.open: 130048
The job runs with a parallelism of 8.
When the job starts from cold, it uses very little CPU and checkpoints complete in 2 sec. Over time, the checkpoint sizes increase but the times are still very reasonable couple of seconds:
During this time we can observe the CPU usage of our TaskManagers gently growing for some reason:
Eventually, the checkpoint time will start spiking to a few minutes, and then will just start repeatedly timing out (10 minutes). At this time:
Checkpoint size (when it does complete) is around 60MB
CPU usage is high, but not 100% (usually around 60-80%)
Looking at in-progress checkpoints, usually 95%+ of operators complete the checkpoint with 30 seconds, but a handful will just stick and never complete. The SQS sink will always be included on this, but the SinkFunction is not rich and has no state.
Using the backpressure monitor on these operators reports a HIGH backpressure
Eventually this situation resolves one of 2 ways:
Enough checkpoints fail to trigger the job to fail due to a failed checkpoint proportion threshold
The checkpoints eventually start succeeding, but never go back down to the 5-10s they take initially (when the state size is more like 30MB vs. 60MB)
We are really at a loss at how to debug this. Our state seems very small compared to the kind of state you see in some questions on here. Our volumes are also pretty low, we are very often under 100 records/sec.
We'd really appreciate any input on areas we could look into to debug this.
Thanks,
A few points:
It's not unusual for state to gradually grow over time. Perhaps your key space is growing, and you are keeping some state for each key. If you are relying on state TTL to expire stale state, perhaps it is not configured in a way that allows it clean up expired state as quickly as you would expect. It's also relatively easy to inadvertently create CEP patterns that need to keep some state for a very long time before certain possible matches can be ruled out.
A good next step would be to identify the cause of the backpressure. The most common cause is that a job doesn't have adequate resources. Most jobs gradually come to need more resources over time, as the number of users (for example) being managed rises. For example, you might need to increase the parallelism, or give the instances more memory, or increase the capacity of the sink(s) (or the speed of the network to the sink(s)), or give RocksDB faster disks.
Besides being inadequately provisioned, other causes of backpressure include
blocking i/o is being done in a user function
a large number of timers are firing simultaneously
event time skew between different sources is causing large amounts of state to be buffered
data skew (a hot key) is overwhelming one subtask or slot
lengthy GC pauses
contention for critical resources (e.g., using a NAS as the local disk for RocksDB)
Enabling RocksDB native metrics might provide some insight.
Add this property to your configuration:
state.backend.rocksdb.checkpoint.transfer.thread.num: {threadNumberAccordingYourProjectSize}
if you do not add this , it will be 1 (default)
Link: https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBOptions.java#L62

How to make Flink job with huge state finish

We are running a Flink cluster to calculate historic terabytes of streaming data. The data calculation has a huge state for which we use keyed states - Value and Map states with RocksDb backend. At some point in the job calculation the job performance starts degrading, input and output rates drop to almost 0. At this point exceptions like 'Communication with Taskmanager X timeout error" can be seen in the logs, however the job is compromised even before.
I presume the problem we are facing has to the with the RocksDb's disk backend. As the state of the job grows it needs to access the Disk more often which drags the performance to 0. We have played with some of the options and have set some which make sense for our particular setup:
We are using the SPINNING_DISK_OPTIMIZED_HIGH_MEM predefined profile, further optimized with optimizeFiltersForHits and some other options which has somewhat improved performance. However not of this can provide a stable computation and on a job re-run against a bigger data set the job halts again.
What we are looking for is a way to modify the job so that it progresses at SOME speed even when the input and the state increases. We are running on AWS with limits set to around 15 GB for Task Manager and no limit on disk space.
using SPINNING_DISK_OPTIMIZED_HIGH_MEM will cost huge off-heap memory by memtable of RocksDB, Seeing as you are running job with memory limitation around 15GB, I think you will encounter the OOM issue, but if you choose the default predefined profile, you will face the write stall issue or CPU overhead by decompressing the page cache of Rocksdb, so I think you should increase the memory limitation.
and here are some post about Rocksdb FYI:
https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
https://www.ververica.com/blog/manage-rocksdb-memory-size-apache-flink

Flink, basic rule for checkpointing?

I have 2 questions regarding Flink checkpointing strategy,
I know that checkpoint is related to state (right?), so if I'm not using state (ValueState sort of things) explicitly in my job code, do I need to care about checkpoint? Is it still necessary?
If I need to enable the checkpointing, what should the interval be? Are there any basic rules for setting the interval? Suppose we're talking about a quite busy system (Kafka+Flink), like several billions messages per day.
Many thanks.
Even if you are not using state explicitly in your application, Flink's Kafka source and sink connectors are using state on your behalf in order to provide you with either at-least-once or exactly-once guarantees -- assuming you care about those guarantees. Also, some other operators will also use state somewhat transparently, on your behalf, such as windows and other streaming aggregations.
If your Flink job fails, then it will be rewound back to the most recent successful checkpoint, and resume processing from there. So, for example, if your checkpoint interval is 10 minutes, then after recovery your job might have 10+ minutes of data to catch up on before it can resume processing live data. So choose a checkpoint interval that you can live with from this perspective.

Manual checkpoint from Flink stream

Is it possible to trigger checkpoint from Flink streaming job?
My use case is that: I have two streams R and S to join with tumbling time windows. The source is Kafka. I use event time processing and BoundedOutOfOrdernessGenerator to make sure events from two streams end up in the same window.
The problem is my states are large and a regular periodic checkpoint takes too much time sometimes. At first, I wanted to disable checkpointing and rely on Kafka offset. But out of orderness means I have already some data in future windows from current offset. So I need checkpointing.
If it was possible to trigger checkpoints after a window gets cleaned instead of periodic ones it would be more efficient. Maybe at evictAfter method.
Does that make sense and is it possible? IF not I'd appreciate a work around.
Seems the issue here is checkpoint efficiency. Consider using the RocksDB state backend with incremental checkpoints, discussed in the docs under Debugging and Tuning Checkpoints and Large State.

Resources