Checkpoint Shared is Very Large - apache-flink

I running my flink app with 16 parallelism. after 20 minutes shared checkpoint increase to 235MB. how i can i handle it. it's very large in long time.
Every task manager is a Openshift Pod
Task managers: 4
Tasks per task manager: 4
CPU per task manager: 4 Core
Memory per task manager: 6GB
Used Rocksdb state bakend
Enabled incremental checkpoint
Below image is for a task manager(Pod)

Flink will use only as much space for state as is required to do what you've asked it to do. If you are unhappy with the result, you need to somehow ask it to do less.
Here some things you might do:
Make sure your application isn't leaking state. This can happen, for example, if you are using keyed state with an unbounded key space, and aren't clearing the state.
Establish a state retention interval (for the Table/SQL API).
Use State TTL to free unneeded state.
There are certain anti-patterns that require a lot of buffering in state. You should avoid those. :)
You could restrict the resources available for storing state, but this will result in the job failing when those resources are exhausted.
Also, 235MB across 16 slots isn't very large for RocksDB. With incremental checkpointing, RocksDB is storing multiple (uncompacted) copies of the state. The actual active state you're using could be much less.


Access to Subtask Metrics in a KeyedStream

I want to load a large amount of data into Flinks state backend (RocksDB) and process events using this data (for example in a CoProcessFunction). The data is much larger than the memory of a TaskManager.
The data blocks for a key can be very large, which has a negative impact on latency if the data needs to be loaded from the state backend each time. Therefore I would like to keep the data for frequent keys locally in the CoProcessFunction.
However, the total data in the state backend is larger than the memory of a TaskManager so it is not possible to keep the corresponding data block from the state backend locally for each key.
To solve this problem I would need to know the current memory usage of a SubTask to decide if a data block for a key can be kept locally or if something needs to be deleted. So here is my question: Since keys are not clearly assigned to a subtask is there a way to access memory related subtask information or custom metrics related to subtasks in a KeyedStream ? Or is there another way to solve this problem ? (External access via Async I/O is not an option).
The RocksDB block cache is already doing roughly what you describe. Rather than implementing your own caching layer, you should be able to get good results by tuning RocksDB, including giving it plenty of memory to work with.
Using RocksDB State Backend in Apache Flink: When and How is a good starting point, and includes pointers to where you can learn more about the native RocksDB metrics, memory management, etc. I also recommend reading The Impact of Disks on RocksDB State Backend in Flink: A Case Study.
As for your question about accessing subtask metrics from within your Flink job -- I don't know of any way to do that locally. You could, I suppose, implement a Flink source connector that fetches them and streams them into the job as another data source.

Checkpointing issues in Flink 1.10.1 using RocksDB state backend

We are experiencing a very difficult-to-observe problem with our Flink job.
The Job is reasonably simple, it:
Reads messages from Kinesis using the Flink Kinesis connector
Keys the messages and distributes them to ~30 different CEP operators, plus a couple of custom WindowFunctions
The messages emitted from the CEP/Windows are forward to a SinkFunction that writes messages to SQS
We are running Flink 1.10.1 Fargate, using 2 containers with 4vCPUs/8GB, we are using the RocksDB state backend with the following configuration:
state.backend: rocksdb
state.backend.async: true
state.backend.incremental: false
state.backend.rocksdb.localdir: /opt/flink/rocksdb
state.backend.rocksdb.ttl.compaction.filter.enabled: true 130048
The job runs with a parallelism of 8.
When the job starts from cold, it uses very little CPU and checkpoints complete in 2 sec. Over time, the checkpoint sizes increase but the times are still very reasonable couple of seconds:
During this time we can observe the CPU usage of our TaskManagers gently growing for some reason:
Eventually, the checkpoint time will start spiking to a few minutes, and then will just start repeatedly timing out (10 minutes). At this time:
Checkpoint size (when it does complete) is around 60MB
CPU usage is high, but not 100% (usually around 60-80%)
Looking at in-progress checkpoints, usually 95%+ of operators complete the checkpoint with 30 seconds, but a handful will just stick and never complete. The SQS sink will always be included on this, but the SinkFunction is not rich and has no state.
Using the backpressure monitor on these operators reports a HIGH backpressure
Eventually this situation resolves one of 2 ways:
Enough checkpoints fail to trigger the job to fail due to a failed checkpoint proportion threshold
The checkpoints eventually start succeeding, but never go back down to the 5-10s they take initially (when the state size is more like 30MB vs. 60MB)
We are really at a loss at how to debug this. Our state seems very small compared to the kind of state you see in some questions on here. Our volumes are also pretty low, we are very often under 100 records/sec.
We'd really appreciate any input on areas we could look into to debug this.
A few points:
It's not unusual for state to gradually grow over time. Perhaps your key space is growing, and you are keeping some state for each key. If you are relying on state TTL to expire stale state, perhaps it is not configured in a way that allows it clean up expired state as quickly as you would expect. It's also relatively easy to inadvertently create CEP patterns that need to keep some state for a very long time before certain possible matches can be ruled out.
A good next step would be to identify the cause of the backpressure. The most common cause is that a job doesn't have adequate resources. Most jobs gradually come to need more resources over time, as the number of users (for example) being managed rises. For example, you might need to increase the parallelism, or give the instances more memory, or increase the capacity of the sink(s) (or the speed of the network to the sink(s)), or give RocksDB faster disks.
Besides being inadequately provisioned, other causes of backpressure include
blocking i/o is being done in a user function
a large number of timers are firing simultaneously
event time skew between different sources is causing large amounts of state to be buffered
data skew (a hot key) is overwhelming one subtask or slot
lengthy GC pauses
contention for critical resources (e.g., using a NAS as the local disk for RocksDB)
Enabling RocksDB native metrics might provide some insight.
Add this property to your configuration:
state.backend.rocksdb.checkpoint.transfer.thread.num: {threadNumberAccordingYourProjectSize}
if you do not add this , it will be 1 (default)

Apache Flink Resource Planning best practices

I'm looking for recommendations/best practices in determining required optimal resources for deploying a streaming job on Flink Cluster.
Resources are
No. of tasks slots per TaskManager
Optimal Memory allocation for TaskManager
Max Parallelism
This blog post gives some ideas on how to size. It's meant for moving a Flink application under development to production.
I'm not aware of a resource that helps to size before that, as the topology of the job has a tremendous impact. So you'd usually start with a PoC and low data volume and then extrapolate your findings.
Memory settings are described on the Flink docs. I'd also use the appropriate page for your Flink version as it got changed recently.
Number of task slots per Task Manager
One slot per TM is a rough rule of thumb as a starting point, but you probably want to the keep the number of TMs under 100, or so. This is because the Checkpoint Coordinator will eventually struggle if it has to manage too many distinct TMs. Running with lots of slots per TM works better with RocksDB than with the heap-based state backends, because with RocksDB the state is off-heap -- with state on the heap, running with lots of slots increases the likelihood of significant GC pauses.
Max Parallelism
The default is 128. Changing this parameter is painful, as it is baked into each checkpoint and savepoint. But making it larger than necessary comes with some cost (in memory/performance). Make it large enough that you will never have to change it, but no larger.

Are there limitations in using a State in Apache Flink?

Apache Flink allows me to use a State in a RichMapFunction. I am planning to build a continuously running job which analyses a stream of web events. Part of the processing will be the creation of a session context with session scoped metrics (like nth of the session, duration etc) and additionally a user context.
A session context will timeout after 30 minutes, but a user context may exist for a year to handle returning users.
There will be millions of sessions and users so I would end up in millions of individual states. Every state is just a few KB in size.
Is this something that can be handled properly with the Flink states?
How is Flink actually cleaning up deprecated states?
Would it make sense to think about providing a custom backend to store the state in a KV cluster?
For large state I would recommend using Flink's RocksDBStateBackend. This state backend uses RocksDB to store state. Since RocksDB gracefully spills to disk, it is only limited by your available disk space. Thus, Flink should be able to handle your use case.
At the moment you need to register timers to clean up state. However, with the next Flink release, the community will add clean up for state with TTL. This will then automatically clean up your state when it is expired.
Keeping your state close to your computation with periodic checkpoints which are persisted will keep your application fast. If every state access went to a remote KV cluster, it would considerably slow down the processing.

Flink Capacity Planning For Large State in YARN Cluster

We have hit a roadblock moving an app at Production scale and was hoping to get some guidance. Application is pretty common use case in stream processing but does require maintaining large number of keyed states. We are processing 2 streams - one of which is a daily burst of stream (normally around 50 mil but could go upto 100 mil in one hour burst) and other is constant stream of around 70-80 mil per hour. We are doing a low level join using CoProcess function between the two keyed streams. CoProcess function needs to refresh (upsert) state from the daily burst stream and decorate constantly streaming data with values from state built using bursty stream. All of the logic is working pretty well in a standalone Dev environment. We are throwing about 500k events of bursty traffic for state and about 2-3 mil of data stream. We have 1 TM with 16GB memory, 1 JM with 8 GB memory and 16 slots (1 per core on the server) on the server. We have been taking savepoints in case we need to restart app for with code changes etc. App does seem to recover from state very well as well. Based on the savepoints, total volume of state in production flow should be around 25-30GB.
At this point, however, we are trying deploy the app at production scale. App also has a flag that can be set at startup time to ignore data stream so we can simply initialize state. So basically we are trying to see if we can initialize the state first and take a savepoint as test. At this point we are using 10 TM with 4 slots and 8GB memory each (idea was to allocate around 3 times estimated state size to start with) but TMs keep getting killed by YARN with a GC Overhead Limit Exceeded error. We have gone through quite a few blogs/docs on Flink Management Memory, off-heap vs heap memory, Disk Spill over, State Backend etc. We did try to tweak managed-memory configs in multiple ways (off/on heap, fraction, network buffers etc) but can’t seem to figure out good way to fine tune the app to avoid issues. Ideally, we would hold state in memory (we do have enough capacity in Production environment for it) for performance reasons and spill over to disk (which I believe Flink should provide out of the box?). It feels like 3x anticipated state volume in cluster memory should have been enough to just initialize state. So instead of just continuing to increase memory (which may or may not help as error is regarding GC overhead) we wanted to get some input from experts on best practices and approach to plan this application better.
Appreciate your input in advance!
