Flink app's checkpoint size keeps growing - apache-flink

I have a pipeline like this:
env.addSource(kafkaConsumer, name_source)
.keyBy { value -> value.f0 }
.window(EventTimeSessionWindows.withGap(Time.seconds(2)))
.process(MyProcessor())
.addSink(kafkaProducer)
The keys are guaranteed to be unique in the data that is being currently processed.
Thus I would expect the state size to not grow over 2 seconds of data.
However, I notice the state size has been steadily growing over the last day (since the app was deployed).
Is this a bug in flink?
using flink 1.11.2 in aws kinesis data analytics.

Kinesis Data Analytics always uses RocksDB as its state backend. With RocksDB, dead state isn't immediately cleaned up, it's merely marked with a tombstone and is later compacted away. I'm not sure how KDA configures RocksDB compaction, but typically it's done when a level reaches a certain size -- and I suspect your state size is still small enough that compaction hasn't occurred.
With incremental checkpoints (which is what KDA does), checkpointing is done by copying RocksDB's SST files -- which in your case are presumably full of stale data. If you let this run long enough you should eventually see a significant drop in checkpoint size, once compaction has been done.

Related

Wha tif the size of state is larger than the flink memory size?

I am wondering that the size of state is larger than the flink's memory size.
Since the state is controlled by the Flink App's APIs by defining MapState<K,V> in the code level, the state is possible to store large size of values (which is over than memory size such as 100Gb,200Gb).
Can it be possible?
You might be interested in reading about State Backends
The HashMapStateBackend holds data internally as objects on the Java heap
HashMapStateBackend will OOM your task managers if your MapStates are too big.
The EmbeddedRocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager local data directories
[...] Note that the amount of state that you can keep is only limited by the amount of disk space available. This allows keeping very large state, compared to the HashMapStateBackend that keeps state in memory. This also means, however, that the maximum throughput that can be achieved will be lower with this state backend. All reads/writes from/to this backend have to go through de-/serialization to retrieve/store the state objects, which is also more expensive than always working with the on-heap representation as the heap-based backends are doing.
EmbeddedRocksDBStateBackend will use the disk, so you have more capacity. Note that it is slower, but that caches could help alleviate some of that slowness; the configuration of which I suggest you look at (in Flink using RocksDB's mecanism)

Heavy back pressure and huge checkpoint size

I have an Apache Flink application that I have deployed on Kinesis Data analytics.
Payload schema processed by the application (simplified version):
{
id:String= uuid (each request gets one),
category:string= uuid (we have 10 of these),
org_id:String = uuid (we have 1000 of these),
count:Integer (some integer)
}
This application is doing the following:
Source: Consume from a single Kafka topic (128 partitions)
Filter: Do some filtering for invalid records (nothing fancy here)
key-by: based on 2 fields in the input Tupe.of(org_id,category) .
Flatmap(de-duplication): Maintains a guava cache(with size 30k and expiration 5 mins). A single String ID (id in payload) field is stored in the cache. Each time a record comes in, we check if the id is present in the cache. If it is present it will be skipped. Else it will be Skipped.
Rebalance: Just to make sure some sinks don't remain idle while the others are taking all the load.
Sink: Writes to S3 (and this S3 has versioning enabled).
This is deployed with:
in KDA terms: parallelism of 64 and parallelism per KPU of 2.
That means we will have a cluster of 32 nodes and each node has 1 core CPU and 4GB of RAM.
All of these below mentioned issues happen at 2000 rps.
Now to the issue I am facing:
My lastCheckPointSize seems to 471MB. This seems to be very high given that we are not using any state (note: the guava cache is not stored on the state: Gist with just the interesting parts).
I see heavy back pressure. Because of this the record_lag_max builds up.
I am unable to understand why my checkpoint size so huge since I am not using any state. I was thinking, it will just be the kafka offsets processed by each of these stages. But 471MB seems too big for that.
?
Is this huge checkpoint responsible for the backpressure I am facing? When I look at s3 metrics it looks like 20ms per write, which I assume is not too much.
I am seeing a few rate limits on S3, but from my understanding this seems to pretty low compared to the number of writes I am making.
Any idea why I am facing this backpressure and also why my checkpoints are so huge?
(Edit added as an after thought)Now that I think about it, is it possible for that not marking LoaderCache as `transient’ in my DeduplicatingFlatmap playing any role in the huge checkpoint size?
Edit 2: Adding details related to my sink:
I am using a StreamingFileSink:
StreamingFileSink
.forRowFormat(new Path(s3Bucket), new JsonEncoder<>())
.withBucketAssigner(bucketAssigner)
.withRollingPolicy(DefaultRollingPolicy.builder()
.withRolloverInterval(60000)
.build())
.build()
The JsonEncoder takes the object and converts it to json and writes out bytes like this: https://gist.github.com/vmohanan1/3ba3feeb6f22a5e34f9ac9bce20ca7bf
The BucketAssigner gets the product and org from the schema and appends them with the processing time from context like this: https://gist.github.com/vmohanan1/8d443a419cfeb4cb1a4284ecec48fe73

Memory is not coming down after data processing in Apache Flink

I am using broadcastprocess function to perform simple pattern matching. I am broadcasting around 60 patterns. Once the process completed the memory is not coming down i am using garbage collection setting in my flink configuration file env.java.opts = "-XX:+UseG1GC" to perform GC but it is also not working. But CPU percentage coming after completing the processing of data. I am doing checkpointing every 2 minutes and my statebackend is filesystem. Below are screenshots of memory and CPU usage
I don't see anything surprising or problematic in the graphs you have shared. After ingesting the patterns, each instance of your BroadcastProcessFunction will be holding onto a copy of all of the patterns -- so that will consume some memory.
If I understand correctly, it sounds like the situation is that as data is processed for matching against those patterns, the memory continues to increase until the pods crash with out-of-memory errors. Various factors might explain this:
If your patterns involve matching a sequence of events over time, then your pattern matching engine has to keep state for each partial match. If there's no timeout clause to ensure that partial matches are eventually cleaned up, this could lead to a combinatorial explosion.
If you are doing key-partitioned processing and your keyspace is unbounded, you may be holding onto state for stale keys.
The filesystem state backend has considerable overhead. You may have underestimated how much memory it needs.

Flink checkpoints keeps failing

we are trying to setup a Flink stateful job using RocksDB backend.
We are using session window, with 30mins gap. We use aggregateFunction, so not using any Flink state variables.
With sampling, we have less than 20k events/s, 20 - 30 new sessions/s. Our session basically gather all the events. the size of the session accumulator would go up along time.
We are using 10G memory in total with Flink 1.9, 128 containers.
Following's the settings:
state.backend: rocksdb
state.checkpoints.dir: hdfs://nameservice0/myjob/path
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
state.backend.rocksdb.memory.write-buffer-ratio: 0.4
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1
containerized.heap-cutoff-ratio: 0.45
taskmanager.network.memory.fraction: 0.5
taskmanager.network.memory.min: 512mb
taskmanager.network.memory.max: 2560mb
From our monitoring of a given time,
rocksdb memtable size is less than 10m,
Our heap usage is less than 1G, but our direct memory usage (network buffer) is using 2.5G. The buffer pool/ buffer usage metrics are all at 1 (full).
Our checkpoints keep failing,
I wonder if it's normal that the network buffer part could use up this much memory?
I'd really appreciate if you can give some suggestions:)
Thank you!
For what it's worth, session windows do use Flink state internally. (So do most sources and sinks.) Depending on how you are gathering the session events into the session accumulator, this could be a performance problem. If you need to gather all of the events together, why are you doing this with an AggregateFunction, rather than having Flink do this for you?
For the best windowing performance, you want to use a ReduceFunction or an AggregateFunction that incrementally reduces/aggregates the window, keeping only a small bit of state that will ultimately be the result of the window. If, on the other hand, you use only a ProcessWindowFunction without pre-aggregation, then Flink will internally use an appending list state object that when used with RocksDB is very efficient -- it only has to serialize each event to append it to the end of the list. When the window is ultimately triggered, the list is delivered to you as an Iterable that is deserialized in chunks. On the other hand, if you roll your own solution with an AggregateFunction, you may have RocksDB deserializing and reserializing the accumulator on every access/update. This can become very expensive, and may explain why the checkpoints are failing.
Another interesting fact you've shared is that the buffer pool / buffer usage metrics show that they are fully utilized. This is an indication of significant backpressure, which in turn would explain why the checkpoints are failing. Checkpointing relies on the checkpoint barriers being able to traverse the entire execution graph, checkpointing each operator as they go, and completing a full sweep of the job before timing out. With backpressure, this can fail.
The most common cause of backpressure is under-provisioning -- or in other words, overwhelming the cluster. The network buffer pools become fully utilized because the operators can't keep up. The answer is not to increase buffering, but to remove/fix the bottleneck.

How flink checkpoints help in failure recovery

My flink job reads from kafka consumer using FlinkKafkaConsumer010 and sinks into hdfs using CustomBucketingSink. We have series of transformations kafka -> flatmaps(2-3 transformations) -> keyBy -> tumblingWindow(5 mins) -> Aggregation -> hdfsSink. We have kafka input of 3 millions/min events on an average and around 20 millions/min events on peak time. Checkpointing duration and minimum pause between two checkpoiting is 3 mins and i am using FsStateBackend.
Here are my assumptions :
Flink consumes some fixed number of events from kafka(multiple offsets from multiple partitions at once) and waits till it reachs to sink and then checkpoints. In case of success it commits the kafka partitions offset it read and maintains some state related to hdfs file it was writting. While multiple transformations were going after kafka hand over events to other operators, kafka consumer sits idle until it gets confirmation for success for the events that it sent. So we can say while sink is writting data to hdfs all previous operators were sitting idle. In case of failure flink goes to previous checkpoint state and points to kafka last partition offset committed and points to hdfs file offest it should start writting to.
Here are my doubts based on above assumptions:
1) Is above assumption is correct.
2) Does it make sense for tumbling window to have state as in case of failure anyway we start from last kafka partition commited offset.
3) In case tumbling window make state, when will this state can be used by flink.
4) Why checkpoint and savepoint state size vary.
5) In case of any failure, flink always starts from sorce operator. Right ?
Your assumptions are not correct.
(1) Checkpointing does not depend in any way on events or results reaching the sink(s).
(2) Flink does its own Kafka offset management. When restoring from a checkpoint, after a failure, the offsets in the checkpoint are used, not those that may have been committed back to Kafka.
(3) No operators are ever idle in the way you've described. The pipeline is not stalled by checkpointing.
The best way to understand how checkpointing works is to go through the Flink operations playground, especially the section on Observing Failure and Recovery. This will give you a much clearer understanding of this topic, because you'll be able to observe exactly what's happening.
I can also recommend reading https://ci.apache.org/projects/flink/flink-docs-master/training/fault_tolerance.html, and following the links contained there.
But to walk through how checkpointing works in your application, here are the basic steps:
(1) When the checkpoint coordinator (part of the job manager) decides it's time to initiate another checkpoint, it informs each of the task managers to start checkpoint n.
(2) All of the sources instances checkpoint their own state, and insert checkpoint barrier n into their outgoing streams. In your case, the sources are Kafka consumers, and they checkpoint the current offset for each partition.
(3) Whenever the checkpoint barrier reaches the head of the input queue in a stateful operator, that operator checkpoints its state and forwards the barrier. This part has some complexity to it -- but basically, the state is held in a multi-version, concurrency controlled hash map. The operator creates a new version n+1 of the state that can be modified by the events behind the checkpoint barrier, and creates a new thread to asynchronously snapshot all the state in version n.
In your case, the window and sink are stateful. The window's state includes the current window contents, the state of the trigger, and other state you're using for window processing, if any.
(4) Sinks use the arrival of the barrier to flush any queued output, and commit pending transactions. Again, there's some complexity here, as transactional sinks use a two-phase commit protocol.
In your application, if the checkpoint interval is much smaller than the window duration, then the sink will complete many checkpoints before ever receiving any output from the window.
(5) When the checkpoint coordinator has heard back from every task that the checkpoint is complete, it finalizes the checkpoint metadata.
During recovery, the state of every operator is reset to the state in the most recent checkpoint. This means that the sources are rewound to the offsets in the checkpoint, and processing resumes with the state in the window and sink corresponding to what it should be after having consumed the events up to those offsets.
Note: To keep this reasonably simple, I've glossed over a bunch of details. Also, FLIP-76 will introduce a new approach to checkpointing.

Resources