How flink checkpoints help in failure recovery - apache-flink

My flink job reads from kafka consumer using FlinkKafkaConsumer010 and sinks into hdfs using CustomBucketingSink. We have series of transformations kafka -> flatmaps(2-3 transformations) -> keyBy -> tumblingWindow(5 mins) -> Aggregation -> hdfsSink. We have kafka input of 3 millions/min events on an average and around 20 millions/min events on peak time. Checkpointing duration and minimum pause between two checkpoiting is 3 mins and i am using FsStateBackend.
Here are my assumptions :
Flink consumes some fixed number of events from kafka(multiple offsets from multiple partitions at once) and waits till it reachs to sink and then checkpoints. In case of success it commits the kafka partitions offset it read and maintains some state related to hdfs file it was writting. While multiple transformations were going after kafka hand over events to other operators, kafka consumer sits idle until it gets confirmation for success for the events that it sent. So we can say while sink is writting data to hdfs all previous operators were sitting idle. In case of failure flink goes to previous checkpoint state and points to kafka last partition offset committed and points to hdfs file offest it should start writting to.
Here are my doubts based on above assumptions:
1) Is above assumption is correct.
2) Does it make sense for tumbling window to have state as in case of failure anyway we start from last kafka partition commited offset.
3) In case tumbling window make state, when will this state can be used by flink.
4) Why checkpoint and savepoint state size vary.
5) In case of any failure, flink always starts from sorce operator. Right ?

Your assumptions are not correct.
(1) Checkpointing does not depend in any way on events or results reaching the sink(s).
(2) Flink does its own Kafka offset management. When restoring from a checkpoint, after a failure, the offsets in the checkpoint are used, not those that may have been committed back to Kafka.
(3) No operators are ever idle in the way you've described. The pipeline is not stalled by checkpointing.
The best way to understand how checkpointing works is to go through the Flink operations playground, especially the section on Observing Failure and Recovery. This will give you a much clearer understanding of this topic, because you'll be able to observe exactly what's happening.
I can also recommend reading https://ci.apache.org/projects/flink/flink-docs-master/training/fault_tolerance.html, and following the links contained there.
But to walk through how checkpointing works in your application, here are the basic steps:
(1) When the checkpoint coordinator (part of the job manager) decides it's time to initiate another checkpoint, it informs each of the task managers to start checkpoint n.
(2) All of the sources instances checkpoint their own state, and insert checkpoint barrier n into their outgoing streams. In your case, the sources are Kafka consumers, and they checkpoint the current offset for each partition.
(3) Whenever the checkpoint barrier reaches the head of the input queue in a stateful operator, that operator checkpoints its state and forwards the barrier. This part has some complexity to it -- but basically, the state is held in a multi-version, concurrency controlled hash map. The operator creates a new version n+1 of the state that can be modified by the events behind the checkpoint barrier, and creates a new thread to asynchronously snapshot all the state in version n.
In your case, the window and sink are stateful. The window's state includes the current window contents, the state of the trigger, and other state you're using for window processing, if any.
(4) Sinks use the arrival of the barrier to flush any queued output, and commit pending transactions. Again, there's some complexity here, as transactional sinks use a two-phase commit protocol.
In your application, if the checkpoint interval is much smaller than the window duration, then the sink will complete many checkpoints before ever receiving any output from the window.
(5) When the checkpoint coordinator has heard back from every task that the checkpoint is complete, it finalizes the checkpoint metadata.
During recovery, the state of every operator is reset to the state in the most recent checkpoint. This means that the sources are rewound to the offsets in the checkpoint, and processing resumes with the state in the window and sink corresponding to what it should be after having consumed the events up to those offsets.
Note: To keep this reasonably simple, I've glossed over a bunch of details. Also, FLIP-76 will introduce a new approach to checkpointing.

Related

How to handle the case for watermarks when num of kafka partitions is larger than Flink parallelism

I am trying to figure out a solution to the problem of watermarks progress when the number of Kafka partitions is larger than the Flink parallelism employed.
Consider for example that I have Flink app with parallelism of 3 and that it needs to read data from 5 Kafka partitions. My issue is that when starting the Flink app, it has to consume historical data from these partitions. As I understand it each Flink task starts consuming events from a corresponding partition (probably buffers a significant amount of events) and progress event time (therefore watermarks) before the same task transitions to another partition that now will have stale data according to watermarks already issued.
I tried considering a watermark strategy using watermark alignment of a few seconds but that
does not solve the problem since historical data are consumed immediately from one partition and therefore event time/watermark has progressed.Below is a snippet of code that showcases watermark strategy implemented.
WatermarkStrategy.forGenerator(ws)
.withTimestampAssigner(
(event, timestamp) -> (long) event.get("event_time))
.withIdleness(IDLENESS_PERIOD)
.withWatermarkAlignment(
GROUP,
Duration.ofMillis(DEFAULT_MAX_WATERMARK_DRIFT_BETWEEN_PARTITIONS),
Duration.ofMillis(DEFAULT_UPDATE_FOR_WATERMARK_DRIFT_BETWEEN_PARTITIONS));
I also tried using a downstream operator to sort events as described here Sorting union of streams to identify user sessions in Apache Flink but then again also this cannot effectively tackle my issue since event record times can deviate significantly.
How can I tackle this issue ? Do I need to have the same number of Flink tasks as the number of Kafka partitions or I am missing something regarding the way data are read from Kafka partitions
The easiest solution to this problem will be using the fromSource with WatermarkStrategy instead of assigning that by using assignTimestampsAndWatermarks.
When You use the WatermarkStrategy directly in fromSource with kafka connector, the watermarks will be partition aware, so the Watermark generated by the given operator will be minimum of all partitions assinged to this operator.
Assigning watermarks directly in source will solve the problem You are facing, but it has one main drawback, since the generated watermark in min of all partitions processed by the given operator, if some partition is idle watermark for this operator will not progress either.
The docs describe kafka connector watermarking here.

Flink app's checkpoint size keeps growing

I have a pipeline like this:
env.addSource(kafkaConsumer, name_source)
.keyBy { value -> value.f0 }
.window(EventTimeSessionWindows.withGap(Time.seconds(2)))
.process(MyProcessor())
.addSink(kafkaProducer)
The keys are guaranteed to be unique in the data that is being currently processed.
Thus I would expect the state size to not grow over 2 seconds of data.
However, I notice the state size has been steadily growing over the last day (since the app was deployed).
Is this a bug in flink?
using flink 1.11.2 in aws kinesis data analytics.
Kinesis Data Analytics always uses RocksDB as its state backend. With RocksDB, dead state isn't immediately cleaned up, it's merely marked with a tombstone and is later compacted away. I'm not sure how KDA configures RocksDB compaction, but typically it's done when a level reaches a certain size -- and I suspect your state size is still small enough that compaction hasn't occurred.
With incremental checkpoints (which is what KDA does), checkpointing is done by copying RocksDB's SST files -- which in your case are presumably full of stale data. If you let this run long enough you should eventually see a significant drop in checkpoint size, once compaction has been done.

How does Flink handle expired keys with CEP

I have streaming job which listens to events, does operations on them using CEP.
Flow is
stream = source
.assignTimestampsAndWatermarks(...)
.filter(...);
CEP
.pattern(stream.keysBy(e-> e.getId()), pattern)
.process(PattenMatchProcessFunction)
.sink(...);
The keys are all short lived, and process function doesn't contains any state, to say state can be removed by setting ttl. Using EventTime characteristics
My question, how does flink handle the expired keys, would have any impact on the GC.
If flink removes the keys itself then at what frequency does this happen.
Facing GC issues, job is getting stuck after deploying for 3 hours.
Doing memory tuning, but want to eliminate this case.
FsStateBackend will hold the state in-memory for your CEP operator.
What Flink does for CEP is it buffers the elements in a MapState[Long, List[T]] which maps a timestamp to all elements that arrived for that time. Once a watermark occurs, Flink will process the buffered events as follows:
// 1) get the queue of pending elements for the key and the corresponding NFA,
// 2) process the pending elements in event time order and custom comparator if exists by feeding them in the NFA
// 3) advance the time to the current watermark, so that expired patterns are discarded.
// 4) update the stored state for the key, by only storing the new NFA and MapState iff they have state to be used later.
// 5) update the last seen watermark.
Once the events have been processed, Flink will advance the watermark which will cause old entries in the state to be expired (you can see this inside NFA.advanceTime). This means that eviction of elements in your depend on how often watermarks are being created and pushed through in your stream.

Flink checkpoints keeps failing

we are trying to setup a Flink stateful job using RocksDB backend.
We are using session window, with 30mins gap. We use aggregateFunction, so not using any Flink state variables.
With sampling, we have less than 20k events/s, 20 - 30 new sessions/s. Our session basically gather all the events. the size of the session accumulator would go up along time.
We are using 10G memory in total with Flink 1.9, 128 containers.
Following's the settings:
state.backend: rocksdb
state.checkpoints.dir: hdfs://nameservice0/myjob/path
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
state.backend.rocksdb.memory.write-buffer-ratio: 0.4
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1
containerized.heap-cutoff-ratio: 0.45
taskmanager.network.memory.fraction: 0.5
taskmanager.network.memory.min: 512mb
taskmanager.network.memory.max: 2560mb
From our monitoring of a given time,
rocksdb memtable size is less than 10m,
Our heap usage is less than 1G, but our direct memory usage (network buffer) is using 2.5G. The buffer pool/ buffer usage metrics are all at 1 (full).
Our checkpoints keep failing,
I wonder if it's normal that the network buffer part could use up this much memory?
I'd really appreciate if you can give some suggestions:)
Thank you!
For what it's worth, session windows do use Flink state internally. (So do most sources and sinks.) Depending on how you are gathering the session events into the session accumulator, this could be a performance problem. If you need to gather all of the events together, why are you doing this with an AggregateFunction, rather than having Flink do this for you?
For the best windowing performance, you want to use a ReduceFunction or an AggregateFunction that incrementally reduces/aggregates the window, keeping only a small bit of state that will ultimately be the result of the window. If, on the other hand, you use only a ProcessWindowFunction without pre-aggregation, then Flink will internally use an appending list state object that when used with RocksDB is very efficient -- it only has to serialize each event to append it to the end of the list. When the window is ultimately triggered, the list is delivered to you as an Iterable that is deserialized in chunks. On the other hand, if you roll your own solution with an AggregateFunction, you may have RocksDB deserializing and reserializing the accumulator on every access/update. This can become very expensive, and may explain why the checkpoints are failing.
Another interesting fact you've shared is that the buffer pool / buffer usage metrics show that they are fully utilized. This is an indication of significant backpressure, which in turn would explain why the checkpoints are failing. Checkpointing relies on the checkpoint barriers being able to traverse the entire execution graph, checkpointing each operator as they go, and completing a full sweep of the job before timing out. With backpressure, this can fail.
The most common cause of backpressure is under-provisioning -- or in other words, overwhelming the cluster. The network buffer pools become fully utilized because the operators can't keep up. The answer is not to increase buffering, but to remove/fix the bottleneck.

How does TM recovery handle past broadcasted data

In the context of HA of TaskManagers(TM), when a TM goes down a new one will be restored from latest checkpoint of faulted by the JobManager(JM).
Say we have 3 TMs (tm1, tm2, & tm3)
At a give time t where everyone's checkpoint(cp) is at cp1. All TMs broadcast data among them.
Now tm2 went down, JM brought tm2' with cp1 checkpoint as part of HA. By the time t+x a new TM is brought up, in the mean time others progressed to cp2.
How's does the data broadcasted by tm1 and tm3 as part of cp2 is replayed on tm2'?
The contents of checkpoints are determined by checkpoint barriers. A given checkpoint includes exactly the effects throughout the entire cluster of everyone having processed all events up to the corresponding barrier, and none of the events after that barrier.
During a restore, the entire cluster is reset to the contents of the most recent checkpoint, and processing then resumes from that consistent starting point.
Broadcast data is checkpointed more or less like everything else, except that each instance stores its own copy of the broadcast data -- with the expectation that these copies are identical. During recovery, the broadcast source is rewound to the point recorded in the checkpoint, and the broadcast state is also recovered from the checkpoint. Any new instance (due to scaling up the cluster) will get a copy of the broadcast state (taken by reading the state intended for one of the other instances).
It may be that at the time of a failure, some machines have completed a new checkpoint, but a checkpoint will not be used for a restore unless every TM has completed that checkpoint, and the Job Manager has finalized it.

Resources