so i tried look for it in the Flink documentation but couldn`t find any answer.
I have a Flink app "Kafka Source->Operations->Window->Kafka Sink".
Checkpoint Configuration is:
10 seconds interval
around 30MB of state size
each checkpoint take less than 1second+-
EXCACLY_ONCE on kafka producer configuration.
When i try to submit the job with checkpoint enabled on topic with 30+ partitions, the checkpoint success , but when i run the same code (with the same resources and parallelisem) on topic (the sink topic) with 2 partitions for example, it keeping failing (after reach the timeout).
So i try to look for it, but couldn`t find a good explanation, is the number of kafka sink partitions is related to checkpoint? if it does, then how?
Related
I'm having issues understanding why my flink job commits to kafka consumer is taking so long. I have a checkpoint of 1s and the following warning appears. I'm currently using version 1.14.
Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity
Compared to some Kafka streams we have running, the commit latency takes around 100 ms.
Can you point me in the right direction? Are there any metrics that I can look at?
I tried to find metrics that could help to debug this
Since Flink is continually committing offsets (sometimes overlapping in the cases of longer-running commits), network related blips and other external issues that cause the checkpoint to take longer can result in what you are seeing (a subsequent checkpoint is completed prior to the success of the previous one).
There are a handful of useful metrics related to checkpointing that you may want to explore that might help determine what's occurring:
lastCheckpointDuration - The time it took to complete the last checkpoint (in milliseconds).
lastCheckpointSize - The checkpointed size of the last checkpoint (in bytes), this metric could be different from lastCheckpointFullSize if incremental checkpoint or changelog is enabled.
Monitoring these as well as some of the other checkpointing metrics, along with task/job manager logs, might help you piece together a story for what caused the slower commit to take so long.
If you find that you are continually encountering this, you may look at adjusting the checkpointing configuration for the job to tolerate these longer durations.
I'm using Flink + Kafka to process streaming documents. I have set up filters on the documents to stop strange documents from coming into Flink jobs, but still there are types of documents that I couldn't foresee. If the job consumes these documents, it will take extra long time.
Like I have seen in the checkpoints of the job, many processes finish quite fast and are waiting for the slow ones to finish (e.g. in image below, all finished but one). My question is: can I make Flink drop these slow processes after certain threshold, and commit those that are already finished? I tried to set flink.job.checkpoint.timeout but found that the checkpoint will fail if it exceeds the timeout, and will read the last offset and process again. Is there a way to make the checkpoint succeed and read the next offset?
looks like this unaligned checkpoint is what I need. https://flink.apache.org/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html
but have to upgrade flink to 1.11, and then set in flink-conf.yaml
execution.checkpointing.unaligned: true
execution.checkpointing.aligned-checkpoint-timeout: 60 s
I use datastream connector KafkaSource and HbaseSinkFunction, consume data from kafka and write it to hbase.
I enable the checkpoint like this:
env.enableCheckpointing(3000,CheckpointingMode.EXACTLY_ONCE);
The data in kafka has already be successfully written to hbase,but checkpoints status on ui page is still “in progress” and has not changed.
Why does this happen and how to deal with it?
Flink version:1.13.3,
Hbase version:1.3.1,
Kafka version:0.10.2
Maybe you can post the complete checkpoint configuration parameters.
In addition, you can adjust the checkpoint interval and observe it.
The current checkpoint interval is 3s, which is relatively short.
everyone.
Please help me.
I write apache flink streraming job, which reads json messages from apache kafka (500-1000 messages in seconds), deserialize them in POJO and performs some operations (filter-keyby-process-sink). I used RocksDB state backend with ExactlyOnce semantic. But I do not understand which checkpointing interval I need set?
Some forums peoples write mostly 1000 or 5000 ms.
I tried to set interval 10ms, 100ms, 500ms, 1000ms, 5000ms. I have not noticed any differences.
Two factors argue in favor of a reasonably small checkpoint interval:
(1) If you are using a sink that does two-phase transactional commits, such as Kafka or the StreamingFileSink, then those transactions will only be committed during checkpointing. Thus any downstream consumers of the output of your job will experience latency that is governed by the checkpoint interval.
Note that you will not experience this delay with Kafka unless you have taken all of the steps required to have exactly-once semantics, end-to-end. This means that you must set Semantic.EXACTLY_ONCE in the Kafka producer, and set the isolation.level in downstream consumers to read_committed. And if you are doing this, you should also increase transaction.max.timeout.ms beyond the default (which is 15 minutes). See the docs for more.
(2) If your job fails and needs to recover from a checkpoint, the inputs will be rewound to the offsets recorded in the checkpoint, and processing will resume from there. If the checkpoint interval is very long (e.g., 30 minutes), then your job may take quite a while to catch back up to the point where it is once again processing events in near real-time (assuming you are processing live data).
On the other hand, checkpointing does add some overhead, so doing it more often than necessary has an impact on performance.
In addition to the points described by #David, my suggestion is also to use the following function to configure the checkpoint time:
StreamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(milliseconds)
This way, you guarantee that your job will be able to make some progress in case the state gets bigger than planned or the storage where the checkpoints are made is slow.
I recommend reading the Flink documentation on Tuning Checkpointing to better understand these scenarios.
I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s.
The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3.
If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000).
When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower.
Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager.
Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?
You are seeing the expected behavior for exactly-once. Flink implements fault-tolerance via a combination of checkpointing and replay in the case of failures. The guarantee is not that each event will be sent into the pipeline exactly once, but rather that each event will affect your pipeline's state exactly once.
Checkpointing creates a consistent snapshot across the entire cluster. During recovery, operator state is restored and the sources are replayed from the most recent checkpoint.
For a more thorough explanation, see this data Artisans blog post: High-throughput, low-latency, and exactly-once stream processing with Apache Flink™, or the Flink docs.