Flink checkpoint status is always in progress - apache-flink

I use datastream connector KafkaSource and HbaseSinkFunction, consume data from kafka and write it to hbase.
I enable the checkpoint like this:
env.enableCheckpointing(3000,CheckpointingMode.EXACTLY_ONCE);
The data in kafka has already be successfully written to hbase,but checkpoints status on ui page is still “in progress” and has not changed.
Why does this happen and how to deal with it?
Flink version:1.13.3,
Hbase version:1.3.1,
Kafka version:0.10.2

Maybe you can post the complete checkpoint configuration parameters.
In addition, you can adjust the checkpoint interval and observe it.
The current checkpoint interval is 3s, which is relatively short.

Related

Savepoint on Flink Job Finish

I have a usecase where I need to seed a Flink Application(both RocksDB state and Broadcast State) using Bounded S3 sources and then read other unbounded/bounded S3 sources after the seeding is complete.
I was trying to achieve this in 2 steps:
Seeding: Trigger a Flink job with only the seeding data bounded source and take a savepoint after the job finishes.
Regular Processing: Restore from seeded savepoint on a new Flink graph to process other unbounded/bounded S3 sources.
Questions:
For Step 1: Does Flink support taking savepoints automatically after Job Finishes in Streaming Mode.
If only manual savepoint trigger is supported, what can be used a done signal that all the seeding data is processed completely and all the task are finished processing?
Any other approaches to achieve the seeding usecase is appreciated as well.
Note: Approaches where we buffer the regular data until seeding data is processed is not feasible for my usecase
Thanks
Using unbounded sources you can make use of externalized checkpoint and you will be able to start/resume jobs from the checkpoint. Enabling this feature it is necessary to have a process to clean the checkpoints when the job is cancelled otherwise the checkpoints won't be deleted by Flink.
You can use the new feature available in Flink 1.15 (checkpoints with finished tasks) to do that.

Flink checkpoint relate to kafka partitions?

so i tried look for it in the Flink documentation but couldn`t find any answer.
I have a Flink app "Kafka Source->Operations->Window->Kafka Sink".
Checkpoint Configuration is:
10 seconds interval
around 30MB of state size
each checkpoint take less than 1second+-
EXCACLY_ONCE on kafka producer configuration.
When i try to submit the job with checkpoint enabled on topic with 30+ partitions, the checkpoint success , but when i run the same code (with the same resources and parallelisem) on topic (the sink topic) with 2 partitions for example, it keeping failing (after reach the timeout).
So i try to look for it, but couldn`t find a good explanation, is the number of kafka sink partitions is related to checkpoint? if it does, then how?

Do I really need Flink checkpointing?

I have a Flink Application that reads some events from Kafka, does some enrichment of the data from MySQL, buffers the data using a window function and writes the data inside a window to HBase. I've currently enabled checkpointing, but it turns out that the checkpointing is quite expensive and over time it takes longer and longer and affects my job's latency (falling behind on kafka ingest rate). If I figure out a way to make my HBase writes idempotent, is there a strong reason for me to use checkpointing? I can just configure the internal kafka consumer client to commit every so often right?
If the only thing you are checkpointing is the Kafka provider offset(s), then it would surprise me that the checkpointing time is significant enough to slow down your workflow. Or is state being saved elsewhere as well? If so, you could skip that (as long as, per your note, the HBase writes are idempotent).
Note that you can also adjust the checkpointing interval, and (if need be) use incremental checkpoints with RocksDB.

Commit Kafka Offsets Manually in Flink

In a Flink streaming application that is ingesting messages from Kafka,
1) How do I disable auto-committing?
2) How do I manually commit from Flink after successfully processing a message?
Thanks.
By default Flink commits offsets on checkpoints. You can disable it as follows:
val consumer = new FlinkKafkaConsumer011[T](...)
c.setCommitOffsetsOnCheckpoints(false)
If you don't have checkpoints enabled see here
Why would you do that though? Flink's checkpointing mechanism is there to solve this problem for you. Flink won't commit offsets in the presence of failures. If you throw an exception at some point downstream of the Kafka consumer Flink will attempt to restart the stream from previous successful checkpoint. If the error persists then Flink will repeatedly restart for the configured number of times before failing the stream.
This means that is unlikely you will lose messages due to Flink committing offsets of messages your code hasn't successfully processed.

Flink exactly once - checkpoint and barrier acknowledgement at sink

I have a Flink job with a sink that is writing the data into MongoDB. The sink is an implementation of RichSinkFunction.
Externalized checkpointing enabled. The interval is 5000 mills and scheme is EXACTLY_ONCE.
Flink version 1.3,
Kafka (source topic) 0.9.0
I can't upgrade to the TwoPhaseCommitSink of Flink 1.4.
I have few doubts
At which point of time does the sink acknowledges the checkpoint barrier, at the start of the invoke function or when invoke completed? Means it waits for persisting (saving in MongoDB) response before acknowledging the barrier?
If committing checkpoint is done by an asynchronous thread, how can Flink guarantee exactly once in case of job failure? What if data is saved by the sink to MongoDB but the checkpoint is not committed? I think this will end up duplicate data on restart.
When I cancel a job from the Flink dashboard, will Flink complete the async checkpoint threads to complete or it's a hard kill -9 call?
First of all, Flink can only guarantee end-to-end exactly-once consistency if the sources and sinks support this. If you are using Flink's Kafka consumer, Flink can guarantee that the internal state of the application is exactly-once consistent. To achieve full end-to-end exactly-once consistency, the sink needs properly support this as well. You should check the implementation of the MongoDB sink if it is working correctly.
Checkpoint barriers are send a regular messages over the data transport channels, i.e., a barrier for checkpoint n separates the stream into records that go into checkpoint n and n + 1. A sink operator will process a barrier between two invoke() calls and trigger the state backend to perform a checkpoint. It is then up to the state backend, whether and how it can perform the checkpoint asynchronously. Once the call to trigger the checkpoint returns, the sink can continue processing. The sink operator will report to the JobManager that it completed checkpointing its state once it is notified by the state backend. An overall checkpoint completes when all operators successfully reported that they completed their checkpoints.
This blog post discusses end-to-end exactly-once processing and the requirements for sink operators in more detail.

Resources