Commit Kafka Offsets Manually in Flink - apache-flink

In a Flink streaming application that is ingesting messages from Kafka,
1) How do I disable auto-committing?
2) How do I manually commit from Flink after successfully processing a message?
Thanks.

By default Flink commits offsets on checkpoints. You can disable it as follows:
val consumer = new FlinkKafkaConsumer011[T](...)
c.setCommitOffsetsOnCheckpoints(false)
If you don't have checkpoints enabled see here
Why would you do that though? Flink's checkpointing mechanism is there to solve this problem for you. Flink won't commit offsets in the presence of failures. If you throw an exception at some point downstream of the Kafka consumer Flink will attempt to restart the stream from previous successful checkpoint. If the error persists then Flink will repeatedly restart for the configured number of times before failing the stream.
This means that is unlikely you will lose messages due to Flink committing offsets of messages your code hasn't successfully processed.

Related

Flink checkpoint status is always in progress

I use datastream connector KafkaSource and HbaseSinkFunction, consume data from kafka and write it to hbase.
I enable the checkpoint like this:
env.enableCheckpointing(3000,CheckpointingMode.EXACTLY_ONCE);
The data in kafka has already be successfully written to hbase,but checkpoints status on ui page is still “in progress” and has not changed.
Why does this happen and how to deal with it?
Flink version:1.13.3,
Hbase version:1.3.1,
Kafka version:0.10.2
Maybe you can post the complete checkpoint configuration parameters.
In addition, you can adjust the checkpoint interval and observe it.
The current checkpoint interval is 3s, which is relatively short.

Details: How Flink achieves exactly-once mechanism?

From a previous post, it seems Flink achieves exactly-once by
After a successful pre-commit, the commit must be guaranteed to
eventually succeed
I think "a successful pre-commit" is achieved by Flink Task Manager; and the "eventual succeed" is achieved by the Flink sink.
How Flink sink node achieves the "eventual succeed"?
Does this exactly-once mechanism have anything to do with checkpoint?
Flink's two-phase commit sinks typically couple their actions with the checkpointing mechanism in the following way:
onSnapshot: Flush all records and pre-commit
onCheckpointComplete: Commit pending transactions and publish data
onRecovery: Check and commit any pending transactions
Note that it is possible for data to be lost if the external system times out pending transactions that would be committed during the onRecovery phase.
You can learn more about this in An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!).

Is it possible to recover when a slot has been removed during a Flink streaming

I have a standalone cluster where there is a Flink streaming job with 1-hour event time windows. After 2-3 hour of a run, the job dies with the "org.apache.flink.util.FlinkException: The assigned slot ... was removed" exception.
The job is working well when my windows are only 15minutes.
How can the job recover after losing a slot?
Is it possible to run the same calculations on multiple slots to prevent this error?
Shall I increase any of the timeouts? if so which one?
Flink streaming job recovers from failures from checkpoint. If your checkpoint is externalized, for example in S3. You can manually or ask Flink automatically recover from the most recent checkpoint.
Depends on your upstream message queuing service, you will likely get duplicated messages. So it's good to make your ingestion idempotent.
Also, the slot removed failure can be the symptom of various failures.
underlying hardware
network
memory pressure
What do you see in the task manager log that was removed?

Flink exactly once - checkpoint and barrier acknowledgement at sink

I have a Flink job with a sink that is writing the data into MongoDB. The sink is an implementation of RichSinkFunction.
Externalized checkpointing enabled. The interval is 5000 mills and scheme is EXACTLY_ONCE.
Flink version 1.3,
Kafka (source topic) 0.9.0
I can't upgrade to the TwoPhaseCommitSink of Flink 1.4.
I have few doubts
At which point of time does the sink acknowledges the checkpoint barrier, at the start of the invoke function or when invoke completed? Means it waits for persisting (saving in MongoDB) response before acknowledging the barrier?
If committing checkpoint is done by an asynchronous thread, how can Flink guarantee exactly once in case of job failure? What if data is saved by the sink to MongoDB but the checkpoint is not committed? I think this will end up duplicate data on restart.
When I cancel a job from the Flink dashboard, will Flink complete the async checkpoint threads to complete or it's a hard kill -9 call?
First of all, Flink can only guarantee end-to-end exactly-once consistency if the sources and sinks support this. If you are using Flink's Kafka consumer, Flink can guarantee that the internal state of the application is exactly-once consistent. To achieve full end-to-end exactly-once consistency, the sink needs properly support this as well. You should check the implementation of the MongoDB sink if it is working correctly.
Checkpoint barriers are send a regular messages over the data transport channels, i.e., a barrier for checkpoint n separates the stream into records that go into checkpoint n and n + 1. A sink operator will process a barrier between two invoke() calls and trigger the state backend to perform a checkpoint. It is then up to the state backend, whether and how it can perform the checkpoint asynchronously. Once the call to trigger the checkpoint returns, the sink can continue processing. The sink operator will report to the JobManager that it completed checkpointing its state once it is notified by the state backend. An overall checkpoint completes when all operators successfully reported that they completed their checkpoints.
This blog post discusses end-to-end exactly-once processing and the requirements for sink operators in more detail.

Stop/Start Kafka Consumer/Producer Stream in Local Execution Mode

Setup:
Java 8
Flink 1.2 (Mac OSX)
Kafka 0.10.0 (VirtualBox/Ubuntu)
FlinkKafkaConsumer010
FlinkKafkaProducer010
Created a simple example program to consume 1M message from one Kafka topic and produce to another - running in local execution mode. Both topics have 32 partitions.
When I let run from start to finish, it consumes and produces all message. If I start and then stop (SIGINT) before it is completed, then restart again the producer only receives a subset of the original 1M messages.
I have confirmed my offsets for the consumer and it read all 1M messages.
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(32);
env.enableCheckpointing(1000L, CheckpointingMode.EXACTLY_ONCE);
--
producer.setFlushOnCheckpoint(true);
producer.setLogFailuresOnly(false);
In local execution mode is this expected? Do I need to enable savepoints to stop and restart a stream job? I appears the producer is not committing all the messages when this happens.
Thanks in advance!
First of all, on subsequent runs, it only receives a subset of the messages because the FlinkKafkaConsumer is using the committed offsets in Kafka as the starting positions. Currently, the only way to avoid this right now in the releases (up to 1.2.0 as of now) is to always assign a new group.id. In the next release, there will be new options for this: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/connectors/kafka.html#kafka-consumers-start-position-configuration.
As a side note, please also note that the committed offsets in Kafka are not used at all for the exactly-once processing guarantees in Flink. Flink only relies on the checkpointed offsets for that. More details on this can be found in the Flink Kafka connector docs in the link above.

Resources