how to achieve exactly once semantics in apache kafka connector - apache-flink

I am using flink version 1.8.0 . My application reads data from kafka -> transform -> publish to Kafka. To avoid any duplicates during restart, i want to use kafka producer with Exactly once semantics , read about it here :
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/kafka.html#kafka-011-and-newer
My kafka version is 1.1 .
return new FlinkKafkaProducer<String>( topic, new KeyedSerializationSchema<String>() {
public byte[] serializeKey(String element) {
// TODO Auto-generated method stub
return element.getBytes();
}
public byte[] serializeValue(String element) {
// TODO Auto-generated method stub
return element.getBytes();
}
public String getTargetTopic(String element) {
// TODO Auto-generated method stub
return topic;
}
},prop, opt, FlinkKafkaProducer.Semantic.EXACTLY_ONCE, 1);
Checkpoint Code :
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointTimeout(15 * 1000 );
checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.enableCheckpointing(5000 );
If I add exactly once sematics in kafka producer , my flink consumer is not reading any new data.
Can any one please share any sample code/application with Exactly once Semantics ?
Please find complete code here :
https://github.com/sris2/sample_flink_exactly_once
Thanks

Can any one please share any sample code/application with Exactly once Semantics ?
An exactly once example is hidden in an end-to-end test in flink. Since it uses some convenience functions, it may be hard to follow without checking out the whole repo.
If I add exactly once sematics in kafka producer , my flink consumer
is not reading any new data.
[...]
Please find complete code here :
https://github.com/sris2/sample_flink_exactly_once
I checked out your code and found the issue (had to fix the whole setup/code to actually get it running). The sink can actually not configure the transactions correctly. As written in the Flink Kafka connector documentation, you need to adjust the transaction.timeout.ms either in your Kafka broker up to 1 hour or in your application down to 15 min:
prop.setProperty("transaction.timeout.ms", "900000");
The respective excerpt is:
Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode.

Related

Flink 1.12.x DataSet --> Flink 1.14.x DataStream

I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.
Our Code using Flink 1.12.x DataSet
dataset
.<few operations>
.mapPartition(new SomeMapParitionFn())
.<few more operations>
public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
#Override
public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
for (InputModel record : records) {
/*
do some operation
*/
if (/* some condition based on processing *MULTIPLE* records */) {
out.collect(...); // Conditional collect ---> (1)
}
}
// At the end of the data, collect
out.collect(...); // Collect processed data ---> (2)
}
}
(1) - Collector.collect invoked based on some condition after processing few records
(2) - Collector.collect invoked at the end of data
Initially we thought of using flatMap instead of mapPartition, but collector not available in close function.
https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers
How to implement this in Flink 1.14.x DataStream? Please advise...
Note: Our application works with only finite set of data (Batch Mode)
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement similar function, you need to define a window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility
Solution provided by Zhipeng
One solution could be using a streamOperator to implement BoundedOneInput
interface.
An example code could be found here [1].
[1]
https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75
Flink user mailing link: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z

Unable to commit consuming offsets to Kafka on checkpoint in Flink new Kafka consumer-api (1.14)

I am referring Flink 1.14 version for the Kafka source connector with the below code.
I am expecting the below requirements.
At the very new start of the application has to read from the latest offsets from the Kafka topic
On checkpoint, it has to commit the consumed offsets to the Kafka
After the restart(when the application killed manually/system error) it has to pick from the last committed offsets and should have to consume consumer lag and henceforth fresh event feeds.
With Flink new KafkaConsumer API (KafkaSource) I am facing the below problems
Able to do the above requirements but not able to commit the consumed offsets on a checkpoint(500ms). It rather commits after 2s or 3s.
When you kill the application manually within that 2s/3s and restart. Since the last consumed message is not committed it is read twice(duplicate).
To cross-check this feature I have tried with Flink Kafka's older consumer API (FlinkKafkaConsumer). There it is perfectly working. As and when a message is consumed immediately it is committed back to Kafka.
Steps followed
Set up the Kafka environment
Run the flink below code to consume. Code includes both old and new APIs. Both these APIs will consume from Kafka topic and print at the console
Push some messages to Kafka topic.
After pushing some messages and after they visible in console kill the Flink job.
Check kafka consumer groups for both APIs. New flink consumer api's group-id(test1) consumer lag is > 0 compared to older consumer api's group-id(older_test1).
When you restart Flink job, you can see those uncommitted messages are visible in the console from the new Flink kafka-consumer API leading to duplicate messages.
Please suggest if anything that I am missing or any property needs to be added.
#Test
public void test() throws Exception {
System.out.println("FlinkKafkaStreamsTest started ..");
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
env.enableCheckpointing(500);
env.setParallelism(4);
Properties propertiesOld = new Properties();
Properties properties = new Properties();
String inputTopic = "input_topic";
String bootStrapServers = "localhost:29092";
String groupId_older = "older_test1";
String groupId = "test1";
propertiesOld.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServers);
propertiesOld.put(ConsumerConfig.GROUP_ID_CONFIG, groupId_older);
propertiesOld.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServers);
properties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
/******************** Old Kafka API **************/
FlinkKafkaConsumer<String> flinkKafkaConsumer = new FlinkKafkaConsumer<>(inputTopic,
new KRecordDes(),
propertiesOld);
flinkKafkaConsumer.setStartFromGroupOffsets();
env.addSource(flinkKafkaConsumer).print("old-api");
/******************** New Kafka API **************/
KafkaSourceBuilder<String> sourceBuilder = KafkaSource.<String>builder()
.setBootstrapServers(bootStrapServers)
.setTopics(inputTopic)
.setGroupId(groupId)
.setValueOnlyDeserializer(new SimpleStringSchema())
.setProperty("enable.auto.commit", "false")
.setProperty("commit.offsets.on.checkpoint", "true")
.setProperties(properties)
.setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.LATEST));
KafkaSource<String> kafkaSource = sourceBuilder.build();
SingleOutputStreamOperator<String> source = env
.fromSource(kafkaSource, WatermarkStrategy.forMonotonousTimestamps(), "Kafka Source");
source.print("new-api");
env.execute();
}
static class KRecordDes implements KafkaDeserializationSchema<String>{
#Override
public TypeInformation<String> getProducedType() {
return TypeInformation.of(String.class);
}
#Override
public boolean isEndOfStream(String nextElement) {
return false;
}
#Override
public String deserialize(ConsumerRecord<byte[], byte[]> consumerRecord) throws Exception {
return new String(consumerRecord.value());
}
}
Note: I have other requirements where I want the Flink Kafka bounded source reader in the same code, which is available in new APIs(KafkaSource).
From the documentation of Kafka Source:
Note that Kafka source does NOT rely on committed offsets for
fault tolerance. Committing offset is only for exposing the progress
of consumer and consuming group for monitoring.
When the Flink job recovers from failure, instead of using committed offsets on broker, it'll restore state from the latest successful checkpoint, and resume consuming from the offset stored in that checkpoint, so records after the checkpoint will be "replayed" a little bit. Since you are using print sink, which does not support exactly-once semantic, you will see duplicated records that are actually records after the latest successful checkpoint.
For the 2-3 second delay of offset commit you mentioned, it is because of the implementation of SourceReaderBase. In short words SplitFetcher manages a task queue, and when an offset commit task is pushed into the queue, it won't be executed until a running fetch task invoking KafkaConsumer#poll() times out. The delay could be longer if the traffic is quite small. But note that this won't affect correctness: KafkaSource doesn't use committed offset for fault tolerance.

Flink CEP Event Not triggering

I have implement the CEP Pattern in Flink which is working as expected connecting to local Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//saves checkpoint
env.getCheckpointConfig().enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
I am using AscendingTimestampExtractor,
consumer.assignTimestampsAndWatermarks(
new AscendingTimestampExtractor<ObjectNode>() {
#Override
public long extractAscendingTimestamp(ObjectNode objectNode) {
long timestamp;
Instant instant = Instant.parse(objectNode.get("value").get("timestamp").asText());
timestamp = instant.toEpochMilli();
return timestamp;
}
});
And also i am getting Warn Message that,
AscendingTimestampExtractor:140 - Timestamp monotony violated: 1594017872227 < 1594017873133
And Also i tried using AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks none of one is working
I have attached Flink console screenshot where Watermark is not assigning.
Updated flink console screenshot
Could Anyone Help?
CEP must first sort the input stream(s), which it does based on the watermarking. So
the problem could be with watermarking, but you haven't shown us enough to debug the cause. One common issue is having an idle source, which can prevent the watermarks from advancing.
But there are other possible causes. To debug the situation, I suggest you look at some metrics, either in the Flink Web UI or in a metrics system if you have one connected. To begin, check if records are flowing, by looking at numRecordsIn, numRecordsOut, or numRecordsInPerSecond and numRecordsOutPerSecond at different stages of your pipeline.
If there are events, then look at currentOutputWatermark throughout the different tasks of your job to see if event time is advancing.
Update:
It appears you may be calling assignTimestampsAndWatermarks on the Kafka consumer, which will result in per-partition watermarking. In that case, if you have an idle partition, that partition won't produce any watermarks, and that will hold back the overall watermark. Try calling assignTimestampsAndWatermarks on the DataStream produced by the source instead, to see if that fixes things. (Of course, without per-partition watermarking, you won't be able to use an AscendingTimestampExtractor, since the stream won't be in order.)

Apache Flink - kafka producer to sink messages to kafka topics but on different partitions

Right now my flink code is processing a file and sinking the data on kafka topic with 1 partition.
Now I have a topic with 2 partition and I want flink code to sink data on those 2 partition using DefaultPartitioner.
Could you help me with that.
Here is the code snippet of my current code:
DataStream<String> speStream = inputStream..map(new MapFunction<Row, String>(){....}
Properties props = Producer.getProducerConfig(propertiesFilePath);
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), props, FlinkKafkaProducer011.Semantic.EXACTLY_ONCE));
Solved this by changing the flinkproducer to
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new SimpleStringSchema(),
props));
earlier i was using
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,
new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), props,
FlinkKafkaProducer011.Semantic.EXACTLY_ONCE));
In Flink version 1.11 (which I'm using with Java), the SimpleStringSchema needs a wrapper (ie. KeyedSerializationSchemaWrapper) which is also used by #Ankit but removed from the suggested solution as I was getting below constructor related error due to the same.
FlinkKafkaProducer<String> producer = new FlinkKafkaProducer<String>(
topic_name, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()),
properties, FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
Error:
The constructor FlinkKafkaProducer<String>(String, SimpleStringSchema, Properties, FlinkKafkaProducer.Semantic) is undefined

Apache Flink flatMap with millions of outputs

Whenever i receive a message, i want to do a read from a database, possibly returning millions of rows, which i then want to pass on down the stream. Is this considered good practice in Flink?
public static class StatsReader implements FlatMapFunction<Msg, Json> {
Transactor txor =
...;
#Override
public void flatMap(Msg msg, Collector<Json> out) {
//Possibly lazy and async stream
java.util.Stream<Json> results =
txor.exec(Stats.read(msg));
results.foreach(stat->out.collect(stat));
}
}
Edit:
Background: I would like to dynamically run a report. The db basically acts as a huge window. The report is based on that window + live data. The report is highly customizable, threfore its hard to preprocess results or define pipelines a priori.
I use vanilla java today, and the pipeline is roughly like this:
ReportDefinition -> ( elasticsearch query + realtime stream ) -> ( ReportProcessingPipeline ) -> ( Websocket push )
In principle this should be possible. However, I'd recommend to use an AsyncFunction instead of a FlatMapFunction.
Please note that, such a setup might require tuning the checkpointing parameters, such as the checkpoint interval.

Resources