FlinkKafkaConsumer read from the start of the topic - apache-flink

I am trying to read a kafka topic as a datastream in Flink. I am using FlinkKafkaConsumer to read the topic.
The problem that I am facing is that after a few testing with I want to start reading again from the start of the topic to do some extra bit of testing. Ideally changing the group.id and restarting the job both should be enough to accomplish this.
But after restart, the consumer is still able to find the old checkpoints/kafka.commit. I also tried to delete all the checkpoints delete all configMaps and deployments and restart everything but the same thing happened again. I can see the offsets being set in taskmanager logs.
How to read from the start of the topic again?
2021-02-17 10:08:41,287 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [] - [Consumer clientId=consumer-FlinkChangeConsumerNewAgain-2, groupId=FlinkChangeConsumerNewAgain] Discovered group coordinator idsp-cdp-qa-ehns-2.servicebus.windows.net:9093 (id: 2147483647 rack: null)
2021-02-17 10:08:41,324 INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator [] - [Consumer clientId=consumer-FlinkChangeConsumerNewAgain-2, groupId=FlinkChangeConsumerNewAgain] Setting offset for partition adhoc-testing-0 to the committed offset FetchPosition{offset=40204, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=idsp-cdp-qa-ehns-2.servicebus.windows.net:9093 (id: 0 rack: null), epoch=-1}}
2021-02-17 10:08:41,326 INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator [] - [Consumer clientId=consumer-FlinkChangeConsumerNewAgain-2, groupId=FlinkChangeConsumerNewAgain] Setting offset for partition adhoc-testing-1 to the committed offset FetchPosition{offset=39962, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=idsp-cdp-qa-ehns-2.servicebus.windows.net:9093 (id: 0 rack: null), epoch=-1}}
2021-02-17 10:08:41,328 INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator [] - [Consumer clientId=consumer-FlinkChangeConsumerNewAgain-2, groupId=FlinkChangeConsumerNewAgain] Setting offset for partition adhoc-testing-4 to the committed offset FetchPosition{offset=40444, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=idsp-cdp-qa-ehns-2.servicebus.windows.net:9093 (id: 0 rack: null), epoch=-1}}
2021-02-17 10:08:41,328 INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator [] - [Consumer clientId=consumer-FlinkChangeConsumerNewAgain-2, groupId=FlinkChangeConsumerNewAgain] Setting offset for partition adhoc-testing-2 to the committed offset FetchPosition{offset=40423, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=idsp-cdp-qa-ehns-2.servicebus.windows.net:9093 (id: 0 rack: null), epoch=-1}}
2021-02-17 10:08:41,328 INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator [] - [Consumer clientId=consumer-FlinkChangeConsumerNewAgain-2, groupId=FlinkChangeConsumerNewAgain] Setting offset for partition adhoc-testing-3 to the committed offset FetchPosition{offset=40368, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=idsp-cdp-qa-ehns-2.servicebus.windows.net:9093 (id: 0 rack: null), epoch=-1}}

I don't think the problem is that consumer is able to find old commits or old checkpoints as long as You are starting the job from scratch not from savepoint.
The issue seems to be that You don't set the auto.offset.reset on Kafka Consumer, which means that default value is used, which is latest. So, whenever You start a job with new group.id it will always start from the latest offsets committed to Kafka. You can change that by simply passing auto.offset.reset property set to earliest on properties passed to KafkaConsumer.

This should do the trick, provided the job isn't recovering from a checkpoint, or being started from a savepoint:
FlinkKafkaConsumer<T> myConsumer = new FlinkKafkaConsumer<>(...);
myConsumer.setStartFromEarliest();
Otherwise the default is to start from the committed group offsets.
Sounds like you've already seen this explanation in the docs, but just in case:
Note that these start position configuration methods do not affect the start position when the job is automatically restored from a failure or manually restored using a savepoint. On restore, the start position of each Kafka partition is determined by the offsets stored in the savepoint or checkpoint (please see the next section for information about checkpointing to enable fault tolerance for the consumer).

Related

Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).
With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

Flink Streaming - kafka.internals.Handover$ClosedException

I am using Apache Flink v1.12.3.
Recently I have encountered this error, and I do not know what exactly it's means. Is error related to the Kafka or Flink?
Error log:
2021-07-02 21:32:50,149 WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator [] - [Consumer clientId=consumer-myGroup-24, groupId=myGroup] Offset commit failed on partition my_topic-14 at offset 11328862986: The request timed out.
// ...
2021-07-02 21:32:50,150 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator [] - [Consumer clientId=consumer-myGroup-24, groupId=myGroup] Group coordinator 1.2.3.4:9092 (id: 2147483616 rack: null) is unavailable or invalid, will attempt rediscovery
// ...
2021-07-02 21:33:20,553 INFO org.apache.kafka.clients.FetchSessionHandler [] - [Consumer clientId=consumer-myGroup-21, groupId=myGroup] Error sending fetch request (sessionId=1923351260, epoch=9902) to node 29: {}.
// ...
2021-07-02 21:33:19,457 INFO org.apache.kafka.clients.FetchSessionHandler [] - [Consumer clientId=consumer-myGroup-15, groupId=myGroup] Error sending fetch request (sessionId=1427185435, epoch=10820) to node 29: {}.
org.apache.kafka.common.errors.DisconnectException: null
// ...
2021-07-02 21:34:10,157 WARN org.apache.flink.runtime.taskmanager.Task [] - Source: my_topic_stream (4/15)#0 (2e2051d41edd606a093625783d844ba1) switched from RUNNING to FAILED.
org.apache.flink.streaming.connectors.kafka.internals.Handover$ClosedException: null
at org.apache.flink.streaming.connectors.kafka.internals.Handover.close(Handover.java:177) ~[blob_p-a7919582483974414f9c0d4744bab53199b880d7-d9edc9d0741b403b3931269bf42a4f6b:?]
at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.cancel(KafkaFetcher.java:164) ~[blob_p-a7919582483974414f9c0d4744bab53199b880d7-d9edc9d0741b403b3931269bf42a4f6b:?]
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.cancel(FlinkKafkaConsumerBase.java:945) ~[blob_p-a7919582483974414f9c0d4744bab53199b880d7-d9edc9d0741b403b3931269bf42a4f6b:?]
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.lambda$createAndStartDiscoveryLoop$2(FlinkKafkaConsumerBase.java:913) ~[blob_p-a7919582483974414f9c0d4744bab53199b880d7-d9edc9d0741b403b3931269bf42a4f6b:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
It is Kafka's issue. Kafka consumer client throws an error(timeout) while committing the offset to Kafka cluster. One possible reason is the Kafka cluster is busy and cannot response in time. This error makes the task manager running Kafka consumer failed.
Try to add parameters to properties while creating the source stream from Kafka. Possible parameter is: request.timeout.ms, set it to a longer time and then have a try.
References:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/#kafka-consumer
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html#consumerconfigs_request.timeout.ms

Solr recovery mode

I am running Solr cluster 7.4 with 2 nodes and 9 shards and 2 replicas for each shard.
When one of the servers crashes, I see this message (Skipping download for _3nap.fnm because it already exists) in logs:
2019-04-16 09:20:21.333 INFO (recoveryExecutor-4-thread-36-processing-n:192.168.1.2:4239_solr
x:telegram_channel_post_archive_shard5_replica_n53
c:telegram_channel_post_archive s:shard5 r:core_node54)
[c:telegram_channel_post_archive s:shard5 r:core_node54
x:telegram_channel_post_archive_shard5_replica_n53]
o.a.s.h.IndexFetcher Skipping download for _3nap.fnm because it already exists
2019-04-16 09:20:35.265 INFO (recoveryExecutor-4-thread-36-processing-n:192.168.1.2:4239_solr x:telegram_channel_post_archive_shard5_replica_n53 c:telegram_channel_post_archive s:shard5 r:core_node54) [c:telegram_channel_post_archive s:shard5 r:core_node54 x:telegram_channel_post_archive_shard5_replica_n53] o.a.s.h.IndexFetcher Skipping download for _3nap.dim because it already exists
2019-04-16 09:20:51.437 INFO (recoveryExecutor-4-thread-36-processing-n:192.168.1.2:4239_solr x:telegram_channel_post_archive_shard5_replica_n53 c:telegram_channel_post_archive s:shard5 r:core_node54) [c:telegram_channel_post_archive s:shard5 r:core_node54 x:telegram_channel_post_archive_shard5_replica_n53] o.a.s.h.IndexFetcher Skipping download for _3nap.si because it already exists
2019-04-16 09:21:00.528 INFO (qtp1543148593-32) [c:telegram_channel_post_archive s:shard20 r:core_node41 x:telegram_channel_post_archive_shard20_replica_n38] o.a.s.u.p.LogUpdateProcessorFactory [telegram_channel_post_archive_shard20_replica_n38] webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=dedupe&distrib.from=http://192.168.1.1:4239/solr/telegram_channel_post_archive_shard20_replica_n83/&min_rf=2&wt=javabin&version=2}{add=[9734588300_4723 (1630961769251864576), 9734588300_4693 (1630961769253961728), 9734588300_4670 (1630961769255010304), 9734588300_4656 (1630961769255010305)]} 0 80197
How is the recovery method in Solar?
Will they transfer all the documents from the shard or only the broken parts?
I found this note in the document:
If a leader goes down, it may have sent requests to some replicas and not others. So when a new potential leader is identified, it runs a synch process against the other replicas. If this is successful, everything should be consistent, the leader registers as active, and normal actions proceed. If a replica is too far out of sync, the system asks for a full replication/replay-based recovery.
but I don't understand this part and what does this mean?
If a replica is too far out of sync
The note just says that it'll attempt to sync as little as possible, but if that's not possible - i.e. the sync is so far behind that the transaction log isn't usable any longer, the complete set of files in the index will be replicated to the index. This takes longer than regular replication.
The message you're getting is that the file in question has already been replicated, so it doesn't have to be sent to the replica again.

librdkafka C API Kafka Consumer doesn't read all messages correctly

I am using librdkafka C API consumer (specifically using rd_kafka_consumer_poll to read and I did call rd_kafka_poll_set_consumer before this)
Problem I see is that in my google test I do following
write 3 messages to kafka
init/start kafka consumer (rd_kafka_consumer_poll)
in rebalance_cb I set each partition offset to RD_KAFKA_OFFSET_STORED and assign them to handle
At this point I believe it should read 3 messages but it reads only last message but surprisingly offset for each partition is already updated!
Am I missing something here using Kafka consumer?
And one more question is I initially thought stored offset is in kafka broker and there is unique offset for topic + consumer group id + partition combination.
So I thought different consumer groups reading same topic should have different offset.
However, it doesn't look like the case. I am always reading from same offset when used different consumer groups.
I am suspecting this may be related to offset commit but not sure where to tackle this.
Any insight?
Configuration to look at : auto.offset.reset
From Kakfa consumer documentation :
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server
From librdkafka documentation :
Action to take when there is no initial offset in offset store or the
desired offset is out of range: 'smallest','earliest' - automatically
reset the offset to the smallest offset, 'largest','latest' -
automatically reset the offset to the largest offset, 'error' -
trigger an error which is retrieved by consuming messages and checking
'message->err'. Type: enum value
Default value is latest.
Furthermore,
#define RD_KAFKA_OFFSET_STORED -1000
So, you're trying to set partition offset to -1000 which is obviously not a valid offset.
Apparently, librdkafka reads last message in this case (I didn't check code).

Apache Flink not deleting old checkpoints

I have a very simple setup of 4-node Flink cluster where one of nodes is Jobmanager, others are Taskmanagers and started by start-cluster script.
All task managers have the same configuration, regarding state and checkpointing it's as follows:
state.backend: rocksdb
state.backend.fs.checkpointdir: file:///root/flink-1.3.1/checkpoints/fs
state.backend.rocksdb.checkpointdir: file:///root/flink-1.3.1/checkpoints/rocksdb
# state.checkpoints.dir: file:///root/flink-1.3.1/checkpoints/metadata
# state.checkpoints.num-retained: 2
(The latter 2 options are commented intentionally as I tried uncommenting them and it didn't change a thing.)
And in code I have:
val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment
streamEnv.enableCheckpointing(10.minutes.toMillis)
streamEnv.getCheckpointConfig.setCheckpointTimeout(1.minute.toMillis)
streamEnv.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
After job is working for 40 minutes, in directory
/root/flink-1.3.1/checkpoints/fs/.../
I see 4 checkpoint directories with name pattern "chk-" + index, whereas I expected that old checkpoints would be deleted and there would be only one checkpoint left.(from the docs, only one checkpoint by default should be retained) Meanwhile, in web UI Flink marks first three checkpoints as "discarded".
Did I configure anything wrong or it's an expected behaviour?
The deletion is done by the job manager, which probably has no way of accessing your files (in /root)

Resources