commitOffsetsInFinalize() and checkmarks in Apache Beam - apache-flink

I am working on a Beam application that uses KafkaIO as an input
KafkaIO.<Long, GenericRecord>read()
.withBootstrapServers("bootstrapServers")
.withTopic("topicName")
.withConsumerConfigUpdates(confs)
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer((Deserializer.class)
.commitOffsetsInFinalize()
.withoutMetadata();
I am trying to understand how exactly the commitOffsetsInFinalize() works.
How can the streaming job be finalized?
The last step in the pipeline is a custom DoFn that writes the messages to DynamoDb. Is there any way to manually call some finalize() method there, so that the offsets are committed after each successful execution of the DoFn?
Also I am having hard time understanding whats the relation between the checkpoints and the finalization ? If no checkpoint is enabled on the pipeline, will I still be able to finalize and get the commitOffsetsInFinalize() to work?
p.s The way the pipeline is right now, even with the commitOffsetsInFinalize() each message that is read, regardless whether there is a failure downstream is being committed, hence causing a data lose.
Thank you!

The finalize here is referring to the finalization of the checkpoint, in other words when the data has been durably committed into Beam's runtime state (such that worker failures/reassignment will be retried without having to read this message from Kafka again). This does not mean that the data in question has made its way the rest of the way through the pipeline.

Related

Flink: sharing state between functions

as we know there's no sharing state mechanism in Flink at the moment, but I suppose we can achieve it. Suppose we have a Flink job (with a single input source) and we want to know what happened at the end of it, in order to adjust the job processing steps.
I have thought:
Sinking a state into a broadcast source, then consuming it to update the state of functions
Using external services to store and retrieve it:
sink state to a db, and use an async function to retrieve it amid the job flow
use state func to update/read from external services amid the job flow
store state in a redis table and retrieve it amid the job flow
I think the first should be the more suitable, as other requires extra setup and extend the complexity to other systems.
What's your opinion on those options?
Are there other ways?
Thanks
If you use Stateful Functions then it's easy to send a message from the final processing step back to the upstream operator(s).
If you're OK with potentially losing this state if it's in-flight and your job restarts (so it's a hint re adjusting job processing, versus a requirement), then you can use an IterativeStream to send it back upstream. That would remove the need for Kafka or some other external feedback system. See also How does Flink treat checkpoints and state within IterativeStream?
I used kafka. Whenever the state is changed, as a side output I sent it to Kafka sink, and other tasks which subscribed to same topic is being notified.

Flink+Kafka High Memeory Usage

I have a basic flink job which reads from kafka as string, in a flatmap makes some string operations and sends messages back to kafka as string. No window or state. Message count is about 25K/second.
Backend is hashmap.
When I check the task manager on Flink UI, I see that heap memory usage sometimes goes up to 10G.
When I watch it I see that it changes something between 3GB-10GB.
I have no idea where this memory is used. a message is about 1KB so I receive 25MB data per second and without any state of window I write it back to Kafka. no keyby, no window, no state. nothing.
Any idea why memory usage is too high. Any advice would be very helpful for analyzing the problem.

What exactly happens if checkpointed data cannot be committed?

I'm reading into the details of Flink's checkpointing mechanism right now and by now, I think I have a really good overview about how everything is tied together but one last issue strikes me here.
It's about how checkpoints and commits interact with each other in the ExactlyOnce context, because I have the feeling that there's still potential for data loss/duplicate records. Mainly I was thinking about potential failures of the commit message or its callback, when I stumbled upon this paragraph in the Flink Blog:
After a successful pre-commit, the commit must be guaranteed to eventually succeed – both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.
Up until this point, I still had the impression that checkpoints would have to be acknowledged by the sink commit first, before they would be viewed as "valid". But apparently, once all operators are ready to actually commit, the checkpoint starts to exist and from that point on, the sink has to guarantee the commit can be done to ensure no data being lost. What exactly happens if my commit can never be done, e.g. if my Kafka sink is down for a longer period of time? Does this mean if the defined retries run out eventually, the checkpointed state will just be treated as the correct state or will Flink only be able to resume the job once this specific commit was able to be done and thus be stuck until broker is available again?
And what if the callback of the commit is lost somehow, will this be resolved in the next retry attempt or since the transaction is "done" now, the producer will not be able to commit and we enter this loop of repeated retries? (more of a Kafka question probably)
For committing the side effects (so things like external state, vide Kafka transactions), Flink is using two phase commit protocol.
Let's say we are performing checkpoint 42. First pre-commit requests are issued. If all participants (parallel subtasks/operators) successfully acknowledged the pre-commit, JobManager/CheckpointCoordinator will start sending out commit requests.
The thing is, if failure happens at this point of time, there is no way going back. If either some commit fails or there is some other unrelated failure, job will be restarted from the checkpoint 42 and Flink will re-attempt to commit the pending/pre-committed transactions. If failure happens again, rinse and repeat according to your selected restart strategy. If you want to avoid data loss, commit attempts must eventually succeed. There is simply no other way. We can not revert those transactions, as once some commit request were issued, some transactions might have already been committed, so we can not rollback only portion of them (otherwise we would have data duplication problem).

Is there a way to programmatically check if a Flink streaming job started from a savepoint before executing the stream?

Before calling execute on the StreamExecutionEnvironment and starting the stream job, is there a way to programmatically find out whether or not the job was restored from a savepoint? I need to know such information so that I can set the offset of a Kafka source depending on it while building the job graph.
It seems that the FlinkConnectorKafkaBase class which has a method initializeState has access to such information (code). However, there is no way to intercept the FunctionInitializationContext and retrieve the isRestored() value since initializeState is a final method. Also, the initializeState method gets called after the job graph is executed and so I don't think there is a feasible solution associated to it.
Another attempt I made was to find a Flink job parameter that indicates whether or not the job was started from a savepoint. However, I don't think such parameter exists.
You can get the effect you are looking for by simply doing this:
FlinkKafkaConsumer<String> myConsumer = new FlinkKafkaConsumer<>(...);
myConsumer.setStartFromEarliest();
If you use setStartFromEarliest then Flink will ignore the offsets stored in Kafka, and instead begin reading from the earliest record. Moreover, even if you use setStartFromEarliest, if Flink is resuming from a checkpoint or savepoint, it will instead use the offsets stored in that snapshot.
Note that Flink does its own Kafka offset management, and when recovering from a checkpoint ignores the offsets stored in Kafka. Flink does this as a part of providing exactly-once guarantees, which requires knowing exactly how much of the input was consumed to produce the results present in the rest of the state captured in a checkpoint or savepoint. For this reason, Flink always stores the offsets as part of every state snapshot (checkpoint or savepoint).
This is documented here and here.
As for your original question about initializeState, this is available if you implement the CheckpointedFunction interface, but it's quite rare to actually need this.

Flink Kinesis Consumer not storing last successfully processed sequence nos

We are using Flink Kinesis Consumer to consume data from Kinesis stream into our Flink application.
KCL library uses a DynamoDB table to store last successfully processed Kinesis stream sequence nos. so that the next time application starts, it resumes from where it left off.
But, it seems that Flink Kinesis Consumer does not maintain any such sequence nos. in any persistent store. As a result, we need to rely upon ShardIteratortype (trim_horizen, latest, etc) to decide where to resume Flink application processing upon application restart.
A possible solution to this could be to rely on Flink checkpointing mechanism, but that only works when application resumes upon failure, and not when the application has been deliberately cancelled and is needed to be restarted from the last successfully consumed Kinesis stream sequence no.
Do we need to store these last successfully consumed sequence nos ourselves ?
Best practice with Flink is to use checkpoints and savepoints, as these create consistent snapshots that contain offsets into your message queues (in this case, Kinesis stream sequence numbers) together with all of the state throughout the rest of the job graph that resulted from having consumed the data up to those offsets. This makes it possible to recover or restart without any loss or duplication of data.
Flink's checkpoints are snapshots taken automatically by Flink itself for the purpose of recovery from failures, and are in a format optimized for rapid restoration. Savepoints use the same underlying snapshot mechanism, but are triggered manually, and their format is more concerned about operational flexibility than performance.
Savepoints are what you are looking for. In particular, cancel with savepoint and resume from savepoint are very useful.
Another option is to use retained checkpoints with ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION.
To add to David's response, I'd like to explain the reasoning behind not storing sequence numbers.
Any kind of offsets committing into the source system would limit the checkpointing/savepointing feature only to fault tolerance. That is, only the latest checkpoint/savepoint would be able to recover.
However, Flink actually supports to jump back to a previous checkpoint/savepoint. Consider an application upgrade. You make a savepoint before, upgrade and let it run for a couple of minutes where it creates a few checkpoints. Then, you discover a critical bug. You would like to rollback to the savepoint that you have taken and discard all checkpoints.
Now if Flink commits the source offsets only to the source systems, we would not be able to replay the data between now and the restored savepoint. So, Flink needs to store the offsets in the savepoint itself as David pointed out. At this point, additionally committing to source system does not yield any benefit and is confusing while restoring to a previous savepoint/checkpoint.
Do you see any benefit in storing the offsets additionally?

Resources