Savepoint on Flink Job Finish - apache-flink

I have a usecase where I need to seed a Flink Application(both RocksDB state and Broadcast State) using Bounded S3 sources and then read other unbounded/bounded S3 sources after the seeding is complete.
I was trying to achieve this in 2 steps:
Seeding: Trigger a Flink job with only the seeding data bounded source and take a savepoint after the job finishes.
Regular Processing: Restore from seeded savepoint on a new Flink graph to process other unbounded/bounded S3 sources.
Questions:
For Step 1: Does Flink support taking savepoints automatically after Job Finishes in Streaming Mode.
If only manual savepoint trigger is supported, what can be used a done signal that all the seeding data is processed completely and all the task are finished processing?
Any other approaches to achieve the seeding usecase is appreciated as well.
Note: Approaches where we buffer the regular data until seeding data is processed is not feasible for my usecase
Thanks

Using unbounded sources you can make use of externalized checkpoint and you will be able to start/resume jobs from the checkpoint. Enabling this feature it is necessary to have a process to clean the checkpoints when the job is cancelled otherwise the checkpoints won't be deleted by Flink.
You can use the new feature available in Flink 1.15 (checkpoints with finished tasks) to do that.

Related

Flink checkpoint status is always in progress

I use datastream connector KafkaSource and HbaseSinkFunction, consume data from kafka and write it to hbase.
I enable the checkpoint like this:
env.enableCheckpointing(3000,CheckpointingMode.EXACTLY_ONCE);
The data in kafka has already be successfully written to hbase,but checkpoints status on ui page is still “in progress” and has not changed.
Why does this happen and how to deal with it?
Flink version:1.13.3,
Hbase version:1.3.1,
Kafka version:0.10.2
Maybe you can post the complete checkpoint configuration parameters.
In addition, you can adjust the checkpoint interval and observe it.
The current checkpoint interval is 3s, which is relatively short.

Initialize Flink Job

We are deploying a new Flink stream processing job and it's state (stores) need to be initialized with historical data and this data should be available in the state store before it starts processing any new application events. We don't want to significantly modify the Flink job to also load the historical data.
We considered writing another, separate Flink job to process the historical data, update it's state store and create a Savepoint and use this Savepoint to initialize the state in the main Flink job. Looks like State Processor API only works with DataSet API and wondering about any alternative solutions. Thanks.
The State Processor API is a good solution. It provides a sort of savepoint connector that you use in a DataSet job to read/modify/update the savepoints that you use in your DataStream jobs.
It's a pretty simple change (definitely not "significant") to support a -preload mode for your job, where the non-historical data sources get replaced by empty/non-terminating sources. I typically use counters to decide when state has been fully populated, then stop with a savepoint, and restart without the -preload option.

Flink exactly once - checkpoint and barrier acknowledgement at sink

I have a Flink job with a sink that is writing the data into MongoDB. The sink is an implementation of RichSinkFunction.
Externalized checkpointing enabled. The interval is 5000 mills and scheme is EXACTLY_ONCE.
Flink version 1.3,
Kafka (source topic) 0.9.0
I can't upgrade to the TwoPhaseCommitSink of Flink 1.4.
I have few doubts
At which point of time does the sink acknowledges the checkpoint barrier, at the start of the invoke function or when invoke completed? Means it waits for persisting (saving in MongoDB) response before acknowledging the barrier?
If committing checkpoint is done by an asynchronous thread, how can Flink guarantee exactly once in case of job failure? What if data is saved by the sink to MongoDB but the checkpoint is not committed? I think this will end up duplicate data on restart.
When I cancel a job from the Flink dashboard, will Flink complete the async checkpoint threads to complete or it's a hard kill -9 call?
First of all, Flink can only guarantee end-to-end exactly-once consistency if the sources and sinks support this. If you are using Flink's Kafka consumer, Flink can guarantee that the internal state of the application is exactly-once consistent. To achieve full end-to-end exactly-once consistency, the sink needs properly support this as well. You should check the implementation of the MongoDB sink if it is working correctly.
Checkpoint barriers are send a regular messages over the data transport channels, i.e., a barrier for checkpoint n separates the stream into records that go into checkpoint n and n + 1. A sink operator will process a barrier between two invoke() calls and trigger the state backend to perform a checkpoint. It is then up to the state backend, whether and how it can perform the checkpoint asynchronously. Once the call to trigger the checkpoint returns, the sink can continue processing. The sink operator will report to the JobManager that it completed checkpointing its state once it is notified by the state backend. An overall checkpoint completes when all operators successfully reported that they completed their checkpoints.
This blog post discusses end-to-end exactly-once processing and the requirements for sink operators in more detail.

How to schedule a job in apache-flink

I want to write a task that is triggered by apache flink after every 24 hours and then processed by flink. What is the possible way to do this? Does flink provide any job scheduling functionality?
Apache Flink is not a job scheduler but an event processing engine which is a different paradigm, as Flink jobs are supposed to run continuously instead of being triggered by a schedule.
That said, you could achieve the functionality by simply using an off the shelve scheduler (i.e. cron) who is scheduled to start a job on your Flink cluster and then stop it after you receive some sort of notification that the job was done (i.e. through a Kafka topic) or simply use a timeout after which you would assume that the job is finished and you can stop the job. But again, especially because Flink is not designed for this kind of use cases, you would most certainly run into edge cases which Flink does not support.
Alternatively you can simply use a 24 hour tumbling window and run your task in the corresponding trigger function. See https://flink.apache.org/news/2015/12/04/Introducing-windows.html for details on that matter.

When are flink checkpoint files cleaned?

I have a streaming job that:
reads from Kafka --> maps events to some other DataStream --> key by(0) --> reduces a time window of 15 seconds processing time and writes back to a Redis sink.
When starting up, everything works great. The problem is, that after a while, the disk space get's full by what I think are links checkpoints.
My question is, are the checkpoints supposed to be cleaned/deleted while the link job is running? could not find any resources on this.
I'm using a filesystem backend that writes to /tmp (no hdfs setup)
Flink cleans up checkpoint files while it is running. There were some corner cases where it "forgot" to clean up all files in case of system failures.
But for Flink 1.3 the community is working on fixing all these issues.
In your case, I'm assuming that you don't have enough disk space to store the data of your windows on disk.
Checkpoints are by default not persisted externally and are only used to resume a job from failures. They are deleted when a program is cancelled.
If you are taking externalized checkpoints, then it has two policy
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the externalized checkpoint when the job is cancelled. Note that you have to manually clean up the checkpoint state after cancellation in this case.
ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the externalized checkpoint when the job is cancelled. The checkpoint state will only be available if the job fails.
For more details
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html

Resources