Avoid running initialization code in Apache Flink job when resuming from savepoint - apache-flink

I have a Apache Flink Job, implemented with the DataStream API, which contains some initialization code before the definition and submission of the job graph. The initialization code should only run the first time the job is submitted and not when resuming the job from a checkpoint or when updating it using a savepoint.
It seems that when restarting the job during a failover from a checkpoint, the job is restarted from a job graph stored in the checkpoint - in particular, the initialization code is not run a second time (which is what I want).
Is the same possible when running a job from a savepoint? In other words, is there a way to execute code only when the job is not started from a savepoint?

If you implement the CheckpointedFunction interface, then initializeState(FunctionInitializationContext context) will be called during initialization. Then you can use context.isRestored() to determine whether the job is being started for the first time, or not.

Related

Savepoint on Flink Job Finish

I have a usecase where I need to seed a Flink Application(both RocksDB state and Broadcast State) using Bounded S3 sources and then read other unbounded/bounded S3 sources after the seeding is complete.
I was trying to achieve this in 2 steps:
Seeding: Trigger a Flink job with only the seeding data bounded source and take a savepoint after the job finishes.
Regular Processing: Restore from seeded savepoint on a new Flink graph to process other unbounded/bounded S3 sources.
Questions:
For Step 1: Does Flink support taking savepoints automatically after Job Finishes in Streaming Mode.
If only manual savepoint trigger is supported, what can be used a done signal that all the seeding data is processed completely and all the task are finished processing?
Any other approaches to achieve the seeding usecase is appreciated as well.
Note: Approaches where we buffer the regular data until seeding data is processed is not feasible for my usecase
Thanks
Using unbounded sources you can make use of externalized checkpoint and you will be able to start/resume jobs from the checkpoint. Enabling this feature it is necessary to have a process to clean the checkpoints when the job is cancelled otherwise the checkpoints won't be deleted by Flink.
You can use the new feature available in Flink 1.15 (checkpoints with finished tasks) to do that.

Flink-RocksDB behaviour after task manager failure

I am experimenting with my new Flink cluster(3 Different Machines-> 1 Job Manager, 2-> Task Managers) using RocksDB as State Backend however the checkpointing behaviour I am getting is a little confusing.
More specifically, I have designed a simple WordCount example and my data source is netcat. When I submit my job, the job manager assigns it to a random task manager(no replication as well). I provide some words and then I kill the currenlty running task manager. After a while, the job restarts in the other task manager and I can provide some new words. The confusing part is that state from the first task manager is preserved even when I have killed it.
To my understanding, RocksDB maintains its state in a local directory of the running task manager, so what I expected was when the first task manager was killed to lose the entire state and start counting words from the beginning. So Flink seems to somehow maintain its state in the memory(?) or broadcasts it through JobManager?
Am I missing something?
The RocksDB state backend does keep its working state on each task manager's local disk, while checkpoints are normally stored in a distributed filesystem.
If you have checkpointing enabled, then the spare task manager is able to recover the state from the latest checkpoint and resume processing.

how to deploy a new job without downtime

I have an Apache Flink application that reads from a single Kafka topic.
I would like to update the application from time to time without experiencing downtime. For now the Flink application executes some simple operators such as map and some synchronous IO to external systems via http rest APIs.
I have tried to use the stop command, but i get "Job termination (STOP) failed: This job is not stoppable.", I understand that the Kafka connector does not support the the stop behavior - a link!
A simple solution would be to cancel with savepoint and to redeploy the new jar with the savepoint, but then we get downtime.
Another solution would be to control the deployment from the outside, for example, by switching to a new topic.
what would be a good practice ?
If you don't need exactly-once output (i.e., can tolerate some duplicates) you can take a savepoint without cancelling the running job. Once the savepoint is completed, you start a second job. The second job could write to different topic but doesn't have to. When the second job is up, you can cancel the first job.

Difference between savepoint and checkpoint in Flink

I know there are similar questions on the stackoverflow,but after investigating several of them, I know
savepoint is triggered manually, while checkpoint is triggered
automatically
They are using different storage format
But these are not the confusing points,I have no idea when to use one or when to use the other.
Consider the following two scenarios:
If I need to shutdown or restart the whole application for some reason(eg bug fix or crash unexpected) , then I will have to use savepoint to restore the whole application?
I thought that checkpoint is only used internally in Flink for fault tolerance when application is running, that is, the application itself is running, but tasks or other things may fail, that is, Flink will use checkpoint for state recovery?
There is also externalized checkpoint, I think it is the same with savepoint in functionality, that is, externalized checkpoint can also be used to recover from a restarted application?
Does Flink use checkpoint for state recovery?
Basically you're right. As you said, the checkpoint is usually used internally in Flink for fault tolerance and it's more like a concept inside the framework. When your application fails, the program will try to restart from the latest checkpoint. That's how checkpoint works in Flink, without any mannual interfering.
Should I use savepoint to restore the whole application for bug fix?
Yes. In these cases, you don't want to restore from the checkpoint because maybe the latest checkpoint occurs several minutes ago. Instead, you'd like to snapshot the current the state of the whole application and restart it from the latest savepoint, which may be the quickest way to restore the application without too much delay.
Externalized checkpoint.
It's still the checkpoint, but will be persisted externally based on your configurations. It can be used to restore the application, but the states are not so real time because there exists an interval between checkpoints.
For more information, take a look at this blog artical: https://data-artisans.com/blog/differences-between-savepoints-and-checkpoints-in-flink.

How to schedule a job in apache-flink

I want to write a task that is triggered by apache flink after every 24 hours and then processed by flink. What is the possible way to do this? Does flink provide any job scheduling functionality?
Apache Flink is not a job scheduler but an event processing engine which is a different paradigm, as Flink jobs are supposed to run continuously instead of being triggered by a schedule.
That said, you could achieve the functionality by simply using an off the shelve scheduler (i.e. cron) who is scheduled to start a job on your Flink cluster and then stop it after you receive some sort of notification that the job was done (i.e. through a Kafka topic) or simply use a timeout after which you would assume that the job is finished and you can stop the job. But again, especially because Flink is not designed for this kind of use cases, you would most certainly run into edge cases which Flink does not support.
Alternatively you can simply use a 24 hour tumbling window and run your task in the corresponding trigger function. See https://flink.apache.org/news/2015/12/04/Introducing-windows.html for details on that matter.

Resources