upgrade Flink minor version and restore from checkpoint - apache-flink

From official doc, it says Flink support minor version upgrade - restoring a snapshot taken with an older minor version of Flink (1.x → 1.y)..
Q1. Does it means I can upgrade Flink version of my job in following way:
Stop job running with Flink 1.10.
Record latest checkpoint, for example: chk-123.
Upgrade Flink to 1.15(or higher).
Restore the job with chk-123.
Q2. I found there is savepoint compatiabiliy table, but checkpoint is not mentioned. Is checkpoint compatiabiliy table the same as savepoint, or just as described as Flink (1.x → 1.y) (x means any version previous than y)?

I think it should work the same way for externalized checkpoint as for savepoint in terms of compatibility. The thing is that savepoints are done automatically when You stop the job, so they are a natural pick when updating a version of Flink to minimize the amount of data reprocessed. Is there a reason why You can't use savepoint?

This depends on what type of snapshot (canonical savepoint, native savepoint, aligned checkpoint or unaligned checkpoint) you're using and if you are changing your Flink application of course.
You can find the overview for these snapshots and the capabilities and limitations they offer at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations

Related

What is the restoring mechanism of the Flink on K8S when rollingUpdate is executed for update strategy?

I am wondering that the restoring procedure of checkpoint or savepoint in Flink when job is restarted by rolling updates on k8s.
Let me explain simple example as below.
Assume that I have 4 pods in my flink k8s job and have following simple dataflow using parallelism 1.
source -> filter -> map -> sink
Each pod is responsible for each operator and data is consumed through the source function. Since I don't want to lose my data so I set up my dataflow as at least or exactly at once mode in Flink.
And then when rolling update occurs, each pod gets restarted in a sequential way. Suppose that filter is managed by pod1, map is pod2, sink is pod3 and source is pod4 respectively. When the pod1 (filter) is restarted according to the rolling update, does the records in the source task (other task) is saved to the external place for checkpoint immediately? So it can be restored perfectly without data loss even after restarting?
And also, I am wondering that the data in map task (pod3) keep persistent to the external source when rolling update happens even though the checkpoint is not finished?
It means that when the rolling update is happen, the flink is now processing the data records and the checkpoint is not completed. In this case, the current processed data in the task is loss?
I need more clarification for data restoring when we use checkpoint and k8s on flink updated by rolling strategy.
Flink doesn't support rolling upgrades. If one of your pods where a Flink application is currently running becomes unavailable , the Flink application will usually restart.
The answer from David at Is the whole job restarted if one task fails explains this in more detail.
I would also recommend to look at the current documentation for Task Failure Recovery at https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/task_failure_recovery/ and the checkpointing/savepointing documentation that's also listed there.

How do scale up/down a running Flink cluster on kubernetes with flink 1.11?

I'm running Flink on Kubernetes and when I update the replicas of TaskManager deployment, Kubernetes scales up/down the number of TM pods for me, but when I checked TM is up but the newly added Tm is not getting any task not sure if that is all I need to do. Do I need to do anything else to make the job adapt to the more/less TMs in flink 1.11.3 version
To get this to work the way you expected, upgrade to Flink 1.13 and use reactive mode. See https://flink.apache.org/2021/05/06/reactive-mode.html.
With Flink 1.11, you'll have to rescale manually, by restarting from a checkpoint or savepoint while specifying the new parallelism. If you are using a native kubernetes deployment, Flink will use its kubernetes resource manager, and will create the appropriate number of pods automatically. (Note that native kubernetes deployments have also been improved since 1.11.) On the other hand, with a standalone kubernetes deployment, Flink is unaware of kubernetes, and you're on your own, and need to manually create the right number of pods.

Is the whole job restarted if one task fails

I have a job that has stateful operators and has also enabled checkpointing. One of the tasks of the staful operator fails due to some reason and has be restarted and recover the checkpointed state.
I would ask which of the followings is the restart behavor:
only the failed task is restarted and restored
all of the operator(contain failed task)'s tasks are restarted and restored
the whole job is restarted and restored
Is the whole job restarted if one task fails?
tldr: For streaming jobs the answer is usually yes, but not necessarily.
Recovery of a Flink streaming job involves rewinding the sources to the offsets recorded in a checkpoint, and resetting the state back to what it had been after having consumed only the data up to those offsets.
Restarting only the failed task would result in inconsistencies, and make it impossible to provide exactly-once semantics, unless the failed task had no dependencies on any upstream tasks, and no downstream tasks depended on it.
What Flink can do then is to restore the state and restart processing on the basis of failover regions, which take into account these dependencies within the job graph. In the case of a streaming job, only if the job is embarrassingly parallel is it possible to do less than a restore and restart of the entire job. So in the case of an embarrassingly parallel job, only the failed region is restored and restarted (which includes all of its subtasks from source to sink), while the other regions continue running.
This approach is used if jobmanager.execution.failover-strategy is set to region, which has been the default since Flink 1.10.
To learn more about this, see FLIP-1: Fine Grained Recovery from Task Failures and the Apache Flink 1.9.0 Release Announcement, where this feature was introduced.

Use of flink/kubernetes to replace etl jobs (on ssis) : one flink cluster per jobtype or create and destroy flink cluster per job execution

I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".
Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.
Any suggestions on best practices for using flink for batch ETL activities.
May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?
Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:
Session cluster
A session cluster allows to run several jobs on the same set of resources (TaskExecutors). You start the session cluster before submitting any resources.
Benefits:
No additional cluster deployment time needed when submitting jobs => Faster job submissions
Better resource utilization if individual jobs don't need many resources
One place to control all your jobs
Downsides:
No strict isolation between jobs
Failures caused by job A can cause job B to restart
Job A runs in the same JVM as job B and hence can influence it if statics are used
Per-job cluster
A per-job cluster starts a dedicated Flink cluster for every job.
Benefits
Strict job isolation
More predictable resource consumption since only a single job runs on the TaskExecutors
Downsides
Cluster deployment time is part of the job submission time, resulting in longer submission times
Not a single cluster which controls all your jobs
Recommendation
So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.

Apache Flink - How Checkpoint/Savepoint works If we run duplicate jobs (Multi Tenancy)

I have multiple Kafka topics (multi tenancy) and I run the same job run multiple times based on the number of topics with each job consuming messages from one topic. I have configured file system as state backend.
Assume there are 3 jobs running. How does checkpoints work here? Does all the 3 jobs store the checkpoint information in the same path? If any of the job fails, how does the job knows from where to recover the checkpoint information? We used to give a job name while submitting a job to the flink cluster. Does it have anything to do with it? In general how does Flink differentiate the jobs and its checkpoint information to restore in case of failures or manual restart of the jobs (irrespective of same or different jobs)?
Case1: What happens in case of job failure?
Case2: What happens If we manually restart the job?
Thank you
To follow-on to what #ShemTov was saying:
Each job will write its checkpoints in a sub-dir named with its jobId.
If you manually cancel a job the checkpoints are deleted (since they are no longer needed for recover), unless they have been configured to be retained:
CheckpointConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
Retained checkpoints can be used for manually restarting, and for rescaling.
Docs on retained checkpoints.
If you have high availability configured, the job manager's metadata about checkpoints will be stored in the HA store, so that recovery does not depend on the job manager's survival.
The JobManager is aware of each job checkpoint, and keep that metadata, checkpoint is being save to the checkpoint directory(via flink-conf.yaml), under this directory it`ll create a randomly hash directory for each checkpoint.
Case 1: The Job will restart (depend on your Fallback Strategy...), and if checkpoint is enabled it'll read the last checkpoint.
Case 2: Im not 100% sure, but i think if you cancel the job manually and then submit it, it won't read the checkpoint. You'll need to use savepoint. (You can kill your job with savepoint, and then submit your job again with the same savepoint). Just be sure that every oprator has a UID. you can read more about savepoints here: https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html

Resources