Flink's failure recovery process - apache-flink

I want to know the detailed failure recovery process of flink.In standalone mode, I guess some steps, such as a TaskManager failure, first detect the failure, all tasks stop processing, and then redeploy the tasks. Then download the checkpoint from HDFS, and each operator loads the state. After the loading is completed, the source continues to send data. Am I right? Does anyone know the correct and detailed recovery process?

Flink recovers from failure through checkpoints. Checkpoints can be stored locally, in S3 or HDFS. When restored, all states of different operators will be revived.
For detailed recovery process, it really depends on your backend. If you are using RocksDB.
your checkpoint can be incremental
you can use the checkpoint data as a savepoint if you do not need to change the backend. This means you can change the parallelism while reviving from the checkpoint.

Related

What is the restoring mechanism of the Flink on K8S when rollingUpdate is executed for update strategy?

I am wondering that the restoring procedure of checkpoint or savepoint in Flink when job is restarted by rolling updates on k8s.
Let me explain simple example as below.
Assume that I have 4 pods in my flink k8s job and have following simple dataflow using parallelism 1.
source -> filter -> map -> sink
Each pod is responsible for each operator and data is consumed through the source function. Since I don't want to lose my data so I set up my dataflow as at least or exactly at once mode in Flink.
And then when rolling update occurs, each pod gets restarted in a sequential way. Suppose that filter is managed by pod1, map is pod2, sink is pod3 and source is pod4 respectively. When the pod1 (filter) is restarted according to the rolling update, does the records in the source task (other task) is saved to the external place for checkpoint immediately? So it can be restored perfectly without data loss even after restarting?
And also, I am wondering that the data in map task (pod3) keep persistent to the external source when rolling update happens even though the checkpoint is not finished?
It means that when the rolling update is happen, the flink is now processing the data records and the checkpoint is not completed. In this case, the current processed data in the task is loss?
I need more clarification for data restoring when we use checkpoint and k8s on flink updated by rolling strategy.
Flink doesn't support rolling upgrades. If one of your pods where a Flink application is currently running becomes unavailable , the Flink application will usually restart.
The answer from David at Is the whole job restarted if one task fails explains this in more detail.
I would also recommend to look at the current documentation for Task Failure Recovery at https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/task_failure_recovery/ and the checkpointing/savepointing documentation that's also listed there.

Is the whole job restarted if one task fails

I have a job that has stateful operators and has also enabled checkpointing. One of the tasks of the staful operator fails due to some reason and has be restarted and recover the checkpointed state.
I would ask which of the followings is the restart behavor:
only the failed task is restarted and restored
all of the operator(contain failed task)'s tasks are restarted and restored
the whole job is restarted and restored
Is the whole job restarted if one task fails?
tldr: For streaming jobs the answer is usually yes, but not necessarily.
Recovery of a Flink streaming job involves rewinding the sources to the offsets recorded in a checkpoint, and resetting the state back to what it had been after having consumed only the data up to those offsets.
Restarting only the failed task would result in inconsistencies, and make it impossible to provide exactly-once semantics, unless the failed task had no dependencies on any upstream tasks, and no downstream tasks depended on it.
What Flink can do then is to restore the state and restart processing on the basis of failover regions, which take into account these dependencies within the job graph. In the case of a streaming job, only if the job is embarrassingly parallel is it possible to do less than a restore and restart of the entire job. So in the case of an embarrassingly parallel job, only the failed region is restored and restarted (which includes all of its subtasks from source to sink), while the other regions continue running.
This approach is used if jobmanager.execution.failover-strategy is set to region, which has been the default since Flink 1.10.
To learn more about this, see FLIP-1: Fine Grained Recovery from Task Failures and the Apache Flink 1.9.0 Release Announcement, where this feature was introduced.

Apache Flink - How Checkpoint/Savepoint works If we run duplicate jobs (Multi Tenancy)

I have multiple Kafka topics (multi tenancy) and I run the same job run multiple times based on the number of topics with each job consuming messages from one topic. I have configured file system as state backend.
Assume there are 3 jobs running. How does checkpoints work here? Does all the 3 jobs store the checkpoint information in the same path? If any of the job fails, how does the job knows from where to recover the checkpoint information? We used to give a job name while submitting a job to the flink cluster. Does it have anything to do with it? In general how does Flink differentiate the jobs and its checkpoint information to restore in case of failures or manual restart of the jobs (irrespective of same or different jobs)?
Case1: What happens in case of job failure?
Case2: What happens If we manually restart the job?
Thank you
To follow-on to what #ShemTov was saying:
Each job will write its checkpoints in a sub-dir named with its jobId.
If you manually cancel a job the checkpoints are deleted (since they are no longer needed for recover), unless they have been configured to be retained:
CheckpointConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
Retained checkpoints can be used for manually restarting, and for rescaling.
Docs on retained checkpoints.
If you have high availability configured, the job manager's metadata about checkpoints will be stored in the HA store, so that recovery does not depend on the job manager's survival.
The JobManager is aware of each job checkpoint, and keep that metadata, checkpoint is being save to the checkpoint directory(via flink-conf.yaml), under this directory it`ll create a randomly hash directory for each checkpoint.
Case 1: The Job will restart (depend on your Fallback Strategy...), and if checkpoint is enabled it'll read the last checkpoint.
Case 2: Im not 100% sure, but i think if you cancel the job manually and then submit it, it won't read the checkpoint. You'll need to use savepoint. (You can kill your job with savepoint, and then submit your job again with the same savepoint). Just be sure that every oprator has a UID. you can read more about savepoints here: https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html

What happen to state in Flink Task Manager when crash?

may i know what happen to state stored in Flink Task Manager when this Task manager crash. Say the state storage is rocksdb, would those data transfer to other running Task Manager so that complete state data is ready for data processing?
Flink does not (yet) support dynamic rescaling of state, so the failed task manager must be recovered, and the job will be restarted from a checkpoint.
Exactly what that involves depends on how your cluster is configured, and whether the job failed because of an exception or because the machine/container running the task manager failed.
If you are using RocksDB and local recovery is enabled, then if the job died because of an exception, the task managers will all be able to restart the job more-or-less immediately from their local copy of the state. On the other hand, if a new task manager has to be spun up, then once it is running it will fetch what it needs from the latest checkpoint (from whatever distributed file system is used) and then the job will resume.
Without local recovery, every task manager will have to fetch the relevant portions of the checkpoint from the DFS.
In some cases it is possible to do something less expensive than a full recovery. See fine-grained recovery for details.

Does checkpoint block every other operation?

When Sql Server issues a checkpoint, does it block every other operation until the checkpoint is completed?
If I understand correctly, when a checkpoint occurs, the server should write all dirty pages.
When it's complete, it will write checkpoint to the transaction log, so in case of any failure it will process only transactions from that point of time (or transactions which already started at time of checkpoint).
How does sql server prevent some non dirty page to become dirty while the checkpoint is in progress?
Does it block all writes until the checkpoint is completed?
Checkpoints do not block writes.
A checkpoint has a start and an end LSN. It guarantees that all pages on disk are at least at the start LSN of the checkpoint. It does not matter if any page is at a later LSN (because it has been written to after the checkpoint has started).
The checkpoint only guarantees a minimum LSN for all pages on disk. It does not guarantee an exact LSN.
This makes sense because you can delete all transaction log records which contain information from LSNs which are earlier than the checkpoint start LSN. That is the purpose of a checkpoint: Allow parts of the log to become inactive.
Checkpoints are not needed for data consistency and correctness. They just free log space and shorten recovery times.
when a checkpoint occurs, the server should write all dirty pages
And that's what it does. However the guarantee given by checkpoint is it writes all the pages that were dirty at the instant the checkpoint started. Any page that got dirty while the checkpoint was making progress may or may not be written, but is sure not guaranteed to be written. What this guarantee offers is an optimization that physical recovery can start REDO from the last checkpoint since everything in the log prior to it is already been applied to the data pages (does not have to be redone). Is even on Wikipedia page for ARIES:
The naive way for checkpointing involves locking the whole database to
avoid changes to the DPT and the TT during the creation of the
checkpoint. Fuzzy logging circumvents that by writing two log records.
One Fuzzy Log Starts Here record and, after preparing the checkpoint
data, the actual checkpoint. Between the two records other logrecords
can be created
usr's answer explains how this is achieved (by using a checkpoint start LSN and end LSN).

Resources