How to achieve fault tolerance(Recovery) with TaskMangers of Apache-Flink? - apache-flink

Recovery with JobManager is achieved using Zookeeper, but what if TaskManager gets failed? How to recover from this, does JobManager automatically recovers TaskManagers?

In general, the JobManager takes care to recover from TaskManager failures. How this is done depends on your setup.
If you run Flink on YARN, the JobManager will start a new TaskManager when it realizes that a TaskManager has died and reassign tasks.
If you run Flink stand-alone on a cluster, you have to make sure you have one (or more) stand-by TaskManager(s) running. The JobManager will assign the tasks of the failed TM to a stand-by TM. This also means that you have to ensure that enough stand-by TMs are up and running.

Related

Fault Tolerance in Flink

How can we configure a Flink application to start/restart only the pods/(sub)tasks that crashed instead of restarting the whole job i.e. restart all the tasks/sub-tasks in the job/pipeline including that tasks that are healthy. It does not make sense and feels unnecessary to try to restart the healthy tasks along with the crashed ones. The stream processing application processes messages from Kafka and writes the output back to Kafka and runs on Flink 1.13.5 and a Kubernetes resource manager - using Lyft's Kubernetes operator to schedule and run the Flink job. We tried setting the property, **jobmanager.execution.failover-strategy** to **region** and did not help.
Flink only supports partial restarts to the extent that this is possible without sacrificing completely correct, exactly-once results.
After recovery, failed tasks are restarted from the latest checkpoint. Their inputs are rewound, and they will reproduce previously emitted results. If healthy downstream consumers of those failed tasks aren't also reset and restarted from that same checkpoint, then they will end up producing duplicate/inflated results.
With streaming jobs, only with embarrassingly parallel pipelines will have you disjoint pipelined regions. Any use of keyBy or rebalancing (e.g., to change the parallelism) will produce a job with a single failure region.
Restart Pipelined Region Failover Strategy.
This strategy groups tasks into disjoint regions. When a task failure is detected, this strategy computes the smallest set of regions that must be restarted to recover from the failure. For some jobs this can result in fewer tasks that will be restarted compared to the Restart All Failover Strategy.
Refer to https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy
But another failover strategy is in progress in https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable

Is the whole job restarted if one task fails

I have a job that has stateful operators and has also enabled checkpointing. One of the tasks of the staful operator fails due to some reason and has be restarted and recover the checkpointed state.
I would ask which of the followings is the restart behavor:
only the failed task is restarted and restored
all of the operator(contain failed task)'s tasks are restarted and restored
the whole job is restarted and restored
Is the whole job restarted if one task fails?
tldr: For streaming jobs the answer is usually yes, but not necessarily.
Recovery of a Flink streaming job involves rewinding the sources to the offsets recorded in a checkpoint, and resetting the state back to what it had been after having consumed only the data up to those offsets.
Restarting only the failed task would result in inconsistencies, and make it impossible to provide exactly-once semantics, unless the failed task had no dependencies on any upstream tasks, and no downstream tasks depended on it.
What Flink can do then is to restore the state and restart processing on the basis of failover regions, which take into account these dependencies within the job graph. In the case of a streaming job, only if the job is embarrassingly parallel is it possible to do less than a restore and restart of the entire job. So in the case of an embarrassingly parallel job, only the failed region is restored and restarted (which includes all of its subtasks from source to sink), while the other regions continue running.
This approach is used if jobmanager.execution.failover-strategy is set to region, which has been the default since Flink 1.10.
To learn more about this, see FLIP-1: Fine Grained Recovery from Task Failures and the Apache Flink 1.9.0 Release Announcement, where this feature was introduced.

Flink Failure Recovery: what if JobManager or TaskManager failed

I'm reading the Flink official doc about Task Failure Recovery: https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html
As my understanding, this doc tells us that if some task failed for some reason, Flink is able to recover it with the help of Checkpoint mechanism.
So now I have two more questions:
What if a TaskManager failed? As my understanding, a task is assigned to one or more slots, and slots are located at one or more TaskManagers. After reading the doc above, I've known that Flink can recover a failed task, but if a TaskManager failed, what would happen? Can Flink recover it too? If a failed TaskManager can be recoverd, will the tasks assigned to it can continue running automatically after it's recovered?
What if the JobManager failed? If the JobManager failed, do all of TaskManagers will fail too? If so, when I recover the JobManager with the help of Zookeeper, do all of TaskManagers and their tasks will continue running automatically?
In a purely standalone cluster, if a Task Manager dies, then if you had a standby task manager running, it will be used. Otherwise the Job Manager will wait for a new Task Manager to magically appear. Making that happen is up to you. On the other hand, if you are using YARN, Mesos, or Kubernetes, the cluster management framework will take care of making sure there are enough TMs.
As for Job Manager failures, in a standalone cluster you should run standby Job Managers, and configure Zookeeper to do leader election. With YARN, Mesos, and Kubernetes, you can let the cluster framework handle restarting the Job Manager, or run standbys, as you prefer, but in either case you will still need Zookeeper to provide HA storage for the Job Manager's metadata.
Task Managers can survive a Job Manager failure/recovery situation. The jobs don't have to be restarted.
https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html.

In Apache Flink, what is the difference between the Job Manager and the Job Master?

In Apache Flink (e.g. v1.8), what is the difference between the Job Manager and the Job Master?
Job Manager and Job Master seem to be used analogously in the logs.
What is the difference between the Job Manager and the Job Master?
Thanks!
The JobManager is the composition of mainly 3 components.
Dispatcher - dispatch the job to the Task Managers
Resource Manager - Allocate the required resource for the job
JobMaster - Supervising, coordinating the Flink Job tasks.
So, JobMaster is part of JobManager. As per docs, a single JobManager is assigned to each individual Flink application, which can contain multiple Flink jobs in it.
For example, a Flink Application with 2 jobs will instantiate 1 JobManger but will contain 2 JobMasters.
JobManager and JobMaster have different roles.
For the JobManager, according to the JobManager Data Structures section of the documentation:
During job execution, the JobManager keeps track of distributed tasks, decides when to schedule the next task (or set of tasks), and reacts to finished tasks or execution failures.
The JobManager receives the JobGraph, which is a representation of the data flow consisting of operators (JobVertex) and intermediate results (IntermediateDataSet). Each operator has properties, like the parallelism and the code that it executes. In addition, the JobGraph has a set of attached libraries, that are necessary to execute the code of the operators.
The role of the JobMaster is more limited according to the Javadoc:
JobMaster implementation. The job master is responsible for the execution of a single JobGraph.

Recovery with a single JobManager

I am attempting to recover my jobs and state when my job manager goes down and I haven't been able to restart my jobs successfully.
From my understanding, TaskManager recovery is aided by the JobManager (this works as expected) and JobManager recovery is completed through Zookeeper.
I am wondering if there is a way to recover the jobmanager without zookeeper?
I am using docker for my setup and all checkpoints & savepoints are persisted to mapped volumes.
Is flink able to recover when all job managers go down? I can afford to wait for the single JobManager to restart.
When I restart the jobmanager I get the following exception: org.apache.flink.runtime.rest.NotFoundException: Job 446f4392adc32f8e7ba405a474b49e32 not found
I have set the following in my flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: file:///opt/flink/checkpoints
state.savepoints.dir: file:///opt/flink/savepoints
I think my issue may that the JAR gets deleted when the job manager is restarted but I am not sure how to solve this.
At the moment, Flink only supports to recover from a JobManager fault if you are using ZooKeeper. However, theoretically you can also make it work without it if you can guarantee that there is only a single JobManager ever running. See this answer for more information.
You can check out running your cluster as a "Flink Job Cluster". This will automatically start the job that you baked into the docker image when the container comes up. You can read more here.

Resources