Fault Tolerance in Flink - apache-flink

How can we configure a Flink application to start/restart only the pods/(sub)tasks that crashed instead of restarting the whole job i.e. restart all the tasks/sub-tasks in the job/pipeline including that tasks that are healthy. It does not make sense and feels unnecessary to try to restart the healthy tasks along with the crashed ones. The stream processing application processes messages from Kafka and writes the output back to Kafka and runs on Flink 1.13.5 and a Kubernetes resource manager - using Lyft's Kubernetes operator to schedule and run the Flink job. We tried setting the property, **jobmanager.execution.failover-strategy** to **region** and did not help.

Flink only supports partial restarts to the extent that this is possible without sacrificing completely correct, exactly-once results.
After recovery, failed tasks are restarted from the latest checkpoint. Their inputs are rewound, and they will reproduce previously emitted results. If healthy downstream consumers of those failed tasks aren't also reset and restarted from that same checkpoint, then they will end up producing duplicate/inflated results.
With streaming jobs, only with embarrassingly parallel pipelines will have you disjoint pipelined regions. Any use of keyBy or rebalancing (e.g., to change the parallelism) will produce a job with a single failure region.

Restart Pipelined Region Failover Strategy.
This strategy groups tasks into disjoint regions. When a task failure is detected, this strategy computes the smallest set of regions that must be restarted to recover from the failure. For some jobs this can result in fewer tasks that will be restarted compared to the Restart All Failover Strategy.
Refer to https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy
But another failover strategy is in progress in https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable

Related

What is the restoring mechanism of the Flink on K8S when rollingUpdate is executed for update strategy?

I am wondering that the restoring procedure of checkpoint or savepoint in Flink when job is restarted by rolling updates on k8s.
Let me explain simple example as below.
Assume that I have 4 pods in my flink k8s job and have following simple dataflow using parallelism 1.
source -> filter -> map -> sink
Each pod is responsible for each operator and data is consumed through the source function. Since I don't want to lose my data so I set up my dataflow as at least or exactly at once mode in Flink.
And then when rolling update occurs, each pod gets restarted in a sequential way. Suppose that filter is managed by pod1, map is pod2, sink is pod3 and source is pod4 respectively. When the pod1 (filter) is restarted according to the rolling update, does the records in the source task (other task) is saved to the external place for checkpoint immediately? So it can be restored perfectly without data loss even after restarting?
And also, I am wondering that the data in map task (pod3) keep persistent to the external source when rolling update happens even though the checkpoint is not finished?
It means that when the rolling update is happen, the flink is now processing the data records and the checkpoint is not completed. In this case, the current processed data in the task is loss?
I need more clarification for data restoring when we use checkpoint and k8s on flink updated by rolling strategy.
Flink doesn't support rolling upgrades. If one of your pods where a Flink application is currently running becomes unavailable , the Flink application will usually restart.
The answer from David at Is the whole job restarted if one task fails explains this in more detail.
I would also recommend to look at the current documentation for Task Failure Recovery at https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/task_failure_recovery/ and the checkpointing/savepointing documentation that's also listed there.

Is the whole job restarted if one task fails

I have a job that has stateful operators and has also enabled checkpointing. One of the tasks of the staful operator fails due to some reason and has be restarted and recover the checkpointed state.
I would ask which of the followings is the restart behavor:
only the failed task is restarted and restored
all of the operator(contain failed task)'s tasks are restarted and restored
the whole job is restarted and restored
Is the whole job restarted if one task fails?
tldr: For streaming jobs the answer is usually yes, but not necessarily.
Recovery of a Flink streaming job involves rewinding the sources to the offsets recorded in a checkpoint, and resetting the state back to what it had been after having consumed only the data up to those offsets.
Restarting only the failed task would result in inconsistencies, and make it impossible to provide exactly-once semantics, unless the failed task had no dependencies on any upstream tasks, and no downstream tasks depended on it.
What Flink can do then is to restore the state and restart processing on the basis of failover regions, which take into account these dependencies within the job graph. In the case of a streaming job, only if the job is embarrassingly parallel is it possible to do less than a restore and restart of the entire job. So in the case of an embarrassingly parallel job, only the failed region is restored and restarted (which includes all of its subtasks from source to sink), while the other regions continue running.
This approach is used if jobmanager.execution.failover-strategy is set to region, which has been the default since Flink 1.10.
To learn more about this, see FLIP-1: Fine Grained Recovery from Task Failures and the Apache Flink 1.9.0 Release Announcement, where this feature was introduced.

Use of flink/kubernetes to replace etl jobs (on ssis) : one flink cluster per jobtype or create and destroy flink cluster per job execution

I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".
Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.
Any suggestions on best practices for using flink for batch ETL activities.
May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?
Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:
Session cluster
A session cluster allows to run several jobs on the same set of resources (TaskExecutors). You start the session cluster before submitting any resources.
Benefits:
No additional cluster deployment time needed when submitting jobs => Faster job submissions
Better resource utilization if individual jobs don't need many resources
One place to control all your jobs
Downsides:
No strict isolation between jobs
Failures caused by job A can cause job B to restart
Job A runs in the same JVM as job B and hence can influence it if statics are used
Per-job cluster
A per-job cluster starts a dedicated Flink cluster for every job.
Benefits
Strict job isolation
More predictable resource consumption since only a single job runs on the TaskExecutors
Downsides
Cluster deployment time is part of the job submission time, resulting in longer submission times
Not a single cluster which controls all your jobs
Recommendation
So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.

How can i share state between my flink jobs?

I run multiple job from my .jar file. i want share state between my jobs. but all inputs consumes(from kafka) in every job and generate duplicate output.
i see my flink panel. all of jobs 'record sents' is 3. i think must split number to my jobs.
I create job with this command
bin/flink run app.jar
How can i fix it?
Because of its focus on scalability and high performance, Flink state is local. Flink doesn't really provide a mechanism for sharing state between jobs.
However, Flink does support splitting up a large job among a fleet of workers. A Flink cluster is able to run a single job in parallel, using the resources of one or many multi-core CPUs. Some Flink jobs are running on thousands of cores, just to give an idea of its scalability.
When used with Kafka, each Kafka partition can be read by a different subtask in Flink, and processed by its own parallel instance of the pipeline.
You might begin by running a single parallel instance of your job via
bin/flink run --parallelism <parallelism> app.jar
For this to succeed, your cluster will have to have at least as many free slots as the parallelism you request. The parallelism should be less than or equal to the number of partitions in the Kafka topic(s) being consumed. The Flink Kafka consumers will coordinate amongst themselves -- with each of them reading from one or more partitions.

What happen to state in Flink Task Manager when crash?

may i know what happen to state stored in Flink Task Manager when this Task manager crash. Say the state storage is rocksdb, would those data transfer to other running Task Manager so that complete state data is ready for data processing?
Flink does not (yet) support dynamic rescaling of state, so the failed task manager must be recovered, and the job will be restarted from a checkpoint.
Exactly what that involves depends on how your cluster is configured, and whether the job failed because of an exception or because the machine/container running the task manager failed.
If you are using RocksDB and local recovery is enabled, then if the job died because of an exception, the task managers will all be able to restart the job more-or-less immediately from their local copy of the state. On the other hand, if a new task manager has to be spun up, then once it is running it will fetch what it needs from the latest checkpoint (from whatever distributed file system is used) and then the job will resume.
Without local recovery, every task manager will have to fetch the relevant portions of the checkpoint from the DFS.
In some cases it is possible to do something less expensive than a full recovery. See fine-grained recovery for details.

Resources