What happens if Flink job restarts between two checkpoints?

What happens if Flink job restarts between two checkpoints? - apache-flink

This question is basically similar to the one asked here: Apache Flink fault tolerance.
i.e. what happens if a job restarts between two checkpoints? Will it reprocess the records that were already processed after the last checkpoint?
Take for example I have two jobs, job1 and job2. Job1 consumes records from Kafka, processes them and again produces them to second Kafka topic. Job2 consumes from this second topic and processes the records (in my case its updating values in aerospike using AerospikeClient).
Now from the answer to this question Apache Flink fault tolerance, I can somehow believe that if job1 restarts, it will not produce duplicates records in the sink. I am using FlinkKafkaProducer011 which extends TwoPhaseCommitSinkFunction (https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html). Please explain how it will prevent reprocessing (ie duplicate production of records to Kafka).
According to Flink doc, flink restarts a job from last successful checkpoint. So if job2 restarts before completing the checkpoint, it will restart from last checkpoint and the records that were already processed after that last checkpoint will be reprocessed (ie multiple updations in aerospike).
Am I right or is there something else in Flink (& aerospike) that prevents this reprocessing in job2?

In such a scenario, Flink will indeed reprocess some events. During recovery the input partitions will have their offsets reset to the offsets in the most recent checkpoint, and events that had been read after that checkpoint will be re-ingested.
However, the FlinkKafkaProducer uses Kafka transactions that are committed when checkpoints are completed. When a job fails, whatever output it has produced since the last checkpoint is protected by transactions that are never committed. So long as that job's consumers are configured to use read_committed as their isolation.level, they won't see any duplicates.
For more details, see Best Practices for Using Kafka Sources/Sinks in Flink Jobs.

Related

Flink JDBC sink consistency guarentees

I have a flink application (v1.13.2) which is reading from multiple kafka topics as a source. There is a filter operator to remove unwanted records from the source stream and finally a JDBC sink to persist the data into postgres tables. The SQL query can perform upserts so same data getting processed again is not a problem. Checkpointing is enabled.
According to the documentation, JDBC sink provides at-least once guarantee. Also,
A JDBC batch is executed as soon as one of the following conditions is true:
the configured batch interval time is elapsed
the maximum batch size is reached
a Flink checkpoint has started
And kafka source documentation
Kafka source commits the current consuming offset when checkpoints are
completed, for ensuring the consistency between Flink’s checkpoint
state and committed offsets on Kafka brokers.
With Flink’s checkpointing enabled, the Flink Kafka Consumer will
consume records from a topic and periodically checkpoint all its Kafka
offsets, together with the state of other operations. In case of a job
failure, Flink will restore the streaming program to the state of the
latest checkpoint and re-consume the records from Kafka, starting from
the offsets that were stored in the checkpoint.
Is it safe to say that in my scenario, whatever record offsets that get committed back to kafka will always be present in the database? Flink will store offsets as part of the checkpoints and commit them back only if they are successfully created. And if the jbdc query fails for some reason, the checkpoint itself will fail. I want to ensure there is no data loss in this usecase.

What is the restoring mechanism of the Flink on K8S when rollingUpdate is executed for update strategy?

I am wondering that the restoring procedure of checkpoint or savepoint in Flink when job is restarted by rolling updates on k8s.
Let me explain simple example as below.
Assume that I have 4 pods in my flink k8s job and have following simple dataflow using parallelism 1.
source -> filter -> map -> sink
Each pod is responsible for each operator and data is consumed through the source function. Since I don't want to lose my data so I set up my dataflow as at least or exactly at once mode in Flink.
And then when rolling update occurs, each pod gets restarted in a sequential way. Suppose that filter is managed by pod1, map is pod2, sink is pod3 and source is pod4 respectively. When the pod1 (filter) is restarted according to the rolling update, does the records in the source task (other task) is saved to the external place for checkpoint immediately? So it can be restored perfectly without data loss even after restarting?
And also, I am wondering that the data in map task (pod3) keep persistent to the external source when rolling update happens even though the checkpoint is not finished?
It means that when the rolling update is happen, the flink is now processing the data records and the checkpoint is not completed. In this case, the current processed data in the task is loss?
I need more clarification for data restoring when we use checkpoint and k8s on flink updated by rolling strategy.

Flink doesn't support rolling upgrades. If one of your pods where a Flink application is currently running becomes unavailable , the Flink application will usually restart.
The answer from David at Is the whole job restarted if one task fails explains this in more detail.
I would also recommend to look at the current documentation for Task Failure Recovery at https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/task_failure_recovery/ and the checkpointing/savepointing documentation that's also listed there.

Fault Tolerance in Flink

How can we configure a Flink application to start/restart only the pods/(sub)tasks that crashed instead of restarting the whole job i.e. restart all the tasks/sub-tasks in the job/pipeline including that tasks that are healthy. It does not make sense and feels unnecessary to try to restart the healthy tasks along with the crashed ones. The stream processing application processes messages from Kafka and writes the output back to Kafka and runs on Flink 1.13.5 and a Kubernetes resource manager - using Lyft's Kubernetes operator to schedule and run the Flink job. We tried setting the property, **jobmanager.execution.failover-strategy** to **region** and did not help.

Flink only supports partial restarts to the extent that this is possible without sacrificing completely correct, exactly-once results.
After recovery, failed tasks are restarted from the latest checkpoint. Their inputs are rewound, and they will reproduce previously emitted results. If healthy downstream consumers of those failed tasks aren't also reset and restarted from that same checkpoint, then they will end up producing duplicate/inflated results.
With streaming jobs, only with embarrassingly parallel pipelines will have you disjoint pipelined regions. Any use of keyBy or rebalancing (e.g., to change the parallelism) will produce a job with a single failure region.

Restart Pipelined Region Failover Strategy.
This strategy groups tasks into disjoint regions. When a task failure is detected, this strategy computes the smallest set of regions that must be restarted to recover from the failure. For some jobs this can result in fewer tasks that will be restarted compared to the Restart All Failover Strategy.
Refer to https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy
But another failover strategy is in progress in https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable

Apache Flink - How Checkpoint/Savepoint works If we run duplicate jobs (Multi Tenancy)

I have multiple Kafka topics (multi tenancy) and I run the same job run multiple times based on the number of topics with each job consuming messages from one topic. I have configured file system as state backend.
Assume there are 3 jobs running. How does checkpoints work here? Does all the 3 jobs store the checkpoint information in the same path? If any of the job fails, how does the job knows from where to recover the checkpoint information? We used to give a job name while submitting a job to the flink cluster. Does it have anything to do with it? In general how does Flink differentiate the jobs and its checkpoint information to restore in case of failures or manual restart of the jobs (irrespective of same or different jobs)?
Case1: What happens in case of job failure?
Case2: What happens If we manually restart the job?
Thank you

To follow-on to what #ShemTov was saying:
Each job will write its checkpoints in a sub-dir named with its jobId.
If you manually cancel a job the checkpoints are deleted (since they are no longer needed for recover), unless they have been configured to be retained:
CheckpointConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
Retained checkpoints can be used for manually restarting, and for rescaling.
Docs on retained checkpoints.
If you have high availability configured, the job manager's metadata about checkpoints will be stored in the HA store, so that recovery does not depend on the job manager's survival.

The JobManager is aware of each job checkpoint, and keep that metadata, checkpoint is being save to the checkpoint directory(via flink-conf.yaml), under this directory it`ll create a randomly hash directory for each checkpoint.
Case 1: The Job will restart (depend on your Fallback Strategy...), and if checkpoint is enabled it'll read the last checkpoint.
Case 2: Im not 100% sure, but i think if you cancel the job manually and then submit it, it won't read the checkpoint. You'll need to use savepoint. (You can kill your job with savepoint, and then submit your job again with the same savepoint). Just be sure that every oprator has a UID. you can read more about savepoints here: https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html

Does checkpoint block every other operation?

When Sql Server issues a checkpoint, does it block every other operation until the checkpoint is completed?
If I understand correctly, when a checkpoint occurs, the server should write all dirty pages.
When it's complete, it will write checkpoint to the transaction log, so in case of any failure it will process only transactions from that point of time (or transactions which already started at time of checkpoint).
How does sql server prevent some non dirty page to become dirty while the checkpoint is in progress?
Does it block all writes until the checkpoint is completed?

Checkpoints do not block writes.
A checkpoint has a start and an end LSN. It guarantees that all pages on disk are at least at the start LSN of the checkpoint. It does not matter if any page is at a later LSN (because it has been written to after the checkpoint has started).
The checkpoint only guarantees a minimum LSN for all pages on disk. It does not guarantee an exact LSN.
This makes sense because you can delete all transaction log records which contain information from LSNs which are earlier than the checkpoint start LSN. That is the purpose of a checkpoint: Allow parts of the log to become inactive.
Checkpoints are not needed for data consistency and correctness. They just free log space and shorten recovery times.

when a checkpoint occurs, the server should write all dirty pages
And that's what it does. However the guarantee given by checkpoint is it writes all the pages that were dirty at the instant the checkpoint started. Any page that got dirty while the checkpoint was making progress may or may not be written, but is sure not guaranteed to be written. What this guarantee offers is an optimization that physical recovery can start REDO from the last checkpoint since everything in the log prior to it is already been applied to the data pages (does not have to be redone). Is even on Wikipedia page for ARIES:
The naive way for checkpointing involves locking the whole database to
avoid changes to the DPT and the TT during the creation of the
checkpoint. Fuzzy logging circumvents that by writing two log records.
One Fuzzy Log Starts Here record and, after preparing the checkpoint
data, the actual checkpoint. Between the two records other logrecords
can be created
usr's answer explains how this is achieved (by using a checkpoint start LSN and end LSN).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight