How does flink deal with the opened transaction when failure happens - apache-flink

I am using Flink 1.12.0, and I got a question about Flink 2PC mechanism for end-to-end consistency guarantee.
At the start of checkpoint, a transaction is opened, and the transaction is committed after the successful completion of checkpoint.
Then what happens if failure happens? I think the opened transaction should be rolled back? also when the transaction is rolled back? Thanks.

Because the operators and Task Managers are distributed inside a cluster, Flink has to ensure that all components agree together in order to claim that a commit is successful. Flink uses the 2 phase commit protocol, as you said, and with a pre-commit. The pre-commit is the key to deal with failures during the checkpoint, as it says on the documentation
The pre-commit phase finishes when the checkpoint barrier passes through all of the operators and the triggered snapshot callbacks complete.

Related

What is the restoring mechanism of the Flink on K8S when rollingUpdate is executed for update strategy?

I am wondering that the restoring procedure of checkpoint or savepoint in Flink when job is restarted by rolling updates on k8s.
Let me explain simple example as below.
Assume that I have 4 pods in my flink k8s job and have following simple dataflow using parallelism 1.
source -> filter -> map -> sink
Each pod is responsible for each operator and data is consumed through the source function. Since I don't want to lose my data so I set up my dataflow as at least or exactly at once mode in Flink.
And then when rolling update occurs, each pod gets restarted in a sequential way. Suppose that filter is managed by pod1, map is pod2, sink is pod3 and source is pod4 respectively. When the pod1 (filter) is restarted according to the rolling update, does the records in the source task (other task) is saved to the external place for checkpoint immediately? So it can be restored perfectly without data loss even after restarting?
And also, I am wondering that the data in map task (pod3) keep persistent to the external source when rolling update happens even though the checkpoint is not finished?
It means that when the rolling update is happen, the flink is now processing the data records and the checkpoint is not completed. In this case, the current processed data in the task is loss?
I need more clarification for data restoring when we use checkpoint and k8s on flink updated by rolling strategy.
Flink doesn't support rolling upgrades. If one of your pods where a Flink application is currently running becomes unavailable , the Flink application will usually restart.
The answer from David at Is the whole job restarted if one task fails explains this in more detail.
I would also recommend to look at the current documentation for Task Failure Recovery at https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/task_failure_recovery/ and the checkpointing/savepointing documentation that's also listed there.

What happens if Flink job restarts between two checkpoints?

This question is basically similar to the one asked here: Apache Flink fault tolerance.
i.e. what happens if a job restarts between two checkpoints? Will it reprocess the records that were already processed after the last checkpoint?
Take for example I have two jobs, job1 and job2. Job1 consumes records from Kafka, processes them and again produces them to second Kafka topic. Job2 consumes from this second topic and processes the records (in my case its updating values in aerospike using AerospikeClient).
Now from the answer to this question Apache Flink fault tolerance, I can somehow believe that if job1 restarts, it will not produce duplicates records in the sink. I am using FlinkKafkaProducer011 which extends TwoPhaseCommitSinkFunction (https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html). Please explain how it will prevent reprocessing (ie duplicate production of records to Kafka).
According to Flink doc, flink restarts a job from last successful checkpoint. So if job2 restarts before completing the checkpoint, it will restart from last checkpoint and the records that were already processed after that last checkpoint will be reprocessed (ie multiple updations in aerospike).
Am I right or is there something else in Flink (& aerospike) that prevents this reprocessing in job2?
In such a scenario, Flink will indeed reprocess some events. During recovery the input partitions will have their offsets reset to the offsets in the most recent checkpoint, and events that had been read after that checkpoint will be re-ingested.
However, the FlinkKafkaProducer uses Kafka transactions that are committed when checkpoints are completed. When a job fails, whatever output it has produced since the last checkpoint is protected by transactions that are never committed. So long as that job's consumers are configured to use read_committed as their isolation.level, they won't see any duplicates.
For more details, see Best Practices for Using Kafka Sources/Sinks in Flink Jobs.

Flink's failure recovery process

I want to know the detailed failure recovery process of flink.In standalone mode, I guess some steps, such as a TaskManager failure, first detect the failure, all tasks stop processing, and then redeploy the tasks. Then download the checkpoint from HDFS, and each operator loads the state. After the loading is completed, the source continues to send data. Am I right? Does anyone know the correct and detailed recovery process?
Flink recovers from failure through checkpoints. Checkpoints can be stored locally, in S3 or HDFS. When restored, all states of different operators will be revived.
For detailed recovery process, it really depends on your backend. If you are using RocksDB.
your checkpoint can be incremental
you can use the checkpoint data as a savepoint if you do not need to change the backend. This means you can change the parallelism while reviving from the checkpoint.

Is there a delay before other transactions see a commit if using asynchronous commit in Postgresql?

I'm looking into optimizing throughput in a Java application that frequently (100+ transactions/second) updates data in a Postgresql database. Since I don't mind losing a few transactions if the database crashes, I think that using asynchronous commit could be a good fit.
My only concern is that I don't want a delay after commit until other transactions/queries see my commit. I am using the default isolation level "Read committed".
So my question is: Does using asynchronous commit in Postgresql in any way mean that there will be delays before other transactions see my committed data or before they proceed if they have been waiting for my transaction to complete? (As mentioned before, I don't care if there is a delay before my data is persisted to disk.)
It would seem that this is the behavior you're looking for.
http://www.postgresql.org/docs/current/static/wal-async-commit.html
Selecting asynchronous commit mode means that the server returns success as soon as the transaction is logically completed, before the
WAL records it generated have actually made their way to disk. This
can provide a significant boost in throughput for small transactions.
The WAL is used to provide on-disk data integrity, and has nothing to do about table integrity of a running server; it's only important if the server crashes. Since they specifically mention "as soon as the transaction is logically completed", the documention is indicating that this does not affect table behavior.

Does checkpoint block every other operation?

When Sql Server issues a checkpoint, does it block every other operation until the checkpoint is completed?
If I understand correctly, when a checkpoint occurs, the server should write all dirty pages.
When it's complete, it will write checkpoint to the transaction log, so in case of any failure it will process only transactions from that point of time (or transactions which already started at time of checkpoint).
How does sql server prevent some non dirty page to become dirty while the checkpoint is in progress?
Does it block all writes until the checkpoint is completed?
Checkpoints do not block writes.
A checkpoint has a start and an end LSN. It guarantees that all pages on disk are at least at the start LSN of the checkpoint. It does not matter if any page is at a later LSN (because it has been written to after the checkpoint has started).
The checkpoint only guarantees a minimum LSN for all pages on disk. It does not guarantee an exact LSN.
This makes sense because you can delete all transaction log records which contain information from LSNs which are earlier than the checkpoint start LSN. That is the purpose of a checkpoint: Allow parts of the log to become inactive.
Checkpoints are not needed for data consistency and correctness. They just free log space and shorten recovery times.
when a checkpoint occurs, the server should write all dirty pages
And that's what it does. However the guarantee given by checkpoint is it writes all the pages that were dirty at the instant the checkpoint started. Any page that got dirty while the checkpoint was making progress may or may not be written, but is sure not guaranteed to be written. What this guarantee offers is an optimization that physical recovery can start REDO from the last checkpoint since everything in the log prior to it is already been applied to the data pages (does not have to be redone). Is even on Wikipedia page for ARIES:
The naive way for checkpointing involves locking the whole database to
avoid changes to the DPT and the TT during the creation of the
checkpoint. Fuzzy logging circumvents that by writing two log records.
One Fuzzy Log Starts Here record and, after preparing the checkpoint
data, the actual checkpoint. Between the two records other logrecords
can be created
usr's answer explains how this is achieved (by using a checkpoint start LSN and end LSN).

Resources