Why is Two-Phase Commit (2PC) blocking? - distributed-transactions

Can anyone let me know, why 2PC is blocking when coordinator fails? Is it because the cohorts don't employ timeout concept in 2PC?
Good reference: Analysis and Verification of Two-Phase Commit & Three-Phase Commit Protocols, by Muhammad Atif,

The two-phase is blocking protocol because when the participants enter the prepared phase they have to wait for the coordinator to decide what is the next step of processing. When coordinator fails they have to wait till it's resurrected.
It's not possible to start another coordinator to reach a result. Participants are not permitted to change their state until they're commanded to do so.
I can understand you compare 3PC vs. 2PC. Thus 3PC protocol (as I understand it) is a family of the protocols where few of them exists. The 3PC addresses the issue of blocking nature of 2PC.
The main point is consistently finishing the transaction (commit or rollback) only with knowledge of "the environment". It's expected that a new coordinator (backup) is started (probably selected on from participants) and transaction could be finished. There is way to include timeouts to abort the participant after some time. Even that the newly started coordinator should be capable to consistently finish the whole transaction (probably by rollback in such case).

2PC doesn't always block when Coordinator fails, a system using 2PC only blocks when Coordinator fails whenever anyone read out a prepared(in-doubt) resource.
If the commit message(of phase 2) to Participant lost, the Participant's resource stays at prepared state, it must refer to Coordinator to check out what exact state the resource is. A Participant could not determine the exact state of a prepared resource itself.

Related

What exactly happens if checkpointed data cannot be committed?

I'm reading into the details of Flink's checkpointing mechanism right now and by now, I think I have a really good overview about how everything is tied together but one last issue strikes me here.
It's about how checkpoints and commits interact with each other in the ExactlyOnce context, because I have the feeling that there's still potential for data loss/duplicate records. Mainly I was thinking about potential failures of the commit message or its callback, when I stumbled upon this paragraph in the Flink Blog:
After a successful pre-commit, the commit must be guaranteed to eventually succeed – both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.
Up until this point, I still had the impression that checkpoints would have to be acknowledged by the sink commit first, before they would be viewed as "valid". But apparently, once all operators are ready to actually commit, the checkpoint starts to exist and from that point on, the sink has to guarantee the commit can be done to ensure no data being lost. What exactly happens if my commit can never be done, e.g. if my Kafka sink is down for a longer period of time? Does this mean if the defined retries run out eventually, the checkpointed state will just be treated as the correct state or will Flink only be able to resume the job once this specific commit was able to be done and thus be stuck until broker is available again?
And what if the callback of the commit is lost somehow, will this be resolved in the next retry attempt or since the transaction is "done" now, the producer will not be able to commit and we enter this loop of repeated retries? (more of a Kafka question probably)
For committing the side effects (so things like external state, vide Kafka transactions), Flink is using two phase commit protocol.
Let's say we are performing checkpoint 42. First pre-commit requests are issued. If all participants (parallel subtasks/operators) successfully acknowledged the pre-commit, JobManager/CheckpointCoordinator will start sending out commit requests.
The thing is, if failure happens at this point of time, there is no way going back. If either some commit fails or there is some other unrelated failure, job will be restarted from the checkpoint 42 and Flink will re-attempt to commit the pending/pre-committed transactions. If failure happens again, rinse and repeat according to your selected restart strategy. If you want to avoid data loss, commit attempts must eventually succeed. There is simply no other way. We can not revert those transactions, as once some commit request were issued, some transactions might have already been committed, so we can not rollback only portion of them (otherwise we would have data duplication problem).

What problems can two-phase commits cause?

Recently I've read multiple times that two-phase commits are bad, but always as a side note. So there was never a good explanation with it.
For example in CQRS Journey Chapter 5:
Second, we're trying to avoid two-phase commits because they always
cause problems in the long run.
Or in Implementing Domain-Driven Design on page 563:
The second ReadRecorts() is used by the infrastructure to replicate
events, to publish them without the need for two-phase commit, ...
I thought two-phase commits are implemented to ensure consistency among multiple database servers.
What problems can occur when using two-phase commits? Why is it better to avoid them?
The biggest problem is scalability due to the blocking nature of the 2 phase commit protocol.
2PC requires a careful coordination between the participating parties: In particular, each party has to acknowledge the prepare phase and the commit. Once a party has acknowledged that it is ready to commit, it has to block until the transaction coordinator sends the commit or rollback message. If either of the parties is over a network, the network latency causes a bottleneck for the communication between the nodes.
Furthermore, once a party has acknowledged that it is ready to commit, it must actually be able to commit the transaction afterwards even if it crashed inbetween. This requires checkpointing to persistence storage (even when the transaction is rolled back afterwards) and also possibly limits the throughput.

How does three-phase commit avoid blocking?

I am trying to understand how three-phase commit avoids blocking
Consider the following two failure scenarios:
Scenario 1: In phase 2 the coordinator sends preCommit messages to all cohorts and has gotten an ack from all except cohort A. Network problems prevent cohort A from receiving the coordinator's preCommit message. Cohort A times out waiting for the preCommit message and chooses to abort. Then both the coordinator and cohort A crash.
Scenario 2: The protocol reaches phase 3. The coordinator sends a doCommit message to cohort A. But before it can send more doCommit messages the coordinator crashes. Cohort A commits its part of the transaction then crashes.
As far as I can tell the remaining cohorts have the exact same state at the end of scenario 1 and scenario 2. So when a recovery coordinator steps in how can it find out from the remaining cohorts whether we are in scenario 1 and abort or we are in scenario 2 and commit and thus avoid blocking?
Three-phase commit isn't magic; it's just more resilient than two-phase commit. In particular, 3PC is resilient against single-point failure, but not all kinds of multiple-point failure. Both scenarios in the question posit two-point failures. In other words, the premise of the question is misguided; it's asking more of 3PC than it's capable of.
For further reading, here's a sentence from the abstract of a paper on the subject, Analysis and Verification of Two-Phase Commit & Three-Phase Commit Protocols, by Muhammad Atif, to whet your appetite:
We also apply our method to its “amended” variant, the Three-Phase
Commit Protocol (3PC) and prove it to be erroneous for simultaneous
site failures
I found this paper to provide a foothold into the literature. There's no small amount of it on this subject, if you want to delve in.
In the two-phase commit the coordinator sends a prepare message to all participants (nodes) and waits for their answers. The coordinator then sends their answers to all other sites. Every participant waits for these answers from the coordinator before committing to or aborting the transaction.
The two-phase commit protocol also has limitations in that it is a blocking protocol. For example, participants will block resource processes while waiting for a message from the coordinator. If for any reason this fails, the participant will continue to wait and may never resolve its transaction. Therefore the resource could be blocked indefinitely. On the other hand, a coordinator will also block resources while waiting for replies from participants. In this case, a coordinator can also block in
definitely if no acknowledgement is received from the participant.
However, the three-phase protocol introduces a third phase called the pre-commit. The aim of this is to 'remove the uncertainty period for participants that have committed and are waiting for the global abort or commit message from the coordinator.
When receiving a pre-commit message, participants know that all others have voted to commit.
If a pre-commit message has not been received the participant will abort and release any blocked resources.
The thing that helped me understand the non-blocking property was to realize that after the first round of messages both protocols are in essentially the same state; all participants have agreed that they can commit and are waiting for confirmation to do so.
Now, consider what a participant knows after it replied "ok to commit" in the first round.
2PC: another participant may have already received a message to commit and committed or, alternatively, no participants may have done any commits. So if the participant doesn't hear anything it has no idea what the group decision was.
3PC: the participant can be sure that no other participant has performed any commit actions. It's still safe to roll back the entire operation in the case of coordinator failure.
Moving on, after the second round of messages and confirmations in 3PC we are guaranteed that all participants know that the group decision is to commit.
This means that there is never a time in 3PC when a participant does a commit action that another participant is not anticipating.
In scenario 1 :
During recovery:
All cohorts ,except A, will be in PRECOMMIT state. This tells the recovery node that all cohorts had voted for the commit and moved forward . So A and Coordinator should be in PRECOMMIT state. Since this is a non-final state the transaction is aborted .
In scenario 2 :
During recovery:
All cohorts ,except A, will be in PRECOMMIT state. This tells the recovery node that all cohorts had voted for the commit and moved ahead .But since A received the doCommit message it is in COMMITTED state.Had a crash not happened the recovery would ask all cohorts to commit since at least one cohort has committed. Since a crash (A crashed) happened the recovery node sees no live cohort with the committed state and hence it deduces that no cohort got the doCommit message. Hence the transaction will be aborted and all cohorts will be asked to release resources.
when A returns from the crash and begins recovery it will find that all the other cohorts aborted the transaction and it too will abort the transaction .
(source: swturner at regal.csep.umflint.edu)

3 phase commit protocol

I was reading 3 phase commit protocol on wikipedia (http://en.wikipedia.org/wiki/Three-phase_commit_protocol) and here is a scenario that came to my mind where 3PC will fail:
Assume there are two participants A and B and a Coordinator C:
1)C sent precommit message to A and before it sends precommit message to B both A and C simulataneously fail. 2)The transaction is now restarted and B ends up aborting it because no reply from A. 3)A commits the transaction because its has already got the precommit message.
Wasn't this also the original problem in 2PC that 3PC was supposed to address? How is 3PC solving the problem? What am I missing. Thanks.
Update:
Do the participants not commit then until they receive the doCommit message from the coordinator?
After receiving the preCommit message, the participants will wait first, and if a timeout happens, they will just go ahead to commit.
if the coordinator fails after sending the precommit message and at least one of the particpant having a precommit message, the rest in the system can just go ahead and commit since they already know the state on the system.
Yes, once the new coordinator sees that their is a participant that has already received the preCommit message, it will resend preCommit messages to other participants.
If the co-ordinator should crash at any point, a recovery node can take over the transaction and query the state from any remaining replicas. If a replica that has committed the transaction has crashed, we know that every other replica has received a ‘prepare to commit’ message (otherwise the co-ordinator wouldn’t have moved to the commit phase), and therefore the recovery node will be able to determine that the transaction was able to be committed, and safely shepherd the protocol to its conclusion. If any replica reports to the recovery node that it has not received ‘prepare to commit’, the recovery node will know that the transaction has not been committed at any replica, and will therefore be able either to pessimistically abort or re-run the protocol from the beginning.
--cited from http://the-paper-trail.org/blog/consensus-protocols-three-phase-commit/
So I think the new coordinator will query the cohorts's state, only when all the live cohorts have received pre-commit message, then the new coordinator will send the do-commit message; otherwise, the transaction will be aborted.
3PC only tolerate single-point failure, not multi-point failure. Actually, to make sure 3PC works, all of the following three conditions must be met:
no network failure (i.e. no network partition, every message will get to destination before timeout if the dest machine is working (not crashing))
at most one participant can fail (crash). To make it precise, if coordinator fails (crashes), all cohorts must not fail
the participant machine can distinguish between timeout and fails (this is not trivial, consider when it crashes (i.e. electric is cut) right after timeout, where it could not write anything to persistent storage to remind itself that it was a timeout instead of a crash when it recovers)
None of these condition is practical. So I don't think 3PC can be implemented in real world.

Two phase commit

I believe most of people know what 2PC (two-phase commit protocol) is and how to use it in Java or most of modern languages. Basically, it is used to make sure the transactions are in sync when you have 2 or more DBs.
Assume I've two DBs (A and B) using 2PC in two different locations. Before A and B are ready to commit a transaction, both DBs will report back to the transaction manager saying they are ready to commit. So, when the transaction manager is acknowledged, it will send a signal back to A and B telling them to go ahead.
Here is my question: let's say A received the signal and commited the transaction. Once everything is completed, B is about to do the same but someone unplugs the power cable, causing the whole server shutdown. When B is back online, what will B do? And how does B do it?
Remember, A is committed but B is not, and we are using 2PC (so, the design of 2PC stops working, does not it?)
On Two-Phase Commit
Two phase commit does not guarantee that a distributed transaction can't fail, but it does guarantee that it can't fail silently without the TM being aware of it.
In order for B to report the transaction as being ready to commit, B must have the transaction in persistent storage (i.e. B must be able to guarantee that the transaction can commit in all circumstances). In this situation, B has persisted the transaction but the transaction manager has not yet received a message from B confirming that B has completed the commit.
The transaction manager will poll B again when B comes back online and ask it to commit the transaction. If B has already committed the transaction it will report the transaction as committed. If B has not yet committed the transaction it will then commit as it has already persisted it and is thus still in a position to commit the transaction.
In order for B to fail in this situation, it would have to undergo a catastrophic failure that lost data or log entries. The transaction manager would still be aware that B had not reported a successful commit.1
In practice, if B can no longer commit the transaction, it would imply that the disaster that took B out had caused data loss, and B would report an error when the TM asked it to commit a TxID that it wasn't aware of or didn't think was in a commitable state.
Thus, two phase commit does not prevent a catastrophic failure from occuring, but it does prevent the failure from going unnoticed. In this scenario the transaction manager will report an error back to the application if B cannot commit.
The application still has to be able to recover from the error, but the transaction cannot fail silently without the application being made aware of the inconsistent state.
Semantics
If a resource manager or network goes down in phase 1, the
transaction manager will detect a fatal error (can't connect to
resource manager) and mark the sub-transaction as failed. When the
network comes back up it will abort the transaction on all of the
participating resource managers.
If a resource manager or network goes down in phase 2, the
transaction manager will continue to poll the resource manager until
it comes back up. When it re-connects back to the resource manager
it will tell the RM to commit the transaction. If the RM returns an
error along the lines of 'Unknown TxID' the TM will be aware that
there is a data loss issue in the RM.
If the TM goes down in phase 1 then the client will block until the
TM comes back up, unless it times out or receives an error due to the
broken network connection. In this case the client is made aware of
the error and can either re-try or initiate the abort itself.
If the TM goes down in phase 2 then it will block the client until
the TM comes back up. It has already reported the transaction as
committable and no fatal error should be presented to the client,
although it may block until the TM comes back up. The TM will still
have the transaction in an uncommitted state and will poll the RMs
to commit when it comes back up.
Post-commit data loss events in the resource managers are not handled by the transaction manager and are a function of the resilience of the RMs.
Two-phase commit does not guarantee fault tolerance - see Paxos for an example of a protocol that does address fault tolerance - but it does guarantee that partial failure of a distributed transaction cannot go un-noticed.
Note that this sort of failure could also lose data from previously committed transactions. Two phase commit does not guarantee that the resource managers can't lose or corrupt data or that DR procedures don't screw up.
I believe three phase commit is a much better approach. Unfortunately I haven't found anyone implementing such a technology.
http://the-paper-trail.org/blog/consensus-protocols-three-phase-commit/
Here are the essential parts of the above article :
The fundamental difficulty with 2PC is that, once the decision to commit has been made by the co-ordinator and communicated to some replicas, the replicas go right ahead and act upon the commit statement without checking to see if every other replica got the message. Then, if a replica that committed crashes along with the co-ordinator, the system has no way of telling what the result of the transaction was (since only the co-ordinator and the replica that got the message know for sure). Since the transaction might already have been committed at the crashed replica, the protocol cannot pessimistically abort – as the transaction might have had side-effects that are impossible to undo. Similarly, the protocol cannot optimistically force the transaction to commit, as the original vote might have been to abort.
This problem is – mostly – circumvented by the addition of an extra phase to 2PC, unsurprisingly giving us a three-phase commit protocol. The idea is very simple. We break the second phase of 2PC – ‘commit’ – into two sub-phases. The first is the ‘prepare to commit’ phase. The co-ordinator sends this message to all replicas when it has received unanimous ‘yes’ votes in the first phase. On receipt of this messages, replicas get into a state where they are able to commit the transaction – by taking necessary locks and so forth – but crucially do not do any work that they cannot later undo. They then reply to the co-ordinator telling it that the ‘prepare to commit’ message was received.
The purpose of this phase is to communicate the result of the vote to every replica so that the state of the protocol can be recovered no matter which replica dies.
The last phase of the protocol does almost exactly the same thing as the original ‘commit or abort’ phase in 2PC. If the co-ordinator receives confirmation of the delivery of the ‘prepare to commit’ message from all replicas, it is then safe to go ahead with committing the transaction. However, if delivery is not confirmed, the co-ordinator cannot guarantee that the protocol state will be recovered should it crash (if you are tolerating a fixed number f of failures, the co-ordinator can go ahead once it has received f+1 confirmations). In this case, the co-ordinator will abort the transaction.
If the co-ordinator should crash at any point, a recovery node can take over the transaction and query the state from any remaining replicas. If a replica that has committed the transaction has crashed, we know that every other replica has received a ‘prepare to commit’ message (otherwise the co-ordinator wouldn’t have moved to the commit phase), and therefore the recovery node will be able to determine that the transaction was able to be committed, and safely shepherd the protocol to its conclusion. If any replica reports to the recovery node that it has not received ‘prepare to commit’, the recovery node will know that the transaction has not been committed at any replica, and will therefore be able either to pessimistically abort or re-run the protocol from the beginning.
So does 3PC fix all our problems? Not quite, but it comes close. In the case of a network partition, the wheels rather come off – imagine that all the replicas that received ‘prepare to commit’ are on one side of the partition, and those that did not are on the other. Then both partitions will continue with recovery nodes that respectively commit or abort the transaction, and when the network merges the system will have an inconsistent state. So 3PC has potentially unsafe runs, as does 2PC, but will always make progress and therefore satisfies its liveness properties. The fact that 3PC will not block on single node failures makes it much more appealing for services where high availability is more important than low latencies.
Your scenario is not the only one where things can ultimately go wrong despite all effort. Suppose A and B have both reported "ready to commit" to TM, and then someone unplugs the line between TM and, say, B. B is waiting for the go-ahead (or no-go) from TM, but it certainly won't keep waiting forever until TM reconnects (its own resources involved in the transaction must stay locked/inaccessible throughout the entire wait time for obvious reasons). So when B is kept waiting too long for its own taste, it will take what is called "heuristic decisions". That is, it will decide to commit or rollback independently from TM, based on, well, I don't really know what, but that doesn't really matter. It should be obvious that any such heuristic decisions can deviate from the actual commit decision taken by TM.

Resources