How do two-phase commits prevent last-second failure? - database

I am studying how two-phase commit works across a distributed transaction. It is my understanding that in the last part of the phase the transaction coordinator asks each node whether it is ready to commit. If everyone agreed, then it tells them to go ahead and commit.
What prevents the following failure?
All nodes respond that they are
ready to commit
The transaction
coordinator tells them to "go ahead
and commit" but one of the nodes
crashes before receiving this
message
All other nodes commit successfully, but now the distributed transaction is corrupt
It is my understanding that when the crashed node comes back, its transaction will have been rolled back (since it never got the commit message)
I am assuming each node is running a normal database that doesn't know anything about distributed transactions. What did I miss?

No, they are not instructed to roll back because in the original poster's scenario, some of the nodes have already committed. What happens is when the crashed node becomes available, the transaction coordinator tells it to commit again.
Because the node responded positively in the "prepare" phase, it is required to be able to "commit", even when it comes back from a crash.

Summarizing everyone's answers:
One cannot use normal databases with distributed transactions. The database must explicitly support a transaction coordinator.
The nodes are not instructed to roll back because some of the nodes have already committed. What happens is that when the crashed node comes back, the transaction coordinator tells it to finish the commit.

No. Point 4 is incorrect. Each node records in stable storage that it was able to commit or rollback the transaction, so that it will be able to do as commanded even across crashes. When the crashed node comes back up, it must realize that it has a transaction in pre-commit state, reinstate any relevant locks or other controls, and then attempt to contact the coordinator site to collect the status of the transaction.
The problems only occur if the crashed node never comes back up (then everything else thinks the transaction was OK, or will be when the crashed node comes back).

Two phase commit isn't foolproof and is just designed to work in the 99% of the time cases.
"The protocol assumes that there is stable storage at each node with a write-ahead log, that no node crashes forever, that the data in the write-ahead log is never lost or corrupted in a crash, and that any two nodes can communicate with each other."
http://en.wikipedia.org/wiki/Two-phase_commit_protocol

There are many ways to attack the problems with two-phase commit. Almost all of them wind up as some variant of the Paxos three-phase commit algorithm. Mike Burrows, who designed the Chubby lock service at Google which is based on Paxos, said that there are two types of distributed commit algorithms - "Paxos, and incorrect ones" - in a lecture I saw.
One thing the crashed node could do, when it reawakes, is say "I never heard about this transaction, should it have been committed?" to the coordinator, which will tell it what the vote was.
Bear in mind that this is an example of a more general problem: the crashed node could miss many transactions before it recovers. Therefore it's terribly important that upon recovery it should talk either to the coordinator or another replica before making itself available. If the node itself can't tell whether or not it has crashed, then things get more involved but still tractable.
If you use a quorum system for database reads, the inconsistency will be masked (and made known to the database itself).

Related

What exactly happens if checkpointed data cannot be committed?

I'm reading into the details of Flink's checkpointing mechanism right now and by now, I think I have a really good overview about how everything is tied together but one last issue strikes me here.
It's about how checkpoints and commits interact with each other in the ExactlyOnce context, because I have the feeling that there's still potential for data loss/duplicate records. Mainly I was thinking about potential failures of the commit message or its callback, when I stumbled upon this paragraph in the Flink Blog:
After a successful pre-commit, the commit must be guaranteed to eventually succeed – both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.
Up until this point, I still had the impression that checkpoints would have to be acknowledged by the sink commit first, before they would be viewed as "valid". But apparently, once all operators are ready to actually commit, the checkpoint starts to exist and from that point on, the sink has to guarantee the commit can be done to ensure no data being lost. What exactly happens if my commit can never be done, e.g. if my Kafka sink is down for a longer period of time? Does this mean if the defined retries run out eventually, the checkpointed state will just be treated as the correct state or will Flink only be able to resume the job once this specific commit was able to be done and thus be stuck until broker is available again?
And what if the callback of the commit is lost somehow, will this be resolved in the next retry attempt or since the transaction is "done" now, the producer will not be able to commit and we enter this loop of repeated retries? (more of a Kafka question probably)
For committing the side effects (so things like external state, vide Kafka transactions), Flink is using two phase commit protocol.
Let's say we are performing checkpoint 42. First pre-commit requests are issued. If all participants (parallel subtasks/operators) successfully acknowledged the pre-commit, JobManager/CheckpointCoordinator will start sending out commit requests.
The thing is, if failure happens at this point of time, there is no way going back. If either some commit fails or there is some other unrelated failure, job will be restarted from the checkpoint 42 and Flink will re-attempt to commit the pending/pre-committed transactions. If failure happens again, rinse and repeat according to your selected restart strategy. If you want to avoid data loss, commit attempts must eventually succeed. There is simply no other way. We can not revert those transactions, as once some commit request were issued, some transactions might have already been committed, so we can not rollback only portion of them (otherwise we would have data duplication problem).

How do 2PC prevent commit failure? [duplicate]

I understand, in a fuzzy sort of way, how regular ACID transactions work. You perform some work on a database in such a way that the work is not confirmed until some kind of commit flag is set. The commit part is based on some underlying assumption (like a single disk block write is atomic). In the event of a catastrophic error, you can just clear out the uncommitted data in the recovery phase.
How do distributed transactions work? In some of the MS documentation I have read that you can somehow perform a transaction across databases and filesystems (among other things).
This technology could be (and probably is) used for installers, where you want the program to be fully installed or fully absent. You simply begin a transaction at the start of the installer. Next you could connect to the registry and filesystem, making the changes that define the installation. When the job is done, simply commit, or rollback if the installation fails for some reason. The registry and filesystem are automatically cleaned for you by this magical distributed transaction coordinator.
How is it possible that two disparate systems can be transacted upon in this fashion? It seems to me that it is always possible to leave the system in an inconsistent state, where the filesystem has committed its changes and the registry has not. I think in MSDTC it is even possible to perform a transaction across the network.
I have read http://blogs.msdn.com/florinlazar/archive/2004/03/04/84199.aspx, but it feels like only the beginning of the explanation, and that step 4 should be expanded considerably.
Edit: From what I gather on http://en.wikipedia.org/wiki/Distributed_transaction, it can be accomplished by a two-phase commit (http://en.wikipedia.org/wiki/Two-phase_commit). After reading this, I'm still not understanding the method 100%, it seems like there is a lot of room for error between the steps.
About "step 4":
The transaction manager coordinates
with the resource managers to ensure
that all succeed to do the requested
work or none of the work if done, thus
maintaining the ACID properties.
This of course requires all participants to provide the proper interfaces and (error-free) implementations. The interface looks like vaguely this:
public interface ITransactionParticipant {
bool WouldCommitWork();
void Commit();
void Rollback();
}
The Transaction manager at commit-time queries all participants whether they are willing to commit the transaction. The participants may only assert this if they are able to commit this transaction under all allowable error conditions (validation, system errors, etc). After all participants have asserted the ability to commit the transaction, the manager sends the Commit() message to all participants. If any participant instead raises an error or times out, the whole transaction aborts and individual members are rolled back.
This protocol requires participants to have recorded their whole transaction content before asserting their ability to commit. Of course this has to be in a special local transaction log structure to be able to recover from various kinds of failures.

Why is Two-Phase Commit (2PC) blocking?

Can anyone let me know, why 2PC is blocking when coordinator fails? Is it because the cohorts don't employ timeout concept in 2PC?
Good reference: Analysis and Verification of Two-Phase Commit & Three-Phase Commit Protocols, by Muhammad Atif,
The two-phase is blocking protocol because when the participants enter the prepared phase they have to wait for the coordinator to decide what is the next step of processing. When coordinator fails they have to wait till it's resurrected.
It's not possible to start another coordinator to reach a result. Participants are not permitted to change their state until they're commanded to do so.
I can understand you compare 3PC vs. 2PC. Thus 3PC protocol (as I understand it) is a family of the protocols where few of them exists. The 3PC addresses the issue of blocking nature of 2PC.
The main point is consistently finishing the transaction (commit or rollback) only with knowledge of "the environment". It's expected that a new coordinator (backup) is started (probably selected on from participants) and transaction could be finished. There is way to include timeouts to abort the participant after some time. Even that the newly started coordinator should be capable to consistently finish the whole transaction (probably by rollback in such case).
2PC doesn't always block when Coordinator fails, a system using 2PC only blocks when Coordinator fails whenever anyone read out a prepared(in-doubt) resource.
If the commit message(of phase 2) to Participant lost, the Participant's resource stays at prepared state, it must refer to Coordinator to check out what exact state the resource is. A Participant could not determine the exact state of a prepared resource itself.

What problems can two-phase commits cause?

Recently I've read multiple times that two-phase commits are bad, but always as a side note. So there was never a good explanation with it.
For example in CQRS Journey Chapter 5:
Second, we're trying to avoid two-phase commits because they always
cause problems in the long run.
Or in Implementing Domain-Driven Design on page 563:
The second ReadRecorts() is used by the infrastructure to replicate
events, to publish them without the need for two-phase commit, ...
I thought two-phase commits are implemented to ensure consistency among multiple database servers.
What problems can occur when using two-phase commits? Why is it better to avoid them?
The biggest problem is scalability due to the blocking nature of the 2 phase commit protocol.
2PC requires a careful coordination between the participating parties: In particular, each party has to acknowledge the prepare phase and the commit. Once a party has acknowledged that it is ready to commit, it has to block until the transaction coordinator sends the commit or rollback message. If either of the parties is over a network, the network latency causes a bottleneck for the communication between the nodes.
Furthermore, once a party has acknowledged that it is ready to commit, it must actually be able to commit the transaction afterwards even if it crashed inbetween. This requires checkpointing to persistence storage (even when the transaction is rolled back afterwards) and also possibly limits the throughput.

Two phase commit

I believe most of people know what 2PC (two-phase commit protocol) is and how to use it in Java or most of modern languages. Basically, it is used to make sure the transactions are in sync when you have 2 or more DBs.
Assume I've two DBs (A and B) using 2PC in two different locations. Before A and B are ready to commit a transaction, both DBs will report back to the transaction manager saying they are ready to commit. So, when the transaction manager is acknowledged, it will send a signal back to A and B telling them to go ahead.
Here is my question: let's say A received the signal and commited the transaction. Once everything is completed, B is about to do the same but someone unplugs the power cable, causing the whole server shutdown. When B is back online, what will B do? And how does B do it?
Remember, A is committed but B is not, and we are using 2PC (so, the design of 2PC stops working, does not it?)
On Two-Phase Commit
Two phase commit does not guarantee that a distributed transaction can't fail, but it does guarantee that it can't fail silently without the TM being aware of it.
In order for B to report the transaction as being ready to commit, B must have the transaction in persistent storage (i.e. B must be able to guarantee that the transaction can commit in all circumstances). In this situation, B has persisted the transaction but the transaction manager has not yet received a message from B confirming that B has completed the commit.
The transaction manager will poll B again when B comes back online and ask it to commit the transaction. If B has already committed the transaction it will report the transaction as committed. If B has not yet committed the transaction it will then commit as it has already persisted it and is thus still in a position to commit the transaction.
In order for B to fail in this situation, it would have to undergo a catastrophic failure that lost data or log entries. The transaction manager would still be aware that B had not reported a successful commit.1
In practice, if B can no longer commit the transaction, it would imply that the disaster that took B out had caused data loss, and B would report an error when the TM asked it to commit a TxID that it wasn't aware of or didn't think was in a commitable state.
Thus, two phase commit does not prevent a catastrophic failure from occuring, but it does prevent the failure from going unnoticed. In this scenario the transaction manager will report an error back to the application if B cannot commit.
The application still has to be able to recover from the error, but the transaction cannot fail silently without the application being made aware of the inconsistent state.
Semantics
If a resource manager or network goes down in phase 1, the
transaction manager will detect a fatal error (can't connect to
resource manager) and mark the sub-transaction as failed. When the
network comes back up it will abort the transaction on all of the
participating resource managers.
If a resource manager or network goes down in phase 2, the
transaction manager will continue to poll the resource manager until
it comes back up. When it re-connects back to the resource manager
it will tell the RM to commit the transaction. If the RM returns an
error along the lines of 'Unknown TxID' the TM will be aware that
there is a data loss issue in the RM.
If the TM goes down in phase 1 then the client will block until the
TM comes back up, unless it times out or receives an error due to the
broken network connection. In this case the client is made aware of
the error and can either re-try or initiate the abort itself.
If the TM goes down in phase 2 then it will block the client until
the TM comes back up. It has already reported the transaction as
committable and no fatal error should be presented to the client,
although it may block until the TM comes back up. The TM will still
have the transaction in an uncommitted state and will poll the RMs
to commit when it comes back up.
Post-commit data loss events in the resource managers are not handled by the transaction manager and are a function of the resilience of the RMs.
Two-phase commit does not guarantee fault tolerance - see Paxos for an example of a protocol that does address fault tolerance - but it does guarantee that partial failure of a distributed transaction cannot go un-noticed.
Note that this sort of failure could also lose data from previously committed transactions. Two phase commit does not guarantee that the resource managers can't lose or corrupt data or that DR procedures don't screw up.
I believe three phase commit is a much better approach. Unfortunately I haven't found anyone implementing such a technology.
http://the-paper-trail.org/blog/consensus-protocols-three-phase-commit/
Here are the essential parts of the above article :
The fundamental difficulty with 2PC is that, once the decision to commit has been made by the co-ordinator and communicated to some replicas, the replicas go right ahead and act upon the commit statement without checking to see if every other replica got the message. Then, if a replica that committed crashes along with the co-ordinator, the system has no way of telling what the result of the transaction was (since only the co-ordinator and the replica that got the message know for sure). Since the transaction might already have been committed at the crashed replica, the protocol cannot pessimistically abort – as the transaction might have had side-effects that are impossible to undo. Similarly, the protocol cannot optimistically force the transaction to commit, as the original vote might have been to abort.
This problem is – mostly – circumvented by the addition of an extra phase to 2PC, unsurprisingly giving us a three-phase commit protocol. The idea is very simple. We break the second phase of 2PC – ‘commit’ – into two sub-phases. The first is the ‘prepare to commit’ phase. The co-ordinator sends this message to all replicas when it has received unanimous ‘yes’ votes in the first phase. On receipt of this messages, replicas get into a state where they are able to commit the transaction – by taking necessary locks and so forth – but crucially do not do any work that they cannot later undo. They then reply to the co-ordinator telling it that the ‘prepare to commit’ message was received.
The purpose of this phase is to communicate the result of the vote to every replica so that the state of the protocol can be recovered no matter which replica dies.
The last phase of the protocol does almost exactly the same thing as the original ‘commit or abort’ phase in 2PC. If the co-ordinator receives confirmation of the delivery of the ‘prepare to commit’ message from all replicas, it is then safe to go ahead with committing the transaction. However, if delivery is not confirmed, the co-ordinator cannot guarantee that the protocol state will be recovered should it crash (if you are tolerating a fixed number f of failures, the co-ordinator can go ahead once it has received f+1 confirmations). In this case, the co-ordinator will abort the transaction.
If the co-ordinator should crash at any point, a recovery node can take over the transaction and query the state from any remaining replicas. If a replica that has committed the transaction has crashed, we know that every other replica has received a ‘prepare to commit’ message (otherwise the co-ordinator wouldn’t have moved to the commit phase), and therefore the recovery node will be able to determine that the transaction was able to be committed, and safely shepherd the protocol to its conclusion. If any replica reports to the recovery node that it has not received ‘prepare to commit’, the recovery node will know that the transaction has not been committed at any replica, and will therefore be able either to pessimistically abort or re-run the protocol from the beginning.
So does 3PC fix all our problems? Not quite, but it comes close. In the case of a network partition, the wheels rather come off – imagine that all the replicas that received ‘prepare to commit’ are on one side of the partition, and those that did not are on the other. Then both partitions will continue with recovery nodes that respectively commit or abort the transaction, and when the network merges the system will have an inconsistent state. So 3PC has potentially unsafe runs, as does 2PC, but will always make progress and therefore satisfies its liveness properties. The fact that 3PC will not block on single node failures makes it much more appealing for services where high availability is more important than low latencies.
Your scenario is not the only one where things can ultimately go wrong despite all effort. Suppose A and B have both reported "ready to commit" to TM, and then someone unplugs the line between TM and, say, B. B is waiting for the go-ahead (or no-go) from TM, but it certainly won't keep waiting forever until TM reconnects (its own resources involved in the transaction must stay locked/inaccessible throughout the entire wait time for obvious reasons). So when B is kept waiting too long for its own taste, it will take what is called "heuristic decisions". That is, it will decide to commit or rollback independently from TM, based on, well, I don't really know what, but that doesn't really matter. It should be obvious that any such heuristic decisions can deviate from the actual commit decision taken by TM.

Resources