How is Raft linearlizable? - distributed

I am pretty new to distributed systems and was wondering how Raft consensus algorithm is linearizable. Raft commits log entries through quorum. At the moment the leader Raft commits, this means that more than half of the participants has the replicated log. But there may be a portion of the participants that doesn't have the latest logs or that they have the logs but haven't received instructions to commit those logs.
Or does Raft's read linearizability require a read quorum?

Well, linearizability pertains to both reads and writes, and yes, both are accomplished with a quorum. To make reads linearizable, reads must be handled by the leader, and the leader must verify it has not been superseded by a newer leader after applying the read to the state machine but before responding to the client. In practice, though, many real-world implementations use relaxed consistency models for reads, e.g. allowing reads from followers. But note that while quorums guarantee linearizability for the Raft cluster, that doesn’t mean client requests are linearizable. To extend linearizability to clients, sessions must be added to prevent dropped/duplicated client requests from producing multiple commits in the Raft log for a single commit, which would violate linearizabity.

kuujo has explained what's linearizability. I'll answer your other doubt in the question.
But there may be a portion of the participants that don't have the latest logs or that they have the logs but haven't received instructions to commit those logs.
This is possible after a leader commits a log entry and some of the peers do not have this log entry right now, but other peers will have it eventually in some ways, Raft does several things to guarantee that:
Even if the leader has committed a log entry (let's say LogEntry at index=8) and answered the client, background routines are still trying to sync logEntry:8 to other peers. If RPC fails, (it's possible)
Raft sends a heartbeat periodically(a kind of AppendEntries), this heartbeat RPC will sync logs the leader has to followers if followers do not have it. After followers have logEntry:8, followers will compare its local commitIndex and leaderCommitIndex in the RPC args to decide if it should commit logEntry:8. If the leader fails immediately after committing a log, (it's rare but possible)
Based on the election rule, only the one who has the logEntry:8 can win the election. After a new leader elected, the new leader will continue using the heartBeat to sync logEntry:8 to other peers. What happens if a follower falls behind so much that it can not get all logs from the leader? (it happens a lot when you try to add a new node). In this scenario, Raft will use a snapshot RPC mechanism to sync all data and trimmed logs.

Related

What exactly happens if checkpointed data cannot be committed?

I'm reading into the details of Flink's checkpointing mechanism right now and by now, I think I have a really good overview about how everything is tied together but one last issue strikes me here.
It's about how checkpoints and commits interact with each other in the ExactlyOnce context, because I have the feeling that there's still potential for data loss/duplicate records. Mainly I was thinking about potential failures of the commit message or its callback, when I stumbled upon this paragraph in the Flink Blog:
After a successful pre-commit, the commit must be guaranteed to eventually succeed – both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.
Up until this point, I still had the impression that checkpoints would have to be acknowledged by the sink commit first, before they would be viewed as "valid". But apparently, once all operators are ready to actually commit, the checkpoint starts to exist and from that point on, the sink has to guarantee the commit can be done to ensure no data being lost. What exactly happens if my commit can never be done, e.g. if my Kafka sink is down for a longer period of time? Does this mean if the defined retries run out eventually, the checkpointed state will just be treated as the correct state or will Flink only be able to resume the job once this specific commit was able to be done and thus be stuck until broker is available again?
And what if the callback of the commit is lost somehow, will this be resolved in the next retry attempt or since the transaction is "done" now, the producer will not be able to commit and we enter this loop of repeated retries? (more of a Kafka question probably)
For committing the side effects (so things like external state, vide Kafka transactions), Flink is using two phase commit protocol.
Let's say we are performing checkpoint 42. First pre-commit requests are issued. If all participants (parallel subtasks/operators) successfully acknowledged the pre-commit, JobManager/CheckpointCoordinator will start sending out commit requests.
The thing is, if failure happens at this point of time, there is no way going back. If either some commit fails or there is some other unrelated failure, job will be restarted from the checkpoint 42 and Flink will re-attempt to commit the pending/pre-committed transactions. If failure happens again, rinse and repeat according to your selected restart strategy. If you want to avoid data loss, commit attempts must eventually succeed. There is simply no other way. We can not revert those transactions, as once some commit request were issued, some transactions might have already been committed, so we can not rollback only portion of them (otherwise we would have data duplication problem).

Eventual consistency with both database and message queue records

I have an application where I need to store some data in a database (mysql for instance) and then publish some data in a message queue. My problem is: If the application crashes after the storage in the database, my data will never be written in the message queue and then be lost (thus eventual consistency of my system will not be guaranted).
How can I solve this problem ?
I have an application where I need to store some data in a database (mysql for instance) and then publish some data in a message queue. My problem is: If the application crashes after the storage in the database, my data will never be written in the message queue and then be lost (thus eventual consistency of my system will not be guaranted). How can I solve this problem ?
In this particular case, the answer is to load the queue data from the database.
That is, you write the messages that need to be queued to the database, in the same transaction that you use to write the data. Then, asynchronously, you read that data from the database, and write it to the queue.
See Reliable Messaging without Distributed Transactions, by Udi Dahan.
If the application crashes, recovery is simple -- during restart, you query the database for all unacknowledged messages, and send them again.
Note that this design really expects the consumers of the messages to be designed for at least once delivery.
I am assuming that you have a loss-less message queue, where once you get a confirmation for writing data, the queue is guaranteed to have the record.
Basically, you need a loop with a transaction that can roll back or a status in the database. The pseudo code for a transaction is:
Begin transaction
Insert into database
Write to message queue
When message queue confirms, commit transaction
Personally, I would probably do this with a status:
Insert into database with a status of "pending" (or something like that)
Write to message queue
When message confirms, change status to "committed" (or something like that)
In the case of recovery from failure, you may need to check the message queue to see if any "pending" records were actually written to the queue.
I'm afraid that answers (VoiceOfUnreason, Udi Dahan) just sweep the problem under the carpet. The problem under carpet is: How the movement of data from database to queue should be designed so that the message will be posted just once (without XA). If you solve this, then you can easily extend that concept by any additional business logic.
CAP theorem tells you the limits clearly.
XA transactions is not 100% bullet proof solution, but seems to me best of all others that I have seen.
Adding to what #Gordon Linoff said, assuming durable messaging (something like MSMQ?) the method/handler is going to be transactional, so if it's all successful, the message will be written to the queue and the data to your view model, if it fails, all will fail...
To mitigate the ID issue you will need to use GUIDs instead of DB generated keys (if you are using messaging you will need to remove your referential integrity anyway and introduce GUIDS as keys).
One more suggestion, don't update the database, but inset only/upsert (the pending row and then the completed row) and have the reader do the projection of the data based on the latest row (for example)
Writing message as part of transaction is a good idea but it has multiple drawbacks like
If your
a. database/language does not support transaction
b. transaction are time taking operation
c. you can not afford to wait for queue response while responding to your service call.
d. If your database is already under stress, writing message will exacerbate the impact of higher workload.
the best practice is to use Database Streams. Most of the modern databases support streams(Dynamodb, mongodb, orcale etc.). You have consumer of database stream running which reads from database stream and write to queue or invalidate cache, add to search indexer etc. Once all of them are successful you mark the stream item as processed.
Pros of this approach
it will work in the case of multi-region deployment where there is a regional failure. (you should read from regional stream and hydrate all the regional data stores.)
No Overhead of writing more records or performance bottle necks of queues.
You can use this pattern for other data sources as well like caching, queuing, searching.
Cons
You may need to call multiple services to construct appropriate message.
One database stream might not be sufficient to construct appropriate message.
ensure the reliability of your streams, like redis stream is not reliable
NOTE this approach also does not guarantee exactly once semantics. The consumer logic should be idempotent and should be able to handle duplicate message

What is the purpose of Google Pub/Sub?

So I was looking at using Google's Pub/Sub service for queues but by trial and error I came to a conclusion that I have no idea what it's good for in real applications.
Google says that it's
A global service for real-time and reliable messaging and streaming
data
but the way it work is really strange to me. It holds acked messages up to 7 days, if the subscriber re-subscribes it will get all the messages from the past 7 days even if it already acked them, acked messages will most likely be sent again to the same subscriber that acked them already and there's no FIFO as well.
So I really do not understand how one should use this service if the only thing that it guarantees is that a message will be delivered at least once to any subscriber. This cannot be used for idempotent actions, each subscriber has to store an information about all messages that were acked already so it won't process the message multiple times and so on...
Google Cloud Pub/Sub has a lot of different applications where decoupled systems need to send and receive messages. The overview page offers a number of use cases including balancing work loads, logging, and event notifications. It is true that Google Cloud Pub/Sub does not currently offer any FIFO guarantees and that messages can be redelivered.
However, the fact that the delivery guarantee is "at least once" should not be taken to mean acked messages are redelivered when a subscriber re-subscribers. Redelivery of acked messages is a rare event. This generally only happens when the ack did not make it all the way back to the service due to a networking issue, a machine failure, or some other exceptional condition. While that means that apps do need to be able to handle this case, it does not mean it will happen frequently.
For different applications, what happens on message redelivery can differ. In a case such as cache invalidation, mentioned in the overview page, getting two events to invalidate an entry in a cache just means the value will have to be reloaded an extra time, so there is not a correctness concern.
In other cases, like tracking button clicks or other events on a website for logging or stats purposes, infrequent acked message redelivery is likely not going to affect the information gathered in a significant way, so not bothering to check if events are duplicates is fine.
For cases where it is necessary to ensure that messages are processed exactly once, then there has to be some sort of tracking on the subscriber side to ensure this is the case. It might be that the subscriber is already accessing and updating an underlying database in response to messages and duplicate events can be detected via that storage.

Perform reading from a paxos-based distributed cluster

Could any one help introduce how to read contents from the distributed cluster?
I mean there is a distributed cluster who's consistency is guaranteed by Paxos algorithm.
In real-world application, how does the client read the contents they have written to the cluster?
For example, in a 5 servers cluster, maybe only 3 of them get the newest data and the other 2 have old data due to network delay or something.
Does this means the client needs to read at least majority of all nodes? In 5-servers, it means reading data from at least 3 servers and checked the one with newest version number?
If so, it seems quite slow since you need to read 3 copies? How does the real world implement this ?
Clients should read from the leader. If a node knows it is not the leader it should redirect the client to the leader. If a node does not know who is leader it should throw an error and the client should pick another node at random until it is told or finds the leader. If the node thinks it is the leader it is dangerous to return a read from local state as it may have just lost connectivity to the rest of the cluster right when it gets a massive stall (cpu load, io stall, vm overload, large gc, some background task, server maintenance job, ...) such that it actually looses the leadership during replying to the client and gives out a stale read. This can be avoided by running a round of (multi)Paxos for the read.
Lamport Clocks and Vector Clock say you must pass messages to assign that operation A happens before operation B when they run on different machines. If not they run concurrently. This provides the theoretic underpinning as to why we cannot say a read from a leader is not stale without exchanging messages with the majority of the cluster. The message exchange establishes a "happened-before" relationship of the read to the next write (which may happen on a new leader due to a failover).
The leader itself can be an acceptor and so in a three node cluster it just needs one response from one other node to complete a round of (multi)Paxos. It can send messages in parallel and reply to the client when it gets the first response. The network between nodes should be dedicated to intra-cluster traffic (and the best you can get) such that this does not add much latency to the client.
There is an answer which describes how Paoxs can be used for a locking service which cannot tolerate stale reads or reordered writes where a crash scenario is discussed over at some questions about paxos Clearly a locking service cannot have reads and writes to the locks "running concurrently" hence why it does a round of (multi)Paxos for each client message to strictly order reads and writes across the cluster.

Two phase commit

I believe most of people know what 2PC (two-phase commit protocol) is and how to use it in Java or most of modern languages. Basically, it is used to make sure the transactions are in sync when you have 2 or more DBs.
Assume I've two DBs (A and B) using 2PC in two different locations. Before A and B are ready to commit a transaction, both DBs will report back to the transaction manager saying they are ready to commit. So, when the transaction manager is acknowledged, it will send a signal back to A and B telling them to go ahead.
Here is my question: let's say A received the signal and commited the transaction. Once everything is completed, B is about to do the same but someone unplugs the power cable, causing the whole server shutdown. When B is back online, what will B do? And how does B do it?
Remember, A is committed but B is not, and we are using 2PC (so, the design of 2PC stops working, does not it?)
On Two-Phase Commit
Two phase commit does not guarantee that a distributed transaction can't fail, but it does guarantee that it can't fail silently without the TM being aware of it.
In order for B to report the transaction as being ready to commit, B must have the transaction in persistent storage (i.e. B must be able to guarantee that the transaction can commit in all circumstances). In this situation, B has persisted the transaction but the transaction manager has not yet received a message from B confirming that B has completed the commit.
The transaction manager will poll B again when B comes back online and ask it to commit the transaction. If B has already committed the transaction it will report the transaction as committed. If B has not yet committed the transaction it will then commit as it has already persisted it and is thus still in a position to commit the transaction.
In order for B to fail in this situation, it would have to undergo a catastrophic failure that lost data or log entries. The transaction manager would still be aware that B had not reported a successful commit.1
In practice, if B can no longer commit the transaction, it would imply that the disaster that took B out had caused data loss, and B would report an error when the TM asked it to commit a TxID that it wasn't aware of or didn't think was in a commitable state.
Thus, two phase commit does not prevent a catastrophic failure from occuring, but it does prevent the failure from going unnoticed. In this scenario the transaction manager will report an error back to the application if B cannot commit.
The application still has to be able to recover from the error, but the transaction cannot fail silently without the application being made aware of the inconsistent state.
Semantics
If a resource manager or network goes down in phase 1, the
transaction manager will detect a fatal error (can't connect to
resource manager) and mark the sub-transaction as failed. When the
network comes back up it will abort the transaction on all of the
participating resource managers.
If a resource manager or network goes down in phase 2, the
transaction manager will continue to poll the resource manager until
it comes back up. When it re-connects back to the resource manager
it will tell the RM to commit the transaction. If the RM returns an
error along the lines of 'Unknown TxID' the TM will be aware that
there is a data loss issue in the RM.
If the TM goes down in phase 1 then the client will block until the
TM comes back up, unless it times out or receives an error due to the
broken network connection. In this case the client is made aware of
the error and can either re-try or initiate the abort itself.
If the TM goes down in phase 2 then it will block the client until
the TM comes back up. It has already reported the transaction as
committable and no fatal error should be presented to the client,
although it may block until the TM comes back up. The TM will still
have the transaction in an uncommitted state and will poll the RMs
to commit when it comes back up.
Post-commit data loss events in the resource managers are not handled by the transaction manager and are a function of the resilience of the RMs.
Two-phase commit does not guarantee fault tolerance - see Paxos for an example of a protocol that does address fault tolerance - but it does guarantee that partial failure of a distributed transaction cannot go un-noticed.
Note that this sort of failure could also lose data from previously committed transactions. Two phase commit does not guarantee that the resource managers can't lose or corrupt data or that DR procedures don't screw up.
I believe three phase commit is a much better approach. Unfortunately I haven't found anyone implementing such a technology.
http://the-paper-trail.org/blog/consensus-protocols-three-phase-commit/
Here are the essential parts of the above article :
The fundamental difficulty with 2PC is that, once the decision to commit has been made by the co-ordinator and communicated to some replicas, the replicas go right ahead and act upon the commit statement without checking to see if every other replica got the message. Then, if a replica that committed crashes along with the co-ordinator, the system has no way of telling what the result of the transaction was (since only the co-ordinator and the replica that got the message know for sure). Since the transaction might already have been committed at the crashed replica, the protocol cannot pessimistically abort – as the transaction might have had side-effects that are impossible to undo. Similarly, the protocol cannot optimistically force the transaction to commit, as the original vote might have been to abort.
This problem is – mostly – circumvented by the addition of an extra phase to 2PC, unsurprisingly giving us a three-phase commit protocol. The idea is very simple. We break the second phase of 2PC – ‘commit’ – into two sub-phases. The first is the ‘prepare to commit’ phase. The co-ordinator sends this message to all replicas when it has received unanimous ‘yes’ votes in the first phase. On receipt of this messages, replicas get into a state where they are able to commit the transaction – by taking necessary locks and so forth – but crucially do not do any work that they cannot later undo. They then reply to the co-ordinator telling it that the ‘prepare to commit’ message was received.
The purpose of this phase is to communicate the result of the vote to every replica so that the state of the protocol can be recovered no matter which replica dies.
The last phase of the protocol does almost exactly the same thing as the original ‘commit or abort’ phase in 2PC. If the co-ordinator receives confirmation of the delivery of the ‘prepare to commit’ message from all replicas, it is then safe to go ahead with committing the transaction. However, if delivery is not confirmed, the co-ordinator cannot guarantee that the protocol state will be recovered should it crash (if you are tolerating a fixed number f of failures, the co-ordinator can go ahead once it has received f+1 confirmations). In this case, the co-ordinator will abort the transaction.
If the co-ordinator should crash at any point, a recovery node can take over the transaction and query the state from any remaining replicas. If a replica that has committed the transaction has crashed, we know that every other replica has received a ‘prepare to commit’ message (otherwise the co-ordinator wouldn’t have moved to the commit phase), and therefore the recovery node will be able to determine that the transaction was able to be committed, and safely shepherd the protocol to its conclusion. If any replica reports to the recovery node that it has not received ‘prepare to commit’, the recovery node will know that the transaction has not been committed at any replica, and will therefore be able either to pessimistically abort or re-run the protocol from the beginning.
So does 3PC fix all our problems? Not quite, but it comes close. In the case of a network partition, the wheels rather come off – imagine that all the replicas that received ‘prepare to commit’ are on one side of the partition, and those that did not are on the other. Then both partitions will continue with recovery nodes that respectively commit or abort the transaction, and when the network merges the system will have an inconsistent state. So 3PC has potentially unsafe runs, as does 2PC, but will always make progress and therefore satisfies its liveness properties. The fact that 3PC will not block on single node failures makes it much more appealing for services where high availability is more important than low latencies.
Your scenario is not the only one where things can ultimately go wrong despite all effort. Suppose A and B have both reported "ready to commit" to TM, and then someone unplugs the line between TM and, say, B. B is waiting for the go-ahead (or no-go) from TM, but it certainly won't keep waiting forever until TM reconnects (its own resources involved in the transaction must stay locked/inaccessible throughout the entire wait time for obvious reasons). So when B is kept waiting too long for its own taste, it will take what is called "heuristic decisions". That is, it will decide to commit or rollback independently from TM, based on, well, I don't really know what, but that doesn't really matter. It should be obvious that any such heuristic decisions can deviate from the actual commit decision taken by TM.

Resources