Raft loss of data/Log entries on leader re-election - distributed

Scenario 1: Lets assume there is a 5 node cluster, out of which 3 are already out of commission. There are 3 nodes in service A(leader), B, C
A replicated a log entry to B, C, received successful response, committed it, applied it, responded to the client and then died. Now, there are only two nodes B, C with some un-applied log entries.
Now, if D comes up, and B becomes the new leader, what will happen to the un-applied entries ? will they be committed/applied as well ?
Scenario 2: Out of 5 node cluster, 3 just got down, A is still the leader and B is online
A replicated an entry to B, but cannot commit, and then A gets killed, C, D come up (so B, C, D are up). What will happen to the entries that were replicated to B? will they be committed/applied ?

Keep in mind there's a difference between an entry being committed and an entry being applied. Entries can be committed but not yet be applied, but entries can't be applied and not committed. Once an entry is committed (replicated to a majority of the cluster in the leader's current term) it's guaranteed to eventually be applied on all nodes. So, the next leader must apply it, otherwise it could result in an inconsistent state if you're using Raft to manage a replicated state machine. The election protocol guarantees that the next leader elected will have the committed entries (so either B or C, but not D), and the new leader's first action will be to replicate those entries to another node (D), determine they're committed, then apply and update followers.
In the second case, if B gets elected with the additional entries then it will ultimately commit and apply them. This is why sessions are a critical component of guaranteeing linearizability for clients. When an operation fails due to failure to reach a quorum, it's always possible that change could actually be later committed. Sessions should be used to guarantee that a change is only applied to the state machine once no matter how many times it's committed (idempotence). See the Raft dissertation for more on sessions.

Related

How does cassandra handle write timestamp conflicts between QUORUM reads?

In the incredibly unlikely event that 2 QUORUM writes happen in parallel to the same row, and result in 2 partition replicas being inconsistent with the same timestamp:
When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?
Then the next question is how does the cluster reach consistency again since the data has the same timestamp?
I understand this situation is highly improbable, but my guess is it is still possible.
Example diagram:
Here is what I got from Datastax support:
Definitely a possible situation to consider. Cassandra/Astra handles this scenario with the following precedence rules so that results to the client are always consistent:
Timestamps are compared and latest timestamp always wins
If data being read has the same timestamp, deletes have priority over inserts/updates
In the event there is still a tie to break, Cassandra/Astra chooses the value for the column that is lexically larger
While these are certainly a bit arbitrary, Cassandra/Astra cannot know which value is supposed to take priority, and these rules do function to always give the exact same results to all clients when a tie happens.
When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?
Cassandra/Astra would handle this for you behind the scenes while traversing the read path. If there is a discrepancy between the data being returned by the two replicas, the data would be compared and synced amongst those two nodes involved in the read prior to sending the data back to the client.
So with regards to your diagram, with W1 and W2 both taking place at t = 1, the data coming back to the client would be data = 2 because 2 > 1. In addition, Node 1 would now have the missing data = 2 at t = 1 record. Node 2 would still only have data = 1 at t = 1 because it was not involved in that read.

What to pick? Quorum OR Latest timestamp data in Cassandra

I was reading about Cassandra and got to know that there is Quorum concept (i.e, if there are multiple nodes/replicas where a particular key is stored, then during read operation choose and return data which has majority across those replicas) to deal with consistency during read operation.
My doubt maybe be silly but i am not able to get how Quorum concept is useful in case where we have majority data value is different that latest timestamp data.
How we decide then which data value we have to return?
Ex -
for particular key "key1"
timestamp : t1>t2
5 replicas
replica0(main node) is down
replicaNo - Value - TIMESTAMP -
replica1 - value1 - t1
replica2 - value2 - t2
replica3 - value2 - t2
replica4 - value2 - t2
So in above case, what should we return majority (value2) or latest timestamp (value1)?
Can someone please help?
In Cassandra the last write always wins. That means that for (a1,t1) and (a2, t2) with t2>t1 value a2 will be considered the right one.
Regarding your question, a QUORUM read on its own is not that useful. That is because in order to have full consistency, the following rule must be followed:
RC+WC>RF
(RC - read consistency; WC - Write consistency; RF - Replication factor)
In your case (when a majority of replicas have the old data), QUORUM will increase the chance of getting the right data, but it won't guarantee it.
The most common use case is using quorum for both read and write. That would mean that for a RF of 5, 3 nodes would have the right value. Now, if we also read from 3 nodes then there is no way that one of the 3 does not have the newer value (since at most 2 have the old value).
Regarding how reading works, when you ask for quorum on RF of 5, the coordinator node will ask one node for the actual data and 2 nodes for a digest of that data. The coordinator node then compares the digest from the first node (the actual data) with the other 2 digests. If they match then all good the data from the first node is returned. If they are different, a read repair will be triggered, meaning that the data will be updated across all available nodes.
So if you write with consistency one on RF of 5 not only will you risk getting old data even with quorum, but if something happens with the node that had the good data, then you could lose it altogether. Finding the balance depends on the particular use case. If in doubts, use quorum for both reads and writes.
Hope this made sense,
Cheers!
Quorum just means that majority of the nodes should provide the answer. But the answers could have different timestamps, so coordinator node will select the answer with the latest timestamp to send it to the client, and at the same time will trigger repair operation for the nodes that have the old data.
But in your situation you may still get the old answer, because with RF=5, the quorum is 3 nodes, and coordinator can pickup results from replicas 2-4 that have old data. You'll get the newest results only if coordinator will include replica 1 into the list queried nodes.
P.S. in Cassandra there is no main replica - all replicas are equal.

Cassandra write consistency level ALL clarification

According to Datastax documentation for Cassandra:
"If the coordinator cannot write to enough replicas to meet the requested consistency level, it throws an Unavailable Exception and does not perform any writes."
Does this mean that while the write is in process, the data updated by the write will not be available to read requests? I mean it is possible that 4/5 nodes have successfully sent a SUCCESS to the coordinator, meaning that their data have been updated. But the 5th one is yet to do the write. Now if a read request comes in and goes to one of these 4 nodes, it will still show the old data until the coordinator recieves a confirmation from the 5th node and marks the new data valid?
If the coordinator knows that it cannot possibly achieve consistency before it attempts the write, then it will fail the request immediately before doing the write. (This is described in the quote given)
However, if the coordinator believes that there are enough nodes to achieve its configured consistency level at the time of the attempt, it will start to send its data to its peers. If one of the peers does not return a success, the request will fail and you will get into a state where the nodes that fail have the old data and the ones that passed have the new data.
If a read requests comes in, it will show the data it finds on the nodes it reaches no matter if it is old or new.
Let us take your example to demonstrate.
If you have 5 nodes and you have replication 3. This will mean that 3 of those 5 nodes will have the write that you have sent. However, one of the three nodes returned a failure to the coordinator. Now if you read with consistency level ALL. You will read all three nodes and will always get the new write (Latest timestamp always wins).
However, if you read with consistency level ONE, there is a 1/3 chance you will get the old value.

Aerospike ACID - How to know the final result of the transaction on timeouts?

I'm new to Aerospike.
I would like to know that in all possible timeout scenarios, as stated in this link:
https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852
Client can’t connect by specified timeout (timeout=). Timeout of zero
means that there is no timeout set.
Client does not receive response by specified timeout (timeout=).
Server times out the transaction during it’s own processing (default
of 1 second if client doesn’t specify timeout). To investigate this,
confirm that the server transaction latencies are not the bottleneck.
Client times out after M iterations of retries when there was no error
due to a failed node or a failed connection.
Client can’t obtain a valid node after N retries (where retries are
set from your client).
Client can’t obtain a valid connection after X retries. The retry
count is usually the limiting factor, not the timeout value. The
reasoning is that if you can’t get a connection after R retries, you
never will, so just timeout early.
Of all timeout scenarios mentioned, under which circumstances could I be absolute certain that the final result of the transaction is FAILED?
Does Aerospike offer anything i.e. to rollback the transaction if the client does not respond?
In the worst case, If I could’t be certain about the final result, how would I be able to know for certain about the final state of the transaction?
Many thanks in advance.
Edit:
We came up with a temporary solution:
Keep a map of [generation -> value read] for that record (maybe a background thread constantly reading the record etc.) and then on timeouts, we would periodically check the map (key = the generation expected) to see if the true written value is actually the one put to the map. If they are the same, it means the write succeeded, otherwise it means the write failed.
Do you guys think it's necessary to do this? Or is there other way?
First, timeouts are not the only error you should be concerned with. Newer clients have an 'inDoubt' flag associated with errors that will indicate that the write may or may not have applied.
There isn't a built-in way of resolving an in-doubt transaction to a definitive answer and if the network is partitioned, there isn't a way in AP to rigorously resolve in-doubt transactions. Rigorous methods do exist for 'Strong Consistency' mode, the same methods can be used to handle common AP scenarios but they will fail under partition.
The method I have used is as follows:
Each record will need a list bin, the list bin will contain the last N transaction ids.
For my use case, I gave each client an unique 2 byte identifier - each client thread a unique 2 byte identifier - and each client thread had a 4 byte counter. So a particular transaction-id would look like would mask an 8 byte identifier from the 2 ids and counter.
* Read the records metadata with the getHeader api - this avoids reading the records bins from storage.
Note - my use case wasn't an increment so I actually had to read the record and write with a generation check. This pattern should be more efficient for a counter use case.
Write the record using operate and gen-equal to the read generation with the these operations: increment the integer bin, prepend to the list of txns, and trim the list of txns. You will prepend you transaction-id to your txns list and then trim the list to the max size of the list you selected.
N needs to be large enough such that a record can be sure to have enough time to verify its transaction given the contention on the key. N will affect the stored size of the record so choosing too big will cost disk resource and choosing too small will render the algorithm ineffective.
If the transaction is successful then you are done.
If the transaction is 'inDoubt' then read the key and check the txns list for your transaction-id. If present then your transaction 'definitely succeeded'.
If your transaction-id isn't in txns, repeat step 3 with the generation returned from the read in step 5.
Return to step 3 - with the exception that on step 5 a 'generation error' would also need to be considered 'in-doubt' since it may have been the previous attempt that finally applied.
Also consider that reading the record in step 5 and not finding the transaction-id in txns does not ensure that the transaction 'definitely failed'. If you wanted to leave the record unchanged but have a 'definitely failed' semantic you would need to have observed the generation move past the previous write's gen-check policy. If it hasn't you could replace the operation in step 6 with a touch - if it succeeds then the initial write 'definitely failed' and if you get a generation-error you will need to check if you raced the application of the initial transaction initial write may now have 'definitely succeeded'.
Again, with 'Strong Consistency' the mentions of 'definitely succeeded' and 'definitely failed' are accurate statements, but in AP these statements have failure modes (especially around network partitions).
Recent clients will provide an extra flag on timeouts, called "in doubt". If false, you are certain the transaction did not succeed (client couldn't even connect to the node so it couldn't have sent the transaction). If true, then there is still an uncertainty as the client would have sent the transaction but wouldn't know if it had reached the cluster or not.
You may also consider looking at Aerospike's Strong Consistency feature which could help your use case.

How to publish data changes between databases within RabbitMQ?

Suppose we have two applications A and B with their own different databases A1 (MySQL) and B1 (Postgres) correspondingly. We create entities X and Y associated with each other in application A. Y belongs to X. On every insert to database A1 (after commit) we publish message to RabbitMQ to make application B aware of brand new entities. One event per entity – X1 and Y1. Everything is good, if RabbitMQ keeps the order of the messages, so workers in application B may process X1 first and Y1 second to establish right association between new A and B records in database B1. But as far as I understand RabbitMQ is not intended to keep messages order and does it in very specific circumstances, like publish within one channel, send to one exchange, push to one queue, consume within one channel.
So my question is about the correct direction and general approach:
Should I choose another message queue, that guarantees messages order?
Have I missed something specific in RabbitMQ messages order specifics?
Should I implement some kind of retry mechanism in application B side, that will re-enqueue messages back to RabbitMQ in case if message order was not as expected?
Maybe it will give more sensible context – B1 is a data warehouse, that aggregates data not only from A1, but also other databases.
Everything is good, if RabbitMQ keeps the order of the messages
Is that the only solution?
Can't X and Y instances get a sequence number assigned, which will then be used to rebuild the right sequence within B?
to establish right association between new A and B records
Can't X1, Y1 express an explicit relation allowing a creation of A, B without relying on sequence?
The point is:
Message ordering is always expensive. You either have hard constraints (e.g. one consumer) or you have less availability and speed. Your best bet is finding a way to not rely on order.

Resources