I was reading about Cassandra and got to know that there is Quorum concept (i.e, if there are multiple nodes/replicas where a particular key is stored, then during read operation choose and return data which has majority across those replicas) to deal with consistency during read operation.
My doubt maybe be silly but i am not able to get how Quorum concept is useful in case where we have majority data value is different that latest timestamp data.
How we decide then which data value we have to return?
Ex -
for particular key "key1"
timestamp : t1>t2
5 replicas
replica0(main node) is down
replicaNo - Value - TIMESTAMP -
replica1 - value1 - t1
replica2 - value2 - t2
replica3 - value2 - t2
replica4 - value2 - t2
So in above case, what should we return majority (value2) or latest timestamp (value1)?
Can someone please help?
In Cassandra the last write always wins. That means that for (a1,t1) and (a2, t2) with t2>t1 value a2 will be considered the right one.
Regarding your question, a QUORUM read on its own is not that useful. That is because in order to have full consistency, the following rule must be followed:
RC+WC>RF
(RC - read consistency; WC - Write consistency; RF - Replication factor)
In your case (when a majority of replicas have the old data), QUORUM will increase the chance of getting the right data, but it won't guarantee it.
The most common use case is using quorum for both read and write. That would mean that for a RF of 5, 3 nodes would have the right value. Now, if we also read from 3 nodes then there is no way that one of the 3 does not have the newer value (since at most 2 have the old value).
Regarding how reading works, when you ask for quorum on RF of 5, the coordinator node will ask one node for the actual data and 2 nodes for a digest of that data. The coordinator node then compares the digest from the first node (the actual data) with the other 2 digests. If they match then all good the data from the first node is returned. If they are different, a read repair will be triggered, meaning that the data will be updated across all available nodes.
So if you write with consistency one on RF of 5 not only will you risk getting old data even with quorum, but if something happens with the node that had the good data, then you could lose it altogether. Finding the balance depends on the particular use case. If in doubts, use quorum for both reads and writes.
Hope this made sense,
Cheers!
Quorum just means that majority of the nodes should provide the answer. But the answers could have different timestamps, so coordinator node will select the answer with the latest timestamp to send it to the client, and at the same time will trigger repair operation for the nodes that have the old data.
But in your situation you may still get the old answer, because with RF=5, the quorum is 3 nodes, and coordinator can pickup results from replicas 2-4 that have old data. You'll get the newest results only if coordinator will include replica 1 into the list queried nodes.
P.S. in Cassandra there is no main replica - all replicas are equal.
Related
Background
I'm beginning work to set up a ClickHouse cluster with 3 CH nodes. The first node (Node A) would be write-only, and the remaining 2 (Nodes B + C) would be read-only. By this I mean that writes for a given table to Node A would automatically replicate to Nodes B + C. When querying the cluster, reads would only be resolved against Nodes B + C.
The purpose for doing this is two-fold.
This datastore serves both real-time and background jobs. Both are high volume, only on the read side, so it makes sense to segment the traffic. Node A would be used for writing to the cluster and all background reads. Nodes B + C would be strictly used for the UX.
The volume of writes is very low, perhaps 1 write per 10,000 reads. Data is entirely refreshed once per week. Background jobs need to be certain that the most current data is being read before they can be kicked off. Reading off of replicas introduces eventual consistency as a concern, so reading from the node directly (rather than the cluster) from Node A guarantees the data to be strongly consistent.
Question
I'm not finding much specific information in the CH documentation, and am wondering whether this might be possible. If so, what would the cluster configuration look like?
Yes, it is possible to do so. But wouldn't the best solution be to read and write to each server sequentially using the Distributed table?
In the incredibly unlikely event that 2 QUORUM writes happen in parallel to the same row, and result in 2 partition replicas being inconsistent with the same timestamp:
When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?
Then the next question is how does the cluster reach consistency again since the data has the same timestamp?
I understand this situation is highly improbable, but my guess is it is still possible.
Example diagram:
Here is what I got from Datastax support:
Definitely a possible situation to consider. Cassandra/Astra handles this scenario with the following precedence rules so that results to the client are always consistent:
Timestamps are compared and latest timestamp always wins
If data being read has the same timestamp, deletes have priority over inserts/updates
In the event there is still a tie to break, Cassandra/Astra chooses the value for the column that is lexically larger
While these are certainly a bit arbitrary, Cassandra/Astra cannot know which value is supposed to take priority, and these rules do function to always give the exact same results to all clients when a tie happens.
When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?
Cassandra/Astra would handle this for you behind the scenes while traversing the read path. If there is a discrepancy between the data being returned by the two replicas, the data would be compared and synced amongst those two nodes involved in the read prior to sending the data back to the client.
So with regards to your diagram, with W1 and W2 both taking place at t = 1, the data coming back to the client would be data = 2 because 2 > 1. In addition, Node 1 would now have the missing data = 2 at t = 1 record. Node 2 would still only have data = 1 at t = 1 because it was not involved in that read.
According to Datastax documentation for Cassandra:
"If the coordinator cannot write to enough replicas to meet the requested consistency level, it throws an Unavailable Exception and does not perform any writes."
Does this mean that while the write is in process, the data updated by the write will not be available to read requests? I mean it is possible that 4/5 nodes have successfully sent a SUCCESS to the coordinator, meaning that their data have been updated. But the 5th one is yet to do the write. Now if a read request comes in and goes to one of these 4 nodes, it will still show the old data until the coordinator recieves a confirmation from the 5th node and marks the new data valid?
If the coordinator knows that it cannot possibly achieve consistency before it attempts the write, then it will fail the request immediately before doing the write. (This is described in the quote given)
However, if the coordinator believes that there are enough nodes to achieve its configured consistency level at the time of the attempt, it will start to send its data to its peers. If one of the peers does not return a success, the request will fail and you will get into a state where the nodes that fail have the old data and the ones that passed have the new data.
If a read requests comes in, it will show the data it finds on the nodes it reaches no matter if it is old or new.
Let us take your example to demonstrate.
If you have 5 nodes and you have replication 3. This will mean that 3 of those 5 nodes will have the write that you have sent. However, one of the three nodes returned a failure to the coordinator. Now if you read with consistency level ALL. You will read all three nodes and will always get the new write (Latest timestamp always wins).
However, if you read with consistency level ONE, there is a 1/3 chance you will get the old value.
I have thus far been living under the impression that you can not truly delete a row in a replication based Distributed Database. It all works well in a Copy based one. But in Replication you mark them as "consider this delete" and filter them out in every last query. But you do not ever actually delete something from the DB. I think it is time to verify if that assumption is true.
My understanding is that you would run into a Race Condition with the Replication if there was ever a key collision. It goes something like this:
Database A:
Adds a Entry under Key 11 (11A)
Database B:
Adds a Entry under Key 11 (11B)
Database A:
Deletes a Entry under Key 11
Now it depends in which Order these 3 operations "meet" in the wild:
The expected order would be:
11A Create
11 Delete (which means 11A)
11B Create
But what if this happens instead?
11A Create
11B Create (fails, already a key 11)
11 Delete
Or even worse, this?
11B Create
11A Create (fails, already a key 11)
11 Delete (which will hit 11B)
I'll assume that we are talking about a leaderless distributed database, that is one where all nodes play the same role (there is no master), so reads and writes can both be served by all nodes. Otherwise, if there's a single master, it can impose a specific ordering on all the writes/deletes and thus resolve the concurrency problem you are describing.
But in Replication you mark them as "consider this delete" and filter
them out in every last query.
That's right and it's done for 2 main reasons:
correctness: if items were deleted instead of tombstoned, then there could be an ambiguous instance, where 2 nodes are consulted where node A has the item but node B does not. And the system as a whole cannot distinguish whether that item was deleted (but the delete failed in A) or whether the item was recently created (but the created failed in B). With tombstones, this distinction can be made clear.
performance: most of those systems do not perform in-place updates (as RDBMS databases usually do), but instead perform append-only operations. That's done in order to improve performance, since random access operations in disk are much slower than sequential operations. As a result, performing the deleted via tombstones aligns well with this approach.
But you do not ever actually delete something from the DB.
That is not necessarily true. Usually, the tombstones are eventually removed from the database (in a garbage-collection fashion). Eventually here means that they are deleted when the system can be sure that the example described above cannot happen anymore for these items (because the deletes have propagated to all the nodes).
My understanding is that you would run into a Race Condition with the Replication if there was ever a key collision
That's right for most of the distributed systems of that kind. The result will depend on the order the operations reached the database. However, some of these databases provide alternative mechanisms, such as conditional writes/deletes. In this way, you can only delete a specific version of an item or update an item only if its version if a specific one (thus aborting the update if someone else updated it in the meanwhile). An example of operations of this kind from Cassandra are conditional deletes and the so-called lightweight transactions
Below are some references that describe how Riak and Cassandra perform deletes, which contain a lot of information around tombstones as well:
Riak: Object deletion
About deletes and tombstones in Cassandra
I mean when we have a multi-master cluster with 3 master nodes then after we added a new master node (fourth node) we have to be somehow notified when it will be up to date or may be ask it every 5 seconds to find out if it up to date.
Is it possible to realize?
This feature currently is not present, and looking at the roadmap, i don't think there will be.
Consider the fact that if you get to have a situation with 5 nodes (therefore the quorum is 3), so the sync takes place only between two nodes, while the other 3 have a quorum to write data. So you can always keep doing insert or other and you do not need a mechanism to tell you that for example the node 5 is ready or not. This of course if you can get a situational so.