How does cassandra handle write timestamp conflicts between QUORUM reads? - database

In the incredibly unlikely event that 2 QUORUM writes happen in parallel to the same row, and result in 2 partition replicas being inconsistent with the same timestamp:
When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?
Then the next question is how does the cluster reach consistency again since the data has the same timestamp?
I understand this situation is highly improbable, but my guess is it is still possible.
Example diagram:

Here is what I got from Datastax support:
Definitely a possible situation to consider. Cassandra/Astra handles this scenario with the following precedence rules so that results to the client are always consistent:
Timestamps are compared and latest timestamp always wins
If data being read has the same timestamp, deletes have priority over inserts/updates
In the event there is still a tie to break, Cassandra/Astra chooses the value for the column that is lexically larger
While these are certainly a bit arbitrary, Cassandra/Astra cannot know which value is supposed to take priority, and these rules do function to always give the exact same results to all clients when a tie happens.
When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?
Cassandra/Astra would handle this for you behind the scenes while traversing the read path. If there is a discrepancy between the data being returned by the two replicas, the data would be compared and synced amongst those two nodes involved in the read prior to sending the data back to the client.
So with regards to your diagram, with W1 and W2 both taking place at t = 1, the data coming back to the client would be data = 2 because 2 > 1. In addition, Node 1 would now have the missing data = 2 at t = 1 record. Node 2 would still only have data = 1 at t = 1 because it was not involved in that read.

Related

What to pick? Quorum OR Latest timestamp data in Cassandra

I was reading about Cassandra and got to know that there is Quorum concept (i.e, if there are multiple nodes/replicas where a particular key is stored, then during read operation choose and return data which has majority across those replicas) to deal with consistency during read operation.
My doubt maybe be silly but i am not able to get how Quorum concept is useful in case where we have majority data value is different that latest timestamp data.
How we decide then which data value we have to return?
Ex -
for particular key "key1"
timestamp : t1>t2
5 replicas
replica0(main node) is down
replicaNo - Value - TIMESTAMP -
replica1 - value1 - t1
replica2 - value2 - t2
replica3 - value2 - t2
replica4 - value2 - t2
So in above case, what should we return majority (value2) or latest timestamp (value1)?
Can someone please help?
In Cassandra the last write always wins. That means that for (a1,t1) and (a2, t2) with t2>t1 value a2 will be considered the right one.
Regarding your question, a QUORUM read on its own is not that useful. That is because in order to have full consistency, the following rule must be followed:
RC+WC>RF
(RC - read consistency; WC - Write consistency; RF - Replication factor)
In your case (when a majority of replicas have the old data), QUORUM will increase the chance of getting the right data, but it won't guarantee it.
The most common use case is using quorum for both read and write. That would mean that for a RF of 5, 3 nodes would have the right value. Now, if we also read from 3 nodes then there is no way that one of the 3 does not have the newer value (since at most 2 have the old value).
Regarding how reading works, when you ask for quorum on RF of 5, the coordinator node will ask one node for the actual data and 2 nodes for a digest of that data. The coordinator node then compares the digest from the first node (the actual data) with the other 2 digests. If they match then all good the data from the first node is returned. If they are different, a read repair will be triggered, meaning that the data will be updated across all available nodes.
So if you write with consistency one on RF of 5 not only will you risk getting old data even with quorum, but if something happens with the node that had the good data, then you could lose it altogether. Finding the balance depends on the particular use case. If in doubts, use quorum for both reads and writes.
Hope this made sense,
Cheers!
Quorum just means that majority of the nodes should provide the answer. But the answers could have different timestamps, so coordinator node will select the answer with the latest timestamp to send it to the client, and at the same time will trigger repair operation for the nodes that have the old data.
But in your situation you may still get the old answer, because with RF=5, the quorum is 3 nodes, and coordinator can pickup results from replicas 2-4 that have old data. You'll get the newest results only if coordinator will include replica 1 into the list queried nodes.
P.S. in Cassandra there is no main replica - all replicas are equal.

Deleting Huge Data In Cassandra Cluster

I have Cassandra cluster with three nodes. We have data close to 7 TB from last 4 years. Now because of less space available in the server, we would like to keep data only for last 2 years. But we don't want to delete it completely(data older than 2 years). We want to keep specific data even it is older than 2 years.
Currently I can think of one approach:
1) Java client using "MutationBatch object". I can get all the records key which fall into date range and excluding rows which we don't want to delete. Then deleting records in a batch. But this solution raises concern over performance as data is huge.
Is it possible to handle it at the server level(opscenter). I read about TTL but how can I apply it to an existing data and also restrict some of the data which I want to keep even if it is older than 2 years.
Please help me in finding out the best solution.
The main thing that you need to understand is that when you remove the data in Cassandra, you're actually adding them by writing the tombstone, and then deletion of actual data will happen during compaction.
So it's very important to perform deletion correctly. There are different types of deletes - individual cells, row, range, partition (from least effective to most effective by number of tombstones generated). The best for you is to remove by partition, then second one is by ranges inside partition. Following article describes how the data is removed in great details.
You may need to perform deletion in several steps, so you don't add too much data as tombstones. You also need to check that you have enough disk space for compaction.

Cassandra write consistency level ALL clarification

According to Datastax documentation for Cassandra:
"If the coordinator cannot write to enough replicas to meet the requested consistency level, it throws an Unavailable Exception and does not perform any writes."
Does this mean that while the write is in process, the data updated by the write will not be available to read requests? I mean it is possible that 4/5 nodes have successfully sent a SUCCESS to the coordinator, meaning that their data have been updated. But the 5th one is yet to do the write. Now if a read request comes in and goes to one of these 4 nodes, it will still show the old data until the coordinator recieves a confirmation from the 5th node and marks the new data valid?
If the coordinator knows that it cannot possibly achieve consistency before it attempts the write, then it will fail the request immediately before doing the write. (This is described in the quote given)
However, if the coordinator believes that there are enough nodes to achieve its configured consistency level at the time of the attempt, it will start to send its data to its peers. If one of the peers does not return a success, the request will fail and you will get into a state where the nodes that fail have the old data and the ones that passed have the new data.
If a read requests comes in, it will show the data it finds on the nodes it reaches no matter if it is old or new.
Let us take your example to demonstrate.
If you have 5 nodes and you have replication 3. This will mean that 3 of those 5 nodes will have the write that you have sent. However, one of the three nodes returned a failure to the coordinator. Now if you read with consistency level ALL. You will read all three nodes and will always get the new write (Latest timestamp always wins).
However, if you read with consistency level ONE, there is a 1/3 chance you will get the old value.

Aerospike ACID - How to know the final result of the transaction on timeouts?

I'm new to Aerospike.
I would like to know that in all possible timeout scenarios, as stated in this link:
https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852
Client can’t connect by specified timeout (timeout=). Timeout of zero
means that there is no timeout set.
Client does not receive response by specified timeout (timeout=).
Server times out the transaction during it’s own processing (default
of 1 second if client doesn’t specify timeout). To investigate this,
confirm that the server transaction latencies are not the bottleneck.
Client times out after M iterations of retries when there was no error
due to a failed node or a failed connection.
Client can’t obtain a valid node after N retries (where retries are
set from your client).
Client can’t obtain a valid connection after X retries. The retry
count is usually the limiting factor, not the timeout value. The
reasoning is that if you can’t get a connection after R retries, you
never will, so just timeout early.
Of all timeout scenarios mentioned, under which circumstances could I be absolute certain that the final result of the transaction is FAILED?
Does Aerospike offer anything i.e. to rollback the transaction if the client does not respond?
In the worst case, If I could’t be certain about the final result, how would I be able to know for certain about the final state of the transaction?
Many thanks in advance.
Edit:
We came up with a temporary solution:
Keep a map of [generation -> value read] for that record (maybe a background thread constantly reading the record etc.) and then on timeouts, we would periodically check the map (key = the generation expected) to see if the true written value is actually the one put to the map. If they are the same, it means the write succeeded, otherwise it means the write failed.
Do you guys think it's necessary to do this? Or is there other way?
First, timeouts are not the only error you should be concerned with. Newer clients have an 'inDoubt' flag associated with errors that will indicate that the write may or may not have applied.
There isn't a built-in way of resolving an in-doubt transaction to a definitive answer and if the network is partitioned, there isn't a way in AP to rigorously resolve in-doubt transactions. Rigorous methods do exist for 'Strong Consistency' mode, the same methods can be used to handle common AP scenarios but they will fail under partition.
The method I have used is as follows:
Each record will need a list bin, the list bin will contain the last N transaction ids.
For my use case, I gave each client an unique 2 byte identifier - each client thread a unique 2 byte identifier - and each client thread had a 4 byte counter. So a particular transaction-id would look like would mask an 8 byte identifier from the 2 ids and counter.
* Read the records metadata with the getHeader api - this avoids reading the records bins from storage.
Note - my use case wasn't an increment so I actually had to read the record and write with a generation check. This pattern should be more efficient for a counter use case.
Write the record using operate and gen-equal to the read generation with the these operations: increment the integer bin, prepend to the list of txns, and trim the list of txns. You will prepend you transaction-id to your txns list and then trim the list to the max size of the list you selected.
N needs to be large enough such that a record can be sure to have enough time to verify its transaction given the contention on the key. N will affect the stored size of the record so choosing too big will cost disk resource and choosing too small will render the algorithm ineffective.
If the transaction is successful then you are done.
If the transaction is 'inDoubt' then read the key and check the txns list for your transaction-id. If present then your transaction 'definitely succeeded'.
If your transaction-id isn't in txns, repeat step 3 with the generation returned from the read in step 5.
Return to step 3 - with the exception that on step 5 a 'generation error' would also need to be considered 'in-doubt' since it may have been the previous attempt that finally applied.
Also consider that reading the record in step 5 and not finding the transaction-id in txns does not ensure that the transaction 'definitely failed'. If you wanted to leave the record unchanged but have a 'definitely failed' semantic you would need to have observed the generation move past the previous write's gen-check policy. If it hasn't you could replace the operation in step 6 with a touch - if it succeeds then the initial write 'definitely failed' and if you get a generation-error you will need to check if you raced the application of the initial transaction initial write may now have 'definitely succeeded'.
Again, with 'Strong Consistency' the mentions of 'definitely succeeded' and 'definitely failed' are accurate statements, but in AP these statements have failure modes (especially around network partitions).
Recent clients will provide an extra flag on timeouts, called "in doubt". If false, you are certain the transaction did not succeed (client couldn't even connect to the node so it couldn't have sent the transaction). If true, then there is still an uncertainty as the client would have sent the transaction but wouldn't know if it had reached the cluster or not.
You may also consider looking at Aerospike's Strong Consistency feature which could help your use case.

Storing Signals in a Database

I'm designing an application that receives information from roughly 100k sensors that measure time-series data. Each sensor measures a single integer data point once every 15 minutes, saves a log of these values, and sends that log to my application once every 4 hours. My application should maintain about 5 years of historical data. The packet I receive once every 4 hours is of the following structure:
Data and time of the sequence start
Number of samples to arrive (assume this is fixed for the sake of simplicity, although in practice there may be partials)
The sequence of samples, each of exactly 4 bytes
My application's main usage scenario is showing graphs of composite signals at certain dates. When I say "composite" signals I mean that for example I need to show the result of adding Sensor A's signal to Sensor B's signal and subtracting Sensor C's signal.
My dilemma is how to store this time-series data in my database. I see two options, assuming I use a relational database:
Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
Store every 4-hour signal as a separate row with its starting time. In this case, whenever a signal arrives, I just add it as a BLOB to the database.
There are obvious pros and cons for each of the options, including storage size, performance, and complexity of the code "above" the database.
I wondered if there are best practices for such cases.
Many thanks.
Storing each sample in it's own row sounds simple and logical to me. Don't be too hasty to optimize unless there is actually a good reason for it. Maybe you should do some tests with dummy data to see if any optimization is really necessary.
I think storing the data in the form that makes it easiest to carry out your main goal is likely the least painful overall. In this case, it's likely the more efficient as well.
Since your main goal appears to be to display the information in interesting and flexible ways I'd go with separate rows for each data point. I presume most of the effort required to write this program well is likely on the display side, you should minimize the complexity on that side as much as possible.
Storing data in BLOBs is good if the content isn't relevent and you would never want to run queries against it. In this case, your data will be the contents of the database, and therefore, very relevent.
I think you should:
1.Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
I see two database operations here: the first is to store the data as it comes in, and the second is to retrieve the data in a (potentially large) number of ways.
As Kieveli says, since you'll be using discrete parts of the data (as opposed to all of the data all at once), storing it as a blob won't help you when it comes time to read it. So for the first task, storing the data line by line would be optimal.
This might also be "good enough" when querying the data. However, if performance is an issue, and/or if you get massive amounts of volume [100,000 sensors x 1 per 15 minutes x 4 hours = 9,600,000 rows per day, x 5 years = 17,529,600,000 or so rows in five years]. To my mind, if you want to write flexible queries against that kind of data, you'll want some form of star schema structure (as gets used in data warehouses).
Whether you load the data directly into the warehouse, or let it build up "row by row" to be added to the warehouse ever day/week/month/whatever, depends on time, effort, available resources, and so on.
A final suggestion: when you set up a test environment for your new code, load it with several years of (dummy) data, to see how it will perform.

Resources