Cassandra write consistency level ALL clarification - database

According to Datastax documentation for Cassandra:
"If the coordinator cannot write to enough replicas to meet the requested consistency level, it throws an Unavailable Exception and does not perform any writes."
Does this mean that while the write is in process, the data updated by the write will not be available to read requests? I mean it is possible that 4/5 nodes have successfully sent a SUCCESS to the coordinator, meaning that their data have been updated. But the 5th one is yet to do the write. Now if a read request comes in and goes to one of these 4 nodes, it will still show the old data until the coordinator recieves a confirmation from the 5th node and marks the new data valid?

If the coordinator knows that it cannot possibly achieve consistency before it attempts the write, then it will fail the request immediately before doing the write. (This is described in the quote given)
However, if the coordinator believes that there are enough nodes to achieve its configured consistency level at the time of the attempt, it will start to send its data to its peers. If one of the peers does not return a success, the request will fail and you will get into a state where the nodes that fail have the old data and the ones that passed have the new data.
If a read requests comes in, it will show the data it finds on the nodes it reaches no matter if it is old or new.
Let us take your example to demonstrate.
If you have 5 nodes and you have replication 3. This will mean that 3 of those 5 nodes will have the write that you have sent. However, one of the three nodes returned a failure to the coordinator. Now if you read with consistency level ALL. You will read all three nodes and will always get the new write (Latest timestamp always wins).
However, if you read with consistency level ONE, there is a 1/3 chance you will get the old value.

Related

Is it possible for a DynamoDB read to return state that is older than the state returned by a previous read?

Let's say there is a DynamoDB key with a value of 0, and there is a process that repeatably reads from this key using eventually consistent reads. While these reads are occurring, a second process sets the value of that key to 1.
Is it ever possible for the read process to ever read a 0 after it first reads a 1? Is it possible in DynamoDB's eventual consistency model for a client to successfully read a key's fully up-to-date value, but then read a stale value on a subsequent request?
Eventually, the write will be fully propagated and the read process will only read 1 values, but I'm unsure if it's possible for the reads to go 'backward in time' while the propagation is occuring.
The property you are looking for is known as monotonic reads, see for example the definition in https://jepsen.io/consistency/models/monotonic-reads.
Obviously, DynamoDB's strongly consistent read (ConsistentRead=true) is also monotonic, but you rightly asked about DynamoDB's eventually consistent read mode.
#Charles in his response gave a link, https://www.youtube.com/watch?v=yvBR71D0nAQ&t=706s, to a nice official official talk by Amazon on how eventually-consistent reads work. The talk explains that DynamoDB replicates written data to three copies, but a write completes when two out of three (including one designated as the "leader") of the copies were updated. It is possible that the third copy will take some time (usually a very short time to get updated).
The video goes on to explain that an eventually consistent read goes to one of the three replicas at random.
So in that short amount of time where the third replica has old data, a request might randomly go to one of the updated nodes and return new data, and then another request slightly later might go by random to the non-updated replica and return old data. This means that the "monotonic read" guarantee is not provided.
To summarize, I believe that DynamoDB does not provide the monotonic read guarantee if you use eventually consistent reads. You can use strongly-consistent reads to get it, of course.
Unfortunately I can't find an official document which claims this. It would also be nice to test this in practice, similar to how he paper http://www.aifb.kit.edu/images/1/17/How_soon_is_eventual.pdf tested whether Amazon S3 (not DynamoDB) guaranteed monotonic reads, and discovered that it did not by actually seeing monotonic-read violations.
One of the implementation details which may make it hard to see these monotonic-read violations in practice is how Amazon handles requests from the same process (which you said is your case). When the same process sends several requests in sequence, it may (but also may not...) may use the same HTTP connections to do so, and Amazon's internal load balancers may (but also may not) decide to send those requests to the same backend replica - despite the statement in the video that each request is sent to a random replica. If this happens, it may be hard to see monotonic read violations in practice - but it may still happen if the load balancer changes its mind, or the client library opens another connection, and so on, so you still can't trust the monotonic read property to hold.
Yes it is possible. Requests are stateless so a second read from the same client is just as likely as any other request to see slightly stale data. If that’s an issue, choose strong consistency.
You will (probably) not ever get the old data after getting the new data..
First off, there's no warning in the docs about repeated reads returning stale data, just that a read after a write may return stale data.
Eventually Consistent Reads
When you read data from a DynamoDB table, the response might not
reflect the results of a recently completed write operation. The
response might include some stale data. If you repeat your read
request after a short time, the response should return the latest
data.
But more importantly, every item in DDB is stored in three storage nodes. A write to DDB doesn't return a 200 - Success until that data is written to 2 of 3 storage nodes. Thus, it's only if your read is serviced by the third node, that you'd see stale data. Once that third node is updated, every node has the latest.
See Amazon DynamoDB Under the Hood
EDIT
#Nadav's answer points that it's at least theoretically possible; AWS certainly doesn't seem to guarantee monotonic reads. But I believe the reality depends on your application architecture.
Most languages, nodejs being an exception, will use persistent HTTP/HTTPS connections by default to the DDB request router. Especially given how expensive it is to open a TLS connection. I suspect though can't find any documents confirming it that there's at least some level of stickiness from the request router to a storage node. #Nadav discusses this possibility. But only AWS knows for sure and they haven't said.
Assuming that belief is correct
curl in a shell script loop - more likely to see the old data again
loop in C# using a single connection - less likely
The other thing to consider is that in the normal course of things, the third storage node in "only milliseconds behind".
Ironically, if the request router truly picks a storage node at random, a non-persistent connection is then less likely to see old data again given the extra time it takes to establish the connection.
If you absolutely need monotonic reads, then you'd need to use strongly consistent reads.
Another option might be to stick DynamoDB Accelerator (DAX) in front of your DDB. Especially if you're retrieving the key with GetItem(). As I read how it works it does seem to imply monotonic reads, especially if you've written-through DAX. Though it does not come right out an say so. Even if you've written around DAX, reading from it should still be monotonic, it's just there will be more latency until you start seeing the new data.

How does cassandra handle write timestamp conflicts between QUORUM reads?

In the incredibly unlikely event that 2 QUORUM writes happen in parallel to the same row, and result in 2 partition replicas being inconsistent with the same timestamp:
When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?
Then the next question is how does the cluster reach consistency again since the data has the same timestamp?
I understand this situation is highly improbable, but my guess is it is still possible.
Example diagram:
Here is what I got from Datastax support:
Definitely a possible situation to consider. Cassandra/Astra handles this scenario with the following precedence rules so that results to the client are always consistent:
Timestamps are compared and latest timestamp always wins
If data being read has the same timestamp, deletes have priority over inserts/updates
In the event there is still a tie to break, Cassandra/Astra chooses the value for the column that is lexically larger
While these are certainly a bit arbitrary, Cassandra/Astra cannot know which value is supposed to take priority, and these rules do function to always give the exact same results to all clients when a tie happens.
When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?
Cassandra/Astra would handle this for you behind the scenes while traversing the read path. If there is a discrepancy between the data being returned by the two replicas, the data would be compared and synced amongst those two nodes involved in the read prior to sending the data back to the client.
So with regards to your diagram, with W1 and W2 both taking place at t = 1, the data coming back to the client would be data = 2 because 2 > 1. In addition, Node 1 would now have the missing data = 2 at t = 1 record. Node 2 would still only have data = 1 at t = 1 because it was not involved in that read.

Aerospike ACID - How to know the final result of the transaction on timeouts?

I'm new to Aerospike.
I would like to know that in all possible timeout scenarios, as stated in this link:
https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852
Client can’t connect by specified timeout (timeout=). Timeout of zero
means that there is no timeout set.
Client does not receive response by specified timeout (timeout=).
Server times out the transaction during it’s own processing (default
of 1 second if client doesn’t specify timeout). To investigate this,
confirm that the server transaction latencies are not the bottleneck.
Client times out after M iterations of retries when there was no error
due to a failed node or a failed connection.
Client can’t obtain a valid node after N retries (where retries are
set from your client).
Client can’t obtain a valid connection after X retries. The retry
count is usually the limiting factor, not the timeout value. The
reasoning is that if you can’t get a connection after R retries, you
never will, so just timeout early.
Of all timeout scenarios mentioned, under which circumstances could I be absolute certain that the final result of the transaction is FAILED?
Does Aerospike offer anything i.e. to rollback the transaction if the client does not respond?
In the worst case, If I could’t be certain about the final result, how would I be able to know for certain about the final state of the transaction?
Many thanks in advance.
Edit:
We came up with a temporary solution:
Keep a map of [generation -> value read] for that record (maybe a background thread constantly reading the record etc.) and then on timeouts, we would periodically check the map (key = the generation expected) to see if the true written value is actually the one put to the map. If they are the same, it means the write succeeded, otherwise it means the write failed.
Do you guys think it's necessary to do this? Or is there other way?
First, timeouts are not the only error you should be concerned with. Newer clients have an 'inDoubt' flag associated with errors that will indicate that the write may or may not have applied.
There isn't a built-in way of resolving an in-doubt transaction to a definitive answer and if the network is partitioned, there isn't a way in AP to rigorously resolve in-doubt transactions. Rigorous methods do exist for 'Strong Consistency' mode, the same methods can be used to handle common AP scenarios but they will fail under partition.
The method I have used is as follows:
Each record will need a list bin, the list bin will contain the last N transaction ids.
For my use case, I gave each client an unique 2 byte identifier - each client thread a unique 2 byte identifier - and each client thread had a 4 byte counter. So a particular transaction-id would look like would mask an 8 byte identifier from the 2 ids and counter.
* Read the records metadata with the getHeader api - this avoids reading the records bins from storage.
Note - my use case wasn't an increment so I actually had to read the record and write with a generation check. This pattern should be more efficient for a counter use case.
Write the record using operate and gen-equal to the read generation with the these operations: increment the integer bin, prepend to the list of txns, and trim the list of txns. You will prepend you transaction-id to your txns list and then trim the list to the max size of the list you selected.
N needs to be large enough such that a record can be sure to have enough time to verify its transaction given the contention on the key. N will affect the stored size of the record so choosing too big will cost disk resource and choosing too small will render the algorithm ineffective.
If the transaction is successful then you are done.
If the transaction is 'inDoubt' then read the key and check the txns list for your transaction-id. If present then your transaction 'definitely succeeded'.
If your transaction-id isn't in txns, repeat step 3 with the generation returned from the read in step 5.
Return to step 3 - with the exception that on step 5 a 'generation error' would also need to be considered 'in-doubt' since it may have been the previous attempt that finally applied.
Also consider that reading the record in step 5 and not finding the transaction-id in txns does not ensure that the transaction 'definitely failed'. If you wanted to leave the record unchanged but have a 'definitely failed' semantic you would need to have observed the generation move past the previous write's gen-check policy. If it hasn't you could replace the operation in step 6 with a touch - if it succeeds then the initial write 'definitely failed' and if you get a generation-error you will need to check if you raced the application of the initial transaction initial write may now have 'definitely succeeded'.
Again, with 'Strong Consistency' the mentions of 'definitely succeeded' and 'definitely failed' are accurate statements, but in AP these statements have failure modes (especially around network partitions).
Recent clients will provide an extra flag on timeouts, called "in doubt". If false, you are certain the transaction did not succeed (client couldn't even connect to the node so it couldn't have sent the transaction). If true, then there is still an uncertainty as the client would have sent the transaction but wouldn't know if it had reached the cluster or not.
You may also consider looking at Aerospike's Strong Consistency feature which could help your use case.

Why does PW=all and PR=all not give strong consistency in Riak?

Suppose we have a 5 machine cluster with n_val = 3. Why does setting PW=3 and PR=3 for writes and reads not guarantee strong consistency?
Writing a key using PW=3 writes the value to 3 primary vnodes. But the success is returned once the vnode receives the value to be written, i.e. before it is persisted.
After acknowledging the value, the vnode must still commit it to the appropriate backend store, any occurrence that interrupts this would result in a this vnode retrieving the previous value (or notfound) on the next read.
Using the durable write option (DW=3) would cause the vnode to delay acknowledgment of the value until after it has been committed to the backend store. However this is still not quite a guarantee, because the backend does not ensure cache is flushed after each value is written. A failure could still interrupt the value being written to disk. This is a decidedly smaller window than with PW=3, but still not a guarantee.
Granted the likelihood of this occurring on all 3 primary nodes such that the next PR=3 ready succeeds but returns the the wrong value is quite remote, but not nonexistant.
There is also the possibility of sibling values from simultaneous writes. If the first successful read after a successful write returns a set of sibling values instead of the last written value, is that strong consistency?

Prioritizing Transactions in Google AppEngine

Let's say I need to perform two different kinds write operations on a datastore entity that might happen simultaneously, for example:
The client that holds a write-lock on the entry updates the entry's content
The client requests a refresh of the write-lock (updates the lock's expiration time-stamp)
As the content-update operation is only allowed if the client holds the current write-lock, I need to perform the lock-check and the content-write in a transaction (unless there is another way that I am missing?). Also, a lock-refresh must happen in a transaction because the client needs to first be confirmed as the current lock-holder.
The lock-refresh is a very quick operation.
The content-update operation can be quite complex. Think of it as the client sending the server a complicated update-script that the server executes on the content.
Given this, if there is a conflict between those two transactions (should they be executed simultaneously), I would much rather have the lock-refresh operation fail than the complex content-update.
Is there a way that I can "prioritize" the content-update transaction? I don't see anything in the docs and I would imagine that this is not a specific feature, but maybe there is some trick I can use?
For example, what happens if my content-update reads the entry, writes it back with a small modification (without committing the transaction), then performs the lengthy operation and finally writes the result and commits the transaction? Would the first write be applied immediately and cause a simultaneous lock-refresh transaction to fail? Or are all writes kept until the transaction is committed at the end?
Is there such a thing as keeping two transactions open? Or doing an intermediate commit in a transaction?
Clearly, I can just split my content-update into two transactions: The first one sets a "don't mess with this, please!"-flag and the second one (later) writes the changes and clears that flag.
But maybe there is some other trick to achieve this with fewer reads/writes/transactions?
Another thought I had was that there are 3 different "blocks" of data: The current lock-holder (LH), the lock expiration (EX), and the content that is being modified (CO). The lock-refresh operation needs to perform a read of LH and a write to EX in a transaction, while the content-update operation needs to perform a read of LH, a read of CO, and a write of CO in a transaction. Is there a way to break the data apart into three entities and somehow have the transactions span only the needed entities? Since LH is never modified by these two operations, this might help avoid the conflict in the first place?
The datastore uses optimistic concurrency control, which means that a (datastore primitive) transaction waits until it is committed, then succeeds only if someone else hasn't committed first. Typically, the app retries the failed transaction with fresh data. There is no way to modify this first-wins behavior.
It might help to know that datastore transactions are strongly consistent, so a client can first commit a lock refresh with a synchronous datastore call, and when that call returns, the client knows for sure whether it obtained or refreshed the lock. The client can then proceed with its update and lock clear. The case you describe where a lock refresh and an update might occur concurrently from the same client sounds avoidable.
I'm assuming you need the lock mechanism to prevent writes from other clients while the lock owner performs multiple datastore primitive transactions. If a client is actually only doing one update before it releases the lock and it can do so within seconds (well before the datastore RPC timeout), you might get by with just a primitive datastore transaction with optimistic concurrency control and retries. But a lock might be a good idea for simple serialization of, say, edits to a record in a user interface, where a user hits an "edit" button in a UI and you want that to guarantee that the user has some time to prepare and submit changes without the record being changed by someone else. (Whether that's the user experience you want is your decision. :) )

Resources