Cassandra Read Consistency when write path fail - database

I am new to cassandra and trying to figure out how cassandra provides consistency in case of failed writes. Consider following scenario where CL is QUORUM, in which case 2 out of 3 replicas must respond. Write request will go to all 3 replica as usual, if write to 2 replica fails, and succeeds on 1 replica, cassandra will return Failed. Since cassandra does not rollback, the record will continue to exist on successful replica. Now, when the read come with CL=QUORUM,the read request will be forwarded to 2 replica node and if one of the replica node is the previously successful one then cassandra will return the new records as it will have latest timestamp. But from client perspective this record was not written at all as cassandra had returned failure during write.
If this is the case then cassandra will never be consistent in this scenario.
How to handle such scenario
Please let me know if this understanding is correct.

Your understanding is correct. The client in this case should receive UnavailableException, but should understand that the write will eventually propagate to the other replicas (if the nodes are alive or come alive), and that this is not a failed write.
For more details see the following articles:
https://www.datastax.com/blog/2014/10/cassandra-error-handling-done-right
http://mighty-titan.blogspot.com/2012/06/understanding-cassandras-consistency.html

Related

Drop reads and drop mutations in Cassandra

I have large Cassandra cluster with multiple DC. Sometimes I am getting INFO message for Drop read and drop mutations in debug.log with 213 internal and 514 cross node. however, application was not impacted. As per my understanding actual request was not failing but some of the replica did not respond to the coordinator and if consistency achieved then request got successful. please clarify if I am having misunderstanding.
The application will not get an error from the coordinator if the consistency level for the requests are satisfied. You mentioned that the application was not impacted but that's likely because:
the read or write request has a low consistency level (for example, ONE or LOCAL_ONE), or
the request consistency level is LOCAL_* but failed for a replica(s) in a remote DC.
FWIW, internal dropped messages is when the local node rejected the read or write request and cross-node is for requests to remote nodes (replicas). Cheers!

MongoDB - WiredTiger Snapshots vs. Locking

I am not completely understanding how these two features relate to one another in a (WiredTiger) MongoDB program:
1) WiredTiger Snapshots
2) Data Locking
If each read operation using the WiredTiger engine is, at read-time, provided with a database level 'snapshot' (so as to create consistency (the C in ACID), why then, do we also need locking? Let's use an example.
I perform a query at the Document level (a read operation). Okay, so I know I get the database level snapshot, so that my data is consistent EVEN IF another user is concurrently writing to that same Document, updating it.
So at this point, what is the use for having a Shared-Lock on that document, which is blocking all write (exclusive) operations on that document until the Shared-Lock is released? What could possibly go wrong in writing to that Document concurrently while I'm reading it, if I am, in fact, using a snapshot of the Document that was provided to me at read-time? Why would I care if the Document is locked during my read-operation period or not? I already have my (consistent) data from that point-in-time, no?
I'm obviously missing a key concept here... Any help?
Thanks.
You are right that the read operation will acquire a snapshot. When using the WiredTiger storage engine, MongoDB does not lock individual documents for either reads or writes. Instead WiredTiger uses Multi-Version Concurrency Control, MVCC. When performing an update of a document, that update will succeed as long as the document still has the same version as it had when acquiring the snapshot. If not, WiredTiger will return an error (WT_ROLLBACK) indicating that the update had write conflicts. In this case, the update will abort and all pending changes are undone. MongoDB will then transparently retry the operation.

How to synchronize distributed system data across cassandra clusters

Hypothetically speaking, I plan to build a distributed system with cassandra as the database. The system will run on multiple servers say server A,B,C,D,E etc. Each server will have Cassandra instance and all servers will form a cluster.
In my hypothetical distributed system, X number of the total servers should process user requests. eg, 3 of servers A,B,C,D,E should process request from user uA. Each application should update its Cassandra instance with the exact copy of data. Eg if user uA sends a message to user uB, each application should update its database with the exact copy of the message sent and to who and as expected, Cassandra should take over from that point to ensure all nodes are up-to date.
How do I configure Cassandra to make sure Cassandra first checks all copies inserted into the database are exactly the same before updating all other nodes
Psst: kindly keep explanations as simple as possible. Am new to Cassandra, crossing over from MySQL. Thank you in advance
Every time a change happens in Cassandra, it is communicated to all relevant nodes (Nodes that have a replica of the data). But sometimes that doesn't happen either because a node is down or too busy, the network fails, etc.
What you are asking is how to get consistency out of Cassandra, or in other terms, how to make a change and guarantee that the next read has the most up to date information.
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. There are multiple consistency options but normally you would only use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (If you write to A, someone can read from B while it was not updated).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered. (If you have 3 nodes ABC and you write to A and B, someone can read from C but also from A or B, meaning it will always get the most up to date information).
Cassandra knows what is the most up to date information because every change has a timestamp and the most recent wins.
You also have other options such as ALL, which is NOT RECOMMENDED because it requires all nodes to be up and available. If a node is unnavailable, your system is down.
Cassandra Documentation (Consistency)

Eventual consistency with both database and message queue records

I have an application where I need to store some data in a database (mysql for instance) and then publish some data in a message queue. My problem is: If the application crashes after the storage in the database, my data will never be written in the message queue and then be lost (thus eventual consistency of my system will not be guaranted).
How can I solve this problem ?
I have an application where I need to store some data in a database (mysql for instance) and then publish some data in a message queue. My problem is: If the application crashes after the storage in the database, my data will never be written in the message queue and then be lost (thus eventual consistency of my system will not be guaranted). How can I solve this problem ?
In this particular case, the answer is to load the queue data from the database.
That is, you write the messages that need to be queued to the database, in the same transaction that you use to write the data. Then, asynchronously, you read that data from the database, and write it to the queue.
See Reliable Messaging without Distributed Transactions, by Udi Dahan.
If the application crashes, recovery is simple -- during restart, you query the database for all unacknowledged messages, and send them again.
Note that this design really expects the consumers of the messages to be designed for at least once delivery.
I am assuming that you have a loss-less message queue, where once you get a confirmation for writing data, the queue is guaranteed to have the record.
Basically, you need a loop with a transaction that can roll back or a status in the database. The pseudo code for a transaction is:
Begin transaction
Insert into database
Write to message queue
When message queue confirms, commit transaction
Personally, I would probably do this with a status:
Insert into database with a status of "pending" (or something like that)
Write to message queue
When message confirms, change status to "committed" (or something like that)
In the case of recovery from failure, you may need to check the message queue to see if any "pending" records were actually written to the queue.
I'm afraid that answers (VoiceOfUnreason, Udi Dahan) just sweep the problem under the carpet. The problem under carpet is: How the movement of data from database to queue should be designed so that the message will be posted just once (without XA). If you solve this, then you can easily extend that concept by any additional business logic.
CAP theorem tells you the limits clearly.
XA transactions is not 100% bullet proof solution, but seems to me best of all others that I have seen.
Adding to what #Gordon Linoff said, assuming durable messaging (something like MSMQ?) the method/handler is going to be transactional, so if it's all successful, the message will be written to the queue and the data to your view model, if it fails, all will fail...
To mitigate the ID issue you will need to use GUIDs instead of DB generated keys (if you are using messaging you will need to remove your referential integrity anyway and introduce GUIDS as keys).
One more suggestion, don't update the database, but inset only/upsert (the pending row and then the completed row) and have the reader do the projection of the data based on the latest row (for example)
Writing message as part of transaction is a good idea but it has multiple drawbacks like
If your
a. database/language does not support transaction
b. transaction are time taking operation
c. you can not afford to wait for queue response while responding to your service call.
d. If your database is already under stress, writing message will exacerbate the impact of higher workload.
the best practice is to use Database Streams. Most of the modern databases support streams(Dynamodb, mongodb, orcale etc.). You have consumer of database stream running which reads from database stream and write to queue or invalidate cache, add to search indexer etc. Once all of them are successful you mark the stream item as processed.
Pros of this approach
it will work in the case of multi-region deployment where there is a regional failure. (you should read from regional stream and hydrate all the regional data stores.)
No Overhead of writing more records or performance bottle necks of queues.
You can use this pattern for other data sources as well like caching, queuing, searching.
Cons
You may need to call multiple services to construct appropriate message.
One database stream might not be sufficient to construct appropriate message.
ensure the reliability of your streams, like redis stream is not reliable
NOTE this approach also does not guarantee exactly once semantics. The consumer logic should be idempotent and should be able to handle duplicate message

What are the common practice to handle Cassandra write failure?

In the doc [1], it was said that
if using a write consistency level of QUORUM with a replication factor
of 3, Cassandra will send the write to 2 replicas. If the write fails on
one of the replicas but succeeds on the other, Cassandra will report a
write failure to the client.
So assume only 2 replicas receive the update, the write failed. But due to eventually consistency, all the nodes will receive the update finally.
So, should I retry? Or just leave it as it?
Any strategy?
[1] http://www.datastax.com/docs/1.0/dml/about_writes
Those docs aren't quite correct. Regardless of the consistency level (CL), writes are sent to all available replicas. If replicas aren't available, Cassandra won't send a request to the down nodes. If there aren't enough available from the outset to satisfy the CL, an UnavailableException is thrown and no write is attempted to any node.
However, the write can still succeed on some nodes and an error be returned to the client. In the example from [1], if one replica is down before the write was attempted, what is written is true.
So assume only 2 replicas receive the update, the write failed. But
due to eventually consistency, all the nodes will receive the update
finally.
Be careful though: a failed write doesn't tell you how many nodes the write was made to. It could be none so the write may not propagate eventually.
So, should I retry? Or just leave it as it?
In general you should retry, because it may not be written at all. You should only regard your write as written when you got a successful return from the write.
If you're using counters though you should be careful with retries. Because you don't know if the write was made or not, you could get duplicate counts. For counters, you probably don't want to retry (since more often than not the write will have been made to at least one node, at least for higher consistency levels).
Retry will not change much. The problem is that you actually cannot know whether data was persisted at all, because Cassandra throws always the same exception.
You have few options:
enable hints and retry request with cl=any - successful response would mean that at least hint was created. So you know that data is there but not yet accessible.
disable hints and retry with one - successful response would mean that at least node could receive data. In case of error execute delete.
use astyanax and their retry strategy
update to Cassandra 1.2 and use write-ahead log

Resources