Drop reads and drop mutations in Cassandra - database

I have large Cassandra cluster with multiple DC. Sometimes I am getting INFO message for Drop read and drop mutations in debug.log with 213 internal and 514 cross node. however, application was not impacted. As per my understanding actual request was not failing but some of the replica did not respond to the coordinator and if consistency achieved then request got successful. please clarify if I am having misunderstanding.

The application will not get an error from the coordinator if the consistency level for the requests are satisfied. You mentioned that the application was not impacted but that's likely because:
the read or write request has a low consistency level (for example, ONE or LOCAL_ONE), or
the request consistency level is LOCAL_* but failed for a replica(s) in a remote DC.
FWIW, internal dropped messages is when the local node rejected the read or write request and cross-node is for requests to remote nodes (replicas). Cheers!

Related

Artificially create Cassandra error to test retry policy

How might you force a Cassandra error that triggers a retry policy?
According to the DataStax, the following errors trigger a retry policy:
Read and write timeouts.
When the coordinator determines there are not enough replicas to satisfy a request.
Edit: The driver I'm using is called gocql
Toggle the network connection of a node off, and it will appear momentarily down. Any request that needed that node to satisfy the consistency level, or if that node was the co-ordinator of the query will need to retry.

Cassandra Read Consistency when write path fail

I am new to cassandra and trying to figure out how cassandra provides consistency in case of failed writes. Consider following scenario where CL is QUORUM, in which case 2 out of 3 replicas must respond. Write request will go to all 3 replica as usual, if write to 2 replica fails, and succeeds on 1 replica, cassandra will return Failed. Since cassandra does not rollback, the record will continue to exist on successful replica. Now, when the read come with CL=QUORUM,the read request will be forwarded to 2 replica node and if one of the replica node is the previously successful one then cassandra will return the new records as it will have latest timestamp. But from client perspective this record was not written at all as cassandra had returned failure during write.
If this is the case then cassandra will never be consistent in this scenario.
How to handle such scenario
Please let me know if this understanding is correct.
Your understanding is correct. The client in this case should receive UnavailableException, but should understand that the write will eventually propagate to the other replicas (if the nodes are alive or come alive), and that this is not a failed write.
For more details see the following articles:
https://www.datastax.com/blog/2014/10/cassandra-error-handling-done-right
http://mighty-titan.blogspot.com/2012/06/understanding-cassandras-consistency.html

Eventual consistency with both database and message queue records

I have an application where I need to store some data in a database (mysql for instance) and then publish some data in a message queue. My problem is: If the application crashes after the storage in the database, my data will never be written in the message queue and then be lost (thus eventual consistency of my system will not be guaranted).
How can I solve this problem ?
I have an application where I need to store some data in a database (mysql for instance) and then publish some data in a message queue. My problem is: If the application crashes after the storage in the database, my data will never be written in the message queue and then be lost (thus eventual consistency of my system will not be guaranted). How can I solve this problem ?
In this particular case, the answer is to load the queue data from the database.
That is, you write the messages that need to be queued to the database, in the same transaction that you use to write the data. Then, asynchronously, you read that data from the database, and write it to the queue.
See Reliable Messaging without Distributed Transactions, by Udi Dahan.
If the application crashes, recovery is simple -- during restart, you query the database for all unacknowledged messages, and send them again.
Note that this design really expects the consumers of the messages to be designed for at least once delivery.
I am assuming that you have a loss-less message queue, where once you get a confirmation for writing data, the queue is guaranteed to have the record.
Basically, you need a loop with a transaction that can roll back or a status in the database. The pseudo code for a transaction is:
Begin transaction
Insert into database
Write to message queue
When message queue confirms, commit transaction
Personally, I would probably do this with a status:
Insert into database with a status of "pending" (or something like that)
Write to message queue
When message confirms, change status to "committed" (or something like that)
In the case of recovery from failure, you may need to check the message queue to see if any "pending" records were actually written to the queue.
I'm afraid that answers (VoiceOfUnreason, Udi Dahan) just sweep the problem under the carpet. The problem under carpet is: How the movement of data from database to queue should be designed so that the message will be posted just once (without XA). If you solve this, then you can easily extend that concept by any additional business logic.
CAP theorem tells you the limits clearly.
XA transactions is not 100% bullet proof solution, but seems to me best of all others that I have seen.
Adding to what #Gordon Linoff said, assuming durable messaging (something like MSMQ?) the method/handler is going to be transactional, so if it's all successful, the message will be written to the queue and the data to your view model, if it fails, all will fail...
To mitigate the ID issue you will need to use GUIDs instead of DB generated keys (if you are using messaging you will need to remove your referential integrity anyway and introduce GUIDS as keys).
One more suggestion, don't update the database, but inset only/upsert (the pending row and then the completed row) and have the reader do the projection of the data based on the latest row (for example)
Writing message as part of transaction is a good idea but it has multiple drawbacks like
If your
a. database/language does not support transaction
b. transaction are time taking operation
c. you can not afford to wait for queue response while responding to your service call.
d. If your database is already under stress, writing message will exacerbate the impact of higher workload.
the best practice is to use Database Streams. Most of the modern databases support streams(Dynamodb, mongodb, orcale etc.). You have consumer of database stream running which reads from database stream and write to queue or invalidate cache, add to search indexer etc. Once all of them are successful you mark the stream item as processed.
Pros of this approach
it will work in the case of multi-region deployment where there is a regional failure. (you should read from regional stream and hydrate all the regional data stores.)
No Overhead of writing more records or performance bottle necks of queues.
You can use this pattern for other data sources as well like caching, queuing, searching.
Cons
You may need to call multiple services to construct appropriate message.
One database stream might not be sufficient to construct appropriate message.
ensure the reliability of your streams, like redis stream is not reliable
NOTE this approach also does not guarantee exactly once semantics. The consumer logic should be idempotent and should be able to handle duplicate message

Coordinator node and its impact on performance

I was studying up on Cassandra and i understand that it is a peer database where there are no master or slaves.
Each read/write is facilitated by a coordinator node, who then forwards the read/write request to the specific node by using the replication strategy and Snitch.
My question is around the performance problems with this method.
Isn't there an extra hop?
Is the write buffered and then forwarded to the right replicas?
How does the performance change with different replication
strategies?
Can I improve the performance by bypassing the coordinator node and
writing to the replica nodes myself?
1) There will occasionally be an extra hop but your driver will most likely have a TokenAware Strategy for selecting the coordinator which will choose the coordinator to be a replica for the given partition.
2) The write is buffered and depending on your consistency level you will not receive acknowledgment of the write until it has been accepted on multiple nodes. For example with Consistency Level one you will receive an ACK as soon as the write as been accepted by a single node. The other nodes will have writes queued up and delivered but you will not receive any info about them. In the case that one of those writes fails/cannot be delivered, a hint will be stored on the coordinator to be delivered when the replica comes back online. Obviously there is a limit to the number of hints that can be saved so after long downtimes you should run repair.
With higher consistency levels the client will not receive an acknowledgment until the number of nodes in the CL have accepted the write.
3) The performance should scale with the total number of writes. If a cluster can sustain a net 10k writes per second but has RF = 2. You most likely can only do 5k writes per second since every write is actually 2. This will happen irregardless of your consistency level since those writes are sent even though you aren't waiting for their acknowledgment.
4) There is really no way to get around the coordination. The token aware strategy will pick a good coordinator which is basically the best you can do. If you manually attempted to write to each replica, your write would still be replicated by each node which received the request so instead of one coordination event you would get N. This is also most likely a bad idea since I would assume you have a better network between your C* nodes than from your client to the c* nodes.
I don't have answers for 2 and 3, but as for 1 and 4.
1) Yes, this can cause an extra hop
4) Yes, well kind of. The Datastax driver, as well as the Netflix Astynax driver can be set to be Token Aware which means it will listen to the ring's gossip to know which nodes have which token ranges and send the insert to the coordinator on the node it will be stored on. Eliminating the additional network hop.
To add to Andrew's response, don't assume the coordinator hop is going to cause significant latency. Do your queries and measure. Think about consistency levels more than the extra hop. Tune your consistency for higher read or higher write speed, or a balance of the two. Then MEASURE. If you find latencies to be unnacceptable, you may then need to tweak your consistency levels and / or change your data model.

What are the common practice to handle Cassandra write failure?

In the doc [1], it was said that
if using a write consistency level of QUORUM with a replication factor
of 3, Cassandra will send the write to 2 replicas. If the write fails on
one of the replicas but succeeds on the other, Cassandra will report a
write failure to the client.
So assume only 2 replicas receive the update, the write failed. But due to eventually consistency, all the nodes will receive the update finally.
So, should I retry? Or just leave it as it?
Any strategy?
[1] http://www.datastax.com/docs/1.0/dml/about_writes
Those docs aren't quite correct. Regardless of the consistency level (CL), writes are sent to all available replicas. If replicas aren't available, Cassandra won't send a request to the down nodes. If there aren't enough available from the outset to satisfy the CL, an UnavailableException is thrown and no write is attempted to any node.
However, the write can still succeed on some nodes and an error be returned to the client. In the example from [1], if one replica is down before the write was attempted, what is written is true.
So assume only 2 replicas receive the update, the write failed. But
due to eventually consistency, all the nodes will receive the update
finally.
Be careful though: a failed write doesn't tell you how many nodes the write was made to. It could be none so the write may not propagate eventually.
So, should I retry? Or just leave it as it?
In general you should retry, because it may not be written at all. You should only regard your write as written when you got a successful return from the write.
If you're using counters though you should be careful with retries. Because you don't know if the write was made or not, you could get duplicate counts. For counters, you probably don't want to retry (since more often than not the write will have been made to at least one node, at least for higher consistency levels).
Retry will not change much. The problem is that you actually cannot know whether data was persisted at all, because Cassandra throws always the same exception.
You have few options:
enable hints and retry request with cl=any - successful response would mean that at least hint was created. So you know that data is there but not yet accessible.
disable hints and retry with one - successful response would mean that at least node could receive data. In case of error execute delete.
use astyanax and their retry strategy
update to Cassandra 1.2 and use write-ahead log

Resources