How might you force a Cassandra error that triggers a retry policy?
According to the DataStax, the following errors trigger a retry policy:
Read and write timeouts.
When the coordinator determines there are not enough replicas to satisfy a request.
Edit: The driver I'm using is called gocql
Toggle the network connection of a node off, and it will appear momentarily down. Any request that needed that node to satisfy the consistency level, or if that node was the co-ordinator of the query will need to retry.
Related
I have large Cassandra cluster with multiple DC. Sometimes I am getting INFO message for Drop read and drop mutations in debug.log with 213 internal and 514 cross node. however, application was not impacted. As per my understanding actual request was not failing but some of the replica did not respond to the coordinator and if consistency achieved then request got successful. please clarify if I am having misunderstanding.
The application will not get an error from the coordinator if the consistency level for the requests are satisfied. You mentioned that the application was not impacted but that's likely because:
the read or write request has a low consistency level (for example, ONE or LOCAL_ONE), or
the request consistency level is LOCAL_* but failed for a replica(s) in a remote DC.
FWIW, internal dropped messages is when the local node rejected the read or write request and cross-node is for requests to remote nodes (replicas). Cheers!
I am using Morphline Solr Sink to store information in Solr. The problem that I am facing is that flume agent never stops retrying the failed requests, which sometimes can increase over time. This results in the flume warning of MaxIO Workers being used and the system suffers with performance issues. Is there any way other than writing my own sink, that can make flume stop retrying or backoff exponentially to have a better system performance? My source is an avroSource.
Thanks.
You should fix the reason for the failed requests.
Flume is doing exactly what it's designed to do. It's transactionally trying to store the batch of events in your store. If it can't store those events then, yes, it keeps on trying.
You haven't explained what the problem is causing these failures. I would recommend thinking about an interceptor to fix whatever is wrong in the data or to drop events you don't want to store.
In the doc [1], it was said that
if using a write consistency level of QUORUM with a replication factor
of 3, Cassandra will send the write to 2 replicas. If the write fails on
one of the replicas but succeeds on the other, Cassandra will report a
write failure to the client.
So assume only 2 replicas receive the update, the write failed. But due to eventually consistency, all the nodes will receive the update finally.
So, should I retry? Or just leave it as it?
Any strategy?
[1] http://www.datastax.com/docs/1.0/dml/about_writes
Those docs aren't quite correct. Regardless of the consistency level (CL), writes are sent to all available replicas. If replicas aren't available, Cassandra won't send a request to the down nodes. If there aren't enough available from the outset to satisfy the CL, an UnavailableException is thrown and no write is attempted to any node.
However, the write can still succeed on some nodes and an error be returned to the client. In the example from [1], if one replica is down before the write was attempted, what is written is true.
So assume only 2 replicas receive the update, the write failed. But
due to eventually consistency, all the nodes will receive the update
finally.
Be careful though: a failed write doesn't tell you how many nodes the write was made to. It could be none so the write may not propagate eventually.
So, should I retry? Or just leave it as it?
In general you should retry, because it may not be written at all. You should only regard your write as written when you got a successful return from the write.
If you're using counters though you should be careful with retries. Because you don't know if the write was made or not, you could get duplicate counts. For counters, you probably don't want to retry (since more often than not the write will have been made to at least one node, at least for higher consistency levels).
Retry will not change much. The problem is that you actually cannot know whether data was persisted at all, because Cassandra throws always the same exception.
You have few options:
enable hints and retry request with cl=any - successful response would mean that at least hint was created. So you know that data is there but not yet accessible.
disable hints and retry with one - successful response would mean that at least node could receive data. In case of error execute delete.
use astyanax and their retry strategy
update to Cassandra 1.2 and use write-ahead log
I need to POST (HTTP method) some info to an external URL when a trigger executes.
I know there are a lot of security and performance implications when using triggers, so I am afraid this is not the place to do this kind of processing. But anyway I am posting this to get some feedback or ideas on how to approach the problem. Some considerations :
The transaction fired in the trigger could be asynchronous.
The process must take care of the authorization
The end URL is a php script on the internet.
What really triggers this execution should be an insert or an update of one record to a table, so I must use this trigger since I can't touch the (third party) application.
On a side note, could the Service Broker be something to consider ?
Any ideas will be welcome.
You are right, this is not something you want to do in a trigger. The last thing you want in your application is to introduce the latency of a HTTP request in every update/insert/delete, which will be very visible even when things work well. But when things work bad, it will work very bad: the added coupling will cause your application to fail when the HTTP resource has availability problems, and even worse is the correctness issues related to rollbacks (your transaction that executed the trigger may rollback, but the HTTP call is already made).
This is why is paramount to introduce a layer that decouples the trigger from the HTTP call, and this is done via a queue. Whether is a table used as a queue, or a Service Broker queue, or even an MSMQ queue is up to you to make the call. The simplest solution is to use a table as a queue:
the trigger enqueues (inserts) a request for the HTTP call to be made
after the transaction that run the trigger commits, the request is available to dequeue
an external application that monitors (polls) the queue picks up the request and places the HTTP call
The advantage of Service Broker over custom tables-as-queues is Internal Activation, which would allow your HTTP handling code to run on-demand when there are items to be processed in the queue, instead of polling. But making the HTTP call from inside the engine, via SQLCLR, is something quite ill advised. An external process is much better for accessing something like HTTP and therefore the added complexity of Service Broker is not warranted.
I am looking to use apache-camel to poll an imap inbox, but I am wondering how this setup would behave in a cluster. I would deploy apache camel on each node of the cluster, and each node would poll the inbox.
How can I avoid having many consumers pick up the same message?
I decided to take the simple road and not install additional components. I used a clustered quartz job to trigger the polling of the inbox. The poller then places a retrieval command on a Hazelcast distributed queue, which is received by an array of Message retrieval components in the cluster.
Installing, Jms, James, in addition to Camel smelled to me, just to solve this task.
Not very easy, since imap is not really a protocol for these kind of tasks.
The trick is still to have one consumer do the polling, not many. If you have many nodes for high availablility, you could do some tricks with JMS to trigger IMAP polls.
For instance, you could use a jms trigger message to init a poll and have all members of the cluster listen to that poll. Keep the concurrentConsumer to 1 and async. JMS disabled in Camel. You can rely on Message Groups or ActiveMQ exclusive consumer to be sure that only one node gets the trigger messages (when alive, otherwise another node will take over). Generating the polling messages might be tricky, but could be done as simply as a timer route from each camel node. Just tune the frequency.
This setup will avoid race conditions in IMAP, while not beeing load balanced, at least fail over secured. It might be good enough to just go ahead and do concurrent polling, with few issues. However, I don't think you will be 100% safe without only allowing one consumer.
In a clustered environment you may consider having a way of electing a single Camel route that is active, which does the imap polling. And then have logic for failover if the node goes down.
In Camel you can take a look at route policy which can be applied to routes.
http://camel.apache.org/routepolicy
The zookeeper component has a policy for electing a leader in a cluster, and only allow one route to be active. This requires though that you use zookeeper.
http://camel.apache.org/zookeeper