Could any one help introduce how to read contents from the distributed cluster?
I mean there is a distributed cluster who's consistency is guaranteed by Paxos algorithm.
In real-world application, how does the client read the contents they have written to the cluster?
For example, in a 5 servers cluster, maybe only 3 of them get the newest data and the other 2 have old data due to network delay or something.
Does this means the client needs to read at least majority of all nodes? In 5-servers, it means reading data from at least 3 servers and checked the one with newest version number?
If so, it seems quite slow since you need to read 3 copies? How does the real world implement this ?
Clients should read from the leader. If a node knows it is not the leader it should redirect the client to the leader. If a node does not know who is leader it should throw an error and the client should pick another node at random until it is told or finds the leader. If the node thinks it is the leader it is dangerous to return a read from local state as it may have just lost connectivity to the rest of the cluster right when it gets a massive stall (cpu load, io stall, vm overload, large gc, some background task, server maintenance job, ...) such that it actually looses the leadership during replying to the client and gives out a stale read. This can be avoided by running a round of (multi)Paxos for the read.
Lamport Clocks and Vector Clock say you must pass messages to assign that operation A happens before operation B when they run on different machines. If not they run concurrently. This provides the theoretic underpinning as to why we cannot say a read from a leader is not stale without exchanging messages with the majority of the cluster. The message exchange establishes a "happened-before" relationship of the read to the next write (which may happen on a new leader due to a failover).
The leader itself can be an acceptor and so in a three node cluster it just needs one response from one other node to complete a round of (multi)Paxos. It can send messages in parallel and reply to the client when it gets the first response. The network between nodes should be dedicated to intra-cluster traffic (and the best you can get) such that this does not add much latency to the client.
There is an answer which describes how Paoxs can be used for a locking service which cannot tolerate stale reads or reordered writes where a crash scenario is discussed over at some questions about paxos Clearly a locking service cannot have reads and writes to the locks "running concurrently" hence why it does a round of (multi)Paxos for each client message to strictly order reads and writes across the cluster.
Related
I am pretty new to distributed systems and was wondering how Raft consensus algorithm is linearizable. Raft commits log entries through quorum. At the moment the leader Raft commits, this means that more than half of the participants has the replicated log. But there may be a portion of the participants that doesn't have the latest logs or that they have the logs but haven't received instructions to commit those logs.
Or does Raft's read linearizability require a read quorum?
Well, linearizability pertains to both reads and writes, and yes, both are accomplished with a quorum. To make reads linearizable, reads must be handled by the leader, and the leader must verify it has not been superseded by a newer leader after applying the read to the state machine but before responding to the client. In practice, though, many real-world implementations use relaxed consistency models for reads, e.g. allowing reads from followers. But note that while quorums guarantee linearizability for the Raft cluster, that doesn’t mean client requests are linearizable. To extend linearizability to clients, sessions must be added to prevent dropped/duplicated client requests from producing multiple commits in the Raft log for a single commit, which would violate linearizabity.
kuujo has explained what's linearizability. I'll answer your other doubt in the question.
But there may be a portion of the participants that don't have the latest logs or that they have the logs but haven't received instructions to commit those logs.
This is possible after a leader commits a log entry and some of the peers do not have this log entry right now, but other peers will have it eventually in some ways, Raft does several things to guarantee that:
Even if the leader has committed a log entry (let's say LogEntry at index=8) and answered the client, background routines are still trying to sync logEntry:8 to other peers. If RPC fails, (it's possible)
Raft sends a heartbeat periodically(a kind of AppendEntries), this heartbeat RPC will sync logs the leader has to followers if followers do not have it. After followers have logEntry:8, followers will compare its local commitIndex and leaderCommitIndex in the RPC args to decide if it should commit logEntry:8. If the leader fails immediately after committing a log, (it's rare but possible)
Based on the election rule, only the one who has the logEntry:8 can win the election. After a new leader elected, the new leader will continue using the heartBeat to sync logEntry:8 to other peers. What happens if a follower falls behind so much that it can not get all logs from the leader? (it happens a lot when you try to add a new node). In this scenario, Raft will use a snapshot RPC mechanism to sync all data and trimmed logs.
Hypothetically speaking, I plan to build a distributed system with cassandra as the database. The system will run on multiple servers say server A,B,C,D,E etc. Each server will have Cassandra instance and all servers will form a cluster.
In my hypothetical distributed system, X number of the total servers should process user requests. eg, 3 of servers A,B,C,D,E should process request from user uA. Each application should update its Cassandra instance with the exact copy of data. Eg if user uA sends a message to user uB, each application should update its database with the exact copy of the message sent and to who and as expected, Cassandra should take over from that point to ensure all nodes are up-to date.
How do I configure Cassandra to make sure Cassandra first checks all copies inserted into the database are exactly the same before updating all other nodes
Psst: kindly keep explanations as simple as possible. Am new to Cassandra, crossing over from MySQL. Thank you in advance
Every time a change happens in Cassandra, it is communicated to all relevant nodes (Nodes that have a replica of the data). But sometimes that doesn't happen either because a node is down or too busy, the network fails, etc.
What you are asking is how to get consistency out of Cassandra, or in other terms, how to make a change and guarantee that the next read has the most up to date information.
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. There are multiple consistency options but normally you would only use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (If you write to A, someone can read from B while it was not updated).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered. (If you have 3 nodes ABC and you write to A and B, someone can read from C but also from A or B, meaning it will always get the most up to date information).
Cassandra knows what is the most up to date information because every change has a timestamp and the most recent wins.
You also have other options such as ALL, which is NOT RECOMMENDED because it requires all nodes to be up and available. If a node is unnavailable, your system is down.
Cassandra Documentation (Consistency)
I am looking for something to use as a simple service registry and am considering etcd. For this use-case availability is more important than consistency. Clients must be able to read/write keys to any of the nodes even when the cluster is split. Can etcd be used in this way? It doesn't matter if some of the writes are lost when things come back together as they will be quickly updated by service "I am alive" heartbeat timers.
I'm also new to etcd. What I have noticed is when network partitioning happens, reads still work for the nodes which are not in main quorum. They will see inconsistent data.
As for the writes they fail with "Raft internal error"
I was studying up on Cassandra and i understand that it is a peer database where there are no master or slaves.
Each read/write is facilitated by a coordinator node, who then forwards the read/write request to the specific node by using the replication strategy and Snitch.
My question is around the performance problems with this method.
Isn't there an extra hop?
Is the write buffered and then forwarded to the right replicas?
How does the performance change with different replication
strategies?
Can I improve the performance by bypassing the coordinator node and
writing to the replica nodes myself?
1) There will occasionally be an extra hop but your driver will most likely have a TokenAware Strategy for selecting the coordinator which will choose the coordinator to be a replica for the given partition.
2) The write is buffered and depending on your consistency level you will not receive acknowledgment of the write until it has been accepted on multiple nodes. For example with Consistency Level one you will receive an ACK as soon as the write as been accepted by a single node. The other nodes will have writes queued up and delivered but you will not receive any info about them. In the case that one of those writes fails/cannot be delivered, a hint will be stored on the coordinator to be delivered when the replica comes back online. Obviously there is a limit to the number of hints that can be saved so after long downtimes you should run repair.
With higher consistency levels the client will not receive an acknowledgment until the number of nodes in the CL have accepted the write.
3) The performance should scale with the total number of writes. If a cluster can sustain a net 10k writes per second but has RF = 2. You most likely can only do 5k writes per second since every write is actually 2. This will happen irregardless of your consistency level since those writes are sent even though you aren't waiting for their acknowledgment.
4) There is really no way to get around the coordination. The token aware strategy will pick a good coordinator which is basically the best you can do. If you manually attempted to write to each replica, your write would still be replicated by each node which received the request so instead of one coordination event you would get N. This is also most likely a bad idea since I would assume you have a better network between your C* nodes than from your client to the c* nodes.
I don't have answers for 2 and 3, but as for 1 and 4.
1) Yes, this can cause an extra hop
4) Yes, well kind of. The Datastax driver, as well as the Netflix Astynax driver can be set to be Token Aware which means it will listen to the ring's gossip to know which nodes have which token ranges and send the insert to the coordinator on the node it will be stored on. Eliminating the additional network hop.
To add to Andrew's response, don't assume the coordinator hop is going to cause significant latency. Do your queries and measure. Think about consistency levels more than the extra hop. Tune your consistency for higher read or higher write speed, or a balance of the two. Then MEASURE. If you find latencies to be unnacceptable, you may then need to tweak your consistency levels and / or change your data model.
I have some trouble using apache cassandra. I have been trying to solve this problem for several weeks now.
This is my setup. I have 2 computers running apache cassandra(lets call the computer C1 and Computer C2), I create a keyspace with replication factor 2. This is so that each computer has a local copy of the data.
I have a program that reads a fairly large amount of data say about 500MB.
Scenario 1)
Say only computer C1 has cassandra is running, I run the read program on computer C1 then this read occurs with half a minute to a minute.
Scenario 2)
I now start the cassandra instance on the computer C2 and run the read program on computer C1 again- it now takes a very long time to complete in the order of 20 minutes.
I am not sure why this is happening. The read consistency is set to "One"
Expected performance
Ideally the read program on both computers C1 and C2 has to complete fast. This should be possible as both computers have a local copy of the data.
Can anyone please point me in the right direction? I really appreciate the help,
Thanks
Update: Network Usage
This may not mean much, but I monitored the internet connection using nethogs and when both cassandra nodes are up, and I read the database, bandwidth is used by cassandra to communicate with the other node - presumably this is read repairs occuring in the background as I've used the read consistency level 'One' and in my case the closest node with the required data is the local computer's cassandra instance (all nodes have all the data) - so the source of data should be from the local computer...
Update: SQLTransentExceptions: TimedOutException()
When both nodes are up, the program that reads the database, however, has several SQLTransentExceptions: TimedOutException(). I use the default timeout of 10 sec. But that raises a question of why the SQL statements are timing out, when all data retrieval should be from the local instance. Also, the same SQL code runs fine, if only one node is up.
There is no such thing as a read consistency of "ANY" (that only applies to writes). The lowest read consistency is ONE. You need to check what your read consistency really is.
Perhaps your configuration is setup in such a way that a read requires data from both servers to be fetched (if both are up), and fetching data from C2 to C1 is really slow.
Force set your read consistency level to "ONE".
You appear to have a token collision, which in your case translates to both nodes owning 100% of the keys. What you need to do is reassign one of the nodes such that it owns half the tokens. Use nodetool move (use token 85070591730234615865843651857942052864) followed by nodetool cleanup.
The slow speeds most likely are from the high network latency, which when multiplied across all your transactions (with some subset actually timing out) result in a correspondingly large job time. Many client libraries use auto node discovery to learn about new or downed nodes, then round robin requests across available nodes. So even though you're only telling it about localhost, it's probably learning about the other node on its own.
In any distributed computing environment where nodes must communicate, network latency and reliability are a huge factor and must be dealt with.