How to synchronize distributed system data across cassandra clusters - database

Hypothetically speaking, I plan to build a distributed system with cassandra as the database. The system will run on multiple servers say server A,B,C,D,E etc. Each server will have Cassandra instance and all servers will form a cluster.
In my hypothetical distributed system, X number of the total servers should process user requests. eg, 3 of servers A,B,C,D,E should process request from user uA. Each application should update its Cassandra instance with the exact copy of data. Eg if user uA sends a message to user uB, each application should update its database with the exact copy of the message sent and to who and as expected, Cassandra should take over from that point to ensure all nodes are up-to date.
How do I configure Cassandra to make sure Cassandra first checks all copies inserted into the database are exactly the same before updating all other nodes
Psst: kindly keep explanations as simple as possible. Am new to Cassandra, crossing over from MySQL. Thank you in advance

Every time a change happens in Cassandra, it is communicated to all relevant nodes (Nodes that have a replica of the data). But sometimes that doesn't happen either because a node is down or too busy, the network fails, etc.
What you are asking is how to get consistency out of Cassandra, or in other terms, how to make a change and guarantee that the next read has the most up to date information.
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. There are multiple consistency options but normally you would only use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (If you write to A, someone can read from B while it was not updated).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered. (If you have 3 nodes ABC and you write to A and B, someone can read from C but also from A or B, meaning it will always get the most up to date information).
Cassandra knows what is the most up to date information because every change has a timestamp and the most recent wins.
You also have other options such as ALL, which is NOT RECOMMENDED because it requires all nodes to be up and available. If a node is unnavailable, your system is down.
Cassandra Documentation (Consistency)

Related

Quorum Vs Versioning, when should i use what?

A quorum is the majority number of servers that have to agree on a certain operation in order to move forward.
Versioning is a counter for each record.
In a database, if latest version will always give me the latest and correct record.
When and why should i use Quorum for distributed systems.
Quorum is required in a distributed environment where you are running a cluster of machines and anyone of these machines can accept a write/modify request and update the data. Under such scenarios Quorum is used to identify the leader that will accept the writes or determine which node can accept write/modify requests for a given range of keys.
Let's consider a scenario where you have 3 master server accepting writes, in that case if you want to update the data, can we just match the version on one of the masters and assume it is safe to update?
No, because at the same moment some other write request to other master server can also assume the same and hence you will end up with different state of data in different machines.
In this scenario, you need quorum to identify the leader that will accept writes for given range of data and then you can use versioning (optimistic locking) to ensure data is consistent across all machines and serialized.
Versioning, however is helpful when you have one master accepting the writes and multiple users might want try to update the data, using versioning here can help you to achieve Optimistic Locking. This is generally helpful when chances of locking are low.

Improving the application performance by load balancing SQL queries on Always on availability group nodes?

We have an intranet browser-based application written in ASP.NET with MS SQL Server as the database backend. One of our clients has always availability groups setup with two nodes. Our application requests are routed (via an availability group listener) to the primary R/W node and our client uses the R/O node for their custom reporting (crystal reports).
As the number of users is growing, we’re getting into performance problems - mostly CPU related.
We would like the customer to add more CPU’s, while they want us to start routing read-only queries to the R/O node.
We are really hesitant because these would be application changes and really non-trivial ones:
We understand that reporting is an ideal case to be sent to the R/O node (reduces load, blocking, …). Is it a recommended practice to load balance by sending the read-only queries to the read-only node(s)?
It seems to me we would need to be very careful in terms of what we can afford. It takes some time before the R/O node is synchronized, so we would need to always understand that we could be reading old data. For example, the user clicks the “Save” button and after the record is saved, we re-read the list of records to be displayed. I assume we would have to go to the R/W node to guarantee that the new records will be there. Is that correct?
If we send queries to the R/O node, don’t we degrade the robustness on the system? If one node crashes, the other node needs to be able to sustain the load on itself. Are there recommended scenarios when it makes sense to send requests to the R/O node and when it does not?
It is preferable to send queries which related to reporting (Only) to secondary nodes so that CPU intensive reports does not degrade the performance of your online database. Your transactions does not get affected from non-transactional usage.
However this does not mean that you need to make all R/O queries on secondary node. Lets say that you have a transactional operation which first needs select operation with row lock, you shouldn't be doing read operation from passive node and DML operation on active node.
We can say that all operational queries can be queried from Active node whereas Passive node(s) are more appropriate to be used for just long running reports.
For your second question If the second node is configured as Async, then yes there might be some delay and also on the case of log shipping failure It is possible to see old data.
For the third question, it really depends on current/future/peak-hours system load. It is hard to tell this or that. It also depends on the budget, IF you can afford it you can have 1 more node. It all depends. But keep in mind that RDBMS systems are not very feasible for horizontal scaling.

Coordinator node and its impact on performance

I was studying up on Cassandra and i understand that it is a peer database where there are no master or slaves.
Each read/write is facilitated by a coordinator node, who then forwards the read/write request to the specific node by using the replication strategy and Snitch.
My question is around the performance problems with this method.
Isn't there an extra hop?
Is the write buffered and then forwarded to the right replicas?
How does the performance change with different replication
strategies?
Can I improve the performance by bypassing the coordinator node and
writing to the replica nodes myself?
1) There will occasionally be an extra hop but your driver will most likely have a TokenAware Strategy for selecting the coordinator which will choose the coordinator to be a replica for the given partition.
2) The write is buffered and depending on your consistency level you will not receive acknowledgment of the write until it has been accepted on multiple nodes. For example with Consistency Level one you will receive an ACK as soon as the write as been accepted by a single node. The other nodes will have writes queued up and delivered but you will not receive any info about them. In the case that one of those writes fails/cannot be delivered, a hint will be stored on the coordinator to be delivered when the replica comes back online. Obviously there is a limit to the number of hints that can be saved so after long downtimes you should run repair.
With higher consistency levels the client will not receive an acknowledgment until the number of nodes in the CL have accepted the write.
3) The performance should scale with the total number of writes. If a cluster can sustain a net 10k writes per second but has RF = 2. You most likely can only do 5k writes per second since every write is actually 2. This will happen irregardless of your consistency level since those writes are sent even though you aren't waiting for their acknowledgment.
4) There is really no way to get around the coordination. The token aware strategy will pick a good coordinator which is basically the best you can do. If you manually attempted to write to each replica, your write would still be replicated by each node which received the request so instead of one coordination event you would get N. This is also most likely a bad idea since I would assume you have a better network between your C* nodes than from your client to the c* nodes.
I don't have answers for 2 and 3, but as for 1 and 4.
1) Yes, this can cause an extra hop
4) Yes, well kind of. The Datastax driver, as well as the Netflix Astynax driver can be set to be Token Aware which means it will listen to the ring's gossip to know which nodes have which token ranges and send the insert to the coordinator on the node it will be stored on. Eliminating the additional network hop.
To add to Andrew's response, don't assume the coordinator hop is going to cause significant latency. Do your queries and measure. Think about consistency levels more than the extra hop. Tune your consistency for higher read or higher write speed, or a balance of the two. Then MEASURE. If you find latencies to be unnacceptable, you may then need to tweak your consistency levels and / or change your data model.

Issue with reading data from Apache cassandra

I have some trouble using apache cassandra. I have been trying to solve this problem for several weeks now.
This is my setup. I have 2 computers running apache cassandra(lets call the computer C1 and Computer C2), I create a keyspace with replication factor 2. This is so that each computer has a local copy of the data.
I have a program that reads a fairly large amount of data say about 500MB.
Scenario 1)
Say only computer C1 has cassandra is running, I run the read program on computer C1 then this read occurs with half a minute to a minute.
Scenario 2)
I now start the cassandra instance on the computer C2 and run the read program on computer C1 again- it now takes a very long time to complete in the order of 20 minutes.
I am not sure why this is happening. The read consistency is set to "One"
Expected performance
Ideally the read program on both computers C1 and C2 has to complete fast. This should be possible as both computers have a local copy of the data.
Can anyone please point me in the right direction? I really appreciate the help,
Thanks
Update: Network Usage
This may not mean much, but I monitored the internet connection using nethogs and when both cassandra nodes are up, and I read the database, bandwidth is used by cassandra to communicate with the other node - presumably this is read repairs occuring in the background as I've used the read consistency level 'One' and in my case the closest node with the required data is the local computer's cassandra instance (all nodes have all the data) - so the source of data should be from the local computer...
Update: SQLTransentExceptions: TimedOutException()
When both nodes are up, the program that reads the database, however, has several SQLTransentExceptions: TimedOutException(). I use the default timeout of 10 sec. But that raises a question of why the SQL statements are timing out, when all data retrieval should be from the local instance. Also, the same SQL code runs fine, if only one node is up.
There is no such thing as a read consistency of "ANY" (that only applies to writes). The lowest read consistency is ONE. You need to check what your read consistency really is.
Perhaps your configuration is setup in such a way that a read requires data from both servers to be fetched (if both are up), and fetching data from C2 to C1 is really slow.
Force set your read consistency level to "ONE".
You appear to have a token collision, which in your case translates to both nodes owning 100% of the keys. What you need to do is reassign one of the nodes such that it owns half the tokens. Use nodetool move (use token 85070591730234615865843651857942052864) followed by nodetool cleanup.
The slow speeds most likely are from the high network latency, which when multiplied across all your transactions (with some subset actually timing out) result in a correspondingly large job time. Many client libraries use auto node discovery to learn about new or downed nodes, then round robin requests across available nodes. So even though you're only telling it about localhost, it's probably learning about the other node on its own.
In any distributed computing environment where nodes must communicate, network latency and reliability are a huge factor and must be dealt with.

What are good algorithms to keep consistency across multiple files in a network?

What are good algorithms to keep consistency in multiple files?
This is a school project. I have to implement in C, some replication across a network.
I have 2 servers,
Server A1
Server A2
Both servers have their own file called "data.txt"
If I write something to one of them, I need the other to be updated.
I also have another scenario, with 3 Servers.
Server B1
Server B2
Server B3
I need these do do pretty much the same.
While this would be fairly simple to implement. If one, or two of the servers were to be down, When comming back up, they would have to update themselves.
I'm sure there are algorithms that solve this efficiently. I know what I want, I just don't know exactly what I'm looking for!
Can someone point me to the right direction please?
Thank you!
The fundamental issue here is known as the 'CAP theorem', which defines three properties that a distributed system can have:
Consistency: Reading data from the system always returns the most up-to-date data.
Availability: Every response either succeeds or fails (doesn't just keep waiting until things recover)
Partition tolerance: The system can operate when its servers are unable to communicate with each other (a server being down is one special case of this)
The CAP theorem states that you can only have two of these. If your system is consistent and partition tolerant, then it loses the availability condition - you might have to wait for a partition to heal before you get a response. If you have consistency and availability, you'll have downtime when there's a partition, or enough servers are down. If you have availability and partition tolerance, you might read stale data, or have to deal with conflicting writes.
Note that this applies separately between reads and writes - you can have an Available and Partition-Tolerant system for reads, but Consistent and Available system for writes. This is basically a master-slave system; in a partition, writes might fail (if they're on the wrong side of a partition), but reads will work (although they might return stale data).
So if you want to be Available and Partition Tolerant for reads, one easy option is to just designate one host as the only one that can do writes, and sync from it (eg, using rsync from a cron script or something - in your C project, you'd just copy the file over using some simple network code periodically, and do an extra copy just after modifying it).
If you need partition tolerance for writes, though, it's more complex. You can have two servers that can't talk to each other both doing writes, and later have to figure out what data wins. This basically means you'll need to compare the two versions when syncing and decide what wins. This can just be as simple as 'let the highest timestamp win', or you can use vector clocks as in Dynamo to implement a more complex policy - which is appropriate here depends on your application.
Check out rsync and how Dropbox works.
With every write on to server A, fork a process to write the same content to server B.
So that all the writes on to server A are replicated on to server B. If you have multiple servers, make the forked process to write across all the backup servers.

Resources