Data Partitioning and Replication on Cassandra cluster - database

I have a 3 node Cassandra cluster with RF=3. Now when I do nodetool status I get the owns for each node in the cluster as 100%.
But when I have 5 nodes in the cluster wit RF=3. The owns is 60%(approx as shown in image below).
Now as per my understanding the partitioner will calculate the hash corresponding to first replica node and the data will also be replicated as per the RF on the other nodes.
Now we have a 5 node cluster and RF is 3.
Shouldn't 3 nodes be owning all the data evenly(100%) as partitioner will point to one node as per the partitoning strategy and then same data be replicated to remaining nodes which equals RF-1? It's like the data is getting evenly distributed among all the nodes(5) even though RF is 3.
Edit1:
As per my understanding the reason for 60%(approx) owns for each node is because the RF is 3. It means there will be 3 replicas for each row. It means there will be 300% data. Now there are 5 nodes in the cluster and partitioner will be using the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
But now the issue is that we checked all the nodes of our cluster and all the nodes contain all the data even though the RF is 3.
Edit2:
#Aaron I did as specified in the comment. I created a new cluster with 3 nodes.
I created a Keyspace "test" and set the class to simplestrategy and RF to 2.
Then I created a table "emp" having partition key (id,name).
Now I inserted a single row into the first node.
As per your explanation, It should only be in 2 nodes as RF=2.
But when I logged into all the 3 nodes, i could see the row replicated in all the nodes.
I think since the keyspace is getting replicated in all the nodes therefore, the data is also getting replicated.

Percent ownership is not affected (at all) by actual data being present. You could add a new node to a single node cluster (RF=1) and it would instantly say 50% on each.
Percent ownership is purely about the percentage of token ranges which a node is responsible for. When a node is added, the token ranges are recalculated, but data doesn't actually move until a streaming event happens. Likewise, data isn't actually removed from its original node until cleanup.
For example, if you have a 3 node cluster with a RF of 3, each node will be at 100%. Add one node (with RF=3), and percent ownership drops to about 75%. Add a 5th node (again, keep RF=3) and ownership for each node correctly drops to about 3/5, or 60%. Again, with a RF of 3 it's all about each node being responsible for a set of primary, secondary, and tertiary token ranges.
the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
Actually, the distributed hash with Murmur3 partitioner will evenly distribute the token ranges, not the data. That's an important distinction. If you wrote all of your data to a single partition, I guarantee that you would not get even distribution of data.

The data replicated to another nodes when you add them isn't cleared up automatically - you need to call nodetool cleanup on the "old" nodes after you add the new node into cluster. This will remove the ranges that were moved to other nodes.

Related

How DSE Spread Data?

We using DSE with Cassandra + Solr.
I'm not sure how it's spreading the data, let's say we have 6 nodes, replication factor of 3.
Our platform uses all the 6 nodes to query data, I query one node from the 6 there is a chance data will be missing?
Or I need to have the same replication factor as the number of the nodes I have if I want to use all the nodes from the platform.
How it's working?
In Cassandra each node stores some parts of the data. When you build the cluster each node will be responsible for specific part of the data. That is decided based on the token value that is assigned to that node. Now when you insert or select the data, each insert or select will have a partition key. Based on that partition key a hash value is calculated and the data will be sent to node which is responsible for that specific token value.
If there are 6 nodes and RF =3 then within cluster you have 3 copies of entire data. The primary copy is stored based on the above concept. The replicas will be stored based on the replication class that you specify while you create keyspace. If you take SimpleStrategy it stores replica on next node in clockwise i.e replica of node1 will be stored on node2 and node3 and replica of node2 will be stored on node3 and node4 and so on.
If you query from one node then based on the partition key the query will be sent to specific node which is responsible for that partition keys. To know to which node your query will be sent you can use nodetool utility.
nodetool getendpoints <keyspace> <table> <key>
This will give you the node Ip where the query will be sent to get the result
If you have 6 nodes and RF 3 it means 3 copies of data exist in your cassandra cluster. Data availability also depends on consistency level. if you are using ONE consistency and 2 node down then also will get data and no loss but if TWO, QOURAM, THREE or ALL consistency then scenario will be different.

Processing records in database with application running in different nodes

I need to process 100K records from DB, process them and update the status of the record in DB. If application is running on multiple nodes, how to make sure that same record is not picked by multiple nodes for processing?
This process is triggered by a quartz scheduler that runs every hour and we do not have the flexibility to configure the scheduler on each node to run at different times.
What is the best way to achieve this?
There are various approaches and the following two come to my mind at the moment.
(1) Use DB row locks
Make your nodes place an exclusive DB row lock (select for update) on the record the node is about to process. If the node can place the lock, then the process is free to process the record. Otherwise, the node should try another record because some other node has locked the record and is currently processing the record. You should randomize the selection of unprocessed records to somewhat minimize the chances of multiple nodes competing to process identical records.
(2) Assign your nodes some IDs and make your nodes use these IDs to select disjunct records from the processed data set. For example, if you have 10 nodes, assign them IDs 0 to 9. Then split your unprocessed records into disjunct sets based on the record ID by applying some function that produces numbers 0 to 9. For example, you can use the MOD function. Your nodes will then select only those unprocessed records whose ID is equal to the node ID. This is very simple to implement in SQL as long as you can assign some unique and consecutive IDs to your nodes.
If I were you, I would probably pick the second solution.

What to do when nodes in a Cassandra cluster reach their limit?

I am studying up Cassandra and in the process of setting up a cluster for a project that I'm working on. Consider this example :
Say I setup a 5 node cluster with 200 gb space for each. That equals up to 1000 gb ( round about 1 TB) of space overall. Assuming that my partitions are equally split across the cluster, I can easily add nodes and achieve linear scalability. However, what if these 5 nodes start approaching the SSD limit of 200 gb? In that case, I can add 5 more nodes and now the partitions would be split across 10 nodes. But the older nodes would still be writing data, as they are part of the cluster. Is there a way to make these 5 older nodes 'read-only'? I want to shoot off random read-queries across the entire cluster, but don't want to write to the older nodes anymore( as they are capped by a 200 gb limit).
Help would be greatly appreciated. Thank you.
Note: I can say that 99% of the queries will be write queries, with 1% or less for reads. The app has to persist click events in Cassandra.
Usually when cluster reach its limit we add new node to cluster. After adding a new node, old cassandra cluster nodes will distribute their data to the new node. And after that we use nodetool cleanup in every node to cleanup the data that distributed to the new node. The entire scenario happens in a single DC.
For example:
Suppose, you have 3 node (A,B,C) in DC1 and 1 node (D) in DC2. Your nodes are reaching their limit. So, decided to add a new node (E) to DC1. Node A, B, C will distribute their data to node E and we'll use nodetool cleanup in A,B,C to cleanup the space.
Problem in understanding the question properly.
I am assuming you know that by adding new 5 nodes, some of the data load would be transferred to new nodes as some token ranges will be assigned to them.
Now, as you know this, if you are concerned that old 5 nodes would not be able to write due to their limit reached, its not going to happen as new nodes have shared the data load and hence these have free space now for further write.
Isolating the read and write to nodes is totally a different problem. But if you want to isolate read to these 5 nodes only and write to new 5 nodes, then the best way to do this is to add new 5 nodes in another datacenter under the same cluster and then use different consistency levels for read and write to satisfy your need to make old datacenter read only.
But the new datacenter will not lighten the data load from first. It will even take the same load to itself. (So you would need more than 5 nodes to accomplish both problems simultaneously. Few nodes to lighten the weight and others to isolate the read-write by creating new datacenter with them. Also the new datacenter should have more then 5 nodes). Best practice is to monitor data load and fixing it before such problem happen, by adding new nodes or increasing data limit.
Considering done that, you will also need to ensure that the nodes you provided for read and write should be from different datacenters.
Consider you have following situation :
dc1(n1, n2, n3, n4, n5)
dc2(n6, n7, n8, n9, n10)
Now, for read you provided with node n1 and for write you provided with node n6
Now the read/write isolation can be done by choosing the right Consistency Levels from bellow options :
LOCAL_QUORUM
or
LOCAL_ONE
These basically would confine the search for the replicas to local datacenter only.
Look at these references for more :
Adding a datacenter to a cluster
and
Consistency Levels

Orientdb in distributed mode: how to manage cluster Id's in distributed mode

According to the documentation here http://orientdb.com/docs/2.0/orientdb.wiki/Distributed-Architecture.html under the heading Creation of records (documents, vertices and edges) it states that
In distributed mode the RID is assigned with cluster locality. If you have class Customer and 3 nodes (node1, node2, node3), you'll have these clusters:
customer with id=#15 (this is the default one, assigned to node1)
customer_node2 with id=#16
customer_node3 with id=#17
This would imply that you cannot rely on the clusterID in your code. For example, selecting a record like this
select from #15:1
would break in a distributed setup, because once node1 fails and node2 takes over, you'll have to select
select from #16:1
My question is, is it the responsibility of the programmer to handle which clusterID to use, or is this automatically handled by OrientDB, in which case
select from #15:1
will always work, not matter which node is up or down?

question about mnesia distribution

I have two nodes running mnesia. I created schema and some tables on node 1, and used mnesia:add_table_copy on node 2 to copy the tables from node 1 to node 2.
Everything works well until I call q() on node 1 and then q() on node 2. I found that when I start node 1 again, mnesia:wait_for_tables([sometable], infinity) won't return. It will only return when I start node 2 again.
Is there a way to fix this? This is a problem because I won't be able to start node 1 again if node 2 is down.
In this discussion a situation similar to the one you're facing is presented.
Reading from that source:
At startup Mnesia tries to connect
with the other nodes and if that
suceeds it loads its tables from
them. If the other nodes are down, it
looks for mnesia_down marks in its
local transaction log in order to
determine if it has a consistent
replica or not of its tables. The node
that was shutdown last has
mnesia_down's from all the other
nodes. This means that it safely can
load its tables. If some of the other
nodes where started first (as in your
case) Mnesia will wait indefinitely
for another node to connect in order
to load its tables
You're shutting down node 1 first, so it doesn't have the mnesia_down from the other node. What happens if you reverse the shutting down order?
Also, it should be possible to force the table loading via the force_load_table/1 function:
force_load_table(Tab) -> yes | ErrorDescription
The Mnesia algorithm for table load
might lead to a situation where a
table cannot be loaded. This situation
occurs when a node is started and
Mnesia concludes, or suspects, that
another copy of the table was active
after this local copy became inactive
due to a system crash.
If this situation is not acceptable,
this function can be used to override
the strategy of the Mnesia table load
algorithm. This could lead to a
situation where some transaction
effects are lost with a inconsistent
database as result, but for some
applications high availability is more
important than consistent data.

Resources