We have a cassandra cluster running with 3 nodes and a replication factor of 2 -> maybe we should have selected 3 from the start, but this is not the case.
Our quorum is therefore = 2/2 + 1 = 2
Lets say we lose one node - so now only two cassandra nodes are online.
We still have the possibility to read from the cluster if we set our consistency level to "ONE" and then read -> so this is not a problem.
The thing I do not understand is the following.
We still have two nodes running, so why is it not possible to do a serial (lightweight transaction) insert into our keyspace? We have two nodes up, so shouldn't it be possible to get a quorum of 2 when trying to insert?
Is it because one of the row's is already put on the missing node?
When you are trying to insert a data, the data is stored based on the token values(based on the partitioner configured) and replicated in a circular way.
For e.g. If you are inserting a data X in a keyspace with replication factor of 2 in a 3 node cluster Node1 (owning token A), Node2 (owning token B) and Node3 (Owning token C). Say if the data X is computed to token B, then Cassandra starts inserting data from Node2 and Node3 (till it completes the replicas). Say if the data X is computed to token C, then Cassandra starts inserting data from Node3 and Node1.
So setting consistency level of 2 means the data must be written in 2 nodes.
In your case even though you have 2 nodes up Node1 (token A) and Node2 (token B) and one node down Node3 (token C), if the data is computed and selected as token B, then Cassandra tries to insert in Node2 and Node3 and you get consistency error as it cannot insert in Node3.
So to insert you must either increase replication to 3 or decrease the consistency to 1.
To know more on consistency see this docs https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_config_consistency_c.html
Lightweight transactions require a QUORUM consistency level, which cannot be reached in case the unavailable node is a replica of the affected key. What's relevant here is the number of available replicas, not the number of nodes in the cluster.
Related
We using DSE with Cassandra + Solr.
I'm not sure how it's spreading the data, let's say we have 6 nodes, replication factor of 3.
Our platform uses all the 6 nodes to query data, I query one node from the 6 there is a chance data will be missing?
Or I need to have the same replication factor as the number of the nodes I have if I want to use all the nodes from the platform.
How it's working?
In Cassandra each node stores some parts of the data. When you build the cluster each node will be responsible for specific part of the data. That is decided based on the token value that is assigned to that node. Now when you insert or select the data, each insert or select will have a partition key. Based on that partition key a hash value is calculated and the data will be sent to node which is responsible for that specific token value.
If there are 6 nodes and RF =3 then within cluster you have 3 copies of entire data. The primary copy is stored based on the above concept. The replicas will be stored based on the replication class that you specify while you create keyspace. If you take SimpleStrategy it stores replica on next node in clockwise i.e replica of node1 will be stored on node2 and node3 and replica of node2 will be stored on node3 and node4 and so on.
If you query from one node then based on the partition key the query will be sent to specific node which is responsible for that partition keys. To know to which node your query will be sent you can use nodetool utility.
nodetool getendpoints <keyspace> <table> <key>
This will give you the node Ip where the query will be sent to get the result
If you have 6 nodes and RF 3 it means 3 copies of data exist in your cassandra cluster. Data availability also depends on consistency level. if you are using ONE consistency and 2 node down then also will get data and no loss but if TWO, QOURAM, THREE or ALL consistency then scenario will be different.
I have a 3 node Cassandra cluster with RF=3. Now when I do nodetool status I get the owns for each node in the cluster as 100%.
But when I have 5 nodes in the cluster wit RF=3. The owns is 60%(approx as shown in image below).
Now as per my understanding the partitioner will calculate the hash corresponding to first replica node and the data will also be replicated as per the RF on the other nodes.
Now we have a 5 node cluster and RF is 3.
Shouldn't 3 nodes be owning all the data evenly(100%) as partitioner will point to one node as per the partitoning strategy and then same data be replicated to remaining nodes which equals RF-1? It's like the data is getting evenly distributed among all the nodes(5) even though RF is 3.
Edit1:
As per my understanding the reason for 60%(approx) owns for each node is because the RF is 3. It means there will be 3 replicas for each row. It means there will be 300% data. Now there are 5 nodes in the cluster and partitioner will be using the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
But now the issue is that we checked all the nodes of our cluster and all the nodes contain all the data even though the RF is 3.
Edit2:
#Aaron I did as specified in the comment. I created a new cluster with 3 nodes.
I created a Keyspace "test" and set the class to simplestrategy and RF to 2.
Then I created a table "emp" having partition key (id,name).
Now I inserted a single row into the first node.
As per your explanation, It should only be in 2 nodes as RF=2.
But when I logged into all the 3 nodes, i could see the row replicated in all the nodes.
I think since the keyspace is getting replicated in all the nodes therefore, the data is also getting replicated.
Percent ownership is not affected (at all) by actual data being present. You could add a new node to a single node cluster (RF=1) and it would instantly say 50% on each.
Percent ownership is purely about the percentage of token ranges which a node is responsible for. When a node is added, the token ranges are recalculated, but data doesn't actually move until a streaming event happens. Likewise, data isn't actually removed from its original node until cleanup.
For example, if you have a 3 node cluster with a RF of 3, each node will be at 100%. Add one node (with RF=3), and percent ownership drops to about 75%. Add a 5th node (again, keep RF=3) and ownership for each node correctly drops to about 3/5, or 60%. Again, with a RF of 3 it's all about each node being responsible for a set of primary, secondary, and tertiary token ranges.
the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
Actually, the distributed hash with Murmur3 partitioner will evenly distribute the token ranges, not the data. That's an important distinction. If you wrote all of your data to a single partition, I guarantee that you would not get even distribution of data.
The data replicated to another nodes when you add them isn't cleared up automatically - you need to call nodetool cleanup on the "old" nodes after you add the new node into cluster. This will remove the ranges that were moved to other nodes.
I am developing a distributed database system, suppose each table has 3 copies on different machines, if a write request comes in, the replication layer's work is to replicate that write data to other to nodes. The question is how to make the replication efficient in terms of throughput? The network bandwidth is 10Gbps.
Example:
cmd1: write on table 1
-> send to node A and node B
cmd2: write on table 2
-> send to node B and node C
cmd3: write on table 1
-> send to node A and node B
cmd4: write on table 3
-> send to node B and node D
cmd5: write on table 2
-> send to node B and node C
The above example shows the replicating command queues. To achieve high performance, what about use mulit-thread network replication? But if there are two writes to the same node, it would be better to combine the two and send altogether?
And suppose those writes will be stored on disks eventually (think of spinning disk), so to speed up the storing process, is there anything can be done when doing the replication? (if each table has been mapped to store on a specific disk on a node, writes are append-only).
According to the documentation here http://orientdb.com/docs/2.0/orientdb.wiki/Distributed-Architecture.html under the heading Creation of records (documents, vertices and edges) it states that
In distributed mode the RID is assigned with cluster locality. If you have class Customer and 3 nodes (node1, node2, node3), you'll have these clusters:
customer with id=#15 (this is the default one, assigned to node1)
customer_node2 with id=#16
customer_node3 with id=#17
This would imply that you cannot rely on the clusterID in your code. For example, selecting a record like this
select from #15:1
would break in a distributed setup, because once node1 fails and node2 takes over, you'll have to select
select from #16:1
My question is, is it the responsibility of the programmer to handle which clusterID to use, or is this automatically handled by OrientDB, in which case
select from #15:1
will always work, not matter which node is up or down?
I have two nodes running mnesia. I created schema and some tables on node 1, and used mnesia:add_table_copy on node 2 to copy the tables from node 1 to node 2.
Everything works well until I call q() on node 1 and then q() on node 2. I found that when I start node 1 again, mnesia:wait_for_tables([sometable], infinity) won't return. It will only return when I start node 2 again.
Is there a way to fix this? This is a problem because I won't be able to start node 1 again if node 2 is down.
In this discussion a situation similar to the one you're facing is presented.
Reading from that source:
At startup Mnesia tries to connect
with the other nodes and if that
suceeds it loads its tables from
them. If the other nodes are down, it
looks for mnesia_down marks in its
local transaction log in order to
determine if it has a consistent
replica or not of its tables. The node
that was shutdown last has
mnesia_down's from all the other
nodes. This means that it safely can
load its tables. If some of the other
nodes where started first (as in your
case) Mnesia will wait indefinitely
for another node to connect in order
to load its tables
You're shutting down node 1 first, so it doesn't have the mnesia_down from the other node. What happens if you reverse the shutting down order?
Also, it should be possible to force the table loading via the force_load_table/1 function:
force_load_table(Tab) -> yes | ErrorDescription
The Mnesia algorithm for table load
might lead to a situation where a
table cannot be loaded. This situation
occurs when a node is started and
Mnesia concludes, or suspects, that
another copy of the table was active
after this local copy became inactive
due to a system crash.
If this situation is not acceptable,
this function can be used to override
the strategy of the Mnesia table load
algorithm. This could lead to a
situation where some transaction
effects are lost with a inconsistent
database as result, but for some
applications high availability is more
important than consistent data.