I have two nodes running mnesia. I created schema and some tables on node 1, and used mnesia:add_table_copy on node 2 to copy the tables from node 1 to node 2.
Everything works well until I call q() on node 1 and then q() on node 2. I found that when I start node 1 again, mnesia:wait_for_tables([sometable], infinity) won't return. It will only return when I start node 2 again.
Is there a way to fix this? This is a problem because I won't be able to start node 1 again if node 2 is down.
In this discussion a situation similar to the one you're facing is presented.
Reading from that source:
At startup Mnesia tries to connect
with the other nodes and if that
suceeds it loads its tables from
them. If the other nodes are down, it
looks for mnesia_down marks in its
local transaction log in order to
determine if it has a consistent
replica or not of its tables. The node
that was shutdown last has
mnesia_down's from all the other
nodes. This means that it safely can
load its tables. If some of the other
nodes where started first (as in your
case) Mnesia will wait indefinitely
for another node to connect in order
to load its tables
You're shutting down node 1 first, so it doesn't have the mnesia_down from the other node. What happens if you reverse the shutting down order?
Also, it should be possible to force the table loading via the force_load_table/1 function:
force_load_table(Tab) -> yes | ErrorDescription
The Mnesia algorithm for table load
might lead to a situation where a
table cannot be loaded. This situation
occurs when a node is started and
Mnesia concludes, or suspects, that
another copy of the table was active
after this local copy became inactive
due to a system crash.
If this situation is not acceptable,
this function can be used to override
the strategy of the Mnesia table load
algorithm. This could lead to a
situation where some transaction
effects are lost with a inconsistent
database as result, but for some
applications high availability is more
important than consistent data.
Related
I have been trying to run the TimeScaleDB in cluster mode. Totally I have four nodes(service) one is for the access node, and another three are for the data node. The access node stores the chunks in the data node.
My problem is when any one of the data nodes is down, I could not perform any query in access node, Because the access node trying to reach the failed data node. and after a few minutes, I got errors like request connection timeout.
In my scenario, If any one of the data nodes is failed, then the access node inserts the data or gets the data from the other data node. Is this possible? If anyone knows the answer, let me know.
I have followed the following document for the multi-node setup
https://docs.timescale.com/mst/latest/mst-multi-node/
https://docs.timescale.com/timescaledb/latest/how-to-guides/multinode-timescaledb/multinode-config/
https://docs.timescale.com/timescaledb/latest/how-to-guides/multinode-timescaledb/multinode-ha/#node-failures
Based on the design of timescaledb you cannot just store data somewhere else, since the chunk (which belong to a specific time range (and possibly additional dimension) is fixed. Storing somewhere else means that you'd break the time-chunk-relation assumption which would make the data not unavailable afterwards.
There are people who tried (not sure they succeeded) to use something like pgbouncer or pgpool between the access node and the data nodes to achieve data node level hot standby servers which in turn would be their own mini clusters and failover to the hot standby when the primary data node fails.
That said, it'd look like something along those lines:
(access node) -> (pg pooling) -> (dn1 primary | dn1 secondary) + (dn2 primary | dn2 secondary) + (dn3 primary | dn3 secondary)
I have a 3 node Cassandra cluster with RF=3. Now when I do nodetool status I get the owns for each node in the cluster as 100%.
But when I have 5 nodes in the cluster wit RF=3. The owns is 60%(approx as shown in image below).
Now as per my understanding the partitioner will calculate the hash corresponding to first replica node and the data will also be replicated as per the RF on the other nodes.
Now we have a 5 node cluster and RF is 3.
Shouldn't 3 nodes be owning all the data evenly(100%) as partitioner will point to one node as per the partitoning strategy and then same data be replicated to remaining nodes which equals RF-1? It's like the data is getting evenly distributed among all the nodes(5) even though RF is 3.
Edit1:
As per my understanding the reason for 60%(approx) owns for each node is because the RF is 3. It means there will be 3 replicas for each row. It means there will be 300% data. Now there are 5 nodes in the cluster and partitioner will be using the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
But now the issue is that we checked all the nodes of our cluster and all the nodes contain all the data even though the RF is 3.
Edit2:
#Aaron I did as specified in the comment. I created a new cluster with 3 nodes.
I created a Keyspace "test" and set the class to simplestrategy and RF to 2.
Then I created a table "emp" having partition key (id,name).
Now I inserted a single row into the first node.
As per your explanation, It should only be in 2 nodes as RF=2.
But when I logged into all the 3 nodes, i could see the row replicated in all the nodes.
I think since the keyspace is getting replicated in all the nodes therefore, the data is also getting replicated.
Percent ownership is not affected (at all) by actual data being present. You could add a new node to a single node cluster (RF=1) and it would instantly say 50% on each.
Percent ownership is purely about the percentage of token ranges which a node is responsible for. When a node is added, the token ranges are recalculated, but data doesn't actually move until a streaming event happens. Likewise, data isn't actually removed from its original node until cleanup.
For example, if you have a 3 node cluster with a RF of 3, each node will be at 100%. Add one node (with RF=3), and percent ownership drops to about 75%. Add a 5th node (again, keep RF=3) and ownership for each node correctly drops to about 3/5, or 60%. Again, with a RF of 3 it's all about each node being responsible for a set of primary, secondary, and tertiary token ranges.
the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
Actually, the distributed hash with Murmur3 partitioner will evenly distribute the token ranges, not the data. That's an important distinction. If you wrote all of your data to a single partition, I guarantee that you would not get even distribution of data.
The data replicated to another nodes when you add them isn't cleared up automatically - you need to call nodetool cleanup on the "old" nodes after you add the new node into cluster. This will remove the ranges that were moved to other nodes.
I am studying up Cassandra and in the process of setting up a cluster for a project that I'm working on. Consider this example :
Say I setup a 5 node cluster with 200 gb space for each. That equals up to 1000 gb ( round about 1 TB) of space overall. Assuming that my partitions are equally split across the cluster, I can easily add nodes and achieve linear scalability. However, what if these 5 nodes start approaching the SSD limit of 200 gb? In that case, I can add 5 more nodes and now the partitions would be split across 10 nodes. But the older nodes would still be writing data, as they are part of the cluster. Is there a way to make these 5 older nodes 'read-only'? I want to shoot off random read-queries across the entire cluster, but don't want to write to the older nodes anymore( as they are capped by a 200 gb limit).
Help would be greatly appreciated. Thank you.
Note: I can say that 99% of the queries will be write queries, with 1% or less for reads. The app has to persist click events in Cassandra.
Usually when cluster reach its limit we add new node to cluster. After adding a new node, old cassandra cluster nodes will distribute their data to the new node. And after that we use nodetool cleanup in every node to cleanup the data that distributed to the new node. The entire scenario happens in a single DC.
For example:
Suppose, you have 3 node (A,B,C) in DC1 and 1 node (D) in DC2. Your nodes are reaching their limit. So, decided to add a new node (E) to DC1. Node A, B, C will distribute their data to node E and we'll use nodetool cleanup in A,B,C to cleanup the space.
Problem in understanding the question properly.
I am assuming you know that by adding new 5 nodes, some of the data load would be transferred to new nodes as some token ranges will be assigned to them.
Now, as you know this, if you are concerned that old 5 nodes would not be able to write due to their limit reached, its not going to happen as new nodes have shared the data load and hence these have free space now for further write.
Isolating the read and write to nodes is totally a different problem. But if you want to isolate read to these 5 nodes only and write to new 5 nodes, then the best way to do this is to add new 5 nodes in another datacenter under the same cluster and then use different consistency levels for read and write to satisfy your need to make old datacenter read only.
But the new datacenter will not lighten the data load from first. It will even take the same load to itself. (So you would need more than 5 nodes to accomplish both problems simultaneously. Few nodes to lighten the weight and others to isolate the read-write by creating new datacenter with them. Also the new datacenter should have more then 5 nodes). Best practice is to monitor data load and fixing it before such problem happen, by adding new nodes or increasing data limit.
Considering done that, you will also need to ensure that the nodes you provided for read and write should be from different datacenters.
Consider you have following situation :
dc1(n1, n2, n3, n4, n5)
dc2(n6, n7, n8, n9, n10)
Now, for read you provided with node n1 and for write you provided with node n6
Now the read/write isolation can be done by choosing the right Consistency Levels from bellow options :
LOCAL_QUORUM
or
LOCAL_ONE
These basically would confine the search for the replicas to local datacenter only.
Look at these references for more :
Adding a datacenter to a cluster
and
Consistency Levels
I mean when we have a multi-master cluster with 3 master nodes then after we added a new master node (fourth node) we have to be somehow notified when it will be up to date or may be ask it every 5 seconds to find out if it up to date.
Is it possible to realize?
This feature currently is not present, and looking at the roadmap, i don't think there will be.
Consider the fact that if you get to have a situation with 5 nodes (therefore the quorum is 3), so the sync takes place only between two nodes, while the other 3 have a quorum to write data. So you can always keep doing insert or other and you do not need a mechanism to tell you that for example the node 5 is ready or not. This of course if you can get a situational so.
According to the documentation here http://orientdb.com/docs/2.0/orientdb.wiki/Distributed-Architecture.html under the heading Creation of records (documents, vertices and edges) it states that
In distributed mode the RID is assigned with cluster locality. If you have class Customer and 3 nodes (node1, node2, node3), you'll have these clusters:
customer with id=#15 (this is the default one, assigned to node1)
customer_node2 with id=#16
customer_node3 with id=#17
This would imply that you cannot rely on the clusterID in your code. For example, selecting a record like this
select from #15:1
would break in a distributed setup, because once node1 fails and node2 takes over, you'll have to select
select from #16:1
My question is, is it the responsibility of the programmer to handle which clusterID to use, or is this automatically handled by OrientDB, in which case
select from #15:1
will always work, not matter which node is up or down?