Multi-Node in TimeScaleDB

Multi-Node in TimeScaleDB - database

I have been trying to run the TimeScaleDB in cluster mode. Totally I have four nodes(service) one is for the access node, and another three are for the data node. The access node stores the chunks in the data node.
My problem is when any one of the data nodes is down, I could not perform any query in access node, Because the access node trying to reach the failed data node. and after a few minutes, I got errors like request connection timeout.
In my scenario, If any one of the data nodes is failed, then the access node inserts the data or gets the data from the other data node. Is this possible? If anyone knows the answer, let me know.
I have followed the following document for the multi-node setup
https://docs.timescale.com/mst/latest/mst-multi-node/
https://docs.timescale.com/timescaledb/latest/how-to-guides/multinode-timescaledb/multinode-config/
https://docs.timescale.com/timescaledb/latest/how-to-guides/multinode-timescaledb/multinode-ha/#node-failures

Based on the design of timescaledb you cannot just store data somewhere else, since the chunk (which belong to a specific time range (and possibly additional dimension) is fixed. Storing somewhere else means that you'd break the time-chunk-relation assumption which would make the data not unavailable afterwards.
There are people who tried (not sure they succeeded) to use something like pgbouncer or pgpool between the access node and the data nodes to achieve data node level hot standby servers which in turn would be their own mini clusters and failover to the hot standby when the primary data node fails.
That said, it'd look like something along those lines:
(access node) -> (pg pooling) -> (dn1 primary | dn1 secondary) + (dn2 primary | dn2 secondary) + (dn3 primary | dn3 secondary)

Related

Cassandra: Shipping Disk to New DC in order to sync 50TB of data

We're adding a new datacenter to our Cassandra cluster. Currently, we have a 15-node DC with RF=3 resulting in about 50TB~ of data.
We are adding another datacenter in a different country and we want both data centers to contain all the data. Obviously, synchronizing 50TB of data across the internet will take a gargantuan amount of time.
Is it possible to copy a full back to a few disks, ship that to the new DC and then recover? I'm just wondering what would be the procedure to do so.
Could someone give me a few pointers on this operation, if possible at all?
Or any other tips?
Our new DC is going to be smaller (6 nodes) for the time being, although enough space will be available. The new DC is mostly meant as a live-backup/failover and will not be the primary cluster for writing, generally speaking.

TL;DR; Due to the topology (node count) change between the two DCs, avoiding streaming the data in isn't possible AFAIK.
Our new DC is going to be smaller (6 nodes) for the time being
The typical process isn't going to work due to token alignment on the nodes being different (new cluster's ring will change). So just copying the existing SSTables wont work, as the nodes that hold those tables, might not have the tokens corresponding to the data in the files and so C* wont be able to find said data.
Bulk loading the data to the new DC is out too, as you'll be overwriting the old data if you re-insert it.
To give you an overview of the process if you were to retain the topology:
snapshot the data from the original DC
Configure the new DC. It's extremely important that you set initial_token for each machine. You can get a list of what tokens you need by running nodetool ring on the original cluster. This is why you need the same number of nodes. As importantly, when copying the SSTable files over, you need the files and the tokens to be from the same node.
ship the data to the new DC (Remember if the new node 10.0.0.1 got it's tokens from 192.168.0.100 in the old dc, then it also has to get it's snapshot data from 192.168.0.100).
Start the new DC and ensure both DCs see eachother ok.
Rebuild and repair system_distributed and system_auth (assuming you have authentication enabled)
Update client consistency to whatever you need. (Do you want to write to both DCs? From your description sounds like a no so you might be all good).
Update the schema, ensure that you're using NetworkTopologyStrategy for any keyspce that you want to be shared, then add some replication for the new DC.
ALTER KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'oldDC' : 3, 'newDC':3 };
Run a full repair on each node in the new dc.

Data Partitioning and Replication on Cassandra cluster

I have a 3 node Cassandra cluster with RF=3. Now when I do nodetool status I get the owns for each node in the cluster as 100%.
But when I have 5 nodes in the cluster wit RF=3. The owns is 60%(approx as shown in image below).
Now as per my understanding the partitioner will calculate the hash corresponding to first replica node and the data will also be replicated as per the RF on the other nodes.
Now we have a 5 node cluster and RF is 3.
Shouldn't 3 nodes be owning all the data evenly(100%) as partitioner will point to one node as per the partitoning strategy and then same data be replicated to remaining nodes which equals RF-1? It's like the data is getting evenly distributed among all the nodes(5) even though RF is 3.
Edit1:
As per my understanding the reason for 60%(approx) owns for each node is because the RF is 3. It means there will be 3 replicas for each row. It means there will be 300% data. Now there are 5 nodes in the cluster and partitioner will be using the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
But now the issue is that we checked all the nodes of our cluster and all the nodes contain all the data even though the RF is 3.
Edit2:
#Aaron I did as specified in the comment. I created a new cluster with 3 nodes.
I created a Keyspace "test" and set the class to simplestrategy and RF to 2.
Then I created a table "emp" having partition key (id,name).
Now I inserted a single row into the first node.
As per your explanation, It should only be in 2 nodes as RF=2.
But when I logged into all the 3 nodes, i could see the row replicated in all the nodes.
I think since the keyspace is getting replicated in all the nodes therefore, the data is also getting replicated.

Percent ownership is not affected (at all) by actual data being present. You could add a new node to a single node cluster (RF=1) and it would instantly say 50% on each.
Percent ownership is purely about the percentage of token ranges which a node is responsible for. When a node is added, the token ranges are recalculated, but data doesn't actually move until a streaming event happens. Likewise, data isn't actually removed from its original node until cleanup.
For example, if you have a 3 node cluster with a RF of 3, each node will be at 100%. Add one node (with RF=3), and percent ownership drops to about 75%. Add a 5th node (again, keep RF=3) and ownership for each node correctly drops to about 3/5, or 60%. Again, with a RF of 3 it's all about each node being responsible for a set of primary, secondary, and tertiary token ranges.
the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
Actually, the distributed hash with Murmur3 partitioner will evenly distribute the token ranges, not the data. That's an important distinction. If you wrote all of your data to a single partition, I guarantee that you would not get even distribution of data.

The data replicated to another nodes when you add them isn't cleared up automatically - you need to call nodetool cleanup on the "old" nodes after you add the new node into cluster. This will remove the ranges that were moved to other nodes.

What to do when nodes in a Cassandra cluster reach their limit?

I am studying up Cassandra and in the process of setting up a cluster for a project that I'm working on. Consider this example :
Say I setup a 5 node cluster with 200 gb space for each. That equals up to 1000 gb ( round about 1 TB) of space overall. Assuming that my partitions are equally split across the cluster, I can easily add nodes and achieve linear scalability. However, what if these 5 nodes start approaching the SSD limit of 200 gb? In that case, I can add 5 more nodes and now the partitions would be split across 10 nodes. But the older nodes would still be writing data, as they are part of the cluster. Is there a way to make these 5 older nodes 'read-only'? I want to shoot off random read-queries across the entire cluster, but don't want to write to the older nodes anymore( as they are capped by a 200 gb limit).
Help would be greatly appreciated. Thank you.
Note: I can say that 99% of the queries will be write queries, with 1% or less for reads. The app has to persist click events in Cassandra.

Usually when cluster reach its limit we add new node to cluster. After adding a new node, old cassandra cluster nodes will distribute their data to the new node. And after that we use nodetool cleanup in every node to cleanup the data that distributed to the new node. The entire scenario happens in a single DC.
For example:
Suppose, you have 3 node (A,B,C) in DC1 and 1 node (D) in DC2. Your nodes are reaching their limit. So, decided to add a new node (E) to DC1. Node A, B, C will distribute their data to node E and we'll use nodetool cleanup in A,B,C to cleanup the space.

Problem in understanding the question properly.
I am assuming you know that by adding new 5 nodes, some of the data load would be transferred to new nodes as some token ranges will be assigned to them.
Now, as you know this, if you are concerned that old 5 nodes would not be able to write due to their limit reached, its not going to happen as new nodes have shared the data load and hence these have free space now for further write.
Isolating the read and write to nodes is totally a different problem. But if you want to isolate read to these 5 nodes only and write to new 5 nodes, then the best way to do this is to add new 5 nodes in another datacenter under the same cluster and then use different consistency levels for read and write to satisfy your need to make old datacenter read only.
But the new datacenter will not lighten the data load from first. It will even take the same load to itself. (So you would need more than 5 nodes to accomplish both problems simultaneously. Few nodes to lighten the weight and others to isolate the read-write by creating new datacenter with them. Also the new datacenter should have more then 5 nodes). Best practice is to monitor data load and fixing it before such problem happen, by adding new nodes or increasing data limit.
Considering done that, you will also need to ensure that the nodes you provided for read and write should be from different datacenters.
Consider you have following situation :
dc1(n1, n2, n3, n4, n5)
dc2(n6, n7, n8, n9, n10)
Now, for read you provided with node n1 and for write you provided with node n6
Now the read/write isolation can be done by choosing the right Consistency Levels from bellow options :
LOCAL_QUORUM
or
LOCAL_ONE
These basically would confine the search for the replicas to local datacenter only.
Look at these references for more :
Adding a datacenter to a cluster
and
Consistency Levels

Is it possible to get notified when an OrientDB master node is up to date?

I mean when we have a multi-master cluster with 3 master nodes then after we added a new master node (fourth node) we have to be somehow notified when it will be up to date or may be ask it every 5 seconds to find out if it up to date.
Is it possible to realize?

This feature currently is not present, and looking at the roadmap, i don't think there will be.
Consider the fact that if you get to have a situation with 5 nodes (therefore the quorum is 3), so the sync takes place only between two nodes, while the other 3 have a quorum to write data. So you can always keep doing insert or other and you do not need a mechanism to tell you that for example the node 5 is ready or not. This of course if you can get a situational so.

question about mnesia distribution

I have two nodes running mnesia. I created schema and some tables on node 1, and used mnesia:add_table_copy on node 2 to copy the tables from node 1 to node 2.
Everything works well until I call q() on node 1 and then q() on node 2. I found that when I start node 1 again, mnesia:wait_for_tables([sometable], infinity) won't return. It will only return when I start node 2 again.
Is there a way to fix this? This is a problem because I won't be able to start node 1 again if node 2 is down.

In this discussion a situation similar to the one you're facing is presented.
Reading from that source:
At startup Mnesia tries to connect
with the other nodes and if that
suceeds it loads its tables from
them. If the other nodes are down, it
looks for mnesia_down marks in its
local transaction log in order to
determine if it has a consistent
replica or not of its tables. The node
that was shutdown last has
mnesia_down's from all the other
nodes. This means that it safely can
load its tables. If some of the other
nodes where started first (as in your
case) Mnesia will wait indefinitely
for another node to connect in order
to load its tables
You're shutting down node 1 first, so it doesn't have the mnesia_down from the other node. What happens if you reverse the shutting down order?
Also, it should be possible to force the table loading via the force_load_table/1 function:
force_load_table(Tab) -> yes | ErrorDescription
The Mnesia algorithm for table load
might lead to a situation where a
table cannot be loaded. This situation
occurs when a node is started and
Mnesia concludes, or suspects, that
another copy of the table was active
after this local copy became inactive
due to a system crash.
If this situation is not acceptable,
this function can be used to override
the strategy of the Mnesia table load
algorithm. This could lead to a
situation where some transaction
effects are lost with a inconsistent
database as result, but for some
applications high availability is more
important than consistent data.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight