distributed database replication design: efficient network transfer - database

I am developing a distributed database system, suppose each table has 3 copies on different machines, if a write request comes in, the replication layer's work is to replicate that write data to other to nodes. The question is how to make the replication efficient in terms of throughput? The network bandwidth is 10Gbps.
Example:
cmd1: write on table 1
-> send to node A and node B
cmd2: write on table 2
-> send to node B and node C
cmd3: write on table 1
-> send to node A and node B
cmd4: write on table 3
-> send to node B and node D
cmd5: write on table 2
-> send to node B and node C
The above example shows the replicating command queues. To achieve high performance, what about use mulit-thread network replication? But if there are two writes to the same node, it would be better to combine the two and send altogether?
And suppose those writes will be stored on disks eventually (think of spinning disk), so to speed up the storing process, is there anything can be done when doing the replication? (if each table has been mapped to store on a specific disk on a node, writes are append-only).

Related

Multi-Node in TimeScaleDB

I have been trying to run the TimeScaleDB in cluster mode. Totally I have four nodes(service) one is for the access node, and another three are for the data node. The access node stores the chunks in the data node.
My problem is when any one of the data nodes is down, I could not perform any query in access node, Because the access node trying to reach the failed data node. and after a few minutes, I got errors like request connection timeout.
In my scenario, If any one of the data nodes is failed, then the access node inserts the data or gets the data from the other data node. Is this possible? If anyone knows the answer, let me know.
I have followed the following document for the multi-node setup
https://docs.timescale.com/mst/latest/mst-multi-node/
https://docs.timescale.com/timescaledb/latest/how-to-guides/multinode-timescaledb/multinode-config/
https://docs.timescale.com/timescaledb/latest/how-to-guides/multinode-timescaledb/multinode-ha/#node-failures
Based on the design of timescaledb you cannot just store data somewhere else, since the chunk (which belong to a specific time range (and possibly additional dimension) is fixed. Storing somewhere else means that you'd break the time-chunk-relation assumption which would make the data not unavailable afterwards.
There are people who tried (not sure they succeeded) to use something like pgbouncer or pgpool between the access node and the data nodes to achieve data node level hot standby servers which in turn would be their own mini clusters and failover to the hot standby when the primary data node fails.
That said, it'd look like something along those lines:
(access node) -> (pg pooling) -> (dn1 primary | dn1 secondary) + (dn2 primary | dn2 secondary) + (dn3 primary | dn3 secondary)

ClickHouse - How to remove node from cluster for reading?

Background
I'm beginning work to set up a ClickHouse cluster with 3 CH nodes. The first node (Node A) would be write-only, and the remaining 2 (Nodes B + C) would be read-only. By this I mean that writes for a given table to Node A would automatically replicate to Nodes B + C. When querying the cluster, reads would only be resolved against Nodes B + C.
The purpose for doing this is two-fold.
This datastore serves both real-time and background jobs. Both are high volume, only on the read side, so it makes sense to segment the traffic. Node A would be used for writing to the cluster and all background reads. Nodes B + C would be strictly used for the UX.
The volume of writes is very low, perhaps 1 write per 10,000 reads. Data is entirely refreshed once per week. Background jobs need to be certain that the most current data is being read before they can be kicked off. Reading off of replicas introduces eventual consistency as a concern, so reading from the node directly (rather than the cluster) from Node A guarantees the data to be strongly consistent.
Question
I'm not finding much specific information in the CH documentation, and am wondering whether this might be possible. If so, what would the cluster configuration look like?
Yes, it is possible to do so. But wouldn't the best solution be to read and write to each server sequentially using the Distributed table?

How DSE Spread Data?

We using DSE with Cassandra + Solr.
I'm not sure how it's spreading the data, let's say we have 6 nodes, replication factor of 3.
Our platform uses all the 6 nodes to query data, I query one node from the 6 there is a chance data will be missing?
Or I need to have the same replication factor as the number of the nodes I have if I want to use all the nodes from the platform.
How it's working?
In Cassandra each node stores some parts of the data. When you build the cluster each node will be responsible for specific part of the data. That is decided based on the token value that is assigned to that node. Now when you insert or select the data, each insert or select will have a partition key. Based on that partition key a hash value is calculated and the data will be sent to node which is responsible for that specific token value.
If there are 6 nodes and RF =3 then within cluster you have 3 copies of entire data. The primary copy is stored based on the above concept. The replicas will be stored based on the replication class that you specify while you create keyspace. If you take SimpleStrategy it stores replica on next node in clockwise i.e replica of node1 will be stored on node2 and node3 and replica of node2 will be stored on node3 and node4 and so on.
If you query from one node then based on the partition key the query will be sent to specific node which is responsible for that partition keys. To know to which node your query will be sent you can use nodetool utility.
nodetool getendpoints <keyspace> <table> <key>
This will give you the node Ip where the query will be sent to get the result
If you have 6 nodes and RF 3 it means 3 copies of data exist in your cassandra cluster. Data availability also depends on consistency level. if you are using ONE consistency and 2 node down then also will get data and no loss but if TWO, QOURAM, THREE or ALL consistency then scenario will be different.

What to do when nodes in a Cassandra cluster reach their limit?

I am studying up Cassandra and in the process of setting up a cluster for a project that I'm working on. Consider this example :
Say I setup a 5 node cluster with 200 gb space for each. That equals up to 1000 gb ( round about 1 TB) of space overall. Assuming that my partitions are equally split across the cluster, I can easily add nodes and achieve linear scalability. However, what if these 5 nodes start approaching the SSD limit of 200 gb? In that case, I can add 5 more nodes and now the partitions would be split across 10 nodes. But the older nodes would still be writing data, as they are part of the cluster. Is there a way to make these 5 older nodes 'read-only'? I want to shoot off random read-queries across the entire cluster, but don't want to write to the older nodes anymore( as they are capped by a 200 gb limit).
Help would be greatly appreciated. Thank you.
Note: I can say that 99% of the queries will be write queries, with 1% or less for reads. The app has to persist click events in Cassandra.
Usually when cluster reach its limit we add new node to cluster. After adding a new node, old cassandra cluster nodes will distribute their data to the new node. And after that we use nodetool cleanup in every node to cleanup the data that distributed to the new node. The entire scenario happens in a single DC.
For example:
Suppose, you have 3 node (A,B,C) in DC1 and 1 node (D) in DC2. Your nodes are reaching their limit. So, decided to add a new node (E) to DC1. Node A, B, C will distribute their data to node E and we'll use nodetool cleanup in A,B,C to cleanup the space.
Problem in understanding the question properly.
I am assuming you know that by adding new 5 nodes, some of the data load would be transferred to new nodes as some token ranges will be assigned to them.
Now, as you know this, if you are concerned that old 5 nodes would not be able to write due to their limit reached, its not going to happen as new nodes have shared the data load and hence these have free space now for further write.
Isolating the read and write to nodes is totally a different problem. But if you want to isolate read to these 5 nodes only and write to new 5 nodes, then the best way to do this is to add new 5 nodes in another datacenter under the same cluster and then use different consistency levels for read and write to satisfy your need to make old datacenter read only.
But the new datacenter will not lighten the data load from first. It will even take the same load to itself. (So you would need more than 5 nodes to accomplish both problems simultaneously. Few nodes to lighten the weight and others to isolate the read-write by creating new datacenter with them. Also the new datacenter should have more then 5 nodes). Best practice is to monitor data load and fixing it before such problem happen, by adding new nodes or increasing data limit.
Considering done that, you will also need to ensure that the nodes you provided for read and write should be from different datacenters.
Consider you have following situation :
dc1(n1, n2, n3, n4, n5)
dc2(n6, n7, n8, n9, n10)
Now, for read you provided with node n1 and for write you provided with node n6
Now the read/write isolation can be done by choosing the right Consistency Levels from bellow options :
LOCAL_QUORUM
or
LOCAL_ONE
These basically would confine the search for the replicas to local datacenter only.
Look at these references for more :
Adding a datacenter to a cluster
and
Consistency Levels

question about mnesia distribution

I have two nodes running mnesia. I created schema and some tables on node 1, and used mnesia:add_table_copy on node 2 to copy the tables from node 1 to node 2.
Everything works well until I call q() on node 1 and then q() on node 2. I found that when I start node 1 again, mnesia:wait_for_tables([sometable], infinity) won't return. It will only return when I start node 2 again.
Is there a way to fix this? This is a problem because I won't be able to start node 1 again if node 2 is down.
In this discussion a situation similar to the one you're facing is presented.
Reading from that source:
At startup Mnesia tries to connect
with the other nodes and if that
suceeds it loads its tables from
them. If the other nodes are down, it
looks for mnesia_down marks in its
local transaction log in order to
determine if it has a consistent
replica or not of its tables. The node
that was shutdown last has
mnesia_down's from all the other
nodes. This means that it safely can
load its tables. If some of the other
nodes where started first (as in your
case) Mnesia will wait indefinitely
for another node to connect in order
to load its tables
You're shutting down node 1 first, so it doesn't have the mnesia_down from the other node. What happens if you reverse the shutting down order?
Also, it should be possible to force the table loading via the force_load_table/1 function:
force_load_table(Tab) -> yes | ErrorDescription
The Mnesia algorithm for table load
might lead to a situation where a
table cannot be loaded. This situation
occurs when a node is started and
Mnesia concludes, or suspects, that
another copy of the table was active
after this local copy became inactive
due to a system crash.
If this situation is not acceptable,
this function can be used to override
the strategy of the Mnesia table load
algorithm. This could lead to a
situation where some transaction
effects are lost with a inconsistent
database as result, but for some
applications high availability is more
important than consistent data.

Resources