How does Multi-Raft group nodes together? - distributed

I am trying to implement a similar architecture to Cockroachdb's multi-raft: https://www.cockroachlabs.com/blog/scaling-Raft/.
Does anyone have a brief explanation to how Multi-Raft group these individual Raft clusters? Specifically, is there a Raft instance coordinating the membership of the servers participating in each range/session/unit of smaller Raft units?

If I had to guess, they implement a logic similar to rendezvous hashing (or consistent hashing).
For example, there are 10 nodes total. And there is a range X. They could use hashing to decide what are those 3 nodes responsible for the range. One of those nodes will be the leader for a given range. And for other range, nodes will be different.
Since there are many more ranges than nodes, it means that every node will participate in multiple ranges. This is cool, as when a node sends a heart beat to a follower - that heart beat will confirm all ranges where a given node is a leader and the other node is a follower.
At the end of the day, still every node sends a heartbeat to every other node.

Related

Data Partitioning and Replication on Cassandra cluster

I have a 3 node Cassandra cluster with RF=3. Now when I do nodetool status I get the owns for each node in the cluster as 100%.
But when I have 5 nodes in the cluster wit RF=3. The owns is 60%(approx as shown in image below).
Now as per my understanding the partitioner will calculate the hash corresponding to first replica node and the data will also be replicated as per the RF on the other nodes.
Now we have a 5 node cluster and RF is 3.
Shouldn't 3 nodes be owning all the data evenly(100%) as partitioner will point to one node as per the partitoning strategy and then same data be replicated to remaining nodes which equals RF-1? It's like the data is getting evenly distributed among all the nodes(5) even though RF is 3.
Edit1:
As per my understanding the reason for 60%(approx) owns for each node is because the RF is 3. It means there will be 3 replicas for each row. It means there will be 300% data. Now there are 5 nodes in the cluster and partitioner will be using the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
But now the issue is that we checked all the nodes of our cluster and all the nodes contain all the data even though the RF is 3.
Edit2:
#Aaron I did as specified in the comment. I created a new cluster with 3 nodes.
I created a Keyspace "test" and set the class to simplestrategy and RF to 2.
Then I created a table "emp" having partition key (id,name).
Now I inserted a single row into the first node.
As per your explanation, It should only be in 2 nodes as RF=2.
But when I logged into all the 3 nodes, i could see the row replicated in all the nodes.
I think since the keyspace is getting replicated in all the nodes therefore, the data is also getting replicated.
Percent ownership is not affected (at all) by actual data being present. You could add a new node to a single node cluster (RF=1) and it would instantly say 50% on each.
Percent ownership is purely about the percentage of token ranges which a node is responsible for. When a node is added, the token ranges are recalculated, but data doesn't actually move until a streaming event happens. Likewise, data isn't actually removed from its original node until cleanup.
For example, if you have a 3 node cluster with a RF of 3, each node will be at 100%. Add one node (with RF=3), and percent ownership drops to about 75%. Add a 5th node (again, keep RF=3) and ownership for each node correctly drops to about 3/5, or 60%. Again, with a RF of 3 it's all about each node being responsible for a set of primary, secondary, and tertiary token ranges.
the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
Actually, the distributed hash with Murmur3 partitioner will evenly distribute the token ranges, not the data. That's an important distinction. If you wrote all of your data to a single partition, I guarantee that you would not get even distribution of data.
The data replicated to another nodes when you add them isn't cleared up automatically - you need to call nodetool cleanup on the "old" nodes after you add the new node into cluster. This will remove the ranges that were moved to other nodes.

Unavailable nodes in consistent hashing

From everything I have read, in consistent hashing, if a node crashes, the keys handled by that node will be re-mapped to the adjacent node in the hash ring. This conceptually makes sense to me.
What I don't understand is how this would work in practice for a distributed database. How can the data be moved to another node if the node has crashed? Does it assume there is a backup/standby cluster available? Or redundant nodes it can be copied from?
Yes. Data is copied from other nodes in the cluster. If the data is not replicated, there is no way to bring back the data.
Consistent Hashing gives us a single node to which key is assigned. How are the other nodes on which the key is replicated are identified?
The answer is replication strategy is built on top of consistent hashing. First, the node to which key belongs is identified using consistent hashing. Second, system replicates the data by using another algorithm. One of the strategies is that the system writes data to the nodes which come next, in a clockwise direction, to the current node in the consistent hash ring. As an example, you can find some other replication strategies here.

Apache Giraph Graph Partitioning.... Can a partition p1 resides partially in worker w1 and partially in worker w2?

I am newbie in Apache Giraph. My question is related to Giraph graph partitioning. As far as I know, Giraph partition the large graph randomly.... possibly #partitions>#workers in order to load balance. But, my question is, is #partitions/worker always an integer? Saying in the ther way, Can it happen, that a partition (say p1) resides partially in worker w1 and worker w2? Or, should p1 be either in w1 or w2 at entirety?
Partition in Giraph refers to vertex partition not graph-partitions. For example, if a graph has 10 vertices numbered from 1 to 10 then a possible partition would be {1,2. 3}, {4,5,6}, {7,8,9,10}. Each partition knows where its outgoing edges are pointing. Each worker creates threads for each partition which is assigned to it. The thread iterates over each vertex in the partition and executes compute function.
So with this information I would say a partition has to reside on a single worker entirely.
Hello #zahorak,
If Giraph implemented Pregel as it is, then as per the Pregel paper it is not necessary to have #partitions == #workers. It says,
The master determines how many partitions the graph will have, and assigns one or more partitions to each worker machine. The number may be controlled by the user. Having more than one partition per worker allows parallelism among the partitions and better load balancing, and will usually improve performance.
UPDATE: I found the similar question on Giraph user mailing list. The answers given in replies might be helpful. Here is the link to the thread - https://www.mail-archive.com/user#giraph.apache.org/msg01869.html
AFAIK no, actually I would have said, #partitions == #workers
The reason for partitioning is to handle parts of the graph on one server. After the superstep is executed messages sent to other partitions are exchanged between the servers within a cluster.
Maybe you understand something else under the term partitioning as me, but for me partitioning means:
Giraph is on a cluster with multiple servers, in order to laverage all servers, it needs to partition the graph. And than in simply assings randomly a node to one of the n servers. Out of this you get n partitions and nodes within each partition are executed by the one server they were assigned to, no other.

What is the unidirectional property and why it helps with hotspots?

In the kademlia paper it's written that the XOR metric is unidirectional. What does it mean precisely?
More importantly in what way it alleviates the problem of a frequently queried node?
Could you explain me that from the point of view of a node? I mean, if I a hotspot am requested frequently by different nodes, do they exchange cached nodes to get to the target? Can't they just exchange the target ip?
Furthermore, it doesn't seem to me that lookups converge along the same path as written, I think its more logical that each node follows a different path wile going farther and farther from itself.
The XOR metric means that A^B gives the same distance as B^A. I'm not sure that it directly alleviates the problem of a frequently query, it's more that nodes from different addresses in the network will perceive query nodes on a search path as having different distance from themselves, thereby caching different nodes after a query completes. Subsequent queries to local nodes will be given different remote nodes in response, thereby potentially spreading the load around the DHT network somewhat.
When querying the DHT network, the more common query is to ask for data regarding a particular info hash. That's stored by the nodes with the smallest distances between their node IDs and the info hash in question. It's only when you begin querying nodes that are close to the target info hash that the IP addresses of peers start to respond with IP addresses of peers for that torrent. Nodes can't just arbitrarily return peer IPs, as that would require that all nodes store all IPs for all torrents, or that nodes perform subsequent queries on your behalf, which would be lead to suboptimal network use and be open to exploitation.
Your observation that lookups don't converge on the same path is only correct when there are a surfeit of nodes at the distance being queried. Eventually as you get closer to nodes storing data for the desired info hash, there will be fewer and fewer nodes with such proximity to the target. Thus toward the end of queries, most querying nodes will converge on similar nodes. It's worth keeping in mind that this isn't a problem. Those nodes will only be "hot" for data related to that one particular info hash as the distance between info hashes is going to be very large on average on account of the enormous size of the hash space used. Additionally, were it a popular info hash to be querying for, nodes close to that hash that aren't coping with the traffic will be penalized by the network, and returned less often by nodes on the search path.

Simple basic explanation of a Distributed Hash Table (DHT)

Could any one give an explanation on how a DHT works?
Nothing too heavy, just the basics.
Ok, they're fundamentally a pretty simple idea. A DHT gives you a dictionary-like interface, but the nodes are distributed across the network. The trick with DHTs is that the node that gets to store a particular key is found by hashing that key, so in effect your hash-table buckets are now independent nodes in a network.
This gives a lot of fault-tolerance and reliability, and possibly some performance benefit, but it also throws up a lot of headaches. For example, what happens when a node leaves the network, by failing or otherwise? And how do you redistribute keys when a node joins so that the load is roughly balanced. Come to think of it, how do you evenly distribute keys anyhow? And when a node joins, how do you avoid rehashing everything? (Remember you'd have to do this in a normal hash table if you increase the number of buckets).
One example DHT that tackles some of these problems is a logical ring of n nodes, each taking responsibility for 1/n of the keyspace. Once you add a node to the network, it finds a place on the ring to sit between two other nodes, and takes responsibility for some of the keys in its sibling nodes. The beauty of this approach is that none of the other nodes in the ring are affected; only the two sibling nodes have to redistribute keys.
For example, say in a three node ring the first node has keys 0-10, the second 11-20 and the third 21-30. If a fourth node comes along and inserts itself between nodes 3 and 0 (remember, they're in a ring), it can take responsibility for say half of 3's keyspace, so now it deals with 26-30 and node 3 deals with 21-25.
There are many other overlay structures such as this that use content-based routing to find the right node on which to store a key. Locating a key in a ring requires searching round the ring one node at a time (unless you keep a local look-up table, problematic in a DHT of thousands of nodes), which is O(n)-hop routing. Other structures - including augmented rings - guarantee O(log n)-hop routing, and some claim to O(1)-hop routing at the cost of more maintenance.
Read the wikipedia page, and if you really want to know in a bit of depth, check out this coursepage at Harvard which has a pretty comprehensive reading list.
DHTs provide the same type of interface to the user as a normal hashtable (look up a value by key), but the data is distributed over an arbitrary number of connected nodes. Wikipedia has a good basic introduction that I would essentially be regurgitating if I write more -
http://en.wikipedia.org/wiki/Distributed_hash_table
I'd like to add onto HenryR's useful answer as I just had an insight into consistent hashing. A normal/naive hash lookup is a function of two variables, one of which is the number of buckets. The beauty of consistent hashing is that we eliminate the number of buckets "n", from the equation.
In naive hashing, first variable is the key of the object to be stored in the table. We'll call the key "x". The second variable is is the number of buckets, "n". So, to determine which bucket/machine the object is stored in, you have to calculate: hash(x) mod(n). Therefore, when you change the number of buckets, you also change the address at which almost every object is stored.
Compare this to consistent hashing. Let's define "R" as the range of a hash function. R is just some constant. In consistent hashing, the address of an object is located at hash(x)/R. Since our lookup is no longer a function of the number of buckets, we end up with less remapping when we change the number of buckets.
http://michaelnielsen.org/blog/consistent-hashing/
The core of a DHT is a hash table. Key-value pairs are stored in DHT and a value can be looked up with a key. The keys are unique identifiers to values that can range from blocks in a blockchain to addresses and to documents.
What differentiates a DHT from a normal hash table is the fact that storage and lookup on DHT are distributed across multiple (can be millions) nodes or machines. This very characteristic of DHT makes it look like distributed databases used for storage and retrieval. There is no master-slave hierarchy or a centralized control among the participating nodes. All the nodes are treated as peers.
DHT provides freedom to the participating nodes such that the nodes can join or leave the network anytime. Due to this reason, DHTs are widely used in Peer-to-Peer (P2P) networks. In fact, part of the motivation behind the research of DHT stems from its usage in P2P networks.
Characteristics of DHT
Decentralized: Since there is no central authority or coordination
Scalable: The system can easily scale up to millions of nodes
Fault-tolerant: DHT replicates the data storage on all the nodes.
Therefore, even if one node leaves the network, it should not affect other nodes in the network.
Let’s see how lookup happens in a popular DHT protocol like Chord. Consider a circular doubly-linked list of nodes. Each node has a reference pointer to the node previous as well as next to it. The node next to the node in question is called the successor. The node that is previous to the node in question is called the predecessor.
Speaking in terms of a DHT, each node has a unique node ID of k bits and these nodes are arranged in the increasing order of their node IDs.
Assume these nodes are arranged in a ring structure called identifier ring. For each node, the successor has the shortest distance clockwise away. For most nodes, this is the node whose ID is closest to but still greater than the current node’s ID.
To find out the node appropriate for a particular key, first hash the key K and all the nodes to exactly k bits using consistent hashing techniques like SHA-1.
Start at any point in the ring and traverse clockwise till you catch the node whose node ID is closer to the key K, but can be greater than K. This node is the one responsible for storage and lookup for that particular key.
In an iterative style of lookup, each node Q queries its successor node for KV (key-value) pair. If the queried node does not have the target key, it will return a set of nodes S that can be closer to the target. The querying node Q then queries the nodes in S which are closer to itself. This continues until either the target KV pair is returned or when there are no more nodes to query.
This lookup is very suitable for an ideal scenario where all the nodes have a perfect uptime. But how to handle scenarios when nodes leave the network either intentionally or by failure? This calls for the need for a robust join/leave protocol.
Popular DHT protocols and implementations
Chord
Kademlia
Apache Cassandra
Koorde TomP2P
Voldemort
References:
https://en.wikipedia.org/wiki/Distributed_hash_table
https://steffikj19.medium.com/dht-demystified-77dd31727ea7
https://www.linuxjournal.com/article/6797

Resources