Unavailable nodes in consistent hashing - database

From everything I have read, in consistent hashing, if a node crashes, the keys handled by that node will be re-mapped to the adjacent node in the hash ring. This conceptually makes sense to me.
What I don't understand is how this would work in practice for a distributed database. How can the data be moved to another node if the node has crashed? Does it assume there is a backup/standby cluster available? Or redundant nodes it can be copied from?

Yes. Data is copied from other nodes in the cluster. If the data is not replicated, there is no way to bring back the data.
Consistent Hashing gives us a single node to which key is assigned. How are the other nodes on which the key is replicated are identified?
The answer is replication strategy is built on top of consistent hashing. First, the node to which key belongs is identified using consistent hashing. Second, system replicates the data by using another algorithm. One of the strategies is that the system writes data to the nodes which come next, in a clockwise direction, to the current node in the consistent hash ring. As an example, you can find some other replication strategies here.

Related

making a b+tree concurrent (c)

I am currently attempting to make a b+ tree concurrent.
So far the approach I had in mind as a starting point would be to iterate through the tree when inserting, locking each node (each node has its own lock) and unlocking once gotten the lock to the next node in the tree until a node which has a child who has the order of the b+ tree - 1 keys as anything under that node can be modified, after which all the necessary insert operations are ran and the node is unlocked.
This is obviously a very naive approach and doesnt offer much in the way of concurrency so I was wondering if there is a better way to go about this? Any inputs whatsoever would be greatly appreciated!
I have just finished one project on implementing a concurrent B+ tree. You can find some intuition from CMU 15-445 (Database Systems):
https://15445.courses.cs.cmu.edu/fall2018/slides/09-indexconcurrency.pdf (Slides)
https://www.youtube.com/watch?v=6AiAR_giC6A&list=PLSE8ODhjZXja3hgmuwhf89qboV1kOxMx7&index=9 (Video)
One way to do this is called "latch crabbing". Basically, you need an RWLock for each node.
When you are searching for a leaf node, you add a Read(Search) or Write(Insert/Delete) lock on each of the node you visit. Once you discovered a node is "safe" (i.e. it won't split for insert, or it won't merge/redistribute with neighbors in delete), you can release the locks of its ancestors since you know the modification is limited to under this node. In this way, you are acquiring locks in the front and releasing locks on the back, walking like a crab, that's why it is called "latch crabbing" (I am misusing "latch" and "lock" here)
This can be hard to implement, good lock :)

Inserting into B+ tree using locks

I am trying to figure out how to insert an item into a B+ tree using locks and don't really understand the theory behind it.
So for searching, my view is that I put a lock on the root node, and then decide which child node I should go to and lock it, at this point I can release the parent node and continue this operation until I reach the leaf node.
But inserting is a lot more complicated because I can't allow any other threads to interfere with the insertion. My idea is put a lock on each node along the path to the leaf node but putting that many locks is quite expensive, and then the question I have is what happens when the leaf node splits because it is too large?
Does anyone know how to properly insert an item into a B+ tree using locks?
There are many different strategies for dealing with locking in B-Trees in general; most of these actually deal with B+Trees and its variations since they have been dominating the field for decades. Summarising these strategies would be tantamount to summarising the progress of four decades; it's virtually impossible. Here are some highlights.
One strategy for minimising the amount of locking during initial descent is to lock not the whole path starting from the root, but only the sub-path beginning at the last 'stable' node (i.e. a node that won't split or merge as a result of the currently planned operation).
Another strategy is to assume that no split or merge will happen, which is true most of the time anyway. This means the descent can be done by locking only the current node and the child node one will descend into next, then release the lock on the previously 'current' node and so on. If it turns out that a split or merge is necessary after all then re-descend from the root under a heavier locking regime (i.e. path rooted at last stable node).
Another staple in the bag of tricks is to ensure that each node 'descended through' is stable by preventative splitting/merging; that is, when the current node would split or merge under a change bubbling up from below then it gets split/merged right away before continuing the descent. This can simplify operations (including locking) and it is somewhat popular in reinventions of the wheel - homework assignments and 'me too' implementations, rather than sophisticated production-grade systems.
Some strategies allow most normal operations to be performed without any locking at all but usually they require that the standard B+Tree structure be slightly modified; see B-link trees for example. This means that different concurrent threads operating on the tree can 'see' different physical views of this tree - depending on when they got where and followed which link - but they all see the same logical view.
Seminal papers and good overviews:
Efficient Locking for Concurrent Operations on B-Trees (Lehman/Yao 1981)
Concurrent Operations on B*-Trees with Overtaking (Sagiv 1986)
A survey of B-tree locking techniques (Graefe 2010)
B+Tree Locking (slides from Stanford U, including Blink trees)
A Blink Tree method and latch protocol for synchronous deletion in a high concurreny environment (Malbrain 2010)
A Lock-Free B+Tree (Braginsky/Petrank 2012)

What is the unidirectional property and why it helps with hotspots?

In the kademlia paper it's written that the XOR metric is unidirectional. What does it mean precisely?
More importantly in what way it alleviates the problem of a frequently queried node?
Could you explain me that from the point of view of a node? I mean, if I a hotspot am requested frequently by different nodes, do they exchange cached nodes to get to the target? Can't they just exchange the target ip?
Furthermore, it doesn't seem to me that lookups converge along the same path as written, I think its more logical that each node follows a different path wile going farther and farther from itself.
The XOR metric means that A^B gives the same distance as B^A. I'm not sure that it directly alleviates the problem of a frequently query, it's more that nodes from different addresses in the network will perceive query nodes on a search path as having different distance from themselves, thereby caching different nodes after a query completes. Subsequent queries to local nodes will be given different remote nodes in response, thereby potentially spreading the load around the DHT network somewhat.
When querying the DHT network, the more common query is to ask for data regarding a particular info hash. That's stored by the nodes with the smallest distances between their node IDs and the info hash in question. It's only when you begin querying nodes that are close to the target info hash that the IP addresses of peers start to respond with IP addresses of peers for that torrent. Nodes can't just arbitrarily return peer IPs, as that would require that all nodes store all IPs for all torrents, or that nodes perform subsequent queries on your behalf, which would be lead to suboptimal network use and be open to exploitation.
Your observation that lookups don't converge on the same path is only correct when there are a surfeit of nodes at the distance being queried. Eventually as you get closer to nodes storing data for the desired info hash, there will be fewer and fewer nodes with such proximity to the target. Thus toward the end of queries, most querying nodes will converge on similar nodes. It's worth keeping in mind that this isn't a problem. Those nodes will only be "hot" for data related to that one particular info hash as the distance between info hashes is going to be very large on average on account of the enormous size of the hash space used. Additionally, were it a popular info hash to be querying for, nodes close to that hash that aren't coping with the traffic will be penalized by the network, and returned less often by nodes on the search path.

What are the differences between B trees and B+ trees?

In a b-tree you can store both keys and data in the internal and leaf nodes, but in a b+ tree you have to store the data in the leaf nodes only.
Is there any advantage of doing the above in a b+ tree?
Why not use b-trees instead of b+ trees everywhere, as intuitively they seem much faster?
I mean, why do you need to replicate the key (data) in a b+ tree?
The image below helps show the differences between B+ trees and B trees.
Advantages of B+ trees:
Because B+ trees don't have data associated with interior nodes, more keys can fit on a page of memory. Therefore, it will require fewer cache misses in order to access data that is on a leaf node.
The leaf nodes of B+ trees are linked, so doing a full scan of all objects in a tree requires just one linear pass through all the leaf nodes. A B tree, on the other hand, would require a traversal of every level in the tree. This full-tree traversal will likely involve more cache misses than the linear traversal of B+ leaves.
Advantage of B trees:
Because B trees contain data with each key, frequently accessed nodes can lie closer to the root, and therefore can be accessed more quickly.
The principal advantage of B+ trees over B trees is they allow you to pack in more pointers to other nodes by removing pointers to data, thus increasing the fanout and potentially decreasing the depth of the tree.
The disadvantage is that there are no early outs when you might have found a match in an internal node. But since both data structures have huge fanouts, the vast majority of your matches will be on leaf nodes anyway, making on average the B+ tree more efficient.
B+Trees are much easier and higher performing to do a full scan, as in look at every piece of data that the tree indexes, since the terminal nodes form a linked list. To do a full scan with a B-Tree you need to do a full tree traversal to find all the data.
B-Trees on the other hand can be faster when you do a seek (looking for a specific piece of data by key) especially when the tree resides in RAM or other non-block storage. Since you can elevate commonly used nodes in the tree there are less comparisons required to get to the data.
In a B tree search keys and data are stored in internal or leaf nodes. But in a B+-tree data is stored only in leaf nodes.
Full scan of a B+ tree is very easy because all data are found in leaf nodes. Full scan of a B tree requires a full traversal.
In a B tree, data may be found in leaf nodes or internal nodes. Deletion of internal nodes is very complicated. In a B+ tree, data is only found in leaf nodes. Deletion of leaf nodes is easy.
Insertion in B tree is more complicated than B+ tree.
B+ trees store redundant search keys but B tree has no redundant value.
In a B+ tree, leaf node data is ordered as a sequential linked list but in a B tree the leaf node cannot be stored using a linked list. Many database systems' implementations prefer the structural simplicity of a B+ tree.
Example from Database system concepts 5th
B+-tree
corresponding B-tree
Adegoke A, Amit
I guess one crucial point you people are missing is difference between data and pointers as explained in this section.
Pointer : pointer to other nodes.
Data :- In context of database indexes, data is just another pointer to real data (row) which reside somewhere else.
Hence in case of B tree each node has three information keys, pointers to data associated with the keys and pointer to child nodes.
In B+ tree internal node keep keys and pointers to child node while leaf node keep keys and pointers to associated data. This allows more number of key for a given size of node. Size of node is determined mainly by block size.
Advantage of having more key per node is explained well above so I will save my typing effort.
B+ Trees are especially good in block-based storage (eg: hard disk). with this in mind, you get several advantages, for example (from the top of my head):
high fanout / low depth: that means you have to get less blocks to get to the data. with data intermingled with the pointers, each read gets less pointers, so you need more seeks to get to the data
simple and consistent block storage: an inner node has N pointers, nothing else, a leaf node has data, nothing else. that makes it easy to parse, debug and even reconstruct.
high key density means the top nodes are almost certainly on cache, in many cases all inner nodes get quickly cached, so only the data access has to go to disk.
Define "much faster". Asymptotically they're about the same. The differences lie in how they make use of secondary storage. The Wikipedia articles on B-trees and B+trees look pretty trustworthy.
In B+ Tree, since only pointers are stored in the internal nodes, their size becomes significantly smaller than the internal nodes of B tree (which store both data+key).
Hence, the indexes of the B+ tree can be fetched from the external storage in a single disk read, processed to find the location of the target. If it has been a B tree, a disk read is required for each and every decision making process. Hope I made my point clear! :)
**
The major drawback of B-Tree is the difficulty of Traversing the keys
sequentially. The B+ Tree retains the rapid random access property of
the B-Tree while also allowing rapid sequential access
**
ref: Data Structures Using C// Author: Aaro M Tenenbaum
http://books.google.co.in/books?id=X0Cd1Pr2W0gC&pg=PA456&lpg=PA456&dq=drawback+of+B-Tree+is+the+difficulty+of+Traversing+the+keys+sequentially&source=bl&ots=pGcPQSEJMS&sig=F9MY7zEXYAMVKl_Sg4W-0LTRor8&hl=en&sa=X&ei=nD5AUbeeH4zwrQe12oCYAQ&ved=0CDsQ6AEwAg#v=onepage&q=drawback%20of%20B-Tree%20is%20the%20difficulty%20of%20Traversing%20the%20keys%20sequentially&f=false
The primary distinction between B-tree and B+tree is that B-tree eliminates the redundant storage of search key values.Since search keys are not repeated in the B-tree,we may not be able to store the index using fewer tree nodes than in corresponding B+tree index.However,since search key that appear in non-leaf nodes appear nowhere else in B-tree,we are forced to include an additional pointer field for each search key in a non-leaf node.
Their are space advantages for B-tree, as repetition does not occur and can be used for large indices.
Take one example - you have a table with huge data per row. That means every instance of the object is Big.
If you use B tree here then most of the time is spent scanning the pages with data - which is of no use. In databases that is the reason of using B+ Trees to avoid scanning object data.
B+ Trees separate keys from data.
But if your data size is less then you can store them with key which is what B tree does.
A B+tree is a balanced tree in which every path from the root of the tree to a leaf is of the same length, and each nonleaf node of the tree has between [n/2] and [n] children, where n is fixed for a particular tree. It contains index pages and data pages.
Binary trees only have two children per parent node, B+ trees can have a variable number of children for each parent node
One possible use of B+ trees is that it is suitable for situations
where the tree grows so large that it does not fit into available
memory. Thus, you'd generally expect to be doing multiple I/O's.
It does often happen that a B+ tree is used even when it in fact fits into
memory, and then your cache manager might keep it there permanently. But
this is a special case, not the general one, and caching policy is a
separate from B+ tree maintenance as such.
Also, in a B+ tree, the leaf pages are linked together in
a linked list (or doubly-linked list), which optimizes traversals
(for range searches, sorting, etc.). So the number of pointers is
a function of the specific algorithm that is used.

Simple basic explanation of a Distributed Hash Table (DHT)

Could any one give an explanation on how a DHT works?
Nothing too heavy, just the basics.
Ok, they're fundamentally a pretty simple idea. A DHT gives you a dictionary-like interface, but the nodes are distributed across the network. The trick with DHTs is that the node that gets to store a particular key is found by hashing that key, so in effect your hash-table buckets are now independent nodes in a network.
This gives a lot of fault-tolerance and reliability, and possibly some performance benefit, but it also throws up a lot of headaches. For example, what happens when a node leaves the network, by failing or otherwise? And how do you redistribute keys when a node joins so that the load is roughly balanced. Come to think of it, how do you evenly distribute keys anyhow? And when a node joins, how do you avoid rehashing everything? (Remember you'd have to do this in a normal hash table if you increase the number of buckets).
One example DHT that tackles some of these problems is a logical ring of n nodes, each taking responsibility for 1/n of the keyspace. Once you add a node to the network, it finds a place on the ring to sit between two other nodes, and takes responsibility for some of the keys in its sibling nodes. The beauty of this approach is that none of the other nodes in the ring are affected; only the two sibling nodes have to redistribute keys.
For example, say in a three node ring the first node has keys 0-10, the second 11-20 and the third 21-30. If a fourth node comes along and inserts itself between nodes 3 and 0 (remember, they're in a ring), it can take responsibility for say half of 3's keyspace, so now it deals with 26-30 and node 3 deals with 21-25.
There are many other overlay structures such as this that use content-based routing to find the right node on which to store a key. Locating a key in a ring requires searching round the ring one node at a time (unless you keep a local look-up table, problematic in a DHT of thousands of nodes), which is O(n)-hop routing. Other structures - including augmented rings - guarantee O(log n)-hop routing, and some claim to O(1)-hop routing at the cost of more maintenance.
Read the wikipedia page, and if you really want to know in a bit of depth, check out this coursepage at Harvard which has a pretty comprehensive reading list.
DHTs provide the same type of interface to the user as a normal hashtable (look up a value by key), but the data is distributed over an arbitrary number of connected nodes. Wikipedia has a good basic introduction that I would essentially be regurgitating if I write more -
http://en.wikipedia.org/wiki/Distributed_hash_table
I'd like to add onto HenryR's useful answer as I just had an insight into consistent hashing. A normal/naive hash lookup is a function of two variables, one of which is the number of buckets. The beauty of consistent hashing is that we eliminate the number of buckets "n", from the equation.
In naive hashing, first variable is the key of the object to be stored in the table. We'll call the key "x". The second variable is is the number of buckets, "n". So, to determine which bucket/machine the object is stored in, you have to calculate: hash(x) mod(n). Therefore, when you change the number of buckets, you also change the address at which almost every object is stored.
Compare this to consistent hashing. Let's define "R" as the range of a hash function. R is just some constant. In consistent hashing, the address of an object is located at hash(x)/R. Since our lookup is no longer a function of the number of buckets, we end up with less remapping when we change the number of buckets.
http://michaelnielsen.org/blog/consistent-hashing/
The core of a DHT is a hash table. Key-value pairs are stored in DHT and a value can be looked up with a key. The keys are unique identifiers to values that can range from blocks in a blockchain to addresses and to documents.
What differentiates a DHT from a normal hash table is the fact that storage and lookup on DHT are distributed across multiple (can be millions) nodes or machines. This very characteristic of DHT makes it look like distributed databases used for storage and retrieval. There is no master-slave hierarchy or a centralized control among the participating nodes. All the nodes are treated as peers.
DHT provides freedom to the participating nodes such that the nodes can join or leave the network anytime. Due to this reason, DHTs are widely used in Peer-to-Peer (P2P) networks. In fact, part of the motivation behind the research of DHT stems from its usage in P2P networks.
Characteristics of DHT
Decentralized: Since there is no central authority or coordination
Scalable: The system can easily scale up to millions of nodes
Fault-tolerant: DHT replicates the data storage on all the nodes.
Therefore, even if one node leaves the network, it should not affect other nodes in the network.
Let’s see how lookup happens in a popular DHT protocol like Chord. Consider a circular doubly-linked list of nodes. Each node has a reference pointer to the node previous as well as next to it. The node next to the node in question is called the successor. The node that is previous to the node in question is called the predecessor.
Speaking in terms of a DHT, each node has a unique node ID of k bits and these nodes are arranged in the increasing order of their node IDs.
Assume these nodes are arranged in a ring structure called identifier ring. For each node, the successor has the shortest distance clockwise away. For most nodes, this is the node whose ID is closest to but still greater than the current node’s ID.
To find out the node appropriate for a particular key, first hash the key K and all the nodes to exactly k bits using consistent hashing techniques like SHA-1.
Start at any point in the ring and traverse clockwise till you catch the node whose node ID is closer to the key K, but can be greater than K. This node is the one responsible for storage and lookup for that particular key.
In an iterative style of lookup, each node Q queries its successor node for KV (key-value) pair. If the queried node does not have the target key, it will return a set of nodes S that can be closer to the target. The querying node Q then queries the nodes in S which are closer to itself. This continues until either the target KV pair is returned or when there are no more nodes to query.
This lookup is very suitable for an ideal scenario where all the nodes have a perfect uptime. But how to handle scenarios when nodes leave the network either intentionally or by failure? This calls for the need for a robust join/leave protocol.
Popular DHT protocols and implementations
Chord
Kademlia
Apache Cassandra
Koorde TomP2P
Voldemort
References:
https://en.wikipedia.org/wiki/Distributed_hash_table
https://steffikj19.medium.com/dht-demystified-77dd31727ea7
https://www.linuxjournal.com/article/6797

Resources