Understanding B+ tree insertion

Understanding B+ tree insertion - database

I'm trying to create a B+ tree with the following sequence,
10 20 30 40 50 60 70 80 90 100
all index nodes should have minimum of 2 and max of 3 keys. I was able to insert till 90, but as soon as insert 100 it increases the height from 2 to 3.
The problem is second child of root has one node, and I cannot fix it. It should have atleast 2, right? Can someone guide me?
UPDATE: I'm following this algorithm
If the bucket is not full (at most b - 1 entries after the insertion), add the record.
Otherwise, split the bucket.
Allocate new leaf and move half the bucket's elements to the new bucket.
Insert the new leaf's smallest key and address into the parent.
If the parent is full, split it too.
Add the middle key to the parent node.
Repeat until a parent is found that need not split.
If the root splits, create a new root which has one key and two pointers. (That is, the value that gets pushed to the new root gets removed from the original node)
P.S: I'm doing it manually, by hand, to understand the algorithm. There's no code!

I believe your B+ Tree is O.K, assuming the order of your B+ Tree is 3. If the order is m, each internal node can have ⌈m/2⌉ to m children. In your case, each internal node can have 2 to 3 children. In a B+ Tree if a node is having just 2 it children, then it requires only 1 key, so no constraints are violated by your B+ Tree.
If you are still confused, look at this B+ Tree Simulator. Try it.

To get the tree you've drawn after inserting the values 10 to 100, the Order of your tree must be 4 not 3. Otherwise the answer given is correct: order m allows m-1 keys in each leaf and each node. After that the Wikipedia description gets a bit confusing as it concentrates on children not keys, and doesn't mention what to do with rounding. Dealing with just keys, the rules are:
Max keys for all nodes = Order-1
Min keys for leaf nodes = floor(Order/2)
Min keys for internal nodes = floor(maxkeys/2)
So you are correct in having one key in the node (order=4, max=3, minleaf=2, minnode=1). You might find this page useful as it has an online JavaScript version of the processes as well as documentation of both insert and delete:
http://goneill.co.nz/btree.php

Related

Optimizing AVLTree with B-tree

PREMISE
So lately i have been thinking of a problem that is common to databases: trying to optimize insertion, search, deletion and update of data.
Usually i have seen that most of the databases nowadays use the BTree or B+Tree to solve such a problem, but they are usually used to store data inside the disk and i wanted to work with in-memory data, so i thought about using the AVLTree (the difference should be minimal because the purpose of the BTrees is kind of the same of the AVLTree but the implementation is different and so are the effects).
Before continuing with the reasoning behind this i would like to get in a deeper level of what i am trying to solve.
So in a modern database data stored in a table with a PRIMARY KEY which tends to be INDEXED (i am not very experienced in indexing so what i will say is basic reasoning i put into this problem), usually the PRIMARY KEY is an increasing number (even though nowadays is a bad practice) starting from 1.
Using normally an AVLTree should be more then enough to solve the problem cause this particular tree is always balanced and offers O(log2(n)) operations, BUT i wanted to reach this on a deeper level trying to optimize it even more then needed.
THEORY
So as the title of the question suggests i am trying to optimize the AVLTree merging it with a Btree.
Basically every node of this new Tree is lets say an array of ten elements every node as also the corresponding height in the tree and every element of the array is ordered ascending.
INSERTION
The insertion initally fills the array of the root node when the root node is full it generates the left and right children which also contains an array of 10 elements.
Whenever a new node is added the Tree autorebalances the nodes based on the first key of the vectors of the left and right child using also their height (note that this is actually how the AVLTree behaves but the AVLTree only has 2 nodes and no vector just the values).
SEARCH
Searching an element works this way: staring from the root we compare the value we are searching K with the first and last key of the array of the current node if the value is in between, we know that it surely will be in the array of the current node so we can start using a binarySearch with O(log2(n)) complexity into this array of ten elements, otherise we go on the left if the key we are searcing is smaller then the first key or we go to the right if it is bigger.
DELETION
The same of the searching but we delete the value.
UPDATE
The same of the searching but we update the value.
CONCLUSION
If i am not wrong this should have a complexity of O(log10(log2(10))) which is always logarithmic so we shouldn't care about this optimization, but in my opinion this could make the height of the tree so much smaller while providing also quick time on the search.

B tree and B+ tree are indeed used for disk storage because of the block design. But there is no reason why they could not be used also as in-memory data structure.
The advantages of a B tree include its use of arrays inside a single node. Look-up in a limited vector of maybe 10 entries can be very fast.
Your idea of a compromise between B tree and AVL would certainly work, but be aware that:
You need to perform tree rotations like in AVL in order to keep the tree balanced. In B trees you work with redistributions, merges and splits, but no rotations.
Like with AVL, the tree will not always be perfectly balanced.
You need to describe what will be done when a vector is full and a value needs to be added to it: the node will have to split, and one half will have to be reinjected as a leaf.
You need to describe what will be done when a vector gets a very low fill-factor (due to deletions). If you leave it like that, the tree could degenerate into an AVL tree where every vector only has 1 value, and then the additional vector overhead will make it less efficient than a genuine AVL tree. To keep the fill-factor of a vector above a minimum you cannot easily apply the redistribution mechanism with a sibling node, as would be done in B-trees. It would work with leaf nodes, but not with internal nodes. So this needs to be clarified...
You need to describe what will be done when a value in a vector is updated. Of course, you would insert it in its sorted position: but if it becomes the first or last value in that vector, this may violate the order with regards to left and right children, and so also there you may need to define more precisely the algorithm.
Binary search in a vector of 10 may be overkill: a simple left-to-right scan may be faster, as CPUs are optimised to read consecutive memory. This does not impact the time complexity, since we set that the vector size is limited to 10. So we are talking about doing either at most 4 comparisons (3-4 on average depending on binary search implementation) or at most 10 comparisons (5 on average).
If I am not wrong this should have a complexity of O(log10(log2(n))) which is always logarithmic
Actually, if that were true, it would be sub-logarithmic, i.e. O(loglogn). But there is a mistake here. The binary search in a vector is not related to n, but to 10. Also, this work comes in addition to finding the node with that vector. So it is not a logarithm of a logarithm, but a sum of logarithms:
O(log10n + log210) = O(log n)
Therefore the time complexity is no different than the one for AVL or B-tree -- provided that the algorithm is completed with the missing details, keeping within the logarithmic complexity.
You should maybe also consider to implement a pure B tree or B+ tree: that way you also benefit from some of the advantages that neither the AVL, nor the in-between structure has:
The leaves of the tree are all at the same level
No rotations are needed
The tree height only changes at one spot: the root.
B+ trees provide a very fast mean for iterating all values in their order.

If leaf node to the left of current node has space, should i redistribute the records before insertion?

I'm working on a homework problem involving B+ Trees. The question asks us to insert (2,3,5,7,11,17,19,23,29,31) in a B+ Tree whose node structure has 4 pointers.
This is where I have reached.
Solution so far
Image 2
I want to know if now, to inset 17 should I split the current node into two and distribute key-values (Same as step 2 : Image 1) or should I move 5 to the leaf node on left and update key-value in parent node to 7 and the insert 17 by shifting 7 and 11 one place left(As in image 2)? The second approach would impose computational overhead but will save space. But there is no mention of this in the text.(Database System Concepts)

Space taken by fringe in DFS

I was recently studying Uninformed Search from here. In the case of Depth First Search, it is given that the space taken by the fringe is O(b.m), but I am unable to figure out how(I could not find proof of this anywhere online). Any help or pointers to specific material would be much appreciated.

A depth-first tree search needs to store only a single path from the root to a leaf node, along with the remaining unexpanded sibling nodes for each node on the path.
Once a node has been expanded, it can be removed from memory as soon as all its descendants has been fully expanded.
So that, for a state space with branching factor band maximum depth m, depth-first search requires storage of only O(b*m).
ref. Russel & Norvig, Artificial Intelligence, Figure 3.16, p87

The Depth First Search (DFS) algorithm has to store few nodes in the fridge because it processes the lastly added node first (Last In First Out), which results in a space complexity of O(bd). Thus, for a depth of d it has to store at most the b children of the d nodes above.
The Breadth First Search (BFS) algorithm, however, gets the firstly inserted node first (First In First Out). Because of this, it has to keep track of all the children nodes it encounters, which results in a space complexity of O(b^d).
Thus, for a depth of d it has to store the children and the children's children, etc., resulting in the exponential growth.

B+ Tree Finding Number oF records

Recently during my studies i came across a question like this
What is minimum levels of B+ tree and B Tree index required for 5000 keys and order of B+ tree node (P) is 10. (Assume P is max pointer possible to store in B+ tree node)
I calculated for Btree it happens to be 4 levels . While attempting for B+ tree i ended up in a confusion . Is the order mentioned in question is internal node order or leaf node order. if it was internal node order then how is it possible to calculate Number of levels required if the order of leaf node is not known. Could some one help me ?

You are right, the question should have mentioned the leaf node capacity.
Whatever it might be - let's call it L - the number of leaf nodes required is clearly ceiling(N / L) because the leaf node layer must contain all the data. If every leaf node can hold up to 10 records (data items) then we get a minimum leaf node count of 500. Once you have the required number of leaf nodes, you can compute the required height of the index part as usual for a B-tree.
In our case the lowest layer of internal nodes (i.e. the bottom-most layer of the index part of the B+ tree) needs to have at least 500 outgoing pointers in order to reach each leaf. ceiling(log(500)/log(10)) is 3, which gives you the minimum number of index levels above the sequence set. Hence the B+ tree also has at least 4 levels in this case, just like the plain B-tree.

AVL Tree rotation. Different possibilities

I have a question regarding the insertion in an AVL Tree. I noticed that there are some cases in which for example, after you inserted an element, both the parent and it's child are breaking the AVL condition. For example here https://www.youtube.com/watch?v=EsgAUiXbOBo, at min. 12:50, when after 1 was inserted, both 4 and 3 are breaking the AVL condition. My question is on which node should we do the rotation. The closest one to the root (in this case is the root itself) or the one who is farthest from the root, as we would get two different trees in those cases? Or is it correct either way?

Rotation starts from the bottom (the inserted node).
Let's consider having balanced all nodes up to P (included). So the subtree of P is perfectly balanced. We go to P's parent (Q). The subtree of Q is checked and (eventually) rotated. The result tree (the root may have changed if a rotation was performed) is perfectly balanced. Advance up again.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight