B+Tree splitting leads to leaf nodes with less capacity - database

I am currently trying to understand and recreate a database. One common way of storing data in a database is using a B+Tree. Data is only stored on so called 'leaf nodes', which are indexed by 'index nodes'. If a full leaf node is inserted into it is split with its content being split equally to the two new nodes. I understand this happens to keep the tree balanced and easily searchable. But I can't quite understand this procedure.
The example is taken from this thread 'splitting a node in b+ tree'.
Let's say we have a B+Tree like this:
[21 30 50]
/ \ \ \
[10|20|-] => [21|22|25] => [30|40|-] => [50|60|80]
And then we insert the key '23'. We would need to split the [21|22|25] node. In order to balance the tree we distribute the keys evenly and arrive at this tree:
[21 30 50]
/ \ \ \
[10|20|-] => [21|22|-] => [23|25|-] => [30|40|-] => [50|60|80]
||
||
\/
[ 30 ]
/ \
[21 23] [50]
/ \ \ / \
[10|20|-] => [21|22|-] => [23|25|-] => [30|40|-] => [50|60|80]
In my understanding of B+Trees this leaves the [21|22|-] leaf unable the be filled up to maximum capacity. This is because any key smaller than 21 will be entered into the [10|20|-] leaf and 23 is already a present key.
If we think of leaf nodes as pages as they are present in a database, this lets precious
memory and disk space go to waste and more disk operations will be necessary to find values, making the database slower.
I would really appreciate feedback on my thoughts and some help clearing up misunderstandings if they are present.
Thank you very much!

This is known as the b-tree "fill factor". When inserting in random order, a fill factor of 70% is the best that can be achieved in practice. When inserting in order, it drops to 50% as you have observed. The b*tree split algorithm performs rebalancing before splitting, and it achieves ~66% fill factor when inserting entries in order.
B-tree designs can improve the fill factor even more using a few techniques. One way is to detect in-order insertion and split the leaf unbalanced. The new left node keeps most of the original entries, but the new right node starts out mostly empty. This technique needs to also detect reverse insertion order in order to prevent creating an abysmal fill factor.
By combining various techniques, a fill factor of 100% is possible with in order insertion, and 80% with random insertion order. Going higher than 80% requires more sophisticated rebalancing steps, which might end up being too costly in practice.

Related

How searching a million keys organized as B-tree will need 114 comparisons?

Please explain how it will take 114 comparisons. The following is the screenshot taken from my book (Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press). My reasoning is that in worst case each node will have just minimum number of children (i.e. 5), so I took log base 5 of a million, and it comes to 9. So assuming at each level of the tree we search minimum number of keys (i.e. 4), it comes to somewhere like 36 comparisons, nowhere near 114.
Consider a situation in which we have to search an un-indexed and
unsorted database that contains n key values. The worst case running
time to perform this operation would be O(n). In contrast, if the data
in the database is indexed with a B tree, the same search operation
will run in O(log n). For example, searching for a single key on a set
of one million keys will at most require 1,000,000 comparisons. But if
the same data is indexed with a B tree of order 10, then only 114
comparisons will be required in the worst case.
Page 350, Data Structures Using C, 2nd Ed. Reema Thareja, Oxford Univ. Press
The worst case tree has the minimum number of keys everywhere except on the path you're searching.
If the size of each internal node is in [5,10), then in the worst case, a tree with a million items will be about 10 levels deep, when most nodes have 5 keys.
The worst case path to a node, however, might have 10 keys in each node. The statement seems to assume that you'll do a linear search instead of a binary search inside each node (I would advise to do a binary search instead), so that can lead to around 10*10 = 100 comparisons.
If you carefully consider the details, the real number might very well come out to 114.
(This is not an Answer to the question asked, but a related discussion.)
Sounds like a textbook question, not a real-life question.
Counting comparisons is likely to be the best way to judge an in-memory tree, but not for a disk-based dataset.
Even so, the "average" number of comparisons (for in-memory) or disk hits (for disk-based) is likely to be the metric to compute.
(Sure, it is good to compute the maximum numbers as a useful exercise for understanding the structures.)
Perhaps the optimal "tree" for in memory searching is a Binary tree, but with 3-way fan out. And keep the tree balanced with 2 or 3 elements in each node.
For disk based searching -- think databases -- the optimal is likely to be a BTree with the size of a block being based on what is efficient to read from disk. Counting comparisons in a poor second when it comes to the overall time taken to fetch a row.

Optimizing AVLTree with B-tree

PREMISE
So lately i have been thinking of a problem that is common to databases: trying to optimize insertion, search, deletion and update of data.
Usually i have seen that most of the databases nowadays use the BTree or B+Tree to solve such a problem, but they are usually used to store data inside the disk and i wanted to work with in-memory data, so i thought about using the AVLTree (the difference should be minimal because the purpose of the BTrees is kind of the same of the AVLTree but the implementation is different and so are the effects).
Before continuing with the reasoning behind this i would like to get in a deeper level of what i am trying to solve.
So in a modern database data stored in a table with a PRIMARY KEY which tends to be INDEXED (i am not very experienced in indexing so what i will say is basic reasoning i put into this problem), usually the PRIMARY KEY is an increasing number (even though nowadays is a bad practice) starting from 1.
Using normally an AVLTree should be more then enough to solve the problem cause this particular tree is always balanced and offers O(log2(n)) operations, BUT i wanted to reach this on a deeper level trying to optimize it even more then needed.
THEORY
So as the title of the question suggests i am trying to optimize the AVLTree merging it with a Btree.
Basically every node of this new Tree is lets say an array of ten elements every node as also the corresponding height in the tree and every element of the array is ordered ascending.
INSERTION
The insertion initally fills the array of the root node when the root node is full it generates the left and right children which also contains an array of 10 elements.
Whenever a new node is added the Tree autorebalances the nodes based on the first key of the vectors of the left and right child using also their height (note that this is actually how the AVLTree behaves but the AVLTree only has 2 nodes and no vector just the values).
SEARCH
Searching an element works this way: staring from the root we compare the value we are searching K with the first and last key of the array of the current node if the value is in between, we know that it surely will be in the array of the current node so we can start using a binarySearch with O(log2(n)) complexity into this array of ten elements, otherise we go on the left if the key we are searcing is smaller then the first key or we go to the right if it is bigger.
DELETION
The same of the searching but we delete the value.
UPDATE
The same of the searching but we update the value.
CONCLUSION
If i am not wrong this should have a complexity of O(log10(log2(10))) which is always logarithmic so we shouldn't care about this optimization, but in my opinion this could make the height of the tree so much smaller while providing also quick time on the search.
B tree and B+ tree are indeed used for disk storage because of the block design. But there is no reason why they could not be used also as in-memory data structure.
The advantages of a B tree include its use of arrays inside a single node. Look-up in a limited vector of maybe 10 entries can be very fast.
Your idea of a compromise between B tree and AVL would certainly work, but be aware that:
You need to perform tree rotations like in AVL in order to keep the tree balanced. In B trees you work with redistributions, merges and splits, but no rotations.
Like with AVL, the tree will not always be perfectly balanced.
You need to describe what will be done when a vector is full and a value needs to be added to it: the node will have to split, and one half will have to be reinjected as a leaf.
You need to describe what will be done when a vector gets a very low fill-factor (due to deletions). If you leave it like that, the tree could degenerate into an AVL tree where every vector only has 1 value, and then the additional vector overhead will make it less efficient than a genuine AVL tree. To keep the fill-factor of a vector above a minimum you cannot easily apply the redistribution mechanism with a sibling node, as would be done in B-trees. It would work with leaf nodes, but not with internal nodes. So this needs to be clarified...
You need to describe what will be done when a value in a vector is updated. Of course, you would insert it in its sorted position: but if it becomes the first or last value in that vector, this may violate the order with regards to left and right children, and so also there you may need to define more precisely the algorithm.
Binary search in a vector of 10 may be overkill: a simple left-to-right scan may be faster, as CPUs are optimised to read consecutive memory. This does not impact the time complexity, since we set that the vector size is limited to 10. So we are talking about doing either at most 4 comparisons (3-4 on average depending on binary search implementation) or at most 10 comparisons (5 on average).
If I am not wrong this should have a complexity of O(log10(log2(n))) which is always logarithmic
Actually, if that were true, it would be sub-logarithmic, i.e. O(loglogn). But there is a mistake here. The binary search in a vector is not related to n, but to 10. Also, this work comes in addition to finding the node with that vector. So it is not a logarithm of a logarithm, but a sum of logarithms:
O(log10n + log210) = O(log n)
Therefore the time complexity is no different than the one for AVL or B-tree -- provided that the algorithm is completed with the missing details, keeping within the logarithmic complexity.
You should maybe also consider to implement a pure B tree or B+ tree: that way you also benefit from some of the advantages that neither the AVL, nor the in-between structure has:
The leaves of the tree are all at the same level
No rotations are needed
The tree height only changes at one spot: the root.
B+ trees provide a very fast mean for iterating all values in their order.

What is a good representation for a searchable bit matrix with fixed number of columns?

The raw data can be described as a fixed number of columns (on the order of a few thousand) and a large (on the order of billions) and variable number of rows. Each cell is a bit. The desired query would be something like find all rows where bits 12,329,2912,3020 are set. Something like
for (i=0;i< max_ents;i++)
if (entry[i].data & mask == mask)
add_result(i);
In a typical case not many (e.g. 5%) bits are set in any particular row, but that's not guaranteed, there's a degree of variability.
On a higher level the data describes a bitwise fingerprint of entries and the data itself is a kind of search index so maximal speed is desired. What algorithm would be good for this kind of search? At the moment I'm thinking of having separate sparse (packed/compressed) bit vectors for each column separately. I doubt it's optimal though.
This looks similar to "text search", in particular to that of intersecting reverse indexes. Let me go through the simplest algorithm for doing that.
First, you should create sorted lists of numbers where each bit is set. E.g., for the table of numbers:
Row 1 -> 10110
Row 2 -> 00111
Row 3 -> 11110
Row 4 -> 00011
Row 5 -> 01010
Row 6 -> 10101
you can create an reverse index:
Bit 0 is set in -> 2, 4, 6
Bit 1 is set in -> 1, 2, 3, 4, 5
Bit 2 is set in -> 1, 2, 3, 6
etc.
Now, for a query (let's say bits 0 & 1 & 2), you just have to merge these sorted lists using a merge sort like algorithm,. To do this, you can do it by first merging lists 0, 1, giving you {2, 4}, and then merge this with list 2 giving you {2}.
Several optimizations are possible, including, but not limited to, compressing these lists, since the difference between consecutive items is typically small, doing more efficient merging etc.
But, to save more hassle, why not reuse work that others have already done? ;)... You can readily use (should be possible in less than 1 day of coding) any open source text search engine (I suggest Lucene) to perform this task, and it should contain several optimizations which people have built over a long time ;). (Hint: You should treat each row as a "doc" in text search parlance, and each bit as a "token").
Edit (adding some of the algorithms by request of the question author):
a) Compression: One of the most effective things you can do is compression of postings lists (the sorted list corresponding to each position). Most algorithms generally take differences of consecutive terms, and then compress them according to some encoding (Gamma Coding, Varint Encoding) to name a few. This compresses the inverted list so that it either consumes less file space (thus less file I/O), or uses less memory for encoding the same set of numbers. In your case, I can estimate that each posting list will contain ~ 5% * 1e9 = 5e7 elements. If they are uniformly distributed across 0 - 1e9, the gaps should be around 20, and so let us say encoding each gap takes ~ 8b on an average (this is a large overestimation), adding up to 500MB. So for 1000 lists you will need 500GB of space, which definitely needs a disk space. This in turn means that you should go for as good a compression algorithm as possible, since a better compression means less file I/O and you are going to be I/O bound.
b) Intersection Order: You should always intersect lists starting from the smallest, since that is guaranteed to create the smallest sized intermediate lists, which means less comparisons later, by techniques shown below.
c) Merge algorithm: Since your index almost certainly spills to disk, there is probably not much you can do at an algorithmic level. But some of the ideas that are used is to use a binary search based procedure for merging two lists instead of the straightforward linear merge procedure in case one of the lists is much smaller than the other (this will lead to O(N*log(M)) complexity instead of O(N+M) where M >> N). But for file based indices this is almost never a good idea since binary search makes many random accesses, which can completely screw up your disk latency, whereas the linear merge procedure is strictly sequential.
d) Skip Lists: This is another great data structure used to store sorted postings lists, which can also then support efficient "binary search" mentioned before. The key idea here is that the upper levels of the skip list can be kept in memory, and this can greatly speed up the last stages of your intersection algorithm, when you can simply search through the in-memory upper levels to get to a disk offset, and then do disk access from there. There is a point when binary search + skiplist based merge becomes more efficient than linear merge and can be found by experimentation.
e) Caching: No-brainer. If some of your terms occur frequently, cache them in-memory so that you can get them more efficiently in the future. Note that the cache can also be, e.g. a faster flash based disk, which can give you better throughput as well as probably cache a significant number of the more frequent terms (a 32GB memory can only hold ~ 64 of these lists, whereas a 256GB flash disk can hold ~ 512).

Which data structure should i use to represent a large amount of records each presents range of items?

I'm looking for software represenation for a very large amount of records ( more than 400K records)
Each record has two keys . one for lower bound and one for upper bound. These number represent a range. Also each record has some piece of information lets call it I . In other words , each record aggregates common item indexs , and has some common description about them.
My software is given an item number , and I have to retrive that info about it.
I thought about AVL , B- Tress or fibonaci . But i'm sure which will be the best for that large amount of record . I would deffinately go for AVL / balanced AVL for a small database.
From a data structure point of view, you search for an interval tree.
The wikipedia article is quite good. What you can do, is to augment a (balanced) binary search tree like AVL or Red-Black-Trees like. Interval trees based on binary search tree have an own section in the classical DS book by Cormen et al..
A good data structure scales well to large amounts of data. The complexity for the major directory operations are O(k + log n) where n is the number of intervals in the tree and k is the number of overlapping intervals in the range. This is usually pretty good. It grows slowly with the number of of interval items, except for cases where a lot or most intervals overlap all others.
If you cannot hold your data in main memory, a B-Tree would be a good choice.
Any database will do what you want just fine.
If you are searching on an index, the increase in look-up speed when going from 2 to 4 records is the same as going from 2 million to 4 million records...one more level to the tree...it's an exponential relationship.

High-Performance Hierarchical Text Search

I'm now in the final stages of upgrading the hierarchy design in a major transactional system, and I have been staring for a while at this 150-line query (which I'll spare you all the tedium of reading) and thinking that there has got to be a better way.
A quick summary of the question is as follows:
How would you implement a hierarchical search that matches several search terms at different levels in the hierarchy, optimized for fastest search time?
I found a somewhat related question, but it's really only about 20% of the answer I actually need. Here is the full scenario/specification:
The end goal is to find one or several arbitrary items at arbitrary positions in the hierarchy.
The complete hierarchy is about 80,000 nodes, projected to grow up to 1M within a few years.
The full text of an entire path down the hierarchy is unique and descriptive; however, the text of an individual node may not be. This is a business reality, and not a decision that was made lightly.
Example: a node might have a name like "Door", which is meaningless by itself, but the full context, "Aaron > House > Living Room > Liquor Cabinet > Door", has clear meaning, it describes a specific door in a specific location. (Note that this is just an example, the real design is far less trivial)
In order to find this specific door, a user might type "aaron liquor door", which would likely turn up only one result. The query is translated as a sequence: An item containing the text "door", under an item containing the text "liquor", under another item containing the text "aaron."
Or, a user might just type "house liquor" to list all the liquor cabinets in people's houses (wouldn't that be nice). I mention this example explicitly to indicate that the search need not match any particular root or leaf level. This user knows exactly which door he is looking for, but can't remember offhand who owns it, and would remember if the name popped up in front of him.
All terms must be matched in the specified sequence, but as the above examples suggest, levels in the hierarchy can be "skipped." The term "aaron booze cabinet" would not match this node.
The platform is SQL Server 2008, but I believe that this is a platform-independent problem and would prefer not to restrict answers to that platform.
The hierarchy itself is based on hierarchyid (materialized path), indexed both breadth-first and depth-first. Each hierarchy node/record has a Name column which is to be queried on. Hierarchy queries based on the node are extremely fast, so don't worry about those.
There is no strict hierarchy - a root may contain no nodes at all or may contain 30 subtrees fanning out to 10,000 leaf nodes.
The maximum nesting is arbitrary, but in practice it tends to be no more than 4-8 levels.
The hierarchy can and does change, although infrequently. Any node can be moved to any other node, with the obvious exceptions (parent can't be moved into its own child, etc.)
In case this wasn't already implied: I do have control over the design and can add indexes, fields, tables, whatever might be necessary to get the best results.
My "dream" is to provide instant feedback to the user, as in a progressive search/filter, but I understand that this may be impossible or extremely difficult. I'd be happy with any significant improvement over the current method, which usually takes between 0.5s to 1s depending on the number of results.
For the sake of completeness, the existing query (stored procedure) starts by gathering all leaf nodes containing the final term, then joins upward and excludes any whose paths don't match with the earlier terms. If this seems backward to anyone, rest assured, it is a great deal more efficient than starting with the roots and fanning out. That was the "old" way and could easily take several seconds per search.
So my question again: Is there a better (more efficient) way to perform this search?
I'm not necessarily looking for code, just approaches. I have considered a few possibilities but they all seem to have some problems:
Create a delimited "path text" column and index it with Full-Text Search. The trouble is that a search on this column would return all child nodes as well; "aaron house" also matches "aaron house kitchen" and "aaron house basement".
Created a NamePath column that is actually a nested sequence of strings, using a CLR type, similar to hierarchyid itself. Problem is, I have no idea how Microsoft is able to "translate" queries on this type to index operations, and I'm not even sure if it's possible to do on a UDT. If the net result is just a full index scan, I've gained nothing by this approach.
It's not really the end of the world if I can't do better than what I already have; the search is "pretty fast", nobody has complained about it. But I'm willing to bet that somebody has tackled this problem before and has some ideas. Please share them!
take a look at Apache Lucene. You can implement very flexible yet efficient searches using Lucene. It may be useful.
Also take a look at the Search Patterns - What you are describing may fit into the Faceted Search pattern.
It is quite easy implement a trivial "Aaron House Living Door" algorithm, but not sure the regular SVM/classification/entropy based algorithms would scale to a large data set. You may also want to look at the "approximation search" concepts by Motwani and Raghavan.
Please post back what you find, if possible :-)
Hi Aarron, I have the following idea:
From your description I have the following image in my mind:
Aaron
/ \
/ \
/ \
House Cars
| / \
Living Room Ferrari Mercedes
|
Liquor Cabinet
/ | \
Table Door Window
This is how your search tree might look like. Now I would sort the nodes on every level:
Aaron
/ \
/ \
/ \
Cars House
/ \ /
/ \ /
/ \ /
/ \ /
/ X
/ / \
/ / \
/ / \
/ / \
| / \
| / \
Ferrari Living Room Mercedes
|
Liquor Cabinet
/ | \
Door Table Window
Now it should be easy and fast to process a query:
Start with the last word in the query and the lowest node level(leafs)
Since all the nodes are sorted within one level, You can use binary search and therefore find a match in O(log N) time, where N is the node count.
Do this for every level. There are O(log N) levels in the tree.
Once You find a match, process all parent nodes to see, if the path matches your query. The path has length O(log N). If it matches, store it in the results, that should be shown to the user.
Let be M the number of overall matches (number of nodes matching the last word in the query). Then our processing time is: O( (log N)^2 + M * (log N) ):
Binary search takes O(log N) time per level and there are O(log N) levels, therefore we have to spend at least O( (log N)^2 ) time. Now, for every match, we have to test, whether the complete path from our matching node up to the root matches the complete query. The path has length O(log N). Thus, given M matches overall, we spend another M * O(log N) time, thus the resulting execution time is O( (log N)^2 + M * (log N) ).
When You have few matches, the processing time approaches O( (log N)^2 ), which is pretty good. On the opposite if the worst case occurs (every single path matches the query (M = N)), the processing time approaches O(N log N) which is not too good, but also not too likely.
Implementation:
You said, that You only wanted an idea. Further my knowledge on databases is very limited, so I won't write much here, just outline some ideas.
The node table could look like this:
ID : int
Text : string
Parent : int -> node ID
Level : int //I don't expect this to change too often, so You can save it and update it, as the database changes.
This table would have to be sorted by the "Text" column. Using the algorithm described above a sql query inside the loop might look like:
SELECT ID FROM node WHERE Level = $i AND Text LIKE $text
Hope one can get my point.
One could speed things even more up, by not only sorting the table by the "Text" column, but by the combined "Text" and "Level" columns, that is, all entries within Level=20 sorted, all entries within Level=19 sorted etc.(but no overall sorting over the complete table necessary). However, the node count PER LEVEL is in O(N), so there is no asymptotic runtime improvement, but I think it's worth to try out, considering the lower constants You get in reality.
Edit: Improvement
I just noticed, that the iterative algorithm is completely unnecessary(thus the Level information can be abandoned). It is fully sufficient to:
Store all nodes sorted by their text value
Find all matches for the last word of the query at once using binary search over all nodes.
From every match, trace the path up to the root and check if the path matches the whole query.
This improves the runtime to O(log N + M * (log N)).

Resources