Do btrees and b+trees only store data at the leafs? - theory

Do b trees and b+ trees only store data at their leafs? I am assuming that they use their internal nodes to search the required data.
Is that the case or do they store data in every node?

Non-leaf nodes "records" contain
a pointer (a node "address" of sorts) to a node in the next level down the tree
the value of the key of the first (or the last, depending on implementation) record in that node
Such non-leaf "records" are listed in key order so that by scanning (or binary searching within) a non-leaf node, one knows which node in the next level down may contain the searched value.
Leaf nodes records contain complete data records: the key value and whatever else.
Therefore "real" data is only contained in the leaf nodes, the non-leaf nodes only contain [a copy of] the key values. for a very small proportion of the data (this proportion depends on the average number of data records founds in a leaf node).
This is illustrated in the following image from the Wikipedia article on B+ Trees
The non-leaf node, at the top, (the only one in this simplistic tree) only contains two non-leaf node records, each with a copy of a key value (bluish color) and a pointer to the corresponding node (gray color). This tree happens to only have two levels, therefore the "records" in root node point to leaf nodes. One can imagine that there are additional levels (above the topmost tree shown below, call it the "3-5 node"); if that were the case the node above would contain (along with other similar records), a record with key value 3 with a pointer to the "3-5" node.
Also note that only the key values 3 and 5 are contained in non-leaf nodes (i.e. not even all key values are reproduced in the non leaf-nodes).
BTW in this example the non-leaf nodes contain the key of the last record in the next node (would also work if the first record were used instead, slight difference in the way the search logic is then implemented).
The leaf nodes contain the key value (in bluish color too) and the corresponding data record (d1, d2... shown in grey). The red-ish pointer shown at the end of each leaf node point to the next leaf node, i.e. the one containing the very next data record in key order; these pointers are useful to "scan" a range of data records.

All data is in the leaves.
Wiki on B+.

There is some confusion on BTrees and B+Trees. B+Trees only store data on the leaf nodes as pointers. This means the data must be stored elsewhere. BTrees may store data on every node. There are advantages and disadvantages to each. I've noticed that some sites show BTrees exactly the same as B+Trees. In general, BTrees are better at holding the actual data, and B+Trees are much more efficient as indexes.

Related

Where does a clustered index store row data for keys on intermediate levels?

Please help me in understanding what I feel is inconsistency between two facts:
SQL Server stores data in a B-Tree structure
Only leaf nodes contain actual table data, while intermediate ones store only keys and pointers to children
In general, a B-Tree has the property that, for a given key in the intermediate node, all keys in the left subtree are smaller than it and in the right subtree greater, such that:
In the above example (image credit), clearly a row with the ID = 7 was inserted into the table. But where is the row data (non-key columns) for that ID if it can't be in the root node of the example and there is no 7 in the leaf nodes?
Clearly, there's more to it than "indexes are B-Trees" and I would appreciate some insight.
That diagram is for a B-tree, but technically speaking SQL Server uses a B+tree structure. Scroll down a bit in that Wiki article and you will find
In the B+ tree, copies of the keys are stored in the internal nodes; the keys and records are stored in leaves; in addition, a leaf node may include a pointer to the next leaf node to speed sequential access (Comer 1979, p. 129).
Thus the internal nodes would only have a copy of the keys, and will be duplicated in the leaves (where, in the case of a clustered index, the actual data are held as well).
You can find more specifics here. You'll notice in the comments section a couple other folks noting that SQL Server uses a B+tree.
I think good overview is this article:
https://www.simple-talk.com/sql/database-administration/sql-server-storage-internals-101/
See part Indexes. There is shown, that nodes as 7 or 16 has also their leafs
Also, I highly recommend book:
SQL Server 2012 Internals by Kalen Delaney
https://www.amazon.com/Microsoft-Server-Internals-Developer-Reference/dp/0735658560
When building a B-Tree index, it starts with the leaf level - the data is sorted and written to data pages and a double-linked list created.
The smallest key value (NULL from the very first page) is taken from each page and used to build the index pages for the next level of index, so each row in the index page contains the ID of the page below and the smallest key value from it. It does the same again, taking the smallest key value from each index page, to create the next level.
This continues until everything fits into a single page - this is the root.
Pages on all intermediate levels and the root follow the same patters, page ID and the smallest key value from that page.
In the picture above, assuming it's just the three leaf level pages and the root, the root should contain (pageID:1 Key:NULL), (pageID:2 Key:9) and (pageID:3 Key:18).
(Please excuse my Word drawing skills)
Although your image does represent a B-Tree, the actual SQL Server has a slightly different implementation, specifically a B+Tree. I'll try to explain using visuals as well, taking the below diagram as an example:
As the diagram shows, the keys do not exist only in one node (in this case the root), but they are copied and distributed to the children nodes up until the leaf nodes. (In this case the tree only has 2 levels, root and leaf-levels).
So, when running a query for the key (Adams, Joe), the key will be looked for in the B-Tree as per the rules you mentioned in the question (smaller keys to the left, greater keys to the right).
This will continue until a LEAF node is reached.
At this point there are 2 distinctions, specifically for SQL Server:
non-clustered index (represented in the diagram above):
contains a ROW_ID / PAGE_ID column which points to a data page where that row exists
the database engine retrieves that page and looks inside it for the ROW_ID
clustered index:
contains the entire data page at the leaf level
the database does not need to retrieve the page because it is already at the leaf level, and just does a lookup for the key inside the page

Hash Tables or BST?

Currently i have a problem that i'm trying to figure out but not sure if my answers are correct.
You have 1 million records. In these records you will frequently need to search by
two criteria: employee ID and salary (but not by both at the same time).
You have the following constraints:
each record is very large and because of that you can only keep one copy of this data.
Your program needs to be reasonably fast. Simply scanning through all the items for each search would be too slow.
What data structure would you use?
My Answer?
I would use Hash table because the worst case time would be O(1000000) = O(1)
How will you retrieve the record when you search by ID?
How will you retrieve the record when you search by salary?
I'd expect many collision issues for a hash-table based on salary, but one for an ID could work with no collisions quite easily using a little cryptographic theory. It seems odd to want to search by salary rather than sort or get some range, which could be performed much more easily on a BST.
The short of it though is that if you want to search by two independent properties you're going to have to maintain two structures. Fortunately pointers exist, so you don't have to keep multiple copies. Personally I'd keep a hash table of IDs to references, then a BST of salaries to references, but if I'm restricted to one datatype I'd have to do a BST with nodes like this:
Node {
int id;
Node idLessThan;
Node idGreaterThan;
int salary;
Node salaryLessThan;
Node salaryGreaterThan;
Data fileInfo;
}
Creating essentially two BSTs over the same node set.

Requesting feedback on database design

I am building a database for a christmas tree growing operation. I have put together, what I believe to be, a workable schema. I am hoping to get some feedback from someone, and I have no one. You are my only hope.
So, there are 3 growing plots, we will call them Orchards. Each Orchard has rows & columns, and each row/column intersection can have zero or one trees, planted in it. The rows/columns are numbers and letters, so row 3, column f, etc. Each row/column intersection has a status (empty, in use). A tree can be different species (denoted by manually created GID {Genetic ID}), modified (have a different species grafted on), or moved to a different location. So a plant can have one or many locations, and a location can contain, through history, one or many trees, but only one at a time.
Here is a diagram I put together:
So I was thinking for historical purposes, I would use the
treelocation table. Do you think it is unnecessary?
No, but in that case you should have the information pertaining to the tree's location in the tree location table. For instance "MovedYear". If a tree moves multiple times, don't you want to keep the Year of each Move, instead of just one MovedYear for each tree?
It's fine to have a history table the way you do, but right now, if TreeId 1 has been in 3 different locations, how could you query your database to see which location it's in NOW? All you'll see is:
TreeId LocationId
1 1
1 2
1 3
You won't know in what order the moves took place. (Unless you have some business rule that says trees can only move from 1 to 2 and from 2 to 3, and never follow any other order).
The usual way to solve this is to have a StartDate and EndDate in the history table.
It seams
A plant can have one or many locations
No, a plant have a location but it can move.
To gain this we need to
Have location foreign key(FK) inside Tree table, showing current tree location.
This FK needs to be mandatory (exposing have a)
To prevent multiple trees having the same location we need to have a unique key constraint on this FK column.
A plant can move, so to trace a plants location history
We will need a plant-location-history table
Each row/column intersection has a status (empty, in use)
So the intersections status can have predefined limited values.
Do we need a LocationStatus table? I don't think so. status can be a static field inside locatin table with a check constraint of (1= empty, 2= in-use, 3= ETC)

B Tree and B+Tree Index diferencies

I'm studying B+ Tree and B Tree and I would like to understand two things about it, if someone can clarify it to me I would appreciate it:
Why can I store more search keys on an B+ Tree Index? My guess would be that the reason is because the nodes of an B+ Tree point
out to sub-trees instead of data.
Is there any type of comparison of data that will not work with an
B+ Tree index or can I use all of them (=, >=, !=, <,<>...) ?
I am not sure that I understand your questions completely (maybe that's why someone gave you negative vote), but I will give a try.
A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves.
From this Wikipedia quote, it follows that the organization of keys/values is different, but I do not deduce that either B or B+ can store more keys than the other.
If you are asking if any type of data which has comparaison operators can be used as the key - the answer is yes.

Using b trees in a database

I have to implement a database using b trees for a school project. the database is for storing audio files(songs), and a number of different queries can be made like asking for all the songs of a given artist or a specific album.
The intuitive idea is to use on b tree for each field ( songs, albums, artists, ...), the problem is that one can be asked to delete any member of any field, and in-case you delete an artist you have to delete all his albums and songs from the other b trees, keeping in mind that for example all the songs of a given artist don't have to be near each other in the b tree that corresponds to songs.
My question is: is there a way to do so (delete the songs after a delete to an author has been made) without having to iterate over all elements of the other b trees? I'm not looking for code just ideas because all the ones I've come up with are brute force ones.
This is my understanding and may not be entirely right.
Typically in a database implementation B Trees are used for indexes, so unless you want to force your user to index every column, defaulting to creating a B Tree for each field is unnecessary. Although this many indexes will lead to a fast read in virtually every case (with an index on everything, you wont have to do a full table scan), it will also cause an extremely slow insert/update/delete, as the corresponding data has to be updated in each tree. As I'm sure you know, modern databases for you to have at least one index (the primary key), so you will have at least one B Tree with a key for the primary key, and a pointer to the appropriate node.
Every node in a B Tree index should have a pointer/reference to the full object it represents.
Future indexes created would include the attributes you specify in the index, such as song name, artist, etc, however will still contain the pointer/reference to the corresponding node. Thus when you modify, lets say, the song title, you will want to modify the referenced node which all the indexes reference. If you have any indexes that have the modified reference as an attribute, you will have to modify the values in that index itself.
Unfortunately I believe you are correct in your belief that you will have to brute-force your way through the other B Trees when deleting/updating, and is one of the downsides of using alot of indexes (slowed update/delete time). If you just delete the referenced nodes, you will likely end up with pointers to deleted objects, which will (depending on your language) give you some form of a NullPointerException. In order to prevent this they references will have to be removed from all the trees.
Keep in mind though that doing a full scan of your indexes will still be much better than doing full table scans.

Resources