I'm studying B+ Tree and B Tree and I would like to understand two things about it, if someone can clarify it to me I would appreciate it:
Why can I store more search keys on an B+ Tree Index? My guess would be that the reason is because the nodes of an B+ Tree point
out to sub-trees instead of data.
Is there any type of comparison of data that will not work with an
B+ Tree index or can I use all of them (=, >=, !=, <,<>...) ?
I am not sure that I understand your questions completely (maybe that's why someone gave you negative vote), but I will give a try.
A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves.
From this Wikipedia quote, it follows that the organization of keys/values is different, but I do not deduce that either B or B+ can store more keys than the other.
If you are asking if any type of data which has comparaison operators can be used as the key - the answer is yes.
Related
Please help me in understanding what I feel is inconsistency between two facts:
SQL Server stores data in a B-Tree structure
Only leaf nodes contain actual table data, while intermediate ones store only keys and pointers to children
In general, a B-Tree has the property that, for a given key in the intermediate node, all keys in the left subtree are smaller than it and in the right subtree greater, such that:
In the above example (image credit), clearly a row with the ID = 7 was inserted into the table. But where is the row data (non-key columns) for that ID if it can't be in the root node of the example and there is no 7 in the leaf nodes?
Clearly, there's more to it than "indexes are B-Trees" and I would appreciate some insight.
That diagram is for a B-tree, but technically speaking SQL Server uses a B+tree structure. Scroll down a bit in that Wiki article and you will find
In the B+ tree, copies of the keys are stored in the internal nodes; the keys and records are stored in leaves; in addition, a leaf node may include a pointer to the next leaf node to speed sequential access (Comer 1979, p. 129).
Thus the internal nodes would only have a copy of the keys, and will be duplicated in the leaves (where, in the case of a clustered index, the actual data are held as well).
You can find more specifics here. You'll notice in the comments section a couple other folks noting that SQL Server uses a B+tree.
I think good overview is this article:
https://www.simple-talk.com/sql/database-administration/sql-server-storage-internals-101/
See part Indexes. There is shown, that nodes as 7 or 16 has also their leafs
Also, I highly recommend book:
SQL Server 2012 Internals by Kalen Delaney
https://www.amazon.com/Microsoft-Server-Internals-Developer-Reference/dp/0735658560
When building a B-Tree index, it starts with the leaf level - the data is sorted and written to data pages and a double-linked list created.
The smallest key value (NULL from the very first page) is taken from each page and used to build the index pages for the next level of index, so each row in the index page contains the ID of the page below and the smallest key value from it. It does the same again, taking the smallest key value from each index page, to create the next level.
This continues until everything fits into a single page - this is the root.
Pages on all intermediate levels and the root follow the same patters, page ID and the smallest key value from that page.
In the picture above, assuming it's just the three leaf level pages and the root, the root should contain (pageID:1 Key:NULL), (pageID:2 Key:9) and (pageID:3 Key:18).
(Please excuse my Word drawing skills)
Although your image does represent a B-Tree, the actual SQL Server has a slightly different implementation, specifically a B+Tree. I'll try to explain using visuals as well, taking the below diagram as an example:
As the diagram shows, the keys do not exist only in one node (in this case the root), but they are copied and distributed to the children nodes up until the leaf nodes. (In this case the tree only has 2 levels, root and leaf-levels).
So, when running a query for the key (Adams, Joe), the key will be looked for in the B-Tree as per the rules you mentioned in the question (smaller keys to the left, greater keys to the right).
This will continue until a LEAF node is reached.
At this point there are 2 distinctions, specifically for SQL Server:
non-clustered index (represented in the diagram above):
contains a ROW_ID / PAGE_ID column which points to a data page where that row exists
the database engine retrieves that page and looks inside it for the ROW_ID
clustered index:
contains the entire data page at the leaf level
the database does not need to retrieve the page because it is already at the leaf level, and just does a lookup for the key inside the page
I have to implement a database using b trees for a school project. the database is for storing audio files(songs), and a number of different queries can be made like asking for all the songs of a given artist or a specific album.
The intuitive idea is to use on b tree for each field ( songs, albums, artists, ...), the problem is that one can be asked to delete any member of any field, and in-case you delete an artist you have to delete all his albums and songs from the other b trees, keeping in mind that for example all the songs of a given artist don't have to be near each other in the b tree that corresponds to songs.
My question is: is there a way to do so (delete the songs after a delete to an author has been made) without having to iterate over all elements of the other b trees? I'm not looking for code just ideas because all the ones I've come up with are brute force ones.
This is my understanding and may not be entirely right.
Typically in a database implementation B Trees are used for indexes, so unless you want to force your user to index every column, defaulting to creating a B Tree for each field is unnecessary. Although this many indexes will lead to a fast read in virtually every case (with an index on everything, you wont have to do a full table scan), it will also cause an extremely slow insert/update/delete, as the corresponding data has to be updated in each tree. As I'm sure you know, modern databases for you to have at least one index (the primary key), so you will have at least one B Tree with a key for the primary key, and a pointer to the appropriate node.
Every node in a B Tree index should have a pointer/reference to the full object it represents.
Future indexes created would include the attributes you specify in the index, such as song name, artist, etc, however will still contain the pointer/reference to the corresponding node. Thus when you modify, lets say, the song title, you will want to modify the referenced node which all the indexes reference. If you have any indexes that have the modified reference as an attribute, you will have to modify the values in that index itself.
Unfortunately I believe you are correct in your belief that you will have to brute-force your way through the other B Trees when deleting/updating, and is one of the downsides of using alot of indexes (slowed update/delete time). If you just delete the referenced nodes, you will likely end up with pointers to deleted objects, which will (depending on your language) give you some form of a NullPointerException. In order to prevent this they references will have to be removed from all the trees.
Keep in mind though that doing a full scan of your indexes will still be much better than doing full table scans.
I'm trying to wrap my head around graph databases. So maybe someone could help explain to me the right way to model this relationship. This is mostly from the perspective of neo4j, but I assume it would be applicable to most graph databases
I have a Recipe node and Ingredient nodes. The Ingredient nodes have a ingredient_in relationship to the Recipe node. The relationship will hold several attributes, of particular note is an amount field with a unit of measure.
I can imagine that elsewhere in the graph there would be a UnitOfMeasure nodes that would have converts_to relationships with a conversion ratio.
The point I'm struggling with is how do I represent the Ingredient->Recipe relationship as having a UnitOfMeasure. Coming from RDMS this feels like I need a another node in between, but that feels wrong for a graph database.
It depends on two things:
a) do you have attributed relations or n-ary relations
b) how do you use the units and amounts - possibly a node in between is easier
Imo, using a "normal" design like this
Recipe -- Entry -- Ingredient
amount: double
|
|
UniOfMeasure
is fine with Entry being a Node - even if you use a graph database which can handle attributed edges. The design would be quite the same with an attributed n-ary edge btw - the only difference would be that Entry, now possibly named "contains", would be an Edge not a Node.
Specifically a Multigraph.
Some colleague suggested this and I'm completely baffled.
Any insights on this?
It's pretty straightforward to store a graph in a database: you have a table for nodes, and a table for edges, which acts as a many-to-many relationship table between the nodes table and itself. Like this:
create table node (
id integer primary key
);
create table edge (
start_id integer references node,
end_id integer references node,
primary key (start_id, end_id)
);
However, there are a couple of sticky points about storing a graph this way.
Firstly, the edges in this scheme are naturally directed - the start and end are distinct. If your edges are undirected, then you will either have to be careful in writing queries, or store two entries in the table for each edge, one in either direction (and then be careful writing queries!). If you store a single edge, i would suggest normalising the stored form - perhaps always consider the node with the lowest ID to be the start (and add a check constraint to the table to enforce this). You could have a genuinely unordered representation by not having the edges refer to the nodes, but rather having a join table between them, but that doesn't seem like a great idea to me.
Secondly, the schema above has no way to represent a multigraph. You can extend it easily enough to do so; if edges between a given pair of nodes are indistinguishable, the simplest thing would be to add a count to each edge row, saying how many edges there are between the referred-to nodes. If they are distinguishable, then you will need to add something to the node table to allow them to be distinguished - an autogenerated edge ID might be the simplest thing.
However, even having sorted out the storage, you have the problem of working with the graph. If you want to do all of your processing on objects in memory, and the database is purely for storage, then no problem. But if you want to do queries on the graph in the database, then you'll have to figure out how to do them in SQL, which doesn't have any inbuilt support for graphs, and whose basic operations aren't easily adapted to work with graphs. It can be done, especially if you have a database with recursive SQL support (PostgreSQL, Firebird, some of the proprietary databases), but it takes some thought. If you want to do this, my suggestion would be to post further questions about the specific queries.
It's an acceptable approach. You need to consider how that information will be manipulated. More than likely you'll need a language separate from your database to do the kinds graph related computations this type of data implies. Skiena's Algorithm Design Manual has an extensive section graph data structures and their manipulation.
Without considering what types of queries you might execute, start with two tables vertices and edges. Vertices are simple, an identifier and a name. Edges are complex given the multigraph. Edges should be uniquely identified by a combination two vertices (i.e. foreign keys) and some additional information. The additional information is dependent on the problem you're solving. For instance, if flight information, the departure and arrival times and airline. Furthermore you'll need to decide if the edge is directed (i.e. one way) or not and keep track if that information as well.
Depending on the computation you may end up with a problem that's better solved with some sort of artificial intelligence / machine learning algorithm. For instance, optimal flights. The book Programming Collective Intelligence has some useful algorithms for this purpose. But where the data is kept doesn't change the algorithm itself.
Well, the information has to be stored somewhere, a relational database isn't a bad idea.
It would just be a many-to-many relationship, a table of a list of nodes, and table of a list of edges/connections.
Consider how Facebook might implement the social graph in their database. They might have a table for people and another table for friendships. The friendships table has at least two columns, each being foreign keys to the table of people.
Since friendship is symmetric (on Facebook) they might ensure that the ID for the first foreign key is always less than the ID for the second foreign key. Twitter has a directed graph for its social network, so it wouldn't use a canonical representation like that.
Do b trees and b+ trees only store data at their leafs? I am assuming that they use their internal nodes to search the required data.
Is that the case or do they store data in every node?
Non-leaf nodes "records" contain
a pointer (a node "address" of sorts) to a node in the next level down the tree
the value of the key of the first (or the last, depending on implementation) record in that node
Such non-leaf "records" are listed in key order so that by scanning (or binary searching within) a non-leaf node, one knows which node in the next level down may contain the searched value.
Leaf nodes records contain complete data records: the key value and whatever else.
Therefore "real" data is only contained in the leaf nodes, the non-leaf nodes only contain [a copy of] the key values. for a very small proportion of the data (this proportion depends on the average number of data records founds in a leaf node).
This is illustrated in the following image from the Wikipedia article on B+ Trees
The non-leaf node, at the top, (the only one in this simplistic tree) only contains two non-leaf node records, each with a copy of a key value (bluish color) and a pointer to the corresponding node (gray color). This tree happens to only have two levels, therefore the "records" in root node point to leaf nodes. One can imagine that there are additional levels (above the topmost tree shown below, call it the "3-5 node"); if that were the case the node above would contain (along with other similar records), a record with key value 3 with a pointer to the "3-5" node.
Also note that only the key values 3 and 5 are contained in non-leaf nodes (i.e. not even all key values are reproduced in the non leaf-nodes).
BTW in this example the non-leaf nodes contain the key of the last record in the next node (would also work if the first record were used instead, slight difference in the way the search logic is then implemented).
The leaf nodes contain the key value (in bluish color too) and the corresponding data record (d1, d2... shown in grey). The red-ish pointer shown at the end of each leaf node point to the next leaf node, i.e. the one containing the very next data record in key order; these pointers are useful to "scan" a range of data records.
All data is in the leaves.
Wiki on B+.
There is some confusion on BTrees and B+Trees. B+Trees only store data on the leaf nodes as pointers. This means the data must be stored elsewhere. BTrees may store data on every node. There are advantages and disadvantages to each. I've noticed that some sites show BTrees exactly the same as B+Trees. In general, BTrees are better at holding the actual data, and B+Trees are much more efficient as indexes.