After reading a few articles and some answers on Stackoverflow my baby mind has started overflowing with information. Yet I am so confused where exactly the data gets stored in SQL Server.
What I assume I know of right now:
The page is the basic unit where the data gets stored in SQL Server
SQL Server uses a binary tree to store the data
There are two types of nodes: a leaf page and a non-leaf page
The leaf pages are the one which is at the end of a B-Tree
The index is stored in a leaf page
My questions:
Is the data stored in leaf page?
What gets stored in a non-leaf page?
If index are created on the leaf page then how are they the first thing SQL Server engine checks for without going through the entire B-tree non-leaf nodes? Isn't this time-consuming.
What I have read so far:
https://dba.stackexchange.com/questions/36815/what-are-the-differences-between-leaf-and-non-leaf-pages
What is a page in SQL Server and do I need to worry?
https://www.simple-talk.com/sql/performance/14-sql-server-indexing-questions-you-were-too-shy-to-ask/
Thank you
SQL Server uses a balanced tree to store indexes, not data.
There are two types of tables: Heaps and Clustered. Heaps doesn't have a clustered index, in this case the data is stored in pages with no order.
When a table has a clustered index, the index get the data for itself, the data pages becomes the leaf level of the index.
Non-leaf pages stores an indexing using the index keys, directing the search for the key across the tree.
An oversimplified way to understand is to think about an string key and the alphabet: the non-leaf level could register each letter of the alphabet and direct the search for that letter to the correct pages across the tree.
I am trying to cluster on spatial locality (Not just create a spatial index), but SQL Server does not allow this. To create a spatial index it first wants me to create a clustered primary key, which nothing makes sense to cluster on. I want to create a spatial index and then cluster on spatial location in some way.
I have an idea to create bins that bin each geometry into a certain bin which then gets some integer. Then set that as the required clustered primary key, that way at least some of my data is clustered close together spatially.
I am kind of baffled SQL server doesnt do this already, so either I am missing out on how to do this or most likely someone has thought of this and someone can proposed a good enough solution.
I want to cluster on spatial location because I am dealing with big data and the first filter I do is by spatial location (creating tiles of maps), without clustering on spatial location my pages are now scattered based on some meaningless auto increment integer.
If a simple implementation of binning by spatial location hasn't been proposed, I figured I could just cut the bounds of my geometry into equal squares and then for each center point run a distance formula that includes all geometries that intersect that bin.
This is not specific to SQL server per say, I am looking for general approaches to solving this index/clustering on spatial location. I assume non-mssql databases may come with this functionality built in.
I don't see how this would be possible, regardless of implementation. Specifically, the idea of a clustering key is so that you (the db engine) can tell the order in which rows should be stored. This is possible with every other datatype (and combination thereof) because ultimately you can say whether a given tuple is bigger, smaller, or equal to another. What metric would you use for generalized spatial data to say that one instance is bigger or smaller than another? Size? Proximity to the origin? Some other measure? There isn't a well-defined sense of that in the general case, and so you can't do it.
But all is not lost. Just assign an arbitrary identifier to your rows (i.e. an identity column or a column populated by a sequence) and cluster on that. Then you can put a spatial index on that and go to town. Looking at your problem, if your bins are pre-defined, you can put those in another table and do a join using STIntersects. But that may be putting the cart before the horse.
DAG = directed acyclic graph;
roots = vertices without incoming edges.
I have a DAG larger than available RAM, so I need a disk-based graph database to work with it.
My DAG is shallow: I have billions of roots nodes, but from each node only dozens of nodes are reachable.
It is also not well connected: majority of the nodes have only one incoming edge. So for any couple of root nodes reachable subgraphs usually have very few nodes in common.
So my DAG can be thought of as a large number of small trees, only few of which intersect.
I need to perform the following queries on my DAG in bulk numbers: given a root node, get all nodes reachable from it.
It can be thought as a batch query: given few thousands of root nodes, return all nodes reachable from there.
As far as I know there are algorithms to improve disk storage locality for graphs. Three examples are:
http://ceur-ws.org/Vol-733/paper_pacher.pdf
http://www.cs.ox.ac.uk/dan.olteanu/papers/g-store.pdf
http://graphlab.org/files/osdi2012-kyrola-blelloch-guestrin.pdf
It also seems there are older generation graph databases that don't utilize graph locality. for example a popular Neo4j graph database:
http://www.ibm.com/developerworks/library/os-giraph/
Neo4j relies on data access methods for graphs without considering
data locality, and the processing of graphs entails mostly random data
access. For large graphs that cannot be stored in memory, random disk
access becomes a performance bottleneck.
My question is: are there any graph databases suited well for my workload?
Support for Win64 and a possibility to work with database from something else than Java is a plus.
From the task itself it doesn't seem that you need a graph database.
You can simply use some external-memory programming library, such as stxxl.
First perform topological sort on the graph (in edge format). Then you only sequentially scan until you finish all the "root nodes". The I/O complexity is bounded by the topological sort. Actually you don't need a topo sort, just need to identify the root nodes. This can be done by a join with edge table and node table, which is linear time.
Can anyone explain why databases tend to use b-tree indexes rather than a linked list of ordered elements.
My thinking is this: On a B+ Tree (used by most databases), the none-leaf nodes are a collection of pointers to other nodes. Each collection (node) is a ordered list. The leaf nodes, which is where all the data pointers are, is a linked list of clusters of data pointers.
The non-leaf nodes are just used to find the correct leaf node in which your target data pointer lives. So as the leaf nodes are just like a linked list, then why not just do away with the tree elements and just have the linked list. Meta data can be provided which gives the minimum and maximum value of each leaf node cluster, so the application can just read the meta data and find the correct leaf where the data pointer lives.
Just to be clear that the most efficent algorithm for searching an random accessed ordered list is an binary search which has a performance of O(log n) which is the same as a b-tree. The benifit of using a linked list rather than a tree is that they don't need to be ballanced.
Is this structure feasible.
After some research and paper reading I found the answer.
In order to cope with large amounts of data such a millions of records, indexes have to be organised into clusters. A cluster is a continuous group of sectors on a disk that can be read into memory quickly. These are usually about 4096 bytes long.
Each one of these clusters can contain a bunch of indexes which can point to other clusters or data on a disk. So if we had a linked list index, each element of the index would be made up of the collection of indexes contained in a single cluster (say 100).
So, when we are looking for a specific record, how do we know which cluster it is on. We perform a binary search to find the cluster in question [O(log n)].
However, to do a binary search we need to know where the range of values in each clusters, so we need meta-data that says the min and max value of each cluster and where that cluster is. This is great. Except if each cluster can contain 100 indexes, and our meta data is also held on a single cluster (for speed) , then our meta data can only point to 100 clusters.
What happens if we want more than 100 clusters. We have to have two meta-data indexes, each pointing to 100 clusters (10 000 records). Well that’s not enough. Lets add another meta-data cluster and we can now access 1 000 000 records. So how do we know which one of the three meta-data clusters we need to query in order to find our target data cluster. We could search one then the other, but that doesn’t scale. So I add another meta-meta-data cluster to indicate which one of the three meta-data clusters I should query to find the target data cluster. Now I have a tree!
So that’s why databases use trees. It’s not the speed it’s the size of the indexes and the need to have indexes referencing other indexes. What I have described above is a B+Tree – child nodes contain references to other child nodes or leaf nodes, and leaf nodes contain references to data on disk.
Phew!
I guess I answered that question in Chapter 1 of my SQL Indexing Tutorial: http://use-the-index-luke.com/sql/anatomy
To summarize the most important parts, with respect to your particular question:
-- from "The Leaf Nodes"
The primary purpose of an index is to provide an ordered
representation of the indexed data. It is, however, not possible to
store the data sequentially because an insert statement would need to
move the following entries to make room for the new one. But moving
large amounts of data is very time-consuming, so that the insert
statement would be very slow. The problem's solution is to establish a
logical order that is independent of physical order in memory.
-- from "The B-Tree":
The index leaf nodes are stored in an arbitrary order—the position on
the disk does not correspond to the logical position according to the
index order. It is like a telephone directory with shuffled pages. If
you search for “Smith” in but open it at “Robinson” in the first
place, it is by no means granted that Smith comes farther back.
Databases need a second structure to quickly find the entry among the
shuffled pages: a balanced search tree—in short: B-Tree.
Linked lists are usually not ordered by key value, but by the moment of insertion: insertion is done at the end of list and each new entry contains a pointer to the previous entry of the list.
They are usually implemented as heap structures.
This has 2 main benefits:
they are very easy to manage (you just need a pointer for each element)
if used in combination with an index you can overcome the problem of sequential access.
If instead you use an ordered list, by key value, you will have ease of access (binary search), but encounter problems each time you edit, delete, insert a new element: you must infact keep your list ordered after performing operation, making algorithms more complex and time consuming.
B+ trees are better structures, having all the properties you stated, and other advantages:
you can make group searches (by intervals of key values) with same cost of a single search: since elements in the leafs result automatically ordered thanks to the insertion algorithm, which is not possible in linked lists cause it would require many linear searches over the list.
cost is logarithmic with number of elements contained and especially since these structures are kept balanced cost of access does not depend on the particulare value you are looking for (very usefull).
these structures are very efficient in update, insert or delete operations.
In a b-tree you can store both keys and data in the internal and leaf nodes, but in a b+ tree you have to store the data in the leaf nodes only.
Is there any advantage of doing the above in a b+ tree?
Why not use b-trees instead of b+ trees everywhere, as intuitively they seem much faster?
I mean, why do you need to replicate the key (data) in a b+ tree?
The image below helps show the differences between B+ trees and B trees.
Advantages of B+ trees:
Because B+ trees don't have data associated with interior nodes, more keys can fit on a page of memory. Therefore, it will require fewer cache misses in order to access data that is on a leaf node.
The leaf nodes of B+ trees are linked, so doing a full scan of all objects in a tree requires just one linear pass through all the leaf nodes. A B tree, on the other hand, would require a traversal of every level in the tree. This full-tree traversal will likely involve more cache misses than the linear traversal of B+ leaves.
Advantage of B trees:
Because B trees contain data with each key, frequently accessed nodes can lie closer to the root, and therefore can be accessed more quickly.
The principal advantage of B+ trees over B trees is they allow you to pack in more pointers to other nodes by removing pointers to data, thus increasing the fanout and potentially decreasing the depth of the tree.
The disadvantage is that there are no early outs when you might have found a match in an internal node. But since both data structures have huge fanouts, the vast majority of your matches will be on leaf nodes anyway, making on average the B+ tree more efficient.
B+Trees are much easier and higher performing to do a full scan, as in look at every piece of data that the tree indexes, since the terminal nodes form a linked list. To do a full scan with a B-Tree you need to do a full tree traversal to find all the data.
B-Trees on the other hand can be faster when you do a seek (looking for a specific piece of data by key) especially when the tree resides in RAM or other non-block storage. Since you can elevate commonly used nodes in the tree there are less comparisons required to get to the data.
In a B tree search keys and data are stored in internal or leaf nodes. But in a B+-tree data is stored only in leaf nodes.
Full scan of a B+ tree is very easy because all data are found in leaf nodes. Full scan of a B tree requires a full traversal.
In a B tree, data may be found in leaf nodes or internal nodes. Deletion of internal nodes is very complicated. In a B+ tree, data is only found in leaf nodes. Deletion of leaf nodes is easy.
Insertion in B tree is more complicated than B+ tree.
B+ trees store redundant search keys but B tree has no redundant value.
In a B+ tree, leaf node data is ordered as a sequential linked list but in a B tree the leaf node cannot be stored using a linked list. Many database systems' implementations prefer the structural simplicity of a B+ tree.
Example from Database system concepts 5th
B+-tree
corresponding B-tree
Adegoke A, Amit
I guess one crucial point you people are missing is difference between data and pointers as explained in this section.
Pointer : pointer to other nodes.
Data :- In context of database indexes, data is just another pointer to real data (row) which reside somewhere else.
Hence in case of B tree each node has three information keys, pointers to data associated with the keys and pointer to child nodes.
In B+ tree internal node keep keys and pointers to child node while leaf node keep keys and pointers to associated data. This allows more number of key for a given size of node. Size of node is determined mainly by block size.
Advantage of having more key per node is explained well above so I will save my typing effort.
B+ Trees are especially good in block-based storage (eg: hard disk). with this in mind, you get several advantages, for example (from the top of my head):
high fanout / low depth: that means you have to get less blocks to get to the data. with data intermingled with the pointers, each read gets less pointers, so you need more seeks to get to the data
simple and consistent block storage: an inner node has N pointers, nothing else, a leaf node has data, nothing else. that makes it easy to parse, debug and even reconstruct.
high key density means the top nodes are almost certainly on cache, in many cases all inner nodes get quickly cached, so only the data access has to go to disk.
Define "much faster". Asymptotically they're about the same. The differences lie in how they make use of secondary storage. The Wikipedia articles on B-trees and B+trees look pretty trustworthy.
In B+ Tree, since only pointers are stored in the internal nodes, their size becomes significantly smaller than the internal nodes of B tree (which store both data+key).
Hence, the indexes of the B+ tree can be fetched from the external storage in a single disk read, processed to find the location of the target. If it has been a B tree, a disk read is required for each and every decision making process. Hope I made my point clear! :)
**
The major drawback of B-Tree is the difficulty of Traversing the keys
sequentially. The B+ Tree retains the rapid random access property of
the B-Tree while also allowing rapid sequential access
**
ref: Data Structures Using C// Author: Aaro M Tenenbaum
http://books.google.co.in/books?id=X0Cd1Pr2W0gC&pg=PA456&lpg=PA456&dq=drawback+of+B-Tree+is+the+difficulty+of+Traversing+the+keys+sequentially&source=bl&ots=pGcPQSEJMS&sig=F9MY7zEXYAMVKl_Sg4W-0LTRor8&hl=en&sa=X&ei=nD5AUbeeH4zwrQe12oCYAQ&ved=0CDsQ6AEwAg#v=onepage&q=drawback%20of%20B-Tree%20is%20the%20difficulty%20of%20Traversing%20the%20keys%20sequentially&f=false
The primary distinction between B-tree and B+tree is that B-tree eliminates the redundant storage of search key values.Since search keys are not repeated in the B-tree,we may not be able to store the index using fewer tree nodes than in corresponding B+tree index.However,since search key that appear in non-leaf nodes appear nowhere else in B-tree,we are forced to include an additional pointer field for each search key in a non-leaf node.
Their are space advantages for B-tree, as repetition does not occur and can be used for large indices.
Take one example - you have a table with huge data per row. That means every instance of the object is Big.
If you use B tree here then most of the time is spent scanning the pages with data - which is of no use. In databases that is the reason of using B+ Trees to avoid scanning object data.
B+ Trees separate keys from data.
But if your data size is less then you can store them with key which is what B tree does.
A B+tree is a balanced tree in which every path from the root of the tree to a leaf is of the same length, and each nonleaf node of the tree has between [n/2] and [n] children, where n is fixed for a particular tree. It contains index pages and data pages.
Binary trees only have two children per parent node, B+ trees can have a variable number of children for each parent node
One possible use of B+ trees is that it is suitable for situations
where the tree grows so large that it does not fit into available
memory. Thus, you'd generally expect to be doing multiple I/O's.
It does often happen that a B+ tree is used even when it in fact fits into
memory, and then your cache manager might keep it there permanently. But
this is a special case, not the general one, and caching policy is a
separate from B+ tree maintenance as such.
Also, in a B+ tree, the leaf pages are linked together in
a linked list (or doubly-linked list), which optimizes traversals
(for range searches, sorting, etc.). So the number of pointers is
a function of the specific algorithm that is used.