I'm puzzled about how clustered primary keys work. I understand from the docs that rows are physically stored by primary key. Does this mean that inserting a new row with a key that fits in the middle would take ages because all the following rows would have to be shifted down? If not, how is this avoided?
These clustered indexes are organized into "pages". Pages are linked together in elaborate ways which we application programmers don't have to worry about. Pages often have free space in them. If a new "in the middle" row fits in the appropriate page, the server puts it there. If it doesn't the server performs a so-called "page split" to make room for the new entry. The pages that result from the split will then have more free space in them. Page splitting isn't a terribly expensive operation. But tons and tons of page splits fragment the clustered index and make a table slower to access.
New rows "at the end" generate new pages.
If you want to experience this for yourself, try making and populating table where a guid is the primary key. Pretty much every row you insert goes "in the middle" and you'll get lots of page splits.
Read this for details of index maintenance.
I have a warehouse fact table that has a clustered index on a BIGINT column. Due to circumstances beyond my control, I have to add in data to the fact table from another source where the keys from the two different source systems overlap (I want the data from the second source in its own datamart but I was overruled). To handle this I am adding 1000000 to the key and multiplying it by -1.
If I am inserting all these negative keys into a clustered index, does it add more overhead when reorganizing or rebuilding the index vs continually adding keys > 0 that just get larger and larger.
Thanks
Well, yes, it does. Think of a list written on a couple of paper sheets that is sorted. Adding an entry at the end will be no problem: If there’s enough space on the last page, the entry can be written on that page, and if the last page is full, you can just add a blank page at the end and write the entry on that page. But what if you have to add a new entry at the beginning of the list? All entries on the first page have to be shifted down to make room for the new entry, and if the page is full, you have to add a new page… In SQL Server, this new page will not be in front of the other pages, so the list of pages will get fragmented, which will impact query performance.
If you do not need the records to be physically sorted on disk by that key, consider to change the index to be non-clustered.
If you add more values (positive or negative) into your clustered index, the data volume grows - so yes, it is a bit more overhead over time.
But that doesn't depend on whether those are positive or negative values... the index doesn't care about that - it's just a bunch of BIGINT values.
I have been thinking about two questions. Couldn't find any resources on the internet about this. How do dbms handle it ? Or do they ? Especially Oracle.
Before the questions, here is an example: Say I have a master table "MASTER" and slave table "SLAVE".
Master table has an "ID" column which is the primary key and index is created by Oracle.Slave table has the foreign key "MASTER_ID" which refers to master table and "SLAVE_NO". These two together is the primary key of slave table, which is again indexed.
**MASTER** | **SLAVE**
(P) ID <------> (P)(F) MASTER_ID
(P) SLAVE_NO
Now the questions;
1- If MASTER_ID is an autoincremented column, and no record is ever deleted, doesn't this get the table's index unbalanced ? Does Oracle rebuilds indexes periodically ? As far as i know Oracle only balances index branches at build time. Does Oracle re-builds indexes Automatically ever ? Say if the level goes up to some high levels ?
2- Assuming Oracle does not rebuild automatically, apart from scheduling a job that rebuilds index periodically, would it be wiser to order SLAVE table's primary key columns reverse ? I mean instead of "MASTER_ID", "SLAVE_NO" ordering it as "SLAVE_NO", "MASTER_ID"i, would it help the slave table's b-tree index be more balanced ? (Well each master table might not have exact number of slave records, but still, seems better than reverse order)
Anyone know anything about that ? Or opinions ?
If MASTER_ID is an autoincremented column, and no record is ever deleted, doesn't this get the table's index unbalanced ?
Oracle's indexes are never "unbalanced": every leaf in the index is at the same depth as any other leaf.
No page split introduces a new level by itself: a leaf page does not become a parent for new pages like it would be on a non-self-balancing tree.
Instead, a sibling for the split page is made and the new record (plus possibly some of the records from the old page) go to the new page. A pointer to the new page is added to the parent.
If the parent page is out of space too (can't accept the pointer to the newly created leaf page), it gets split as well, and so on.
These splits can propagate up to the root page, whose split is the only thing which increases the index depth (and does it for all pages at once).
Index pages are additionally organized into double-linked lists, each list on its own level. This would be impossible if the tree were unbalanced.
If master_id is auto-incremented this means that all splits occur at the end (such called 90/10 splits) which makes the most dense index possible.
Would it be wiser to order SLAVE table's primary key columns reverse?
No, it would not, for the reasons above.
If you join slave to master often, you may consider creating a CLUSTER of the two tables, indexed by master_id. This means that the records from both tables, sharing the same master_id, go to the same or nearby data pages which makes a join between them very fast.
When the engine found a record from master, with an index or whatever, this also means it already has found the records from slave to be joined with that master. And vice versa, locating a slave also means locating its master.
The b-tree index on MASTER_ID will remain balanced for most useful definitions of "balanced". In particular, the distance between the root block and any child block will always be the same and the amount of data in any branch will be at least roughly euqal to the amount of data in any other sibling branch.
As you insert data into an index on a sequence-generated column, Oracle will perform 90-10 block splits on the leaves when the amount of data in any particular level increases. If, for example, you have a leaf block that can hold 10 rows, when it is full and you want to add an 11th row, Oracle creates a new block, leaves the first 9 entries in the first block, puts 2 entries in the new block, and updates the parent block with the address of the new block. If the parent block needs to split because it is holding addresses for too many children, a similar process takes place. That leaves the indexes relatively balanced throughout their life. Richard Foote (the expert on Oracle indexes) has an excellent blog on When Does an Oracle Index Increase in Height that goes into much more detail about how this works.
The only case that you potentially have to worry about an index becoming skewed is when you regularly delete most but not all blocks from the left-hand side of the index. If, for example, you decide to delete 90% of the data from the left-hand side of the index leaving one row in every leaf node pointing to data, your index can become unbalanced in the sense that some paths lead to vastly more data than other paths. Oracle doesn't reclaim index leaf nodes until they are completely empty so you can end up with the left-hand side of the index using a lot more space than is really necessary. This doesn't really affect the performance of the system much-- it is more of a space utilization issue-- but it can be fixed by coalescing the index after you've purged the data (or structuring your purges so that you don't leave lots of sparse blocks).
[11] tells:
"In a nonclustered index, the leaf level does not contain all the data. In addition to the key values, each index row in the leaf level (the lowest level of the tree) contains a bookmark that tells SQL Server where to find the data row corresponding to the key in the index.
A bookmark can take one of two forms. If the table has a clustered index, the bookmark is the clustered index key for the corresponding data row
. If the table is a heap (in other words, it has no clustered index), the bookmark is a row identifier (RID), which is an actual row locator in the form File#:Page#:Slot#."
Is this a copy of clustered index key or nonclustered index has a pointer to it?
Should all clustered index structure, i.e. b-tree with intermediate data, be traversed to get to row data through non-clustered index bookmark on clustered table?
What does clustered index indtroduce that direct referencing become impossible?
Update:
Let me re-phrase the question. How this is done I can read myself but I want to understand why it is done this way.
Would not it be much more efficient to continue referencing row data by RID from non-clustered index having (added) clustered one?
Suppose a table has only non-clustered index(es) (but no clustered index).
Non-clustered index leaves contains RID to real data. For direct access of row data without any need of lookup/traversals.
Adding clustered index means elimination of IAM (Index Allocation Map) pages and substituting all RIDs of all non-clustered indexes by clustered index keys + necessity of additional lookup instead of direct access.
What is the point in this?
Update2:
Was my question downvoted by Microsoft himself? Thanks again, this is an honor.
It is pointless to downvote without explaining.
Update3:
#PerformanceDB", I could not understand the phrase in your answer:
""It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them)."
Can you explain it?
Yes, I'd like illustrations.
I started to read: Debunking myths about clustered indexes - part 4 (CIXs, TPC-C & Oracle clusters) and it, as many other sources, explicitly refer to the fact that SQL Server, in contrat to Oracle, lacks direct access features on a clustered table.
Update4 (Update5 - corrected by strike-out):
A few answerers referred to the fact that a bookmark CI key in NCI leaf is for address independence in case of page splits.
Aren't during index reorganization or de-fragmenting in non-clustered table with CI NCI (non-clustered index) the rows relocated and corresponding RIDs in NCI change in NCI are modified?
This seems to me as addressing scheme deficiency - the row should have moved with its address, should have not it?
Also, Is heap completely immune to page splits? due to size increase of variable size data types in row
Related questions:
What do I miss in understanding the clustered index?
Why/when/how is whole clustered index scan chosen rather than full table scan?
Reasons not to have a clustered index in SQL Server 2005
Cited:
[11]
Inside Microsoft® SQL Server™ 2005: The Storage Engine
By Kalen Delaney - (Solid Quality Learning)
...............................................
Publisher: Microsoft Press
Pub Date: October 11, 2006
Print ISBN-10: 0-7356-2105-5
Print ISBN-13: 978-0-7356-2105-3
Pages: 464
[11a]
p.250 Section Index organization from Chapter 7. Index Internals and Management
Here is helpful online copypaste from it
http://sqlserverindexeorgnization.blogspot.com/
though without any credits to source
The problem is that the doco is gobbledegook, and increases the very confusion it is alleging it clarifies. If you forget about all that and start again, it is actually quite straight-forward. Since you are inquiring re data storage structures, and concerned re performance, let's look at that perspective (not the logical). There is no data storage structure caled "Table".
Heap
Data pages containing rows. There is no Clustered Index. The rows are not shifted as a result of Inserts/Deletes. The rows can be read in entirety (table scan) or singly (via a NonClustered Index). It gets badly fragmented.
Clustered Index
B-Tree. The Index is Clustered with the Data Rows. The leaf level is the data row. That means one less I/O on every access. It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them). The Heap (entire data storage stucture) is eliminated. There are no pointers. The rows are maintained in Clustered Index Key order (rows are moved on the page as a result of Insert/Delete/Expand). Pages are trimmed within the extents.
NonClustered Index
B-Tree. Full height as required by number of rows.
Where there is a Clustered Index, the Leaf level is the Clustered Index Key (so that it can go to the exact location in the CI, which is the row).
Where there is no Clustered Index, the Leaf level is a pointer: File:Page:Offset (so that it can go to the Heap, and get the row). The RowIds in the Heap do not change (if they did, every time you Inserted/Deleted one row, you would have to update all the NCI entries in all associated NCIs, for all other rows on the page).
That is why, when you create a CI, all NCIs are automatically rebuilt (they have to be switched from [2] to 1). Obviously, always create the CI before the NCIs.
There is no File:Page:Slot, the row length is variable, it is Offset within Page.
There is no Bookmark or other goobledegook.
Re "No direct access to data row in clustered table - why"
Nonsense. You have direct and immediate access to each data row, via the CI (one less I/O) or the NCI⇢CI Key.
This is very fast, invented by Britton Lee; re-implemented and patented by Sybase; obtained by dishonesty and for a pittance by Darth Vader.
If you need further clarification, I can provide illustrations.
Responses to Comments
"It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them)."
Let's say you have a tables with 1 billion rows. The "height" of the B-Tree of any given index (eg. Unique, on PK) drawn vertically is say 8; or you can say the index is 8 levels deep, between the top (a single entry) and the bottom, the leaf level. the leaf level is of course the widest, and most polpulated; it will have 1 billion entries. Given that each index page contains say 256 entries, the leaf-minus-one level contains 390K entries.
The CI B-tree (index only portion) will contain 7 levels, 390K entries, taking 10MB; because the leaf level IS the data row (of which there are 1 billion entries, spread nicely across 100GB), and is thus excluded, or not repeated.
Yes, I'd like illustrations.
Ok. I have a set of finished Sybase docs; I have butchered one for you, so as to avoid confusion, and excluded the bits that Sybase has, that MS does not. Sorry. Don't follow the links, just stay on the one page. Also the very low levels of Fragmentation in the heap are different by the fragmentation in the Heap is massive, in both Sybase and MS, so I have left that intact.
Data Storage Basics
(That is a condensed version of my much more elaborate Sybase diagrams, which I have butchered for the MS context. There is a link at the bottom of that doc, if you want the full Sybase set.)
"I started to read: Debunking myths about clustered indexes - part 4 (CIXs, TPC-C & Oracle clusters) and it, as many other sources, explicitly refer to the fact that SQL Server, in contrat to Oracle, lacks direct access features on a clustered table."
Be careful what you read, the web is full of superficial information; half truths discussed out of context; misinformation (from vendiors as well as well-meaning ignorants). As you notice, I just answer questions; I do not waste time answering points raised in references.
Just think about this. Well-implemented Tables with a CI do not need de-fragmentation; and when implemented badly, need infrequent de-fragmentation; tables without a CI need frequent and pretty much offline de-fragmentation. That's your maintenance window running into Monday morning. Just an example of why discussing items in isolation is actually misinformation. Which is why my docs are all linked and related to one another.
"A few answerers referred to the fact that CI key in NCI leaf is for address independence in case of page splits."
Yes, I just would not put it that way, that's as confusing as the first reference you posted. Page splits have nothing to do with it. I put is the way I did in my post above on purpose, for clarity. Since the rows move (the CI keeps the pages and Extents trim), the NCI MUST have the CI key, in order to find the row. It can't use a RowId which would keep changing all the time. Unless you have wide CI keys, this is no big deal; a 4-byte RowId (plus processing overhead) vs an 8-byte CI key (minus said overhead) ... who cares (ok, maybe you). Address the higher level issues, and the low level issues will be small enough to not warrant address. Squeezing out 1% performance improvement at the low level when your db is fragmented and unnormalised, is amore than a bit silly.
A system in an integrated set of components, none can be changed or evaluated in isolation. A bunch of components that are not integrated are dis-integrated, not a system. At your level of questioning, you are not yet in a position to form conclusions, or have grudges against this or that, if you do, they are premature conclusions and grudges, that will impede your progress. On top of that, there is a big difference between knowledge gained by question-and-answer vs knowledge gained by reading plus experience.
"Aren't during undex reotganization or defragmanting of non-clustered table with CI the rows relocated and corresponding RIDs in NCI change in NCI?"
Do you mean "non-clustered INDEX with CI" ? Well the NCIs are not worth de-fragmenting, just drop/create them.
Or do you mean "de-fragmenting a CI [whole table]" ? I have already posted, when you re-create the CI (or de-fragment it in place), the NCIs are automatically rebuilt. It is not about RowIds, it is about the change: when you drop the CI, the NCIs have to be rewritten from CI keys to RowIds; when you create the CI, the NCIs have to the changed back to CI Keys. Switched on DBAs drop the NCI before dropping the CI.
"This seems to me as addressing scheme deficiency - the row should have moved with its address, should have not it?"
You're getting too low-level without understanding the higher levels. If the row moves, its address changes; if the address changes, the row moves. Either you have a CI (rows move) xor you have a Heap (rows do not move).
"Also, Is heap completely immune to page splits?"
No. Page Splits still happen when variable length rows expand and there is no room on the page. But in the scheme of things, massive fragmentation on Heaps, due to never moving rows, due to it being RowId based (which the NCIs rely on), this is a small item.
Let me re-phrase the question. How
this is done I can read myself but I
want to understand why it is done this
way.
Would not it be much more efficient to
continue referencing row data by RID
from non-clustered index having
(added) clustered one?
NO ! If a table has an insert, and a page split occurs, then you would have to potentially update a lot of references that use a RID to point to the new locations of those data rows that have been moved to a new page in SQL Server. That's exactly why the SQL Server team chose to use the clustering key instead, as the "data pointer", so to speak. The value of the clustering key does not change when a page split occurs, so no update to indices are needed.
Would not it be much more efficient to continue referencing row data by RID from non-clustered index having (added) clustered one?
The whole point of a clustered index is that the records are accessed via a logical locator (which is not normally meant to change), not physical.
If the indexes were pointing to a physical RID and a row changed its physical location (say from a page split), all indexes would need to be updated too.
It's exactly the kind of problem the clustered indexes were invented to deal with.
If a table has a clustered index, each non-clustered index row contains a copy of the clustered index key.
If a table does not have a clustered index, i.e. the table is a heap, each non-clustered index row contains a pointer built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).
When you identify (select) a row using a clustered index, you have all the columns from the row.
When you identify a row in a non-clustered index, you need to perfrom another lookup step to obtain the columns not included in the non-clustered index.
Would not it be much more efficient to
continue referencing row data by RID
from non-clustered index having
(added) clustered one?
In many cases it would be more efficient, yes. I believe that clustered indexes were originally implemented that way (in version 6.0?). Presumably they were changed for the reasons that marc_s mentioned, which make sense if your clustered index is such that it has a lot of page splits.
I would not have posted this (my) question, have I seen before my posting here that answer by AlexSmith there, which I saw just a few minutes after posting and having been already answered here:
"Greg Linwood wrote a great series of blogs. It is a must read: Debunking myths about clustered indexes"
It is pity, it is impossible to accept it there as an answer here
Update:
The accepted here answer by PerformanceDBA told: "The problem is that the doco is gobbledegook, and increases the very confusion it is alleging it clarifies"
Well, all msdn docs tell and show, for ex., cf. pictures from Clustered Index Structures vs. "Heap Structures" that clustered table does not have IAM page. Meanwhile, the output from following the code from Inside the Storage Engine: Using DBCC PAGE and DBCC IND to find out if page splits ever roll back shows the opposite.
Having no desire to continue spamming here I shifted my questioning and participation to /www.sqlservercentral.com/Forums
My related question there:
Does clustered table have IAM page?
If I have a table column with data and create an index on this column, will the index take same amount of disc space as the column itself?
I'm interested because I'm trying to understand if b-trees actually keep copies of column data in leaf nodes or they somehow point to it?
Sorry if this a "Will Java replace XML?" kind question.
UPDATE:
created a table without index with a single GUID column, added 1M rows - 26MB
same table with a primary key (clustered index) - 25MB (even less!), index size - 176KB
same table with a unique key (nonclustered index) - 26MB, index size - 27MB
So only nonclustered indexes take as much space as the data itself.
All measurements were done in SQL Server 2005
The B-Tree points to the row in the table, but the B-Tree itself still takes some space on disk.
Some database, have special table which embed the main index and the data. In Oracle, it's called IOT -- index-organized table.
Each row in a regular table can be identified by an internal ID (but it's database specific) which is used by the B-Tree to identify the row. In Oracle, it's called rowid and looks like AAAAECAABAAAAgiAAA :)
If I have a table column with data and
create an index on this column, will
the index take same amount of disc
space as the column itself?
In a basic B-Tree, you have the same number of node as the number of item in the column.
Consider 1,2,3,4:
1
/
2
\ 3
\ 4
The exact space can still be a bit different (the index is probably a bit bigger as it need to store links between nodes, it may not be balanced perfectly, etc.), and I guess database can use optimization to compress part of the index. But the order of magnitude between the index and the column data should be the same.
I'm almost sure it's quite a DB dependent, but generally – yeah, they take additional space. This happens because of two reasons:
This way you can utilize the fact
the data in BTREE leafs is sorted;
You gain lookup speed advantage as
you don't have to seek back and
forth to fetch neccessary stuff.
PS just checked our mysql server: for a 20GB table indexes take 10GB of space :)
Judging by this article, it will, in fact, take at least the same amount of space as the data in the column (in PostgreSQL, anyway).
The article also goes to suggest a strategy to reduce disk and memory usage.
A way to check for yourself would be to use e.g. the derby DB, create a table with a million rows and a single column, check it's size, create an index on the column and check it's size again. If you take the 10-15 minutes to do so, let us know the results. :)