No direct access to data row in clustered table - why? - sql-server

[11] tells:
"In a nonclustered index, the leaf level does not contain all the data. In addition to the key values, each index row in the leaf level (the lowest level of the tree) contains a bookmark that tells SQL Server where to find the data row corresponding to the key in the index.
A bookmark can take one of two forms. If the table has a clustered index, the bookmark is the clustered index key for the corresponding data row
. If the table is a heap (in other words, it has no clustered index), the bookmark is a row identifier (RID), which is an actual row locator in the form File#:Page#:Slot#."
Is this a copy of clustered index key or nonclustered index has a pointer to it?
Should all clustered index structure, i.e. b-tree with intermediate data, be traversed to get to row data through non-clustered index bookmark on clustered table?
What does clustered index indtroduce that direct referencing become impossible?
Update:
Let me re-phrase the question. How this is done I can read myself but I want to understand why it is done this way.
Would not it be much more efficient to continue referencing row data by RID from non-clustered index having (added) clustered one?
Suppose a table has only non-clustered index(es) (but no clustered index).
Non-clustered index leaves contains RID to real data. For direct access of row data without any need of lookup/traversals.
Adding clustered index means elimination of IAM (Index Allocation Map) pages and substituting all RIDs of all non-clustered indexes by clustered index keys + necessity of additional lookup instead of direct access.
What is the point in this?
Update2:
Was my question downvoted by Microsoft himself? Thanks again, this is an honor.
It is pointless to downvote without explaining.
Update3:
#PerformanceDB", I could not understand the phrase in your answer:
""It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them)."
Can you explain it?
Yes, I'd like illustrations.
I started to read: Debunking myths about clustered indexes - part 4 (CIXs, TPC-C & Oracle clusters) and it, as many other sources, explicitly refer to the fact that SQL Server, in contrat to Oracle, lacks direct access features on a clustered table.
Update4 (Update5 - corrected by strike-out):
A few answerers referred to the fact that a bookmark CI key in NCI leaf is for address independence in case of page splits.
Aren't during index reorganization or de-fragmenting in non-clustered table with CI NCI (non-clustered index) the rows relocated and corresponding RIDs in NCI change in NCI are modified?
This seems to me as addressing scheme deficiency - the row should have moved with its address, should have not it?
Also, Is heap completely immune to page splits? due to size increase of variable size data types in row
Related questions:
What do I miss in understanding the clustered index?
Why/when/how is whole clustered index scan chosen rather than full table scan?
Reasons not to have a clustered index in SQL Server 2005
Cited:
[11]
Inside Microsoft® SQL Server™ 2005: The Storage Engine
By Kalen Delaney - (Solid Quality Learning)
...............................................
Publisher: Microsoft Press
Pub Date: October 11, 2006
Print ISBN-10: 0-7356-2105-5
Print ISBN-13: 978-0-7356-2105-3
Pages: 464
[11a]
p.250 Section Index organization from Chapter 7. Index Internals and Management
Here is helpful online copypaste from it
http://sqlserverindexeorgnization.blogspot.com/
though without any credits to source

The problem is that the doco is gobbledegook, and increases the very confusion it is alleging it clarifies. If you forget about all that and start again, it is actually quite straight-forward. Since you are inquiring re data storage structures, and concerned re performance, let's look at that perspective (not the logical). There is no data storage structure caled "Table".
Heap
Data pages containing rows. There is no Clustered Index. The rows are not shifted as a result of Inserts/Deletes. The rows can be read in entirety (table scan) or singly (via a NonClustered Index). It gets badly fragmented.
Clustered Index
B-Tree. The Index is Clustered with the Data Rows. The leaf level is the data row. That means one less I/O on every access. It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them). The Heap (entire data storage stucture) is eliminated. There are no pointers. The rows are maintained in Clustered Index Key order (rows are moved on the page as a result of Insert/Delete/Expand). Pages are trimmed within the extents.
NonClustered Index
B-Tree. Full height as required by number of rows.
Where there is a Clustered Index, the Leaf level is the Clustered Index Key (so that it can go to the exact location in the CI, which is the row).
Where there is no Clustered Index, the Leaf level is a pointer: File:Page:Offset (so that it can go to the Heap, and get the row). The RowIds in the Heap do not change (if they did, every time you Inserted/Deleted one row, you would have to update all the NCI entries in all associated NCIs, for all other rows on the page).
That is why, when you create a CI, all NCIs are automatically rebuilt (they have to be switched from [2] to 1). Obviously, always create the CI before the NCIs.
There is no File:Page:Slot, the row length is variable, it is Offset within Page.
There is no Bookmark or other goobledegook.
Re "No direct access to data row in clustered table - why"
Nonsense. You have direct and immediate access to each data row, via the CI (one less I/O) or the NCI⇢CI Key.
This is very fast, invented by Britton Lee; re-implemented and patented by Sybase; obtained by dishonesty and for a pittance by Darth Vader.
If you need further clarification, I can provide illustrations.
Responses to Comments
"It also means the B-Tree is reduced by one level in index height (which is why they are tiny if you inspected them)."
Let's say you have a tables with 1 billion rows. The "height" of the B-Tree of any given index (eg. Unique, on PK) drawn vertically is say 8; or you can say the index is 8 levels deep, between the top (a single entry) and the bottom, the leaf level. the leaf level is of course the widest, and most polpulated; it will have 1 billion entries. Given that each index page contains say 256 entries, the leaf-minus-one level contains 390K entries.
The CI B-tree (index only portion) will contain 7 levels, 390K entries, taking 10MB; because the leaf level IS the data row (of which there are 1 billion entries, spread nicely across 100GB), and is thus excluded, or not repeated.
Yes, I'd like illustrations.
Ok. I have a set of finished Sybase docs; I have butchered one for you, so as to avoid confusion, and excluded the bits that Sybase has, that MS does not. Sorry. Don't follow the links, just stay on the one page. Also the very low levels of Fragmentation in the heap are different by the fragmentation in the Heap is massive, in both Sybase and MS, so I have left that intact.
Data Storage Basics
(That is a condensed version of my much more elaborate Sybase diagrams, which I have butchered for the MS context. There is a link at the bottom of that doc, if you want the full Sybase set.)
"I started to read: Debunking myths about clustered indexes - part 4 (CIXs, TPC-C & Oracle clusters) and it, as many other sources, explicitly refer to the fact that SQL Server, in contrat to Oracle, lacks direct access features on a clustered table."
Be careful what you read, the web is full of superficial information; half truths discussed out of context; misinformation (from vendiors as well as well-meaning ignorants). As you notice, I just answer questions; I do not waste time answering points raised in references.
Just think about this. Well-implemented Tables with a CI do not need de-fragmentation; and when implemented badly, need infrequent de-fragmentation; tables without a CI need frequent and pretty much offline de-fragmentation. That's your maintenance window running into Monday morning. Just an example of why discussing items in isolation is actually misinformation. Which is why my docs are all linked and related to one another.
"A few answerers referred to the fact that CI key in NCI leaf is for address independence in case of page splits."
Yes, I just would not put it that way, that's as confusing as the first reference you posted. Page splits have nothing to do with it. I put is the way I did in my post above on purpose, for clarity. Since the rows move (the CI keeps the pages and Extents trim), the NCI MUST have the CI key, in order to find the row. It can't use a RowId which would keep changing all the time. Unless you have wide CI keys, this is no big deal; a 4-byte RowId (plus processing overhead) vs an 8-byte CI key (minus said overhead) ... who cares (ok, maybe you). Address the higher level issues, and the low level issues will be small enough to not warrant address. Squeezing out 1% performance improvement at the low level when your db is fragmented and unnormalised, is amore than a bit silly.
A system in an integrated set of components, none can be changed or evaluated in isolation. A bunch of components that are not integrated are dis-integrated, not a system. At your level of questioning, you are not yet in a position to form conclusions, or have grudges against this or that, if you do, they are premature conclusions and grudges, that will impede your progress. On top of that, there is a big difference between knowledge gained by question-and-answer vs knowledge gained by reading plus experience.
"Aren't during undex reotganization or defragmanting of non-clustered table with CI the rows relocated and corresponding RIDs in NCI change in NCI?"
Do you mean "non-clustered INDEX with CI" ? Well the NCIs are not worth de-fragmenting, just drop/create them.
Or do you mean "de-fragmenting a CI [whole table]" ? I have already posted, when you re-create the CI (or de-fragment it in place), the NCIs are automatically rebuilt. It is not about RowIds, it is about the change: when you drop the CI, the NCIs have to be rewritten from CI keys to RowIds; when you create the CI, the NCIs have to the changed back to CI Keys. Switched on DBAs drop the NCI before dropping the CI.
"This seems to me as addressing scheme deficiency - the row should have moved with its address, should have not it?"
You're getting too low-level without understanding the higher levels. If the row moves, its address changes; if the address changes, the row moves. Either you have a CI (rows move) xor you have a Heap (rows do not move).
"Also, Is heap completely immune to page splits?"
No. Page Splits still happen when variable length rows expand and there is no room on the page. But in the scheme of things, massive fragmentation on Heaps, due to never moving rows, due to it being RowId based (which the NCIs rely on), this is a small item.

Let me re-phrase the question. How
this is done I can read myself but I
want to understand why it is done this
way.
Would not it be much more efficient to
continue referencing row data by RID
from non-clustered index having
(added) clustered one?
NO ! If a table has an insert, and a page split occurs, then you would have to potentially update a lot of references that use a RID to point to the new locations of those data rows that have been moved to a new page in SQL Server. That's exactly why the SQL Server team chose to use the clustering key instead, as the "data pointer", so to speak. The value of the clustering key does not change when a page split occurs, so no update to indices are needed.

Would not it be much more efficient to continue referencing row data by RID from non-clustered index having (added) clustered one?
The whole point of a clustered index is that the records are accessed via a logical locator (which is not normally meant to change), not physical.
If the indexes were pointing to a physical RID and a row changed its physical location (say from a page split), all indexes would need to be updated too.
It's exactly the kind of problem the clustered indexes were invented to deal with.

If a table has a clustered index, each non-clustered index row contains a copy of the clustered index key.
If a table does not have a clustered index, i.e. the table is a heap, each non-clustered index row contains a pointer built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).
When you identify (select) a row using a clustered index, you have all the columns from the row.
When you identify a row in a non-clustered index, you need to perfrom another lookup step to obtain the columns not included in the non-clustered index.

Would not it be much more efficient to
continue referencing row data by RID
from non-clustered index having
(added) clustered one?
In many cases it would be more efficient, yes. I believe that clustered indexes were originally implemented that way (in version 6.0?). Presumably they were changed for the reasons that marc_s mentioned, which make sense if your clustered index is such that it has a lot of page splits.

I would not have posted this (my) question, have I seen before my posting here that answer by AlexSmith there, which I saw just a few minutes after posting and having been already answered here:
"Greg Linwood wrote a great series of blogs. It is a must read: Debunking myths about clustered indexes"
It is pity, it is impossible to accept it there as an answer here
Update:
The accepted here answer by PerformanceDBA told: "The problem is that the doco is gobbledegook, and increases the very confusion it is alleging it clarifies"
Well, all msdn docs tell and show, for ex., cf. pictures from Clustered Index Structures vs. "Heap Structures" that clustered table does not have IAM page. Meanwhile, the output from following the code from Inside the Storage Engine: Using DBCC PAGE and DBCC IND to find out if page splits ever roll back shows the opposite.
Having no desire to continue spamming here I shifted my questioning and participation to /www.sqlservercentral.com/Forums
My related question there:
Does clustered table have IAM page?

Related

How do I know if I should create an Non clustered Index on a Clustered Index or on a heap?

I have a DB containing some tables, no table has non-clustered index defined. The big application which uses this DB is slow(because the number of rows are close to a million). I want to optimize DB fetch operations by adding indexes. When I read about indexes I came across index names like:
Clustered Index
Non clustered Index on a Clustered Index
Non Clustered Index on a heap
Also, indexes need to be created only on some columns. How will I identify that in a table which kind of index need to be created and across which column(s)?
P.S. Execution plan while running query tells to create NCI on all columns. Can I blindly go ahead and create index as suggested by SQL Server?
A clustered index is a type of index which defines how the data of your table will be stored (more precisely, how the data is sorted). This is the reason why the clustered index columns should be chosen very carefully (sequentially inserted data is primordial or you will end up with fragmentation and performance issues over time, an integer "identity" column is a good pick for example).
I found out that it is a good practice to always have a clustered index on your permanent tables.
A table without a clustered index is a heap because data is not sorted in a particular way (it'll be added at the end of the file), data is therefore harder to retrieve. The only improvement you can get from using a heap without indexes is that data insertion will be faster.
A non-clustered index is a separate file that will help speed up your queries on the columns you choose (it will store values of the indexed data and their reference to the location in the main file). As the data of your table become more and more important, having those separate files can dramatically improve the performance of your queries because the db engine won't have to scan the entire table for the data you are looking for, but just look for the position of the rows to retrieve in the index file (which contains ordered data of the columns you've chosen).
Adding indexes will speed up your select queries, but slow down writing operations as the indexes have to be updated. So, don't create too many indexes on too many columns !
There are two types of tables: heap tables (which have no clustered index) and clustered tables (which do). Each of these can have any number of non-clustered indexes built on them.
When do you use a heap table? Realistically, in only one scenario: when you're doing parallel bulk imports. This specific scenario requires that the table have no clustered index. In all other scenarios, a heap table has worse performance than a table with a clustered index -- don't take my word for it, though: Microsoft has an article on this that, while dated, is still relevant. In other words, for most practical database work, you can ignore heap tables as a curiosity.
On what do you create your clustered index? Ideally, on a column with values that are ever increasing (or decreasing) and aren't changed in updates. Why? Because this has the least overhead for updating, as no data has to be moved. Because of these two requirements, surrogate keys in the form of IDENTITY columns are popular, since they neatly meet them. This is certainly not the only possible choice, though: indexing on an ever increasing timestamp is also popular (in big data warehouses, for example).
With that (mostly) out of the way, how do you decide what other columns to index? Now that's a great question, but not one I feel qualified to answer in all its glory here. I've gotten a lot of experience myself with index design over the years, but I'm not aware of specific books or articles that I could recommend (which is not to say they don't exist, and I hope other people can chime in with suggestions). For what it's worth, Microsoft itself has written a guide here, which is quite in-depth (perhaps too much so), but I haven't thoroughly read this myself.
Can you blindly go ahead and create the indexes as suggested by the query optimizer? If by that you mean "should I", then the answer is almost certainly no. The query optimizer is very eager to suggest and and all possible indexes that could speed up a query, but that doesn't mean they should all be created -- every index increases the overhead of performing inserts and updates on the table. If you followed the optimizer's advice, it's probable that you would eventually end up with indexes covering every possible combination of columns, which would be pretty terrible for anything that's not a SELECT query. Having said that, creating too many indexes is almost always not as awful as creating no indexes at all, since that quickly kills performance for most queries that involve tables with more than about 10.000 rows.
I could write books on this topic, but I haven't the time or (I fear) the skill. I hope this at least gets you started.

What do I miss in understanding the clustered index?

In absence of any index the table rows are accessed through IAM ((Index Allocation Map).
Can I directly access a row programmatically using IAM?
Does absence of index mean that the only way to read specific row is full table scan reading all table?
Why IAM cannot be engaged for more specific direct access?
"If the table is a heap (in other words, it has no clustered index), the bookmark is a row identifier (RID), which is an actual row locator in the form File#:Page#:Slot#" [1a]
There was no further definition of slot. Well, other sources tell that Slot# is really row number. Correct? or some further juxtaposing with IAM needed to determine specific row?
Now, introduction of clustered index means that no data can be directly accessed but only through eventually clustered index lookup or traversing clustered leaf nodes sequentially.
Do I understand correctly that introduction of clustered indexes is beneficial only for selecting continuous adjacent (ranges of) rows and only through clustered index keys?
Which else benefits are in clustering a table?
Do I understand correctly that clustered index introduction worsen the performance benefits of non-clustered indexes engagement for non-exact match queries? No direct access, sequential access cannot be parallelized, non-clustered indexes are increased by clustered index keys, etc., correct?
Well, I see that clustering a table makes sense for quite specific and well understood contexts while creation of primary keys always default in clustering a table. Why is it?
What do I miss in clustered indexes understanding?
[1]
Inside Microsoft® SQL Server™ 2005: The Storage Engine
By Kalen Delaney - (Solid Quality Learning)
...............................................
Publisher: Microsoft Press
Pub Date: October 11, 2006
Print ISBN-10: 0-7356-2105-5
Print ISBN-13: 978-0-7356-2105-3
Pages: 464
[1a] p.250 Section Index organization from Chapter 7. Index Internals and Management
Here is helpful online copypaste from it
http://sqlserverindexeorgnization.blogspot.com/
though without any credits to source
Related questions:
No direct access to data row in clustered table - why?
Why/when/how is whole clustered index scan chosen rather than full table scan?
Reasons not to have a clustered index in SQL Server 2005
Update: #PerformanceDBA,
"please, forget the doco you reference and start again"
Starting me again on the basis of what?
Any references, any advices. techniques how to start again?
**"A Clustered Index is always better"
Can you answer my question Why/when/how is whole clustered index scan chosen rather than full table scan? The doubt is what is the meaning of Full Clustered Index Scan. Does not it read more than Full Table Scan?
""If there is an IAM, then there is an Index"
So, there is no IAM if there is no index at all?
There is IAM if there is CI?
How am I supposed to verify/study it?
if all docs write the opposite:
- there is IAM on non-indexed table
- there is no IAM if there is clustered index.
That's a lot of questions. Yes the IAM is used to look up pages on a heap. The difference is that without an index there is no way to know what pages to retrieve for any given piece of data. An important feature of the SQL / relational model of data is that queries access data by data values only - never by using pointers or other structures directly.
A slot number just identifies a row within a page. Row data is not logically ordered within a page, even in a clustered index. Each data page contains a row offset table that points to the position of rows within a page.
A clustered index can slow down data access from nonclustered indexes because of the additional bookmark lookups required. This can be mitigated by using the INCLUDE clause to add columns to a NC index. Sometimes it may be more efficient not to have a clustered index on a table.
Please read my answer under "No direct access to data row in clustered table - why?", first.
If there is an IAM, then there is an Index.
But if the is no CI, then the rows are in a Heap, and yes, if you want to read it directly (without using an NCI, or where no Indices exist), you can only table scan a Heap.
A Clustered Index is always better that not having one. There is one exception, and one caveat, both for abnormal or sub-standard conditions:
Non-Unique CI Key. This causes Overflow Pages. Relational tables are required to have unique keys, so this is not a Relational table. The CI can be made unique quite easily by overloading the columns. A Non-unique CI is still better (as per my other post) to have a Non-unique CI than no CI.
Monotonic Key. Typically an IDENTITY column. Instead of random Inserts which insert rows distributed throughout the data storage structure (as is normal with a "good" natural Relational key), the inserted Key is always on the last page. This causes an Insert hotspot, and reduces concurrency. Relational keys are supposed to naturally unique; the surrogate is always an additional index. A surrogate-only is simply not a relational table (it is group of Unnormalised spreadsheets with row identifiers linking them together; you will not get th epower of a databse from that).
.
So the standing advice is, use an NCI for monotonic keys, and ensure that the CI allows good data distribution.
The advantages of CIs are vast, there is no good reason to have one (there may be bad reasons as alluded to above).
CIs allow range queries; NCIs do not. But that is not the only reason.
Another caveat is you need to keep the width of the CI Key small, because it is carried in the NCIs. Now normally this is not a problem, as in, wide CI keys are fine. But where you have an Unormalise dbunch of spreadsheets masquerading as a database, which results in many more indices than a Normalised database, that does become a consideration. Therefore the standing advice for Empire devotees is, keep the width of the CI key down. CIs do not "increase" the NCIs, that is not stated accurately. If you have NCIs, well, it is going to have a pointer or a CI key; if you have a CI (with all the benefits) then the cost is, a CI key instead of a RowId, negligible. So the accurate statement is, fat wide CI keys increase the NCIs.
Whoever says sequential access of CIs cannot be parallelised is wrong (MS may break it in one version and fix it in the next, but that is transient).
Using the ANSI SQL ...PRIMARY KEY ... notation defaults to UNIQUE CLUSTERED. because the db is supposed to be Relational. And the Unique PK is supposed to be a nice friendly Relational key, not a idiotic IDENTITY column. Therefore invariably (not counting exceptions) the PRIMARY KEY is the best candidate for clustering.
You can always create whatever indices you want by avoiding the ANSI SQL ...PRIMARY KEY ... notation and using CREATE [UNIQUE] [CLUSTERED] INDEX notation instead.
It is not possible to answer that last question of yours, you will have to keep asking questions until you run out. But please, forget the doco you reference and start again, otherwise we will be here for days discussing the difference between clear knowledge and gobbledegook.

Effects of Clustered Index on DB Performance

I recently became involved with a new software project which uses SQL Server 2000 for its data storage.
In reviewing the project, I discovered that one of the main tables uses a clustered index on its primary key which consists of four columns:
Sequence numeric(18, 0)
Date datetime
Client varchar(9)
Hash tinyint
This table experiences a lot of inserts in the course of normal operation.
Now, I'm a C++ developer, not a DB Admin, but my first impression of this table design was that that having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.
In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?
So basically I need some ammunition for when I go to the powers that be to convince them that the table design should be changed.
The clustered index should contain the column(s) most queried by to give the greatest chance of seeks or of making a nonclustered index cover all the columns in the query.
The primary key and the clustered index do not have to be the same. They are both candidate keys, and tables often have more than one such key.
You said
In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?
That's not true. A seek can be had just by using the first column or two of the clustered index. It may be a range seek, but it's still a seek. You don't have to specify all the columns of it in order to get that benefit. But the order of the columns does matter a lot. If you're predominantly querying on Client, then the Sequence column is a bad choice as the first in the clustered index. The choice of the second column should be the item that is most queried in conjunction with the first (not by itself). If you find that a second column is queried by itself almost as often as the first column, then a nonclustered index will help.
As others have said, reducing the number of columns/bytes in the clustered index as much as possible is important.
It's too bad that the Sequence is a random value instead of incrementing, but that may not be able to be helped. The answer isn't to throw in an identity column unless your application can start using it as the primary query condition on this table (unlikely). Now, since you're stuck with this random Sequence column (presuming it IS the most often queried), let's look at another of your statements:
having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.
That's not entirely true.
The physical location on the disk is not really what we're talking about here, but it does come into play in terms of fragmentation, which is a performance implication.
The rows inside each 8k page are not ordered. It's just that all the rows in each page are less than the next page and more than the previous one. The problem occurs when you insert a row and the page is full: you get a page split. The engine has to copy all the rows after the inserted row to a new page, and this can be expensive. With a random key you're going to get a lot of page splits. You can ameliorate the problem by using a lower fillfactor when rebuilding the index. You'd have to play with it to get the right number, but 70% or 60% might serve you better than 90%.
I believe that having datetime as the second CI column could be beneficial, since you'd still be dealing with pages needing to be split between two different Sequence values, but it's not nearly as bad as if the second column in the CI was also random, since you'd be guaranteed to page split on every insert, where with an ascending value you can get lucky if the row can be added to a page because the next Sequence number starts on the next page.
Shortening the data types and number of all columns in a table as well as its nonclustered indexes can boost performance too, since more rows per page = fewer page reads per request. Especially if the engine is forced to do a table scan. Moving a bunch of rarely-queried columns to a separate 1-1 table could do wonders for some of your queries.
Last, there are some design tweaks that could help as well (in my opinion):
Change the Sequence column to a bigint to save a byte for every row (8 bytes instead of 9 for the numeric).
Use a lookup table for Client with a 4-byte int identity column instead of a varchar(9). This saves 5 bytes per row. If possible, use a smallint (-32768 to 32767) which is 2 bytes, an even greater savings of 7 bytes per row.
Summary: The CI should start with the column most queried on. Remove any columns from the CI that you can. Shorten columns (bytes) as much as you can. Use a lower fillfactor to mitigate the page splits caused by the random Sequence column (if it has to stay first because of being queried the most).
Oh, and get your online defragging going. If the table can't be changed, at least it can be reorganized frequently to keep it in best possible shape. Don't neglect statistics, either, so the engine can pick appropriate execution plans.
UPDATE
Another strategy to consider is if the composite key used in the table can be converted to an int, and a lookup table of the values is created. Let's say some combination of less than all 4 columns is repeated in over 100 rows, for example, Sequence + Client + Hash but only with varying Date values. Then an insert to a separate SequenceClientHash table with an identity column could make sense, because then you could look up the artificial key once and use it over and over again. This would also get your CI to add new rows only on the last page (yay) and substantially reduce the size of the CI as repeated in all nonclustered indexes (yippee). But this would only make sense in certain narrow usage patterns.
Now, marc_s suggested just adding an additional int identity column as the clustered index. It is possible that this could help by making all the nonclustered indexes get more rows per page, but it all depends on exactly where you want the performance to be, because this would guarantee that every single query on the table would have to use a bookmark lookup and you could never get a table seek.
About "tons of page splits and bad index fragmentation": as I already said this can be ameliorated somewhat with a lower fill factor. Also, frequent online index reorganization (not the same as rebuilding) can help reduce the effect of this.
Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized. For some systems, having a slower insert isn't bad as long as selects are always fast. For others, having consistent but slightly slower select times is more important than having slightly faster but inconsistent select times. For others, the data isn't really read until it's pushed to a data warehouse anyway so the inserts need to be as fast as possible. And adding into the mix is the fact that performance isn't just about user wait time or even query response time but also about server resources especially in the case of massive parallelism, so that total throughput (say, in client responses per time unit) matters more than any other factor.
Clustered indexes (CI) work best over ever-increasing, narrow, rarely changing values. You'll want your CI to cover the column(s) that get hit the most often in queries with >=, <=, or BETWEEN statements.
I'm not sure how your data normally gets hit. Most often you'll see a CI on an IDENTITY column or another narrow column (because this column will also be returned "tacked on" to all non-clustered indexes, and we don't want a ton of data added on to every fetch if it isn't needed). It's possible the data might be getting queried most often on date, and that may be a good choice, but all four columns is likely not correct (I stress likely, because I don't know the set-up; this may not have anything wrong with it). There are some pointers here: http://msdn.microsoft.com/en-us/library/aa933131%28SQL.80%29.aspx
There are a few things you are misunderstanding about how SQL creates and uses indexes.
Clustered indexes aren't necessarily physically ordered on disk by the clustered index, at least not in real-time. They are just a logical ordering.
I wouldn't expect a major performance hit based on this structure and removing the clustered index before you have actually identified a performance issue related to that index is clearly premature optimization.
Also, an index can be useful (especially one with several fields in it) even for searches that don't sort or get queried on all columns included in it.
Obviously, there should be a justification for creating a multi-part clustered index, just like any index, so it makes sense to ask for that if you think it was added capriciously.
Bottom line: Don't optimize the indexes for insert performance until you have actually detected a performance problem with inserts. It usually isn't worth it.
If you have only that single clustered index on your table, that might not be too bad. However, the clustering index is also used for looking up the real data page for any hit in a non-clustered index - therefor, the clustered index (all its columns) are also part of each and every non-clustered index you might have on your table.
So if you have a few nonclustered indices on your table, then you're definitely a) wasting a lot of space (and not just on disk - also in your server's RAM!), and b) your performance will be bad.
A good clustered index ought to be:
small (best bet: a 4-byte INT) - yours is pretty bad with up to 28 bytes per entry
unique
stable (never change)
ever-increasing
I would bet your current setup violates at least two if not more of those requirements. Not following these recommendations will lead to waste of space, and as you rightfully say, lots of page and index fragmentation and page splits (having to "rearrange" the data when an insert happens somewhere in the middle of the clustered index).
Quite honestly: just add a surrogate ID INT IDENTITY(1,1) to your table and make that the primary clustered key - you should see quite a nice boost in performance, just from that, if you have lots of INSERT (and UPDATE) operations going on!
See some more background info on what makes a good clustering key, and what is important about them, here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
I ultimately agree with Erik's last paragraph:
"Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized..."
This is the basic thing I force people to learn: there's no universal solution.
You have to know your data and the actions performed against it. You have to know how frequent different type of actions are and their impact and expected execution times (you don't have to hard tune some rarely executed query and impact everything else if the end user agrees the query execution time is not so important--let's say waiting for few minutes for some report once per week is okay). Of course, as Erik said
"performance isn't just about user wait time or even query response time but also about server resources"
If such a query affects overall server performance, it should be considered as a serious candidate for optimization, even if execution time is fine. I've seen some very fast queries that used huge amount of CPU on multiprocessor servers, while slightly slower solution were incomparable "lighter" from resource utilization point of view. In that case I almost always go for the slower one.
Once you know what is your goal you can decide how many indexes you need and which one should be clustered. Unique constraints, filtered indexes, indexes with included columns are quite powerful tools for tuning. Choosing proper columns is important, but often choosing proper order of columns is even more important. And at the end, don't kill insert/update performance with tons of indexes if the table is frequently modified.

SQL Server: How does the type of an index affect a join's performance?

If I'm am trying to squeeze every last drop of performance out of a query what affect does having these types of index's being used by my joins.
clustered index.
non-clustered index.
clustered or non-clustered index with extra columns that may not be involved in the join.
Will I gain any performance if I go through and create clustered index's that only contain the columns involved in my joins and nothing else?
(I realize I may have to move the clustered index from another index(making that index non-clustered) since it can only have one.)
In addition to Gareth Saul's answer a tiny clarification:
Non-clustered indexes repeat the
included fields, with pointer to the
rows that have that value.
This pointer to the actual data value is the column (or the set of columns) that are in your clustering key.
That's one of the main reasons why you should try and keep the clustering key small and static - small because otherwise you'll waste a lot of space, on disk and in your server's RAM, and static because otherwise, you'll have to update not just your clustering index, but also all your non-clustered indices as well, if your value changes.
This "lookup pointer is the clustering key" feature has been in SQL Server since version 7, as Kim Tripp will explain in great detail here:
What is a clustered index?
In SQL Server 7.0 and higher the
internal dependencies on the
clustering key CHANGED. (Yes, it's
important to know that things CHANGED
in 7.0... why? Because there are still
some folks out there that don't
realize how RADICAL of a change
occurred in the internals (wrt to the
clustering key) in SQL Server 7.0).
What changed is that the clustering
key gets used as the "lookup" value
from the nonclustered indexes.
Will I gain any performance if I go through and create clustered index's that only contain the columns involved in my joins and nothing else?
Not as I understand. The point of a clustered index is that it then sorts the data on disk around that index (hence why you can only have the one), so if your join data isn't being sorted by those exact columns as well, I don't think it'd make any difference. Plus by putting data that might change (as opposed to the key) into the clustered index, you make it more likely that things will need rebuilding peridically, slowing the overall database down.
Sorry if this sounds a daft question, but have you tried running your query through the index tuning wizard? Not foolproof by any stretch but I've had some decent improvements from it in the past.
You only get one clustered index - this is what controls the physical storage of the table on disk / in memory.
Non-clustered indexes repeat the included fields, with pointer to the rows that have that value. Having an index on the columns being used in your joins should improve performance. You can further optimise by using "included columns" in your index - this duplicates the row information directly into the index, which can remove the performance penalty of having to look up the row itself to perform the select.
It is useful to pay attention to the order in which your joins occur - the sequence of columns in your index should match up to this. Remember that the SQL engine may optimise and re-order your query internally - profiling may be helpful.
In most situations, you can just use the Database Engine Tuning Advisor - the recommendations it provides are pretty much spot on.
If you can your best bet is for a non-clustered index that has all the element of your join in it and if possible the field you are selecting.
This will create a spanning index meaning that all the fields SQL requires to perform are on one index.
If possible have an index which has no unnessasery field in it. Every field added makes the an individual index record larger, the smaller each index record the more you get in each Page. The more index items you get in each page the less you have to go to the Disk.
Clustered Index - Will mean the table is layed out in the order specified in the Index, this means that you will get better performance for select * from TABLE where INDEXFIELD = 3. Unless you are selecting lots of large data items this should not be required.

Does it make sense to put an index on tiny tables

I have a few tables with 4 to 10 rows. We don't anticipate that these tables will ever grow much more that a few more rows.
Does it make sense to put an index on their primary keys.
If they have primary keys as you stated then you already have at least one index. This probably is a clustered index, and I thing you are good to go.
The correct answer is: IT DOESN'T MATTER.
Any time the table is small enough to fit inside a single 8k data page, SQL server can simply load that one page, and have the "entire table" available to do whatever it needs.
A clustered index is the table itself, so if you add a clustered index, you're not really adding any overhead, you're just specifying a sort preference within the single data page where the table resides.
A nonclustered index, on the other hand, is a separate object, so it would just be wasted space, because it would never be used. (The query optimizer is never going to load an index that only has pointers to a single data page. It'll just load the only data page directly).
By all means make sure you have a primary key, but if you also add the clustered index, it isn't going to mean much (and likely wouldn't ever be used) unless the table grows well beyond one page.
You should almost always have a clustered index in the very least. I would say yes, go ahead and index them.
It certainly won't hurt, and should the data in that table grow past your expectations, you will at least have a simple indexing strategy in place to help mitigate the effect of the increased table size.
If your using a Primary Key then you will already have one clustered index.
Hopefully you have a primary key which is your Clustered index. But other than column(s) in your index, a 4 to 10 row table is tiny - there is more cost associated with looking up an index than an actual table scan.
Someone please keep me honest here - for SQL 2008 in large scale production and reporting environments, we do not bother with indices on tables with less than 50k rows.
I’m with Tapori on this; adding indexes will unnecessarily add overhead.

Resources