Clustered indexes SQL Server

Clustered indexes SQL Server - sql-server

I have an Oracle background, and using "Indexed organized tables" (IOT) for every table sounds unreasonable in Oracle and I never actually seen this. In SQL Server, every database I worked on, has a clustered index on every table, which is the same as IOT (conceptually).
Why is that? Is there any reason for using clustered index everywhere? Seems to me like they would be good only for a handful of cases.
Thanks

A clustered index is not quite the same thing as an index-organised table. With an IOT, every field must participate in the IOT key. A clustered index on SQL Server does not have to be unique, and does not have to be the primary key.
Clustered indexes are widely used on SQL Server, as there is almost always some natural ordering that makes a commonly used query more efficient. IOTs in Oracle carry more baggage, so they aren't quite as useful, although they may well be more useful then they're commonly given credit for.
Historically, really old versions of SQL Server pre 6.5 or 7.0 IIRC did not support row-level locking and could only lock at a table or page level. Often a clustered index would be used to ensure that writes were scattered around the table's physical storage to minimise contention on page locks. However, SQL Server 6 went of support some years ago, so applications with this issue will be restricted to rare legacy systems.

Without a clustered index, your table is organized as a heap. This means that every row that is insert is added at the data page at the end of the table. Also as rows get updated, they get moved to the data page at the end of the table if the data updated is larger than than before.
When it is good to not have a clustered index
If you have a table that needs the fastest possible inserts, but can sacrifice update, and read speed, then not having a clustered index may work for you. One example would be if you had a table that was being used as a queue, for instance, lots of inserts that later just get read and moved to a different table.
Clustered Indexes
Clustered indexes organize the data in your table based on the columns in the clustered index. If you cluster on the wrong thing for instance a uniqueidentifier this can slow things down (see below).
As long as your clustered index is on the value that is most commonly used for searching, and it is unique and increasing they you get some amazing performance benefits out of the clustered index. For instance if you have a table called USERS where you are commonly looking up user data based on USER_ID then clustering on USER_ID would speed up the performance of all of those lookups. This simply reduces the number of data pages that need to be read to get at your data.
If you have too many keys in your clustered index this can slow things down also.
General rules for clustered indexes:
Don't cluster on any varchar columns.
Clustering on INT IDENTITY columns is usually best.
Cluster on what you commonly search on.
Clustering on UniqueIdentifiers
With uniqueidentifiers in an index, they are extremely inefficient because there is no natural sort order. Based on the b-tree structure of the index you end up with extremely fragmented indexes when using uniqueidentifiers. After rebuilding or reorganizing, they are still extremely fragmented. So you end up with a slower index, that ends up being really huge in memory and on disk due to the fragmentation. Also on inserts of the uniqueidentifier you are more likely to end up with a page split on the index thus slowing your insert. Generally uniqueidentifiers are bad news for indexes.
Summary
My recommendation is that every table should have a clustered index on it unless there is a really good reason not to (ie table functioning as a queue).

I wouldn't know why you would prefer a heap over a clustered index most of the time. Using clustering, you get one index of your choice for free. Most of the time this is the primary key (which you probably want to enforce anyway!).
Heaps are mostly for special situations.

We are using Primary Keys in relational databases and in general relation is established via these primary keys. Most people used to name first field as TableID and make it primary key. When you join two ore more tables in your query you will get the fastest result if you use clustered indexes.

Related

SQL Server: ~2000 Heap Tables all using GUID Uniqueidentifier - Possible Clustered Indexing?

I have just taken over a database which has around 2200 tables. Over 2000 of these have no clustered index (some have no indexes at all).
All of the tables have been configured to use a GUID as the uniqueidentifier.
Just looking at the query plans, I can see that there are many table scans occurring. Most searches use the uniqueidentifier to search on.
I am wondering if it is better to have a clustered index on the GUID than not to have a clustered index at all. I imagine that a clustered index on a 16-byte column will inevitably lead to fragmentation.
I could arguably cluster on other columns but the majority of searches tend to search by or join via the GUIDS.
Any advice would be very much welcomed. I've never seen so many GUID's!!

In generally, I would recommend having an identity column as the primary key and use that for clustering. This is also a better choice for joins.
Why? First, identity keys are generally shorter that unique ids. So, foreign key references and indexes are smaller.
More importantly, inserts would always go at the "end" of the table. When using GUIDs, inserts are often going to cause fragmentation. If you are inserting rows, I would say that a secondary index on the GUID might be better than a clustered index (the fragmentation is only in the index).
With 2000 tables, I doubt you will change the structure. You can ameliorate the fragmentation using newsequentialid().

GUID column with random values usually is not the best choice for a clustered index because it could be the root cause of an index fragmentation:
Read ahead opportunity of the database won't be effective;
The cost of insert operations will be too expensive, because in this case you'll got lots of page split overhead;
There are 3 ways how you can live with that:
Schedule planning index reorganizing and rebuilding which will reduce index fragmentation and improve your statistics automatically;
Use
newsequantialid for generating values of this column;
Generate
GUID value sequantialy outside of the database (Guid.Comb
Identifier is a great example of solving this issue in
NHibernate).

This is a really a comment on your question to Gordon's good answer:
Firstly, don't forget to check the index DMVs to see which ones are being used (or not used) and have a look at the expensive query plans in the cache to focus on the tables and queries that will be causing most pain. I would expect that many of those 2200 tables are relatively small & the queries are able to look up pretty quickly even from the guid clustered index.
For those tables that aren't clustered, clustering on the guid would reduce fragmentation, since it forces all the data for the table to be colocated rather than allowing pages to be put in the next free extent & spreading tables all over the disk. This should make some of the I/O more efficient.
Check you have a low enough fillfactor so that your regular index rebuilds avoid page splitting in advance, although it will also be workload dependent (OLTP vs DW and read/write ratio of table)
If you have applications that are doing explicit column selects/inserts then you may be able to add an identity column without breaking anything. That allows you cluster around the identity & add an index to the guid. Whether this really helps depends on the relative (in)efficency of the new plans.
You could consider clustering around a non-guid field where queries will lookup against it fairly regularly (eg, a date range) and index the guid separately.
You'd have to look at the queries & relative performance for that more closely.

How do I know if I should create an Non clustered Index on a Clustered Index or on a heap?

I have a DB containing some tables, no table has non-clustered index defined. The big application which uses this DB is slow(because the number of rows are close to a million). I want to optimize DB fetch operations by adding indexes. When I read about indexes I came across index names like:
Clustered Index
Non clustered Index on a Clustered Index
Non Clustered Index on a heap
Also, indexes need to be created only on some columns. How will I identify that in a table which kind of index need to be created and across which column(s)?
P.S. Execution plan while running query tells to create NCI on all columns. Can I blindly go ahead and create index as suggested by SQL Server?

A clustered index is a type of index which defines how the data of your table will be stored (more precisely, how the data is sorted). This is the reason why the clustered index columns should be chosen very carefully (sequentially inserted data is primordial or you will end up with fragmentation and performance issues over time, an integer "identity" column is a good pick for example).
I found out that it is a good practice to always have a clustered index on your permanent tables.
A table without a clustered index is a heap because data is not sorted in a particular way (it'll be added at the end of the file), data is therefore harder to retrieve. The only improvement you can get from using a heap without indexes is that data insertion will be faster.
A non-clustered index is a separate file that will help speed up your queries on the columns you choose (it will store values of the indexed data and their reference to the location in the main file). As the data of your table become more and more important, having those separate files can dramatically improve the performance of your queries because the db engine won't have to scan the entire table for the data you are looking for, but just look for the position of the rows to retrieve in the index file (which contains ordered data of the columns you've chosen).
Adding indexes will speed up your select queries, but slow down writing operations as the indexes have to be updated. So, don't create too many indexes on too many columns !

There are two types of tables: heap tables (which have no clustered index) and clustered tables (which do). Each of these can have any number of non-clustered indexes built on them.
When do you use a heap table? Realistically, in only one scenario: when you're doing parallel bulk imports. This specific scenario requires that the table have no clustered index. In all other scenarios, a heap table has worse performance than a table with a clustered index -- don't take my word for it, though: Microsoft has an article on this that, while dated, is still relevant. In other words, for most practical database work, you can ignore heap tables as a curiosity.
On what do you create your clustered index? Ideally, on a column with values that are ever increasing (or decreasing) and aren't changed in updates. Why? Because this has the least overhead for updating, as no data has to be moved. Because of these two requirements, surrogate keys in the form of IDENTITY columns are popular, since they neatly meet them. This is certainly not the only possible choice, though: indexing on an ever increasing timestamp is also popular (in big data warehouses, for example).
With that (mostly) out of the way, how do you decide what other columns to index? Now that's a great question, but not one I feel qualified to answer in all its glory here. I've gotten a lot of experience myself with index design over the years, but I'm not aware of specific books or articles that I could recommend (which is not to say they don't exist, and I hope other people can chime in with suggestions). For what it's worth, Microsoft itself has written a guide here, which is quite in-depth (perhaps too much so), but I haven't thoroughly read this myself.
Can you blindly go ahead and create the indexes as suggested by the query optimizer? If by that you mean "should I", then the answer is almost certainly no. The query optimizer is very eager to suggest and and all possible indexes that could speed up a query, but that doesn't mean they should all be created -- every index increases the overhead of performing inserts and updates on the table. If you followed the optimizer's advice, it's probable that you would eventually end up with indexes covering every possible combination of columns, which would be pretty terrible for anything that's not a SELECT query. Having said that, creating too many indexes is almost always not as awful as creating no indexes at all, since that quickly kills performance for most queries that involve tables with more than about 10.000 rows.
I could write books on this topic, but I haven't the time or (I fear) the skill. I hope this at least gets you started.

Use of non-clustered index on guid type column in SQL Server

I would like optimize the performance of a database that my team is using for an application.
I have been looking for areas to add foreign keys, and in turn index those columns to improve the performance of joins. However, many of our tables are joined on an id that is a GUID type, generated upon insertion of an item, and the data associated with that item in other tables is generally has column item_id containing the GUID.
I have read that adding clustered indexes to GUID type columns is a very bad decision because the index will need to be constantly reconstructed in order to be effective. However, I was wondering, is there any detriment to utilizing a non-clustered index in the scenario described above? Or is it reasonable to assume that it would help performance? I can provide more information if needed.

An index on a <anytype> is by far the best option you have to improve joins and singleton lookups. Lacking this index the query will always have to scan the entire table end-to-end with (often) abysmal performance results and concurrency gone down the drain.
It is true that uniqueidentifier makes poor choice for indexes for the reasons you mention, but by no means does that implies that you should not create these indexes. Changing the data type to INT or BIGINT would be advisable, if possible. Using NEWSEQUENTIALID() or UuidCreateSequential to generate them would help with fragmentation issues. If all alternatives fail you may have to do index maintenance (Rebuild, reorganize) operations more often than for other indexes. But by no means do any of these drawbacks outweigh the benefit of having the index in the first place!

Two performance:
- insert
- select
An index should improve select
An index will slow slow down insert.
If the inserts are in order the index does not fragment.
If the inserts are not in order the index will fragment.
Index fragmentation slows down both insert and select.
Via maintenance can defragment the index.
Adding an non-clustered index to the column that references a FK will help the joins.
Since that column is most likely not ordered that fact it is a GUID is of no loss.
On the FK table itself is where GUID is not a good candidate for a PK (clustered index).
With GUID as PK that index fragments on insert.
Int or sequential ID are better candidates as they would not fragment the PK on insert.
But no big deal just defragment those tables.

Yes, you are better off changing the Guid index from clustered to non-clustered. Guid can still be primary key and you don't need to change your query/source code. No reordering of data and increased performance.
In databases like SQL Azure it is mandatory to have a clustered index. So you could use a date/datetime field. Creating a additional int-identity/autoincrement column is unnecessary as some developers in one team tend to use those and others GUID. Resulting in an inconsistent application. So keep only GUID.. full stop!
Talking about sequential Guids, I think Guids are better created from code than from database. Modern DALs and repository patterns do not prefer dependencies on DB for CRUD. e.g. scenario: linq query and automated builds with unit testing with out DB dependency. And creating a sequential guid ourselves is not a good idea(atleast for me). So Guid as primary Key with a non-clustered index is the best option there is.
I have backing from Microsoft on the non-clustered subject http://blogs.msdn.com/b/sqlazure/archive/2010/05/05/10007304.aspx
Edited: Backing is gone ("No Resource Found")

It would usually help performance. But you may wish to create the index with a fillfactor of less than 100% such that the inevitable page-splits don't have to happen quite so often. Regular maintenance on the index would certainly be a plus.

Yes, a non-clustered index would be ideal for your situation. The underlying is a B-tree, like the clustered index, but the underlying data on the table is not sorted, so the problems with the non-sequential nature of the GUID does not exist. The NC index exist separately from the table.
Be careful to not add too many non-clustered indices though. Optimize only where you need to. Run the profiler to see which queries are taking a long time, and optimize only those. Additionally, be sure to set the fill factor to a value <50% unless the database rarely gets any updates, or space is a constraint.
Relevant MSDN: http://msdn.microsoft.com/en-us/library/ms177484(v=sql.105).aspx

Reasons not to have a clustered index in SQL Server 2005

I've inherited some database creation scripts for a SQL SERVER 2005 database.
One thing I've noticed is that all primary keys are created as NON CLUSTERED indexes as opposed to clustered.
I know that you can only have one clustered index per table and that you may want to have it on a non primary key column for query performance of searches etc. However there are no other CLUSTERED indexes on the tables in questions.
So my question is are there any technical reasons not to have clustered indexes on a primary key column apart from the above.

On any "normal" data or lookup table: no, I don't see any reason whatsoever.
On stuff like bulk import tables, or temporary tables - it depends.
To some people surprisingly, it appears that having a good clustered index actually can speed up operations like INSERT or UPDATE. See Kimberly Tripps excellent The Clustered Index Debate continues.... blog post in which she explains in great detail why this is the case.
In this light: I don't see any valid reason not to have a good clustered index (narrow, stable, unique, ever-increasing = INT IDENTITY as the most obvious choice) on any SQL Server table.
To get some deep insights into how and why to choose clustering keys, read all of Kimberly Tripp's excellent blog posts on the topic:
http://www.sqlskills.com/BLOGS/KIMBERLY/category/Clustering-Key.aspx
http://www.sqlskills.com/BLOGS/KIMBERLY/category/Clustered-Index.aspx
Excellent stuff from the "Queen of Indexing" ! :-)

Clustered Tables vs Heap Tables
(Good article on subject at www.mssqltips.com)
HEAP Table (Without clustered index)
Data is not stored in any particular
order
Specific data can not be retrieved
quickly, unless there are also
non-clustered indexes
Data pages are not linked, so
sequential access needs to refer back
to the index allocation map (IAM)
pages
Since there is no clustered index,
additional time is not needed to
maintain the index
Since there is no clustered index,
there is not the need for additional
space to store the clustered index
tree
These tables have a index_id value of
0 in the sys.indexes catalog view
Clustered Table
Data is stored in order based on the
clustered index key
Data can be retrieved quickly based
on the clustered index key, if the
query uses the indexed columns
Data pages are linked for faster
sequential access
Additional time is needed to maintain clustered index based on
INSERTS, UPDATES and DELETES
Additional space is needed to store
clustered index tree
These tables have a index_id value of 1 in the sys.indexes catalog
view

Please read my answer under "No direct access to data row in clustered table - why?", first. Specifically item [2] Caveat.
The people who created the "database" are cretins. They had:
a bunch of unnormalised spreadhseets, not normalised relational tables
the PKs are all IDENTITY columns (the spreadsheets are linked to each other; they have to be navigated one-by-one-by-one); there is no relational access or relational power across the database
they had PRIMARY KEY, which produce UNIQUE CLUSTERED
they found that that prevented concurrency
they removed the CI and made them all NCIs
they were too lazy to finish the reversal; to nominate an alternate (current NCI) to become the new CI, for each table
the IDENTITY column remains the Primary Key (it isn't really, but it is in this hamfisted implementation)
For such collections of spreadsheets masquerading as databases, it is becoming more and more common to avoid CIs altogether, and just have NCIs plus the Heap. Obviously they get none of the power or benefits of the CI, but hell, they get none of the power or benefit of Relational databases, so who cares that they get none of the power of CIs (which were designed for Relational databases, which theirs is not). The way they look at it, they have to "refactor" the darn thing every so often anyway, so why bother. Relational databases do not need "refactoring".
If you need to discuss this response further, please post the CREATE TABLE/INDEX DDL; otherwise it is a time-wasting academic argument.

Here is another (have it already been provided in other answers?) possible reason (still to be understood):
SQL Server - Poor performance of PK delete
I hope, I shall update later but for now it is rather the desire to link these topics
Update:
What do I miss in understanding the clustered index?

With some b-tree servers/programming languages still used today, fixed or variable length flat ascii files are used for storing data. When a new data record/row is added to a file (table), the record is (1) appended to the end of the file (or replaces a deleted record) and (2) the indexes are balanced. When data is stored this way, you don't have to be concerned about system performance (as far as what the b-tree server is doing to return a pointer to the first data record). The response time is only effected by the # of nodes in your index files.
When you get into using SQL, you hopefully come to realize that system performance has to be considered whenever you write an SQL statement. Using an "ORDER BY" statement on a non-indexed column can bring a system to its knees. Using a clustered index might put an unnecessary load on the CPU. It's the 21st century and I wish we didn't have to think about system performance when programming in SQL, but we still do.
With some older programming languages, it was mandatory to use an index whenever sorted data is retrieved. I only wish this requirement was still in place today. I can only wonder how many companies have updated their slow computer systems due to a poorly written SQL statement on non-indexed data.
In my 25 years of programming, I've never needed my physical data stored in a particular order, so maybe that is why some programmers avoid using clustered indexes. It's hard to know what the tradeoff is (storage time, verses retrieval time) especially if the system you are designing might store millions of records someday.

Cluster the index on ever-increasing datetime column on logging table?

I'm not a DBA ("Good!", you'll be thinking in a moment.)
I have a table of logging data with these characteristics and usage patterns:
A datetime column for storing log timestamps whose value is ever-increasing and mostly (but only mostly) unique
Frequent-ish inserts (say, a dozen a minute), only at the end of the timestamp range (new data being logged)
Infrequent deletes, in bulk, from the beginning of the timestamp range (old data being cleared)
No updates at all
Frequent-ish selects using the timestamp column as the primary criterion, along with secondary criteria on other columns
Infrequent selects using other columns as the criteria (and not including the timestamp column)
A good amount of data, but nowhere near enough that I'm worried much about storage space
Additionally, there is currently a daily maintenance window during which I could do table optimization.
I frankly don't expect this table to challenge the server it's going to be on even if I mis-index it a bit, but nevertheless it seemed like a good opportunity to ask for some input on SQL Server clustered indexes.
I know that clustered indexes determine the storage of the actual table data (the data is stored in the leaf nodes of the index itself), and that non-clustered indexes are separate pointers into the data. So in query terms, a clustered index is going to be faster than a non-clustered index -- once we've found the index value, the data is right there. There are costs on insert and delete (and of course an update changing the clustered index column's value would be particularly costly).
But I read in this answer that deletes leave gaps that don't get cleaned up until/unless the index is rebuilt.
All of this suggests to me that I should:
Put a clustered index on the timestamp column with a 100% fill-factor
Put non-clustered indexes on any other column that may be used as a criterion in a query that doesn't also involve the clustered column (which may be any of them in my case)
Schedule the bulk deletes to occur during the daily maintenance interval
Schedule a rebuild of the clustered index to occur immediately after the bulk delete
Relax and get out more
Am I wildly off base there? Do I need to frequently rebuild the index like that to avoid lots of wasted space? Are there other obvious (to a DBA) things I should be doing?
Thanks in advance.

Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!
Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.
She mentions (about in the middle of the article):
Inserts are faster in a clustered
table (but only in the "right"
clustered table) than compared to a
heap. The primary problem here is that
lookups in the IAM/PFS to determine
the insert location in a heap are
slower than in a clustered table
(where insert location is known,
defined by the clustered key). Inserts
are faster when inserted into a table
where order is defined (CL) and where
that order is ever-increasing.
The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.
Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!
If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.

I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.
After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.
Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance

There's two "best practice" ways to index a high traffic logging table:
an integer identity column as a primary clustered key
a uniqueidentifier colum as primary key, with DEFAULT NEWSEQUENTIALID()
Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.
I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.

The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.