I'm not a DBA ("Good!", you'll be thinking in a moment.)
I have a table of logging data with these characteristics and usage patterns:
A datetime column for storing log timestamps whose value is ever-increasing and mostly (but only mostly) unique
Frequent-ish inserts (say, a dozen a minute), only at the end of the timestamp range (new data being logged)
Infrequent deletes, in bulk, from the beginning of the timestamp range (old data being cleared)
No updates at all
Frequent-ish selects using the timestamp column as the primary criterion, along with secondary criteria on other columns
Infrequent selects using other columns as the criteria (and not including the timestamp column)
A good amount of data, but nowhere near enough that I'm worried much about storage space
Additionally, there is currently a daily maintenance window during which I could do table optimization.
I frankly don't expect this table to challenge the server it's going to be on even if I mis-index it a bit, but nevertheless it seemed like a good opportunity to ask for some input on SQL Server clustered indexes.
I know that clustered indexes determine the storage of the actual table data (the data is stored in the leaf nodes of the index itself), and that non-clustered indexes are separate pointers into the data. So in query terms, a clustered index is going to be faster than a non-clustered index -- once we've found the index value, the data is right there. There are costs on insert and delete (and of course an update changing the clustered index column's value would be particularly costly).
But I read in this answer that deletes leave gaps that don't get cleaned up until/unless the index is rebuilt.
All of this suggests to me that I should:
Put a clustered index on the timestamp column with a 100% fill-factor
Put non-clustered indexes on any other column that may be used as a criterion in a query that doesn't also involve the clustered column (which may be any of them in my case)
Schedule the bulk deletes to occur during the daily maintenance interval
Schedule a rebuild of the clustered index to occur immediately after the bulk delete
Relax and get out more
Am I wildly off base there? Do I need to frequently rebuild the index like that to avoid lots of wasted space? Are there other obvious (to a DBA) things I should be doing?
Thanks in advance.
Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!
Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.
She mentions (about in the middle of the article):
Inserts are faster in a clustered
table (but only in the "right"
clustered table) than compared to a
heap. The primary problem here is that
lookups in the IAM/PFS to determine
the insert location in a heap are
slower than in a clustered table
(where insert location is known,
defined by the clustered key). Inserts
are faster when inserted into a table
where order is defined (CL) and where
that order is ever-increasing.
The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.
Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!
If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.
I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.
After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.
Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance
There's two "best practice" ways to index a high traffic logging table:
an integer identity column as a primary clustered key
a uniqueidentifier colum as primary key, with DEFAULT NEWSEQUENTIALID()
Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.
I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.
The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.
Related
I have just taken over a database which has around 2200 tables. Over 2000 of these have no clustered index (some have no indexes at all).
All of the tables have been configured to use a GUID as the uniqueidentifier.
Just looking at the query plans, I can see that there are many table scans occurring. Most searches use the uniqueidentifier to search on.
I am wondering if it is better to have a clustered index on the GUID than not to have a clustered index at all. I imagine that a clustered index on a 16-byte column will inevitably lead to fragmentation.
I could arguably cluster on other columns but the majority of searches tend to search by or join via the GUIDS.
Any advice would be very much welcomed. I've never seen so many GUID's!!
In generally, I would recommend having an identity column as the primary key and use that for clustering. This is also a better choice for joins.
Why? First, identity keys are generally shorter that unique ids. So, foreign key references and indexes are smaller.
More importantly, inserts would always go at the "end" of the table. When using GUIDs, inserts are often going to cause fragmentation. If you are inserting rows, I would say that a secondary index on the GUID might be better than a clustered index (the fragmentation is only in the index).
With 2000 tables, I doubt you will change the structure. You can ameliorate the fragmentation using newsequentialid().
GUID column with random values usually is not the best choice for a clustered index because it could be the root cause of an index fragmentation:
Read ahead opportunity of the database won't be effective;
The cost of insert operations will be too expensive, because in this case you'll got lots of page split overhead;
There are 3 ways how you can live with that:
Schedule planning index reorganizing and rebuilding which will reduce index fragmentation and improve your statistics automatically;
Use
newsequantialid for generating values of this column;
Generate
GUID value sequantialy outside of the database (Guid.Comb
Identifier is a great example of solving this issue in
NHibernate).
This is a really a comment on your question to Gordon's good answer:
Firstly, don't forget to check the index DMVs to see which ones are being used (or not used) and have a look at the expensive query plans in the cache to focus on the tables and queries that will be causing most pain. I would expect that many of those 2200 tables are relatively small & the queries are able to look up pretty quickly even from the guid clustered index.
For those tables that aren't clustered, clustering on the guid would reduce fragmentation, since it forces all the data for the table to be colocated rather than allowing pages to be put in the next free extent & spreading tables all over the disk. This should make some of the I/O more efficient.
Check you have a low enough fillfactor so that your regular index rebuilds avoid page splitting in advance, although it will also be workload dependent (OLTP vs DW and read/write ratio of table)
If you have applications that are doing explicit column selects/inserts then you may be able to add an identity column without breaking anything. That allows you cluster around the identity & add an index to the guid. Whether this really helps depends on the relative (in)efficency of the new plans.
You could consider clustering around a non-guid field where queries will lookup against it fairly regularly (eg, a date range) and index the guid separately.
You'd have to look at the queries & relative performance for that more closely.
Recently I found a couple of tables in a Database with no Clustered Indexes defined.
But there are non-clustered indexes defined, so they are on HEAP.
On analysis I found that select statements were using filter on the columns defined in non-clustered indexes.
Not having a clustered index on these tables affect performance?
It's hard to state this more succinctly than SQL Server MVP Brad McGehee:
As a rule of thumb, every table should have a clustered index. Generally, but not always, the clustered index should be on a column that monotonically increases–such as an identity column, or some other column where the value is increasing–and is unique. In many cases, the primary key is the ideal column for a clustered index.
BOL echoes this sentiment:
With few exceptions, every table should have a clustered index.
The reasons for doing this are many and are primarily based upon the fact that a clustered index physically orders your data in storage.
If your clustered index is on a single column monotonically increases, inserts occur in order on your storage device and page splits will not happen.
Clustered indexes are efficient for finding a specific row when the indexed value is unique, such as the common pattern of selecting a row based upon the primary key.
A clustered index often allows for efficient queries on columns that are often searched for ranges of values (between, >, etc.).
Clustering can speed up queries where data is commonly sorted by a specific column or columns.
A clustered index can be rebuilt or reorganized on demand to control table fragmentation.
These benefits can even be applied to views.
You may not want to have a clustered index on:
Columns that have frequent data changes, as SQL Server must then physically re-order the data in storage.
Columns that are already covered by other indexes.
Wide keys, as the clustered index is also used in non-clustered index lookups.
GUID columns, which are larger than identities and also effectively random values (not likely to be sorted upon), though newsequentialid() could be used to help mitigate physical reordering during inserts.
A rare reason to use a heap (table without a clustered index) is if the data is always accessed through nonclustered indexes and the RID (SQL Server internal row identifier) is known to be smaller than a clustered index key.
Because of these and other considerations, such as your particular application workloads, you should carefully select your clustered indexes to get maximum benefit for your queries.
Also note that when you create a primary key on a table in SQL Server, it will by default create a unique clustered index (if it doesn't already have one). This means that if you find a table that doesn't have a clustered index, but does have a primary key (as all tables should), a developer had previously made the decision to create it that way. You may want to have a compelling reason to change that (of which there are many, as we've seen). Adding, changing or dropping the clustered index requires rewriting the entire table and any non-clustered indexes, so this can take some time on a large table.
I would not say "Every table should have a clustered index", I would say "Look carefully at every table and how they are accessed and try to define a clustered index on it if it makes sense". It's a plus, like a Joker, you have only one Joker per table, but you don't have to use it. Other database systems don't have this, at least in this form, BTW.
Putting clustered indices everywhere without understanding what you're doing can also kill your performance (in general, the INSERT performance because a clustered index means physical re-ordering on the disk, or at least it's a good way to understand it), for example with GUID primary keys as we see more and more.
So, read Tim Lehner's exceptions and reason.
Performance is a big hairy problem. Make sure you are optimizing for the right thing.
Free advice is always worth it's price, and there is no substitute for actual experimentation.
The purpose of an index is to find matching rows and help retrieve the data when found.
A non-clustered index on your search criteria will help to find rows, but there needs to be additional operation to get at the row's data.
If there is no clustered index, SQL uses an internal rowId to point to the location of the data.
However, If there is a clustered index on the table, that rowId is replaced by the data values in the clustered index.
So the step of reading the rows data would not be needed, and would be covered by the values in the index.
Even if a clustered index isn't very good at being selective, if those keys are frequently most or all of the results requested - it may be helpful to have them as the leaf of the non-clustered index.
Yes you should have clustered index on a table.So that all nonclustered indexes perform in better way.
Consider using a clustered index when Columns that contain a large number of distinct values so to avoid the need for SQL Server to add a "uniqueifier" to duplicate key values
Disadvantage : It takes longer to update records if only when the fields in the clustering index are changed.
Avoid clustering index constructions where there is a risk that many concurrent inserts will happen on almost the same clustering index value
Searches against a nonclustered index will appear slower is the clustered index isn't build correctly, or it does not include all the columns needed to return the data back to the calling application. In the event that the non-clustered index doesn't contain all the needed data then the SQL Server will go to the clustered index to get the missing data (via a lookup) which will make the query run slower as the lookup is done row by row.
Yes, every table should have a clustered index. The clustered index sets the physical order of data in a table. You can compare this to the ordering of music at a store, by bands name and or Yellow pages ordered by a last name. Since this deals with the physical order you can have only one it can be comprised by many columns but you can only have one.
It’s best to place the clustered index on columns often searched for a range of values. Example would be a date range. Clustered indexes are also efficient for finding a specific row when the indexed value is unique. Microsoft SQL will place clustered indexes on a PRIMARY KEY constraint automatically if no clustered indexes are defined.
Clustered indexes are not a good choice for:
Columns that undergo frequent changes
This results in the entire row moving (because SQL Server must keep
the data values of a row in physical order). This is an important
consideration in high-volume transaction processing systems where
data tends to be volatile.
Wide keys
The key values from the clustered index are used by all
nonclustered indexes as lookup keys and therefore are stored in each
nonclustered index leaf entry.
I have an Oracle background, and using "Indexed organized tables" (IOT) for every table sounds unreasonable in Oracle and I never actually seen this. In SQL Server, every database I worked on, has a clustered index on every table, which is the same as IOT (conceptually).
Why is that? Is there any reason for using clustered index everywhere? Seems to me like they would be good only for a handful of cases.
Thanks
A clustered index is not quite the same thing as an index-organised table. With an IOT, every field must participate in the IOT key. A clustered index on SQL Server does not have to be unique, and does not have to be the primary key.
Clustered indexes are widely used on SQL Server, as there is almost always some natural ordering that makes a commonly used query more efficient. IOTs in Oracle carry more baggage, so they aren't quite as useful, although they may well be more useful then they're commonly given credit for.
Historically, really old versions of SQL Server pre 6.5 or 7.0 IIRC did not support row-level locking and could only lock at a table or page level. Often a clustered index would be used to ensure that writes were scattered around the table's physical storage to minimise contention on page locks. However, SQL Server 6 went of support some years ago, so applications with this issue will be restricted to rare legacy systems.
Without a clustered index, your table is organized as a heap. This means that every row that is insert is added at the data page at the end of the table. Also as rows get updated, they get moved to the data page at the end of the table if the data updated is larger than than before.
When it is good to not have a clustered index
If you have a table that needs the fastest possible inserts, but can sacrifice update, and read speed, then not having a clustered index may work for you. One example would be if you had a table that was being used as a queue, for instance, lots of inserts that later just get read and moved to a different table.
Clustered Indexes
Clustered indexes organize the data in your table based on the columns in the clustered index. If you cluster on the wrong thing for instance a uniqueidentifier this can slow things down (see below).
As long as your clustered index is on the value that is most commonly used for searching, and it is unique and increasing they you get some amazing performance benefits out of the clustered index. For instance if you have a table called USERS where you are commonly looking up user data based on USER_ID then clustering on USER_ID would speed up the performance of all of those lookups. This simply reduces the number of data pages that need to be read to get at your data.
If you have too many keys in your clustered index this can slow things down also.
General rules for clustered indexes:
Don't cluster on any varchar columns.
Clustering on INT IDENTITY columns is usually best.
Cluster on what you commonly search on.
Clustering on UniqueIdentifiers
With uniqueidentifiers in an index, they are extremely inefficient because there is no natural sort order. Based on the b-tree structure of the index you end up with extremely fragmented indexes when using uniqueidentifiers. After rebuilding or reorganizing, they are still extremely fragmented. So you end up with a slower index, that ends up being really huge in memory and on disk due to the fragmentation. Also on inserts of the uniqueidentifier you are more likely to end up with a page split on the index thus slowing your insert. Generally uniqueidentifiers are bad news for indexes.
Summary
My recommendation is that every table should have a clustered index on it unless there is a really good reason not to (ie table functioning as a queue).
I wouldn't know why you would prefer a heap over a clustered index most of the time. Using clustering, you get one index of your choice for free. Most of the time this is the primary key (which you probably want to enforce anyway!).
Heaps are mostly for special situations.
We are using Primary Keys in relational databases and in general relation is established via these primary keys. Most people used to name first field as TableID and make it primary key. When you join two ore more tables in your query you will get the fastest result if you use clustered indexes.
If I'm am trying to squeeze every last drop of performance out of a query what affect does having these types of index's being used by my joins.
clustered index.
non-clustered index.
clustered or non-clustered index with extra columns that may not be involved in the join.
Will I gain any performance if I go through and create clustered index's that only contain the columns involved in my joins and nothing else?
(I realize I may have to move the clustered index from another index(making that index non-clustered) since it can only have one.)
In addition to Gareth Saul's answer a tiny clarification:
Non-clustered indexes repeat the
included fields, with pointer to the
rows that have that value.
This pointer to the actual data value is the column (or the set of columns) that are in your clustering key.
That's one of the main reasons why you should try and keep the clustering key small and static - small because otherwise you'll waste a lot of space, on disk and in your server's RAM, and static because otherwise, you'll have to update not just your clustering index, but also all your non-clustered indices as well, if your value changes.
This "lookup pointer is the clustering key" feature has been in SQL Server since version 7, as Kim Tripp will explain in great detail here:
What is a clustered index?
In SQL Server 7.0 and higher the
internal dependencies on the
clustering key CHANGED. (Yes, it's
important to know that things CHANGED
in 7.0... why? Because there are still
some folks out there that don't
realize how RADICAL of a change
occurred in the internals (wrt to the
clustering key) in SQL Server 7.0).
What changed is that the clustering
key gets used as the "lookup" value
from the nonclustered indexes.
Will I gain any performance if I go through and create clustered index's that only contain the columns involved in my joins and nothing else?
Not as I understand. The point of a clustered index is that it then sorts the data on disk around that index (hence why you can only have the one), so if your join data isn't being sorted by those exact columns as well, I don't think it'd make any difference. Plus by putting data that might change (as opposed to the key) into the clustered index, you make it more likely that things will need rebuilding peridically, slowing the overall database down.
Sorry if this sounds a daft question, but have you tried running your query through the index tuning wizard? Not foolproof by any stretch but I've had some decent improvements from it in the past.
You only get one clustered index - this is what controls the physical storage of the table on disk / in memory.
Non-clustered indexes repeat the included fields, with pointer to the rows that have that value. Having an index on the columns being used in your joins should improve performance. You can further optimise by using "included columns" in your index - this duplicates the row information directly into the index, which can remove the performance penalty of having to look up the row itself to perform the select.
It is useful to pay attention to the order in which your joins occur - the sequence of columns in your index should match up to this. Remember that the SQL engine may optimise and re-order your query internally - profiling may be helpful.
In most situations, you can just use the Database Engine Tuning Advisor - the recommendations it provides are pretty much spot on.
If you can your best bet is for a non-clustered index that has all the element of your join in it and if possible the field you are selecting.
This will create a spanning index meaning that all the fields SQL requires to perform are on one index.
If possible have an index which has no unnessasery field in it. Every field added makes the an individual index record larger, the smaller each index record the more you get in each Page. The more index items you get in each page the less you have to go to the Disk.
Clustered Index - Will mean the table is layed out in the order specified in the Index, this means that you will get better performance for select * from TABLE where INDEXFIELD = 3. Unless you are selecting lots of large data items this should not be required.
I am working on a database that usually uses GUIDs as primary keys.
By default SQL Server places a clustered index on primary key columns. I understand that this is a silly idea for GUID columns, and that non-clustered indexes are better.
What do you think - should I get rid of all the clustered indexes and replace them with non-clustered indexes?
Why wouldn't SQL's performance tuner offer this as a recommendation?
A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.
Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.
As to why it's not offered as a recommendation, I'd suggest the tuner is aware of this fact.
You almost certainly want to establish a clustered index on every table in your database.
If a table does not have a clustered index it is what is referred to as a "Heap" and performance of most types of common queries is less for a heap than for a clustered index table.
Which fields the clustered index should be established on depend on the table itself, and the expected usage patterns of queries against the table. In almost every case you probably want the clustered index to be on a column or a combination of columns that is unique, i.e., (an alternate key), because if it isn't, SQL will add a unique value to the end of whatever fields you select anyway. If your table has a column or columns in it that will be frequently used by queries to select or filter multiple records, (for example if your table contains sales transactions, and your application will frequently request sales transactions by product Id, or even better, a Invoice details table, where in almost every case you will be retrieving all the detail records for a specific invoice, or an invoice table where you often retrieve all the invoices for a particular customer... This is true whether you will be selected large numbers of records by a single value, or by a range of values)
These columns are candidates for the clustered index. The order of the columns in the clustered index is critical.. The first column defined in the index should be the column that will be selected or filtered on first in expected queries.
The reason for all this is based on understanding the internal structure of a database index. These indices are called balanced-tree (B-Tree) indices. they are kinda like a binary tree, except that each node in the tree can have an arbitrary number of entries, (and child nodes), instead of just two. What makes a clustered index different is that the leaf nodes in a clustered index are the actual physical disk data pages of the table itself. whereas the leaf nodes of the non-clustered index just "point" to the tables' data pages.
When a table has a clustered index, therefore, the tables data pages are the leaf level of that index, and each one has a pointer to the previous page and the next page in the index order (they form a doubly-linked-list).
So if your query requests a range of rows that is in the same order as the clustered index... the processor only has to traverse the index once (or maybe twice), to find the start page of the data, and then follow the linked list pointers to get to the next page and the next page, until it has read all the data pages it needs.
For a non-clustered index, it has to traverse the index once for every row it retrieves...
NOTE: EDIT
To address the sequential issue for Guid Key columns, be aware that SQL2k5 has NEWSEQUENTIALID() that does in fact generate Guids the "old" sequential way.
or you can investigate Jimmy Nielsens COMB guid algotithm that is implemented in client side code:
COMB Guids
The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
However, with integer-based clustered indexes, the integers are normally sequential (like with an IDENTITY spec), so they just get added to the end an no data needs to be moved around.
On the other hand, clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. If you need to be able to SELECT records quickly, then use a clustered index... the INSERT speed will suffer, but the SELECT speed will be improved.
While clustering on a GUID is normally a bad idea, be aware that GUIDs can under some circumstances cause fragmentation even in non-clustered indexes.
Note that if you're using SQL Server 2005, the newsequentialid() function produces sequential GUIDs. This helps to prevent the fragmentation problem.
I suggest using a SQL query like the following to measure fragmentation before making any decisions (excuse the non-ANSI syntax):
SELECT OBJECT_NAME (ips.[object_id]) AS 'Object Name',
si.name AS 'Index Name',
ROUND (ips.avg_fragmentation_in_percent, 2) AS 'Fragmentation',
ips.page_count AS 'Pages',
ROUND (ips.avg_page_space_used_in_percent, 2) AS 'Page Density'
FROM sys.dm_db_index_physical_stats
(DB_ID ('MyDatabase'), NULL, NULL, NULL, 'DETAILED') ips
CROSS APPLY sys.indexes si
WHERE si.object_id = ips.object_id
AND si.index_id = ips.index_id
AND ips.index_level = 0;
If you are using NewId(), you could switch to NewSequentialId(). That should help the insert perf.
Yes, there's no point in having a clustered index on a random value.
You probably do want clustered indexes SOMEWHERE in your database. For example, if you have a "Author" table and a "Book" table with a foreign key to "Author", and if you have a query in your application that says, "select ... from Book where AuthorId = ..", then you would be reading a set of books. It will be faster if those book are physically next to each other on the disk, so that the disk head doesn't have to bounce around from sector to sector gathering all the books of that author.
So, you need to think about your application, the ways in which it queries the database.
Make the changes.
And then test, because you never know...
As most have mentioned, avoid using a random identifier in a clustered index-you will not gain the benefits of clustering. Actually, you will experience an increased delay. Getting rid of all of them is solid advice. Also keep in mind newsequentialid() can be extremely problematic in a multi-master replication scenario. If database A and B both invoke newsequentialid() prior to replication, you will have a conflict.
Yes you should remove the clustered index on GUID primary keys for the reasons Galwegian states above. We have done this on our applications.
It depends if you're doing a lot of inserts, or if you need very quick lookup by PK.