Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have some doubts on choosing the right index and have some questions:
Clustered index
What is the best candidate?
Usually is the primary key but if the primary key is not used in the search by eg CustomerNo is used to search on customers should the clustered index put on CustomerNo?
Views with SchemaBinding
If have a view with indexes I read that these are not used but those on tables are.
Pointless no? Or am I missing the point? Will it make a difference using "NOExpand" to force to read the index from the view rather than the table?
Nonclustered indexes
Is it good practice when adding a nonclustered index to include every possible column till you reach the limit?
Many thanks for your time. I am reading massive database and speed is a must
The clustered index is the index that (a) defines the storage layout of your table (the table data is physically sorted by the clustering key), and (b) is used as the "row locator" in every single nonclustered index on that table.
Therefore, the clustered index should be
narrow (4 byte is ideal, 8 byte OK - anything else is too much)
unique (if you don't use a unique clustered index, SQL Server will add a 4 byte uniqueifier to your table)
static (shouldn't change)
optimally it should be ever-increasing
fixed with - e.g. don't use large Varchar(x) columns in your clustered index
Out of these requirements, the INT IDENTITY seems to be the most logical, most obvious choice. Don't use variable length columns, don't use multiple columns (if ever possible), don't use GUID (that's a horribly bad choice because of it's size and randomness)
For more background info on clustering keys and clustered indexes - read everything that Kimberly Tripp ever publishes! She's the Queen of Indexing in SQL Server - she knows her stuff extremely well!
See e.g. these blog posts:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
In general: don't overindex! too many indices is often worse than none!
For non-clustered indexes: I would typically index foreign key columns - those indexes help with JOINs and other operations and make things faster.
Other than that: don't put too many indexes in your database ! Every index must be maintained on every CRUD operation on your table! This is overhead - don't excessively index!
An index with all columns of a table is an especially bad idea since it really cannot be used for much - but carries a lot of administrative overhead.
Run your app, profile it - see which operations are slow, try to optimize those by adding a few selective indexes to your table.
Clustered Indexes
Just to add to marc_s good answer, one exception to the standard INT IDENTITY PK approach to Clustered Indexes is when you have Parent Child tables, where all the children are frequently always retrieved at the same time as the parent. In this case, clustering by Child table by the Parent PK will reduce the number of pages read when the children are retrieved. For example:
CREATE TABLE Invoice
(
-- Use the default MS Approach on the parent, viz Clustered by Surrogate PK
InvoiceID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
-- Index Fields here
);
CREATE TABLE InvoiceLineItem
(
-- Own Surrogate Key
InvoiceLineItemID INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
InvoiceID INT NOT NULL FOREIGN KEY REFERENCES Invoice(InvoiceID),
-- Line Item Fields Here
);
-- But Cluster on the Parent FK
CREATE CLUSTERED INDEX CL_InvoiceLineItem ON InvoiceLineItem(InvoiceID);
NonClustered Indexes
No, never just include columns without careful thought - the index tree needs to be as narrow as possible. The ordering of the index columns is critical, and always ensure that the index is designed with selectivity of the data in mind - you will need to have a good understanding of the distribution of your data in order to choose optimal indexes.
You can consider using covering indexes to include (at most, a few) columns which would otherwise have required a bookmark lookup from the Nonclustered index back into the table when tuning performance-critical queries.
As a very basic rule of thumb I use, is to use nonclustered indexes when small amounts of data will be returned and clustered indexes when larger resultsets will be returned by your query.
I recomend you read Clustered Index Design Guidelines
As for indexing views: indexing views works the same as indexing the table. It can improve preformance but like indexing tables it can also slow things down.
I recomend you read Improving Performance with SQL Server 2008 Indexed Views
In genral when indexing i find less is better. You need to research your data not just slap indexes on everthing. Check what you are linking on, add indexes and check the Execution plan. Sometimes what you think would make a good index actualy can make thing slower.
Views with SchemaBinding
...
Pointless no? Or am I missing the point?
(More properly, indexed views, schemabinding is a means to an end here, and the rest of the text is more talking about indexed views)
There can be (at least) two reasons for creating an indexed view. Without seeing your database, it's impossible to tell which of those reasons apply.
The first is to compute intermediate results which are expensive to compute from the base table. In order to benefit from that computation, you need to ensure your query uses the indexes. To use the indexes you either need to be querying the view and specifying NOEXPAND, or be using Enterprise or Developer edition (On Ent/Dev editions the index might be used even if the base table is queried and the view isn't mentioned)
The second reason is to enforce a constraint that isn't enforceable in a simpler manner, by implementing e.g. a unique constraint on the view, this may be enforcing some form of conditional uniqueness on the base table.
An example of the second - say you want table T to be able to contain multiple rows with the same U value - but of those rows, only one may be marked as the Default. Before filtered indexes were available, this was commonly achieved as:
CREATE VIEW DRI_T_OneDefault
WITH SCHEMABINDING
AS
SELECT U
FROM S.T
WHERE Default = 1
GO
CREATE UNIQUE CLUSTERED INDEX IX_DRI_T_OneDefault on DRI_T_OneDefault (U)
The point is that these indexes enforce a constraint. It doesn't matter (in such a case) whether any query every actually uses the index. In the same way that any unique constraint may be declared on a base table but never actually used in any queries.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have an application that uses GUID as the Primary Key in almost all tables and I have read that there are issues about performance when using GUID as Primary Key. Honestly, I haven't seen any problem, but I'm about to start a new application and I still want to use the GUIDs as the Primary Keys, but I was thinking of using a Composite Primary Key (The GUID and maybe another field.)
I'm using a GUID because they are nice and easy to manage when you have different environments such as "production", "test" and "dev" databases, and also for migration data between databases.
I will use Entity Framework 4.3 and I want to assign the Guid in the application code, before inserting it in the database. (i.e. I don't want to let SQL generate the Guid).
What is the best practice for creating GUID-based Primary Keys, in order to avoid the supposed performance hits associated with this approach?
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.
Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:
CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
MyINT INT IDENTITY(1,1) NOT NULL,
.... add more columns as needed ...... )
ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)
CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)
Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED
This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!
I've been using GUIDs as PKs since 2005. In this distributed database world, it is absolutely the best way to merge distributed data. You can fire and forget merge tables without all the worry of ints matching across joined tables. GUIDs joins can be copied without any worry.
This is my setup for using GUIDs:
PK = GUID. GUIDs are indexed similar to strings, so high row tables (over 50 million records) may need table partitioning or other performance techniques. SQL Server is getting extremely efficient, so performance concerns are less and less applicable.
PK Guid is NON-Clustered index. Never cluster index a GUID unless it is NewSequentialID. But even then, a server reboot will cause major breaks in ordering.
Add ClusterID Int to every table. This is your CLUSTERED Index... that orders your table.
Joining on ClusterIDs (int) is more efficient, but I work with 20-30 million record tables, so joining on GUIDs doesn't visibly affect performance. If you want max performance, use the ClusterID concept as your primary key & join on ClusterID.
Here is my Email table...
CREATE TABLE [Core].[Email] (
[EmailID] UNIQUEIDENTIFIER CONSTRAINT [DF_Email_EmailID] DEFAULT (newsequentialid()) NOT NULL,
[EmailAddress] NVARCHAR (50) CONSTRAINT [DF_Email_EmailAddress] DEFAULT ('') NOT NULL,
[CreatedDate] DATETIME CONSTRAINT [DF_Email_CreatedDate] DEFAULT (getutcdate()) NOT NULL,
[ClusterID] INT NOT NULL IDENTITY,
CONSTRAINT [PK_Email] PRIMARY KEY NonCLUSTERED ([EmailID] ASC)
);
GO
CREATE UNIQUE CLUSTERED INDEX [IX_Email_ClusterID] ON [Core].[Email] ([ClusterID])
GO
CREATE UNIQUE NONCLUSTERED INDEX [IX_Email_EmailAddress] ON [Core].[Email] ([EmailAddress] Asc)
I am currently developing an web application with EF Core and here is the pattern I use:
All my classes (tables) have an int PK and FK.
I then have an additional column of type Guid (generated by the C# constructor) with a non clustered index on it.
All the joins of tables within EF are managed through the int keys while all the access from outside (controllers) are done with the Guids.
This solution allows to not show the int keys on URLs but keep the model tidy and fast.
This link says it better than I could and helped in my decision making. I usually opt for an int as a primary key, unless I have a specific need not to and I also let SQL server auto-generate/maintain this field unless I have some specific reason not to. In reality, performance concerns need to be determined based on your specific app. There are many factors at play here including but not limited to expected db size, proper indexing, efficient querying, and more. Although people may disagree, I think in many scenarios you will not notice a difference with either option and you should choose what is more appropriate for your app and what allows you to develop easier, quicker, and more effectively (If you never complete the app what difference does the rest make :).
https://web.archive.org/web/20120812080710/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
P.S. I'm not sure why you would use a Composite PK or what benefit you believe that would give you.
Well, if your data never reach millions of rows, you are good. If you ask me, i never use GUID as database identity column of any type, including PK even if you force me to design with a shotgun at the head.
Using GUID as primary key is a definitive scaling stopper, and a critical one.
I recommend you check database identity and sequence option. Sequence is table independent and may provide a solution for your needs(MS SQL has sequences).
If your tables start reaching some dozens of millions of rows the most, e.g. 50 million you will not be able read/write information at acceptable timings and even standard database index maintenance would turn impossible.
Then you need to use partitioning, and be scalable up to half a billion or even 1-2 billion rows. Adding partitioning on the way is not the easiest thing, all read/write statements must include partition column (full app changes!).
These number of course (50 million and 500 million) are for a light selecting useage. If you need to select information in a complex way and/or have lots of inserts/updates/deletes, those could even be 1-2 millions and 50 millions instead, for a very demanding system. If you also add factors like full recovery model, high availability and no maintenance window, common for modern systems, things become extremely ugly.
Note at this point that 2 billion is int limit that looks bad, but int is 4 times smaller and is a sequential type of data, small size and sequential type are the #1 factor for database scalability. And you can use big int which is just twice smaller but still sequential, sequential is what is really deadly important - even more important than size - when to comes to many millions or few billions of rows.
If GUID is also clustered, things are much worst. Just inserting a new row will be actually stored randomly everywhere in physical position.
Even been just a column, not PK or PK part, just indexing it is trouble. From fragmentation perspective.
Having a guid column is perfectly ok like any varchar column as long as you do not use it as PK part and in general as a key column to join tables. Your database must have its own PK elements, filtering and joining data using them - filtering also by a GUID afterwards is perfectly ok.
Having sequential ID's makes it a LOT easier for a hacker or data miner to compromise your site and data. Keep that in mind when choosing a PK for a website.
If you use GUID as primary key and create clustered index then I suggest use the default of NEWSEQUENTIALID() value for it.
Another reason not to expose an Id in the user interface is that a competitor can see your Id incrementing over a day or other period and so deduce the volume of business you are doing.
Most of the times it should not be used as the primary key for a table because it really hit the performance of the database.
useful links regarding GUID impact on performance and as a primary key.
https://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
https://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/
Recently I found a couple of tables in a Database with no Clustered Indexes defined.
But there are non-clustered indexes defined, so they are on HEAP.
On analysis I found that select statements were using filter on the columns defined in non-clustered indexes.
Not having a clustered index on these tables affect performance?
It's hard to state this more succinctly than SQL Server MVP Brad McGehee:
As a rule of thumb, every table should have a clustered index. Generally, but not always, the clustered index should be on a column that monotonically increases–such as an identity column, or some other column where the value is increasing–and is unique. In many cases, the primary key is the ideal column for a clustered index.
BOL echoes this sentiment:
With few exceptions, every table should have a clustered index.
The reasons for doing this are many and are primarily based upon the fact that a clustered index physically orders your data in storage.
If your clustered index is on a single column monotonically increases, inserts occur in order on your storage device and page splits will not happen.
Clustered indexes are efficient for finding a specific row when the indexed value is unique, such as the common pattern of selecting a row based upon the primary key.
A clustered index often allows for efficient queries on columns that are often searched for ranges of values (between, >, etc.).
Clustering can speed up queries where data is commonly sorted by a specific column or columns.
A clustered index can be rebuilt or reorganized on demand to control table fragmentation.
These benefits can even be applied to views.
You may not want to have a clustered index on:
Columns that have frequent data changes, as SQL Server must then physically re-order the data in storage.
Columns that are already covered by other indexes.
Wide keys, as the clustered index is also used in non-clustered index lookups.
GUID columns, which are larger than identities and also effectively random values (not likely to be sorted upon), though newsequentialid() could be used to help mitigate physical reordering during inserts.
A rare reason to use a heap (table without a clustered index) is if the data is always accessed through nonclustered indexes and the RID (SQL Server internal row identifier) is known to be smaller than a clustered index key.
Because of these and other considerations, such as your particular application workloads, you should carefully select your clustered indexes to get maximum benefit for your queries.
Also note that when you create a primary key on a table in SQL Server, it will by default create a unique clustered index (if it doesn't already have one). This means that if you find a table that doesn't have a clustered index, but does have a primary key (as all tables should), a developer had previously made the decision to create it that way. You may want to have a compelling reason to change that (of which there are many, as we've seen). Adding, changing or dropping the clustered index requires rewriting the entire table and any non-clustered indexes, so this can take some time on a large table.
I would not say "Every table should have a clustered index", I would say "Look carefully at every table and how they are accessed and try to define a clustered index on it if it makes sense". It's a plus, like a Joker, you have only one Joker per table, but you don't have to use it. Other database systems don't have this, at least in this form, BTW.
Putting clustered indices everywhere without understanding what you're doing can also kill your performance (in general, the INSERT performance because a clustered index means physical re-ordering on the disk, or at least it's a good way to understand it), for example with GUID primary keys as we see more and more.
So, read Tim Lehner's exceptions and reason.
Performance is a big hairy problem. Make sure you are optimizing for the right thing.
Free advice is always worth it's price, and there is no substitute for actual experimentation.
The purpose of an index is to find matching rows and help retrieve the data when found.
A non-clustered index on your search criteria will help to find rows, but there needs to be additional operation to get at the row's data.
If there is no clustered index, SQL uses an internal rowId to point to the location of the data.
However, If there is a clustered index on the table, that rowId is replaced by the data values in the clustered index.
So the step of reading the rows data would not be needed, and would be covered by the values in the index.
Even if a clustered index isn't very good at being selective, if those keys are frequently most or all of the results requested - it may be helpful to have them as the leaf of the non-clustered index.
Yes you should have clustered index on a table.So that all nonclustered indexes perform in better way.
Consider using a clustered index when Columns that contain a large number of distinct values so to avoid the need for SQL Server to add a "uniqueifier" to duplicate key values
Disadvantage : It takes longer to update records if only when the fields in the clustering index are changed.
Avoid clustering index constructions where there is a risk that many concurrent inserts will happen on almost the same clustering index value
Searches against a nonclustered index will appear slower is the clustered index isn't build correctly, or it does not include all the columns needed to return the data back to the calling application. In the event that the non-clustered index doesn't contain all the needed data then the SQL Server will go to the clustered index to get the missing data (via a lookup) which will make the query run slower as the lookup is done row by row.
Yes, every table should have a clustered index. The clustered index sets the physical order of data in a table. You can compare this to the ordering of music at a store, by bands name and or Yellow pages ordered by a last name. Since this deals with the physical order you can have only one it can be comprised by many columns but you can only have one.
It’s best to place the clustered index on columns often searched for a range of values. Example would be a date range. Clustered indexes are also efficient for finding a specific row when the indexed value is unique. Microsoft SQL will place clustered indexes on a PRIMARY KEY constraint automatically if no clustered indexes are defined.
Clustered indexes are not a good choice for:
Columns that undergo frequent changes
This results in the entire row moving (because SQL Server must keep
the data values of a row in physical order). This is an important
consideration in high-volume transaction processing systems where
data tends to be volatile.
Wide keys
The key values from the clustered index are used by all
nonclustered indexes as lookup keys and therefore are stored in each
nonclustered index leaf entry.
This question already has answers here:
What are the differences between a clustered and a non-clustered index?
(13 answers)
Closed 7 years ago.
I need to add proper index to my tables and need some help.
I'm confused and need to clarify a few points:
Should I use index for non-int columns? Why/why not
I've read a lot about clustered and non-clustered index yet I still can't decide when to use one over the other. A good example would help me and a lot of other developers.
I know that I shouldn't use indexes for columns or tables that are often updated. What else should I be careful about and how can I know that it is all good before going to test phase?
A clustered index alters the way that the rows are stored. When you create a clustered index on a column (or a number of columns), SQL server sorts the table’s rows by that column(s). It is like a dictionary, where all words are sorted in alphabetical order in the entire book.
A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. It creates a completely different object within the table that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data. It is like an index in the last pages of a book, where keywords are sorted and contain the page number to the material of the book for faster reference.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way!
One rule of thumb I would apply is this: any "regular" table (one that you use to store data in, that is a lookup table etc.) should have a clustering key. There's really no point not to have a clustering key. Actually, contrary to common believe, having a clustering key actually speeds up all the common operations - even inserts and deletes (since the table organization is different and usually better than with a heap - a table without a clustering key).
Kimberly Tripp, the Queen of Indexing has a great many excellent articles on the topic of why to have a clustering key, and what kind of columns to best use as your clustering key. Since you only get one per table, it's of utmost importance to pick the right clustering key - and not just any clustering key.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Marc
You should be using indexes to help SQL server performance. Usually that implies that columns that are used to find rows in a table are indexed.
Clustered indexes makes SQL server order the rows on disk according to the index order. This implies that if you access data in the order of a clustered index, then the data will be present on disk in the correct order. However if the column(s) that have a clustered index is frequently changed, then the row(s) will move around on disk, causing overhead - which generally is not a good idea.
Having many indexes is not good either. They cost to maintain. So start out with the obvious ones, and then profile to see which ones you miss and would benefit from. You do not need them from start, they can be added later on.
Most column datatypes can be used when indexing, but it is better to have small columns indexed than large. Also it is common to create indexes on groups of columns (e.g. country + city + street).
Also you will not notice performance issues until you have quite a bit of data in your tables. And another thing to think about is that SQL server needs statistics to do its query optimizations the right way, so make sure that you do generate that.
A comparison of a non-clustered index with a clustered index with an example
As an example of a non-clustered index, let’s say that we have a non-clustered index on the EmployeeID column. A non-clustered index will store both the value of the
EmployeeID
AND a pointer to the row in the Employee table where that value is actually stored. But a clustered index, on the other hand, will actually store the row data for a particular EmployeeID – so if you are running a query that looks for an EmployeeID of 15, the data from other columns in the table like
EmployeeName, EmployeeAddress, etc
. will all actually be stored in the leaf node of the clustered index itself.
This means that with a non-clustered index extra work is required to follow that pointer to the row in the table to retrieve any other desired values, as opposed to a clustered index which can just access the row directly since it is being stored in the same order as the clustered index itself. So, reading from a clustered index is generally faster than reading from a non-clustered index.
In general, use an index on a column that's going to be used (a lot) to search the table, such as a primary key (which by default has a clustered index). For example, if you have the query (in pseudocode)
SELECT * FROM FOO WHERE FOO.BAR = 2
You might want to put an index on FOO.BAR. A clustered index should be used on a column that will be used for sorting. A clustered index is used to sort the rows on disk, so you can only have one per table. For example if you have the query
SELECT * FROM FOO ORDER BY FOO.BAR ASCENDING
You might want to consider a clustered index on FOO.BAR.
Probably the most important consideration is how much time your queries are taking. If a query doesn't take much time or isn't used very often, it may not be worth adding indexes. As always, profile first, then optimize. SQL Server Studio can give you suggestions on where to optimize, and MSDN has some information1 that you might find useful
faster to read than non cluster as data is physically storted in index order
we can create only one per table.(cluster index)
quicker for insert and update operation than a cluster index.
we can create n number of non cluster index.
I'm not a DBA ("Good!", you'll be thinking in a moment.)
I have a table of logging data with these characteristics and usage patterns:
A datetime column for storing log timestamps whose value is ever-increasing and mostly (but only mostly) unique
Frequent-ish inserts (say, a dozen a minute), only at the end of the timestamp range (new data being logged)
Infrequent deletes, in bulk, from the beginning of the timestamp range (old data being cleared)
No updates at all
Frequent-ish selects using the timestamp column as the primary criterion, along with secondary criteria on other columns
Infrequent selects using other columns as the criteria (and not including the timestamp column)
A good amount of data, but nowhere near enough that I'm worried much about storage space
Additionally, there is currently a daily maintenance window during which I could do table optimization.
I frankly don't expect this table to challenge the server it's going to be on even if I mis-index it a bit, but nevertheless it seemed like a good opportunity to ask for some input on SQL Server clustered indexes.
I know that clustered indexes determine the storage of the actual table data (the data is stored in the leaf nodes of the index itself), and that non-clustered indexes are separate pointers into the data. So in query terms, a clustered index is going to be faster than a non-clustered index -- once we've found the index value, the data is right there. There are costs on insert and delete (and of course an update changing the clustered index column's value would be particularly costly).
But I read in this answer that deletes leave gaps that don't get cleaned up until/unless the index is rebuilt.
All of this suggests to me that I should:
Put a clustered index on the timestamp column with a 100% fill-factor
Put non-clustered indexes on any other column that may be used as a criterion in a query that doesn't also involve the clustered column (which may be any of them in my case)
Schedule the bulk deletes to occur during the daily maintenance interval
Schedule a rebuild of the clustered index to occur immediately after the bulk delete
Relax and get out more
Am I wildly off base there? Do I need to frequently rebuild the index like that to avoid lots of wasted space? Are there other obvious (to a DBA) things I should be doing?
Thanks in advance.
Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!
Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.
She mentions (about in the middle of the article):
Inserts are faster in a clustered
table (but only in the "right"
clustered table) than compared to a
heap. The primary problem here is that
lookups in the IAM/PFS to determine
the insert location in a heap are
slower than in a clustered table
(where insert location is known,
defined by the clustered key). Inserts
are faster when inserted into a table
where order is defined (CL) and where
that order is ever-increasing.
The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.
Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!
If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.
I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.
After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.
Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance
There's two "best practice" ways to index a high traffic logging table:
an integer identity column as a primary clustered key
a uniqueidentifier colum as primary key, with DEFAULT NEWSEQUENTIALID()
Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.
I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.
The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.
I have a series of questions about Keys, Indexes and Constraints in SQL, SQL 2005 in particular. I have been working with SQL for about 4 years but I have never been able to get definitive answers on this topic and there is always contradictory info on blog posts, etc. Most of the time tables I create and use just have an Identity column that is a Primary Key and other tables point to it via a Foreign Key.
With join tables I have no Identity and create a composite Primary Key over the Foreign Key columns. The following is a set of statements of my current beliefs, which may be wrong, please correct me if so, and other questions.
So here goes:
As I understand it the difference between a Clustered and Non Clustered Index (regardless of whether it is Unique or not) is that the Clustered Index affects the physical ordering of data in a table (hence you can only have one in a table), whereas a Non Clustered Index builds a tree data structure. When creating Indexes why should I care about Clustered vs Non Clustered? When should I use one or the other? I was told that inserting and deleting are slow with Non-Clustered indexes as the tree needs to be "rebuilt." I take it Clustered indexes do not affect performance this way?
I see that Primary Keys are actually just Clustered Indexes that are Unique (do they have to be clustered?). What is special about a Primary Key vs a Clustered Unique Index?
I have also seen Constraints, but I have never used them or really looked at them. I was told that the purpose of Constraints is that they are for enforcing data integrity, whereas Indexes are aimed at performance. I have also read that constraints are acually implemented as Indexes anyway so they are "the same." This doesnt sound right to me. How are constraints different to Indexes?
Clustered indexes are, as you put it correctly, the definition as to how data in a table is stored physically, i.e. you have a B-tree sorted using the clustering key and you have the data at the leaf level.
Non-clustered indexes on the other hand are separate tree structures which at the leaf level only have the clustering key (or a RID if the table is a heap), meaning that when you use a non-clustered index, you'll have to use the clustered index to get the other columns (unless your request is fully covered by the non-clustered index, which can happen if you request only the columns, which constitute the non-clustered index key columns).
When should you use one or the other ? Well, since you can have only one clustered index, define it on the columns which makes most sense, i.e. when you look up clients by ID most of the time, define a clustered index on the ID. Non-clustered indexes should be defined on columns which are used less often.
Regarding performance, inserts or updates that change the index key are always painfull, regardless of whether it is a clusted on non-clustered index, since page splits can happen, which forces data to be moved between pages (moving the pages of a clustered index hurts more, since you have more data in the leaf level). Thus the general rule is to avoid changing the index key and inserting new values so that they would be sequencial. Otherwise you'll encounter fragmentation and will have to rebuild your index on a regular basis.
Finally, regarding constraints, by definition, they have nothing to do with indexes, yet SQL server has chosen to implement them using indexes. E.g. currently, a unique constraint is implemented as an index, however this can change in a future version (though I doubt that will happen). The type of index (clustered or not) is up to you, just remember that you can have only one clustered index.
If you have more questions of this type, I highly recommend reading this book, which covers these topics in depth.
Your assumption about the clustered vs non-clustered is pretty good
It also seems that primary key enforces non null uniquenes, while the unique index does not enforce non null primary vs unique
The primary key is a logical concept in relational database theory - it's a key (and typically also an index) which is designed to uniquely identify any of your rows. Therefore it must be unique and it cannot be NULL.
The clustering key is a storage-physical concept of SQL Server specifically. It's a special index that isn't just used for lookups etc., but also defines the physical structure of your data in your table. In a printed phonebook in Western European culture (except maybe for Iceland ), the clustered index would be "LastName, FirstName".
Since the clustering index defines your physical data layout, you can only ever have one of those (or none - not recommended, though).
Requirements for a clustering key are:
must be unique (if not, SQL Server will add a 4-byte "uniqueifier")
should be stable (never changing)
should be as small as possible (INT is best)
should be ever-increasing (think: IDENTITY)
SQL Server makes your primary key the clustering key by default - but you can change that if you need to. Also, mind you: the columns that make up the clustering key will be added to each and every entry of each and every non-clustered index on your table - so you want to keep your clustering key as small as possible. This is because the clustering key will be used to do the "bookmark lookup" - if you found an entry in a non-clustered index (e.g. a person by their social security number) and now you need to grab the entire row of data to get more details, you need to do a lookup, and for this, the clustering key is used.
There's a great debate about what makes a good or useful clustering and/or primary key - here's a few excellent blog posts to read about this:
all of Kimberly Tripp's Indexing blog posts are a must-read
GUIDs as primary key and/or clustering key
The Clustered index debate continues....
Marc
You have several questions. I'll break some of them out:
When creating Indexes why should I care about Clustered vs Non Clustered?
Sometimes you do care how the rows are organized. It depends on your data and how you will use it. For example, if your primary key is a uniqueidentifier, you may not want it to be CLUSTERED, because GUID values are essentially random. This will cause SQL to insert rows randomly throughout the table, causing page splits which hurt performance. If your primary key value will always increment sequentially (int IDENTITY for example), then you probably want it to be CLUSTERED, so your table will always grow at the end.
A primary key is CLUSTERED by default, and most of the time you don't have to worry about it.
I was told that inserting and deleting are slow with Non-Clustered indexes as the tree needs to be "rebuilt." I take it Clustered indexes do not affect performance this way?
Actually, the opposite can be true. NONCLUSTERED indexes are kept as a separate data structure, but the structure is designed to allow some modification without needing to be "re-built". When the index is initially created, you can specify the FILLFACTOR, which specifies how much free space to leave on each page of the index. This allows the index to tolerate some modification before a page split is necessary. Even when a page split must occur, it only affects the neighboring pages, not the entire index.
The same behavior applies to CLUSTERED indexes, but since CLUSTERED indexes store the actual table data, page splitting operations on the index can be much more expensive because the whole row may need to be moved (versus just the key columns and the ROWID in a NONCLUSTERED index).
The following MSDN page talks about FILLFACTOR and page splits:
http://msdn.microsoft.com/en-us/library/aa933139(SQL.80).aspx
What is special about a Primary Key vs a Clustered Unique Index?
How are constraints different to Indexes?
For both of these I think it's more about declaring your intentions. When you call something a PRIMARY KEY you are declaring that it is the primary method for identifying a given row. Is a PRIMARY KEY physically different from a CLUSTERED UNIQUE INDEX? I'm not sure. The behavior is essentially the same, but your intentions may not be clear to someone working with your database.
Regarding constraints, there are many types of constraints. For a UNIQUE CONSTRAINT, there isn't really a difference between that and a UNIQUE INDEX, other than declaring your intention. There are other types of constraints that do not map directly to a type of index, such as CHECK constraints, DEFAULT constraints, and FOREIGN KEY constraints.
I don't have time to answer this in depth, so here is some info off the top of my head:
You're right about clustered indexes. They rearrange the physical data according to the sort order of the clustered index. You can use clustered indexes specifically for range-bound queries (e.g. between dates).
PKs are by default clustered, but they don't have to be. That's just a default setting. The PK is supposed to be a UID for the row.
Constraints can be implemented as indexes (for example, unique constraints), but can also be implemented as default values.