Nonclustered primary key dilemma - sql-server

Suppose we'd have to define optimal indexing for Stackoverflow questions. But let's not take the schema of the actual Posts table, let's just include those columns that are actually relevant:
create table Posts (
Id int not null
identity,
PostTypeId tinyint not null,
LastActivityDate datetime not null
default getdate(),
Title nvarchar(500) null, -- answers don't have titles
Body nvarchar(max) not null,
...
)
I've added Id to be identity even though Data Stackexchange shows that none of the tables have a primary key constraint on them, nor identity columns. There are many just unique/non-unique clustered/non-clustered indices.
Usage scenarios
So basically two main scenarios for posts:
They're chronologically displayed in descending order by their LastActivityDate column (or maybe LastEditDate that I haven't included above as it's not so important)
They're individually displayed on question details
Answers are displayed on question details page in votes order (ScoreCount column not part of my upper code)
Indexing optimization
Which indices would be best created on above scenarios especially if we'd say that #1 is the most common scenario so it has to work really fast.
I'd say that one of the better possibilities would be to create these indices:
-- index 1
alter table Posts
add primary key nonclustered (Id);
-- index 2
create clustered index IX_Posts_LastActivityDate
on Posts(LastActivityDate desc);
-- index 3
create index IX_Posts_ParentId
on Posts(ParentId, PostTypeId)
include (ScoreCount);
This way we basically end up with three indices of which the second one is clustered.
So in order for #1 to work really fast I've set clustered index on LastActivityDate column, because clustered indices are especially great when we do range comparison on them. And we would be ordering questions chronologically newest to oldest hence I've set ordering direction and also included type on the clustered index.
So what did we solve with this?
scenario #1 is very efficiently covered by index 2 as it's clustered and fully covered; we can also easily and efficiently do result paging;
scenario #2 is somewhat covered with unique index 1 (to get the question) and non-unique index 3 to get all related answers (scenario #3) ordered by ScoreCount; and if we decide to chronologically order answers that's also covered with index 2;
Question 1
SQL internals are such that SQL implicitly adds clustered key to nonclustering index so it can locate records in the row store.
if clustering index is unique, than that's the key that will be added to nonclustering indices, and
if clustering index is non-unique, SQL supposedly generates its own UniqueId and uses that
Since I've also added a nonclustered primary key on the table (which must by design be unique), I would like to know whether SQL will still supply its own unique key on clustered non-unique index or will it use nonclustered primary key to uniquely identify each records instead?
Question 2
So if primary key isn't used to locate records on row store (clustered index) does it even make sense to actually create a PK? Would in this case be better to rather do this?
create unique index UX_Posts_Id
on Posts(Id);
-- include (Title, Body, ScoreCount);
It would be great to also include commented out columns, but then that would make this index inefficient as it will be worse in caching... Why I'm asking whether it would be better to create this index instead of a primary key constraint is because we can include additional non-key columns to this index while we can't do the same when we add a PK constraint that internally generates a unique index...
Question 3
I'm aware that LastActivityDate changes which isn't desired with clustered indices, but we have to consider the fact that this column is more likely to change for some time before it becomes more or less static, so it shouldn't cause too much index fragmentation as records will mostly be appended to the end whenever LastActivityDate changes. Index fragmentation on some arbitrary page should never happen because some new record would be inserted into some old(er) page as LastActivityDate will only increase. Hence most modifications will happen on the last page.
So the question is whether these changes can be harmful as LastActivityDate isn't the best candidate for clustering index key:
it's not unique - although one could argue about this, especially if we'd change datetime to datetime2 and use higher precision function sysdatetime()
and set index as unique
it's narrow - pretty much
it's not static - but I've explained how it changes
it's ever increasing

Since I've also added a nonclustered primary key on the table (which
must by design be unique), I would like to know whether SQL will still
supply its own unique key on clustered non-unique index or will it use
nonclustered primary key to uniquely identify each records instead?
SQL Server adds a 4-byte "uniqueifier" when a given non-unique clustered index key value isn't unique. All non-clustered index leaf nodes, including the primary key, will include LastActivityDate plus the uniqueifier (when present) as the row locator. The internal uniqueifier would be needed here only for posts with the same LastActivityDate so I'd expect relatively few rows would actually need a uniqueifier.
So if primary key isn't used to locate records on row store (clustered
index) does it even make sense to actually create a PK? Would in this
case be better to rather do this?
From a data modeling perspective, every relational table should have primary key. The implicitly created index can be declared as either clustered or non-clustered as needed to optimize performance. If LastActivity is a better choice for performance, then the primary key index must be non-clustered. This primary key index will provide the needed index to retrieve singleton posts.
Unfortunately, SQL Server doesn't provide a way to specify included columns on primary key and unique constraint definitions. This is a case where one can bend the rules and use a unique index instead of a declared primary key constraint in order to avoid the cost of redundant indexes and the benefits of included columns. The unique index is functionally identical to a primary key and can be referenced by foreign key constraints.
So the question is whether these changes can be harmful as
LastActivityDate isn't the best candidate for clustering index key
LastActivityDate alone can never be guaranteed to be unique regardless of the level of precision (barring single-threaded inserts or retry logic). One approach could be a composite primary key on LastActivityDate and Id. Individual posts would need to be retrieved using both values. That would eliminate the need for a separate unique index Id previously discussed.
My biggest concern about LastActivityDate as the leftmost clustered index key column is that it may change often for recent posts. This would require a lot of row movement to maintain the logical key order, may impact concurrency significantly compared to the current static Id key, and require updates to the non-clustered index row locator values upon each change. So even though this clustered index key may be optimal for many queries, the other costs on a highly transactional system may outweigh the benefits.

Related

Is there any advantage in creating a clustered index - if we are not going to query/search for records based on that column?

I am doing a review of some DB tables that were created in our project and came across this. The table contains an Identity column (ID) which is the primarykey for the table and a clustered index has been defined using this ID column. But when I look at the SPROC that retrieves records from this table, I see that the ID column is never used in the query and they query the records based on a USERID column (this column is not unique) and there can be multiple records for the same USERID.
So my question is there any advantage/purpose in creating a clustered index when we know that the records wont be queried with that column?
If the IDENTITY column is never used in WHERE and JOIN clauses, or referenced by foreign keys, perhaps USERID should be a clustered primary key. I would question the need for the ID column at all in that case.
The best choice for the clustered index depends much on how the table is queried. If the majority of queries are by USERID, then it should probably be a unique clustered index (or clustered unique constraint) and the ID column non-clustered.
Keep in mind that the clustered index key is implicitly included in all non-clustered indexes as the row locator. The implication is that non-clustered indexes may more likely cover queries and non-clustered index leaf node pages wider as a result.
I would say your table is mis-designed. Someone apparently thought every table needs a primary key and the primary key is the clustered index. Adding a system-generated unique number as an identifier just adds noise if that number isn't used anywhere. Noise in the clustered index is unhelpful, to say the least.
They are different concepts, by the way. A primary key is a data modeling concern, a logical concept. An index is a physical design issue. A SQL DBMS must support primary keys, but need not have any indexes, clustered or no.
If USERID is what is usually used to search the table, it should be in your clustered index. The clustered index need not be unique and need not be the primary key. I would look at the data carefully to see if some combination of USERID and another column (or two, or more) form a unique identifier for the row. If so, I'd make that the primary key (and clustered index), with USERID as the first column. If query analysis showed that many queries use only USERID and nothing else (for existence testing) I might create a separate index just of USERID.
If no combination of columns constitutes a unique identifier, you have logical problem, to wit: what does the row mean? What aspect of the real world does it represent?
A basic tenet of the Relational Model is that elements in a relation (rows in a table) are unique, that each one identifies something. If two rows are identical, they identify the same thing. What does it mean to delete one of them? Is the thing that they both identify still there, or not? If it is, what purpose did the 2nd row serve?
I hope that gives you another way to think about clustered indexes and keys. I wouldn't be surprised if you find other tables that could be improved, too.

Proper table design for sparse primary key

In my system I have temporary entities that are created based on rules stored in my database, and the entities are not persisted.
Now, I need is to store information about these entities, and because they are created based on rules and are not stored, they have no ID.
I came up with a formula to generate an ID for these temp entities based on the rule that was used to generate them: id = rule id + "-" + entity index in the rule. This formula generates unique strings of the form 164-3, 123-0, 432-2, etc...
My question is how should I build my table (regarding primary key and clustered index) when my keys have no relation or order? Keep in mind that I will only (99.9% of the time) query the table using the id mentioned above.
Options I thought about after much reading, but don't have the knowledge to determine which is better:
1) primary key on a varchar column with clustered index. -According to various sources, this would be bad because of fragmentation and the wideness of the key. Also their format is pretty weird for sorting.
2) primary key on varchar column without clustered index (heap table). -Also a bad idea according to various sources due to indexing and fragmentation issues.
3) identity int column with clustered index, and a varchar column as primary key with unique index. -Can't really see the benefit of the surogate key here since it would mainly help with range queries and ordering and I would never query the table based on this key because it would be unknown at all times.
4) 2 columns composite key: rule id + rule index columns.
-Now I don't have strings but I have two columns that will be copied to FKs and non clustered indexes. Also I'm not sure what indexes I would use in this case.
Can anybody shine a light here? Any help is appreciated.
--Edit
I will perform more selects than inserts;
I will perform more inserts than updates;
All selects will include at least rule id;
If I use a surogate primary key, and a unique index on (rule id, index), then I can use the surogate for subsequent operations after retrieving data by rule id, which would be faster. Also, inserts would be faster.
However, because the data will be stored according to the surogate key, I might have records that have the same rule id, but different index, stored quite far from each other on disk, which means even with an index on rule id, retrieving the data could be kinda slow.
If I use (rule id, index) as clustered primary key, rows with same rule id would be stored close to each other, and selecting data by rule id would be efficient enough. However, I suspect inserts would be slow.
Is the rationale above correct?
Using a heap is generally a bad idea unless proven otherwise. Even so, you will need a very solid reason for not having a clustered index (any one will make things better, even on identity column).
Storing this key in a single column is okay; if you want natural sorting, you can pad your numbers with zeroes, for example. However, this will widen the key.
Having a composite primary key (and, subsequently, foreign keys) is completely acceptable, especially when dealing with natural keys, like the one you have. This will give you the narrowest possible key - int + int or some such - while eliminating the sorting issue at the same time. I would recommend to make this PK clustered to reduce additional key lookups.
Fragmentation here will not be a big issue; at least, no bigger than with any other indexing decision. Any index built on such a key will be prone to fragmentation, clustered or no. In any case, your DBA should know how to keep an index such as this in top form.
Regarding the order of columns in the index, the following rules usually apply:
If partial key match will take place (filtering by one part of the key but not by the other) the one which is used most often should go first;
If No.1 isn't applicable and all parts of the key used in all queries, the column with the highest cardinality should go first.
The order of remaining columns (if there are more than 1) isn't of much importance because SQL Server only creates distribution statistics for the first column in a composite index. However, it is a good idea to list them in order of decreasing cardinality.
EDIT: Seeing your update with additional details, here are the most suitable options. Suppose your table looks like this:
-- Sample table
create table dbo.TempEntities (
RuleId int not null,
IndexId int not null,
-- Remaining columns listed here
EntityData xml not null
);
go
From here, the most straightforward way is to use the natural key as a clustered index:
-- Option 1 - natural clustered index
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (RuleId, IndexId);
go
However, if you have any child tables that would reference this one, it might not be the most convenient solution, because natural keys are prone to updates, which creates a mess where you could avoid it. Instead, a surrogate key can be introduced, like this:
-- Option 2 - surrogate clustered, natural nonclustered
alter table dbo.TempEntities add Id bigint identity(1,1) not null;
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (Id);
alter table dbo.TempEntities
add constraint UQ_TempEntities_RuleIdIndexId unique (RuleId, IndexId);
go
It makes sense to have the surrogate PK clustered, because it will result in much less page splits, making inserts faster (despite having one index more compared to Option 1). Without any intimate knowledge of your queries, this is probably the most balanced solution.
Shuffling the clustered attribute between surrogate and natural keys has mostly academic value and can only make difference on a high-load system with hundreds of inserts happening every second on 24*7 schedule. If your system is indeed as such, please seek a professional consultant who will analyse your queries and provide the solution tailored to your situation.

Non nullable column with clustered unique index. Why need Primary Key?

In SQL Server, I have a non nullable column with a unique clustered index on it.
If I make this column a Primary Key the exact same index is created automatically plus
the column is recognized as a Primary Key.
I understand the abstract/semantic difference.
(Primary Key identifies the entity, while any other column with this index may not.
For example, a Person can have Email field which is Unique,Non-nullable... but can be changed)
But what bothers me is the actual difference when it comes to the DB engine itself.
What will happen if I will just create an Id column, make it non-nullable, create a unique clustered index for it, make it Identity Increment, but without the Primary Key constraint?
In what scenarios the Primary Key constraint comes into play?
(I've looked at many related questions before asking this, but all the answers I saw ended up with an abstract/theoretical explanation).
Nothing will be different really. You specify PRIMARY KEY to relay your intentions, not so that the engine does anything differently. When constructing a query plan, the optimizer will still use the uniqueness for all of its properties, and will still use the clustered index for all of its properties, regardless of whether you technically created it as a PRIMARY KEY. When creating a FOREIGN KEY, you can still reference the column(s) specified as unique (clustered or not). The difference is solely in the metadata (sys.indexes.is_primary_key) and in SSMS' representation to you (oh and the fact that you can create a unique clustered index on a NULLable column, but you can't create a PRIMARY KEY on that column).
In fact there are many cases where you want to completely separate the clustered index from the PRIMARY KEY. If you have a table where the PK is a GUID, for example, and you are typically running date range queries against the table, you are probably better off having the PK be non-clustered and have a clustered index on a naturally increasing column (the datetime column) - both to minimize page splits on heavy insert activity and also to best assist date range queries. The non-clustered index will be perfectly fine for looking up individual GUIDs. (I wanted to mention that because a lot of people think the primary key has to be clustered. Not true.)
Also interesting to note that if you create a PRIMARY KEY constraint, then create a unique clustered index with the same name using DROP_EXISTING, the is_primary_key column will still be 1 and Object Explorer will still show the index name under Keys.
Here is one scenario - a lot of code to data mapping frameworks look at the database metadata (what are the primary keys, foreign keys, etc) to determine how code is executed. For example Hibernate requires a primary key.
A typical scenario might be generating a where clause for an update.

Does an index already cover a clustered primary key?

Let's say I have a table like this:
CREATE TABLE t(
[guid] [uniqueidentifier] NOT NULL,
[category] [nvarchar](400)
{,...other columns}
)
Where guid is my primary key, and has a clustered index.
Now, I want an index that covers both category and guid, because I'm rolling up some other stuff related to t by category, and I want to avoid including the t table itself.
Is it sufficient to create index covering category, or do I need to include guid as well?
I would expect SQL Server indexes to point directly to page offsets in t rather than simply referring to a guid primary key value, which means I would need to explicitly include the PK column to avoid hitting t. Is this the case?
Actually your assumption is wrong - all SQL Server non-clustered indices do include the clustering key (single or multiple columns) and do not point directly at some physical page.
This prevents SQL Server from having to reorganize and update lots of index entries when a page needs to be split in two or relocated. So if you are seeking in a non-clustered index and you find a value, then you have the clustering key and SQL Server will need to do a "bookmark lookup" (or key lookup) to retrieve the actual data page (the leaf page in the clustering index) to get the whole set of data belonging to a single row.
That said - if you ever have a situation where it depends on the ordering of the key columns, then you still might need to create an index specifically on (guid, category) - of course, in that case, SQL Server is smart enough to figure out that the clustering key column is already in the index and won't be adding it one more time.
The fact that the clustering key column(s) are inlcuded in every single non-clustered index is another strong reason why your clustering keys should be narrow, static and unique. Making them too wide (anything beyond 8 byte) is a sure recipe for bloat and slow-down.
Differing slightly to marc_s' answer.
A covering index on (category, guid) will have a different sort on GUID to the primary key sort. Therefore, guid may appear twice in the index because it is in the key column list and the pointer to the clustered index.
If you INCLUDEd (as a non-key column) guid SQL Server won't add it again.
I can't test the key column thing just now, but I have verified the INCLUDE one before on SQL Server 2005.

SQL 2005: Keys, Indexes and Constraints Questions

I have a series of questions about Keys, Indexes and Constraints in SQL, SQL 2005 in particular. I have been working with SQL for about 4 years but I have never been able to get definitive answers on this topic and there is always contradictory info on blog posts, etc. Most of the time tables I create and use just have an Identity column that is a Primary Key and other tables point to it via a Foreign Key.
With join tables I have no Identity and create a composite Primary Key over the Foreign Key columns. The following is a set of statements of my current beliefs, which may be wrong, please correct me if so, and other questions.
So here goes:
As I understand it the difference between a Clustered and Non Clustered Index (regardless of whether it is Unique or not) is that the Clustered Index affects the physical ordering of data in a table (hence you can only have one in a table), whereas a Non Clustered Index builds a tree data structure. When creating Indexes why should I care about Clustered vs Non Clustered? When should I use one or the other? I was told that inserting and deleting are slow with Non-Clustered indexes as the tree needs to be "rebuilt." I take it Clustered indexes do not affect performance this way?
I see that Primary Keys are actually just Clustered Indexes that are Unique (do they have to be clustered?). What is special about a Primary Key vs a Clustered Unique Index?
I have also seen Constraints, but I have never used them or really looked at them. I was told that the purpose of Constraints is that they are for enforcing data integrity, whereas Indexes are aimed at performance. I have also read that constraints are acually implemented as Indexes anyway so they are "the same." This doesnt sound right to me. How are constraints different to Indexes?
Clustered indexes are, as you put it correctly, the definition as to how data in a table is stored physically, i.e. you have a B-tree sorted using the clustering key and you have the data at the leaf level.
Non-clustered indexes on the other hand are separate tree structures which at the leaf level only have the clustering key (or a RID if the table is a heap), meaning that when you use a non-clustered index, you'll have to use the clustered index to get the other columns (unless your request is fully covered by the non-clustered index, which can happen if you request only the columns, which constitute the non-clustered index key columns).
When should you use one or the other ? Well, since you can have only one clustered index, define it on the columns which makes most sense, i.e. when you look up clients by ID most of the time, define a clustered index on the ID. Non-clustered indexes should be defined on columns which are used less often.
Regarding performance, inserts or updates that change the index key are always painfull, regardless of whether it is a clusted on non-clustered index, since page splits can happen, which forces data to be moved between pages (moving the pages of a clustered index hurts more, since you have more data in the leaf level). Thus the general rule is to avoid changing the index key and inserting new values so that they would be sequencial. Otherwise you'll encounter fragmentation and will have to rebuild your index on a regular basis.
Finally, regarding constraints, by definition, they have nothing to do with indexes, yet SQL server has chosen to implement them using indexes. E.g. currently, a unique constraint is implemented as an index, however this can change in a future version (though I doubt that will happen). The type of index (clustered or not) is up to you, just remember that you can have only one clustered index.
If you have more questions of this type, I highly recommend reading this book, which covers these topics in depth.
Your assumption about the clustered vs non-clustered is pretty good
It also seems that primary key enforces non null uniquenes, while the unique index does not enforce non null primary vs unique
The primary key is a logical concept in relational database theory - it's a key (and typically also an index) which is designed to uniquely identify any of your rows. Therefore it must be unique and it cannot be NULL.
The clustering key is a storage-physical concept of SQL Server specifically. It's a special index that isn't just used for lookups etc., but also defines the physical structure of your data in your table. In a printed phonebook in Western European culture (except maybe for Iceland ), the clustered index would be "LastName, FirstName".
Since the clustering index defines your physical data layout, you can only ever have one of those (or none - not recommended, though).
Requirements for a clustering key are:
must be unique (if not, SQL Server will add a 4-byte "uniqueifier")
should be stable (never changing)
should be as small as possible (INT is best)
should be ever-increasing (think: IDENTITY)
SQL Server makes your primary key the clustering key by default - but you can change that if you need to. Also, mind you: the columns that make up the clustering key will be added to each and every entry of each and every non-clustered index on your table - so you want to keep your clustering key as small as possible. This is because the clustering key will be used to do the "bookmark lookup" - if you found an entry in a non-clustered index (e.g. a person by their social security number) and now you need to grab the entire row of data to get more details, you need to do a lookup, and for this, the clustering key is used.
There's a great debate about what makes a good or useful clustering and/or primary key - here's a few excellent blog posts to read about this:
all of Kimberly Tripp's Indexing blog posts are a must-read
GUIDs as primary key and/or clustering key
The Clustered index debate continues....
Marc
You have several questions. I'll break some of them out:
When creating Indexes why should I care about Clustered vs Non Clustered?
Sometimes you do care how the rows are organized. It depends on your data and how you will use it. For example, if your primary key is a uniqueidentifier, you may not want it to be CLUSTERED, because GUID values are essentially random. This will cause SQL to insert rows randomly throughout the table, causing page splits which hurt performance. If your primary key value will always increment sequentially (int IDENTITY for example), then you probably want it to be CLUSTERED, so your table will always grow at the end.
A primary key is CLUSTERED by default, and most of the time you don't have to worry about it.
I was told that inserting and deleting are slow with Non-Clustered indexes as the tree needs to be "rebuilt." I take it Clustered indexes do not affect performance this way?
Actually, the opposite can be true. NONCLUSTERED indexes are kept as a separate data structure, but the structure is designed to allow some modification without needing to be "re-built". When the index is initially created, you can specify the FILLFACTOR, which specifies how much free space to leave on each page of the index. This allows the index to tolerate some modification before a page split is necessary. Even when a page split must occur, it only affects the neighboring pages, not the entire index.
The same behavior applies to CLUSTERED indexes, but since CLUSTERED indexes store the actual table data, page splitting operations on the index can be much more expensive because the whole row may need to be moved (versus just the key columns and the ROWID in a NONCLUSTERED index).
The following MSDN page talks about FILLFACTOR and page splits:
http://msdn.microsoft.com/en-us/library/aa933139(SQL.80).aspx
What is special about a Primary Key vs a Clustered Unique Index?
How are constraints different to Indexes?
For both of these I think it's more about declaring your intentions. When you call something a PRIMARY KEY you are declaring that it is the primary method for identifying a given row. Is a PRIMARY KEY physically different from a CLUSTERED UNIQUE INDEX? I'm not sure. The behavior is essentially the same, but your intentions may not be clear to someone working with your database.
Regarding constraints, there are many types of constraints. For a UNIQUE CONSTRAINT, there isn't really a difference between that and a UNIQUE INDEX, other than declaring your intention. There are other types of constraints that do not map directly to a type of index, such as CHECK constraints, DEFAULT constraints, and FOREIGN KEY constraints.
I don't have time to answer this in depth, so here is some info off the top of my head:
You're right about clustered indexes. They rearrange the physical data according to the sort order of the clustered index. You can use clustered indexes specifically for range-bound queries (e.g. between dates).
PKs are by default clustered, but they don't have to be. That's just a default setting. The PK is supposed to be a UID for the row.
Constraints can be implemented as indexes (for example, unique constraints), but can also be implemented as default values.

Resources