I would like to create a covering index to improve query performance.
I do know that this one more index will impact INSERT performance.
The table has only INSERT operations no UPDATE or DELETE. The data in covering index will unique because the index keys contain the table's PK, so I need no further constraining uniqueness, my only goal to improve query performance.
Question
Which type of index will the optimal (means degrade less) the INSERT performance, unique or not unique?
Which type of index will the optimal (means degrade less) the INSERT
performance, unique or not unique?
The difference if any is negligible.
On the other hand, what could have some impact is the index key length. Smaller is better. You could remove the primary key from the index key and make the index non unique. It will remain being a covering index because all non clustered indexes have the clustered index key on leaf page nodes, assuming your primary key is clustered.
The data in covering index will unique because the index keys contain the table's PK
...
Which type of index will the optimal unique or not unique?
Assuming your table has a clustered index:
A non-unique nonclustered index in SQL Server is physically stored as a unique index after the clustered index keys are added as index key columns.
So for a nonclustered index that includes the clustered index keys there is absolutely no difference whether you declare the index to be unique or not.
This depends on your definition of "degrade". If you're talking about indexing the same column in either case, and are able to use the UNIQUE keyword because the column will always contain a unique set of values, then it can be a little more performance optimal to specify that keyword for future queries that use that index.
You can read on the performance benefits that come from being able to use the UNIQUE keyword when creating an index in Brent Ozar's - Performance Benefits of Unique Indexes.
In response to your comments, there will be no difference in INSERT performance if you denote your index as UNIQUE vs not specifying that keyword.
In our cases, I prefer to use not unique indexes. For instance we have a table with hundreds of millions keywords to be analyzed. The keyword is unique (clustered index & primary key). But as we cannot do them all at once, we have a priority column as well. So we create a not unique, non-clustered index with only the Priority column and the keyword as included column.
This gives the following performance advantages:
No additional constraint is checked when adding.
When inserting a keyword, any page which has free space left and containing the same priority entries can be used. Meaning less fragmentation, less maintenance (rebuild/reorganize) Note that we also delete from this table and priority is just a tinyint, so this happens in most cases.
Related
Is searching a primary key column in a table faster than searching a non primary key column due to the default clustered index on primary keys in SQL Server?
from a previous question; "The reason we specify keys for a table is primarily to improve the data integrity and usefulness of the data. Keys guarantee the table is free from duplicate data and therefore they allow the user/consumer of the data to identify information correctly. DBMS query optimizers and storage engines are designed to take advantage of keys so having a key will also give your DBMS the best chance of executing some queries efficiently but there's no guarantee that adding a key will improve performance in every case"
That being said the existence of an Index should make searching faster in most cases, but is unrelated to the key per-se
Searching according to an index is generally faster than searching without an index (unless your table is extremely small, or the index is in a terrible condition). The fact that this index also supports a primary key is inconsequential to this discussion.
In general a relational database will be optimised to use primary keys efficiently, however your database structure, your query and which database engine on what platform will all be major factors in the performance.
Additionally, if the non-primary key column is not indexed, it will be orders of magnitude slower, regardless of your query.
Is searching a primary key column in a table faster than searching a
non primary key column due to the default clustered index on primary
keys in SQL Server?
Since the optimizer doesn't care whether an index supports a constraint or not, I'll paraphrase the question as:
Is searching using the clustered index key column(s) faster than searching via a
non-clustered index key column(s)?
A clustered index seek a disk-based table is the fastest method to return the entire row 1) for singleton lookups and 2) for key range searches. This minimizes the number of logical reads need to locate and retrieve rows.
SQL Server uses clustered as the default for primary key indexes to leverage this performance benefit and encourage the practice that all tables should have a clustered index. Also, since the clustered index key is implicitly included in all non-clustered indexes as the row locator, the likelihood that non-clustered indexes cover queries is increased.
This is not to say that the primary key should always be the clustered index. There may be a better choice when queries most often use other columns in join/where clause predicates.
Suppose we'd have to define optimal indexing for Stackoverflow questions. But let's not take the schema of the actual Posts table, let's just include those columns that are actually relevant:
create table Posts (
Id int not null
identity,
PostTypeId tinyint not null,
LastActivityDate datetime not null
default getdate(),
Title nvarchar(500) null, -- answers don't have titles
Body nvarchar(max) not null,
...
)
I've added Id to be identity even though Data Stackexchange shows that none of the tables have a primary key constraint on them, nor identity columns. There are many just unique/non-unique clustered/non-clustered indices.
Usage scenarios
So basically two main scenarios for posts:
They're chronologically displayed in descending order by their LastActivityDate column (or maybe LastEditDate that I haven't included above as it's not so important)
They're individually displayed on question details
Answers are displayed on question details page in votes order (ScoreCount column not part of my upper code)
Indexing optimization
Which indices would be best created on above scenarios especially if we'd say that #1 is the most common scenario so it has to work really fast.
I'd say that one of the better possibilities would be to create these indices:
-- index 1
alter table Posts
add primary key nonclustered (Id);
-- index 2
create clustered index IX_Posts_LastActivityDate
on Posts(LastActivityDate desc);
-- index 3
create index IX_Posts_ParentId
on Posts(ParentId, PostTypeId)
include (ScoreCount);
This way we basically end up with three indices of which the second one is clustered.
So in order for #1 to work really fast I've set clustered index on LastActivityDate column, because clustered indices are especially great when we do range comparison on them. And we would be ordering questions chronologically newest to oldest hence I've set ordering direction and also included type on the clustered index.
So what did we solve with this?
scenario #1 is very efficiently covered by index 2 as it's clustered and fully covered; we can also easily and efficiently do result paging;
scenario #2 is somewhat covered with unique index 1 (to get the question) and non-unique index 3 to get all related answers (scenario #3) ordered by ScoreCount; and if we decide to chronologically order answers that's also covered with index 2;
Question 1
SQL internals are such that SQL implicitly adds clustered key to nonclustering index so it can locate records in the row store.
if clustering index is unique, than that's the key that will be added to nonclustering indices, and
if clustering index is non-unique, SQL supposedly generates its own UniqueId and uses that
Since I've also added a nonclustered primary key on the table (which must by design be unique), I would like to know whether SQL will still supply its own unique key on clustered non-unique index or will it use nonclustered primary key to uniquely identify each records instead?
Question 2
So if primary key isn't used to locate records on row store (clustered index) does it even make sense to actually create a PK? Would in this case be better to rather do this?
create unique index UX_Posts_Id
on Posts(Id);
-- include (Title, Body, ScoreCount);
It would be great to also include commented out columns, but then that would make this index inefficient as it will be worse in caching... Why I'm asking whether it would be better to create this index instead of a primary key constraint is because we can include additional non-key columns to this index while we can't do the same when we add a PK constraint that internally generates a unique index...
Question 3
I'm aware that LastActivityDate changes which isn't desired with clustered indices, but we have to consider the fact that this column is more likely to change for some time before it becomes more or less static, so it shouldn't cause too much index fragmentation as records will mostly be appended to the end whenever LastActivityDate changes. Index fragmentation on some arbitrary page should never happen because some new record would be inserted into some old(er) page as LastActivityDate will only increase. Hence most modifications will happen on the last page.
So the question is whether these changes can be harmful as LastActivityDate isn't the best candidate for clustering index key:
it's not unique - although one could argue about this, especially if we'd change datetime to datetime2 and use higher precision function sysdatetime()
and set index as unique
it's narrow - pretty much
it's not static - but I've explained how it changes
it's ever increasing
Since I've also added a nonclustered primary key on the table (which
must by design be unique), I would like to know whether SQL will still
supply its own unique key on clustered non-unique index or will it use
nonclustered primary key to uniquely identify each records instead?
SQL Server adds a 4-byte "uniqueifier" when a given non-unique clustered index key value isn't unique. All non-clustered index leaf nodes, including the primary key, will include LastActivityDate plus the uniqueifier (when present) as the row locator. The internal uniqueifier would be needed here only for posts with the same LastActivityDate so I'd expect relatively few rows would actually need a uniqueifier.
So if primary key isn't used to locate records on row store (clustered
index) does it even make sense to actually create a PK? Would in this
case be better to rather do this?
From a data modeling perspective, every relational table should have primary key. The implicitly created index can be declared as either clustered or non-clustered as needed to optimize performance. If LastActivity is a better choice for performance, then the primary key index must be non-clustered. This primary key index will provide the needed index to retrieve singleton posts.
Unfortunately, SQL Server doesn't provide a way to specify included columns on primary key and unique constraint definitions. This is a case where one can bend the rules and use a unique index instead of a declared primary key constraint in order to avoid the cost of redundant indexes and the benefits of included columns. The unique index is functionally identical to a primary key and can be referenced by foreign key constraints.
So the question is whether these changes can be harmful as
LastActivityDate isn't the best candidate for clustering index key
LastActivityDate alone can never be guaranteed to be unique regardless of the level of precision (barring single-threaded inserts or retry logic). One approach could be a composite primary key on LastActivityDate and Id. Individual posts would need to be retrieved using both values. That would eliminate the need for a separate unique index Id previously discussed.
My biggest concern about LastActivityDate as the leftmost clustered index key column is that it may change often for recent posts. This would require a lot of row movement to maintain the logical key order, may impact concurrency significantly compared to the current static Id key, and require updates to the non-clustered index row locator values upon each change. So even though this clustered index key may be optimal for many queries, the other costs on a highly transactional system may outweigh the benefits.
I have a database where all tables include a Site column (char(4)) and a PrimaryId column (int).
Currently the clustered index on all tables is the combination of these two columns. Many customers only have one site so in those cases I think it definitely makes sense to change the clustered index to only include the PrimaryId.
In cases where there are multiple sites though, I'm wondering whether it would still be advantageous to only use the PrimaryId as the clustered index? Might having a smaller clustered index produce better performance than having a unique one?
In case it's relevant, there are generally not going to be more than a few sites. 10 sites would be a lot.
The answer is simple UNIQUE index is always better then NON-UNIQUE. There is some maths behind it but the greater uniqueness is the faster server can look up a record from index.
CLUSTERED index is great as they physically order the records on disk and it always a good idea to use CLUSTERED INDEX on UNIQUE keys.
CLUSTER INDEX with PRIMARY KEY give very good performance with large data. If your data is not high in column then it will not matter much.
I have recently read a article about how nonclustered indexes are matching table rows. I will try to summarize what I believe is relevant to your question.
There are two types of tables (in the context of indexes):
heap - a table without clustered index
clustered index - a table with clustered index
In the first case a nonclustered index is matching rows using RIP-Based bookmarks which has the following format:
file number - page number - row number
and a nonclustered index is looking like this:
You can see the RIP bookmark is in red.
Generally speaking, the rows of a heap do not move; once they have
been inserted into a page they remain on that page. To be more
technically-precise: rows in a heap seldom move, and when they do
move, they leave a forwarding address at the old location. The rows of
a clustered index, however, can move; that is, they can be relocated
to another page during data modification or index reorganization.
In the second the nonclustered index is using the index key of the clustered index as a bookmark and the clustered index itself should meet several criteria:
it must be unique
it should be short
it should be static
I am going to describe the first criteria (the others are described in the link below):
Each index entry bookmark must allow SQL Server to find the one row in
the table that corresponds to that entry. If you create a clustered
index that is not unique, SQL Server will make the clustered index
unique by generating an additional value that "breaks the tie" for
duplicate keys. This extra value is generated by SQL Server to create
uniqueness is called the uniquifier and is transparent to any client
application. You should carefully consider whether or not to allow
duplicates in a clustered index, for the following reasons:
Generating uniquifiers is extra overhead. SQL Server must decide, at
insert time, if a new row's key is a duplicate of an existing row's
key; and, if so, generate a uniquifier values to add to the new row
The uniquifier is a meaningless piece of information; a meaningless
piece of information that is being propagated into the table's
nonclustered indexes. It's usually better to propagate a meaningful
piece of information into the nonclustered indexes.
The whole article can be found here.
In the documentation for SQL server 2008 R2 is stated:
Wide keys are a composite of several columns or several large-size columns. The key values from the clustered index are used by all nonclustered indexes as lookup keys. Any nonclustered indexes defined on the same table will be significantly larger because the nonclustered index entries contain the clustering key and also the key columns defined for that nonclustered index.
Does this mean, that when there is a search using non-clustered index, than the clustered indes is search also? I originally thought that the non-clustered index contains ditrectly the address of the page (block) with the row it references. From the text above it seems that it contains just the key from the non-clustered index instead of the address.
Could somebody explain please?
Yes, that's exactly what happens:
SQL Server searches for your search value in the non-clustered index
if a match is found, in that index entry, there's also the clustering key (the column or columns that make up the clustered index)
with that clustered key, a key lookup (often also called bookmark lookup) is now performed - the clustered index is searched for that value given
when the item is found, the entire data record at the leaf level of the clustered index navigation structure is present and can be returned
SQL Server does this, because using a physical address would be really really bad:
if a page split occurs, all the entries that are moved to a new page would be updated
for all those entries, all nonclustered indices would also have to be updated
and this is really really bad for performance.
This is one of the reasons why it is beneficial to use limited column lists in SELECT (instead of always SELECT *) and possibly even include a few extra columns in the nonclustered index (to make it a covering index). That way, you can avoid unnecessary and expensive bookmark lookups.
And because the clustering key is included in each and every nonclustered index, it's highly important that this be a small and narrow key - optimally an INT IDENTITY or something like that - and not a huge structure; the clustering key is the most replicated data structure in SQL Server and should be a small as possible.
The fact that these bookmark lookups are relatively expensive is also one of the reasons why the query optimizer might opt for an index scan as soon as you select a larger number of rows - at at time, just scanning the clustered index might be cheaper than doing a lot of key lookups.
Recently I found a couple of tables in a Database with no Clustered Indexes defined.
But there are non-clustered indexes defined, so they are on HEAP.
On analysis I found that select statements were using filter on the columns defined in non-clustered indexes.
Not having a clustered index on these tables affect performance?
It's hard to state this more succinctly than SQL Server MVP Brad McGehee:
As a rule of thumb, every table should have a clustered index. Generally, but not always, the clustered index should be on a column that monotonically increases–such as an identity column, or some other column where the value is increasing–and is unique. In many cases, the primary key is the ideal column for a clustered index.
BOL echoes this sentiment:
With few exceptions, every table should have a clustered index.
The reasons for doing this are many and are primarily based upon the fact that a clustered index physically orders your data in storage.
If your clustered index is on a single column monotonically increases, inserts occur in order on your storage device and page splits will not happen.
Clustered indexes are efficient for finding a specific row when the indexed value is unique, such as the common pattern of selecting a row based upon the primary key.
A clustered index often allows for efficient queries on columns that are often searched for ranges of values (between, >, etc.).
Clustering can speed up queries where data is commonly sorted by a specific column or columns.
A clustered index can be rebuilt or reorganized on demand to control table fragmentation.
These benefits can even be applied to views.
You may not want to have a clustered index on:
Columns that have frequent data changes, as SQL Server must then physically re-order the data in storage.
Columns that are already covered by other indexes.
Wide keys, as the clustered index is also used in non-clustered index lookups.
GUID columns, which are larger than identities and also effectively random values (not likely to be sorted upon), though newsequentialid() could be used to help mitigate physical reordering during inserts.
A rare reason to use a heap (table without a clustered index) is if the data is always accessed through nonclustered indexes and the RID (SQL Server internal row identifier) is known to be smaller than a clustered index key.
Because of these and other considerations, such as your particular application workloads, you should carefully select your clustered indexes to get maximum benefit for your queries.
Also note that when you create a primary key on a table in SQL Server, it will by default create a unique clustered index (if it doesn't already have one). This means that if you find a table that doesn't have a clustered index, but does have a primary key (as all tables should), a developer had previously made the decision to create it that way. You may want to have a compelling reason to change that (of which there are many, as we've seen). Adding, changing or dropping the clustered index requires rewriting the entire table and any non-clustered indexes, so this can take some time on a large table.
I would not say "Every table should have a clustered index", I would say "Look carefully at every table and how they are accessed and try to define a clustered index on it if it makes sense". It's a plus, like a Joker, you have only one Joker per table, but you don't have to use it. Other database systems don't have this, at least in this form, BTW.
Putting clustered indices everywhere without understanding what you're doing can also kill your performance (in general, the INSERT performance because a clustered index means physical re-ordering on the disk, or at least it's a good way to understand it), for example with GUID primary keys as we see more and more.
So, read Tim Lehner's exceptions and reason.
Performance is a big hairy problem. Make sure you are optimizing for the right thing.
Free advice is always worth it's price, and there is no substitute for actual experimentation.
The purpose of an index is to find matching rows and help retrieve the data when found.
A non-clustered index on your search criteria will help to find rows, but there needs to be additional operation to get at the row's data.
If there is no clustered index, SQL uses an internal rowId to point to the location of the data.
However, If there is a clustered index on the table, that rowId is replaced by the data values in the clustered index.
So the step of reading the rows data would not be needed, and would be covered by the values in the index.
Even if a clustered index isn't very good at being selective, if those keys are frequently most or all of the results requested - it may be helpful to have them as the leaf of the non-clustered index.
Yes you should have clustered index on a table.So that all nonclustered indexes perform in better way.
Consider using a clustered index when Columns that contain a large number of distinct values so to avoid the need for SQL Server to add a "uniqueifier" to duplicate key values
Disadvantage : It takes longer to update records if only when the fields in the clustering index are changed.
Avoid clustering index constructions where there is a risk that many concurrent inserts will happen on almost the same clustering index value
Searches against a nonclustered index will appear slower is the clustered index isn't build correctly, or it does not include all the columns needed to return the data back to the calling application. In the event that the non-clustered index doesn't contain all the needed data then the SQL Server will go to the clustered index to get the missing data (via a lookup) which will make the query run slower as the lookup is done row by row.
Yes, every table should have a clustered index. The clustered index sets the physical order of data in a table. You can compare this to the ordering of music at a store, by bands name and or Yellow pages ordered by a last name. Since this deals with the physical order you can have only one it can be comprised by many columns but you can only have one.
It’s best to place the clustered index on columns often searched for a range of values. Example would be a date range. Clustered indexes are also efficient for finding a specific row when the indexed value is unique. Microsoft SQL will place clustered indexes on a PRIMARY KEY constraint automatically if no clustered indexes are defined.
Clustered indexes are not a good choice for:
Columns that undergo frequent changes
This results in the entire row moving (because SQL Server must keep
the data values of a row in physical order). This is an important
consideration in high-volume transaction processing systems where
data tends to be volatile.
Wide keys
The key values from the clustered index are used by all
nonclustered indexes as lookup keys and therefore are stored in each
nonclustered index leaf entry.