I have the following table that serves to join 3 tables:
ClientID int
BlogID int
MentionID int
Assuming that queries will always come via ClientID, I can create 1 multi-column index (ClientID, BlogID, MentionID).
The question is, should I create it as a clustered index or a unique key? I understand a clustered index stores the data on its leaf nodes. Of course, in this case, the index is the data, so I don't know if SQL Server will duplicate the data or not. Be that as it may, I can't find anything on MSDN about the significance of using "unique key".
How does this differ from Type = Index & IsUnique = yes?
Can someone tell me the advantages each way?
Clustered index is "the table itself", that is, index nodes are arranged in a tree, and its leaf nodes contains row data. Clustered index doesn't have to be declared as unique (though it usually is); if it is not unique, the server implicitly adds a "uniqalizer" to this index, so that each row is uniquely identified.
Other indexes store clustered index value as their leaf nodes (and possibly some other columns if they are included with INCLUDE clause in CREATE INDEX staetment).
Any index might be decalred as unique, so the server would perform an additional check to prevent duplicate values forom getting into the table.
It seems you are asking for the difference among:
MYTABLE
id integer primary key autoincrement
clientid integer
blogid integer
mentionid integer
-- with a unique composite index on (clientid, blogid, mentionid) and three foreign key constraints
and
MYTABLE
clientid
blogid
mentionid
-- with a composite primary key on (clientid, blogid, mentionid) and three foreign key constraints
and
MYTABLE
id integer primary key autoincrement
clientid integer
blogid integer
mentionid integer
with an index on clientid and also an index on blogid and the three foreign key constraints
In the first, you have the index on the integer primary key and also the alternative unique index on the triad. If the second, you have only the unique index on the triadic primary key. In the third, you have a unique index on the integer primary key and two other non-unique indexes, one on clientid and the other on blogid.
The performance gain with the second option's marginally greater efficiency would be de minimis, and so I'd base the decision on other factors. The third is the most flexible in terms of queries and offers greater simplicity of coding; it offers the benefit of indexes on client and blog both, in case you wanted to have a query with blog, not client, in the WHERE clause. As for coding, some GUI tools and middleware have trouble with multi-part primary keys, and your update/insert/delete logic will be simpler when it has to deal with a single integer PK column. I have found that code simplicity and ease of maintenance are far better things than a few seconds or only a few fractions of seconds of improvement in query response time.
A unique index, a unique key and
a unique constraint are basically
the same thing. They result in an
index that enforces uniqueness.
Clustered means that the index
becomes the table itself. It's good
to have a clustered index, otherwise
the table hangs around in an
unordered heap.
Unique and clustered are unrelated properties. You can combine them in any way you like. So in your case, I'd create a unique clustered index. The normal way to do that is by creating the index as a clustered primary key.
The data will not be duplicated if you create a clustered unique index on your three columns.
The unique clustered index will be the data - and the index at the same time :-)
Since this is a three-way join table, this clustered index probably does make a lot of sense. I'd say: go for it!
UNIQUE INDEX and UNIQUE CONSTRAINT are somewhat different concepts.
UNIQUE CONSTRAINT is a logical concept and means "make sure this column is unique, no matter how"
UNIQUE INDEX is a physical concept and means "create a B-Tree index on this column and fail whenever duplicates are inserted there"
The latter implies the former but not vice versa.
For instance, in Oracle, if you have a non-unique index on col1:
CREATE UNIQUE INDEX (col1) will fail and say "these columns are already indexed"
ALTER TABLE ADD CONSTRAINT UNIQUE(col1) will succeed and use the existing index to police the constraint.
Use CONSTRAINT if you just want the column to be unique and INDEX if you know a B-Tree index is what you want (to speed up searches etc).
Related
I am doing a review of some DB tables that were created in our project and came across this. The table contains an Identity column (ID) which is the primarykey for the table and a clustered index has been defined using this ID column. But when I look at the SPROC that retrieves records from this table, I see that the ID column is never used in the query and they query the records based on a USERID column (this column is not unique) and there can be multiple records for the same USERID.
So my question is there any advantage/purpose in creating a clustered index when we know that the records wont be queried with that column?
If the IDENTITY column is never used in WHERE and JOIN clauses, or referenced by foreign keys, perhaps USERID should be a clustered primary key. I would question the need for the ID column at all in that case.
The best choice for the clustered index depends much on how the table is queried. If the majority of queries are by USERID, then it should probably be a unique clustered index (or clustered unique constraint) and the ID column non-clustered.
Keep in mind that the clustered index key is implicitly included in all non-clustered indexes as the row locator. The implication is that non-clustered indexes may more likely cover queries and non-clustered index leaf node pages wider as a result.
I would say your table is mis-designed. Someone apparently thought every table needs a primary key and the primary key is the clustered index. Adding a system-generated unique number as an identifier just adds noise if that number isn't used anywhere. Noise in the clustered index is unhelpful, to say the least.
They are different concepts, by the way. A primary key is a data modeling concern, a logical concept. An index is a physical design issue. A SQL DBMS must support primary keys, but need not have any indexes, clustered or no.
If USERID is what is usually used to search the table, it should be in your clustered index. The clustered index need not be unique and need not be the primary key. I would look at the data carefully to see if some combination of USERID and another column (or two, or more) form a unique identifier for the row. If so, I'd make that the primary key (and clustered index), with USERID as the first column. If query analysis showed that many queries use only USERID and nothing else (for existence testing) I might create a separate index just of USERID.
If no combination of columns constitutes a unique identifier, you have logical problem, to wit: what does the row mean? What aspect of the real world does it represent?
A basic tenet of the Relational Model is that elements in a relation (rows in a table) are unique, that each one identifies something. If two rows are identical, they identify the same thing. What does it mean to delete one of them? Is the thing that they both identify still there, or not? If it is, what purpose did the 2nd row serve?
I hope that gives you another way to think about clustered indexes and keys. I wouldn't be surprised if you find other tables that could be improved, too.
Suppose we'd have to define optimal indexing for Stackoverflow questions. But let's not take the schema of the actual Posts table, let's just include those columns that are actually relevant:
create table Posts (
Id int not null
identity,
PostTypeId tinyint not null,
LastActivityDate datetime not null
default getdate(),
Title nvarchar(500) null, -- answers don't have titles
Body nvarchar(max) not null,
...
)
I've added Id to be identity even though Data Stackexchange shows that none of the tables have a primary key constraint on them, nor identity columns. There are many just unique/non-unique clustered/non-clustered indices.
Usage scenarios
So basically two main scenarios for posts:
They're chronologically displayed in descending order by their LastActivityDate column (or maybe LastEditDate that I haven't included above as it's not so important)
They're individually displayed on question details
Answers are displayed on question details page in votes order (ScoreCount column not part of my upper code)
Indexing optimization
Which indices would be best created on above scenarios especially if we'd say that #1 is the most common scenario so it has to work really fast.
I'd say that one of the better possibilities would be to create these indices:
-- index 1
alter table Posts
add primary key nonclustered (Id);
-- index 2
create clustered index IX_Posts_LastActivityDate
on Posts(LastActivityDate desc);
-- index 3
create index IX_Posts_ParentId
on Posts(ParentId, PostTypeId)
include (ScoreCount);
This way we basically end up with three indices of which the second one is clustered.
So in order for #1 to work really fast I've set clustered index on LastActivityDate column, because clustered indices are especially great when we do range comparison on them. And we would be ordering questions chronologically newest to oldest hence I've set ordering direction and also included type on the clustered index.
So what did we solve with this?
scenario #1 is very efficiently covered by index 2 as it's clustered and fully covered; we can also easily and efficiently do result paging;
scenario #2 is somewhat covered with unique index 1 (to get the question) and non-unique index 3 to get all related answers (scenario #3) ordered by ScoreCount; and if we decide to chronologically order answers that's also covered with index 2;
Question 1
SQL internals are such that SQL implicitly adds clustered key to nonclustering index so it can locate records in the row store.
if clustering index is unique, than that's the key that will be added to nonclustering indices, and
if clustering index is non-unique, SQL supposedly generates its own UniqueId and uses that
Since I've also added a nonclustered primary key on the table (which must by design be unique), I would like to know whether SQL will still supply its own unique key on clustered non-unique index or will it use nonclustered primary key to uniquely identify each records instead?
Question 2
So if primary key isn't used to locate records on row store (clustered index) does it even make sense to actually create a PK? Would in this case be better to rather do this?
create unique index UX_Posts_Id
on Posts(Id);
-- include (Title, Body, ScoreCount);
It would be great to also include commented out columns, but then that would make this index inefficient as it will be worse in caching... Why I'm asking whether it would be better to create this index instead of a primary key constraint is because we can include additional non-key columns to this index while we can't do the same when we add a PK constraint that internally generates a unique index...
Question 3
I'm aware that LastActivityDate changes which isn't desired with clustered indices, but we have to consider the fact that this column is more likely to change for some time before it becomes more or less static, so it shouldn't cause too much index fragmentation as records will mostly be appended to the end whenever LastActivityDate changes. Index fragmentation on some arbitrary page should never happen because some new record would be inserted into some old(er) page as LastActivityDate will only increase. Hence most modifications will happen on the last page.
So the question is whether these changes can be harmful as LastActivityDate isn't the best candidate for clustering index key:
it's not unique - although one could argue about this, especially if we'd change datetime to datetime2 and use higher precision function sysdatetime()
and set index as unique
it's narrow - pretty much
it's not static - but I've explained how it changes
it's ever increasing
Since I've also added a nonclustered primary key on the table (which
must by design be unique), I would like to know whether SQL will still
supply its own unique key on clustered non-unique index or will it use
nonclustered primary key to uniquely identify each records instead?
SQL Server adds a 4-byte "uniqueifier" when a given non-unique clustered index key value isn't unique. All non-clustered index leaf nodes, including the primary key, will include LastActivityDate plus the uniqueifier (when present) as the row locator. The internal uniqueifier would be needed here only for posts with the same LastActivityDate so I'd expect relatively few rows would actually need a uniqueifier.
So if primary key isn't used to locate records on row store (clustered
index) does it even make sense to actually create a PK? Would in this
case be better to rather do this?
From a data modeling perspective, every relational table should have primary key. The implicitly created index can be declared as either clustered or non-clustered as needed to optimize performance. If LastActivity is a better choice for performance, then the primary key index must be non-clustered. This primary key index will provide the needed index to retrieve singleton posts.
Unfortunately, SQL Server doesn't provide a way to specify included columns on primary key and unique constraint definitions. This is a case where one can bend the rules and use a unique index instead of a declared primary key constraint in order to avoid the cost of redundant indexes and the benefits of included columns. The unique index is functionally identical to a primary key and can be referenced by foreign key constraints.
So the question is whether these changes can be harmful as
LastActivityDate isn't the best candidate for clustering index key
LastActivityDate alone can never be guaranteed to be unique regardless of the level of precision (barring single-threaded inserts or retry logic). One approach could be a composite primary key on LastActivityDate and Id. Individual posts would need to be retrieved using both values. That would eliminate the need for a separate unique index Id previously discussed.
My biggest concern about LastActivityDate as the leftmost clustered index key column is that it may change often for recent posts. This would require a lot of row movement to maintain the logical key order, may impact concurrency significantly compared to the current static Id key, and require updates to the non-clustered index row locator values upon each change. So even though this clustered index key may be optimal for many queries, the other costs on a highly transactional system may outweigh the benefits.
A clustered index stores the actual data rows at the leaf level of the index. Returning to the example above, that would mean that the entire row of data associated with the primary key value of 123 would be stored in that leaf node.
Question - in case the primary key does not exists and I set the Name column as clustered index. In this case, will the above statement becomes contradictory?
No - why?
The clustered index will still store the actual data pages at its leaf level, (initially) physically sorted by the name column.
The index navigation structure above the leaf level will contain the name column values for all rows.
So overall: nothing changes.
The primary key is a logical construct, designed to uniquely identify each row in your table. That's why it has to be unique and non-null.
The clustering index is a physical construct that will (initially) phyiscally sort your data by the clustering key and arrange the SQL Server pages accordingly.
While in SQL Server, the primary is used by default as the clustering key, the two do not have to fall together - nor does one have to exist with the other. You can have a table with a non-clustered primary key, or a clustered table without primary key. Both is possible. Whether it's sensible to have that is another discussion - but it's technically possible.
Update: if your primary key is your clustering key, uniqueness is guaranteed (since the primary key must be unique). If you're choosing some column that is not the primary key as your clustering key, and that column does not guarantee uniqueness, SQL Server will - behind the scenes - add a 4-byte (INT) uniqueifier column to those duplicates values to make them unique. So you might have Smith, Smith1, Smith2 and so forth in your clustered index navigation structure for your Smith's.
See:
MSDN: Clustering Index Design Guidelines
Simple-Talk: Effective Clustered Indexes
If the clustered index is not unique, SQL Server creates a 4-byte uniqueifier and adds it to the clustered index value. The uniqueifier is added only if the clustered index value is duplicate, not for all clustered index values.
All nonclustered indexes will contain this value in its leaf level, and non-unique nonclustered index will also have this uniqueifier value in its non-leaf level entry, as a part of bookmark.
Difference between a Primary key and a unique index (or constraint) is that Null values are not allowed in a the primary key column. There is no need to have a primary key on a table but it make things easier for external application to edit the rows in the table and even then, it's not really a necessity with most external applications.
In term of performance, this change nothing. The important is the presence or absence of indexes (either unique or not, clustered or not and with null values or not) and the primary key is essentially simply one more unique index without null value.
For the clustered index, the column doesn't need to be unique and/or without null. A column with duplicates and null values is fine for creating a clustered index.
For a foreign key, it must reference a column with a unique index on it but not necessarily a primary key or without null value. It's perfectly legal to reference a column that is not a primary key and is allowing null value a long as there is a unique index on it. Notice that because there must be an unique index on it, this column cannot have more than a single null value.
There is no limitation on the foreign key column itself (the column on the foreign table) but performance wise, setting an index on it is often a good thing.
Can anyone tell me what is the difference between a primary key and index key. And when to use which?
A primary key is a special kind of index in that:
there can be only one;
it cannot be nullable; and
it must be unique.
You tend to use the primary key as the most natural unique identifier for a row (such as social security number, employee ID and so forth, although there is a school of thought that you should always use an artificial surrogate key for this).
Indexes, on the other hand, can be used for fast retrieval based on other columns. For example, an employee database may have your employee number as the primary key but it may also have an index on your last name or your department.
Both of these indexes (last name and department) would disallow NULLs (probably) and allow duplicates (almost certainly), and they would be useful to speed up queries looking for anyone with (for example) the last name 'Corleone' or working in the 'HitMan' department.
A key (minimal superkey) is a set of attributes, the values of which are unique for every tuple (every row in the table at some point in time).
An index is a performance optimisation feature that enables data to be accessed faster.
Keys are frequently good candidates for indexing and some DBMSs automatically create indexes for keys, but that doesn't have to be so.
The phrase "index key" mixes these two quite different words and might be best avoided if you want to avoid any confusion. "Index key" is sometimes used to mean "the set of attributes in an index". However the set of attributes in question are not necessarily a key because they may not be unique.
Oracle Database enforces a UNIQUE key or PRIMARY KEY integrity constraint on a table by creating a unique index on the unique key or primary key. This index is automatically created by the database when the constraint is enabled.
You can create indexes explicitly (outside of integrity constraints) using the SQL statement CREATE INDEX .
Indexes can be unique or non-unique. Unique indexes guarantee that no two rows of a table have duplicate values in the key column (or columns). Non-unique indexes do not impose this restriction on the column values.
Use the CREATE UNIQUE INDEX statement to create a unique index.
Specifying the Index Associated with a Constraint
If you require more explicit control over the indexes associated with UNIQUE and PRIMARY KEY constraints, the database lets you:
1. Specify an existing index that the database is to use
to enforce the constraint
2. Specify a CREATE INDEX statement that the database is to use to create
the index and enforce the constraint
These options are specified using the USING INDEX clause.
Example:
CREATE TABLE a (
a1 INT PRIMARY KEY USING INDEX (create index ai on a (a1)));
http://docs.oracle.com/cd/B28359_01/server.111/b28310/indexes003.htm
Other responses are defining the Primary Key, but not the Primary Index.
A Primary Index isn't an index on the Primary Key.
A Primary Index is your table's data structure, but only if your data structure is ordered by the Primary Key, thus allowing efficient lookups without a requiring a separate data structure to look up records by the Primary Key.
All databases (that I'm aware of) have a Primary Key.
Not all databases have a Primary Index. Most of those that don't build a secondary index on the Primary Key by default.
I have a junction table in my SQL Server 2005 database that consist of two columns:
object_id (uniqueidentifier)
property_id (integer)
These values together make a compound primary key.
What's the best way to create this PK index for SELECT performance?
If the columns were two integers, I would just use a compound clustered index (the default). However, I've heard bad things about clustered indexes when uniqueidentifiers are involved.
Anyone have experience with this situation?
Yes, GUID's are really bad for clustered indexes, since the GUIDs is by design very random and thus leads to massive fragmentation and thus performance problems.
See Kim Tripp's blog - most notably "The CLustered Index Debate continues" and "GUIDs as PRIMARY and/or CLUSTERED key" - for a lot of valuable background info.
If you really need to have an index on these TWO columns, I'd suggest a non-clustered index - it can be a primary index - just better not a clustered index.
Marc
One alternative is to use what is known as a surrogate key (which incidentally can also be assigned as the primary key).
For example, adding an identity column that can be used to uniquely identify each row within the table i.e. a primary key.
Understand that a GUID is used to identify a record globally within SQL Server (which arguably is not a relationally correct practice however that is not a concern for us here).
The identity column, now also a primary key can/will have a clustered index applied. A separate, nonclustered index can then be applied to the compound key described by the original poster.
This practice avoids the issue of frequent page splits occurring within the clustered index (inserts into a random GUID primary key) as well as producing a smaller and more efficient clustered index, whilst also preserving the relationships defined within the database.
Surrogate Key Definition: http://en.wikipedia.org/wiki/Surrogate_key
I i would create an identity column & then make this your primary key & clustered index. You can then create non clustered indexes on objectid propertyid as needed.
You can create a unique constraint to ensure uniqueness of your key.
The reason for this is that the rows will be inserted sequentially, so your reducing page splits. in addition using an integer for your PK means you have a smaller value for your clustered index.