I have an existing table with structure:
TableA
(
Id1 int not null,
Id2 int not null
)
Where the primary key is the composite (Id1, Id2). If you haven't deduced it yet, this is a many-to-many associative table. These are the only columns in the table.
The actual application data populating the table are only one-to-many relationships, due the nature of the business use case in this instance. The number of rows is quite small. Somewhere ~50 rows. New Id2 records occasionally get created and then associated to existing Id1 records. Even more rarely a new Id1 record will be created that requires inserting a new set of Id1, Id2 records. On a day-to-day basis however, the data is static. The table is heavily used in join queries.
The only index on the table is nonclustered, unique, primary key (created as part of the constraint definition) on (Id1, Id2).
To meet some requirements for synchronizing data to another database, I need to add a clustered index to this table.
What is the best way to do this while maintaining the best performance and good physical data organization?
Given the small number of rows, I'm leaning toward replacing the non-clustered index with a clustered index.
Some thoughts:
Since there are no other columns in the table, the clustered index can't be added on any other columns.
Adding a clustered index on only one column doesn't make sense and could be detrimental.
Will it hurt to have both a clustered index and a non-clustered index on the same columns?
Because the actual data is one-to-many and does not utilize the many-to-many structure, replacing the non-clustered index with a clustered index is not bad.
Data inserts into a clustered index on the PK columns cause bad physical data organization.
Adding an identity column to the table and putting the clustered index on it gets around the issue, but provides no benefit to querying at all.
I'm probably over-analyzing this.
I'd say, that with 50 rows it doesn't really matter. I'd create a
clustered index (primary key) on (id1, id2)
plus non-clustered unique index on (id2, id1)
This will cover all possible queries.
Once in a while (once a day or week or after changes to this infrequently changing table) you can rebuild all indexes to defragment them and keep statistics up to date. This kind of maintenance should be done for all tables any way.
Related
I am using SQL Server 2012 and for one of the table I see it has created primary key non-clustered (composite key) and clustered index on different column? Can somebody help me to understand what will happen in this situation?
Does this going to degrade performance for DML operations? If yes how to measure it?
Will this be causing locking/blocking/deadlocks for this table when performing DML operation during concurrency ?
Note: this table has a huge number of records in it ~10 million
One common scenario where you might end up with a primary key which is a non clustered composite key is a junction table. A junction table mainly exists to store a relationship between two primary key values from other tables. A simple example would be storing say relationships between students and the courses they take. As such, the primary (unique) key in such a table would actually be the combination of the two foreign key columns. That being said, there can still be a clustered index on some other column. There is nothing at all out of the ordinary here, assuming such a table falls in line with your design intentions.
I am doing a review of some DB tables that were created in our project and came across this. The table contains an Identity column (ID) which is the primarykey for the table and a clustered index has been defined using this ID column. But when I look at the SPROC that retrieves records from this table, I see that the ID column is never used in the query and they query the records based on a USERID column (this column is not unique) and there can be multiple records for the same USERID.
So my question is there any advantage/purpose in creating a clustered index when we know that the records wont be queried with that column?
If the IDENTITY column is never used in WHERE and JOIN clauses, or referenced by foreign keys, perhaps USERID should be a clustered primary key. I would question the need for the ID column at all in that case.
The best choice for the clustered index depends much on how the table is queried. If the majority of queries are by USERID, then it should probably be a unique clustered index (or clustered unique constraint) and the ID column non-clustered.
Keep in mind that the clustered index key is implicitly included in all non-clustered indexes as the row locator. The implication is that non-clustered indexes may more likely cover queries and non-clustered index leaf node pages wider as a result.
I would say your table is mis-designed. Someone apparently thought every table needs a primary key and the primary key is the clustered index. Adding a system-generated unique number as an identifier just adds noise if that number isn't used anywhere. Noise in the clustered index is unhelpful, to say the least.
They are different concepts, by the way. A primary key is a data modeling concern, a logical concept. An index is a physical design issue. A SQL DBMS must support primary keys, but need not have any indexes, clustered or no.
If USERID is what is usually used to search the table, it should be in your clustered index. The clustered index need not be unique and need not be the primary key. I would look at the data carefully to see if some combination of USERID and another column (or two, or more) form a unique identifier for the row. If so, I'd make that the primary key (and clustered index), with USERID as the first column. If query analysis showed that many queries use only USERID and nothing else (for existence testing) I might create a separate index just of USERID.
If no combination of columns constitutes a unique identifier, you have logical problem, to wit: what does the row mean? What aspect of the real world does it represent?
A basic tenet of the Relational Model is that elements in a relation (rows in a table) are unique, that each one identifies something. If two rows are identical, they identify the same thing. What does it mean to delete one of them? Is the thing that they both identify still there, or not? If it is, what purpose did the 2nd row serve?
I hope that gives you another way to think about clustered indexes and keys. I wouldn't be surprised if you find other tables that could be improved, too.
I was seaching how to move a table from one filegroup to the other, and I had some doubts as to why most of the replies I found dealt with clustered indexes, considering that my question had to do with tables.
Then I looked at How I can move table to another filegroup?, and it says that the clustered index is the table data, which explains the reasoning behind recreating a clustered index with CREATE CLUSTERED INDEX.
But in that same question it says that if my clustered index is unique, then do something else.
My question: I assume that when I create tables on a database, a clustered index is created for that table. So how can it not be unique?
Thanks.
If you have an int array and you store the number 1 twice in it - how can that array not be unique?! (Trick question to get you thinking. It clearly can be not unique.) Being unique is a constraint on the data. Fundamentally, there is nothing preventing you from creating multiple rows that have the same values in all columns.
In a heap this is not a problem physically at all. The internal row identifier is it's location on disk.
In a b-tree based index (a "clustered index") the physical data structure indeed requires uniqueness. Note, that the logical structure (the table) does not. This is a physical concern. It's an implementation detail. SQL Server does this by internally appending a key column that contains a sequence number that is counted upwards. This disambiguates the records. You can observe this effect by creating more than 2^32 rows with the same non-unique key. You will receive an error.
So there's a hidden column in the table that you cannot access. It's officially called "uniqueifier". Internally, it's used to complete the CI key to make it unique. It's stored and used everywhere where normally the unique CI key would be used: In the CI, in non-unique NCIs, in the lock hash and in query plans.
If Clustered Index is not unique then SQL Server internally creates Uniquifier to make uniqueness on that record. I will try to explain with an example:
CREATE TABLE Test2 (Col1 INT, Col2 INT)
CREATE CLUSTERED INDEX idxClustered ON Test2 (Col1)
CREATE NONCLUSTERED INDEX idxNonClustered ON test2 (Col2)
Here cluserered index is not unique
INSERT INTO Test2 VALUES (1,1), (2,2)
INSERT INTO Test2 VALUES (3,3)
INSERT INTO Test2 VALUES (3,3)
--Get the Page Number of the Non Clustered Index
DBCC IND (Test, Test2, -1)
--Examine the Results of the Page
--Not to run in production
DBCC TRACEON (3604);
DBCC PAGE(Test, 1, 3376, 3);
You will see Uniquifier key with corresponding uniqueness value... If your clustered index is Unique Clustered Index then It will not have that Uniquifier attribute.
**usr* has a good post worth reading. I will add here from Microsofts Documentation.
First, you are not alone with Clustered-Indexes. Honestly, the name itself is somewhat confusing (Structured-Indexes or Disk-Indexes would probably be better in SQL).
Refer back to the official documentation from MSDN. Any alterations by me are in italics:
A Clustered Index is an on-disk structure of the table. This means the values are pointing to a physical location. This is why when you move the table you need to recreate the Index because the physical location has been altered.
Clustered
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the index
definition. There can be only one clustered index per table, because
the data rows themselves can be sorted in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table has
no clustered index, its data rows are stored in an unordered structure
called a heap.
Nonclustered
Nonclustered indexes have a structure separate from the data rows (like pointers, this is a logical ordering of the data that consumes a fraction of the physical disk space).
A nonclustered index contains the nonclustered index key values and each
key value entry has a pointer to the data row that contains the key
value.
The pointer from an index row in a nonclustered index to a data row
is called a row locator. The structure of the row locator depends on
whether the data pages are stored in a heap or a clustered table (think ordered).
For a heap, a row locator is a pointer to the row.
For a clustered table, the row locator is the clustered index key.
ABSTRACT VIEW:
A table created is not necessarily a clustered (ordered) table.
An index does not necessarily have to be unique. It is an abstract view of the table.
Unique means that a value or set of values will not repeat themselves. If you wish to enforce this, you can add a constraint by the index (i.e. UNIQUE CLUSTERED INDEX) or a CONSTRAINT such as PRIMARY KEY if you wish this to be managed in the table structure itself.
You may have multiple unique indexes since as long as the values are represented logically, they will not share the same value as another row pointer.
Consider you have Columns A, B, and C in a given table.
Column A was created with a UNIQUE CLUSTERED INDEX. This means that either A already had an enforceable UNIQUE constraint (like PK, UNIQUE CONSTRAINT) or was DECLARED EXPLICITLY.
A Column Group {B,C} could be a unique index so long as B and C never repeat itself together. In the same way, you could theoretically have indexes with the groups {A}, {B,C}, {A,C}, and every one of them be unique. Recall that an index is a logical ordering of the data so they likely will not have the same logical value (and thus are unique).
HOWEVER: unless the datatype, constraint (including the INDEX constraint), or table structure enforces a unique constraint on a COLUMN, you should not assume the index is unique. Furthermore, you cannot create a UNIQUE index if there are more than one rows containing the same combination of NULL values since SQL Server will treat them as the same value (NULL being unknown).
Will SQL Server use your indexes, unique or not? Well that is another story and depends on a number of things. But hopefully you find this post helpful.
Sources:
MSDN - Clustered and Nonclustered Indexes Described
A clustered index doesn't have to be unique. But, there can be only one clustered index on a table, because a clustered index actually determines the physical order of the table rows on disk (but I find it confusing to say that the clustered index is the table data, per se, even though they are strongly tied to each other).
HERE is a good post all about non-unique clustered indexes. Even if the index was the entire row of data, you can certainly have duplicate rows (no PK), which would equate to duplicate clustered index nodes.
I have a table which has 4 columns (region_id, product_id, cate_id, month_id) as a primary key.
This primary key was created as default, so a clustered index were created for PK.
This table contains more than 10 millions rows.
If I delete existing pk and create a new pk with non-clustered index type, is it better than clustered index for the following query?
select region_id, product_id, cate_id, month_id, a, b, c
from fact_a
where month_id > 100
Thanks in advance.
A simple nonclustered index on month_id will certainly improve the average performance for that query (assuming month_id for most of the rows is less than 100, so that the where clause excludes most of the rows). However, if you're creating the index specifically for that query (or any queries with month_id in the where clause and a, b, c, month_id or a subset of those in the select), you will get even better results by including the selected values in the index, like this:
CREATE INDEX index_fact_a_month_id ON fact_a (month_id) INCLUDE (a,b,c)
The quick answer, yes, removing the primary key (moreso, replacing the current multi-column Primary Key with a single identity column) and then creating your NCI on Month_ID will be better/faster/more efficient.
Clustered Index - it IS the data. It contains every column of every row in the table. There can only be one CI because the table data only needs to exist once. Each row has a key...
Primary Key - it is the key to identify a row in a Clustered Index.
Non-Clustered Index - it acts as a table of a subset of columns from the rows in the Clustered Index.
Keeping it simple, a Non-Clustered Index contains less data than the Clustered Index, and it orders the data in a way (Month_id ASC) that makes queries against it much more efficient than querying against the CI (A, B, C, Month_ID). SQL Server has no way to "dip" into the CI Primary Key or row data and say, "Hey, I'm filtering by Month_ID, so I'll just go right to that column." By nature of Clustered Indexes, SQL Server "reads" all CI rows (index scan), every column, every byte of data. Very inefficient and wasteful since your WHERE clause will be filtering out a lot of these rows.
The Non Clustered Index only contains a subset of columns, so it is much more efficient in that it can say, "Hey, I'm filtering by Month_ID, and I only contain Month_ID, aaannnd Month_ID is in ascending order, so I can just jump right to the rows that I want!" (index seek). Much more efficient since only the rows you want to return will be "read" by SQL Server.
Getting a little more advanced, since the Non Clustered Index is only Month_ID, but you are querying for all the columns in the Clustered Index, SQL Server needs to be able to go back to the CI from the NCI to get rest of the columns. To do that, the Primary Key of the CI is stored in the NCI, along with the column subset. So the NCI is really like a two column table of (Month_ID, CI Primary Key).
If your Primary Key is monstrous, your NCIs will also be monstrous, and therefore less efficient (more disk reads, more buffer pool consumption, bad database stuff).
Disclaimer: there can be specific scenarios where you want every column to be the clustered index key/pk. I don't sense that is applicable here, but it is possible. If you have a heavily used query that refers to every column of the table in where clauses or joins, than a coverage clustered index may be beneficial.
We have a table with about 100,000 record which is used frequently in our applications. We had an identity (ID) columns and had a clustered index on it and everything worked good. But for some reasons we had to use a Uniqueidentifier column as Primary key. So we add a non clustered index on it and removed the clustered index on ID column. But now, we have lots of performance degradation issuses from our customer in peak times. Is it because the table has no clustered index now?
The fact that you added a primary key by no means implies you had to drop the clustered index. The two concepts are distinct. You can have an uniqueidentifier PK implemented by a non clustered index and a separate clustered index of choice (eg. the old ID column).
But the real question is How did you change your application when you added the uniqueidentifier PK? Did you also modified the application code to retrieve the records by this new PK (by the uniqueidentifier)? Did you update all joins to reference the new PK? Did you modified all foreign key cosntraints that referenced the old ID column? Or does the application continue to retrieve the data using the old identity ID column? My expectation is that you changed both the application and the table, and the access is now prevalent on the form of SELECT ... FROM table WHERE pk=#uniqueidentifier. If only such access occurs, then the table should perform OK even with a non-clustered uniqueidentifier primary key and no clustered index. So there must be something else at play:
your application continues to access the table based on the old identity ID column
there are joins in your query based on the old identity ID column
there are foreign key constraints referencing the table on the old ID column
Ultimately you have a performance troubleshooting issue at hand and approach it as a performance troubleshooting problem. I have two great resources for you:the Waits and Queue methodology and the Performance Troubleshooting Flowchart
Hi I think you can make uniqueidentifier column as clustered index with NEWSEQUENTIALID() instead of NEWID(). As newsequentialid generates the sequential ids and for clustered index its the best.