Which index is better non-clustered vs clustered in this case? - sql-server

I have a table which has 4 columns (region_id, product_id, cate_id, month_id) as a primary key.
This primary key was created as default, so a clustered index were created for PK.
This table contains more than 10 millions rows.
If I delete existing pk and create a new pk with non-clustered index type, is it better than clustered index for the following query?
select region_id, product_id, cate_id, month_id, a, b, c
from fact_a
where month_id > 100
Thanks in advance.

A simple nonclustered index on month_id will certainly improve the average performance for that query (assuming month_id for most of the rows is less than 100, so that the where clause excludes most of the rows). However, if you're creating the index specifically for that query (or any queries with month_id in the where clause and a, b, c, month_id or a subset of those in the select), you will get even better results by including the selected values in the index, like this:
CREATE INDEX index_fact_a_month_id ON fact_a (month_id) INCLUDE (a,b,c)

The quick answer, yes, removing the primary key (moreso, replacing the current multi-column Primary Key with a single identity column) and then creating your NCI on Month_ID will be better/faster/more efficient.
Clustered Index - it IS the data. It contains every column of every row in the table. There can only be one CI because the table data only needs to exist once. Each row has a key...
Primary Key - it is the key to identify a row in a Clustered Index.
Non-Clustered Index - it acts as a table of a subset of columns from the rows in the Clustered Index.
Keeping it simple, a Non-Clustered Index contains less data than the Clustered Index, and it orders the data in a way (Month_id ASC) that makes queries against it much more efficient than querying against the CI (A, B, C, Month_ID). SQL Server has no way to "dip" into the CI Primary Key or row data and say, "Hey, I'm filtering by Month_ID, so I'll just go right to that column." By nature of Clustered Indexes, SQL Server "reads" all CI rows (index scan), every column, every byte of data. Very inefficient and wasteful since your WHERE clause will be filtering out a lot of these rows.
The Non Clustered Index only contains a subset of columns, so it is much more efficient in that it can say, "Hey, I'm filtering by Month_ID, and I only contain Month_ID, aaannnd Month_ID is in ascending order, so I can just jump right to the rows that I want!" (index seek). Much more efficient since only the rows you want to return will be "read" by SQL Server.
Getting a little more advanced, since the Non Clustered Index is only Month_ID, but you are querying for all the columns in the Clustered Index, SQL Server needs to be able to go back to the CI from the NCI to get rest of the columns. To do that, the Primary Key of the CI is stored in the NCI, along with the column subset. So the NCI is really like a two column table of (Month_ID, CI Primary Key).
If your Primary Key is monstrous, your NCIs will also be monstrous, and therefore less efficient (more disk reads, more buffer pool consumption, bad database stuff).
Disclaimer: there can be specific scenarios where you want every column to be the clustered index key/pk. I don't sense that is applicable here, but it is possible. If you have a heavily used query that refers to every column of the table in where clauses or joins, than a coverage clustered index may be beneficial.

Related

Is there any advantage in creating a clustered index - if we are not going to query/search for records based on that column?

I am doing a review of some DB tables that were created in our project and came across this. The table contains an Identity column (ID) which is the primarykey for the table and a clustered index has been defined using this ID column. But when I look at the SPROC that retrieves records from this table, I see that the ID column is never used in the query and they query the records based on a USERID column (this column is not unique) and there can be multiple records for the same USERID.
So my question is there any advantage/purpose in creating a clustered index when we know that the records wont be queried with that column?
If the IDENTITY column is never used in WHERE and JOIN clauses, or referenced by foreign keys, perhaps USERID should be a clustered primary key. I would question the need for the ID column at all in that case.
The best choice for the clustered index depends much on how the table is queried. If the majority of queries are by USERID, then it should probably be a unique clustered index (or clustered unique constraint) and the ID column non-clustered.
Keep in mind that the clustered index key is implicitly included in all non-clustered indexes as the row locator. The implication is that non-clustered indexes may more likely cover queries and non-clustered index leaf node pages wider as a result.
I would say your table is mis-designed. Someone apparently thought every table needs a primary key and the primary key is the clustered index. Adding a system-generated unique number as an identifier just adds noise if that number isn't used anywhere. Noise in the clustered index is unhelpful, to say the least.
They are different concepts, by the way. A primary key is a data modeling concern, a logical concept. An index is a physical design issue. A SQL DBMS must support primary keys, but need not have any indexes, clustered or no.
If USERID is what is usually used to search the table, it should be in your clustered index. The clustered index need not be unique and need not be the primary key. I would look at the data carefully to see if some combination of USERID and another column (or two, or more) form a unique identifier for the row. If so, I'd make that the primary key (and clustered index), with USERID as the first column. If query analysis showed that many queries use only USERID and nothing else (for existence testing) I might create a separate index just of USERID.
If no combination of columns constitutes a unique identifier, you have logical problem, to wit: what does the row mean? What aspect of the real world does it represent?
A basic tenet of the Relational Model is that elements in a relation (rows in a table) are unique, that each one identifies something. If two rows are identical, they identify the same thing. What does it mean to delete one of them? Is the thing that they both identify still there, or not? If it is, what purpose did the 2nd row serve?
I hope that gives you another way to think about clustered indexes and keys. I wouldn't be surprised if you find other tables that could be improved, too.

If clustered index is table data, how can it not be unique?

I was seaching how to move a table from one filegroup to the other, and I had some doubts as to why most of the replies I found dealt with clustered indexes, considering that my question had to do with tables.
Then I looked at How I can move table to another filegroup?, and it says that the clustered index is the table data, which explains the reasoning behind recreating a clustered index with CREATE CLUSTERED INDEX.
But in that same question it says that if my clustered index is unique, then do something else.
My question: I assume that when I create tables on a database, a clustered index is created for that table. So how can it not be unique?
Thanks.
If you have an int array and you store the number 1 twice in it - how can that array not be unique?! (Trick question to get you thinking. It clearly can be not unique.) Being unique is a constraint on the data. Fundamentally, there is nothing preventing you from creating multiple rows that have the same values in all columns.
In a heap this is not a problem physically at all. The internal row identifier is it's location on disk.
In a b-tree based index (a "clustered index") the physical data structure indeed requires uniqueness. Note, that the logical structure (the table) does not. This is a physical concern. It's an implementation detail. SQL Server does this by internally appending a key column that contains a sequence number that is counted upwards. This disambiguates the records. You can observe this effect by creating more than 2^32 rows with the same non-unique key. You will receive an error.
So there's a hidden column in the table that you cannot access. It's officially called "uniqueifier". Internally, it's used to complete the CI key to make it unique. It's stored and used everywhere where normally the unique CI key would be used: In the CI, in non-unique NCIs, in the lock hash and in query plans.
If Clustered Index is not unique then SQL Server internally creates Uniquifier to make uniqueness on that record. I will try to explain with an example:
CREATE TABLE Test2 (Col1 INT, Col2 INT)
CREATE CLUSTERED INDEX idxClustered ON Test2 (Col1)
CREATE NONCLUSTERED INDEX idxNonClustered ON test2 (Col2)
Here cluserered index is not unique
INSERT INTO Test2 VALUES (1,1), (2,2)
INSERT INTO Test2 VALUES (3,3)
INSERT INTO Test2 VALUES (3,3)
--Get the Page Number of the Non Clustered Index
DBCC IND (Test, Test2, -1)
--Examine the Results of the Page
--Not to run in production
DBCC TRACEON (3604);
DBCC PAGE(Test, 1, 3376, 3);
You will see Uniquifier key with corresponding uniqueness value... If your clustered index is Unique Clustered Index then It will not have that Uniquifier attribute.
**usr* has a good post worth reading. I will add here from Microsofts Documentation.
First, you are not alone with Clustered-Indexes. Honestly, the name itself is somewhat confusing (Structured-Indexes or Disk-Indexes would probably be better in SQL).
Refer back to the official documentation from MSDN. Any alterations by me are in italics:
A Clustered Index is an on-disk structure of the table. This means the values are pointing to a physical location. This is why when you move the table you need to recreate the Index because the physical location has been altered.
Clustered
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the index
definition. There can be only one clustered index per table, because
the data rows themselves can be sorted in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table has
no clustered index, its data rows are stored in an unordered structure
called a heap.
Nonclustered
Nonclustered indexes have a structure separate from the data rows (like pointers, this is a logical ordering of the data that consumes a fraction of the physical disk space).
A nonclustered index contains the nonclustered index key values and each
key value entry has a pointer to the data row that contains the key
value.
The pointer from an index row in a nonclustered index to a data row
is called a row locator. The structure of the row locator depends on
whether the data pages are stored in a heap or a clustered table (think ordered).
For a heap, a row locator is a pointer to the row.
For a clustered table, the row locator is the clustered index key.
ABSTRACT VIEW:
A table created is not necessarily a clustered (ordered) table.
An index does not necessarily have to be unique. It is an abstract view of the table.
Unique means that a value or set of values will not repeat themselves. If you wish to enforce this, you can add a constraint by the index (i.e. UNIQUE CLUSTERED INDEX) or a CONSTRAINT such as PRIMARY KEY if you wish this to be managed in the table structure itself.
You may have multiple unique indexes since as long as the values are represented logically, they will not share the same value as another row pointer.
Consider you have Columns A, B, and C in a given table.
Column A was created with a UNIQUE CLUSTERED INDEX. This means that either A already had an enforceable UNIQUE constraint (like PK, UNIQUE CONSTRAINT) or was DECLARED EXPLICITLY.
A Column Group {B,C} could be a unique index so long as B and C never repeat itself together. In the same way, you could theoretically have indexes with the groups {A}, {B,C}, {A,C}, and every one of them be unique. Recall that an index is a logical ordering of the data so they likely will not have the same logical value (and thus are unique).
HOWEVER: unless the datatype, constraint (including the INDEX constraint), or table structure enforces a unique constraint on a COLUMN, you should not assume the index is unique. Furthermore, you cannot create a UNIQUE index if there are more than one rows containing the same combination of NULL values since SQL Server will treat them as the same value (NULL being unknown).
Will SQL Server use your indexes, unique or not? Well that is another story and depends on a number of things. But hopefully you find this post helpful.
Sources:
MSDN - Clustered and Nonclustered Indexes Described
A clustered index doesn't have to be unique. But, there can be only one clustered index on a table, because a clustered index actually determines the physical order of the table rows on disk (but I find it confusing to say that the clustered index is the table data, per se, even though they are strongly tied to each other).
HERE is a good post all about non-unique clustered indexes. Even if the index was the entire row of data, you can certainly have duplicate rows (no PK), which would equate to duplicate clustered index nodes.

Add Clustered Index to Table with Non-Clustered Unique PK Index

I have an existing table with structure:
TableA
(
Id1 int not null,
Id2 int not null
)
Where the primary key is the composite (Id1, Id2). If you haven't deduced it yet, this is a many-to-many associative table. These are the only columns in the table.
The actual application data populating the table are only one-to-many relationships, due the nature of the business use case in this instance. The number of rows is quite small. Somewhere ~50 rows. New Id2 records occasionally get created and then associated to existing Id1 records. Even more rarely a new Id1 record will be created that requires inserting a new set of Id1, Id2 records. On a day-to-day basis however, the data is static. The table is heavily used in join queries.
The only index on the table is nonclustered, unique, primary key (created as part of the constraint definition) on (Id1, Id2).
To meet some requirements for synchronizing data to another database, I need to add a clustered index to this table.
What is the best way to do this while maintaining the best performance and good physical data organization?
Given the small number of rows, I'm leaning toward replacing the non-clustered index with a clustered index.
Some thoughts:
Since there are no other columns in the table, the clustered index can't be added on any other columns.
Adding a clustered index on only one column doesn't make sense and could be detrimental.
Will it hurt to have both a clustered index and a non-clustered index on the same columns?
Because the actual data is one-to-many and does not utilize the many-to-many structure, replacing the non-clustered index with a clustered index is not bad.
Data inserts into a clustered index on the PK columns cause bad physical data organization.
Adding an identity column to the table and putting the clustered index on it gets around the issue, but provides no benefit to querying at all.
I'm probably over-analyzing this.
I'd say, that with 50 rows it doesn't really matter. I'd create a
clustered index (primary key) on (id1, id2)
plus non-clustered unique index on (id2, id1)
This will cover all possible queries.
Once in a while (once a day or week or after changes to this infrequently changing table) you can rebuild all indexes to defragment them and keep statistics up to date. This kind of maintenance should be done for all tables any way.

SQL Server - Clustered Index Key Issue on FACT Table with millions of rows

we got a FACT Table which has got 237383163 number of rows and which has lot of duplicate data.
While running queries against this table its doing a SCAN across that many rows resulting in long execution times (bocs we haven't created clustered index).
Is there way someone can suggest - to create a clustered key using some combination of existing field along with adding any new field (like identity column)
Non-clustered index are created on table is of no help either.
Regards
Thoughts:
Adding a clustered index that is not unique will require a 4 byte uniqueifier
Adding a surrogate IDENTITY column will leave you with duplicates
A clustered index is best when narrow and numeric espeically if you have non-clustered indexes
First thing, de-duplicate data
Then I'd consider one of 2 things based on whether there are non-clustered indexes
Without NC indexes, create a unique clustered index on some or all of the FACT columns
With NC indexes, create an IDENTITY column and use this as the clustered index. Create a unique NC index on the FACT columns
Option 1 will be a lot smaller on disk. I've done this before for a billion+ row fact table and it shrank by 65%. There were no NC indexes.
Both options will need tested to see the effect on load and response times etc

Non Clustered Index not working sql server

I have a table that doesn't have any primary key. data is already there. I have made a non clustered index. but when i run query, actual execution plan is not showing index scanning. I think non clustered index is not working. what could be the reason. Please Help Me
First of all - why isn't there a primary key?? If it doesn't have a primary key, it's not a table - just add one! That will help on so many levels....
Secondly: even if you have an index, SQL Server query optimizer will always look at your query to decide whether it makes sense to use the index (or not). If you select all columns, and a large portion of the rows, then using an index is pointless.
So things to avoid are:
SELECT * FROM dbo.YourTable is almost guaranteed not to use any indices
if you don't have a good WHERE clause in your query
if your index is on a column that doesn't really select a small percentage of data; an index on a boolean column, or a Gender column with at most three different values doesn't help at all
Without knowing a lot more about your table structure, the data contained in those tables, the number of rows, and what kind of queries you're executing, no one can really answer your question - it's just way too broad....
Update: if you want to create a clustered index on a table which is different from your primary key, do these steps:
1) First, design your table
2) Then open up the index designer - create a new, clustered index on a column of your choice. Mind you - this is NOT the primary key !
3) After that, you can put your primary key on the ID column - it will create an index, but that index is not clustered !
Without having any more information I'd guess that the reason is that the table is too small for an index seek to be worth it.
If your table has less than a few thousand rows then SQL Server will almost always choose to do a table / index scan regardless of the indexes on that table simply because an index scan is in fact faster.
An index scan in itself doesn't necessarily indicate a performance problem - is the query actually slow?

Resources