Create aligned index on a foreign key column - sql-server

I have a fact table which is partitioned along the PeriodDate column.
CREATE TABLE MyFactTable
(
PeriodDate DATE,
OrderID INT
CONSTRAINT MyFk FOREIGN KEY (OrderID) REFERENCES DimOrder(ID)
)
I'd like create a partition aligned index on the OrderID column, and as I understood from BOL, I need to include the partitioning key (PeriodDate) in order to have the index aligned.
Like this:
CREATE NONCLUSTERED INDEX MyAlignedOrderIdIndex
ON MyFactTable (OrderID, PeriodDate);
My question is: in what order should I put the two columns in the index above?
ON MyFactTable (OrderID, PeriodDate);
or
ON MyFactTable (PeriodDate, OrderID);
As I read on BOL as well the order matters in composite indexes, and my queries will usually use OrderID to lookup Dim table data.
First OrderID, PeriodDate order seems logical choice, but since I am not familiar with partitioning I don't know how it will "like it" when the tables has millions of rows.
What does the best practices dictates here?

My question is: In what order should I put the two columns in the index above?:
(OrderID,PeriodDate) The index is there to enable retrieval of all the facts for a given set of OrderIDs, and if your partitions have multiple PeriodDate's in them, having the index with PeriodDate first, wouldn't be as helpful.
The general rule of thumb here is that you don't partition by the leading column. That way you get partition elimination and the index order as fully independent access paths.
My dimension will have a dozen or maximum a hundred rows. The fact table will have millions of rows. Does it worth to create this index?
You'll have to test. It really depends on the queries. However if your fact table is a clustered columnstore (which a fact table with millions of rows typically should be), you'll probably find that the index is not used much for the query workload. Ie it may be used for queries for a single product, but not for queries that filter by non-key product attributes.

Related

how to turn non-clustered Index into covering indexs

I googled covering index and found:
"A covering index is a special type of composite index, where all of the columns exist in the index."
I understand the main purpose is to make non-clustered index don't lookup clustered Index, and for SQL Server, we can use 'INCLUDE' columns when creating an index, so SQL Server adds them at the leaf level of the index. So there is no need to look up via cluster-index.
But image we have an Customers table(CustID, FirstName, City) that has a clustered Index on CustID.
if we create a non-clustered Index(called IX_FIRSTNAME) on FirstName column, and include this column as payload in the leaf node of the index and query as:
select FirstName from Customers where FirstName Like 'T*';
so in this case, there is no need to look up through clustered Index, so can IX_FIRSTNAME be considered as a covering index?
or it has to meet the requirement for all columns?
and we need to create a non-clustered for all three columns to be a covering index?
There are two concepts here:
clustered versus non-clustered indexes
covering indexes
A covering index is one that can satisfy the where clause in your query. As you are likely to have more than one query running against your data, a given index may be "covering" for one query, but not for another.
In your example, IX_FIRSTNAME is a covering index for the query
select FirstName from Customers where FirstName Like 'T*';,
but not for the query
select FirstName from Customers where FirstName Like 'T*' and City Like 'London';.
A lot of performance optimization boils down to "understand your queries, especially the where clause, and design covering indexes". Creating indexes for every possible combination of columns is a bad idea as indexes do have a performance cost of their own.
The second concept is "clustered" versus "non-clustered" indexes. This is more of a physical concern - with a clustered index, the data is stored on disk in the order of the index. This makes it great for creating an index on the primary key of a table (if your primary key is an incrementing integer). In your example, you would create a clustered index on custid, which would be covering the following query:
select FirstName from Customers where custid = 12
It would also help joins (e.g. from customer to order).

Is there any advantage in creating a clustered index - if we are not going to query/search for records based on that column?

I am doing a review of some DB tables that were created in our project and came across this. The table contains an Identity column (ID) which is the primarykey for the table and a clustered index has been defined using this ID column. But when I look at the SPROC that retrieves records from this table, I see that the ID column is never used in the query and they query the records based on a USERID column (this column is not unique) and there can be multiple records for the same USERID.
So my question is there any advantage/purpose in creating a clustered index when we know that the records wont be queried with that column?
If the IDENTITY column is never used in WHERE and JOIN clauses, or referenced by foreign keys, perhaps USERID should be a clustered primary key. I would question the need for the ID column at all in that case.
The best choice for the clustered index depends much on how the table is queried. If the majority of queries are by USERID, then it should probably be a unique clustered index (or clustered unique constraint) and the ID column non-clustered.
Keep in mind that the clustered index key is implicitly included in all non-clustered indexes as the row locator. The implication is that non-clustered indexes may more likely cover queries and non-clustered index leaf node pages wider as a result.
I would say your table is mis-designed. Someone apparently thought every table needs a primary key and the primary key is the clustered index. Adding a system-generated unique number as an identifier just adds noise if that number isn't used anywhere. Noise in the clustered index is unhelpful, to say the least.
They are different concepts, by the way. A primary key is a data modeling concern, a logical concept. An index is a physical design issue. A SQL DBMS must support primary keys, but need not have any indexes, clustered or no.
If USERID is what is usually used to search the table, it should be in your clustered index. The clustered index need not be unique and need not be the primary key. I would look at the data carefully to see if some combination of USERID and another column (or two, or more) form a unique identifier for the row. If so, I'd make that the primary key (and clustered index), with USERID as the first column. If query analysis showed that many queries use only USERID and nothing else (for existence testing) I might create a separate index just of USERID.
If no combination of columns constitutes a unique identifier, you have logical problem, to wit: what does the row mean? What aspect of the real world does it represent?
A basic tenet of the Relational Model is that elements in a relation (rows in a table) are unique, that each one identifies something. If two rows are identical, they identify the same thing. What does it mean to delete one of them? Is the thing that they both identify still there, or not? If it is, what purpose did the 2nd row serve?
I hope that gives you another way to think about clustered indexes and keys. I wouldn't be surprised if you find other tables that could be improved, too.

Which index is better non-clustered vs clustered in this case?

I have a table which has 4 columns (region_id, product_id, cate_id, month_id) as a primary key.
This primary key was created as default, so a clustered index were created for PK.
This table contains more than 10 millions rows.
If I delete existing pk and create a new pk with non-clustered index type, is it better than clustered index for the following query?
select region_id, product_id, cate_id, month_id, a, b, c
from fact_a
where month_id > 100
Thanks in advance.
A simple nonclustered index on month_id will certainly improve the average performance for that query (assuming month_id for most of the rows is less than 100, so that the where clause excludes most of the rows). However, if you're creating the index specifically for that query (or any queries with month_id in the where clause and a, b, c, month_id or a subset of those in the select), you will get even better results by including the selected values in the index, like this:
CREATE INDEX index_fact_a_month_id ON fact_a (month_id) INCLUDE (a,b,c)
The quick answer, yes, removing the primary key (moreso, replacing the current multi-column Primary Key with a single identity column) and then creating your NCI on Month_ID will be better/faster/more efficient.
Clustered Index - it IS the data. It contains every column of every row in the table. There can only be one CI because the table data only needs to exist once. Each row has a key...
Primary Key - it is the key to identify a row in a Clustered Index.
Non-Clustered Index - it acts as a table of a subset of columns from the rows in the Clustered Index.
Keeping it simple, a Non-Clustered Index contains less data than the Clustered Index, and it orders the data in a way (Month_id ASC) that makes queries against it much more efficient than querying against the CI (A, B, C, Month_ID). SQL Server has no way to "dip" into the CI Primary Key or row data and say, "Hey, I'm filtering by Month_ID, so I'll just go right to that column." By nature of Clustered Indexes, SQL Server "reads" all CI rows (index scan), every column, every byte of data. Very inefficient and wasteful since your WHERE clause will be filtering out a lot of these rows.
The Non Clustered Index only contains a subset of columns, so it is much more efficient in that it can say, "Hey, I'm filtering by Month_ID, and I only contain Month_ID, aaannnd Month_ID is in ascending order, so I can just jump right to the rows that I want!" (index seek). Much more efficient since only the rows you want to return will be "read" by SQL Server.
Getting a little more advanced, since the Non Clustered Index is only Month_ID, but you are querying for all the columns in the Clustered Index, SQL Server needs to be able to go back to the CI from the NCI to get rest of the columns. To do that, the Primary Key of the CI is stored in the NCI, along with the column subset. So the NCI is really like a two column table of (Month_ID, CI Primary Key).
If your Primary Key is monstrous, your NCIs will also be monstrous, and therefore less efficient (more disk reads, more buffer pool consumption, bad database stuff).
Disclaimer: there can be specific scenarios where you want every column to be the clustered index key/pk. I don't sense that is applicable here, but it is possible. If you have a heavily used query that refers to every column of the table in where clauses or joins, than a coverage clustered index may be beneficial.

Will adding a clustered index to an existing table improve performance?

I have a SQL 2005 database I've inherited, with a table that has grown to about 17 million records over the course of about 15 years, and is now horribly slow.
The table layout looks about like this:
id_column = nvarchar(20),indexed, not unique
column2 = nvarchar(20), indexed, not unique
column3 = nvarchar(10), indexed, not unique
column4 = nvarchar(10), indexed, not unique
column5 = numeric(8,0), indexed, not unique
column6 = numeric(8,0), indexed, not unique
column7 = nvarchar(20), indexed, not unique
column8 = nvarchar(10), indexed, not unique
(and about 5 more columns that look pretty much the same, not indexed)
The 'id' field is a value entered in a front-end application by the end-user.
There are no defined primary keys, and no columns that can be combined to make a unique row (unless all columns are combined). The table actually is a 'details' table to another table, but there are no constraints ensuring referential integrity.
Every column is heavily used in 'where' clauses in queries, which is why I assume there's an index on every one, or perhaps a desperate attempt to speed things up by another DBA.
Having said all that, my question is: could adding a clustered index do me any good at this point?
If I did add a clustered index, I assume it would have to be a new column, ie., an identity column? Basically, is it worth the trouble?
Appreciate any advice.
I would say only add the clustered index if there is a reasoning for needing it. So ask these questions;
Does the order of the data make sense?
Is there sequential value to the way the data is inserted?
Do I need to use a feature that requires it have a clustered index, such as full text index?
If the answer to these questions is all "No" than a clustered index might not be of any additional help over a good non-clustered index strategy. Instead you might want to consider how and when you update statistics, when you refresh the indexes and whether or not filtered indexes make sense in your situation. Looking at the table you have as an example it tough to say, but maybe it makes sense to normalize the table further and use numeric keys instead of nvarchar.
http://www.mssqltips.com/sqlservertip/3041/when-sql-server-nonclustered-indexes-are-faster-than-clustered-indexes/
the article is a great example of when non-clustered indexes might make more sense.
I would recommend adding a clustered index, even if it's an identity column for 3 reasons:
Assuming that your existing queries have to go through the entire table every time, a clustered index scan is still faster than a table scan.
The table is a child to some other table. With some extra works, you can use the new child_id to join against the parent table. This enables clustered index seek, which is a lot faster than scan in some cases.
Depend on how they are setup, the existing indices may not do much good. I've come across some terrible indices that cover 1 column each, or indices that don't include the appropriate columns, causing costly Key Lookups operations. Check your index stats to see if they are being used at all.

which execution plan has better performance?

(my english is not good enough. so bear with me)
I am working on optimizing this query:
Select Title from Groups Where GroupID = WWPMD.GroupID) as GroupName
FROM dbo.Detail AS WWPMD INNER JOIN
dbo.News AS WWPMN ON WWPMD.DetailID = WWPMN.NewsID
WHERE
WWPMD.IsDeleted = 0 AND WWPMD.IsEnabled= 1 AND WWPMD.GroupID IN (
Select ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')
in this case, tables have clustered indexes on primary keys which are GUID. in the execution plan, the arrows were a little thick, and the cost of clustered index seek, for table "detail" was 87%, and there was no cost for key lookup
then, I changed indexes of table "detail". I have put clustered index on a datetime column, and 3 unclustered indexes on PK and FKs. now in the execution plan, index seek cost for table detail is 4percent and key lookup is 80 percent, with thin arrows.
I want to know which execution plan is better, and what can I do to improve this query.
UPDATE:
thank all of you for your guidance. one more question. I want to know if 80% cost of a clustered index seek is better, or 80% total cost of a non-clustered index seek and key look up. which one is better?
IN statement is good for selecting littele bit data if you are selecting more data you should use INNER JOIN its better performance than IN for large data
IN is faster than JOIN on DISTINCT
EXISTS is more efficient that IN, because "EXISTS returns only one row"
Clustered index on a guid column is not a good idea when your guids are not sequential, since this will lead to performance loss on inserts. The records in your table are physically ordered based on the clustered index. The clustered index should be put on a column wich has sequential values and doesn't change (often).
if you have a nonclustered index on groupid (table groups), then you could make 'title' an included column on that index. See msdn for included columns.
I suggest that use following query:
INNER JOIN (SELECT DISTINCT ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')z OR WWPMD.GroupID= z.ModuleID
Instead of:
AND WWPMD.GroupID IN (
Select ModuleID FROM Page_Content_Item WHERE ContentID='a6780e80-4ead4e62-9b22-1190bb88ed90')
Also must be survey execution plan of this query, It seem that Index on Detail.DetailId with filter (IsDeleted = 0 and IsEnable = 1) was very useful.
Indexes will not be particularly useful unless the leftmost indexed columns are specified in JOIN and/or WHERE clauses. I suggest thise indexes below (unique when possible): dbo.Detail(GroupID), dbo.News(DetailID), dbo.Page_Content_Item(ContentID)
You can fine tine these indexes with included columns and/or filters, but I suspect performance may be good enough without those measures.
Note that primary keys are important. Not only is that a design best practice, a unique index will automatically be created on the key columns(s), which can improve performance of joins to the table on related columns. Consider reviewing your model to ensure you have proper primary keys and relationships.

Resources