Two non-clustered indexes with the same attributes as key - sql-server

I have two indexes like this.
Index #1:
CREATE NONCLUSTERED INDEX [index1]
ON [dbo].[table1] ([column1], [column2], [column3])
INCLUDE ([column4], [column5], [column6]) WITH (ONLINE = ON)
Index #2:
CREATE NONCLUSTERED INDEX [index2]
ON [dbo].[table1] ([column2], [column1], [column3])
INCLUDE ([column4], [column5]) WITH (ONLINE = ON)
Question is: since both indexes have the same columns (but with a different sequence) and index1 includes more columns that index2, would you say that index2 could be removed and all the queries that it was serving would be served by index1?
Any help will be greatly appreciated!

Would you say that index2 could be removed and all the queries that it was serving would be served by index1?
No, index1 will not serve the queries that used index2.
Index is sorted based on the columns/keys selected. So different key orders, regardless if multiple indexes contain the same set of columns as index key, will be used differently by the query optimizer.
But having duplicate indexes (indexes with the same set of columns/keys) is not optimal. I would identify the queries that use the indexes and rearrange the join conditions and where clause predicates so that it would use one index and can therefore eliminate the duplicate index.

Related

execution plan suggesting to add an index on columns which are not part of where clause

I am running following query in SSMS and execution plan suggesting to add index on columns which are not part of where clause. I was planning to add index on two columns which are being used in where clause (OID and TransactionDate).
SELECT
[OID] , //this is not a PK. Primary key column is not a part of sql script
[CustomerNum] ,
[Amount] ,
[TransactionDate] ,
[CreatedDate]
FROM [dbo].[Transaction]
WHERE OID = 489
AND TransactionDate > '01/01/2018 06:13:06.46';
Index suggestion
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Transaction] ([OID],[TransactionDate])
INCLUDE ([CustomerNum],[Amount],[CreatedDate])
Updated
Do i need to include other columns? Data is being imported to that table through a back end process using SQLBulkCopy class in .net. I am wondering if having non cluster index on all columns would reduce the performance. (In my table is Pk column called TransactionID which is not needed but i have this in the table in case its needed in the future otherwise SQLBulkCopy works better with heap. Other option is to drop and recreate indexes before and after SQLBulkCopy operation)
the INCLUDE keyword specifies the non-key columns to be added to the leaf level of the nonclustered index.
This means that if you will add this index and run the query again, SQL Server can get all the information needed from the index, thus eliminating the need to perform a lookup in the table as well.
As a general rule of thumb - when SSMS suggest an index, create it. You can always drop it later if it doesn't help.
You don't need to add all table columns in your non-clustered index, suggested index is good for the query provided. SQL Server database engine suggestions are usually really good.
INCLUDE keyword is required to avoid KEY LOOKUP and use NONCLUSTERED INDEX SEEK.
All in all: No NONCLUSTERED INDEX results in Clustered index scan
Created NONCLUSTERED INDEX with no included columns results in NONCLUSTERED INDEX scan plus key lookup.
Created NONCLUSTERED INDEX with included columns results in NONCLUSTERED INDEX SEEK.

Understanding Include on Index

I have the below query:
USE [AxReports]
GO
DECLARE #paramCompany varchar(3)
SET #paramCompany = 'adf'
SELECT stl.MAINSALESID,
st.DATAAREAID,
Sum(sl.SALESQTY) as 'Quantity',
Sum(sl.SALESQTY * sl.SALESPRICE) as 'SalesValue'
INTO #openrel
FROM
DynamicsV5Realtime.dbo.SALESTABLE st
INNER JOIN
DynamicsV5Realtime.dbo.SALESLINE sl
ON
sl.SALESID = st.SALESID
and sl.DATAAREAID = st.DATAAREAID
INNER JOIN
DynamicsV5Realtime.dbo.INVENTTABLE it
ON
it.ITEMID = sl.ITEMID
and it.DATAAREAID = sl.DATAAREAID
INNER JOIN
DynamicsV5Realtime.dbo.SALESTABLELINKS stl
ON
stl.SUBSALESID = st.SALESID
and stl.DATAAREAID = st.DATAAREAID
WHERE
st.DATAAREAID = #paramCompany
and st.SALESTYPE = 3 -- Release Order
and st.SALESSTATUS = 1
and sl.SALESSTATUS <> 4
and it.ITEMGROUPID <> 'G0022A'
GROUP BY
stl.MAINSALESID,
st.DATAAREAID
My execution plan is recommending an index of :
USE [DynamicsV5Realtime]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
GO
However I already have an index on that table that is similar which the plan is using but performs a table scan against it. The current index is below:
CREATE NONCLUSTERED INDEX [I_ITEMGROUPIDX] ON [dbo].[INVENTTABLE]
(
[ITEMID] ASC,
[DATAAREAID] ASC
)
INCLUDE ( [ITEMGROUPID])
GO
I have an understanding that you should only put things as an included column when you are not bothered about them being sorted at the leaf level (I think thats correct?).
In this case the WHERE clause has it.ITEMGROUPID <> 'G0022A' so putting that as a key column would make sense as it will be quicker to seek that column in order, (again I think I am right in saying that?)
However what about the joins, why does it recommend to put the ITEMID column as an include but not the DATAAREAID column? ITEMID and DATAAREAID make up the PK in this case so is it something to do with not needing to sort both columns and would perhaps using the existing index but putting the ITEMGROUPID as a key columm be a better solution that adding a new index? (thats something I can test I suppose)
Thanks
Let's consider this table in relative isolation first; that is we'll only pay attention to those parts of the query where it is directly mentioned.
Executing the query needs to do the following:
Find all rows in INVENTTABLE where the ITEMGROUPID column is equal to 'G0022A'.
Find the values of the DATAAREAID and ITEMID columns in those rows, for use in finding the necessary rows in SALESLINE.
The best index for doing part one is one that has a key on ITEMGROUPID but no other columns. Such a key (we'll ignore included columns for now) would enable a table scan to find the relevant rows and those only.
If there was no such index but there was an index that had ITEMGROUPID as one of its columns, then that index could be used in a table scan instead, though not quite as efficiently.
Now, when we come to considering the second part, the only values we actually care about getting from the row are DATAAREAID and ITEMID.
If those fields where included, then they can be used in an index scan.
If they are actually parts of the key, or one of them is and the other is included, then that index can also be used for such an index scan.
So. At this point, considering only those aspects we said we would consider at this point and ignoring other considerations (index size, cost of inserts, etc), then any of the following indices would be useful here:
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID],[ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID])
INCLUDE ([DATAAREAID],[ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID])
INCLUDE ([ITEMID],[DATAAREAID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMID],[ITEMGROUPID])
INCLUDE ([DATAAREAID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID])
INCLUDE ([ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[ITEMID])
INCLUDE ([DATAAREAID])
Each one of these indices contains ITEMGROUPID as all or part of the key and both ITEMID and DATAAREAID as either part of the key, or as an included column.
Note that they index you do have is the opposite to this; it has the column that would be ideally a key as an included column, and the others as part of the key. It's better than nothing and the query planner can re-jigger things to make use of it, but it's not the ideal key for what we've determined we want.
Now, lets consider the query as a whole.
Note that we will be searching SALESTABLE based on its DATAAREAID column.
Note that SALESLINE is joined to that column on its own DATAAREAID column.
Note that INVENTTABLE is in turn joined to that column on SALESLINE based on its own DATAAREAID column.
From this we can deduce that we logically only want those records from INVENTTABLE that have the value #paramCompany in their DATAAREAID column.
And the planner made that deduction.
So, considering the query as a whole, we can change our two actions above to:
Find all rows in INVENTTABLE where the ITEMGROUPID column is equal to 'G0022A' and where DATAAREAID is equal to #paramCompany.
Find the values of the DATAAREAID (already got in step 1) and ITEMID columns in those rows.
Hence the ideal index for this would be either:
CREATE NONCLUSTERED INDEX [someName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID])
INCLUDE ([ITEMID])
GO
OR
CREATE NONCLUSTERED INDEX [someName]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
GO
(Or one that includes all three in the key, but there are other reasons not to have a large key if you don't actually need it).
And the second is indeed what you were advised to do.
This should be easy to Google, but I would say to basically just have the columns that are used in joins in the index and include return columns so that there is no need to do a lookup on the actual table (al is included in the index).
I would say recommendations can be more or less reliable, perhaps due to bad statistics or whatever, don't blindly rely on them. Also, I believe indexes can not be used when the operator is '<>'.

Why does performance degrade when using a non-indexed field in the SELECT clause?

Consider these three queries:
select sampleno from sample
where markupdate > '1/1/2010'
select sampleno, markupdate from sample
where markupdate > '1/1/2010'
select sampleno, markuptime from sample
where markupdate > '1/1/2010'
sampleno and markupdate are indexed fields (sampleno is the primary key)
markuptime is not indexed
Queries 1 and 2 take about 1 second to run (returning 237K rows). Query 3 is still running after 3 minutes.
Why would the inclusion of a non-indexed field in the SELECT clause cause such a performance degradation?
This is a SQL 6.5 database.
A table's data (basically: all columns) is stored in a clustered index. A clustered index is a binary tree that allows a binary search on the indexed column(s). It is special (clustered) in that it contains all other columns at the leaf level. Usually, the clustered index is also the primary key. In your case, it's:
(sampleno) include (markupdate, markuptime, ...)
A non-clustered index contains the indexed column(s) and (at the leaf level) the clustered index. When you use a non-clustered index, the database has to look up all the other columns in the clustered index. That process is called a lookup. In your case, the non-clustered index on (markupdate) is:
(markupdate) include (sampleno)
This index contains all data for a query on markupdate, sampleno. The technical term for such an index is a covering index. But when you add markuptime to the query, the index is no longer covering. It has to look up the value for markuptime in the clustered index. And lookups are expansive.
Only your third query requires lookups. And that's why your third query is slower.

sql server multi column index queries

If I have created a single index on two columns [lastName] and [firstName] in that order. If I then do a query to find the number of the people with first name daniel:
SELECT count(*)
FROM people
WHERE firstName = N'daniel'
will this search in each section of the first index (lastname) and use the secondary index (firstName) to quickly search through each of the blocks of LastName entries?
This seems like an obvious thing to do and I assume that it is what happens but you know what they say about assumptions.
Yes, this query may - and probably do - use this index (and do an Index Scan) if the query optimizer thinks that it's better to "quickly search through each of the blocks of LastName entries" as you say than (do an Full Scan) of the table.
An index on (firstName) would be more efficient though for this particular query so if there is such an index, SQL-Server will use that one (and do an Index Seek).
Tested in SQL-Server 2008 R2, Express edition:
CREATE TABLE Test.dbo.people
( lastName NVARCHAR(30) NOT NULL
, firstName NVARCHAR(30) NOT NULL
) ;
INSERT INTO people
VALUES
('Johnes', 'Alex'),
... --- about 300 rows
('Johnes', 'Bill'),
('Brown', 'Bill') ;
Query without any index, Table Scan:
SELECT count(*)
FROM people
WHERE firstName = N'Bill' ;
Query with index on (lastName, firstName), Index Scan:
CREATE INDEX last_first_idx
ON people (lastName, firstName) ;
SELECT ...
Query with index on (firstName), Index Seek:
CREATE INDEX first_idx
ON people (firstName) ;
SELECT ...
If you have an index on (lastname, firstname), in this order, then a query like
WHERE firstname = 'daniel'
won't use the index, as long as you don't include the first column of the composite index (i.e. lastname) in the WHERE clause. To efficiently search for firstname only, you will need a separate index on that column.
If you frequently search on both columns, do 2 separate single column indexes. But keep in mind that each index will be updated on insert/update, so affecting performance.
Also, avoid composite indexes if they aren't covering indexes at the same time. For tips regarding composite indexes see the following article at sql-server-performance.com:
Tips on Optimizing SQL Server Composite Indexes
Update (to address downvoters):
In this specific case of SELECT Count(*) the index is a covering index (as pointed out by #ypercube in the comment), so the optimizer may choose it for execution. Using the index in this case means an Index Scan and not an Index Seek.
Doing an Index Scan means scanning every single row in the index. This will be faster, if the index contains less rows than the whole table. So, if you got a highly selective index (with many unique values) you'll get an index with roughly as many rows as the table itself. In such a case usually there won't be a big difference in doing a Clustered Index Scan (implies a PK on the table, iterates over the PK) or a Non-Clustered Index Scan (iterates over the index). A Table Scan (as seen in the screenshot of #ypercube's answer) means that there is no PK on the table, which results in an even slower execution than a Clustered Index Scan, as it doesn't have the advantage of sequential data alignment on disk given by a PK.

SQL Server Indexes

What's the Need for going for Non-clustered index even though table has clustered index?
For optimal performance you have to create an index for every combination used in your queries. For instance if you have a select like this.
SELECT *
FROM MyTable
WHERE Col_1 = #SomeValue AND
Col_2 = #SomeOtherValue
Then you should do a clustered index with Col_1 and Col_2.
On the other hand if you have an additional query which only looks up one of the Column like:
SELECT *
FROM MyTable
WHERE Col_1 = #SomeValue
Then you should have an index with just the Col_1.
So you end up with two indexes. One with Col_1 and Col_2 and another with just Col_1.
The "need" is to do faster lookups of columns not included in the clustered index.
Don't get clustered indexes confused with indexes across multiple columns. That isn't the same thing.
Here's an article that does a good job of explaining clustered vs. non-clustered indexes.
In mssql server you can only have one clustered index per table, and it's almost always the primary key. A clustered index is "attached" to the table so it doesn't need to go back to the table to get any other data elements that might be in the "select" clause. A non-clustered index is not attached, but contains a reference back to the table row with all the rest of the data.

Resources