Understanding Include on Index - sql-server

I have the below query:
USE [AxReports]
GO
DECLARE #paramCompany varchar(3)
SET #paramCompany = 'adf'
SELECT stl.MAINSALESID,
st.DATAAREAID,
Sum(sl.SALESQTY) as 'Quantity',
Sum(sl.SALESQTY * sl.SALESPRICE) as 'SalesValue'
INTO #openrel
FROM
DynamicsV5Realtime.dbo.SALESTABLE st
INNER JOIN
DynamicsV5Realtime.dbo.SALESLINE sl
ON
sl.SALESID = st.SALESID
and sl.DATAAREAID = st.DATAAREAID
INNER JOIN
DynamicsV5Realtime.dbo.INVENTTABLE it
ON
it.ITEMID = sl.ITEMID
and it.DATAAREAID = sl.DATAAREAID
INNER JOIN
DynamicsV5Realtime.dbo.SALESTABLELINKS stl
ON
stl.SUBSALESID = st.SALESID
and stl.DATAAREAID = st.DATAAREAID
WHERE
st.DATAAREAID = #paramCompany
and st.SALESTYPE = 3 -- Release Order
and st.SALESSTATUS = 1
and sl.SALESSTATUS <> 4
and it.ITEMGROUPID <> 'G0022A'
GROUP BY
stl.MAINSALESID,
st.DATAAREAID
My execution plan is recommending an index of :
USE [DynamicsV5Realtime]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
GO
However I already have an index on that table that is similar which the plan is using but performs a table scan against it. The current index is below:
CREATE NONCLUSTERED INDEX [I_ITEMGROUPIDX] ON [dbo].[INVENTTABLE]
(
[ITEMID] ASC,
[DATAAREAID] ASC
)
INCLUDE ( [ITEMGROUPID])
GO
I have an understanding that you should only put things as an included column when you are not bothered about them being sorted at the leaf level (I think thats correct?).
In this case the WHERE clause has it.ITEMGROUPID <> 'G0022A' so putting that as a key column would make sense as it will be quicker to seek that column in order, (again I think I am right in saying that?)
However what about the joins, why does it recommend to put the ITEMID column as an include but not the DATAAREAID column? ITEMID and DATAAREAID make up the PK in this case so is it something to do with not needing to sort both columns and would perhaps using the existing index but putting the ITEMGROUPID as a key columm be a better solution that adding a new index? (thats something I can test I suppose)
Thanks

Let's consider this table in relative isolation first; that is we'll only pay attention to those parts of the query where it is directly mentioned.
Executing the query needs to do the following:
Find all rows in INVENTTABLE where the ITEMGROUPID column is equal to 'G0022A'.
Find the values of the DATAAREAID and ITEMID columns in those rows, for use in finding the necessary rows in SALESLINE.
The best index for doing part one is one that has a key on ITEMGROUPID but no other columns. Such a key (we'll ignore included columns for now) would enable a table scan to find the relevant rows and those only.
If there was no such index but there was an index that had ITEMGROUPID as one of its columns, then that index could be used in a table scan instead, though not quite as efficiently.
Now, when we come to considering the second part, the only values we actually care about getting from the row are DATAAREAID and ITEMID.
If those fields where included, then they can be used in an index scan.
If they are actually parts of the key, or one of them is and the other is included, then that index can also be used for such an index scan.
So. At this point, considering only those aspects we said we would consider at this point and ignoring other considerations (index size, cost of inserts, etc), then any of the following indices would be useful here:
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID],[ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID])
INCLUDE ([DATAAREAID],[ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID])
INCLUDE ([ITEMID],[DATAAREAID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMID],[ITEMGROUPID])
INCLUDE ([DATAAREAID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID])
INCLUDE ([ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[ITEMID])
INCLUDE ([DATAAREAID])
Each one of these indices contains ITEMGROUPID as all or part of the key and both ITEMID and DATAAREAID as either part of the key, or as an included column.
Note that they index you do have is the opposite to this; it has the column that would be ideally a key as an included column, and the others as part of the key. It's better than nothing and the query planner can re-jigger things to make use of it, but it's not the ideal key for what we've determined we want.
Now, lets consider the query as a whole.
Note that we will be searching SALESTABLE based on its DATAAREAID column.
Note that SALESLINE is joined to that column on its own DATAAREAID column.
Note that INVENTTABLE is in turn joined to that column on SALESLINE based on its own DATAAREAID column.
From this we can deduce that we logically only want those records from INVENTTABLE that have the value #paramCompany in their DATAAREAID column.
And the planner made that deduction.
So, considering the query as a whole, we can change our two actions above to:
Find all rows in INVENTTABLE where the ITEMGROUPID column is equal to 'G0022A' and where DATAAREAID is equal to #paramCompany.
Find the values of the DATAAREAID (already got in step 1) and ITEMID columns in those rows.
Hence the ideal index for this would be either:
CREATE NONCLUSTERED INDEX [someName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID])
INCLUDE ([ITEMID])
GO
OR
CREATE NONCLUSTERED INDEX [someName]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
GO
(Or one that includes all three in the key, but there are other reasons not to have a large key if you don't actually need it).
And the second is indeed what you were advised to do.

This should be easy to Google, but I would say to basically just have the columns that are used in joins in the index and include return columns so that there is no need to do a lookup on the actual table (al is included in the index).
I would say recommendations can be more or less reliable, perhaps due to bad statistics or whatever, don't blindly rely on them. Also, I believe indexes can not be used when the operator is '<>'.

Related

Add all primary key constraints in non clustered indexes

I have created a table with the following columns. All columns are unique key (column) there is no primary key in my table.
Table Product:
Bat_Key,
product_no,
value,
pgm_name,
status,
industry,
created_by,
created_date
I have altered my table to add constraints
ALTER TABLE [dbo].[Product]
ADD CONSTRAINT [PRODUCT_PK]
PRIMARY KEY NONCLUSTERED ([Bat_Key] ASC, [product_no] ASC,
[value] ASC, [pgm_name] ASC, )
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
GO
And if I created indexes as below:
CREATE NONCLUSTERED INDEX [PRODUCT_BKEY_PNO_IDX]
ON [dbo].[PRODUCT] ([Bat_Key] ASC, [product_no] ASC, [value], [pgm_name])
INCLUDE ([status], [industry])
WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
GO
Whether this design is good for the following select queries :
select *
from Product
where Bat_Key = ? and product_no=?
order by product_no, pgm_name;
select *
from Product
where Bat_Key = ? and product_no=? and pgm_name = ? and value = ?
select *
from Product
where Bat_Key = ? and product_no=?
delete from Product
where Bat_Key = ? and product_no=?
or should I create different indexes based on my where clauses?
A clustered index is very different from a non-clustered index. Effectively, types both indexes contain the data sorted according to the columns you specify. However,
The clustered index also contains the rest of the data in the table (except for a few things like nvarchar(max)). You can consider this to be how it's saved in the database
Non-clustered indexes only contain the columns you have included in the index
If you don't have a clustered index, you have a 'heap'. Instead of a PK, they have a row identifiers built in.
In your case, as your primary key is non-clustered, it makes no sense to make another index with the same fields. To read the data, it must get the row identifier(s) from your PK, then go and read the data from the heap.
If, on the other hand, your primary key is clustered (which is the default), having a non-clustered index on the fields can be useful in some circumstances. But note that every non-clustered index you add can also slow down updates, inserts and deletes (as the indexes must be maintained as well).
In your example - say you had a field there which was a varchar(8000) on the row which contains a lot of information. To even read one row from the clustered index, it must read (say) 100 bytes from the other fields, and up to 8000 bytes from that new field. In other words, it multiplies the amount you need to read by 80x.
I have a tendency to have see tables having two types of data
Data you aggregate
Data you only care about on a row-by-row level
For example, in a transaction table, you may have transaction_id, transaction_date, transaction_amount, transaction_description, transaction_entered_by_user_id.
In most cases, whenever you're getting totals etc, you'll frequently need transaction amounts, date when looking at totals (e.g., what was the total of transactions this week?)
On the other hand, the description and user_id are only used when you refer to specific rows (e.g., who did this specific transaction?)
In these cases, I often put a non-clustered index on the fields used in aggregation, even if they overlap with the clustered index. It just reduces the amounts of reads required.
A really good video on this is by Brent Ozar called How to think like the SQL Server Engine - I strongly recomment it as it helped me a lot in understanding how indexes are used.
Regarding your specific examples - there are two things to look for in indexes:
The ability to 'seek' to a specific point in the data set (based on the sort of the index).
Capability to reduce amount to be read.
In terms of allowing seeks, you need to sort the index in the most appropriate way. When doing it for filtering (e.g., WHERE clauses, JOINs, one rule of thumb is to first look for 'exact' matches. For these, it doesn't matter what order they are in, as long as they have all the ones up to that point.
In your case, you have
where Bat_Key = ? and product_no=?
where Bat_Key = ? and product_no=? and pgm_name = ? and value = ?
This suggests your first two fields should be Bat_Key and product_no (in either order). Then you can also have pgm_name and value (also in either order).
You also have
where Bat_Key = ? and product_no=?
order by product_no, pgm_name;
which suggests to me that the third field should be pgm_name (as an index on Bat_Key, product_no and pgm_name would provide what you need there).
However - and this is a big however - you have lots of *s in there e.g.,
select *
from Product
where Bat_Key = ? and product_no=?
Because you are selecting *, any index that is not the clustered index needs to also go back to the actual rows to get the rest of what's included in the *.
As these want all the fields from the table (not just the ones in the index) it will need to go back to the heap (in your case). If you had a clustered index on the fields above, as well as a non-clustered index, it would have to read from the clustered index anyway because information is in there that is needed for your query.
Once again - the video above - explains this much better than I do.
Therefore, in your case, I suggest the following Primary Key
ADD CONSTRAINT [PRODUCT_PK]
PRIMARY KEY CLUSTERED ([Bat_Key] ASC, [product_no] ASC,
[pgm_name] ASC, [value] ASC)
Differences
It is clustered rather than non-clustered
The order of the 3rd and 4th fields are rearranged to help with the order by pgm_name
No real need for a second non-clustered index as there is not much other stuff to be read.

execution plan suggesting to add an index on columns which are not part of where clause

I am running following query in SSMS and execution plan suggesting to add index on columns which are not part of where clause. I was planning to add index on two columns which are being used in where clause (OID and TransactionDate).
SELECT
[OID] , //this is not a PK. Primary key column is not a part of sql script
[CustomerNum] ,
[Amount] ,
[TransactionDate] ,
[CreatedDate]
FROM [dbo].[Transaction]
WHERE OID = 489
AND TransactionDate > '01/01/2018 06:13:06.46';
Index suggestion
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Transaction] ([OID],[TransactionDate])
INCLUDE ([CustomerNum],[Amount],[CreatedDate])
Updated
Do i need to include other columns? Data is being imported to that table through a back end process using SQLBulkCopy class in .net. I am wondering if having non cluster index on all columns would reduce the performance. (In my table is Pk column called TransactionID which is not needed but i have this in the table in case its needed in the future otherwise SQLBulkCopy works better with heap. Other option is to drop and recreate indexes before and after SQLBulkCopy operation)
the INCLUDE keyword specifies the non-key columns to be added to the leaf level of the nonclustered index.
This means that if you will add this index and run the query again, SQL Server can get all the information needed from the index, thus eliminating the need to perform a lookup in the table as well.
As a general rule of thumb - when SSMS suggest an index, create it. You can always drop it later if it doesn't help.
You don't need to add all table columns in your non-clustered index, suggested index is good for the query provided. SQL Server database engine suggestions are usually really good.
INCLUDE keyword is required to avoid KEY LOOKUP and use NONCLUSTERED INDEX SEEK.
All in all: No NONCLUSTERED INDEX results in Clustered index scan
Created NONCLUSTERED INDEX with no included columns results in NONCLUSTERED INDEX scan plus key lookup.
Created NONCLUSTERED INDEX with included columns results in NONCLUSTERED INDEX SEEK.

Redundant indexes?

I noticed a strange combination of indexes in one of the databases I was working on.
Here is the table design:
CREATE TABLE tblABC
(
id INT NOT NULL IDENTITY(1,1),
AnotherId INT NOT NULL, --not unique column
Othercolumn1 INT,
OtherColumn2 VARCHAR(10),
OtherColumn3 DATETIME,
OtherColumn4 DECIMAL(14, 4),
OtherColumn5 INT,
CONSTRAINT idxPKNCU
PRIMARY KEY NONCLUSTERED (id)
)
CREATE CLUSTERED INDEX idx1
ON tblABC(AnotherId ASC)
CREATE NONCLUSTERED INDEX idx2
ON tblABC(AnotherId ASC) INCLUDE(OtherColumn4)
CREATE NONCLUSTERED INDEX idx3
ON tblABC (AnotherId) INCLUDE (OtherColumn2, OtherColumn4)
Please note that column id is identity and defined as primary key.
A clustered index is defined on column - AnotherId, this column is not unique.
There are two additional nonclustered indexes defined on AnotherId, with additional include columns
My opinion is that either of the nonclustered indexes on AnotherId are redundant (idx2 and idx3) because the main copy of the table (culstred index) has the same data.
When I checked the index usage, I was expecting to see no usage on idx2 and idx3, but idx3 had highest index seeks.
I have given a screenshots of the index design and usage
My question is - aren't these nonclustered indexes - idx2 and idx3 redundant? Optimizer can get the same data from the clustered index - idx1. May be it would have got it, if there was no NC index defined.
Am I missing something?
Regards,
Nayak
It is a bit odd to have two very similar non-clustered indexes, though they may both be getting used equally. I do also find it positively weird that the clustered index was made on a non-unique field.
Check out the following link for information and a free tool to ascertain index usage. I use this all the time to see which indexes are being used etc.
https://www.brentozar.com/blitzindex/
For the non-clustered indexes - You can consolidate, and remove the unused indexes as if you're only writing to them, it is a royal waste of resources.
For the clustered index, you may consider redoing it based on your findings with the blitz index tool.

Does SQL Server allow including a computed column in a non-clustered index? If not, why not?

When a column is included in non-clustered index, SQL Server copies the values for that column from the table into the index structure (B+ tree). Included columns don't require table look up.
If the included column is essentially a copy of original data, why does not SQL Server also allow including computed columns in the non-clustered index - applying the computations when it is copying/updating the data from table to index structure? Or am I just not getting the syntax right here?
Assume:
DateOpened is datetime
PlanID is varchar(6)
This works:
create nonclustered index ixn_DateOpened_CustomerAccount
on dbo.CustomerAccount(DateOpened)
include(PlanID)
This does not work with left(PlanID, 3):
create nonclustered index ixn_DateOpened_CustomerAccount
on dbo.CustomerAccount(DateOpened)
include(left(PlanID, 3))
or
create nonclustered index ixn_DateOpened_CustomerAccount
on dbo.CustomerAccount(DateOpened)
include(left(PlanID, 3) as PlanType)
My use case is somewhat like below query.
select
case
when left(PlanID, 3) = '100' then 'Basic'
else 'Professional'
end as 'PlanType'
from
CustomerAccount
where
DateOpened between '2016-01-01 00:00:00.000' and '2017-01-01 00:00:00.000'
The query cares only for the left 3 of PlanID and I was wondering instead of computing it every time the query runs, I would include left(PlanID, 3) in the non-clustered index so the computations are done when the index is built/updated (fewer times) instead at the query time (frequently)
EDIT: We use SQL Server 2014.
As Laughing Vergil stated - you CAN index persisted columns provided that they are persisted. You have a few options, here's a couple:
Option 1: Create the column as PERSISTED then index it
(or, in your case, include it in the index)
First the sample data:
CREATE TABLE dbo.CustomerAccount
(
PlanID int PRIMARY KEY,
DateOpened datetime NOT NULL,
First3 AS LEFT(PlanID,3) PERSISTED
);
INSERT dbo.CustomerAccount (PlanID, DateOpened)
VALUES (100123, '20160114'), (100999, '20151210'), (255657, '20150617');
and here's the index:
CREATE NONCLUSTERED INDEX nc_CustomerAccount ON dbo.CustomerAccount(DateOpened)
INCLUDE (First3);
Now let's test:
-- Note: IIF is available for SQL Server 2012+ and is cleaner
SELECT PlanID, PlanType = IIF(First3 = 100, 'Basic', 'Professional')
FROM dbo.CustomerAccount;
Execution Plan:
As you can see- the optimizer picked the nonclustered index.
Option #2: Perform the CASE logic inside your table DDL
First the updated table structure:
DROP TABLE dbo.CustomerAccount;
CREATE TABLE dbo.CustomerAccount
(
PlanID int PRIMARY KEY,
DateOpened datetime NOT NULL,
PlanType AS
CASE -- NOTE: casting as varchar(12) will make the column a varchar(12) column:
WHEN LEFT(PlanID,3) = 100 THEN CAST('Basic' AS varchar(12))
ELSE 'Professional'
END
PERSISTED
);
INSERT dbo.CustomerAccount (PlanID, DateOpened)
VALUES (100123, '20160114'), (100999, '20151210'), (255657, '20150617');
Notice that I use CAST to assign the data type, the table will be created with this column as varchar(12).
Now the index:
CREATE NONCLUSTERED INDEX nc_CustomerAccount ON dbo.CustomerAccount(DateOpened)
INCLUDE (PlanType);
Let's test again:
SELECT DateOpened, PlanType FROM dbo.CustomerAccount;
Execution plan:
... again, it used the nonclustered index
A third option, which I don't have time to go into, would be to create an indexed view. This would be a good option for you if you were unable to change your existing table structure.
SQL Server 2014 allows creating indexes on computed columns, but you're not doing that -- you're attempting to create the index directly on an expression. This is not allowed. You'll have to make PlanType a column first:
ALTER TABLE dbo.CustomerAccount ADD PlanType AS LEFT(PlanID, 3);
And now creating the index will work just fine (if your SET options are all correct, as outlined here):
CREATE INDEX ixn_DateOpened_CustomerAccount ON CustomerAccount(DateOpened) INCLUDE (PlanType)
It is not required that you mark the column PERSISTED. This is required only if the column is not precise, which does not apply here (this is a concern only for floating-point data).
Incidentally, the real benefit of this index is not so much that LEFT(PlanType, 3) is precalculated (the calculation is inexpensive), but that no clustered index lookup is needed to get at PlanID. With an index only on DateOpened, a query like
SELECT PlanType FROM CustomerAccounts WHERE DateOpened >= '2012-01-01'
will result in an index seek on CustomerAccounts, followed by a clustered index lookup to get PlanID (so we can calculate PlanType). If the index does include PlanType, the index is covering and the extra lookup disappears.
This benefit is relevant only if the index is truly covering, however. If you select other columns from the table, an index lookup is still required and the included computed column is only taking up space for little gain. Likewise, suppose that you had multiple calculations on PlanID or you needed PlanID itself as well -- in this case it would make much more sense to include PlanID directly rather than PlanType.
Computed columns are only allowed in indexes if they are Persisted - that is, if the data is written to the table. If the information is not persisted, then the information isn't even calculated / available until the field is queried.

Log table with non unique columns; what indexes to create

I have a log table with two columns.
DocumentType (varchar(250), not unique, not null)
DateEntered (Date, not unique, not null)
The table will only have rows inserted, never updated or deleted.
Here is the stored procedure for the report:
SELECT DocumentType,
COUNT(DocumentType) AS "CountOfDocs"
FROM DocumentTypes
WHERE DateEntered>= #StartDate AND DateEntered<= #EndDate
GROUP BY DocumentType
ORDER BY DocumentType ASC;
In the future user may want to also filter by document type in a different report. I currently have a non-clustered index containing both columns. Is this the proper index to create?
Clustered index on the date, for sure.
I think your NCI is fine. I would say both in as named columns as I assume you will have the date in the WHERE clause for your queries. I don't think 1000 per day worst case scenario will have a major impact on insert times when loading the data.
Don't add any index. It'll be heap table and wait for your "future you" with task to select something from this table :).
If you want index:
With heap: Add index on field you will filter and if the second one is only in select (=isn't in where clause) put the second one as included column. If you'll filter with both column put index on both columns.
If you want add clustered index (for example on new autoincrement primary key column) add only one index on col you want filter or try to don't add aditional index and check execution plan and efectivity - in most cases is clustered index with seeks enough.
Don't create clustered index on nonunique columns (it's used only in very special cases).

Resources