I have created a table with the following columns. All columns are unique key (column) there is no primary key in my table.
Table Product:
Bat_Key,
product_no,
value,
pgm_name,
status,
industry,
created_by,
created_date
I have altered my table to add constraints
ALTER TABLE [dbo].[Product]
ADD CONSTRAINT [PRODUCT_PK]
PRIMARY KEY NONCLUSTERED ([Bat_Key] ASC, [product_no] ASC,
[value] ASC, [pgm_name] ASC, )
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
GO
And if I created indexes as below:
CREATE NONCLUSTERED INDEX [PRODUCT_BKEY_PNO_IDX]
ON [dbo].[PRODUCT] ([Bat_Key] ASC, [product_no] ASC, [value], [pgm_name])
INCLUDE ([status], [industry])
WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
GO
Whether this design is good for the following select queries :
select *
from Product
where Bat_Key = ? and product_no=?
order by product_no, pgm_name;
select *
from Product
where Bat_Key = ? and product_no=? and pgm_name = ? and value = ?
select *
from Product
where Bat_Key = ? and product_no=?
delete from Product
where Bat_Key = ? and product_no=?
or should I create different indexes based on my where clauses?
A clustered index is very different from a non-clustered index. Effectively, types both indexes contain the data sorted according to the columns you specify. However,
The clustered index also contains the rest of the data in the table (except for a few things like nvarchar(max)). You can consider this to be how it's saved in the database
Non-clustered indexes only contain the columns you have included in the index
If you don't have a clustered index, you have a 'heap'. Instead of a PK, they have a row identifiers built in.
In your case, as your primary key is non-clustered, it makes no sense to make another index with the same fields. To read the data, it must get the row identifier(s) from your PK, then go and read the data from the heap.
If, on the other hand, your primary key is clustered (which is the default), having a non-clustered index on the fields can be useful in some circumstances. But note that every non-clustered index you add can also slow down updates, inserts and deletes (as the indexes must be maintained as well).
In your example - say you had a field there which was a varchar(8000) on the row which contains a lot of information. To even read one row from the clustered index, it must read (say) 100 bytes from the other fields, and up to 8000 bytes from that new field. In other words, it multiplies the amount you need to read by 80x.
I have a tendency to have see tables having two types of data
Data you aggregate
Data you only care about on a row-by-row level
For example, in a transaction table, you may have transaction_id, transaction_date, transaction_amount, transaction_description, transaction_entered_by_user_id.
In most cases, whenever you're getting totals etc, you'll frequently need transaction amounts, date when looking at totals (e.g., what was the total of transactions this week?)
On the other hand, the description and user_id are only used when you refer to specific rows (e.g., who did this specific transaction?)
In these cases, I often put a non-clustered index on the fields used in aggregation, even if they overlap with the clustered index. It just reduces the amounts of reads required.
A really good video on this is by Brent Ozar called How to think like the SQL Server Engine - I strongly recomment it as it helped me a lot in understanding how indexes are used.
Regarding your specific examples - there are two things to look for in indexes:
The ability to 'seek' to a specific point in the data set (based on the sort of the index).
Capability to reduce amount to be read.
In terms of allowing seeks, you need to sort the index in the most appropriate way. When doing it for filtering (e.g., WHERE clauses, JOINs, one rule of thumb is to first look for 'exact' matches. For these, it doesn't matter what order they are in, as long as they have all the ones up to that point.
In your case, you have
where Bat_Key = ? and product_no=?
where Bat_Key = ? and product_no=? and pgm_name = ? and value = ?
This suggests your first two fields should be Bat_Key and product_no (in either order). Then you can also have pgm_name and value (also in either order).
You also have
where Bat_Key = ? and product_no=?
order by product_no, pgm_name;
which suggests to me that the third field should be pgm_name (as an index on Bat_Key, product_no and pgm_name would provide what you need there).
However - and this is a big however - you have lots of *s in there e.g.,
select *
from Product
where Bat_Key = ? and product_no=?
Because you are selecting *, any index that is not the clustered index needs to also go back to the actual rows to get the rest of what's included in the *.
As these want all the fields from the table (not just the ones in the index) it will need to go back to the heap (in your case). If you had a clustered index on the fields above, as well as a non-clustered index, it would have to read from the clustered index anyway because information is in there that is needed for your query.
Once again - the video above - explains this much better than I do.
Therefore, in your case, I suggest the following Primary Key
ADD CONSTRAINT [PRODUCT_PK]
PRIMARY KEY CLUSTERED ([Bat_Key] ASC, [product_no] ASC,
[pgm_name] ASC, [value] ASC)
Differences
It is clustered rather than non-clustered
The order of the 3rd and 4th fields are rearranged to help with the order by pgm_name
No real need for a second non-clustered index as there is not much other stuff to be read.
Related
I noticed a strange combination of indexes in one of the databases I was working on.
Here is the table design:
CREATE TABLE tblABC
(
id INT NOT NULL IDENTITY(1,1),
AnotherId INT NOT NULL, --not unique column
Othercolumn1 INT,
OtherColumn2 VARCHAR(10),
OtherColumn3 DATETIME,
OtherColumn4 DECIMAL(14, 4),
OtherColumn5 INT,
CONSTRAINT idxPKNCU
PRIMARY KEY NONCLUSTERED (id)
)
CREATE CLUSTERED INDEX idx1
ON tblABC(AnotherId ASC)
CREATE NONCLUSTERED INDEX idx2
ON tblABC(AnotherId ASC) INCLUDE(OtherColumn4)
CREATE NONCLUSTERED INDEX idx3
ON tblABC (AnotherId) INCLUDE (OtherColumn2, OtherColumn4)
Please note that column id is identity and defined as primary key.
A clustered index is defined on column - AnotherId, this column is not unique.
There are two additional nonclustered indexes defined on AnotherId, with additional include columns
My opinion is that either of the nonclustered indexes on AnotherId are redundant (idx2 and idx3) because the main copy of the table (culstred index) has the same data.
When I checked the index usage, I was expecting to see no usage on idx2 and idx3, but idx3 had highest index seeks.
I have given a screenshots of the index design and usage
My question is - aren't these nonclustered indexes - idx2 and idx3 redundant? Optimizer can get the same data from the clustered index - idx1. May be it would have got it, if there was no NC index defined.
Am I missing something?
Regards,
Nayak
It is a bit odd to have two very similar non-clustered indexes, though they may both be getting used equally. I do also find it positively weird that the clustered index was made on a non-unique field.
Check out the following link for information and a free tool to ascertain index usage. I use this all the time to see which indexes are being used etc.
https://www.brentozar.com/blitzindex/
For the non-clustered indexes - You can consolidate, and remove the unused indexes as if you're only writing to them, it is a royal waste of resources.
For the clustered index, you may consider redoing it based on your findings with the blitz index tool.
We have a table which holds all email messages ready to send and which have already been sent. The table contains over 1 million rows.
Below is the query to find the messages which still need to be sent. After 5 errors the message is not attempted anymore and needs to be fixed manually. SentDate remains null until the message is sent.
SELECT TOP (15)
ID,
FromEmailAddress,
FromEmailDisplayName,
ReplyToEmailAddress,
ToEmailAddresses,
CCEmailAddresses,
BCCEmailAddresses,
[Subject],
Body,
AttachmentUrl
FROM sysEmailMessage
WHERE ErrorCount < 5
AND SentDate IS NULL
ORDER BY CreatedDate
The query is slow, I assumed due to lacking indexes. I've offered the query to the Database Engine Tuning Advisor. It suggests the below index (and some statistics, which I generally ignore):
SET ANSI_PADDING ON
CREATE NONCLUSTERED INDEX [_dta_index_sysEmailMessage_7_1703677117__K14_K1_K12_5_6_7_8_9_10_11_15_17_18] ON [dbo].[sysEmailMessage]
(
[SentDate] ASC,
[ID] ASC,
[ErrorCount] ASC
)
INCLUDE ( [FromEmailAddress],
[ToEmailAddresses],
[CCEmailAddresses],
[BCCEmailAddresses],
[Subject],
[Body],
[AttachmentUrl],
[CreatedDate],
[FromEmailDisplayName],
[ReplyToEmailAddress]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
(On a sidenote: this index has a suggested size of 5,850,573 KB (?) which is neary 6 GB and doesn't make any sense to me at all.)
My question is does this suggested index make any sense? Why for example is the ID column included, while it's not needed in the query (as far as I can tell)?
As far as my knowledge of indexes goes they are meant to be a fast lookup to find the relevant row. If I had to design the index myself I would come up with something like:
SET ANSI_PADDING ON
CREATE NONCLUSTERED INDEX [index_alternative_a] ON [dbo].[sysEmailMessage]
(
[SentDate] ASC,
[ErrorCount] ASC
)
WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
Is the optimizer really clever or is my index more efficient and probably better?
There's 2 different aspects for selecting an index, the fields you need for finding the rows (=actual indexed fields), and the fields that are needed after that (=included fields). If you're always doing top 15 rows, you can totally ignore included fields because 15 keylookups will be fast -- and adding the whole email to the index would make it huge.
For the indexed fields, it's quite important to know how big percentage of the data matches your criteria.
Assuming almost all of your rows have ErrorCount < 5, you should not have it in the index -- but if it's a rare case, then it's good to have.
Assuming SentDate is really rarely NULL, then you should have that as the first column of the index.
Having CreatedDate in the index depends on how many rows on average the are found from the table with the ErrorCount and SentDate criteria. If it is a lot (thousands) then it might help to have it there so the newest can be found fast.
But like always, several things affect the performance so you should test how different options affect your environment.
I have a table with approx. 135M rows:
CREATE TABLE [LargeTable]
(
[ID] UNIQUEIDENTIFIER NOT NULL,
[ChildID] UNIQUEIDENTIFIER NOT NULL,
[ChildType] INT NOT NULL
)
It has a non-clustered index with no included columns:
CREATE NONCLUSTERED INDEX [LargeTable_ChildID_IX]
ON [LargeTable]
(
[ChildID] ASC
)
(It is clustered on ID).
I wish to join this against a temporary table which contains a few thousand rows:
CREATE TABLE #temp
(
ChildID UNIQUEIDENTIFIER PRIMARY KEY,
ChildType INT
)
...add #temp data...
SELECT lt.ChildID, lt.ChildType
FROM #temp t
INNER JOIN [LargeTable] lt
ON lt.[ChildID] = t.[ChildID]
However the query plan includes an index scan on the large table:
If I change the index to include extra columns:
CREATE NONCLUSTERED INDEX [LargeTable_ChildID_IX] ON [LargeTable]
(
[ChildID] ASC
)
INCLUDE [ChildType]
Then the query plan changes to something more sensible:
So my question is: Why can't SQL Server still use an index seek in the first scenario, but with a RID lookup to get from the non-clustered index to the table data? Surely that would be more efficient than an index scan on such a large table?
The first query plan actually makes a lot of sense. Remember that SQL Server never reads records, it reads pages. In your table, a page contains many records, since those records are so small.
With the original index, if the second query plan would be used, after finding all the RID's in the index, and reading index pages to do so, pages in the clustered index need to be read to read the ChildType column. In a worst case scenario, that is an entire page for each record it needs to read. As there are many records per page, that might boil down to reading a large percentage of the pages in the clustered index.
SQL server guessed, based on statistics, that simply scanning the pages in the clustered index would require less page reads in total, because it then avoids reading the pages in the non-clustered index.
What matters here is the number of rows in the temp table compared to the number of pages in the large table. Assuming a random distribution of ChildID in the large table, as soon as the number of rows in the temp table approaches or supersedes the number of pages in the large table, SQL server will have to read virtually every page in the large table anyway.
Because the column ChildType isn't covered in an index, it has to go back to the clustered index (with the mentioned Row IDentifier lookup) to get the values for ChildType.
When you INCLUDE this column in the nonclustered index it will be added to the leaf-level of the index where it is available for querying.
Colloquially is called 'the index tipping point'. Basically, at what point does the cost based optimizer consider that is more effective to do a scan rather than seek + lookup. Usually is around 20% of the size, which in your case will base on an estimate coming from the #temp table stats. YMMV.
You already have your answer: include the required column, make the index covering.
I have the below query:
USE [AxReports]
GO
DECLARE #paramCompany varchar(3)
SET #paramCompany = 'adf'
SELECT stl.MAINSALESID,
st.DATAAREAID,
Sum(sl.SALESQTY) as 'Quantity',
Sum(sl.SALESQTY * sl.SALESPRICE) as 'SalesValue'
INTO #openrel
FROM
DynamicsV5Realtime.dbo.SALESTABLE st
INNER JOIN
DynamicsV5Realtime.dbo.SALESLINE sl
ON
sl.SALESID = st.SALESID
and sl.DATAAREAID = st.DATAAREAID
INNER JOIN
DynamicsV5Realtime.dbo.INVENTTABLE it
ON
it.ITEMID = sl.ITEMID
and it.DATAAREAID = sl.DATAAREAID
INNER JOIN
DynamicsV5Realtime.dbo.SALESTABLELINKS stl
ON
stl.SUBSALESID = st.SALESID
and stl.DATAAREAID = st.DATAAREAID
WHERE
st.DATAAREAID = #paramCompany
and st.SALESTYPE = 3 -- Release Order
and st.SALESSTATUS = 1
and sl.SALESSTATUS <> 4
and it.ITEMGROUPID <> 'G0022A'
GROUP BY
stl.MAINSALESID,
st.DATAAREAID
My execution plan is recommending an index of :
USE [DynamicsV5Realtime]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
GO
However I already have an index on that table that is similar which the plan is using but performs a table scan against it. The current index is below:
CREATE NONCLUSTERED INDEX [I_ITEMGROUPIDX] ON [dbo].[INVENTTABLE]
(
[ITEMID] ASC,
[DATAAREAID] ASC
)
INCLUDE ( [ITEMGROUPID])
GO
I have an understanding that you should only put things as an included column when you are not bothered about them being sorted at the leaf level (I think thats correct?).
In this case the WHERE clause has it.ITEMGROUPID <> 'G0022A' so putting that as a key column would make sense as it will be quicker to seek that column in order, (again I think I am right in saying that?)
However what about the joins, why does it recommend to put the ITEMID column as an include but not the DATAAREAID column? ITEMID and DATAAREAID make up the PK in this case so is it something to do with not needing to sort both columns and would perhaps using the existing index but putting the ITEMGROUPID as a key columm be a better solution that adding a new index? (thats something I can test I suppose)
Thanks
Let's consider this table in relative isolation first; that is we'll only pay attention to those parts of the query where it is directly mentioned.
Executing the query needs to do the following:
Find all rows in INVENTTABLE where the ITEMGROUPID column is equal to 'G0022A'.
Find the values of the DATAAREAID and ITEMID columns in those rows, for use in finding the necessary rows in SALESLINE.
The best index for doing part one is one that has a key on ITEMGROUPID but no other columns. Such a key (we'll ignore included columns for now) would enable a table scan to find the relevant rows and those only.
If there was no such index but there was an index that had ITEMGROUPID as one of its columns, then that index could be used in a table scan instead, though not quite as efficiently.
Now, when we come to considering the second part, the only values we actually care about getting from the row are DATAAREAID and ITEMID.
If those fields where included, then they can be used in an index scan.
If they are actually parts of the key, or one of them is and the other is included, then that index can also be used for such an index scan.
So. At this point, considering only those aspects we said we would consider at this point and ignoring other considerations (index size, cost of inserts, etc), then any of the following indices would be useful here:
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID],[ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID])
INCLUDE ([DATAAREAID],[ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID])
INCLUDE ([ITEMID],[DATAAREAID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMID],[ITEMGROUPID])
INCLUDE ([DATAAREAID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID])
INCLUDE ([ITEMID])
CREATE NONCLUSTERED INDEX [someIndexName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[ITEMID])
INCLUDE ([DATAAREAID])
Each one of these indices contains ITEMGROUPID as all or part of the key and both ITEMID and DATAAREAID as either part of the key, or as an included column.
Note that they index you do have is the opposite to this; it has the column that would be ideally a key as an included column, and the others as part of the key. It's better than nothing and the query planner can re-jigger things to make use of it, but it's not the ideal key for what we've determined we want.
Now, lets consider the query as a whole.
Note that we will be searching SALESTABLE based on its DATAAREAID column.
Note that SALESLINE is joined to that column on its own DATAAREAID column.
Note that INVENTTABLE is in turn joined to that column on SALESLINE based on its own DATAAREAID column.
From this we can deduce that we logically only want those records from INVENTTABLE that have the value #paramCompany in their DATAAREAID column.
And the planner made that deduction.
So, considering the query as a whole, we can change our two actions above to:
Find all rows in INVENTTABLE where the ITEMGROUPID column is equal to 'G0022A' and where DATAAREAID is equal to #paramCompany.
Find the values of the DATAAREAID (already got in step 1) and ITEMID columns in those rows.
Hence the ideal index for this would be either:
CREATE NONCLUSTERED INDEX [someName]
ON [dbo].[INVENTTABLE] ([ITEMGROUPID],[DATAAREAID])
INCLUDE ([ITEMID])
GO
OR
CREATE NONCLUSTERED INDEX [someName]
ON [dbo].[INVENTTABLE] ([DATAAREAID],[ITEMGROUPID])
INCLUDE ([ITEMID])
GO
(Or one that includes all three in the key, but there are other reasons not to have a large key if you don't actually need it).
And the second is indeed what you were advised to do.
This should be easy to Google, but I would say to basically just have the columns that are used in joins in the index and include return columns so that there is no need to do a lookup on the actual table (al is included in the index).
I would say recommendations can be more or less reliable, perhaps due to bad statistics or whatever, don't blindly rely on them. Also, I believe indexes can not be used when the operator is '<>'.
How do I switch off the default index on primary keys
I dont want all my tables to be indexed (sorted) but they must have a primary key
You can define a primary key index as NONCLUSTERED to prevent the table rows from being ordered according to the primary key, but you cannot define a primary key without some associated index.
Tables are always unsorted - there is no "default" order for a table and the optimiser may or may not choose to use an index if one exists.
In SQL Server an index is effectively the only way to implement a key. You get a choice between clustered or nonclustered indexes - that is all.
The means by which SQL Server implements Primary and Unique keys is by placing an index on those columns. So you cannot have a Primary Key (or Unique constraint) without an index.
You can tell SQL Server to use a nonclustered index to implement these indexes. If there are only nonclustered indexes on a table (or no indexes at all), you have a heap. It's pretty rare that this is what you actually want.
Just because a table has a clustered index, this in no way indicates that the rows of the table will be returned in the "order" defined by such an index - the fact that the rows are usually returned in that order is an implementation quirk.
And the actual code would be:
CREATE TABLE T (
Column1 char(1) not null,
Column2 char(1) not null,
Column3 char(1) not null,
constraint PK_T PRIMARY KEY NONCLUSTERED (Column2,Column3)
)
What does " I dont want all my tables to be sorted" mean ? If it means that you want the rows to appear in the order where they've been entered, there's only one way to garantee it: have a field that stores that order (or the time if you don't have a lot of transactions). And in that case, you will want to have a clustered index on that field for best performance.
You might end up with a non clustered PK (like the productId) AND a clustered unique index on your autonumber_or_timestamp field for max performance.
But that's really depending on the reality your're trying to model, and your question contains too little information about this. DB design is NOT abstract thinking.