I did many searchs to find answer for my specific issue but didn't manage to find one..
Here is my problem:
I created a filtered index on a table (SQLServer) :
CREATE NONCLUSTERED INDEX [IDX_FilteredIdx] ON [dbo].[MyTable]
(
[Type] ASC,
[CreationDate] ASC
)
WHERE ([Statut]=N'CREATED'
What i see is that SQL requests such as :
select Id, CreationDate, [An other field not in the index] from MyTable
where Type='XXX' and CreationDate<'YYYY'
and Statut=N'CREATED'
use the filtered index (and a key lookup for other field) if 'XXX' is an existing value in the index OR a not existing value in the table (yes sounds very weird..)
If XXX does not exist in the index (no value with CREATED statut for example), but exists in the table, the execution plan turns to a 'clustered index scan' with awful performances (the table is big > 600 M rows) and finally returns empty result as expected...
(The total number of row with statut CREATED is very small compared to the whole table).
I saw that including the column used for filtering was recommended by microsoft :
"A column in the filtered index expression should be a key or included column in the filtered index definition if the column is in the query result set."
but i get the same issue with the column included in the index (and it seems very counter intuitive to add it when the filter criteria is a literal fixed value..).
The only way i found for this request to use the filtered index and get decent performance was to add the Statut column in the second position of the index...
(I also understand than depending on the selected fields and the cardinality of the result the engine may choose to use a clustered index scan to prevent a costly key lookup step.)
Does someone have an explanation for this behavior .. ?
Regards.
Related
I have a table with these indexes:
pk_id_sales PRIMARY KEY (id) -> Clustered unique index
uk_sales_id UNIQUE(sales_id -> Non clustered unique index
uk_sales_date_party_name (sales_date, party_name) -> Non clustered, non unique index
I want to partition this table on the column sales_date.
Should I include sales_date into the clustered index to get the benefits of partitioning? Is this an optional one? What should be the factors to be considered to make this decision if it is an optional one?
What should be the order of columns in the clustered index If I add sales_date? Should it be (id, sales_date) or (sales_date, id)? What is the role of order here?
Will the order of columns in the index make any performance impact in this case?
If we include the partition column in the query, will partition elimination always happen regardless of the indexes we have? (Eg: I already have a unique non-clustered index on the sales_id (it doesn't contain sales_date). If I make a query with sales_id and sales_date in the where clause, will the partition elimination happen?)
Please share if there is a comprehensive write-up or video that will help to gain a fair understanding of the above-given concepts.
Any response will be appreciated. I can share more details if required.
I tried the following scenarios on an existing empty table. In both cases, the new records are getting inserted into the respective partitions and partition elimination is happening correctly (Found it based on the actual execution plan in azure data studio)
SCENARIO 1
I followed the below-given steps based on a tutorial. I don't know we are performing the 4th step.
Drop the existing clustered index on ID
Create a new non-clustered index on ID
Create a clustered index on sales_date
Drop the clustered index on `sales_date'
SCENARIO 2
Based on another tutorial, I tried the following.
I followed the below-given steps based on a tutorial. I don't know we are performing the 4th step.
Drop the existing clustered index on ID
Create a new non-clustered index on ID
Create a clustered index on sales_date
For your first question, the partitioning column is required to be specified explicitly as a key column for all unique indexes. Furthermore, SQL Server will automatically add the partitioning column to clustered index keys if not already specified.
The partitioning column is automatically added as an included column in non-unique non-clustered indexes when not already a key or included column.
EDIT:
For this question asked in comment:
The existing clustered index on my table is id (It is IDentical and
auto incremented). I want to partition the table based on sales_date.
My understanding is that we need to add sales_date to the clustered
index. In the examples I saw on web, they are adding it as a second
part of the clustered index, ie, (id, sales_date). But for me, it
looks like (sales_date,id) will be more helpful as id is unique and
it will not help to improve performance.
It depends on your queries. The partitioning column must be specified to eliminate parttions and the leftmost key column must be specified to perform an index seek.
With unique clustered index key (id,sales_date) and no other indexes:
WHERE id = 1 will perform an index seek against every partition to
find the single row.
WHERE sales_date = '20221114' will perform a full scan of single
partition containing the date and return only rows matching the date.
WHERE id = 1 AND sales_date = '20221114' will perform a seek
against only the single partition containing the date and touch the
single row.
With unique clustered index key (sales_date,id):
WHERE id = 1 will full scan every partition to find the single
row.
WHERE sales_date = '20221114' will perform an index seek on only
the partition containing the date and touch only rows that qualify.
WHERE id = 1 AND sales_date = '20221114' will perform and index
seek only the partition containing the date and touch only the single
row.
I have created a NC index and used "LastUpdated" as a include column in index but "LastUpdated" is being used in order by clause in my query . Should we use column used in order by clause as include column of NC INDEX?
CREATE NONCLUSTERED INDEX [IXNI__symboltab__Status_active_Symbol]
ON symboltab (Status,active,Symbol)
INCLUDE (LastUpdated)
SELECT TOP 10000 symbol,LastUpdated
from symboltab
with (nolock index = IXNI__symboltab__Status_active_Symbol)
WHERE
active = 1
AND Status = 999 --999 is default
AND Symbol NOT LIKE '/%'
AND Symbol NOT LIKE '%#%'
AND Symbol NOT LIKE '!%'
ORDER BY LastUpdated ASC
As far as I know, columns listed in the INCLUDE clause are not part of the actual B-tree index, but rather appear only in the leaf nodes. A consequence of this is that for your current index, the leaf nodes would generally not be sorted by the LastUpdated values. The values would be there in the leaf nodes, but there is no guarantee of any sort. Therefore, if you want to give your index a chance to cover all parts of your query, you should move LastUpdated into the actual index structure:
CREATE NONCLUSTERED INDEX [IXNI__symboltab__Status_active_Symbol]
ON symboltab (Status, active, Symbol, LastUpdated);
The best index for this query is actually
CREATE NONCLUSTERED INDEX [IXNI__symboltab__Status_active_Symbol]
ON symboltab (Status, active, LastUpdated) INCLUDE (Symbol);
-- alternatively
CREATE NONCLUSTERED INDEX [IXNI__symboltab__Status_active_Symbol]
ON symboltab (active, Status, LastUpdated) INCLUDE (Symbol);
The reason is that both active and Status have equality predicates, and can be seeked directly, therefore they should come first (in either order).
Symbol cannot be seeked, as it has multiple inequality predicates. Even if it had only one, it would still mess up the final sort. Therefore it must go in the INCLUDE, which is not part of the index key.
Finally LastUpdated, this means that the data is fully sorted, and does not need an extra sort.
You can see the difference in this db<>fiddle
Side notes:
If you get the indexing right, you do not need an index hint.
Do not use NOLOCK unless you really know what you're doing. It's not a go-faster switch, it's a give-incorrect-results switch.
Select Top or Fetch First is actually a filter even if order by is absent, although there is usually order by clause. The index should be:
CREATE NONCLUSTERED INDEX [IXNI__symboltab__Status_active_Symbol]
ON symboltab (LastUpdated,Status,active,Symbol)
And you are done searching the rows with your filters as fast as possible.
Note that if another similar query does not filter 'active' for example but does filter 'symbol' and the rest, only LastUpdated+Status part of the index will be used. So if you have more queries, study a better order of columns into the index. Put column left-wise based on their usage, the more usage the most left-wise.
Included columns now are only used for reading, not for searching. If part of the index, no included is needed. Since you select symbol,LastUpdated that are both part of the index, no included columns are needed. If you add another column like 'FirstUpdated' tomorrow and you do not filter it (just for display purpose on select list) then you can add this particular column into INCLUDE to make your query faster because when the rows are found through the index included column will help read the information from the index itself. Else it will read the found rows to get that new column.
I have a table with a non-unique clustered index. I want to move these records to a table with a unique clustered index (I don't care which single record I get out of the table). I can do this via a GROUP BY, but I suspect it could be faster to select off of the hidden uniquifier column.
I have found some examples of how to access the value of the uniquifier column, but none that show how it might be uses in a query, e.g.
SELECT *
FROM NonUniqueClusteredIndexTable
WHERE uniquifier = 0
Any ideas how to access this value, or otherwise how to quickly de-duplicate such a table?
Suppose I need to update myTab from luTab as follows
update myTab
set LookupVale = (select LookupValue from luTab B
where B.idLookup = myTab.idLookup)
luTab consists of 2 columns (idLookup(unique), LookupValue)
Which is preferable : a unique clustered index on idLookup, or one on idLookup and Lookupvalue combined? Is a covering index going to make any difference in this situation?
(I'm mostly interested in SQL server)
Epilogue :
I followed up Krips tests below with 27M rows in myTab, 1.5M rows in luTab.
The crucial part seems to be the uniqueness of the index.
If the index is specified as unique, the update uses a hash table.
If it is not specified as unique, then the update first aggreates luTab by idLookup (the Stream Aggegate) and then uses a nested loop. This is much slower.
When I use the extended index, SQL is now no longer assued that that LookupValue is unique so its forced down the much slower, stream aggregate-nested loop route
Firstly:
A covering index is always non-clustered
You should always have a PK and a clustered index (there are the same by default on SQL Server)
The 2 concepts are separate
So:
Your PK (clustered) would be idLookup if this uniquely identifies a row
The covering index would be (idLookup) INCLUDE (LookupValue)
However:
idLookup is the PK (clustered), so you don't need a covering index
the clustered index (PK) is implicitly "covering" by the nature of a clustered index (simply, index is data at the lowest level)
I've created your tables and loaded just a few records (50 or so lookup, and 15 in myTab).
Then I've tried various index options. The Index Seek on luTab always has a cost of 29%.
The interesting bit is that if you add in the LookupValue column to the index on luTab the execution plan shows two extra steps after the Index Seek: Stream Aggregate and Assert. While cost is 0%, that may go up with more data.
I've also tried a nonclustered index on just idLookup, and including LookupValue as an 'Included Column'. That way the data pages don't need to be accessed to retrieve that that column. That may be an option for you although the execution plan doesn't show anything different (but they don't have the Stream Aggregate / Assert either).
-Krip
I'm in the process of trying to optimize a query that looks up historical data. I'm using the query analyzer to lookup the Execution Plan and have found that the majority of my query cost is on something called a "Bookmark Lookup". I've never seen this node in an execution plan before and don't know what it means.
Is this a good thing or a bad thing in a query?
A bookmark lookup is the process of finding the actual data in the SQL table, based on an entry found in a non-clustered index.
When you search for a value in a non-clustered index, and your query needs more fields than are part of the index leaf node (all the index fields, plus any possible INCLUDE columns), then SQL Server needs to go retrieve the actual data page(s) - that's what's called a bookmark lookup.
In some cases, that's really the only way to go - only if your query would require just one more field (not a whole bunch of 'em), it might be a good idea to INCLUDE that field in the non-clustered index. In that case, the leaf-level node of the non-clustered index would contain all fields needed to satisfy your query (a "covering" index), and thus a bookmark lookup wouldn't be necessary anymore.
Marc
It's a NESTED LOOP which joins a non-clustered index with the table itself on a row pointer.
Happens for the queries like this:
SELECT col1
FROM table
WHERE col2 BETWEEN 1 AND 10
, if you have an index on col2.
The index on col2 contains pointers to the indexed rows.
So, in order to retrieve the value of col1, the engine needs to scan the index on col2 for the key values from 1 to 10, and for each index leaf, refer to the table itself using the pointer contained in the leaf, to find out the value of col1.
This article points out that a Bookmark Lookup is SQL Server 2000's term, which is replaced by NESTED LOOP's between the index and the table in SQL Server 2005 and above
From MSDN regarding Bookmark Lookups:
The Bookmark Lookup operator uses a
bookmark (row ID or clustering key) to
look up the corresponding row in the
table or clustered index. The Argument
column contains the bookmark label
used to look up the row in the table
or clustered index. The Argument
column also contains the name of the
table or clustered index in which the
row is looked up. If the WITH PREFETCH
clause appears in the Argument column,
the query processor has determined
that it is optimal to use asynchronous
prefetching (read-ahead) when looking
up bookmarks in the table or clustered
index.