Indexing for complex predicates - sql-server

I'm struggling to identify effective indexes (or rewrite the query) to improve a query with the following confounding predicates:
JOIN on a date from one table being in range - between two date fields on second table (one is nullable, one is not nullable in PK).
The date used is actually the value in date field (nullable) +1.
WHERE clauses includes OR logic on multiple flag fields.
The simplified version of the query is:
select
d.dim_date_id
,f.dim_provider_id
,f.dim_event_id
,d.date
from DWH.dbo.tbl_fact_outcome f
join DWH.dbo.tbl_dim_date d on DATEADD(DAY,1,d.date) between f.known_from and f.known_to
where
f.known_from > getdate()-12
and (d.flag_latest_day = 'Y' or d.flag_end_of_month = 'Y' or (d.flag_end_of_week = 'Y' AND d.flag_latest_week = 'Y'))
and d.flag_future_day = 'N'
and f.deleted = 0
tbl_fact_outcome has these indexes:
PK clustered index on input_form_id, known_from
Non-unique Nonclustered index on deleted, known_from, known_to (INCLUDES the required _dim_id fields)
tbl_dim_date has these indexes:
PK clustered index on dim_date_id
Non-unique nonclustered index on flag_future_day, date (INCLUDES relevant flag fields)
At present, it estimates 853 rows but returns 16,784.
Here is the query plan:
https://www.brentozar.com/pastetheplan/?id=rydKb_3AI
Statistics are up to date.
I have tried re-ordering the covering indexes but no improvement.
I'm totally stumped as to what else to try with indexes or the code itself to improve performance, so any pointers appreciated.
EDIT 05/07/2020
Ruled out following suggestions here:
Filtered index (on deleted) on tbl_fact_outcome - less than 1% of records would be filtered out, so not worthwhile
Filtered index (using entire WHERE clause from query) on tbl_dim_date - not possible to use OR in index
Index on tbl_dim_date with INCLUDEd fields as key fields - tried this, made no difference, not used by optimizer.

Guessing all/most queries filter on deleted I would suggest a filtered index.
CREATE NONCLUSTERED INDEX TodoNewIndexName ON DWH.dbo.tbl_fact_outcome (
known_from ASC
,known_to ASC
)
INCLUDE (dim_event_id,dim_provider_id)
WHERE deleted = 0;
If this query is really running frequently you could also consider using a filterd index for tbl_dim_date. This will probably only be used by this query, since the where is an exact match of your query:
CREATE NONCLUSTERED INDEX TodoNewIndexName ON DWH.dbo.tbl_dim_date (DATE ASC)
WHERE (
d.flag_latest_day = 'Y'
OR d.flag_end_of_month = 'Y'
OR (
d.flag_end_of_week = 'Y'
AND d.flag_latest_week = 'Y'
)
)
AND d.flag_future_day = 'N'
If you don't want a filterd index on the flag fields. You should add the flag fields to the index instead of includes.
CREATE NONCLUSTERED INDEX TodoNewIndexName ON DWH.dbo.tbl_dim_date (
DATE ASC
,flag_latest_day ASC
,flag_end_of_month ASC
,flag_end_of_week ASC
,flag_latest_week ASC
)
What this should do is get rid of the Index Spool (Eager Spool) more info about eager spool.
Eager index spools are often a sign that a useful permanent index is
missing from the database schema. This is not always the case, as the
streaming table-valued function examples show.

Has your date dimension a nextDay column or something? If not you can add this column and replace DATEADD(DAY,1,d.date) with this new column.

Related

SQL Query is slow when ORDER BY statement added

I have a table [Documents] with the following columns:
Name (string)
Status (string)
DateCreated [datetime]
This table has around 1 million records. All three of these columns have an index (a single index for each one).
When I run this query:
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New';
Execution is really fast (300 ms.)
If I run the same query but with the ORDER BY clause, it's really slow (3000 ms)
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New'
order by DateCreated;
I understand that its searching in another index (DateCreated), but should it really be that much slower? If so, why? Anything I can do to speed this query up (a composite index)?
Thanks
BTW: All Indexes including DateCreated have really low fragmentation, in fact I ran a reorganize and it didn't change a thing.
As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.
Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.
If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.
NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY before the TOP 50. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.
A couple of ideas to improve performance
Adding a composite index (as other answers have suggested) may offer some improvement, for example:
ON Documents (Status, DateCreated, Name)
SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.
For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New' is a relatively small subset of the table.
CREATE NONCLUSTERED INDEX Documents_FIX
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'
I would also modify the query to specify ORDER BY Status, DateCreated, Name
so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.
As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that
ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;
CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;
Then the query could be re-written:
SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;
If DateCreated field means insertion time to table, you can create an integer id field and order by that integer field.
You need an index by 2 columns: (Name, DateCreated). The order of fields in the index is important. So, replace your index for just Name with a new index for two columns (Name, DateCreated).

How do indexes work behind the scenes

Im a begginer. I know indexes are necessary for performance boosts, but i want to know how they actually work behind the scenes. Beforehand, I used to think that we should make indexes on those columns which are included in where clause (which I realized is wrong)
For example, SELECT * from MARKS where marks_obtained > 50
Consider that there's a clustered index on primary key of this table and I created a non-clustered index on marks_obtained column as its there in my where clause.
My perception: So the leaf nodes will be containing pointers to clustered index and as clustered index points to actual rows, it will select entire rows (due to asteric in my query)
Scenario
I came across following query (from AdventureWorks DB on which a non-clustered index was created) which works fine and took less than a second to execute 3200000 rows until a new column was inserted into it:
Query
SELECT x.*
INTO#X
FROM dbo.bigProduct AS p
CROSS APPLY
(
SELECT TOP 1000 *
FROM dbo.bigTransactionHistory AS bth
WHERE
bth.ProductId = p.bth.ProductId
ORDER BY
TransactionDate DESC
) AS x
WHERE
p.ProductId BETWEEN 1000 AND 7500
GO
NEW INSERTED COLUMN
ALTER TABLE dbo.bigTransactionHistory
ADD CustomerId INT NULL
After insertion of above column it took 17 seconds! means 17 times slower. A non-clusered index was now missing CustomerId column in the index. Just after including CustomerId, problem was gone.
Question CustomerId seemed to be the culprit until it was added to the index. BUT HOW???
The execution plan would answer this but I'll make a guess: The non-clustered index was no longer enough to satisfy the query after the additional column had been added. This can cause the index to not be used anymore. It also can cause one clustered index seek per row.
Learn to read execution plans. Turn on the "actual execution plan" feature routinely for each query that you test.

Index Scan with PROBE instead of an Index Seek

I have a query that looks like this:
--Updated To remove Distinct per Aaron Bertrand's suggestion in the comments
SELECT TOP 100 ord.OrderId
FROM Customer cust
JOIN CustomerOrder ord
ON ord.CustomerId = cust.CustomerId
WHERE cust.FirstName LIKE (#firstName + '%')
ORDER BY ord.CreatedWhen DESC
And I have an index like this:
CREATE NONCLUSTERED INDEX [IX_MyIndex] ON CustomerOrder
(
OrderId DESC,
CustomerId DESC,
CreatedWhen Desc
)
GO
When I run my query, the index gets used, but it is an index scan. And it gives this message:
PROBE([Bitmap1011],[MyDatabase].[order].[CustomerOrder].[OrderId] as [ord].[OrderId],N'[IN ROW]')
The output list consists of the OrderId and CreatedWhen.
What is this PROBE doing and why I don't get an Index Seek?
UPDATE:
The FirstName column on the Customer table does have an index that is being used in an IndexSeek.
CREATE NONCLUSTERED INDEX [IX_Customer_FirstName] ON Customer
(
[FirstName] ASC
)
GO
The reason that an Index Scan gets used is because your WHERE clause predicate is based on CustomerId, but it appears as the SECOND column in the list of columns in your non-clustered index [IX_MyIndex].
If you want an Index Seek to be performed, you would need to specify a new non-clustered index just on column CustomerId.
And that would essentially be a good practice - have two separate NC indices for OrderId and CustomerId. So when you join Customer and CustomerOrder tables, it will use the NC Index for CustomerId, and when you join Order and CustomerOrder tables, it will use the NC index for OrderId.
Refer to this article to read more about the difference between a multi-column non-clustered index (which you currently have) and multiple non-clustered indexes (which I proposed using).
[UPDATE]
But creating separate non-clustered indexes is not sufficient in getting an Index Seek everytime. That will depend on the columns being selected in the query, and the size of the data being read - based on that the query optimizer will accordingly make a decision on whether to use an Index Seek or an Index Scan. See this answer for more information.
[UPDATE Feb 8, 2021]
At a high-level, the PROBE function in question would essentially try to verify whether the CustomerOrder.OrderId column value is present in the Customer table. This is achieved internally through the using of bitmaps and hash keys, and you can read in detail about it here.
Note that a PROBE is not specific to an Index Scan or an Index Seek. It is simply a function that is utilized for verifying matches (based on a certain hash keyed column(s)) between two tables in a join.
Simple reason: your FirstName column isn't in the index. It must scan every row to see if the row matches the pattern you want.

SQL Server index - very large table with where clause against a very small range of values - do I need an index for the where clause?

I am designing a database with a single table for a special scenario I need to implement a solution for. The table will have several hundred million rows after a short time, but each row will be fairly compact. Even when there are a lot of rows, I need insert, update and select speeds to be nice and fast, so I need to choose the best indexes for the job.
My table looks like this:
create table dbo.Domain
(
Name varchar(255) not null,
MetricType smallint not null, -- very small range of values, maybe 10-20 at most
Priority smallint not null, -- extremely small range of values, generally 1-4
DateToProcess datetime not null,
DateProcessed datetime null,
primary key(Name, MetricType)
);
A select query will look like this:
select Name from Domain
where MetricType = #metricType
and DateProcessed is null
and DateToProcess < GETUTCDATE()
order by Priority desc, DateToProcess asc
The first type of update will look like this:
merge into Domain as target
using #myTablePrm as source
on source.Name = target.Name
and source.MetricType = target.MetricType
when matched then
update set
DateToProcess = source.DateToProcess,
Priority = source.Priority,
DateProcessed = case -- set to null if DateToProcess is in the future
when DateToProcess < DateProcessed then DateProcessed
else null end
when not matched then
insert (Name, MetricType, Priority, DateToProcess)
values (source.Name, source.MetricType, source.Priority, source.DateToProcess);
The second type of update will look like this:
update Domain
set DateProcessed = source.DateProcessed
from #myTablePrm source
where Name = source.Name and MetricType = #metricType
Are these the best indexes for optimal insert, update and select speed?
-- for the order by clause in the select query
create index IX_Domain_PriorityQueue
on Domain(Priority desc, DateToProcess asc)
where DateProcessed is null;
-- for the where clause in the select query
create index IX_Domain_MetricType
on Domain(MetricType asc);
Observations:
Your updates should use the PK
Why not use tinyint (range 0-255) to make the rows even narrower?
Do you need datetime? Can you use smalledatetime?
Ideas:
Your SELECT query doesn't have an index to cover it. You need one on (DateToProcess, MetricType, Priority DESC) INCLUDE (Name) WHERE DateProcessed IS NULL
`: you'll have to experiment with key column order to get the best one
You could extent that index to have a filtered indexes per MetricType too (keeping DateProcessed IS NULL filter). I'd do this after the other one when I do have millions of rows to test with
I suspect that your best performance will come from having no indexes on Priority and MetricType. The cardinality is likely too low for the indexes to do much good.
An index on DateToProcess will almost certainly help, as there is lilely to be high cardinality in that column and it is used in a WHERE and ORDER BY clause. I would start with that first.
Whether an index on DateProcessed will help is up for debate. That depends on what percentage of NULL values you expect for this column. Your best bet, as usual, is to examine the query plan with some real data.
In the table schema section, you have highlighted that 'MetricType' is one of two Primary keys, therefore this should definately be indexed along with the Name column. As for the 'Priority' and 'DateToProcess' fields as these will be present in a where clause it can't hurt to have them indexed also but I don't recommend the where clause you have on that index of 'DateProcessed' is null, indexing just a set of the data is not a good idea, remove this and index the whole of both those columns.

Suitable indexes for sorting in ranking functions

I have a table which keeps parent-child-relations between items. Those can be changed over time, and it is necessary to keep a complete history so that I can query how the relations were at any time.
The table is something like this (I removed some columns and the primary key etc. to reduce noise):
CREATE TABLE [tblRelation](
[dtCreated] [datetime] NOT NULL,
[uidNode] [uniqueidentifier] NOT NULL,
[uidParentNode] [uniqueidentifier] NOT NULL
)
My query to get the relations at a specific time is like this (assume #dt is a datetime with the desired date):
SELECT *
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY r.uidNode ORDER BY r.dtCreated DESC) ix, r.*
FROM [tblRelation] r
WHERE (r.dtCreated < #dt)
) r
WHERE r.ix = 1
This query works well. However, the performance is not yet as good as I would like. When looking at the execution plan, it basically boils down to a clustered index scan (36% of cost) and a sort (63% of cost).
What indexes should I use to make this query faster? Or is there a better way altogether to perform this query on this table?
The ideal index for this query would be with key columns uidNode, dtCreated and included columns all remaining columns in the table to make the index covering as you are returning r.*. If the query will generally only be returning a relatively small number of rows (as seems likely due to the WHERE r.ix = 1 filter) it might not be worthwhile making the index covering though as the cost of the key lookups might not outweigh the negative effects of the large index on CUD statements.
The window/rank functions on SQL Server 2005 are not that optimal sometimes (based on answers here). Apparently better in SQL Server 2008
Another alternative is something like this. I'd have a non-clustered index on (uidNode, dtCreated) INCLUDE any other columns required by SELECT. Subject to what Martin Smith said about lookups.
WITH MaxPerUid AS
(
SELECT
MAX(r.dtCreated) AS MAXdtCreated, r.uidNode
FROM
MaxPerUid
WHERE
r.dtCreated < #dt
GROUP BY
r.uidNode
)
SELECT
...
FROM
MaxPerUid M
JOIN
MaxPerUid R ON M.uidNode = R.uidNode AND M.MAXdtCreated = R.dtCreated

Resources