WHERE clause gives poor query plan - sql-server

I'm not sure how best to tune this query and/ or indexes to avoid a blunt FORCE ORDER hint.
This main query runs fine, currently returns 0 rows in 0 seconds:
SELECT S1.ID, S.LOAD_DATE, s.Deleted,S1.HUB_FORM_ID
FROM #TMP S
INNER JOIN HUB_FORM H1 ON
H1.Form_ID = S.HUB_FORM_BK
INNER JOIN HUB_ORG H2 ON
H2.Organisation_ID = S.HUB_ORG_BK
INNER JOIN HUB_PERSON H3 ON
H3.person_id = S.HUB_PERSON_BK
INNER JOIN HUB_EVENT H4 ON
H4.job_id = S.HUB_EVENT_BK
INNER JOIN HUB_WORKFLOW_STEP H5 ON
H5.step_id = S.HUB_WORKFLOW_STEP_BK
INNER JOIN LNK_FORM_ENTITY S1 ON
H1.HUB_FORM_ID = S1.HUB_FORM_ID AND H2.HUB_ORG_ID = S1.HUB_ORG_ID AND H3.HUB_PERSON_ID = S1.HUB_PERSON_ID AND H4.HUB_EVENT_ID = S1.HUB_EVENT_ID AND H5.HUB_WORKFLOW_STEP_ID = S1.HUB_WORKFLOW_STEP_ID
INNER JOIN DK_SAT_LNK_FORM_ENTITY S2 ON
S1.ID = S2.Parent_ID
Adding a WHERE clause on S2.LOAD_DATE_TO makes it run and run (killed off after a minute or two).
WHERE S2.LOAD_DATE_TO = '31/12/9999'
I'm not sure why that happens as:
Without the filter, no rows are returned, so it can make no difference.
The index used for the table containing this field in the good plan (with no date filter), already contains that field as the second key field so I'd have thought any additional cost is negligible
NB - it doesn't always return 0 rows, but it needs to run (and complete in a reasonable time) whether rows are returned or not.
CREATE NONCLUSTERED INDEX [JM_TEST_190221_2] ON [dbo].[DK_SAT_LNK_FORM_ENTITY]
(
[Parent_ID] ASC,
[LOAD_DATE_TO] ASC
)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
The live query plan shows it running through millions of rows in the LNK_ and DK_ tables and subsequently joined tables, whereas in the original plan it shows actual number of rows = 56 (56 executions - expected 1 row) on the LNK_ table and 0 actual rows (56 executions) on the DK_ table.
If I add OPTION (FORCE ORDER) after the WHERE clause, it runs in 0 seconds again, with a different query plan to the original good one.
Clearly that resolves the issue in the short term, but I'm wary of using such a blunt instrument given that it may not always be the optimal choice as data changes over time.
Edit
I have tried updating statistics with FULL SCAN, and rebuilding key indexes but it had no impact.
Query plans below - any tips or explanation gratefully received!
Original good plan (actual plan): no WHERE clause : https://www.brentozar.com/pastetheplan/?id=HyG3SwTZd
Poor plan (from live query plan at point killed off) : https://www.brentozar.com/pastetheplan/?id=rJpBSPpWO
Good plan with FORCE ORDER hint : https://www.brentozar.com/pastetheplan/?id=SJqxUvT-d

Clearly, your issue is that HUB_FORM is selective enough that it is limiting the rows down to 0 at the very beginning. But the optimizer does not realize that and therefore it is reversing the order of the joins.
To enforce the order without hammering the rest of the query via FORCE ORDER, we have two options:
Pre-compute the join of #TMP, HUB_FORM into a temp table or table variable. This can often cause a fair bit of extra IO.
A much better option is to persuade the optimizer to compute the join first, but without using explicit hints.
This is often best done by putting the join inside a subquery with a SELECT TOP, but you may need to modify this by adding one or two further joins.
SELECT S1.ID, S.LOAD_DATE, s.Deleted, S1.HUB_FORM_ID
FROM (
SELECT TOP (9223372036854775807) S.*
FROM #TMP S
INNER JOIN HUB_FORM H1 ON
H1.Form_ID = S.HUB_FORM_BK
) S
INNER JOIN HUB_ORG H2 ON
H2.Organisation_ID = S.HUB_ORG_BK
INNER JOIN HUB_PERSON H3 ON
H3.person_id = S.HUB_PERSON_BK
INNER JOIN HUB_EVENT H4 ON
H4.job_id = S.HUB_EVENT_BK
INNER JOIN HUB_WORKFLOW_STEP H5 ON
H5.step_id = S.HUB_WORKFLOW_STEP_BK
INNER JOIN LNK_FORM_ENTITY S1 ON
H1.HUB_FORM_ID = S1.HUB_FORM_ID AND H2.HUB_ORG_ID = S1.HUB_ORG_ID AND H3.HUB_PERSON_ID = S1.HUB_PERSON_ID AND H4.HUB_EVENT_ID = S1.HUB_EVENT_ID AND H5.HUB_WORKFLOW_STEP_ID = S1.HUB_WORKFLOW_STEP_ID
INNER JOIN DK_SAT_LNK_FORM_ENTITY S2 ON
S1.ID = S2.Parent_ID
If that doesn't work, you may be able to persuade it by changing the TOP to a variable, and adding an OPTIMIZE FOR hint at the end:
DECLARE #topRows bigint = 9223372036854775807;
SELECT S1.ID, S.LOAD_DATE, s.Deleted, S1.HUB_FORM_ID
FROM (
SELECT TOP (#topRows) S.*
FROM #TMP S
INNER JOIN HUB_FORM H1 ON
H1.Form_ID = S.HUB_FORM_BK
) S
INNER JOIN HUB_ORG H2 ON
.........
OPTION (OPTIMIZE FOR (#topRows = 1));
This causes the optimizer to think it will only get 1 row out of the join, but actually allows more rows if that is the case at a runtime.
Note that none of this changes the essential semantics of the query

Related

Table UPDATE in SQL Server slows down an Index Seek in a subquery

I have the following query in SQL Server Management Studio 18, let's call it Query1:
SELECT
stage.IDContratto,
SUM(stageReg.Costo) AS Costo
FROM STAGING.TabContrattiRedditivita AS stage
INNER JOIN STAGING.TabCommesse AS stageCom ON stage.CodiceContratto = stageCom.CodiceContrattoCommessa
INNER JOIN STAGING.TabRegistrazioneOreRisorse AS stageReg
ON stageCom.CodiceCommessa = stageReg.CodiceCommessaCalcolato
AND stageReg.DataRegistrazione BETWEEN stage.StartDate AND stage.EndDate
WHERE stageCom.SeMotivoNonFatturabilePerditaCommessa = 1
GROUP BY stage.IDContratto
TabContrattiRedditivita has 16K rows, TabCommesse has 49K rows, and TabRegistrazioneOreRisorse has 6.8 MLN rows. Query1 returns 1.200 rows. Because of the IX_CostiCommessa non-clustered index I put on TabRegistrazioneOreRisorse (details below) this query completes in about 3 min, which all in all is fine to me. You can see the actual exec plan here.
However I actually use Query1 inside an UPDATE of TabContrattiRedditivita, let's call it Query2:
UPDATE STAGING.TabContrattiRedditivita
SET
ActualCostoCommesseNonFatturanti += costi.Costo,
TotaleCostoCommesseNonFatturanti += costi.Costo
FROM STAGING.TabContrattiRedditivita AS stage
INNER JOIN (Query1) AS costi ON stage.IDContratto = costi.IDContratto
And Query 2 completes in 16 min or more, which is not fine. You can see the actual exec plan here.
You might think it's a problem of writing operations, but in the following I report some strange facts that led me to think it is not.
First, Query1 returns just 1.200 rows, so the writing operations are insignificant (in my ETL I do UPDATEs 2 to 3 orders of magnitude higher without any performance problem). Second, as you can see above the actual exec plan of the subquery Query1 inside Query2 looks identical to the actual exec plan of Query1 executed alone (except for percentages, of course). Third, live statistics about Query2 seems to reveal that the Index Seek on TabRegistrazioneOreRisorse is slowing down Query2, not the UPDATE operation, which instead takes < 1 sec (notice the total running time was 17 min 11 sec):
This is the same Index Seek that in Query1 only took about 3 min (total running time: 3 min 10 sec):
So it seems like the mere presence of the UPDATE is causing Query1 to slow down dramatically even before the UPDATE is executed.
Here come the twist: if I copy my datawarehouse tables TabContrattiRedditivita, TabCommesse and TabRegistrazioneOreRisorse into temp tables #Tab1, #Tab2 and #Tab3 respectively, and then I create the same PKs and indexes on these temp tables, then all suddenly works. Query1:
SELECT
stage.IDContratto,
SUM(stageReg.Costo) AS Costo
FROM #Tab1 AS stage
INNER JOIN #Tab2 AS stageCom ON stage.CodiceContratto = stageCom.CodiceContrattoCommessa
INNER JOIN #Tab3 AS stageReg
ON stageCom.CodiceCommessa = stageReg.CodiceCommessaCalcolato
AND stageReg.DataRegistrazione BETWEEN stage.StartDate AND stage.EndDate
WHERE stageCom.SeMotivoNonFatturabilePerditaCommessa = 1
GROUP BY stage.IDContratto
Execution time about 3 min, just as the previous Query1; actual exec plan here. Query2:
UPDATE #Tab1
SET
ActualCostoCommesseNonFatturanti += costi.Costo,
TotaleCostoCommesseNonFatturanti += costi.Costo
FROM #Tab1 AS stage
INNER JOIN (Query1) AS costi ON stage.IDContratto = costi.IDContratto
Execution time about 3 min 10 sec, instead of 16 or 17 min like the previous Query2; actual exec plan here.
How can this be? Any clue about how to fix this?
Note: I also tried a couple of alternatives, which revealed themselves uneffective.
I tried to use a #temp table: I put Query1 INTO #temp, then executing Query2 this way:
UPDATE STAGING.TabContrattiRedditivita
SET
ActualCostoCommesseNonFatturanti += costi.Costo,
TotaleCostoCommesseNonFatturanti += costi.Costo
FROM STAGING.TabContrattiRedditivita AS stage
INNER JOIN #temp AS costi ON stage.IDContratto = costi.IDContratto
The results are the same, but this time Query1 is the slow part: Query1 runs in 16 min, with the Index Seek on Tab3 very slow, then Query2 runs in few seconds.
I also tried to use a CTE in two ways. Way number 1:
WITH CostoRegistrazioni AS (Query1)
UPDATE STAGING.TabContrattiRedditivita
SET
ActualCostoCommesseNonFatturanti += costi.Costo,
TotaleCostoCommesseNonFatturanti += costi.Costo
FROM STAGING.TabContrattiRedditivita AS stage
INNER JOIN CostoRegistrazioni AS costi ON stage.IDContratto = costi.IDContratto
Way number 2:
WITH updateStage AS (
SELECT
ActualCostoCommesseNonFatturanti,
TotaleCostoCommesseNonFatturanti,
costi.Costo
FROM STAGING.TabContrattiRedditivita AS stage
INNER JOIN (Query1) AS costi ON stage.IDContratto = costi.IDContratto
)
UPDATE updateStage
SET
ActualCostoCommesseNonFatturanti += Costo,
TotaleCostoCommesseNonFatturanti += Costo
In both cases same result: Query1 runs in 16 min, with the Index Seek on TabRegistrazioneOreRisorse very slow.
Technical details
Where you see Clustered Index Scans on PK_TableName in the above exec plans, PK_TableName are just the standard clustered indexes that SQL Server creates on the table's PK. IX_CostiCommessa is instead defined as follows (on Tab3 is exactly the same):
CREATE NONCLUSTERED INDEX [IX_CostiCommessa]
ON [STAGING].[TabRegistrazioneOreRisorse] (
[DataRegistrazione] ASC,
[CodiceCommessaCalcolato] ASC
)
INCLUDE (
[Costo],
[SeRisorsaInterna],
[SeRisolutivo]
)
WITH (
PAD_INDEX = ON,
STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ONLINE = OFF,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON,
FILLFACTOR = 100
)
Temp tables are defined as follows:
SELECT *
INTO #Tab1
FROM STAGING.TabContrattiRedditivita
SELECT *
INTO #Tab2
FROM STAGING.TabCommesse
SELECT *
INTO #Tab3
FROM STAGING.TabRegistrazioneOreRisorse
You may have noticed some missed index notifications in the actual exec plans above. I already tried to create them, the only effect was slowing down the queries (even Query1).
The warning you may have seen in the actual exec plan of Query2 is an ExcessiveGrant, which I'm not sure how to interpret:
Directly updating an updatable CTE should prove to be faster, as you don't need to re-query #Tab1:
WITH costi AS (
SELECT
stage.ActualCostoCommesseNonFatturanti,
stage.TotaleCostoCommesseNonFatturanti,
SUM(stageReg.Costo) OVER (PARTITION BY stage.IDContratto) AS Costo
FROM #Tab1 AS stage
INNER JOIN #Tab2 AS stageCom ON stage.CodiceContratto = stageCom.CodiceContrattoCommessa
INNER JOIN #Tab3 AS stageReg
ON stageCom.CodiceCommessa = stageReg.CodiceCommessaCalcolato
AND stageReg.DataRegistrazione BETWEEN stage.StartDate AND stage.EndDate
WHERE stageCom.SeMotivoNonFatturabilePerditaCommessa = 1
)
UPDATE costi
SET
ActualCostoCommesseNonFatturanti += costi.Costo,
TotaleCostoCommesseNonFatturanti += costi.Costo;
I would also recommend the following indexes:
stage (IDContratto) INCLUDE (CodiceContratto, StartDate, EndDate, ActualCostoCommesseNonFatturanti, TotaleCostoCommesseNonFatturanti)
stageCom (SeMotivoNonFatturabilePerditaCommessa, CodiceContrattoCommessa) INCLUDE (CodiceCommessa)
stageReg (CodiceCommessaCalcolato, DataRegistrazione) INCLUDE (Costo)
You could alternately make a filtered index on stageCom
stageCom (CodiceContrattoCommessa) INCLUDE (CodiceCommessa, SeMotivoNonFatturabilePerditaCommessa) WHERE (SeMotivoNonFatturabilePerditaCommessa = 1)

SQL Server Index over lookup table of distinct values

I am trying to speed up the following SQL Server query:
SELECT
V.Id, V.Number, V.VisitDate, V.ArrivalTime, V.VisitKindId, VK.Description AS
VisitKindDescription,
VK.DescriptionAr AS VisitKindDescriptionAr, V.StatusId, V.Note, V.CancelingReason,
V.CancelingTime, V.EnterToDoctorRoomTime,
V.PatientId, P.Number AS PatientNumber, P.FirstName, P.LastName, P.BirthDate, P.Note AS
PatientNotes, V.DoctorId, D.FullName AS DoctorFullName, V.CreatedById,
U.FullName AS UserFullName, V.CreationDate, V.VersionNo
FROM
Patient_Tbl P INNER JOIN
Visit_Tbl V ON P.Id = V.PatientId INNER JOIN
VisitKind_Tbl VK ON V.VisitKindId = VK.Id INNER JOIN
Doctor_Tbl D ON V.DoctorId = D.Id INNER JOIN
User_Tbl U ON V.CreatedById = U.Id INNER JOIN
VisitStatus_Tbl VS ON V.StatusId = VS.Id
WHERE V.StatusId = 2 --patient is in doctor room
and we had the following 4 values the VisitStatus_Tbl:
(1 -> In Waiting Room, 2 -> In Doctor Room, 3 -> Canceled, 4 -> Completed)
and in one moment of time, there is only one record on the Visits table for one person in the doctor's room.
The end-user inform me that there is a delay in the use case that depends on the above query.
Please help us speed system performance by suggesting the proper index.
Thanks,
You do not indicate if you have any indexes on the tables now. I will assume that the 'ID' columns for patient_tbl, etc are clustered primary keys or just primary keys and have indexes. If not, that is another problem.
Simple rule: start with index foreign keys (lookup tables) and WHERE clauses.
CREATE INDEX ix_visit_tbl_statusid ON visit_tbl(statusId)
CREATE INDEX ix_visit_tbl_patientid ON visit_tbl(patientId)
CREATE INDEX ix_visit_tbl_visitkindId ON visit_tbl(visitkindId)
CREATE INDEX ix_visit_tbl_doctorid ON visit_tbl(doctorId)
CREATE INDEX ix_visit_tbl_createdbyid ON visit_tbl(createdbyId)
Now for the comments on how that is too many indexes. It depends ...

Why is using Table Spool slower than not?

There are two similiar sqls running in sql server,in which the table TBSFA_DAT_CUST has millons rows and no constraint(no index and primary key),
the other two has just a few rows and normal primary key:
s for slower one:
SELECT A.CUST_ID, C.CUST_NAME, A.xxx --and several specific columns
FROM TBSFA_DAT_ORD_LIST A JOIN VWSFA_ORG_EMPLOYEE B ON A.EMP_ID = B.EMP_ID
LEFT JOIN TBSFA_DAT_CUST C ON A.CUST_ID = B.CUST_ID
JOIN VWSFA_ORG_EMPLOYEE D ON A.REVIEW_ID = D.EMP_ID
WHERE ISNULL(A.BATCH_ID, '') != ''
execution plan of slower one
f for faster one:
SELECT *
FROM TBSFA_DAT_ORD_LIST A JOIN VWSFA_ORG_EMPLOYEE B ON A.EMP_ID = B.EMP_ID
LEFT JOIN TBSFA_DAT_CUST C ON A.CUST_ID = B.CUST_ID
JOIN VWSFA_ORG_EMPLOYEE D ON A.REVIEW_ID = D.EMP_ID
WHERE ISNULL(A.BATCH_ID, '') != ''
execution plan of faster one
f(above 0.6s) is much faster than s(above 4.6s).
Otherwise,I found two ways to make s fast as f:
1.Add constaint and primary key in table TBSFA_DAT_CUST.CUST_ID;
2.Specific more than 61 columns of table TBSFA_DAT_CUST(totally 80 columns).
My question is why sql optimizer uses Table Spool when I specific columns in SELECT clause rather than '*',and why is using Table Spool one executes slower?
My question is about sql-servertable-spool
In the slower query you are limiting your result set to specific columns. Since this is an un-indexed un constrained table the optimizer is creating a temporary table from the original table scan with only the specific columns required. It is then running through the nested loop operator on the temporary table. When it knows its going to need every column on the table (Select *) it can run the nested loop operator directly off the table scan because the result set of the scan will be joined in full to the top table.
Outside of that your query has a couple other possible problems:
LEFT JOIN TBSFA_DAT_CUST C ON A.CUST_ID = B.CUST_ID
you aren't joining to anything here, you are joining the entire table to every record. Did mean a.cust_id = c.cust_id or b.cust_id = c.cust_id or a.cust_id = c.cust_id and b.cust_id = c.cust_id?
Also, this function in the where clause is pointless and can degrade performance:
WHERE ISNULL(A.BATCH_ID, '') != ''
change it to:
WHERE A.BATCH_ID is not null and A.Batch_ID <> ''

SQL Server 2008 Stored Procedure Performance issue

Hi I have a Stored Procedure
ALTER PROCEDURE [dbo].[usp_EP_GetTherapeuticalALternates]
(
#NDCNumber CHAR(11) ,
#patientid INT ,
#pbmid INT
)
AS
BEGIN
TRUNCATE TABLE TempTherapeuticAlt
INSERT INTO TempTherapeuticAlt
SELECT --PR.ProductID AS MedicationID ,
NULL AS MedicationID ,
PR.ePrescribingName AS MedicationName ,
U.Strength AS MedicationStrength ,
FRM.FormName AS MedicationForm ,
PR.DEAClassificationID AS DEASchedule ,
NULL AS NDCNumber
--INTO #myTemp
FROM DatabaseTwo.dbo.Product PR
JOIN ( SELECT MP.MarketedProductID
FROM DatabaseTwo.dbo.Therapeutic_Concept_Tree_Specific_Product TCTSP
JOIN DatabaseTwo.dbo.Marketed_Product MP ON MP.SpecificProductID = TCTSP.SpecificProductID
JOIN ( SELECT TCTSP.TherapeuticConceptTreeID
FROM DatabaseTwo.dbo.Marketed_Product MP
JOIN DatabaseTwo.dbo.Therapeutic_Concept_Tree_Specific_Product TCTSP ON MP.SpecificProductID = TCTSP.SpecificProductID
JOIN ( SELECT
PR.MarketedProductID
FROM
DatabaseTwo.dbo.Package PA
JOIN DatabaseTwo.dbo.Product PR ON PA.ProductID = PR.ProductID
WHERE
PA.NDC11 = #NDCNumber
) PAPA ON MP.MarketedProductID = PAPA.MarketedProductID
) xxx ON TCTSP.TherapeuticConceptTreeID = xxx.TherapeuticConceptTreeID
) MPI ON PR.MarketedProductID = MPI.MarketedProductID
JOIN ( SELECT P.ProductID ,
O.Strength ,
O.Unit
FROM DatabaseTwo.dbo.Product AS P
INNER JOIN DatabaseTwo.dbo.Marketed_Product
AS M ON P.MarketedProductID = M.MarketedProductID
INNER JOIN DatabaseTwo.dbo.Specific_Product
AS S ON M.SpecificProductID = S.SpecificProductID
LEFT OUTER JOIN DatabaseTwo.dbo.OrderableName_Combined
AS O ON S.SpecificProductID = O.SpecificProductID
GROUP BY P.ProductID ,
O.Strength ,
O.Unit
) U ON PR.ProductID = U.ProductID
JOIN ( SELECT PA.ProductID ,
S.ScriptFormID ,
F.Code AS NCPDPScriptFormCode ,
S.FormName
FROM DatabaseTwo.dbo.Package AS PA
INNER JOIN DatabaseTwo.dbo.Script_Form
AS S ON PA.NCPDPScriptFormCode = S.NCPDPScriptFormCode
INNER JOIN DatabaseTwo.dbo.FormCode AS F ON S.FormName = F.FormName
GROUP BY PA.ProductID ,
S.ScriptFormID ,
F.Code ,
S.FormName
) FRM ON PR.ProductID = FRM.ProductID
WHERE
( PR.OffMarketDate IS NULL )
OR ( PR.OffMarketDate = '' )
OR (PR.OffMarketDate = '1899-12-30 00:00:00.000')
OR ( PR.OffMarketDate <> '1899-12-30 00:00:00.000'
AND DATEDIFF(dd, GETDATE(),PR.OffMarketDate) > 0
)
GROUP BY PR.ePrescribingName ,
U.Strength ,
FRM.FormName ,
PR.DEAClassificationID
-- ORDER BY pr.ePrescribingName
SELECT LL.ProductID AS MedicationID ,
temp.MedicationName ,
temp.MedicationStrength ,
temp.MedicationForm ,
temp.DEASchedule ,
temp.NDCNumber ,
fs.[ReturnFormulary] AS FormularyStatus ,
copay.CopaTier ,
copay.FirstCopayTerm ,
copay.FlatCopayAmount ,
copay.PercentageCopay ,
copay.PharmacyType,
dbo.udf_EP_GetBrandGeneric(LL.ProductID) AS BrandGeneric
FROM TempTherapeuticAlt temp
OUTER APPLY ( SELECT TOP 1
ProductID
FROM DatabaseTwo.dbo.Product
WHERE ePrescribingName = temp.MedicationName
) AS LL
OUTER APPLY [dbo].[udf_EP_tbfGetFormularyStatus](#patientid,
LL.ProductID,
#pbmid) AS fs
OUTER APPLY ( SELECT TOP 1
*
FROM udf_EP_CopayDetails(LL.ProductID,
#PBMID,
fs.ReturnFormulary)
) copay
--ORDER BY LL.ProductID
TRUNCATE TABLE TempTherapeuticAlt
END
On my dev server I have data of 63k in each table
so this procedure took about 30 seconds to return result.
On my Production server, it is timing out, or taking >1 minute.
I am wondering my production server tables are full with 1400 millions of records,
can this be a reason.
if so what can be done, I have all required indexes on tables.
any help would be greatly appreciated.
thanks
Execution Plan
http://www.sendspace.com/file/hk8fao
Major Leakage
OUTER APPLY [dbo].[udf_EP_tbfGetFormularyStatus](#patientid,
LL.ProductID,
#pbmid) AS fs
Some strategies that may help:
Remove the first ORDER BY statement, those are killer on complex queries shouldn't be necessary.
Use CTEs to break the query into smaller pieces that can be individually addressed.
Reduce the nesting in the first set of JOINs
Extract the second and third set of joins (the GROUPED ones) and insert those into a temporary indexed table before joining and grouping everything.
You did not include the definition for function1 or function2 -- custom functions are often a place where performance issues can hide.
Without seeing the execution plan, it's difficult to see where the particular problems may be.
You have a query that selects data from 4 or 5 tables , some of them multiple times. It's really hard to say how to improve without deep analysis of what you are trying to achieve and what table structure actually is.
Data size is definitely an issue; I think it's quite obvious that the more data has to be processed, the longer query will take. Some general advices... Run the query directly and check execution plan. It may reveal bottlenecks. Then check if statistics is up to date. Also, review your tables, partitioning may help a lot in some cases. In addition, you can try altering tables and create clustered index not on PK (as it's done by default unless otherwise specified), but on other column[s] so your query will benefit from certain physical order of records. Note : do it only if you are absolutely sure what you are doing.
Finally, try refactoring your query. I have a feeling that there is a better way to get desired results (sorry, without understanding of table structure and expected results I cannot tell exact solution, but multiple joins of the same tables and bunch of derived tables don't look good to me)

Why would SQL Server choose Clustered Index Scan over Non-Clustered one?

In one of the tables I am querying, a clustered index was created over a key that's not a primary key. (I don't know why.)
However, there's a non-clustered index for the primary key for this table.
In the execution plan, SQL is choosing the clustered index, rather than the non-clustered index for the primary key which I am using in my query.
Is there a reason why SQL would do this? How can I force SQL to choose the non-clustered index instead?
Appending more detail:
The table has many fields and the query contains many joins. Let me abstract it a bit.
The table definition looks like this:
SlowTable
[SlowTable_id] [int] IDENTITY(200000000,1) NOT NULL,
[fk1Field] [int] NULL,
[fk2Field] [int] NULL,
[other1Field] [varchar] NULL,
etc. etc...
and then the indices for this table are:
fk1Field (Clustered)
SlowTable_id (Non-Unique, Non-Clustered)
fk2Field (Non-Unique, Non-Clustered)
... and 14 other Non-Unique, Non-Clustered indices on other fields
Presumably there are lots more queries made against fk1Field which is why they selected this as the basis for the Clustered index.
The query I have uses a view:
SELECT
[field list]
FROM
SourceTable1 S1
INNER JOIN SourceTable2 S2
ON S2.S2_id = S1.S2_id
INNER JOIN SourceTable3 S3
ON S3.S3_id = S2.S3_id
INNER JOIN SlowTable ST
ON ST.SlowTable_id = S1.SlowTable_id
INNER JOIN [many other tables, around 7 more...]
The execution plan is quite big, with the nodes concerned say
Hash Match
(Inner Join)
Cost: 9%
with a thick arrow pointing to
Clustered Index Scan (Clustered)
SlowTable.fk1Field
Cost: 77%
I hope this provides enough detail on the issue.
Thanks!
ADDENDUM 2:
Correction to my previous post. The view doesn't have a where clause. It is just a series of inner joins. The execution plan was taken from an Insert statement that uses the View (listed as SLOW_VIEW) in a complex query that looks like the following:
(What this stored procedure does is to do a proportional split of the total amount of some records, based on weights, computed as percentage against a total. This mimics distributing a value from, say, one account, to other accounts.)
INSERT INTO dbo.WDTD(
FieldA,
FieldB,
GWB_id,
C_id,
FieldC,
PG_id,
FieldD,
FieldE,
O_id,
FieldF,
FieldG,
FieldH,
FieldI,
GWBIH_id,
T_id,
JO_id,
PC_id,
PP_id,
FieldJ,
FieldK,
FieldL,
FieldM,
FieldN,
FieldO,
FieldP,
FieldQ,
FieldS)
SELECT DISTINCT
#FieldA FieldA,
GETDATE() FieldB,
#Parameter1 GWB_id,
GWBIH.C_id C_id,
P.FieldT FieldC,
P.PG_id PG_id,
PAM.FieldD FieldD,
PP.FieldU FieldE,
GWBIH.O_id O_id,
CO.FieldF FieldF,
CO.FieldG FieldG,
PSAM.FieldH FieldH,
PSAM.FieldI FieldI,
SOURCE.GWBIH_id GWBIH_id,
' ' T_id,
GWBIH.JO_id JO_id,
SOURCE.PC_id PC_id,
GWB.PP_id,
SOURCE.FieldJ FieldJ,
1 FieldK,
ROUND((SUM(GWBIH.Total) / AGG.Total) * SOURCE.Total, 2) FieldL,
ROUND((SUM(GWBIH.Total) / AGG.Total) * SOURCE.Total, 2) FieldM,
0 FieldN,
' ' FieldO,
ESGM.FieldP_flag FieldP,
SOURCE.FieldQ FieldQ,
'[UNPROCESSED]'
FROM
dbo.Table1 GWBIH
INNER JOIN dbo.Table2 GWBPH
ON GWBPH.GWBP_id = GWBIH.GWBP_id
INNER JOIN dbo.Table3 GWB
ON GWB.GWB_id = GWBPH.GWB_id
INNER JOIN dbo.Table4 P
ON P.P_id = GWBPH.P_id
INNER JOIN dbo.Table5 ESGM
ON ESGM.ET_id = P.ET_id
INNER JOIN dbo.Table6 PAM
ON PAM.PG_id = P.PG_id
INNER JOIN dbo.Table7 O
ON O.dboffcode = GWBIH.O_id
INNER JOIN dbo.Table8 CO
ON
CO.Country_id = O.Country_id
AND CO.Brand_id = O.Brand_id
INNER JOIN dbo.Table9 PSAM
ON PSAM.Office_id = GWBIH.O_id
INNER JOIN dbo.Table10 PCM
ON PCM.PC_id = GWBIH.PC_id
INNER JOIN dbo.Table11 PC
ON PC.PC_id = GWBIH.PC_id
INNER JOIN dbo.Table12 PP
ON PP.PP_id = GWB.PP_id
-- THIS IS THE VIEW THAT CONTAINS THE CLUSTERED INDEX SCAN
INNER JOIN dbo.SLOW_VIEW GL
ON GL.JO_id = GWBIH.JO_id
INNER JOIN (
SELECT
GWBIH.C_id C_id,
GWBPH.GWB_id,
SUM(GWBIH.Total) Total
FROM
dbo.Table1 GWBIH
INNER JOIN dbo.Table2 GWBPH
ON GWBPH.GWBP_id = GWBIH.GWBP_id
INNER JOIN dbo.Table10 PCM
ON PCM.PC_id = GWBIH.PC_id
WHERE
PCM.Split_flag = 0
AND GWBIH.JO_id IS NOT NULL
GROUP BY
GWBIH.C_id,
GWBPH.GWB_id
) AGG
ON AGG.C_id = GWBIH.C_id
AND AGG.GWB_id = GWBPH.GWB_id
INNER JOIN (
SELECT
GWBIH.GWBIH_id GWBIH_id,
GWBIH.C_id C_id,
GWBIH.FieldQ FieldQ,
GWBP.GWB_id GWB_id,
PCM.PC_id PC_id,
CASE
WHEN WT.FieldS IS NOT NULL
THEN WT.FieldS
WHEN WT.FieldS IS NULL
THEN PCMS.FieldT
END FieldJ,
SUM(GWBIH.Total) Total
FROM
dbo.Table1 GWBIH
INNER JOIN dbo.Table2 GWBP
ON GWBP.GWBP_id = GWBIH.GWBP_id
INNER JOIN dbo.Table4 P
ON P.P_id = GWBP.P_id
INNER JOIN dbo.Table10 PCM
ON PCM.PC_id = GWBIH.PC_id
INNER JOIN dbo.Table11 PCMS
ON PCMS.PC_id = PCM.PC_id
LEFT JOIN dbo.WT WT
ON WT.ET_id = P.ET_id
AND WT.PC_id = GWBIH.PC_id
WHERE
PCM.Split_flag = 1
GROUP BY
GWBIH.GWBI_id,
GWBIH.C_id,
GWBIH.FieldQ,
GWBP.GWB_id,
WT.FieldS,
PCM.PC_id,
PCMS.ImportCode
) SOURCE
ON SOURCE.C_id = GWBIH.C_id
AND SOURCE.GWB_id = GWBPH.GWB_id
WHERE
PCM.Split_flag = 0
AND AGG.Total > 0
AND GWBPH.GWB_id = #Parameter1
AND NOT EXISTS (
SELECT *
FROM dbo.WDTD
WHERE
TD.C_id = GWBIH.C_id
AND TD.FieldA = GWBPH.GWB_id
AND TD.JO_id = GWBIH.JO_id
AND TD.PC_id = SOURCE.PC_id
AND TD.GWBIH_id = ' ')
GROUP BY
GWBIH.C_id,
P.FieldT,
GWBIH.JO_id,
GWBIH.O_id,
GWBPH.GWB_id,
P.PG_id,
PAM.FieldD,
PP.FieldU,
GWBIH.O_id,
CO.FieldF,
CO.FieldG,
PSAM.FieldH,
PSAM.FieldI,
GWBIH.JO_id,
SOURCE.PC_id,
GWB.PP_id,
SOURCE.FieldJ,
ESGM.FieldP_flag,
SOURCE.GWBIH_id,
SOURCE.FieldQ,
AGG.Total,
SOURCE.Total
ADDENDUM 3: When doing an execution plan on the select statement of the view, I see this:
Hash Match <==== Bitmap <------ etc...
(Inner Join) (Bitmap Create)
Cost: 0% Cost: 0%
^
|
|
Parallelism Clustered Index Scan (Clustered)
(Repartition Streams) <==== Slow_Table.fk1Field
Cost: 1% Cost: 98%
ADDENDUM 4: I think I found the problem. The Clustered Index Scan isn't referring to my clause that references the Primary Key, but rather another clause that needs a field that is, in some way, related to fk1Field above.
Most likely one of:
too many rows to make the index effective
index doesn't fit the ON/WHERE conditions
index isn't covering and SQL Server avoids a key lookup
Edit, after update:
Your indexes are useless because they are all single column indexes, so it does a clustered index scan.
You need an index that matches your ON, WHERE, GROUP BY conditions with INCLUDES for your SELECT list.
If the query you're executing isn't selecting a small subset of the records, SQL Server may well choose to ignore any "otherwise useful" non-clustered index and just scan through the clustered index (in this instance, most likely all rows in the table) - the logic being that the amount of I/O required to perform the query vs. the non-clustered index outweights that required for a full scan.
If you can post the schema of your table(s) + a sample query, I'm sure we can offer more information.
Ideally you shouldn't be telling SQL Server to do either or, it can pick the best, if you give it a good query. Query hints was created to steer the engine a bit, but you shouldn't have to use this just yet.
Sometimes it is beneficial to cluster the table differently that the primary key, is rare, but it can be useful (the clustering controls the data layout while the primary key ensures correctness).
I can tell you exactly why SQL Server picks the clustered index if you show me your query and schema otherwise I'd only be guessing on likely causes and execution plan is helpful in these cases.
For a non-clustered index to be considered it has to be meaningful to the query and if you non-clustered index doesn't cover your query, there's no guaratee that it will be used at all.
A clustered index scan is essentially a table scan (on a table that happens to have a clustered index). You really should post your statement to get a better answer. Your where clause may not be searchable (see sargs), or if you are selecting many records, sql server may scan the table rather than use the index and later have to look up related columns.

Resources