Sql query optimization - sql-server

I have a query that I want to execute that fastest possible.
Here it is:
select d.InvoiceDetailId,a.Fee,a.FeeTax
from InvoiceDetail d
LEFT JOIN InvoiceDetail a on a.AdjustDetailId = d.InvoiceDetailId
I put an ascending index on AdjustDetailId column
I then ran the query with 'Show Actual Execution Plan' and the result estimated subtree cost(off of the topmost select node) was 2.07
I then thought, maybe I can do something to improve this so I added a conditional to the left join like so:
select d.InvoiceDetailId,a.Fee,a.FeeTax
from InvoiceDetail d
LEFT JOIN InvoiceDetail a on a.AdjustDetailId is not null
and a.AdjustDetailId = d.InvoiceDetailId
I re-ran and I got a subtree cost of .98. So I thought, great I made it twice as fast. Well I then clicked show client statistics and then clicked execute 4-5 times with both queries and believe it or not the first query averaged out to be faster. I don't get it. By the way the query returns 120K rows.
Any insight?
Maybe i get tainted results because of caching, but I don't know if that is the case or how to reset the caching.
EDIT:
Okay I googled how to clear query cache so I added the following before the queries:
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
I then ran each query 5 times and the first query was still a little faster(13%).
1st Query: Client Processing time: 239.4
2nd Query: Client Processing time: 290
So I guess the question is, why do you think so? Could it be when the table quadruples in size that the second query will be faster? Or the left join is causing the query to hit the index twice so it will always be slower.
Please don't flame me, I'm just trying to get educated.
EDIT # 2:
I need to get all the InvoiceDetails, not just the adjusted ones hence the left join.
EDIT # 3:
The real problem I'm trying to solve with the query is to sum up all of the InvoiceDetail rows but at the same time adjust them as well. So ultimately it seems that the best query to perform is the following. I thought doing a join then adding the joined in table would be the only way but it seems that grouping by a conditional solves the problem most elegantly.
SELECT CASE WHEN AdjustDetailId IS NULL THEN InvoiceDetailId ELSE AdjustDetailId END AS InvoiceDetailId
,SUM(Fee + FeeTax) AS Fee
FROM dbo.InvoiceDetail d
GROUP BY CASE WHEN AdjustDetailId IS NULL THEN InvoiceDetailId ELSE AdjustDetailId END
Example: With the following rows
InvoiceDetailId|Fee|FeeTax|AdjustDetailId
1|300|0|NULL
2|-100|0|1
3|-50|0|1
4|250|0|NULL
My desire was to get the following:
InvoiceDetailId|Fee
1|150
4|250
Thanks everybody for your input.

If you want to make that query really fast, you need to
turn the LEFT JOIN into an INNER JOIN
make sure the InvoiceDetail.AdjustDetailId and InvoiceDetail.InvoiceDetailId are indexed
SELECT
d.InvoiceDetailId, a.Fee, a.FeeTax
FROM
dbo.InvoiceDetail d
INNER JOIN
dbo.InvoiceDetail a ON a.AdjustDetailId = d.InvoiceDetailId
Next, you need to make sure your statistics are up to date, so that the cost-based query optimizer can work properly.
In order to update the statistics, use the UPDATE STATISTICS (table) command - see the MSDN docs on UPDATE STATISTICS here

I would have guessed that they would be the same, (with the same execution plan) since it is impossible for a predicate like a.AdjustDetailId = d.InvoiceDetailId to be true if one side is null... So adding the Is Not Null condition is redundant. But maybe the processor is executing additional unnecessary steps with that additional predicate in there...
But what the other answer mentions is more important. Do you really need to output all the rows where there is no matching record (Invoices without a Adjusting Invoice) ?? If not change it to an Inner join and it will speed up a lot.
if you really need them, however, You might try a Union
Select d.InvoiceDetailId,a.Fee,a.FeeTax
From InvoiceDetail d
Join InvoiceDetail a
On a.AdjustDetailId = d.InvoiceDetailId
Union
Select InvoiceDetailId, null, null
from InvoiceDetail
Where AdjustDetailId Is Null
Which does the same thing without using an outer join...
(It is problematic as to whether two queries with a union will run faster than the single outer join query... )

You only have 1 table in this query, right?
If you use
select InvoiceDetailId, Fee, FeeTax
from InvoiceDetail
That WILL get all the rows, not just the adjusted ones.
Asuming you are doing a self-join, and doing it for a good reason, I would index InvoiceDetailId and AdjustDetailId and see which index(es) the execution plan uses.
You could also try "include" the Fee and FeeTax columns in your index - this will help a lot if the table is really wide.

For your queries, I can think of 3 different reasonable execution plans:
LOOP JOIN OUTER [a.AdjustDetailId = d.InvoiceDetailId]
TABLE SCAN InvoiceDetail d
TABLE SCAN InvoiceDetail a
HASH JOIN OUTER [a.AdjustDetailId = d.InvoiceDetailId]
TABLE SCAN InvoiceDetail d
TABLE SCAN InvoiceDetail a
LOOP JOIN OUTER
HASH JOIN OUTER [x.AdjustDetailId = d.InvoiceDetailId] AS y
TABLE SCAN InvoiceDetail d
INDEX SEEK [InvoiceDetail, AdjustDetailId IS NOT NULL] x
InvoiceDetail a [a.AdjustDetailId = y.AdjustDetailId]
Perhaps adding the IS NOT NULL condition makes the optimizer choose another one of the plans, it's hard to say.

Related

Possible causes slow order by on sql server statement

I have the next query which returns 1550 rows.
SELECT *
FROM V_InventoryMovements -- 2 seconds
ORDER BY V_InventoryMovements.TransDate -- 23 seconds
It takes about 2 seconds to return the results.
But when I include the ORDER BY clause, then it takes about 23 seconds.
It is a BIG change just for adding an ORDER BY.
I would like to know what is happening, and a way to improve the query with the ORDER BY. To quit the ORDER BY should not be the solution.
Here a bit of information, please let me know if you need more info.
V_InventoryMovements
CREATE VIEW [dbo].[V_InventoryMovements]
AS
SELECT some_fields
FROM FinTime
RIGHT OUTER JOIN V_Outbound ON FinTime.StdDate = dbo.TruncateDate(V_Outbound.TransDate)
LEFT OUTER JOIN ReasonCode_Grouping ON dbo.V_Outbound.ReasonCode = dbo.ReasonCode_Grouping.ReasonCode
LEFT OUTER JOIN Items ON V_Outbound.ITEM = Items.Item
LEFT OUTER JOIN FinTime ON V_Outbound.EventDay = FinTime.StdDate
V_Outbound
CREATE VIEW [dbo].[V_Outbound]
AS
SELECT V_Outbound_WMS.*
FROM V_Outbound_WMS
UNION
SELECT V_Transactions_Calc.*
FROM V_Transactions_Calc
V_OutBound_WMS
CREATE VIEW [dbo].[V_OutBound_WMS]
AS
SELECT some_fields
FROM Transaction_Log
INNER JOIN MFL_StartDate ON Transaction_Log.TransDate >= MFL_StartDate.StartDate
LEFT OUTER JOIN Rack ON Transaction_Log.CHARGE = Rack.CHARGE AND Transaction_Log.CHARGE_LFD = Rack.CHARGE_LFD
V_Transactions_Calc
CREATE VIEW [dbo].[V_Transactions_Calc]
AS
SELECT some_fields
FROM Transactions_Calc
INNER JOIN MFL_StartDate ON dbo.Transactions_Calc.EventDay >= dbo.MFL_StartDate.StartDate
And here I will also share a part of the execution plan (the part where you can see the main cost). I don't know exactly how to read it and improve the query. Let me know if you need to see the rest of the execution plan. But all the other parts are 0% of Cost. The main Cost is in the: Nested Loops (Left Outer Join) Cost 95%.
Execution Plan With ORDER BY
Execution Plan Without ORDER BY
I think the short answer is that the optimizer is executing in a different order in an attempt to minimize the cost of the sorting, and doing a poor job. Its job is made very hard by the views within views within views, as GuidoG suggests. You might be able to convince it to execute differently by creating some additional index or statistics, but its going to be hard to advise on that remotely.
A possible workaround might be to select into a temp table, then apply the ordering afterwards:
SELECT *
INTO #temp
FROM V_InventoryMovements;
SELECT *
FROM #temp
ORDER BY TransDate

SQL Query is taking infinite time when using with order by

Below is my SQL Query
select top(10) ClientCode
FROM (((Branch INNER JOIN BusinessLocation ON
Branch.BranchCode=BusinessLocation.BranchCode)
INNER JOIN Center ON BusinessLocation.LocationCode = Center.LocationCode)
INNER JOIN Groups ON Center.CenterCode = Groups.CenterCode)
INNER JOIN Client ON Groups.GroupCode = Client.GroupCode
WHERE
((Client.CBStatus) IS NULL) AND ((Branch.PartnerName) in
('SVCL','Edelweiss'))
order by Client.ClientCode DESC
When i run it without order by it runs fine , but with order by it is not finishing execution. Why is this behavior ?
When you select using TOP statement, calculations and joins for every row are not necessarily calculated. When you try to order, at least one cell for all rows need to be calculated. It is a long query because your table is large and the behavior is not faulty. Don't let the fast running query without the order by mislead you about the complexity of your second query.
You can create an index on clientcode column. That would speed things up.

TSQL Join, Query Processing order and storage

Table structure:
CREATE TABLE dbo.Transactions
(
actid INT NOT NULL, --Account ID
tranid INT NOT NULL, -- Transaction ID
val MONEY NOT NULL, --- Transaction value
CONSTRAINT PK_Transactions PRIMARY KEY(actid, tranid)
);
The following inefficient query tries to determine the running balance after each transaction
SELECT
T1.actid, T1.tranid, T1.val,
SUM(T2.val) AS balance
FROM
dbo.Transactions AS T1
JOIN
dbo.Transactions AS T2 ON T2.actid = T1.actid
AND T2.tranid <= T1.tranid
GROUP BY
T1.actid, T1.tranid, T1.val;
I am not sure how the join is processed in query. Is the join treated as a subquery where for each group (T1.actid, T1.tranid, T1.val) the join statement is executed? Does that mean if there 10K Transactions , 10K joined data sets are created by this query?
Execute your query in SSMS. Then highlight it and press Ctrl + L to view the Execution Plan. This will show you how SQL Server plans to execute the query and sometimes suggest indexes, etc.
It means you will have exactly number of rows the join satisfy
Each row in T1 is processed and brings in rows from T2 that satisfies the join conditions.
The join can be process as loop, hash, or merge. Typically the optimizer ill use hash.
The best think to do is just run it. The output should tell a story.
The ONLY way to know is by 'studying' the query plan.
FYI: it seems to me your query is equivalent to
SELECT
T1.actid, T1.tranid, T1.val,
balance = (SELECT SUM(T2.val)
FROM dbo.Transactions
WHERE T2.actid = T1.actid
AND T2.tranid <= T1.tranid)
FROM
dbo.Transactions AS T1
To be honest, I prefer 'this' version because it looks more readable to me; I'm also expecting this version to be slightly 'leaner' as there is less need for sorting, but only actual testing will tell. It's sometimes surprising to see what the optimizer does behind the scenes! Again, the query plan will show.
Therefore, run both queries and compare the resulting query plans, those should give you an idea about their relative cost. Now, keep in mind that "cost" isn't always directly correlated to "time"; so you might want to check which one runs faster too on your hardware and under 'typical load'; also keep in mind that e.g. caching may have an effect here!

Why is this CTE so much slower than using temp tables?

We had an issue since a recent update on our database (I made this update, I am guilty here), one of the query used was much slower since then. I tried to modify the query to get faster result, and managed to achieve my goal with temp tables, which is not bad, but I fail to understand why this solution performs better than a CTE based one, which does the same queries. Maybe it has to do that some tables are in a different DB ?
Here's the query that performs badly (22 minutes on our hardware) :
WITH CTE_Patterns AS (
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email AS PELE WITH(NOLOCK) ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
),
CTE_Emails AS (
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED WITH(NOLOCK) ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
)
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM CTE_Patterns AS BL WITH(NOLOCK)
INNER JOIN CTE_Emails AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
When running both CTE queries separately, it's super fast (0 secs in SSMS, returns 122 rows and 13k rows), when running the full query, with INNER JOIN on sEmail, it's super slow (22 minutes)
Here's the query that performs well, with temp tables (0 sec on our hardware) and which does the eaxct same thing, returns the same result :
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
INTO #tb1
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email PELE ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
INTO #tb2
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM #tb1 AS BL WITH(NOLOCK)
INNER JOIN #tb2 AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
DROP TABLE #tb1
DROP TABLE #tb2
Tables stats :
OtherDb.dbo.Purchased_Email_List : 13 rows, 2 rows flagged bPattern = 1
OtherDb.dbo.Purchased_Email_List_Email : 324289 rows, 122 rows with patterns (which are used in this issue)
dbo.NewsletterService_import_list_email : 15.5M rows
dbo.NewsletterService_import_list_email_distinct ~1.5M rows
WHERE ILE.iId_newsletterservice_import_list = 1000 retrieves ~ 13k rows
I can post more info about tables on request.
Can someone help me understand this ?
UPDATE
Here is the query plan for the CTE query :
Here is the query plan with temp tables :
As you can see in the query plan, with CTEs, the engine reserves the right to apply them basically as a lookup, even when you want a join.
If it isn't sure enough it can run the whole thing independently, in advance, essentially generating a temp table... let's just run it once for each row.
This is perfect for the recursion queries they can do like magic.
But you're seeing - in the nested Nested Loops - where it can go terribly wrong.
You're already finding the answer on your own by trying the real temp table.
Parallelism. If you noticed in your TEMP TABLE query, the 3rd Query indicates Parallelism in both distributing and gathering the work of the 1st Query. And Parallelism when combining the results of the 1st and 2nd Query. The 1st Query also incidentally has a relative cost of 77%. So the Query Engine in your TEMP TABLE example was able to determine that the 1st Query can benefit from Parallelism. Especially when the Parallelism is Gather Stream and Distribute Stream, so its allowing the divying up of work (join) because the data is distributed in such a way that allows for divying up the work then recombining. Notice the cost of the 2nd Query is 0% so you can ignore that as no cost other than when it needs to be combined.
Looking at the CTE, that is entirely processed Serially and not in Parallel. So somehow with the CTE it could not figure out the 1st Query can be run in Parallel, as well as the relationship of the 1st and 2nd query. Its possible that with multiple CTE expressions it assumes some dependency and did not look ahead far enough.
Another test you can do with the CTE is keep the CTE_Patterns but eliminate the CTE_Emails by putting that as a "subquery derived" table to the 3rd Query in the CTE. It would be curious to see the Execution Plan, and see if there is Parallelism when expressed that way.
In my experience it's best to use CTE's for recursion and temp tables when you need to join back to the data. Makes for a much faster query typically.

Why is this non-correlated query so slow?

I have this query...
SELECT Distinct([TargetAttributeID]) FROM
(SELECT distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x
Execution Plan for the above query
The two inner distincts are looking at 32 and 10,000 rows respectively. This query returns 5 rows and executes in under 1 second.
If I then use the result of this query as the subject of an IN like so...
SELECT attx.intAttributeID,attx.txtAttributeName,attx.txtAttributeLabel,attx.txtType,attx.txtEntity FROM
AST_tblAttributes attx WHERE attx.intAttributeID
IN
(SELECT Distinct([TargetAttributeID]) FROM
(SELECT Distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT Distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x)
Execution Plan for the above query
Then it takes over 3 minutes! If I just take the result of the query and perform the IN "manually" then again it comes back extremely quickly.
However if I remove the two inner DISTINCTS....
SELECT attx.intAttributeID,attx.txtAttributeName,attx.txtAttributeLabel,attx.txtType,attx.txtEntity FROM
AST_tblAttributes attx WHERE attx.intAttributeID
IN
(SELECT Distinct([TargetAttributeID]) FROM
(SELECT att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x)
Execution Plan for the above query
..then it comes back in under a second.
What is SQL Server thinking? Can it not figure out that it can perform the two sub-queries and use the result as the subject of the IN. It seems as slow as a correlated sub-query, but it isn't correlated!!!
In Show Estimate Execution plan there are three Clustered Index Scans each with a cost of 100%! (Execution Plan is here)
Can anyone tell me why the inner DISTINCTS make this query so much slower (but only when used as the subject of an IN...) ?
UPDATE
Sorry it's taken me a while to get these execution plans up...
Query 1
Query 2 (The slow one)
Query 3 - No Inner Distincts
Honestly I think it comes down to the fact that, in terms of relational operators, you have a gratuitously baroque query there, and SQL Server stops searching for alternate execution plans within the time it allows itself to find one.
After the parse and bind phase of plan compilation, SQL Server will apply logical transforms to the resulting tree, estimate the cost of each, and choose the one with the lowest cost. It doesn't exhaust all possible transformations, just as many as it can compute within a given window. So presumably, it has burned through that window before it arrives at a good plan, and it's the addition of the outer semi-self-join on AST_tblAttributes that pushed it over the edge.
How is it gratuitously baroque? Well, first off, there's this (simplified for noise reduction):
select distinct intAttributeID from (
select distinct intAttributeID from AST_tblAttributes ....
union all
select distinct intAttributeID from AST_tblAttributes ....
)
Concatenating two sets, and projecting the unique elements? Turns out there's operator for that, it's called UNION. So given enough time during plan compilation and enough logical transformations, SQL Server will realize what you really mean is:
select intAttributeID from AST_tblAttributes ....
union
select intAttributeID from AST_tblAttributes ....
But wait, you put this in a correlated subquery. Well, a correlated subquery is a semi-join, and the right relation does not require logical dedupping in a semi-join. So SQL Server may logically rewrite the query as this:
select * from AST_tblAttributes
where intAttributeID in (
select intAttributeID from AST_tblAttributes ....
union all
select intAttributeID from AST_tblAttributes ....
)
And then go about physical plan selection. But to get there, it has to see though the cruft first, and that may fall outside the optimization window.
EDIT:
Really, the way to explore this for yourself, and corroborate the speculation above, is to put both versions of the query in the same window and compare estimated execution plans side-by-side (Ctrl-L in SSMS). Leave one as is, edit the other, and see what changes.
You will see that some alternate forms are recognized as logically equivalent and generate to the same good plan, and others generate less optimal plans, as you bork the optimizer.**
Then, you can use SET STATISTICS IO ON and SET STATISTICS TIME ON to observe the actual amount of work SQL Server performs to execute the queries:
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT ....
SELECT ....
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
The output will appear in the messages pane.
** Or not--if they all generate the same plan, but actual execution time still varies like you say, something else may be going on--it's not unheard of. Try comparing actual execution plans and go from there.
El Ronnoco
First of all a possible explanation:
You say that: "This query returns 5 rows and executes in under 1 second.". But how many rows does it ESTIMATE are returned? If the estimate is very much off, using the query as part of the IN part could cause you to scan the entire: AST_tblAttributes in the outer part, instead of index seeking it (which could explain the big difference)
If you shared the query plans for the different variants (as a file, please), I think I should be able to get you an idea of what is going on under the hood here. It would also allow us to validate the explanation.
Edit: each DISTINCT keyword adds a new Sort node to your query plan. Basically, by having those other DISTINCTs in there, you're forcing SQL to re-sort the entire table again and again to make sure that it isn't returning duplicates. Each such operation can quadruple the cost of the query. Here's a good review of the effects that the DISTINCT operator can have, intended an unintended. I've been bitten by this, myself.
Are you using SQL 2008? If so, you can try this, putting the DISTINCT work into a CTE and then joining to your main table. I've found CTEs to be pretty fast:
WITH DistinctAttribID
AS
(
SELECT Distinct([TargetAttributeID])
FROM (
SELECT distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
UNION ALL
SELECT distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate
) x
SELECT attx.intAttributeID,
attx.txtAttributeName,
attx.txtAttributeLabel,
attx.txtType,
attx.txtEntity
FROM AST_tblAttributes attx
JOIN DistinctAttribID attrib
ON attx.intAttributeID = attrib.TargetAttributeID

Resources