NOT IN Subquery Optimization

NOT IN Subquery Optimization - sql-server

I have a dynamic query that runs indentifying CDs that members have not rented yet. I am using the NOT IN subquery but when I have large member table it makes them really slow. Any suggestions how to optimize the query
SELECT DVDTitle AS "DVD Title"
FROM DVD
WHERE DVDId NOT IN
(SELECT DISTINCT DVDId FROM Rental WHERE MemberId = AL240);
thanks

Using NOT EXISTS will have slightly better performance because it can "short circuit" rather than evaluating the entire set for each match. At the very least, it will be "no worse" than NOT IN or an OUTER JOIN, though there are exceptions to every rule. Here is how I would write this query:
SELECT DVDTitle AS [DVD Title]
FROM dbo.DVD AS d
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.Rental
WHERE MemberId = 'AL240'
AND DVDId = d.DVDId
);
I would guess you will optimize performance better by investigating the execution plan and ensuring that your indexes are best suited for this query (without causing negative impact to other parts of your workload).
Also see Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?

SELECT DVDTitle AS "DVD Title"
FROM DVD d
left outer join Rental r on d.DVDId = r.DVDId
WHERE r.MemberId = 'AL240'
and r.DVDId is null
Make sure you have indexes on:
d.DVDId
r.DVDId
r.MemberId

Related

Force joined view not to be optimized

I have a somewhat complex view which includes a join to another view. For some reason the generated query plan is highly inefficient. The query runs for many hours. However if I select the sub-view into a temporary table first and then join with this, the same query finished in a few minutes.
My question is: Is there some kind of query hint or other trick which will force the optimizer to execute the joined sub-view in isolation before performing the join, just as when using a temp table? Clearly the default strategy chosen by the optimizer is not optimal.
I cannot use the temporary table-trick since views does not allow temporary tables. I understand I could probably rewrite everything to a stored procedure, but that would break composeability of views, and it seems also like bad for maintenance to rewrite everything just to trick the optimizer to not use a bad optimization.

Adam Machanic explained one such way at a SQL Saturday I recently attended. The presentation was called Clash of the Row Goals. The method involves using a TOP X at the beginning of the sub-select. He explained that when doing a TOP X, the query optimizer assumes it is more efficient to grab the TOP X rows one at a time. As long as you set X as a sufficiently large number (limit of INT or BIGINT?), the query will always get the correct results.
So one example that Adam provided:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
becomes:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT TOP(2147483647)
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
It is a super cool trick and very useful.

When things get messy the query optimize often resorts to loop joins
If materializing to a temp fixed it then most likely that is the problem
The optimizer often does not deal with views very well
I would rewrite you view to not uses views
Join Hints (Transact-SQL)
You may be able to use these hints on views
Try merge and hash
Try changing the order of join
Move condition into the join whenever possible
select *
from table1
join table2
on table1.FK = table2.Key
where table2.desc = 'cat1'
should be
select *
from table1
join table2
on table1.FK = table2.Key
and table2.desc = 'cat1'
Now the query optimizer will get that correct but as the query gets more complex the query optimize goes into what I call stupid mode and loop joins. But that is also done to protect the server and have as little in memory as possible.

Is moving a constraint into a join more efficient than join and a where clause?

I have been trying to test this, but I have doubts about my tests as the timings vary so much.
-- Scenario 1
SELECT * FROM Foo f
INNER JOIN Bar b ON f.id = b.id
WHERE b.flag = true;
-- Scenario 2
SELECT * FROM Foo f
INNER JOIN Bar b ON b.flag = true AND f.id = b.id;
Logically it seems like scenario 2 would be more efficient, but I wasn't sure if SQL server is smart enough to optimize this or not.

Not sure why you think scenario 2 would "logically" be more efficient. On an INNER JOIN everything is basically a filter so SQL Server can collapse the logic to the exact same underlying plan shape. Here's an example from AdventureWorks2012 (click to enlarge):
I prefer separating the join criteria from the filter criteria, so will always write the query in the format on the left. However #HLGEM makes a good point, these clauses are interchangeable in this case only because it's an INNER JOIN. For an OUTER JOIN, it is very important to place the filters on the outer table in the join criteria, else you unwittingly end up with an INNER JOIN and drastically change the semantics of the query. So my advice about how the plan can be collapsed only holds true for inner joins.
If you're worried about performance, I'd start by getting rid of SELECT * and only pulling the columns you actually need (and make sure there's a covering index).
Four months later, another answer has emerged claiming that there usually will be a difference in performance, and that putting filter criteria in the ON clause will be better. While I won't dispute that it is certainly plausible that this could happen, I contend that it certainly isn't the norm and shouldn't be something you use as an excuse to always put all filter criteria in the ON clause.

The accepted answer is correct only for your test case.
An answer to the headline question as stated is yes, moving the constraint to the join condition can greatly improve the query and ensures. I have seen forms similar to this (but perhaps not exactly)...
select *
from A
inner join B
on B.id = a.id
inner join C
on C.id = A.id
where B.z = 1 and C.z = 2;
...not optimize to the same plan as the "on join" equivalents so I tend to use the "on join" constraints as a best practice even for the simpler cases that might have resolved optimally either way.

Why is this non-correlated query so slow?

I have this query...
SELECT Distinct([TargetAttributeID]) FROM
(SELECT distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x
Execution Plan for the above query
The two inner distincts are looking at 32 and 10,000 rows respectively. This query returns 5 rows and executes in under 1 second.
If I then use the result of this query as the subject of an IN like so...
SELECT attx.intAttributeID,attx.txtAttributeName,attx.txtAttributeLabel,attx.txtType,attx.txtEntity FROM
AST_tblAttributes attx WHERE attx.intAttributeID
IN
(SELECT Distinct([TargetAttributeID]) FROM
(SELECT Distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT Distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x)
Execution Plan for the above query
Then it takes over 3 minutes! If I just take the result of the query and perform the IN "manually" then again it comes back extremely quickly.
However if I remove the two inner DISTINCTS....
SELECT attx.intAttributeID,attx.txtAttributeName,attx.txtAttributeLabel,attx.txtType,attx.txtEntity FROM
AST_tblAttributes attx WHERE attx.intAttributeID
IN
(SELECT Distinct([TargetAttributeID]) FROM
(SELECT att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x)
Execution Plan for the above query
..then it comes back in under a second.
What is SQL Server thinking? Can it not figure out that it can perform the two sub-queries and use the result as the subject of the IN. It seems as slow as a correlated sub-query, but it isn't correlated!!!
In Show Estimate Execution plan there are three Clustered Index Scans each with a cost of 100%! (Execution Plan is here)
Can anyone tell me why the inner DISTINCTS make this query so much slower (but only when used as the subject of an IN...) ?
UPDATE
Sorry it's taken me a while to get these execution plans up...
Query 1
Query 2 (The slow one)
Query 3 - No Inner Distincts

Honestly I think it comes down to the fact that, in terms of relational operators, you have a gratuitously baroque query there, and SQL Server stops searching for alternate execution plans within the time it allows itself to find one.
After the parse and bind phase of plan compilation, SQL Server will apply logical transforms to the resulting tree, estimate the cost of each, and choose the one with the lowest cost. It doesn't exhaust all possible transformations, just as many as it can compute within a given window. So presumably, it has burned through that window before it arrives at a good plan, and it's the addition of the outer semi-self-join on AST_tblAttributes that pushed it over the edge.
How is it gratuitously baroque? Well, first off, there's this (simplified for noise reduction):
select distinct intAttributeID from (
select distinct intAttributeID from AST_tblAttributes ....
union all
select distinct intAttributeID from AST_tblAttributes ....
)
Concatenating two sets, and projecting the unique elements? Turns out there's operator for that, it's called UNION. So given enough time during plan compilation and enough logical transformations, SQL Server will realize what you really mean is:
select intAttributeID from AST_tblAttributes ....
union
select intAttributeID from AST_tblAttributes ....
But wait, you put this in a correlated subquery. Well, a correlated subquery is a semi-join, and the right relation does not require logical dedupping in a semi-join. So SQL Server may logically rewrite the query as this:
select * from AST_tblAttributes
where intAttributeID in (
select intAttributeID from AST_tblAttributes ....
union all
select intAttributeID from AST_tblAttributes ....
)
And then go about physical plan selection. But to get there, it has to see though the cruft first, and that may fall outside the optimization window.
EDIT:
Really, the way to explore this for yourself, and corroborate the speculation above, is to put both versions of the query in the same window and compare estimated execution plans side-by-side (Ctrl-L in SSMS). Leave one as is, edit the other, and see what changes.
You will see that some alternate forms are recognized as logically equivalent and generate to the same good plan, and others generate less optimal plans, as you bork the optimizer.**
Then, you can use SET STATISTICS IO ON and SET STATISTICS TIME ON to observe the actual amount of work SQL Server performs to execute the queries:
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT ....
SELECT ....
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
The output will appear in the messages pane.
** Or not--if they all generate the same plan, but actual execution time still varies like you say, something else may be going on--it's not unheard of. Try comparing actual execution plans and go from there.

El Ronnoco
First of all a possible explanation:
You say that: "This query returns 5 rows and executes in under 1 second.". But how many rows does it ESTIMATE are returned? If the estimate is very much off, using the query as part of the IN part could cause you to scan the entire: AST_tblAttributes in the outer part, instead of index seeking it (which could explain the big difference)
If you shared the query plans for the different variants (as a file, please), I think I should be able to get you an idea of what is going on under the hood here. It would also allow us to validate the explanation.

Edit: each DISTINCT keyword adds a new Sort node to your query plan. Basically, by having those other DISTINCTs in there, you're forcing SQL to re-sort the entire table again and again to make sure that it isn't returning duplicates. Each such operation can quadruple the cost of the query. Here's a good review of the effects that the DISTINCT operator can have, intended an unintended. I've been bitten by this, myself.
Are you using SQL 2008? If so, you can try this, putting the DISTINCT work into a CTE and then joining to your main table. I've found CTEs to be pretty fast:
WITH DistinctAttribID
AS
(
SELECT Distinct([TargetAttributeID])
FROM (
SELECT distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
UNION ALL
SELECT distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate
) x
SELECT attx.intAttributeID,
attx.txtAttributeName,
attx.txtAttributeLabel,
attx.txtType,
attx.txtEntity
FROM AST_tblAttributes attx
JOIN DistinctAttribID attrib
ON attx.intAttributeID = attrib.TargetAttributeID

WHERE clause better execute before IN and JOIN or after

I read this article:
Logical Processing Order of the SELECT statement
in end of article has been write ON and JOIN clause consider before WHERE.
Consider we have a master table that has 10 million records and a detail table (that has reference to master table(FK)) with 50 million record. We have a query that want just 100 record of detail table according a PK in master table.
In this situation ON and JOIN execute before WHERE?I mean that do we have 500 million record after JOIN and then WHERE apply to it?or first WHERE apply and then JOIN and ON Consider? If second answer is true do it has incoherence with top article?
thanks

In the case of an INNER JOIN or a table on the left in a LEFT JOIN, in many cases, the optimizer will find that it is better to perform any filtering first (highest selectivity) before actually performing whatever type of physical join - so there are obviously physical order of operations which are better.
To some extent you can sometimes control this (or interfere with this) with your SQL, for instance, with aggregates in subqueries.
The logical order of processing the constraints in the query can only be transformed according to known invariant transformations.
So:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is still logically equivalent to:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and they will generally have the same execution plan.
On the other hand:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is NOT equivalent to:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and so the optimizer isn't going to transform them into the same execution plan.
The optimizer is very smart and is able to move things around pretty successfully, including collapsing views and inline table-valued functions as well as even pushing things down through certain kinds of aggregates fairly successfully.
Typically, when you write SQL, it needs to be understandable, maintainable and correct. As far as efficiency in execution, if the optimizer is having difficulty turning the declarative SQL into an execution plan with acceptable performance, the code can sometimes be simplified or appropriate indexes or hints added or broken down into steps which should perform more quickly - all in successive orders of invasiveness.

It doesn't matter
Logical processing order is always honoured: regardless of actual processing order
INNER JOINs and WHERE conditions are effectively associative and commutative (hence the ANSI-89 "join in the where" syntax) so actual order doesn't matter
Logical order becomes important with outer joins and more complex queries: applying WHERE on an OUTER table changes the logic completely.
Again, it doesn't matter how the optimiser does it internally so long as the query semantics are maintained by following logical processing order.
And the key word here is "optimiser": it does exactly what it says

Just re-reading Paul White's excellent series on the Query Optimiser and remembered this question.
It is possible to use an undocumented command to disable specific transformation rules and get some insight into the transformations applied.
For (hopefully!) obvious reasons only try this on a development instance and remember to re-enable them and remove any suboptimal plans from the cache.
USE AdventureWorks2008;
/*Disable the rules*/
DBCC RULEOFF ('SELonJN');
DBCC RULEOFF ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
You can see with those two rules disabled it does a cartesian join and filter after.
/*Re-enable them*/
DBCC RULEON ('SELonJN');
DBCC RULEON ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
With them enabled the predicate is pushed right down into the index seek and so reduces the number of rows processed by the join operation.

There is no defined order. The SQL engine determines what order to perform the operations based on the execution strategy chosen by its optimizer.

I think you have misread ON as IN in the article.
However, the order it is showing in the article is correct (obviously it is msdn anyway). The ON and JOIN are executed before WHERE naturally because WHERE has to be applied as a filter on the temporary resultset obtained due to JOINS
The article just says it is logical order of execution and also at end of the paragraph it adds this line too ;)
"Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list."

How to add more OR searches with CONTAINS Brings Query to Crawl?

I have a simple query that relies on two full-text indexed tables, but it runs extremely slow when I have the CONTAINS combined with any additional OR search. As seen in the execution plan, the two full text searches crush the performance. If I query with just 1 of the CONTAINS, or neither, the query is sub-second, but the moment you add OR into the mix the query becomes ill-fated.
The two tables are nothing special, they're not overly wide (42 cols in one, 21 in the other; maybe 10 cols are FT indexed in each) or even contain very many records (36k recs in the biggest of the two).
I was able to solve the performance by splitting the two CONTAINS searches into their own SELECT queries and then UNION the three together. Is this UNION workaround my only hope?
SELECT a.CollectionID
FROM collections a
INNER JOIN determinations b ON a.CollectionID = b.CollectionID
WHERE a.CollrTeam_Text LIKE '%fa%'
OR CONTAINS(a.*, '"*fa*"')
OR CONTAINS(b.*, '"*fa*"')
Execution Plan:

I'd be curious to see if a LEFT JOIN to an equivalent CONTAINSTABLE would perform any better. Something like:
SELECT a.CollectionID
FROM collections a
INNER JOIN determinations b ON a.CollectionID = b.CollectionID
LEFT JOIN CONTAINSTABLE(a, *, '"*fa*"') ct1 on a.CollectionID = ct1.[Key]
LEFT JOIN CONTAINSTABLE(b, *, '"*fa*"') ct2 on b.CollectionID = ct2.[Key]
WHERE a.CollrTeam_Text LIKE '%fa%'
OR ct1.[Key] IS NOT NULL
OR ct2.[Key] IS NOT NULL

I was going to suggest to UNION each as their own query, but as I read your question I saw that you have found that. I can't think of a better way, so if it helps use it. The UNION method is a common approach to a poor performing query that has several OR conditions where each performs well on its own.

I would probably use the UNION. If you are really against it, you might try something like:
SELECT a.CollectionID
FROM collections a
LEFT OUTER JOIN (SELECT CollectionID FROM collections WHERE CONTAINS(*, '"*fa*"')) c
ON c.CollectionID = a.CollectionID
LEFT OUTER JOIN (SELECT CollectionID FROM determinations WHERE CONTAINS(*, '"*fa*"')) d
ON d.CollectionID = a.CollectionID
WHERE a.CollrTeam_Text LIKE '%fa%'
OR c.CollectionID IS NOT NULL
OR d.CollectionID IS NOT NULL

We've experience the exact same problem and at the time, put it down to our query being badly formed - that SQL 2005 had let us get away with it, but 2008 wouldn't.
In the end, we split the query into 2 SELECTs that were called using an IF. Glad someone else has had the same problem and that it's a known issue. We were seeing queries on a table with ~150,000 rows + full-text going from < 1 second (2005) to 30+ seconds (2008).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight