Is moving a constraint into a join more efficient than join and a where clause? - sql-server

I have been trying to test this, but I have doubts about my tests as the timings vary so much.
-- Scenario 1
SELECT * FROM Foo f
INNER JOIN Bar b ON f.id = b.id
WHERE b.flag = true;
-- Scenario 2
SELECT * FROM Foo f
INNER JOIN Bar b ON b.flag = true AND f.id = b.id;
Logically it seems like scenario 2 would be more efficient, but I wasn't sure if SQL server is smart enough to optimize this or not.

Not sure why you think scenario 2 would "logically" be more efficient. On an INNER JOIN everything is basically a filter so SQL Server can collapse the logic to the exact same underlying plan shape. Here's an example from AdventureWorks2012 (click to enlarge):
I prefer separating the join criteria from the filter criteria, so will always write the query in the format on the left. However #HLGEM makes a good point, these clauses are interchangeable in this case only because it's an INNER JOIN. For an OUTER JOIN, it is very important to place the filters on the outer table in the join criteria, else you unwittingly end up with an INNER JOIN and drastically change the semantics of the query. So my advice about how the plan can be collapsed only holds true for inner joins.
If you're worried about performance, I'd start by getting rid of SELECT * and only pulling the columns you actually need (and make sure there's a covering index).
Four months later, another answer has emerged claiming that there usually will be a difference in performance, and that putting filter criteria in the ON clause will be better. While I won't dispute that it is certainly plausible that this could happen, I contend that it certainly isn't the norm and shouldn't be something you use as an excuse to always put all filter criteria in the ON clause.

The accepted answer is correct only for your test case.
An answer to the headline question as stated is yes, moving the constraint to the join condition can greatly improve the query and ensures. I have seen forms similar to this (but perhaps not exactly)...
select *
from A
inner join B
on B.id = a.id
inner join C
on C.id = A.id
where B.z = 1 and C.z = 2;
...not optimize to the same plan as the "on join" equivalents so I tend to use the "on join" constraints as a best practice even for the simpler cases that might have resolved optimally either way.

Related

Is there a performance difference or another reason to use this JOIN syntax?

I have come across several stored procedures in our "legacy" code that have joins that look like this:
SELECT *
FROM TableA
INNER JOIN TableB
INNER JOIN TableC ON TableC.TableBId = TableB.TableBId
ON TableA.TableAId = TableB.TableAId
I would write this query differently, like this:
SELECT *
FROM TableA
INNER JOIN TableB ON TableA.TableAId = TableB.TableAId
INNER JOIN TableC ON TableB.TableBId = TableC.TableBId
The results are the same, but I find the second example to be much easier to follow, especially in situations where there are several joins. Is there any advantage to writing JOIN statements with the ON clause "deferred" until after all of the joins have been specified, as in the first example?
No, predicate order doesn't matter at all to how it's executed. You could put the ON predicates in a WHERE instead, or mix them at random, and end up with the same execution plan. While I'd tread lightly when rewriting code I don't fully understand, in your own code, you should definitely write the way that's readable.

Query performance - INNER Join filtering

I would like to work out which query would be faster. I think that the JOINS are done first, and then the WHERE cause.
So, that sounds like this:
SELECT *
FROM Table1 t1
INNER JOIN Table2 t2
ON t1.field = t2.field
AND t2.Deleted = 0
INNER JOIN Table2 t3
ON t2.field = t3.field
AND t3.Deleted = 0
WHERE t1.Deleted = 0
Would be faster than:
SELECT *
FROM Table1 t1
INNER JOIN Table2 t2
ON t1.field = t2.field
INNER JOIN Table2 t3
ON t2.field = t3.field
WHERE
t1.Deleted = 0 AND
t2.Deleted = 0 AND
t3.Deleted = 0
The joins in the first query would filter out the data earlier, and hence, less joining.
(I understand thi8s can be different when you have LEFT joins)
Note that the WHERE clause you provide is in the form called Conjunctive Normal Form. Whenever your query has simple comparison or equality terms joined in Conjunctive Normal Form the optimizer would be completely negligent if it did not perform the transformation from your Style 2 to your Style 1. I believe it can be safely assumed that the optimizers for ALL mature SQL products are quite capable of efficiently optimizing simple predicates in either Disjunctive or Conjunctive Normal Form.
However, at a certain point the optimizer has to stop analyzing the query and begin constructing the optimized query. In cases where your filter clauses are complex, there can be benefit in helping the optimizer out by locating clauses in the correct sub-query.
However, this step is not usually performed unless the query actually tests as non-performant. Human labour is much more expensive than CPU cycles, and that includes the time spent by humans in reading code as part of code review and maintenance. Until a query has been proved non-performant to the point of requiring manual tuning, it is best to keep your filtering terms in the WHERE clause, where they can be readily identified and verified

NOT IN Subquery Optimization

I have a dynamic query that runs indentifying CDs that members have not rented yet. I am using the NOT IN subquery but when I have large member table it makes them really slow. Any suggestions how to optimize the query
SELECT DVDTitle AS "DVD Title"
FROM DVD
WHERE DVDId NOT IN
(SELECT DISTINCT DVDId FROM Rental WHERE MemberId = AL240);
thanks
Using NOT EXISTS will have slightly better performance because it can "short circuit" rather than evaluating the entire set for each match. At the very least, it will be "no worse" than NOT IN or an OUTER JOIN, though there are exceptions to every rule. Here is how I would write this query:
SELECT DVDTitle AS [DVD Title]
FROM dbo.DVD AS d
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.Rental
WHERE MemberId = 'AL240'
AND DVDId = d.DVDId
);
I would guess you will optimize performance better by investigating the execution plan and ensuring that your indexes are best suited for this query (without causing negative impact to other parts of your workload).
Also see Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?
SELECT DVDTitle AS "DVD Title"
FROM DVD d
left outer join Rental r on d.DVDId = r.DVDId
WHERE r.MemberId = 'AL240'
and r.DVDId is null
Make sure you have indexes on:
d.DVDId
r.DVDId
r.MemberId

"Sub SELECT before INNER JOIN" or "WHERE after INNER JOIN"?

There are table A and table B. I want to join these tables on two columns but only for selected rows of table A.
Query scenarios:
SELECT B.*
FROM B
INNER JOIN (SELECT * FROM A WHERE A.COLUMN1 BETWEEN somevalue1 AND somevalue2) C
ON B.COLUMN2 = C.COLUMN2
AND B.COLUMN3 = C.COLUMN3
OR
SELECT B.*
FROM B
INNER JOIN A
ON B.COLUMN2 = A.COLUMN2
AND B.COLUMN3 = A.COLUMN3
WHERE A.COLUMN1 BETWEEN somevalue1 AND somevalue2
Both tables A and B have millions of records. With WHERE condition table A will return me only 1000 results, so the actual join to be performed is to find matching details from B for only 1000 rows of A.
Query:
Which one should be faster? (I do not have access to view the query execution plan)
Thanks!
It's hard to predict performance here without actually measuring.
My instincts say the latter option should be faster because an optimizer may want to fully materialize the inner query before the join, which in addition to being slow all by itself could break any indexing that might help the join along. The optimizer for the latter option, on the other hand, should still be smart enough to pre-filter table A before the join, with no risk of breaking indexes and the ability only materialize results that match the join. Notice all the weasel words in there, though; my instincts could be way off in this case. The real lesson to take away from this is to measure your query using real data under conditions as close to actual as possible.
More importantly, I prefer the latter because (imo) it's just more readable and maintainable.

WHERE clause better execute before IN and JOIN or after

I read this article:
Logical Processing Order of the SELECT statement
in end of article has been write ON and JOIN clause consider before WHERE.
Consider we have a master table that has 10 million records and a detail table (that has reference to master table(FK)) with 50 million record. We have a query that want just 100 record of detail table according a PK in master table.
In this situation ON and JOIN execute before WHERE?I mean that do we have 500 million record after JOIN and then WHERE apply to it?or first WHERE apply and then JOIN and ON Consider? If second answer is true do it has incoherence with top article?
thanks
In the case of an INNER JOIN or a table on the left in a LEFT JOIN, in many cases, the optimizer will find that it is better to perform any filtering first (highest selectivity) before actually performing whatever type of physical join - so there are obviously physical order of operations which are better.
To some extent you can sometimes control this (or interfere with this) with your SQL, for instance, with aggregates in subqueries.
The logical order of processing the constraints in the query can only be transformed according to known invariant transformations.
So:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is still logically equivalent to:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and they will generally have the same execution plan.
On the other hand:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is NOT equivalent to:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and so the optimizer isn't going to transform them into the same execution plan.
The optimizer is very smart and is able to move things around pretty successfully, including collapsing views and inline table-valued functions as well as even pushing things down through certain kinds of aggregates fairly successfully.
Typically, when you write SQL, it needs to be understandable, maintainable and correct. As far as efficiency in execution, if the optimizer is having difficulty turning the declarative SQL into an execution plan with acceptable performance, the code can sometimes be simplified or appropriate indexes or hints added or broken down into steps which should perform more quickly - all in successive orders of invasiveness.
It doesn't matter
Logical processing order is always honoured: regardless of actual processing order
INNER JOINs and WHERE conditions are effectively associative and commutative (hence the ANSI-89 "join in the where" syntax) so actual order doesn't matter
Logical order becomes important with outer joins and more complex queries: applying WHERE on an OUTER table changes the logic completely.
Again, it doesn't matter how the optimiser does it internally so long as the query semantics are maintained by following logical processing order.
And the key word here is "optimiser": it does exactly what it says
Just re-reading Paul White's excellent series on the Query Optimiser and remembered this question.
It is possible to use an undocumented command to disable specific transformation rules and get some insight into the transformations applied.
For (hopefully!) obvious reasons only try this on a development instance and remember to re-enable them and remove any suboptimal plans from the cache.
USE AdventureWorks2008;
/*Disable the rules*/
DBCC RULEOFF ('SELonJN');
DBCC RULEOFF ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
You can see with those two rules disabled it does a cartesian join and filter after.
/*Re-enable them*/
DBCC RULEON ('SELonJN');
DBCC RULEON ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
With them enabled the predicate is pushed right down into the index seek and so reduces the number of rows processed by the join operation.
There is no defined order. The SQL engine determines what order to perform the operations based on the execution strategy chosen by its optimizer.
I think you have misread ON as IN in the article.
However, the order it is showing in the article is correct (obviously it is msdn anyway). The ON and JOIN are executed before WHERE naturally because WHERE has to be applied as a filter on the temporary resultset obtained due to JOINS
The article just says it is logical order of execution and also at end of the paragraph it adds this line too ;)
"Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list."

Resources