Is there any difference (performance, best-practice, etc...) between putting a condition in the JOIN clause vs. the WHERE clause?
For example...
-- Condition in JOIN
SELECT *
FROM dbo.Customers AS CUS
INNER JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND CUS.FirstName = 'John'
-- Condition in WHERE
SELECT *
FROM dbo.Customers AS CUS
INNER JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE CUS.FirstName = 'John'
Which do you prefer (and perhaps why)?
The relational algebra allows interchangeability of the predicates in the WHERE clause and the INNER JOIN, so even INNER JOIN queries with WHERE clauses can have the predicates rearrranged by the optimizer so that they may already be excluded during the JOIN process.
I recommend you write the queries in the most readable way possible.
Sometimes this includes making the INNER JOIN relatively "incomplete" and putting some of the criteria in the WHERE simply to make the lists of filtering criteria more easily maintainable.
For example, instead of:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
AND c.State = 'NY'
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
AND a.Status = 1
Write:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
WHERE c.State = 'NY'
AND a.Status = 1
But it depends, of course.
For inner joins I have not really noticed a difference (but as with all performance tuning, you need to check against your database under your conditions).
However where you put the condition makes a huge difference if you are using left or right joins. For instance consider these two queries:
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE ORD.OrderDate >'20090515'
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND ORD.OrderDate >'20090515'
The first will give you only those records that have an order dated later than May 15, 2009 thus converting the left join to an inner join.
The second will give those records plus any customers with no orders. The results set is very different depending on where you put the condition. (Select * is for example purposes only, of course you should not use this in production code.)
The exception to this is when you want to see only the records in one table but not the other. Then you use the where clause for the condition not the join.
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE ORD.OrderID is null
Most RDBMS products will optimize both queries identically. In "SQL Performance Tuning" by Peter Gulutzan and Trudy Pelzer, they tested multiple brands of RDBMS and found no performance difference.
I prefer to keep join conditions separate from query restriction conditions.
If you're using OUTER JOIN sometimes it's necessary to put conditions in the join clause.
WHERE will filter after the JOIN has occurred.
Filter on the JOIN to prevent rows from being added during the JOIN process.
I prefer the JOIN to join full tables/Views and then use the WHERE To introduce the predicate of the resulting set.
It feels syntactically cleaner.
I typically see performance increases when filtering on the join. Especially if you can join on indexed columns for both tables. You should be able to cut down on logical reads with most queries doing this too, which is, in a high volume environment, a much better performance indicator than execution time.
I'm always mildly amused when someone shows their SQL benchmarking and they've executed both versions of a sproc 50,000 times at midnight on the dev server and compare the average times.
Agree with 2nd most vote answer that it will make big difference when using LEFT JOIN or RIGHT JOIN. Actually, the two statements below are equivalent. So you can see that AND clause is doing a filter before JOIN while the WHERE clause is doing a filter after JOIN.
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND ORD.OrderDate >'20090515'
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN (SELECT * FROM dbo.Orders WHERE OrderDate >'20090515') AS ORD
ON CUS.CustomerID = ORD.CustomerID
Joins are quicker in my opinion when you have a larger table. It really isn't that much of a difference though especially if you are dealing with a rather smaller table. When I first learned about joins, i was told that conditions in joins are just like where clause conditions and that i could use them interchangeably if the where clause was specific about which table to do the condition on.
Putting the condition in the join seems "semantically wrong" to me, as that's not what JOINs are "for". But that's very qualitative.
Additional problem: if you decide to switch from an inner join to, say, a right join, having the condition be inside the JOIN could lead to unexpected results.
It is better to add the condition in the Join. Performance is more important than readability. For large datasets, it matters.
Related
How can I optimize Performance of the below mentioned query when the table structure is as shown in the pic below
Pic Showing The Table Structure
select CounterID, OutletTitle, CounterTitle
from(
select OutletID, Text as OutletTitle
from Outlets as q1
inner join
TranslationTexts as tt
on q1.TitleID=tt.TranslationID
where tt.Locale='ar-SA' and q1.CompanyID=311 and q1.OutletID=8 --Locale & CompanyID & OutletID
) as O
inner join
(
select CounterID, Text as CounterTitle, OutletID
from Counters as q1
inner join
TranslationTexts as tt
on q1.TitleID=tt.TranslationID
where tt.Locale='ar-SA' and q1.OutletID=8 --Locale & OutletID
) as C
on O.OutletID=C.OutletID
You should try this request :
SELECT CounterID, tou.Text as OutletTitle, tco.Text as CounterTitle
FROM Counters as co
INNER JOIN Outlets as ou ON co.OutletID = ou.OutletID
INNER JOIN TranslationTexts as tco on co.TitleID=tco.TranslationID
INNER JOIN TranslationTexts as tou on ou.TitleID=tou.TranslationID
WHERE co.CompanyID=311 and co.OutletID=8 AND tco.Locale='ar-SA' and tou.Locale='ar-SA'
To have much better performance, you could add some indexes on the 3 tables.
This is a different approach. I cannot say about improvement in performance because that depends on a lot of other things, but I believe it is an equivalent version and an easier one to read.
SELECT
C.CounterID
, tt.Text AS OutletTitle
, tt.Text AS CounterTitle
FROM
Outlets AS q1
INNER JOIN TranslationTexts AS tt ON q1.TitleID=tt.TranslationID
INNER JOIN Counters C ON c.OutletID=q1.OutletID
INNER JOIN TranslationTexts AS tt2 ON tt2.TranslationID=tt.TranslationID AND tt2.Locale=tt.Locale
WHERE
tt.Locale='ar-SA' and q1.CompanyID=311 and q1.OutletID=8;
The question is what you want to optimize.. readability (and maintainability) and/or performance ?
Most people have their own 'style' when writing queries. I prefer the one below, but to the server it will probably look the same and most likely the system will have the exact same amount of 'work' to get the data even though it 'looks' different to us humans. I'd suggest to google around a bit and learn how to interpret a Query Plan.
SELECT q2.CounterID,
tt1.Text as OutletTitle,
tt2.Text as CounterTitle
FROM Outlets as q1
INNER JOIN Counters as q2
ON q2.OutletID = q1.OutletID
INNER JOIN TranslationTexts as tt1
ON tt1.TranslationID = q1.TitleID
AND tt1.Locale = 'ar-SA'
INNER JOIN TranslationTexts as tt2
ON tt2.TranslationID = q2.TitleID
AND tt2.Locale = 'ar-SA'
WHERE q1.CompanyID = 311
AND q1.OutletID = 8
On of the things I notice is that you pass both CompanyID and OutletID as filters for the Outlets table. Since OutletID is the primary key of that table I wonder if you really need the filter on CompanyID. At best it will eliminate the record because it's the wrong company, but somehow I'm under the impression that you already know the right CompanyID.
As for performance, I'd advice these indexes
CREATE INDEX idx_Locale ON TranslationTexts (Locale, Translation_id)
CREATE INDEX idx_CompanyID ON Outlets (CompanyID) INCLUDE (TitleID, OutletID)
Most likely you even can make that index on Local a UNIQUE index making it work even better.
In theory, why would inner join work remarkably faster then left outer join given the fact that both queries return same result set. I had a query which would take long time to describe, but this is what I saw changing single join: left outer join - 6 sec, inner join - 0 sec (the rest of the query is the same). Result set: the same
Actually depending on the data, left outer join and inner join would not return the same results..most likely left outer join will have more result and again depends on the data..
I'd be worried if I changed a left join to an inner join and the results were not different. I would suspect that you have a condition on the left side of the table in the where clause effectively (and probably incorrectly) turning it into an inner join.
Something like:
select *
from table1 t1
left join table2 t2 on t1.myid = t2.myid
where t2.somefield = 'something'
Which is not the same thing as
select *
from table1 t1
left join table2 t2
on t1.myid = t2.myid and t2.somefield = 'something'
So first I would be worried that my query was incorrect to begin with, then I would worry about performance. An inner join is NOT a performance enhancement for a Left Join, they mean two different things and should return different results unless you have a table where there will always be a match for every record. In this case you change to an inner join because the other is incorrect not to improve performance.
My best guess as to the reason the left join takes longer is that it is joining to many more rows that then get filtered out by the where clause. But that is just a wild guess. To know you need to look at the Execution plans.
Consider the example from MSDN documentation:
SELECT p.Name, pr.ProductReviewID
FROM Production.Product p
LEFT OUTER JOIN Production.ProductReview pr
ON p.ProductID = pr.ProductID
In this example, it is clear that the table on the left is "Production" and that is where all rows will be returned from, and then only those that match in ProductReview.
But now consider the following hypothetical query with 3 tables A,B,C
select * from A
inner Join B on A.field1 = B.field1
left outer join C on C.field2 = b.Field2
Which is the left table in this query (from which all records will be returned, regardless of a match to C)? Is it A or B? Or is it the result of the join from A & B?
My confusion arises from the following MSDN documentation, which states that "Outer joins can be specified in the FROM clause only" which would mean that the left table in my hypothetical query is A, but then I dont have an ON clause that specifies the join condition - in which case is my hypothetical query a bad one?
Since there is an INNER JOIN between A and B, only rows from B that match A will qualify for the LEFT JOIN to C.
I'm not 100% sure I understand you question, but assuming I am understanding it correctly:
Your "left" table in your hypothetical query is B, since your ON condition specifies the B.Field2.
The terms 'left" and "right" are not sufficiently specific in this context. Instead, you should use the terms "preserved" and "unpreserved". In that light, tables A and B are preserved and table C is unpreserved.
The reference in the MSDN documentation is meant to imply you cannot use joins (outer or otherwise) in the Select, Where, Group By, Having or Order By clauses outside of a subquery (where they are still in a From clause).
From your joins
A inner Join B on A.field1 = B
left outer join C on C.field2 = b.Field2
You need to have records from table A and B.
The left join only has data from table C field field2 matching the B table, but note that table A field2 does not have to match.
To see your data for table C run the following:
select c.*
from A inner Join B on A.field1 = B.field1 left outer join C on C.field2 = b.Field2
They use the term FROM clause in a general (broad) sense meaning the whole section of the query that starts from the keyword FROM and includes all the joins there are.
Here's a fuller context (note the previous sentence):
Inner joins can be specified in either the FROM or WHERE clauses. Outer joins can be specified in the FROM clause only.
See? They mean you cannot specify an outer join in the WHERE clause as is the case with inner joins. You can only do that in the FROM clause (that is, after however many other joins too). The result will be applied to the result of the previous joins.
Is there any argument, performance wise, to do filtering in the join, as opposed to the WHERE clause?
For example,
SELECT blah FROM TableA a
INNER JOIN TableB b
ON b.id = a.id
AND b.deleted = 0
WHERE a.field = 5
As opposed to
SELECT blah FROM TableA a
INNER JOIN TableB b
ON b.id = a.id
WHERE a.field = 5
AND b.deleted = 0
I personally prefer the latter, because I feel filtering should be done in the filtering section (WHERE), but is there any performance or other reasons to do either method?
If the query optimizer does its job, there is no difference at all (except clarity for others) in the two forms for inner joins.
That said, with left joins a condition in the join means to filter rows out of the second table before joining. A condition in the where means to filter rows out of the final result after joining. Those mean very different things.
With inner joins you will have the same results and probably the same performance. However, with outer joins the two queries would return different results and are not equivalent at all as putting the condition in the where clause will in essence change the query from a left join to an inner join (unless you are looking for the records where some field is null).
No there is no differences between these two, because in the logical processing of the query, WHERE will always go right after filter clause(ON), in your examples you will have:
Cartesian product (number of rows from TableA x number of rows from TableB)
Filter (ON)
Where.
Your examples are in ANSI SQL-92 standard, you could also write the query with ANSI SQL-89 standard like this:
SELECT blah FROM TableA a,TableB b
WHERE b.id = a.id AND b.deleted = 0 AND a.field = 5
THIS IS TRUE FOR INNER JOINS, WITH OUTER JOINS IS SIMILAR BUT NOT THE SAME
I am building a view in SQL Server 2000 (and 2005) and I've noticed that the order of the join statements greatly affects the execution plan and speed of the query.
select sr.WTSASessionRangeID,
-- bunch of other columns
from WTSAVW_UserSessionRange us
inner join WTSA_SessionRange sr on sr.WTSASessionRangeID = us.WTSASessionRangeID
left outer join WTSA_SessionRangeTutor srt on srt.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeClass src on src.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeStream srs on srs.WTSASessionRangeID = sr.WTSASessionRangeID
--left outer join MO_Stream ms on ms.MOStreamID = srs.MOStreamID
left outer join WTSA_SessionRangeEnrolmentPeriod srep on srep.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeStudent stsd on stsd.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionSubrange ssr on ssr.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionSubrangeRoom ssrr on ssrr.WTSASessionSubrangeID = ssr.WTSASessionSubrangeID
left outer join MO_Stream ms on ms.MOStreamID = srs.MOStreamID
On SQL Server 2000, the query above consistently generates a plan of cost 946. If I uncomment the MO_Stream join in the middle of the query and comment out the one at the bottom, the cost drops to 263. The execution speed drops accordingly. I always thought that the query optimizer would interpret the query appropriately without considering join order, but it seems that order matters.
So since order does seem to matter, is there a join strategy I should be following for writing faster queries?
(Incidentally, on SQL Server 2005, with almost identical data, the query plan costs were 0.675 and 0.631 respectively.)
Edit: On SQL Server 2000, here are the profiled stats:
946-cost query: 9094ms CPU, 5121 reads, 0 writes, 10123ms duration
263-cost query: 172ms CPU, 7477 reads, 0 writes, 170ms duration
Edit: Here is the logical structure of the tables.
SessionRange ---+--- SessionRangeTutor
|--- SessionRangeClass
|--- SessionRangeStream --- MO_Stream
|--- SessionRangeEnrolmentPeriod
|--- SessionRangeStudent
+----SessionSubrange --- SessionSubrangeRoom
Edit: Thanks to Alex and gbn for pointing me in the right direction. I also found this question.
Here's the new query:
select sr.WTSASessionRangeID // + lots of columns
from WTSAVW_UserSessionRange us
inner join WTSA_SessionRange sr on sr.WTSASessionRangeID = us.WTSASessionRangeID
left outer join WTSA_SessionRangeTutor srt on srt.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeClass src on src.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeEnrolmentPeriod srep on srep.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeStudent stsd on stsd.WTSASessionRangeID = sr.WTSASessionRangeID
// SessionRangeStream is a many-to-many mapping table between SessionRange and MO_Stream
left outer join (
WTSA_SessionRangeStream srs
inner join MO_Stream ms on ms.MOStreamID = srs.MOStreamID
) on srs.WTSASessionRangeID = sr.WTSASessionRangeID
// SessionRanges MAY have Subranges and Subranges MAY have Rooms
left outer join (
WTSA_SessionSubrange ssr
left outer join WTSA_SessionSubrangeRoom ssrr on ssrr.WTSASessionSubrangeID = ssr.WTSASessionSubrangeID
) on ssr.WTSASessionRangeID = sr.WTSASessionRangeID
SQLServer2000 cost: 24.9
I have to disagree with all previous answers, and the reason is simple: if you change the order of your left join, your queries are logically different and as such they produce different result sets. See for yourself:
SELECT 1 AS a INTO #t1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4;
SELECT 1 AS b INTO #t2
UNION ALL SELECT 2;
SELECT 1 AS c INTO #t3
UNION ALL SELECT 3;
SELECT a, b, c
FROM #t1 LEFT JOIN #t2 ON #t1.a=#t2.b
LEFT JOIN #t3 ON #t2.b=#t3.c
ORDER BY a;
SELECT a, b, c
FROM #t1 LEFT JOIN #t3 ON #t1.a=#t3.c
LEFT JOIN #t2 ON #t3.c=#t2.b
ORDER BY a;
a b c
----------- ----------- -----------
1 1 1
2 2 NULL
3 NULL NULL
4 NULL NULL
(4 row(s) affected)
a b c
----------- ----------- -----------
1 1 1
2 NULL NULL
3 NULL 3
4 NULL NULL
The join order does make a difference to the resulting query. This is documented in BOL in the docs for FROM:
<joined_table>
Is a result set that is the product of two or more tables. For multiple joins, use parentheses to change the natural order of the joins.
You can alter the join order using parenthesis around the joins (BOL does show this in the syntax at the top of the docs, but it is easy to miss).
This is known as chiastic behaviour. You can also use the query hint OPTION (FORCE ORDER) to force a specific join order, but this can result in what are called "bushy plans" which may not be the most optimal for the query being executed.
Obviously, the SQL Server 2005 optimizer is a lot better than the SQL Server 2000 one.
However, there's a lot of truth in your question. Outer joins will cause execution to vary wildly based on order (inner joins tend to be optimized to the most efficient route, but again, order matters). If you think about it, as you build up left joins, you need to figure out what the heck is on the left. As such, each join must be calculated before every other join can be done. It becomes sequential, and not parallel. Now, obviously, there are things you can do to combat this (such as indexes, views, etc). But, the point stands: The table needs to know what's on the left before it can do a left outer join. And if you just keep adding joins, you're getting more and more abstraction to what, exactly is on the left (especially if you use joined tables as the left table!).
With inner joins, however, you can parallelize those quite a bit, so there's less of a dramatic difference as far as order's concerned.
A general strategy for optimizing queries containing JOINs is to look at your data model and the data and try to determine which JOINs will reduce number of records that must be considered the most quickly. The fewer records that must be considered, the faster the query will run. The server will generally produce a better query plan too.
Along with the above optimization make sure that any fields used in JOINs are indexed
You query is probably wrong anyway. Alex is correct. Eric may be correct too, but the query is wrong.
Lets' take this subset:
WTSA_SessionRange sr
left outer join
WTSA_SessionSubrange ssr on ssr.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join
WTSA_SessionSubrangeRoom ssrr on ssrr.WTSASessionSubrangeID = ssr.WTSASessionSubrangeID
You are joining WTSA_SessionSubrangeRoom onto WTSA_SessionSubrange. You may have no rows from WTSA_SessionSubrange.
The join should be this:
WTSA_SessionRange sr
left outer join
(SELECT WTSASessionRangeID, columns I need
FROM
WTSA_SessionSubrange ssr
left outer join
WTSA_SessionSubrangeRoom ssrr on ssrr.WTSASessionSubrangeID = ssr.WTSASessionSubrangeID
) foo on foo.WTSASessionRangeID = sr.WTSASessionRangeID
This is why the join order is affecting results because it's a different query, declaratively speaking.
You'd also need to change the MO_Stream and WTSA_SessionRangeStream join too.
it depends on which of the join fields are indexed - if it has to table scan the first field, but use an index on the second, it's slow. If your first join field is an index, it'll be quicker. My guess is that 2005 optimizes it better by determining the indexed fields and performing those first
At DevConnections a few years ago a session on SQL Server performance stated that (a) order of outer joins DOES matter, and (b) when a query has a lot of joins, it will not look at all of them before making a determination on a plan. If you know you have joins that will help speed up a query, they should be early on in the FROM list (if you can).