Inner join vs select statements on multiple tables - sql-server

THe below 2 queries performs the same operation, but wondering which would be the fastest and most preferable?
NUM is the primary key on table1 & table2...
select *
from table1 tb1,
table2 tb2
where tb1.num = tb2.num
select *
from table1 tb1
inner join
table2 tb2
on tb1.num = tb2.num

They are the same query. The first is an older alternate syntax, but they both mean do an inner join.
You should avoid using the older syntax. It's not just readability, but as you build more complex queries, there are things that you simply can't do with the old syntax. Additionally, the old syntax is going through a slow process of being phased out, with the equivalent outer join syntax marked as deprecated in most products, and iirc dropped already in at least one.

The 2 SQL statements are equivalent. You can look at the execution plan to confirm. As a rule, given 2 SQL statements which affect/return the same rows in the same way, the server is free to execute them the same way.

They're equivalent queries - both are inner joins, but the first uses an older, implicit join syntax. Your database should execute them in exactly the same way.
If you're unsure, you could always use the SQL Management Studio to view and compare the execution plans of both queries.

The first example is what I have seen referred to as an Oracle Join. As mentioned already there appears to be little performance difference. I prefer the second example from a readability standpoint because it separates join conditions from filter conditions.

Related

Database Join Performance Comparison [duplicate]

This question already has answers here:
Explicit vs implicit SQL joins
(12 answers)
Closed 7 years ago.
I am on a project where much of the queries are performed by including multiple tables in the FROM clause. I know this is legal, but I have always used explicit JOINs instead.
For example, two tables (using SQL Server DDL)
CREATE TABLE Manufacturers(
ManufacturerID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
Name varchar(100))
CREATE TABLE Cars (
ModelID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
ManufacturerID INT CONSTRAINT FOREIGN KEY FK_Manufacturer REFERENCES Manufacturers(ManufacturerID),
ModelName VARCHAR(100))
If I want to find the models for GM, I could do either:
SELECT ModelName FROM Cars c, Manufacturers m WHERE c.ManufacturerID=m.ManufacturerID AND m.Name='General Motors'
or
SELECT ModelName FROM Cars c INNER JOIN Manufacturers m ON c.ManufacturerID=m.ManufacturerID WHERE m.Name='General Motors'
My question is this: does one form perform better than the other? Aside from how the tables are defined in Oracle vs SQL Server, does one form of join work better than the other in Oracle or SQL Server? What if you include more tables, say 3 or 4? Does that change the performance characteristics, assuming the queries are constructed to return an equivalent record set?
My question is this: does one form perform better than the other?
They should not. You can check your execution plans to be certain, but every RDBMS I've worked with generates the same plans for comma (ANSI-89) joins as they do for ANSI-92 explicit joins. (Note that comma joins didn't stop being ANSI SQL, it's just that ANSI-92 is where the explicit syntax first appeared.)
Aside from how the tables are defined in Oracle vs SQL Server, does one form of join work better than the other in Oracle or SQL Server?
As far as the server is concerned, no.
What if you include more tables, say 3 or 4? Does that change the performance characteristics, assuming the queries are constructed to return an equivalent record set?
It's possible. With comma joins, I'm not sure it's possible to control the JOIN order with parentheses like you can with explicit joins:
SELECT *
FROM Table1 t1
INNER JOIN (
Table2 t2
INNER JOIN Table3 t3
t2.id = t3.id)
ON t1.id = t2.id
This can affect overall query performance (for better or worse). I'm not sure how to accomplish the same level of control with comma joins, but I've never fully learned comma join syntax. I don't know if you can say Table1 t1, (Table2 t2, Table3 t3), but I don't believe you can. I think you'd have to use subqueries to do that.
The primary advantages of explicit joins are:
Easier to read. It makes it very clear which conditions are used with which join. You won't ever see Table1 t1, Table2 t2, Table3 t3 and then have to dig into the WHERE clause to figure out if one of those joins is an outer join. It also means the WHERE clause isn't stuffed full of all these join conditions you know you don't care about changing when you modify a query.
Easier to use outer joins. In the case of outer joins, you can even specify literal filter values in the outer table without having to handle nulls from the outer join.
Easier to reuse existing joins. If you just want to query from the same relations, you can just grab the FROM clause. You don't have to worry about what bits from the WHERE clause you want and what bits you don't.
Identical syntax across RDBMSs. When you spend all day switching between Oracle and SQL Server, the last thing you want to worry about is confusing + and *= to get your outer joins right.
All of the above make the explicit join syntax more maintainable, which is a very important factor for software.

What is the 'old style' syntax for joins in T-Sql?

I'm rewriting a bunch of old, badly written Oracle queries against a new(-er) Sql Server 2008 environment. They use old-school Oracle join syntax like
select <whatever>
from Table1, Table2, Table3
where Table1.T1ID = Table2.T2ID -- old Oracle inner join
and Table2.T3ID = Table3.T3ID (+) -- old Oracle left join (I think)
Except a lot more complicated. There's a lot of mixed joins and a lot of nesting and a lot of views piled on views going on in these things. It's not pretty. The data is disparate between the two servers too, making testing a chore.
I figured the easiest way to replicate would be to make the queries look as similar as possible in Sql Server (ie, using the same style of join), and then do a massive clean-up job after once I'm confident they're both definitely doing the same thing & I don't have a join in the wrong place somewhere (and yes, I have compatibility mode temporarily set to support old joins).
I know the 'old' syntax for an inner join in T-Sql is
select <whatever>
from T1, T2
where T1.ID = T2.ID
but what is the 'old' syntax for a left outer join or a right outer join?
From the documentation on TechNet (on SQL Server 2000, so be aware this might not be supported any more!), you need to use *= instead of the (+) as Oracle does:
select <whatever>
from T1, T2
where T1.ID *= T2.ID

Force joined view not to be optimized

I have a somewhat complex view which includes a join to another view. For some reason the generated query plan is highly inefficient. The query runs for many hours. However if I select the sub-view into a temporary table first and then join with this, the same query finished in a few minutes.
My question is: Is there some kind of query hint or other trick which will force the optimizer to execute the joined sub-view in isolation before performing the join, just as when using a temp table? Clearly the default strategy chosen by the optimizer is not optimal.
I cannot use the temporary table-trick since views does not allow temporary tables. I understand I could probably rewrite everything to a stored procedure, but that would break composeability of views, and it seems also like bad for maintenance to rewrite everything just to trick the optimizer to not use a bad optimization.
Adam Machanic explained one such way at a SQL Saturday I recently attended. The presentation was called Clash of the Row Goals. The method involves using a TOP X at the beginning of the sub-select. He explained that when doing a TOP X, the query optimizer assumes it is more efficient to grab the TOP X rows one at a time. As long as you set X as a sufficiently large number (limit of INT or BIGINT?), the query will always get the correct results.
So one example that Adam provided:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
becomes:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT TOP(2147483647)
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
It is a super cool trick and very useful.
When things get messy the query optimize often resorts to loop joins
If materializing to a temp fixed it then most likely that is the problem
The optimizer often does not deal with views very well
I would rewrite you view to not uses views
Join Hints (Transact-SQL)
You may be able to use these hints on views
Try merge and hash
Try changing the order of join
Move condition into the join whenever possible
select *
from table1
join table2
on table1.FK = table2.Key
where table2.desc = 'cat1'
should be
select *
from table1
join table2
on table1.FK = table2.Key
and table2.desc = 'cat1'
Now the query optimizer will get that correct but as the query gets more complex the query optimize goes into what I call stupid mode and loop joins. But that is also done to protect the server and have as little in memory as possible.

WHERE clause better execute before IN and JOIN or after

I read this article:
Logical Processing Order of the SELECT statement
in end of article has been write ON and JOIN clause consider before WHERE.
Consider we have a master table that has 10 million records and a detail table (that has reference to master table(FK)) with 50 million record. We have a query that want just 100 record of detail table according a PK in master table.
In this situation ON and JOIN execute before WHERE?I mean that do we have 500 million record after JOIN and then WHERE apply to it?or first WHERE apply and then JOIN and ON Consider? If second answer is true do it has incoherence with top article?
thanks
In the case of an INNER JOIN or a table on the left in a LEFT JOIN, in many cases, the optimizer will find that it is better to perform any filtering first (highest selectivity) before actually performing whatever type of physical join - so there are obviously physical order of operations which are better.
To some extent you can sometimes control this (or interfere with this) with your SQL, for instance, with aggregates in subqueries.
The logical order of processing the constraints in the query can only be transformed according to known invariant transformations.
So:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is still logically equivalent to:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and they will generally have the same execution plan.
On the other hand:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is NOT equivalent to:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and so the optimizer isn't going to transform them into the same execution plan.
The optimizer is very smart and is able to move things around pretty successfully, including collapsing views and inline table-valued functions as well as even pushing things down through certain kinds of aggregates fairly successfully.
Typically, when you write SQL, it needs to be understandable, maintainable and correct. As far as efficiency in execution, if the optimizer is having difficulty turning the declarative SQL into an execution plan with acceptable performance, the code can sometimes be simplified or appropriate indexes or hints added or broken down into steps which should perform more quickly - all in successive orders of invasiveness.
It doesn't matter
Logical processing order is always honoured: regardless of actual processing order
INNER JOINs and WHERE conditions are effectively associative and commutative (hence the ANSI-89 "join in the where" syntax) so actual order doesn't matter
Logical order becomes important with outer joins and more complex queries: applying WHERE on an OUTER table changes the logic completely.
Again, it doesn't matter how the optimiser does it internally so long as the query semantics are maintained by following logical processing order.
And the key word here is "optimiser": it does exactly what it says
Just re-reading Paul White's excellent series on the Query Optimiser and remembered this question.
It is possible to use an undocumented command to disable specific transformation rules and get some insight into the transformations applied.
For (hopefully!) obvious reasons only try this on a development instance and remember to re-enable them and remove any suboptimal plans from the cache.
USE AdventureWorks2008;
/*Disable the rules*/
DBCC RULEOFF ('SELonJN');
DBCC RULEOFF ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
You can see with those two rules disabled it does a cartesian join and filter after.
/*Re-enable them*/
DBCC RULEON ('SELonJN');
DBCC RULEON ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
With them enabled the predicate is pushed right down into the index seek and so reduces the number of rows processed by the join operation.
There is no defined order. The SQL engine determines what order to perform the operations based on the execution strategy chosen by its optimizer.
I think you have misread ON as IN in the article.
However, the order it is showing in the article is correct (obviously it is msdn anyway). The ON and JOIN are executed before WHERE naturally because WHERE has to be applied as a filter on the temporary resultset obtained due to JOINS
The article just says it is logical order of execution and also at end of the paragraph it adds this line too ;)
"Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list."

SQL Server CTE referred in self joins slow

I have written a table-valued UDF that starts by a CTE to return a subset of the rows from a large table.
There are several joins in the CTE. A couple of inner and one left join to other tables, which don't contain a lot of rows.
The CTE has a where clause that returns the rows within a date range, in order to return only the rows needed.
I'm then referencing this CTE in 4 self left joins, in order to build subtotals using different criterias.
The query is quite complex but here is a simplified pseudo-version of it
WITH DataCTE as
(
SELECT [columns] FROM table
INNER JOIN table2
ON [...]
INNER JOIN table3
ON [...]
LEFT JOIN table3
ON [...]
)
SELECT [aggregates_columns of each subset] FROM DataCTE Main
LEFT JOIN DataCTE BananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality = 100
LEFT JOIN DataCTE DamagedBananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality < 20
LEFT JOIN DataCTE MangosSubset
ON [...]
GROUP BY [
I have the feeling that SQL Server gets confused and calls the CTE for each self join, which seems confirmed by looking at the execution plan, although I confess not being an expert at reading those.
I would have assumed SQL Server to be smart enough to only perform the data retrieval from the CTE only once, rather than do it several times.
I have tried the same approach but rather than using a CTE to get the subset of the data, I used the same select query as in the CTE, but made it output to a temp table instead.
The version referring the CTE version takes 40 seconds. The version referring the temp table takes between 1 and 2 seconds.
Why isn't SQL Server smart enough to keep the CTE results in memory?
I like CTEs, especially in this case as my UDF is a table-valued one, so it allowed me to keep everything in a single statement.
To use a temp table, I would need to write a multi-statement table valued UDF, which I find a slightly less elegant solution.
Did some of you had this kind of performance issues with CTE, and if so, how did you get them sorted?
Thanks,
Kharlos
I believe that CTE results are retrieved every time. With a temp table the results are stored until it is dropped. This would seem to explain the performance gains you saw when you switched to a temp table.
Another benefit is that you can create indexes on a temporary table which you can't do to a cte. Not sure if there would be a benefit in your situation but it's good to know.
Related reading:
Which are more performant, CTE or temporary tables?
SQL 2005 CTE vs TEMP table Performance when used in joins of other tables
http://msdn.microsoft.com/en-us/magazine/cc163346.aspx#S3
Quote from the last link:
The CTE's underlying query will be
called each time it is referenced in
the immediately following query.
I'd say go with the temp table. Unfortunately elegant isn't always the best solution.
UPDATE:
Hmmm that makes things more difficult. It's hard for me to say with out looking at your whole environment.
Some thoughts:
can you use a stored procedure instead of a UDF (instead, not from within)?
This may not be possible but if you can remove the left join from you CTE you could move that into an indexed view. If you are able to do this you may see performance gains over even the temp table.

Resources