Join queries with same execution plan - sql-server

I have two queries for the same task
ONE:
select * from table1 t1
INNER JOIN table2 t2
ON t1.id=t2.id
TWO
select * from table1 t1
INNER JOIN (select * from table2) t2
ON t1.id=t2.id
I checked the execution plan for both the queries.Both execution plans are same.But i doubt ,is there any difference in both the queries? If yes which one is more efficient?

You haven't mentioned which DBMS. SQL is just declarative - you tell Oracle(or any other RDBMS) what you want. But the Execution Plan is what ultimately decides how the query will be executed. So if the plans of both queries are the same, then you can be rest assured there will be no difference in performance. Both queries will be executing ditto as far as the RDBMS is concerned.
Even though both queries are the same, the first one is the most preferred/right way to do it. The second method means the RDBMS needs to do a FULL scan on table2 before joining but Oracle's CBO is usually smart enough to rewrite the 2nd one to be the same as 1st one. This is something you need to be aware of. Some RDBMS have powerful optimizers that rewrite your query before even deriving the plan if it reduces the executedion cost of the query.

Related

TSQL Join, Query Processing order and storage

Table structure:
CREATE TABLE dbo.Transactions
(
actid INT NOT NULL, --Account ID
tranid INT NOT NULL, -- Transaction ID
val MONEY NOT NULL, --- Transaction value
CONSTRAINT PK_Transactions PRIMARY KEY(actid, tranid)
);
The following inefficient query tries to determine the running balance after each transaction
SELECT
T1.actid, T1.tranid, T1.val,
SUM(T2.val) AS balance
FROM
dbo.Transactions AS T1
JOIN
dbo.Transactions AS T2 ON T2.actid = T1.actid
AND T2.tranid <= T1.tranid
GROUP BY
T1.actid, T1.tranid, T1.val;
I am not sure how the join is processed in query. Is the join treated as a subquery where for each group (T1.actid, T1.tranid, T1.val) the join statement is executed? Does that mean if there 10K Transactions , 10K joined data sets are created by this query?
Execute your query in SSMS. Then highlight it and press Ctrl + L to view the Execution Plan. This will show you how SQL Server plans to execute the query and sometimes suggest indexes, etc.
It means you will have exactly number of rows the join satisfy
Each row in T1 is processed and brings in rows from T2 that satisfies the join conditions.
The join can be process as loop, hash, or merge. Typically the optimizer ill use hash.
The best think to do is just run it. The output should tell a story.
The ONLY way to know is by 'studying' the query plan.
FYI: it seems to me your query is equivalent to
SELECT
T1.actid, T1.tranid, T1.val,
balance = (SELECT SUM(T2.val)
FROM dbo.Transactions
WHERE T2.actid = T1.actid
AND T2.tranid <= T1.tranid)
FROM
dbo.Transactions AS T1
To be honest, I prefer 'this' version because it looks more readable to me; I'm also expecting this version to be slightly 'leaner' as there is less need for sorting, but only actual testing will tell. It's sometimes surprising to see what the optimizer does behind the scenes! Again, the query plan will show.
Therefore, run both queries and compare the resulting query plans, those should give you an idea about their relative cost. Now, keep in mind that "cost" isn't always directly correlated to "time"; so you might want to check which one runs faster too on your hardware and under 'typical load'; also keep in mind that e.g. caching may have an effect here!

Force joined view not to be optimized

I have a somewhat complex view which includes a join to another view. For some reason the generated query plan is highly inefficient. The query runs for many hours. However if I select the sub-view into a temporary table first and then join with this, the same query finished in a few minutes.
My question is: Is there some kind of query hint or other trick which will force the optimizer to execute the joined sub-view in isolation before performing the join, just as when using a temp table? Clearly the default strategy chosen by the optimizer is not optimal.
I cannot use the temporary table-trick since views does not allow temporary tables. I understand I could probably rewrite everything to a stored procedure, but that would break composeability of views, and it seems also like bad for maintenance to rewrite everything just to trick the optimizer to not use a bad optimization.
Adam Machanic explained one such way at a SQL Saturday I recently attended. The presentation was called Clash of the Row Goals. The method involves using a TOP X at the beginning of the sub-select. He explained that when doing a TOP X, the query optimizer assumes it is more efficient to grab the TOP X rows one at a time. As long as you set X as a sufficiently large number (limit of INT or BIGINT?), the query will always get the correct results.
So one example that Adam provided:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
becomes:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT TOP(2147483647)
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
It is a super cool trick and very useful.
When things get messy the query optimize often resorts to loop joins
If materializing to a temp fixed it then most likely that is the problem
The optimizer often does not deal with views very well
I would rewrite you view to not uses views
Join Hints (Transact-SQL)
You may be able to use these hints on views
Try merge and hash
Try changing the order of join
Move condition into the join whenever possible
select *
from table1
join table2
on table1.FK = table2.Key
where table2.desc = 'cat1'
should be
select *
from table1
join table2
on table1.FK = table2.Key
and table2.desc = 'cat1'
Now the query optimizer will get that correct but as the query gets more complex the query optimize goes into what I call stupid mode and loop joins. But that is also done to protect the server and have as little in memory as possible.

Why is this CTE so much slower than using temp tables?

We had an issue since a recent update on our database (I made this update, I am guilty here), one of the query used was much slower since then. I tried to modify the query to get faster result, and managed to achieve my goal with temp tables, which is not bad, but I fail to understand why this solution performs better than a CTE based one, which does the same queries. Maybe it has to do that some tables are in a different DB ?
Here's the query that performs badly (22 minutes on our hardware) :
WITH CTE_Patterns AS (
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email AS PELE WITH(NOLOCK) ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
),
CTE_Emails AS (
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED WITH(NOLOCK) ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
)
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM CTE_Patterns AS BL WITH(NOLOCK)
INNER JOIN CTE_Emails AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
When running both CTE queries separately, it's super fast (0 secs in SSMS, returns 122 rows and 13k rows), when running the full query, with INNER JOIN on sEmail, it's super slow (22 minutes)
Here's the query that performs well, with temp tables (0 sec on our hardware) and which does the eaxct same thing, returns the same result :
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
INTO #tb1
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email PELE ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
INTO #tb2
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM #tb1 AS BL WITH(NOLOCK)
INNER JOIN #tb2 AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
DROP TABLE #tb1
DROP TABLE #tb2
Tables stats :
OtherDb.dbo.Purchased_Email_List : 13 rows, 2 rows flagged bPattern = 1
OtherDb.dbo.Purchased_Email_List_Email : 324289 rows, 122 rows with patterns (which are used in this issue)
dbo.NewsletterService_import_list_email : 15.5M rows
dbo.NewsletterService_import_list_email_distinct ~1.5M rows
WHERE ILE.iId_newsletterservice_import_list = 1000 retrieves ~ 13k rows
I can post more info about tables on request.
Can someone help me understand this ?
UPDATE
Here is the query plan for the CTE query :
Here is the query plan with temp tables :
As you can see in the query plan, with CTEs, the engine reserves the right to apply them basically as a lookup, even when you want a join.
If it isn't sure enough it can run the whole thing independently, in advance, essentially generating a temp table... let's just run it once for each row.
This is perfect for the recursion queries they can do like magic.
But you're seeing - in the nested Nested Loops - where it can go terribly wrong.
You're already finding the answer on your own by trying the real temp table.
Parallelism. If you noticed in your TEMP TABLE query, the 3rd Query indicates Parallelism in both distributing and gathering the work of the 1st Query. And Parallelism when combining the results of the 1st and 2nd Query. The 1st Query also incidentally has a relative cost of 77%. So the Query Engine in your TEMP TABLE example was able to determine that the 1st Query can benefit from Parallelism. Especially when the Parallelism is Gather Stream and Distribute Stream, so its allowing the divying up of work (join) because the data is distributed in such a way that allows for divying up the work then recombining. Notice the cost of the 2nd Query is 0% so you can ignore that as no cost other than when it needs to be combined.
Looking at the CTE, that is entirely processed Serially and not in Parallel. So somehow with the CTE it could not figure out the 1st Query can be run in Parallel, as well as the relationship of the 1st and 2nd query. Its possible that with multiple CTE expressions it assumes some dependency and did not look ahead far enough.
Another test you can do with the CTE is keep the CTE_Patterns but eliminate the CTE_Emails by putting that as a "subquery derived" table to the 3rd Query in the CTE. It would be curious to see the Execution Plan, and see if there is Parallelism when expressed that way.
In my experience it's best to use CTE's for recursion and temp tables when you need to join back to the data. Makes for a much faster query typically.

Inner join vs select statements on multiple tables

THe below 2 queries performs the same operation, but wondering which would be the fastest and most preferable?
NUM is the primary key on table1 & table2...
select *
from table1 tb1,
table2 tb2
where tb1.num = tb2.num
select *
from table1 tb1
inner join
table2 tb2
on tb1.num = tb2.num
They are the same query. The first is an older alternate syntax, but they both mean do an inner join.
You should avoid using the older syntax. It's not just readability, but as you build more complex queries, there are things that you simply can't do with the old syntax. Additionally, the old syntax is going through a slow process of being phased out, with the equivalent outer join syntax marked as deprecated in most products, and iirc dropped already in at least one.
The 2 SQL statements are equivalent. You can look at the execution plan to confirm. As a rule, given 2 SQL statements which affect/return the same rows in the same way, the server is free to execute them the same way.
They're equivalent queries - both are inner joins, but the first uses an older, implicit join syntax. Your database should execute them in exactly the same way.
If you're unsure, you could always use the SQL Management Studio to view and compare the execution plans of both queries.
The first example is what I have seen referred to as an Oracle Join. As mentioned already there appears to be little performance difference. I prefer the second example from a readability standpoint because it separates join conditions from filter conditions.

Do the order of JOINs make a difference?

Say I have a query like the one below:
SELECT t1.id, t1.Name
FROM Table1 as t1 --800,000 records
INNER JOIN Table2 as t2 --500,000 records
ON t1.fkID = t2.id
INNER JOIN Table3 as t3 -- 1,000 records
ON t1.OtherId = t3.id
Would i see a performance improvement if I changed the order of my joins on Table2 and Table3. See below:
SELECT t1.id, t1.Name
FROM Table1 as t1 --800,000 records
INNER JOIN Table3 as t3 -- 1,000 records
ON t1.OtherId = t3.id
INNER JOIN Table2 as t2 --500,000 records
ON t1.fkID = t2.id
I've heard that the Query Optimizer will try to determine the best order but doesn't always work. Does the version of SQL Server you are using make a difference?
The order of joins makes no difference.
What does make a difference is ensuring your statistics are up to date.
One way to check your statistics is to run a query in SSMS and include the Actual execution plan. If the Estimated number of rows is very different to the Actual number of rows used by any part of the execution plan, then your statistics are out of date.
Statistics are rebuilt when the related indexes are rebuilt. If your production maintenance window allows, I would update statistics every night.
This will update statistics for all tables in a database:
exec sp_MSforeachtable "UPDATE STATISTICS ?"
The order of joins makes a difference only if you specify OPTION (FORCE ORDER). Otherwise, the optimizer will rearrange your query in whichever way it deems most efficient.
There actually are certain instances where I find that I need to use FORCE ORDER, but of course they are few and far between. If you aren't sure, just SET STATISTICS [TIME|IO] ON and see for yourself. You'll probably find that your version runs slower than the optimized version in most if not all cases.
The Query Optimizer should easily handle these as exactly the same query, and work out the best way of doing it.
A lot of it is more about the statistics than the number of records. For example, if the vast majority of values in t1.fkID are identical, this information can influence the QO a lot.

Resources