Do the order of JOINs make a difference? - sql-server

Say I have a query like the one below:
SELECT t1.id, t1.Name
FROM Table1 as t1 --800,000 records
INNER JOIN Table2 as t2 --500,000 records
ON t1.fkID = t2.id
INNER JOIN Table3 as t3 -- 1,000 records
ON t1.OtherId = t3.id
Would i see a performance improvement if I changed the order of my joins on Table2 and Table3. See below:
SELECT t1.id, t1.Name
FROM Table1 as t1 --800,000 records
INNER JOIN Table3 as t3 -- 1,000 records
ON t1.OtherId = t3.id
INNER JOIN Table2 as t2 --500,000 records
ON t1.fkID = t2.id
I've heard that the Query Optimizer will try to determine the best order but doesn't always work. Does the version of SQL Server you are using make a difference?

The order of joins makes no difference.
What does make a difference is ensuring your statistics are up to date.
One way to check your statistics is to run a query in SSMS and include the Actual execution plan. If the Estimated number of rows is very different to the Actual number of rows used by any part of the execution plan, then your statistics are out of date.
Statistics are rebuilt when the related indexes are rebuilt. If your production maintenance window allows, I would update statistics every night.
This will update statistics for all tables in a database:
exec sp_MSforeachtable "UPDATE STATISTICS ?"

The order of joins makes a difference only if you specify OPTION (FORCE ORDER). Otherwise, the optimizer will rearrange your query in whichever way it deems most efficient.
There actually are certain instances where I find that I need to use FORCE ORDER, but of course they are few and far between. If you aren't sure, just SET STATISTICS [TIME|IO] ON and see for yourself. You'll probably find that your version runs slower than the optimized version in most if not all cases.

The Query Optimizer should easily handle these as exactly the same query, and work out the best way of doing it.
A lot of it is more about the statistics than the number of records. For example, if the vast majority of values in t1.fkID are identical, this information can influence the QO a lot.

Related

TSQL Join, Query Processing order and storage

Table structure:
CREATE TABLE dbo.Transactions
(
actid INT NOT NULL, --Account ID
tranid INT NOT NULL, -- Transaction ID
val MONEY NOT NULL, --- Transaction value
CONSTRAINT PK_Transactions PRIMARY KEY(actid, tranid)
);
The following inefficient query tries to determine the running balance after each transaction
SELECT
T1.actid, T1.tranid, T1.val,
SUM(T2.val) AS balance
FROM
dbo.Transactions AS T1
JOIN
dbo.Transactions AS T2 ON T2.actid = T1.actid
AND T2.tranid <= T1.tranid
GROUP BY
T1.actid, T1.tranid, T1.val;
I am not sure how the join is processed in query. Is the join treated as a subquery where for each group (T1.actid, T1.tranid, T1.val) the join statement is executed? Does that mean if there 10K Transactions , 10K joined data sets are created by this query?
Execute your query in SSMS. Then highlight it and press Ctrl + L to view the Execution Plan. This will show you how SQL Server plans to execute the query and sometimes suggest indexes, etc.
It means you will have exactly number of rows the join satisfy
Each row in T1 is processed and brings in rows from T2 that satisfies the join conditions.
The join can be process as loop, hash, or merge. Typically the optimizer ill use hash.
The best think to do is just run it. The output should tell a story.
The ONLY way to know is by 'studying' the query plan.
FYI: it seems to me your query is equivalent to
SELECT
T1.actid, T1.tranid, T1.val,
balance = (SELECT SUM(T2.val)
FROM dbo.Transactions
WHERE T2.actid = T1.actid
AND T2.tranid <= T1.tranid)
FROM
dbo.Transactions AS T1
To be honest, I prefer 'this' version because it looks more readable to me; I'm also expecting this version to be slightly 'leaner' as there is less need for sorting, but only actual testing will tell. It's sometimes surprising to see what the optimizer does behind the scenes! Again, the query plan will show.
Therefore, run both queries and compare the resulting query plans, those should give you an idea about their relative cost. Now, keep in mind that "cost" isn't always directly correlated to "time"; so you might want to check which one runs faster too on your hardware and under 'typical load'; also keep in mind that e.g. caching may have an effect here!

Join queries with same execution plan

I have two queries for the same task
ONE:
select * from table1 t1
INNER JOIN table2 t2
ON t1.id=t2.id
TWO
select * from table1 t1
INNER JOIN (select * from table2) t2
ON t1.id=t2.id
I checked the execution plan for both the queries.Both execution plans are same.But i doubt ,is there any difference in both the queries? If yes which one is more efficient?
You haven't mentioned which DBMS. SQL is just declarative - you tell Oracle(or any other RDBMS) what you want. But the Execution Plan is what ultimately decides how the query will be executed. So if the plans of both queries are the same, then you can be rest assured there will be no difference in performance. Both queries will be executing ditto as far as the RDBMS is concerned.
Even though both queries are the same, the first one is the most preferred/right way to do it. The second method means the RDBMS needs to do a FULL scan on table2 before joining but Oracle's CBO is usually smart enough to rewrite the 2nd one to be the same as 1st one. This is something you need to be aware of. Some RDBMS have powerful optimizers that rewrite your query before even deriving the plan if it reduces the executedion cost of the query.

Query Optimization on SQL server 2008

I have a small sql query that runs on SQL Server 2008. It uses the following tables and their row counts:
dbo.date_master - 245424
dbo.ers_hh_forecast_consumption - 436061472
dbo.ers_hh_forecast_file - 15105
dbo.ers_ed_supply_point - 8485
I am quite new to the world of SQL Server and am learning. Please guide me on how I'll be able to optimize this query to run much faster.
I'll be quite happy to learn if anyone can mention my mistakes and what I am doing that makes it take sooo long to query the resulting table.
WITH CTE_CONS AS
(
SELECT T2.CONVERTED_DATE
,T1.FORECAST_FILE_ID
,SUM(T1.FORECAST_CONSUMPTION) AS TOTAL
FROM dbo.ers_hh_forecast_consumption AS T1
LEFT JOIN dbo.date_master AS T2 ON T1.UTC_DATETIME=T2.STRDATETIME
WHERE T2.CONVERTED_DATE>='2015-01-01' AND T2.CONVERTED_DATE<='2015-06-01'
GROUP BY T2.CONVERTED_DATE, T1.FORECAST_FILE_ID, T1.FORECAST_CONSUMPTION
),
CTE_MPAN AS
(
SELECT T2.FORECAST_FILE_ID
,T2.MPAN_CORE
FROM CTE_CONS AS T1
LEFT JOIN dbo.ers_hh_forecast_file AS T2 ON T1.FORECAST_FILE_ID=T2.FORECAST_FILE_ID
),
CTE_GSP AS
(
SELECT T2.MPAN_CORE
,T2.GSP_GROUP_ID
FROM CTE_MPAN AS T1
LEFT JOIN dbo.ers_ed_supply_point AS T2 ON T1.MPAN_CORE=T2.MPAN_CORE
)
SELECT T1.CONVERTED_DATE
,T1.TOTAL
,T2.MPAN_CORE
,T1.TOTAL
FROM CTE_CONS AS T1
LEFT JOIN CTE_MPAN AS T2 ON T1.FORECAST_FILE_ID=T2.FORECAST_FILE_ID
LEFT JOIN CTE_GSP AS T3 ON T2.MPAN_CORE=T3.MPAN_CORE
Basically, without looking at the actual table design and indices, it is difficult to tell exactly what all you would need to change. But for starters, you could definitely consider two things:
In your CTE_CONS query, you are doing a left join on a Datetime field. This is definitely not a good idea unless you have some kind of index on that field. I would strongly urge you to create a index if there isn't one already.
CREATE NONCLUSTERED INDEX IX_UTC_DATETIME ON dbo.ers_hh_forecast_consumption
(UTC_DATETIME ASC) INCLUDE (
FORECAST_FILE_ID
,FORECAST_CONSUMPTION
);
The other thing you could consider doing would be partitioning your table dbo.ers_hh_forecast_consumption. That way, your read is much less on the table and becomes lot quicker to retrieve records as well. Here is a quick guide on How To Decide if You Should Use Table Partitioning.
Hope this helps!
Apart from the fact that you'll need to offer quite a bit more info for us to get a good idea on what's going on, I think I spotted a bit of an issue with your query here:
WITH CTE_CONS AS
(
SELECT T2.CONVERTED_DATE
,T1.FORECAST_FILE_ID
,SUM(T1.FORECAST_CONSUMPTION) AS TOTAL
FROM dbo.ers_hh_forecast_consumption AS T1
LEFT JOIN dbo.date_master AS T2 ON T1.UTC_DATETIME=T2.STRDATETIME
WHERE T2.CONVERTED_DATE>='2015-01-01' AND T2.CONVERTED_DATE<='2015-06-01'
GROUP BY T2.CONVERTED_DATE, T1.FORECAST_FILE_ID, T1.FORECAST_CONSUMPTION
)
On first sigth you're trying to SUM() the values of T1.FORECAST_CONSUMPTION per T2.CONVERTED_DATE ,T1.FORECAST_FILE_ID combination. However, in the GROUP BY you also add T1.FORECAST_CONSUMPTION again? This will have the exact same effect as doing a DISTINCT over the three fields. Either removed the field you're SUM()ing on from the GROUP BY or use a DISTINCT and get rid of the SUM() and GROUP BY; depending on what effect you're after.
Anyway, could you add the following things to your question :
EXEC sp_helpindex <table_name> for all tables involved.
if possible, a screenshot of the Execution Plan (either from SSMS, or from SQL Sentry Plan Explorer).

Why is this CTE so much slower than using temp tables?

We had an issue since a recent update on our database (I made this update, I am guilty here), one of the query used was much slower since then. I tried to modify the query to get faster result, and managed to achieve my goal with temp tables, which is not bad, but I fail to understand why this solution performs better than a CTE based one, which does the same queries. Maybe it has to do that some tables are in a different DB ?
Here's the query that performs badly (22 minutes on our hardware) :
WITH CTE_Patterns AS (
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email AS PELE WITH(NOLOCK) ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
),
CTE_Emails AS (
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED WITH(NOLOCK) ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
)
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM CTE_Patterns AS BL WITH(NOLOCK)
INNER JOIN CTE_Emails AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
When running both CTE queries separately, it's super fast (0 secs in SSMS, returns 122 rows and 13k rows), when running the full query, with INNER JOIN on sEmail, it's super slow (22 minutes)
Here's the query that performs well, with temp tables (0 sec on our hardware) and which does the eaxct same thing, returns the same result :
SELECT
PEL.iId_purchased_email_list,
PELE.sEmail
INTO #tb1
FROM OtherDb.dbo.Purchased_Email_List PEL WITH(NOLOCK)
INNER JOIN OtherDb.dbo.Purchased_Email_List_Email PELE ON PELE.iId_purchased_email_list = PEL.iId_purchased_email_list
WHERE PEL.bPattern = 1
SELECT
ILE.iId_newsletterservice_import_list,
ILE.iId_newsletterservice_import_list_email,
ILED.sEmail
INTO #tb2
FROM dbo.NewsletterService_import_list_email AS ILE WITH(NOLOCK)
INNER JOIN dbo.NewsletterService_import_list_email_distinct AS ILED ON ILED.iId_newsletterservice_import_list_email_distinct = ILE.iId_newsletterservice_import_list_email_distinct
WHERE ILE.iId_newsletterservice_import_list = 1000
SELECT I.iId_newsletterservice_import_list,
I.iId_newsletterservice_import_list_email,
BL.iId_purchased_email_list
FROM #tb1 AS BL WITH(NOLOCK)
INNER JOIN #tb2 AS I WITH(NOLOCK) ON I.sEmail LIKE BL.sEmail
DROP TABLE #tb1
DROP TABLE #tb2
Tables stats :
OtherDb.dbo.Purchased_Email_List : 13 rows, 2 rows flagged bPattern = 1
OtherDb.dbo.Purchased_Email_List_Email : 324289 rows, 122 rows with patterns (which are used in this issue)
dbo.NewsletterService_import_list_email : 15.5M rows
dbo.NewsletterService_import_list_email_distinct ~1.5M rows
WHERE ILE.iId_newsletterservice_import_list = 1000 retrieves ~ 13k rows
I can post more info about tables on request.
Can someone help me understand this ?
UPDATE
Here is the query plan for the CTE query :
Here is the query plan with temp tables :
As you can see in the query plan, with CTEs, the engine reserves the right to apply them basically as a lookup, even when you want a join.
If it isn't sure enough it can run the whole thing independently, in advance, essentially generating a temp table... let's just run it once for each row.
This is perfect for the recursion queries they can do like magic.
But you're seeing - in the nested Nested Loops - where it can go terribly wrong.
You're already finding the answer on your own by trying the real temp table.
Parallelism. If you noticed in your TEMP TABLE query, the 3rd Query indicates Parallelism in both distributing and gathering the work of the 1st Query. And Parallelism when combining the results of the 1st and 2nd Query. The 1st Query also incidentally has a relative cost of 77%. So the Query Engine in your TEMP TABLE example was able to determine that the 1st Query can benefit from Parallelism. Especially when the Parallelism is Gather Stream and Distribute Stream, so its allowing the divying up of work (join) because the data is distributed in such a way that allows for divying up the work then recombining. Notice the cost of the 2nd Query is 0% so you can ignore that as no cost other than when it needs to be combined.
Looking at the CTE, that is entirely processed Serially and not in Parallel. So somehow with the CTE it could not figure out the 1st Query can be run in Parallel, as well as the relationship of the 1st and 2nd query. Its possible that with multiple CTE expressions it assumes some dependency and did not look ahead far enough.
Another test you can do with the CTE is keep the CTE_Patterns but eliminate the CTE_Emails by putting that as a "subquery derived" table to the 3rd Query in the CTE. It would be curious to see the Execution Plan, and see if there is Parallelism when expressed that way.
In my experience it's best to use CTE's for recursion and temp tables when you need to join back to the data. Makes for a much faster query typically.

Join multiple table performance

In my current project, I have to left join multiple table (about 10->20 table) together. In these tables, there are about 1->3 large table with millions row (at maximum: 80 millions), the other table only have thousands row at most.
Currently, my query is like:
SELECT *
FROM table1 left join table2 on table1.A=table2.A
table1 left join table3 on table1.B=table3.B
table1 left join table4 on table1.C=table4.C
table1 left join table5 on table1.D=table5.D
....
table1 left join table15 on table1.Z=table15.Z
table1 and table2 are large table, other are small.
I have clustered index in all of these table but the performance is still low. So, I want to know if there is anything I can try to increase the performance.
p/s: I have try to create nonclustered index in these table but the performance become lower than before.
Well the fastest query would be if you de-normalized your table1 so that the split out normalized values were actually part of the table.
Another solution that you might try is building a temp table that was one big collection of the 20 other small tables. And then just join that temp table back to your table1.
First of all, do you really need all those joined data? I suppose most of the situations you don't. If you do, you probably need to review your requirements and architecture.
So the trick is, you only get the data you want, instead of all of them. And filter the data as early as possible (even before joining the next table. but don't worry, SQL Server would do some optimization for you).
I would start from checking the execution plan with Ctrl+L. Try finding out those "Index Scan" nodes and build index for them. I can't go any further without seeing your execution plan.

Resources