joining a small table significantly slows down a query - sql-server

Problem: Joining a relatively small table into a query with many joins doubles the execution time of a query.
Challenge: How could query or data structures be optimized so that the query executes much faster?
There are indexes everywhere, and statistics are actual.
3 passes, measured with SQL Server Profiler:
Without extra join: avg 3741ms
With extra join: avg 6733ms - this is +80%
Basically, the query is about filtering and aggregating values from a large budget table.
Table budgetcontains around 700,000 records
For filtering, we have tables f1,f2,f3 which contain values selected by user. These filter tables have 1 to max. 70 records
For aggregating, we have tables a1,a2,a3 which are used for aggregating the dimensions to group aggregates.
Budget Table: 700,000 records
Final Result: 2,400 records
Original query:
select a1.agg1, a2.agg2, a3.agg3, sum(b.value)
from
budget b -- 700.000 records
inner join f1 on b.dim1 = f1.dim1
inner join f2 on b.dim2 = f2.dim2
inner join f3 on b.dim3 = f3.dim3
inner join a1 on b.dim1 = a1.dim1
inner join a2 on b.dim2 = a2.dim2
inner join a3 on b.dim3 = a3.dim3
group by a1.agg1, a2.agg2, a3.agg3
order by a1.agg1, a2.agg2, a3.agg3
Now I'm adding the left join. It brings in an additional count of an extra table which holds comments to some figures.
So this join adds a column in the result, but does not change the rows.
The table comments only has around 1,500 records.
New query:
select a1.agg1, a2.agg2, a3.agg3, sum(b.value), count(c.CommentText) as commentcount
from
budget b -- 700.000 records
inner join f1 on b.dim1 = f1.dim1
inner join f2 on b.dim2 = f2.dim2
inner join f3 on b.dim3 = f3.dim3
inner join a1 on b.dim1 = a1.dim1
inner join a2 on b.dim2 = a2.dim2
inner join a3 on b.dim3 = a3.dim3
left join comments c on b.dim1= c.dim1 and b.dim2=c.dim2 and b.dim3=c.dim3
group by a1.agg1, a2.agg2, a3.agg3
order by a1.agg1, a2.agg2, a3.agg3
I compared the execution plans of both queries in detail. The extra left join leads to 3 extra operations with the following costs:
Clustered Index scan (on the joined table): 1%
Merge Join (left outer join): 10%
Sort: 13%
The Sort is an extra sort which appears only with the join. With extra join, I have two sorts (one before and one after the merge join), otherwise only one sort.
Interestingly, the costs are not really matching with reality. But anyway. The basic question is: How could that be improved? Any ideas?

Related

Right Join in SQL Server is taking too long

SELECT
b.1, b.2, b.3, b.4, a.4, a.3, a.5
FROM
a
RIGHT JOIN
b ON a.id = b.id
This query is taking more than 7 minutes.
Both tables have around 100 000 records and just a select from each table runs around 12 seconds avg. In execution plan it is saying that table a has logical reads of around 8708 and 100% operator cost. Both tables have CI on ID.
Verify that an INDEX on the ID column exists in table A. For each row selected in B there will be a lookup of rows in A on the ID column. If an index does not exist on that column this will result in a table scan i.e. a lookup through 100k rows to find rows with that specific ID. Not efficient.
PS - General advice: write queries that don't use RIGHT JOIN, stick to INNER, LEFT and OUTER joins unless there is no other way (there almost always is).
Use this sql below to help you identify any missing indexes. My guess is you are missing at least one.
SELECT
statement AS [database.scheme.table],
column_id , column_name, column_usage,
migs.user_seeks, migs.user_scans,
migs.last_user_seek, migs.avg_total_user_cost,
migs.avg_user_impact
FROM sys.dm_db_missing_index_details AS mid
CROSS APPLY sys.dm_db_missing_index_columns (mid.index_handle)
INNER JOIN sys.dm_db_missing_index_groups AS mig
ON mig.index_handle = mid.index_handle
INNER JOIN sys.dm_db_missing_index_group_stats AS migs
ON mig.index_group_handle=migs.group_handle
ORDER BY mig.index_group_handle, mig.index_handle, column_id

SQL Cartesian join optimisation

I have a big query for an ETL view that has a cartesian join (see below) which is then left joined to 5 other tables.
SELECT W.Field1, W.Field2
FROM datedim AS d
INNER JOIN employee AS W
ON 1 = 1
The query takes 5 minutes to run hence I'm trying to optimise it. The cartesian join is having a big impact on performance.
Any ideas?
-- Additional info
The Cartesian results are then used in an join below. There several joins very similar to the one below.
LEFT OUTER JOIN detail AS det
ON det.id = W.id
AND d.datevalue >= det.validfrom
AND d.datevalue <= det.validto

Lookup on huge table is not happening sql-server

I am using sql-server this is my query:
select asst_id,camp_asst.amp_asst_id,asst.camp_asst_id,lyty_no,campaign_id
into camp.asst_respy
from camp.asst_respy respy
inner join camp.camp_wave wave on wave.wave_cd=resp.camp_id
inner join camp.camp_cust custy on cust.cust_lyty_no=resp.big_id
inner join camp.camp_asst assty on asst.sst_trck_url=resp.dum_url
inner join camp.camp_camp_assty camp_asst on camp_asst.camp_asst_id=asst.asst_id
inner join camp.camp_cust_assty cust_asst on cust_asst.camp_camp_asst_id=camp_asst.asst_id -- this table has about 16 billion rows.
inner join camp.camp_camp_custy camp_cust on camp_cust.camp_camp_cust_id=cust_asst.cust_id
please somebody guide me in doing the join,the join is taking very long time. to happen
and there are indexes defined on table,I am looking to partition the table to make this happen please guide
remaining all tables used have about >10 Million rows.

Index with Leftouter join there is always Index scan in sql server 2005

I have query joining several tables, the last table is joined with LEFT
JOIN. The last table
has more then million rows and execution plan shows table scan on it. I have
indexed columns
on which the join is made. It is always use index scan but If I replace LEFT JOIN with INNER JOIN, index seek is used
used and execution
takes few seconds but with LEFT JOIN there is a table scan , so the
execution
takes several minutes. Does using outer joins turn off indexes? Missed I
something?
What is the reason for such behavior?
Here is the Query
Select *
FROM
Subjects s
INNER join Question q ON q.SubjectID = s.SubjectID
INNER JOIN Answer c ON a.QestionID = q.QuestionID
Left outer JOIN Cell c ON c.Question ID = q.QuestionID
Where S.SubjectID =15
There is cluster index on SubjectID in "Subject" table. and there is non-cluster index on questionID in other tables.
Solution:
I try it in other way and now I am index seek on Cell table. Here is the modified query:
Select *
FROM
Subjects s
INNER join Question q ON q.SubjectID = s.SubjectID
INNER JOIN Answer c ON a.QestionID = q.QuestionID
Left outer JOIN Cell c ON c.Question ID = q.QuestionID
AND C.QuestionID > 0
AND C.CellKey > 0
Where S.SubjectID =15
This way I did high selectivity on Cell table. :)
I just tried to simulate the same issue, however there is no table scan instead it was using the clustered index of Cell, at the same time you could try to force the index, you can check the syntax here and the issues you may face when forcing an index here. Hope this helps.

Why does the order of join clauses affect the query plan in SQL Server?

I am building a view in SQL Server 2000 (and 2005) and I've noticed that the order of the join statements greatly affects the execution plan and speed of the query.
select sr.WTSASessionRangeID,
-- bunch of other columns
from WTSAVW_UserSessionRange us
inner join WTSA_SessionRange sr on sr.WTSASessionRangeID = us.WTSASessionRangeID
left outer join WTSA_SessionRangeTutor srt on srt.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeClass src on src.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeStream srs on srs.WTSASessionRangeID = sr.WTSASessionRangeID
--left outer join MO_Stream ms on ms.MOStreamID = srs.MOStreamID
left outer join WTSA_SessionRangeEnrolmentPeriod srep on srep.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeStudent stsd on stsd.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionSubrange ssr on ssr.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionSubrangeRoom ssrr on ssrr.WTSASessionSubrangeID = ssr.WTSASessionSubrangeID
left outer join MO_Stream ms on ms.MOStreamID = srs.MOStreamID
On SQL Server 2000, the query above consistently generates a plan of cost 946. If I uncomment the MO_Stream join in the middle of the query and comment out the one at the bottom, the cost drops to 263. The execution speed drops accordingly. I always thought that the query optimizer would interpret the query appropriately without considering join order, but it seems that order matters.
So since order does seem to matter, is there a join strategy I should be following for writing faster queries?
(Incidentally, on SQL Server 2005, with almost identical data, the query plan costs were 0.675 and 0.631 respectively.)
Edit: On SQL Server 2000, here are the profiled stats:
946-cost query: 9094ms CPU, 5121 reads, 0 writes, 10123ms duration
263-cost query: 172ms CPU, 7477 reads, 0 writes, 170ms duration
Edit: Here is the logical structure of the tables.
SessionRange ---+--- SessionRangeTutor
|--- SessionRangeClass
|--- SessionRangeStream --- MO_Stream
|--- SessionRangeEnrolmentPeriod
|--- SessionRangeStudent
+----SessionSubrange --- SessionSubrangeRoom
Edit: Thanks to Alex and gbn for pointing me in the right direction. I also found this question.
Here's the new query:
select sr.WTSASessionRangeID // + lots of columns
from WTSAVW_UserSessionRange us
inner join WTSA_SessionRange sr on sr.WTSASessionRangeID = us.WTSASessionRangeID
left outer join WTSA_SessionRangeTutor srt on srt.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeClass src on src.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeEnrolmentPeriod srep on srep.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join WTSA_SessionRangeStudent stsd on stsd.WTSASessionRangeID = sr.WTSASessionRangeID
// SessionRangeStream is a many-to-many mapping table between SessionRange and MO_Stream
left outer join (
WTSA_SessionRangeStream srs
inner join MO_Stream ms on ms.MOStreamID = srs.MOStreamID
) on srs.WTSASessionRangeID = sr.WTSASessionRangeID
// SessionRanges MAY have Subranges and Subranges MAY have Rooms
left outer join (
WTSA_SessionSubrange ssr
left outer join WTSA_SessionSubrangeRoom ssrr on ssrr.WTSASessionSubrangeID = ssr.WTSASessionSubrangeID
) on ssr.WTSASessionRangeID = sr.WTSASessionRangeID
SQLServer2000 cost: 24.9
I have to disagree with all previous answers, and the reason is simple: if you change the order of your left join, your queries are logically different and as such they produce different result sets. See for yourself:
SELECT 1 AS a INTO #t1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4;
SELECT 1 AS b INTO #t2
UNION ALL SELECT 2;
SELECT 1 AS c INTO #t3
UNION ALL SELECT 3;
SELECT a, b, c
FROM #t1 LEFT JOIN #t2 ON #t1.a=#t2.b
LEFT JOIN #t3 ON #t2.b=#t3.c
ORDER BY a;
SELECT a, b, c
FROM #t1 LEFT JOIN #t3 ON #t1.a=#t3.c
LEFT JOIN #t2 ON #t3.c=#t2.b
ORDER BY a;
a b c
----------- ----------- -----------
1 1 1
2 2 NULL
3 NULL NULL
4 NULL NULL
(4 row(s) affected)
a b c
----------- ----------- -----------
1 1 1
2 NULL NULL
3 NULL 3
4 NULL NULL
The join order does make a difference to the resulting query. This is documented in BOL in the docs for FROM:
<joined_table>
Is a result set that is the product of two or more tables. For multiple joins, use parentheses to change the natural order of the joins.
You can alter the join order using parenthesis around the joins (BOL does show this in the syntax at the top of the docs, but it is easy to miss).
This is known as chiastic behaviour. You can also use the query hint OPTION (FORCE ORDER) to force a specific join order, but this can result in what are called "bushy plans" which may not be the most optimal for the query being executed.
Obviously, the SQL Server 2005 optimizer is a lot better than the SQL Server 2000 one.
However, there's a lot of truth in your question. Outer joins will cause execution to vary wildly based on order (inner joins tend to be optimized to the most efficient route, but again, order matters). If you think about it, as you build up left joins, you need to figure out what the heck is on the left. As such, each join must be calculated before every other join can be done. It becomes sequential, and not parallel. Now, obviously, there are things you can do to combat this (such as indexes, views, etc). But, the point stands: The table needs to know what's on the left before it can do a left outer join. And if you just keep adding joins, you're getting more and more abstraction to what, exactly is on the left (especially if you use joined tables as the left table!).
With inner joins, however, you can parallelize those quite a bit, so there's less of a dramatic difference as far as order's concerned.
A general strategy for optimizing queries containing JOINs is to look at your data model and the data and try to determine which JOINs will reduce number of records that must be considered the most quickly. The fewer records that must be considered, the faster the query will run. The server will generally produce a better query plan too.
Along with the above optimization make sure that any fields used in JOINs are indexed
You query is probably wrong anyway. Alex is correct. Eric may be correct too, but the query is wrong.
Lets' take this subset:
WTSA_SessionRange sr
left outer join
WTSA_SessionSubrange ssr on ssr.WTSASessionRangeID = sr.WTSASessionRangeID
left outer join
WTSA_SessionSubrangeRoom ssrr on ssrr.WTSASessionSubrangeID = ssr.WTSASessionSubrangeID
You are joining WTSA_SessionSubrangeRoom onto WTSA_SessionSubrange. You may have no rows from WTSA_SessionSubrange.
The join should be this:
WTSA_SessionRange sr
left outer join
(SELECT WTSASessionRangeID, columns I need
FROM
WTSA_SessionSubrange ssr
left outer join
WTSA_SessionSubrangeRoom ssrr on ssrr.WTSASessionSubrangeID = ssr.WTSASessionSubrangeID
) foo on foo.WTSASessionRangeID = sr.WTSASessionRangeID
This is why the join order is affecting results because it's a different query, declaratively speaking.
You'd also need to change the MO_Stream and WTSA_SessionRangeStream join too.
it depends on which of the join fields are indexed - if it has to table scan the first field, but use an index on the second, it's slow. If your first join field is an index, it'll be quicker. My guess is that 2005 optimizes it better by determining the indexed fields and performing those first
At DevConnections a few years ago a session on SQL Server performance stated that (a) order of outer joins DOES matter, and (b) when a query has a lot of joins, it will not look at all of them before making a determination on a plan. If you know you have joins that will help speed up a query, they should be early on in the FROM list (if you can).

Resources