Query Optimization on SQL server 2008 - sql-server

I have a small sql query that runs on SQL Server 2008. It uses the following tables and their row counts:
dbo.date_master - 245424
dbo.ers_hh_forecast_consumption - 436061472
dbo.ers_hh_forecast_file - 15105
dbo.ers_ed_supply_point - 8485
I am quite new to the world of SQL Server and am learning. Please guide me on how I'll be able to optimize this query to run much faster.
I'll be quite happy to learn if anyone can mention my mistakes and what I am doing that makes it take sooo long to query the resulting table.
WITH CTE_CONS AS
(
SELECT T2.CONVERTED_DATE
,T1.FORECAST_FILE_ID
,SUM(T1.FORECAST_CONSUMPTION) AS TOTAL
FROM dbo.ers_hh_forecast_consumption AS T1
LEFT JOIN dbo.date_master AS T2 ON T1.UTC_DATETIME=T2.STRDATETIME
WHERE T2.CONVERTED_DATE>='2015-01-01' AND T2.CONVERTED_DATE<='2015-06-01'
GROUP BY T2.CONVERTED_DATE, T1.FORECAST_FILE_ID, T1.FORECAST_CONSUMPTION
),
CTE_MPAN AS
(
SELECT T2.FORECAST_FILE_ID
,T2.MPAN_CORE
FROM CTE_CONS AS T1
LEFT JOIN dbo.ers_hh_forecast_file AS T2 ON T1.FORECAST_FILE_ID=T2.FORECAST_FILE_ID
),
CTE_GSP AS
(
SELECT T2.MPAN_CORE
,T2.GSP_GROUP_ID
FROM CTE_MPAN AS T1
LEFT JOIN dbo.ers_ed_supply_point AS T2 ON T1.MPAN_CORE=T2.MPAN_CORE
)
SELECT T1.CONVERTED_DATE
,T1.TOTAL
,T2.MPAN_CORE
,T1.TOTAL
FROM CTE_CONS AS T1
LEFT JOIN CTE_MPAN AS T2 ON T1.FORECAST_FILE_ID=T2.FORECAST_FILE_ID
LEFT JOIN CTE_GSP AS T3 ON T2.MPAN_CORE=T3.MPAN_CORE

Basically, without looking at the actual table design and indices, it is difficult to tell exactly what all you would need to change. But for starters, you could definitely consider two things:
In your CTE_CONS query, you are doing a left join on a Datetime field. This is definitely not a good idea unless you have some kind of index on that field. I would strongly urge you to create a index if there isn't one already.
CREATE NONCLUSTERED INDEX IX_UTC_DATETIME ON dbo.ers_hh_forecast_consumption
(UTC_DATETIME ASC) INCLUDE (
FORECAST_FILE_ID
,FORECAST_CONSUMPTION
);
The other thing you could consider doing would be partitioning your table dbo.ers_hh_forecast_consumption. That way, your read is much less on the table and becomes lot quicker to retrieve records as well. Here is a quick guide on How To Decide if You Should Use Table Partitioning.
Hope this helps!

Apart from the fact that you'll need to offer quite a bit more info for us to get a good idea on what's going on, I think I spotted a bit of an issue with your query here:
WITH CTE_CONS AS
(
SELECT T2.CONVERTED_DATE
,T1.FORECAST_FILE_ID
,SUM(T1.FORECAST_CONSUMPTION) AS TOTAL
FROM dbo.ers_hh_forecast_consumption AS T1
LEFT JOIN dbo.date_master AS T2 ON T1.UTC_DATETIME=T2.STRDATETIME
WHERE T2.CONVERTED_DATE>='2015-01-01' AND T2.CONVERTED_DATE<='2015-06-01'
GROUP BY T2.CONVERTED_DATE, T1.FORECAST_FILE_ID, T1.FORECAST_CONSUMPTION
)
On first sigth you're trying to SUM() the values of T1.FORECAST_CONSUMPTION per T2.CONVERTED_DATE ,T1.FORECAST_FILE_ID combination. However, in the GROUP BY you also add T1.FORECAST_CONSUMPTION again? This will have the exact same effect as doing a DISTINCT over the three fields. Either removed the field you're SUM()ing on from the GROUP BY or use a DISTINCT and get rid of the SUM() and GROUP BY; depending on what effect you're after.
Anyway, could you add the following things to your question :
EXEC sp_helpindex <table_name> for all tables involved.
if possible, a screenshot of the Execution Plan (either from SSMS, or from SQL Sentry Plan Explorer).

Related

TSQL Join, Query Processing order and storage

Table structure:
CREATE TABLE dbo.Transactions
(
actid INT NOT NULL, --Account ID
tranid INT NOT NULL, -- Transaction ID
val MONEY NOT NULL, --- Transaction value
CONSTRAINT PK_Transactions PRIMARY KEY(actid, tranid)
);
The following inefficient query tries to determine the running balance after each transaction
SELECT
T1.actid, T1.tranid, T1.val,
SUM(T2.val) AS balance
FROM
dbo.Transactions AS T1
JOIN
dbo.Transactions AS T2 ON T2.actid = T1.actid
AND T2.tranid <= T1.tranid
GROUP BY
T1.actid, T1.tranid, T1.val;
I am not sure how the join is processed in query. Is the join treated as a subquery where for each group (T1.actid, T1.tranid, T1.val) the join statement is executed? Does that mean if there 10K Transactions , 10K joined data sets are created by this query?
Execute your query in SSMS. Then highlight it and press Ctrl + L to view the Execution Plan. This will show you how SQL Server plans to execute the query and sometimes suggest indexes, etc.
It means you will have exactly number of rows the join satisfy
Each row in T1 is processed and brings in rows from T2 that satisfies the join conditions.
The join can be process as loop, hash, or merge. Typically the optimizer ill use hash.
The best think to do is just run it. The output should tell a story.
The ONLY way to know is by 'studying' the query plan.
FYI: it seems to me your query is equivalent to
SELECT
T1.actid, T1.tranid, T1.val,
balance = (SELECT SUM(T2.val)
FROM dbo.Transactions
WHERE T2.actid = T1.actid
AND T2.tranid <= T1.tranid)
FROM
dbo.Transactions AS T1
To be honest, I prefer 'this' version because it looks more readable to me; I'm also expecting this version to be slightly 'leaner' as there is less need for sorting, but only actual testing will tell. It's sometimes surprising to see what the optimizer does behind the scenes! Again, the query plan will show.
Therefore, run both queries and compare the resulting query plans, those should give you an idea about their relative cost. Now, keep in mind that "cost" isn't always directly correlated to "time"; so you might want to check which one runs faster too on your hardware and under 'typical load'; also keep in mind that e.g. caching may have an effect here!

Force joined view not to be optimized

I have a somewhat complex view which includes a join to another view. For some reason the generated query plan is highly inefficient. The query runs for many hours. However if I select the sub-view into a temporary table first and then join with this, the same query finished in a few minutes.
My question is: Is there some kind of query hint or other trick which will force the optimizer to execute the joined sub-view in isolation before performing the join, just as when using a temp table? Clearly the default strategy chosen by the optimizer is not optimal.
I cannot use the temporary table-trick since views does not allow temporary tables. I understand I could probably rewrite everything to a stored procedure, but that would break composeability of views, and it seems also like bad for maintenance to rewrite everything just to trick the optimizer to not use a bad optimization.
Adam Machanic explained one such way at a SQL Saturday I recently attended. The presentation was called Clash of the Row Goals. The method involves using a TOP X at the beginning of the sub-select. He explained that when doing a TOP X, the query optimizer assumes it is more efficient to grab the TOP X rows one at a time. As long as you set X as a sufficiently large number (limit of INT or BIGINT?), the query will always get the correct results.
So one example that Adam provided:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
becomes:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT TOP(2147483647)
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
It is a super cool trick and very useful.
When things get messy the query optimize often resorts to loop joins
If materializing to a temp fixed it then most likely that is the problem
The optimizer often does not deal with views very well
I would rewrite you view to not uses views
Join Hints (Transact-SQL)
You may be able to use these hints on views
Try merge and hash
Try changing the order of join
Move condition into the join whenever possible
select *
from table1
join table2
on table1.FK = table2.Key
where table2.desc = 'cat1'
should be
select *
from table1
join table2
on table1.FK = table2.Key
and table2.desc = 'cat1'
Now the query optimizer will get that correct but as the query gets more complex the query optimize goes into what I call stupid mode and loop joins. But that is also done to protect the server and have as little in memory as possible.

How can I speed up this SQL view?

I'm a beginner at this so hope you can help. I'm working in SQL server 2008R2 and have a view that is comprised from four tables all joined together:
SELECT DISTINCT ad.award_id,
bl.funding_id,
bl.budget_line,
dd4.monthnumberofyear AS month,
dd4.yearcalendar AS year,
CASE
WHEN frb.full_value IS NULL THEN '0'
ELSE frb.full_value
END AS Expenditure_value,
bl.budget_id,
frb.accode,
'Actual' AS Type
FROM dw.dbo.dimdate5 AS dd4
LEFT OUTER JOIN dbo.award_data AS ad
ON dd4.fulldate BETWEEN ad.usethisstartdate AND
ad.usethisenddate
LEFT OUTER JOIN dbo.budget_line AS bl
ON bl.award_id = ad.award_id
LEFT OUTER JOIN dw.dbo.fctresearchbalances AS frb
ON frb.el3 = bl.award_id
AND frb.element4groupidnew = bl.budget_line
AND dd4.yearfiscal = frb.yr
AND dd4.monthnumberfiscal = frb.period
The view has 9 columns and 1.5 million rows and growing. A select * from this view was taking 20 minutes for all the rows. I added indexes on the fields in the tables that are joined on and that improved it to 10 minutes. My question is what else could I do to get the select to run faster?
Many thanks, Violet.
Try getting rid of the case statement.
If you have 1.5 million rows, if you're interesting in the aggregation of those rows rather than the whole set, you might want to sum the rows in fctResearchBalances first and then do the joins.
(It's a bit difficult to determine what else you might benefit from, without seeing the access plan.)
1- You can use stored procedure to have buffer cache.
2- you can use indexed view , this means creating index on schemabound views.
3- you can use query hints in join to order the query optimizer to use special kind of join.
4- you can use table partitioning .
SELECT DISTINCT --#1 - potential bottleneck
ad.award_id
, bl.funding_id
, bl.budget_line
, [month] = dd4.monthnumberofyear
, [year] = dd4.yearcalendar
, Expenditure_value = ISNULL(frb.full_value, '0')
, bl.budget_id
, frb.accode
, [type] = 'Actual'
FROM dbo.dimdate5 dd4
LEFT JOIN dbo.award_data ad ON dd4.fulldate BETWEEN ad.usethisstartdate AND ad.usethisenddate
LEFT JOIN dbo.budget_line bl ON bl.award_id = ad.award_id
LEFT JOIN dbo.fctresearchbalances frb ON frb.el3 = bl.award_id --#2 - join by multiple columns
AND frb.element4groupidnew = bl.budget_line
AND dd4.yearfiscal = frb.yr
AND dd4.monthnumberfiscal = frb.period
The CASE statement can be replace by
COALESCE(frb.full_value,'0') AS Expenditure_value
Without more info it's not possible to tell exactly what is wrong but just to give you some pointers.
When you have so many LEFT JOINS the order of the joins can make a difference.
Do you have standard indexes or covering indexes with included columns?
If you don't have covering indexes, then primary keys matter in the joins. Including all the primary key columns in the join condition will speed up the query.
Then look at your data - do you need all those LEFT JOINS base on the foreign keys between those tables? Depending on your keys a LEFT JOIN may be equivalent to an INNER JOIN.
And with all those LEFT JOINS is having a DISTINCT really useful?
How much RAM do you have? If you have 8GB+ then 1.5m rows is nothing for SQL Server. You need to optimise those joins.

SQL Server CTE referred in self joins slow

I have written a table-valued UDF that starts by a CTE to return a subset of the rows from a large table.
There are several joins in the CTE. A couple of inner and one left join to other tables, which don't contain a lot of rows.
The CTE has a where clause that returns the rows within a date range, in order to return only the rows needed.
I'm then referencing this CTE in 4 self left joins, in order to build subtotals using different criterias.
The query is quite complex but here is a simplified pseudo-version of it
WITH DataCTE as
(
SELECT [columns] FROM table
INNER JOIN table2
ON [...]
INNER JOIN table3
ON [...]
LEFT JOIN table3
ON [...]
)
SELECT [aggregates_columns of each subset] FROM DataCTE Main
LEFT JOIN DataCTE BananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality = 100
LEFT JOIN DataCTE DamagedBananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality < 20
LEFT JOIN DataCTE MangosSubset
ON [...]
GROUP BY [
I have the feeling that SQL Server gets confused and calls the CTE for each self join, which seems confirmed by looking at the execution plan, although I confess not being an expert at reading those.
I would have assumed SQL Server to be smart enough to only perform the data retrieval from the CTE only once, rather than do it several times.
I have tried the same approach but rather than using a CTE to get the subset of the data, I used the same select query as in the CTE, but made it output to a temp table instead.
The version referring the CTE version takes 40 seconds. The version referring the temp table takes between 1 and 2 seconds.
Why isn't SQL Server smart enough to keep the CTE results in memory?
I like CTEs, especially in this case as my UDF is a table-valued one, so it allowed me to keep everything in a single statement.
To use a temp table, I would need to write a multi-statement table valued UDF, which I find a slightly less elegant solution.
Did some of you had this kind of performance issues with CTE, and if so, how did you get them sorted?
Thanks,
Kharlos
I believe that CTE results are retrieved every time. With a temp table the results are stored until it is dropped. This would seem to explain the performance gains you saw when you switched to a temp table.
Another benefit is that you can create indexes on a temporary table which you can't do to a cte. Not sure if there would be a benefit in your situation but it's good to know.
Related reading:
Which are more performant, CTE or temporary tables?
SQL 2005 CTE vs TEMP table Performance when used in joins of other tables
http://msdn.microsoft.com/en-us/magazine/cc163346.aspx#S3
Quote from the last link:
The CTE's underlying query will be
called each time it is referenced in
the immediately following query.
I'd say go with the temp table. Unfortunately elegant isn't always the best solution.
UPDATE:
Hmmm that makes things more difficult. It's hard for me to say with out looking at your whole environment.
Some thoughts:
can you use a stored procedure instead of a UDF (instead, not from within)?
This may not be possible but if you can remove the left join from you CTE you could move that into an indexed view. If you are able to do this you may see performance gains over even the temp table.

Optimizing ROW_NUMBER() in SQL Server

We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.
I can do this using ROW_NUMBER as follows:
WITH TempTable AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
FROM dbo.DataTable
)
SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous
ON [Current].Machine_ID = Previous.Machine_ID
AND Previous.Ordering = [Current].Ordering + 1
The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.
Is there anyway to rewrite this query to go faster?
The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:
CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);
Separate indexes on Machine_ID and Date_Time will help little, if any.
How does it compare to this version?:
SELECT x.*
,(SELECT MAX(Date_Time)
FROM dbo.DataTable
WHERE Machine_ID = x.Machine_ID
AND Date_Time < x.Date_Time
) AS PreviousDateTime
FROM dbo.DataTable AS x
Or this version?:
SELECT x.*
,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
FROM dbo.DataTable AS l
LEFT JOIN dbo.DataTable AS r
ON l.Machine_ID = r.Machine_ID
AND l.Date_Time > r.Date_Time
GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
AND triang_join.Date_Time = x.Date_Time
Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.
You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...
If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here
Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)
I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.
I would try this before going any further with using a CTE.
I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).
Update
After your comment that this is a view, I would first test the query with a temp table to see if that performs better.
If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.
What if you use a trigger to store the last timestamp an subtract each time to get the difference?
If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?
(Remus' compound index will make the query fast; running it only once should make it faster still.)

Resources