Awkward JOIN causes poor performance - sql-server

I have a stored procedure that combines data from several tables via UNION ALL. If the parameters passed in to the stored procedure don't apply to a particular table, I attempt to "short-circuit" that table by using "helper bits", e.g. #DataSomeTableExists and adding a corresponding condition in the WHERE clause, e.g. WHERE #DataSomeTableExists = 1
One (psuedo) table in the stored procedure is a bit awkward and causing me some grief.
DECLARE #DataSomeTableExists BIT = (SELECT CASE WHEN EXISTS(SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable') THEN 1 ELSE 0 END);
...
UNION ALL
SELECT *
FROM REF_MinuteDimension AS dim WITH (NOLOCK)
CROSS JOIN (SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable') AS T
CROSS APPLY dbo.fGetLastValueFromSomeTable(T.ParentId, dim.TimeStamp) dpp
WHERE #DataSomeTableExists = 1 AND dim.TimeStamp >= #StartDateTime AND dim.TimeStamp <= #EndDateTime
UNION ALL
...
Note: REF_MinuteDimension is nothing more than smalldatetimes with minute increments.
(1) The execution plan (below) indicates a warning on the nested loops operator saying that there is no join predicate. This is probably not good, but there really isn't a natural join between the tables. Is there a better way to write such a query? For each ParentId in T, I want the value from the UDF for every minute between #StartDateTime and #EndDateTime.
(2) Even when #DataSomeTableExists = 0, there is I/O activity on the tables in this query as reported by SET STATISTICS IO ON and the actual execution plan. The execution plan reports 14.2 % cost which is too much considering these tables don't even apply in this case.
SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable' comes back empty.
Is it the way my query is written? Why wouldn't the helper bit or an empty T short circuit this query?

For 2) I can sure say that line
CROSS JOIN (SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable') AS T
Ill force #T to be analysed and to enter a join. You can create to versions of a SP with and without that join and use that flag to execute one or another but I cannot say that ill save any response time||cpu clocks||I/O bandwith||memory.
For 1) I suggest to remove the (nolock) if you are using SQL Server 2005 or better and to keep a close eye in that UDF. Cannot say more without a good SQL fiddle.

I should mention, I have no clue if this will ever work, as it's kind of an odd way to write a sproc and table-valued UDFs aren't well understood by the query optimizer. You might have to build your resultset into a table variable or temp table conditionally, based on IF statements, then return that data. But I would try this, first:
--helper bit declared
declare #DataSomeTableExists BIT = 0x0
if exists (select 1 from #T where StorageTable = 'DATA_SomeTable')
begin
set #DataSomeTableExists = 0x1
end
...
UNION ALL
SELECT *
FROM REF_MinuteDimension AS dim WITH (NOLOCK)
CROSS JOIN (SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable' and #DataSomeTableExists = 0x1) AS T
CROSS APPLY dbo.fGetLastValueFromSomeTable(T.ParentId, dim.TimeStamp) dpp
WHERE #DataSomeTableExists = 0x1 AND dim.TimeStamp >= #StartDateTime AND dim.TimeStamp <= #EndDateTime
UNION ALL
...
And if you don't know already, the UDF might be giving you weird readings in the execution plans. I don't know enough to give you accurate data, but you should search around to understand the limitations.

Since your query is dependent on run-time variables, consider using dynamic SQL to create your query on the fly. This way you can include the tables you want and exclude the ones you don't want.
There are downsides to dynamic SQL, so read up

Related

Trying to find a solution to long running SQL code where I think NESTED SQL statement is the culprit

I have a SQL statement that has a weird 2nd nested SQL statement that I think is causing this query to run for 6+ min and any suggestions/help would be appreciated. I tried creating a TEMP table for the values in the nested SQL statement and just do a simple join but there is nothing to join on in the SQL code so that is why they used a 1=1 in the ON statement for the join. Here is the SQL code:
Declare #TransactionEndDate datetime;
Select #TransactionEndDate = lastmonth_end from dbo.DTE_udfCommonDates(GETDATE());
Select ''''+TreatyName as Treaty,
cast(EndOfMonth as Date) as asOfDate,
Count(Distinct ClaimSysID) as ClaimCount,
Count(Distinct FeatureSysID) as FeatureCount,
Sum(OpenReserve) as OpenReserve
From (
Select
TreatyName,
EndOfMonth,
dbo.CMS_Claims.ClaimSysID,
FeatureSysID,
sum(IW_glGeneralLedger.TransactionAmount)*-1 as OpenReserve
From dbo.CMS_Claims
Inner Join dbo.CMS_Claimants
On dbo.CMS_Claims.ClaimSysID = dbo.CMS_Claimants.ClaimSysID
Inner Join dbo.CMS_Features
On dbo.CMS_Features.ClaimantSysID = dbo.CMS_Claimants.ClaimantSysID
Left Join dbo.IW_glGeneralLedger
On IW_glGeneralLedger.FeatureID = dbo.CMS_Features.FeatureSysID
Left Join dbo.IW_glSubChildAccount
On dbo.IW_glSubChildAccount.glSubChildAccountID = dbo.IW_glGeneralLedger.glSubChildAccountSysID
Left Join dbo.IW_glAccountGroup
On dbo.IW_glAccountGroup.glAccountGroupID = dbo.IW_glSubChildAccount.glAccountGroupSysID
Left Join dbo.IW_BankRegister
On dbo.IW_BankRegister.BankRegisterSysID = dbo.IW_glGeneralLedger.BankRegisterID
Left Join dbo.IW_BankRegisterStatus
On dbo.IW_BankRegisterStatus.BankRegisterStatusSysID = dbo.IW_BankRegister.BankRegisterStatusID
**Left Join (Select Distinct dbo.DTE_get_month_end(dt) as EndOfMonth
From IW_Calendar
Where dt Between '3/1/2004'
and #TransactionEndDate) as dates
on 1=1**
Left Join dbo.IW_ReinsuranceTreaty
On dbo.IW_ReinsuranceTreaty.TreatySysID = IW_glGeneralLedger.PolicyTreatyID
Where dbo.IW_glGeneralLedger.TransactionDate Between '1/1/2004 00:00:00' And EndOfMonth
And dbo.IW_glAccountGroup.Code In ('RESERVEINDEMNITY')
And (
(dbo.IW_glGeneralLedger.BankRegisterID Is Null)
Or (
(IW_BankRegister.PrintedDate Between '1/1/2004 00:00:00' And EndOfMonth Or dbo.IW_glGeneralLedger.BankRegisterID = 0)
And
(dbo.IW_BankRegisterStatus.EnumValue In ('Approved','Outstanding','Cleared','Void') Or dbo.IW_glGeneralLedger.BankRegisterID = 0))
)
Group By TreatyName, dbo.CMS_Claims.ClaimSysID, FeatureSysID, EndOfMonth
Having sum(IW_glGeneralLedger.TransactionAmount) <> 0
) As Data
Group By TreatyName,EndOfMonth
Order By EndOfMonth, TreatyName
This nested SQL code only provides a table of End of Month values in one column called EndOfMonth and this is what I'm trying to fix:
Select Distinct dbo.DTE_get_month_end(dt) as EndOfMonth
From IW_Calendar
Where dt Between '3/1/2004'
and #TransactionEndDate
Please use the below methods to increase the query performance.
Use temporary tables. ( load relevant data into temporary tables with necessary where conditions and then join).
Use clustered and non clustered indexes to your tables.
Create Multiple-Column Indexes.
Index the ORDER-BY / GROUP-BY / DISTINCT Columns for Better Response Time.
Use Parameterized Queries.
Use query hints accordingly.
NOLOCK: In the event that data is locked, this tells SQL Server to read data from the last known value available, also known as a dirty read. Since it is possible to use some old values and some new values, data sets can contain inconsistencies. Do not use this in any place in which data quality is important.
RECOMPILE: Adding this to the end of a query will result in a new execution plan being generated each time this query executed. This should not be used on a query that is executed often, as the cost to optimize a query is not trivial. For infrequent reports or processes, though, this can be an effective way to avoid undesired plan reuse. This is often used as a bandage when statistics are out of date or parameter sniffing is occurring.
MERGE/HASH/LOOP: This tells the query optimizer to use a specific type of join as part of a join operation. This is super-risky as the optimal join will change as data, schema, and parameters evolve over time. While this may fix a problem right now, it will introduce an element of technical debt that will remain for as long as the hint does.
OPTIMIZE FOR: Can specify a parameter value to optimize the query for. This is often used when we want performance to be controlled for a very common use case so that outliers do not pollute the plan cache. Similar to join hints, this is fragile and when business logic changes, this hint usage may become obsolete.

Should recursive common table expressions over dmvs be built on cached data?

I have written a little CTE to get the total blocking time of a head blocker process, and I am unsure if I should first copy all of the processes that I want the CTE to run over into a temp table and then perform the query over this - i.e. I want to be sure that the data cannot change under my feet whilst the query runs and (worst case scenario), I end up with an infinite recursive loop!
This is my SQL including the temp table - I'd prefer not to have to use the table for performance reasons, and go directly to the sysprocesses dmv inside my CTE, but I'm not sure of the possible implications of this.
DECLARE #proc TABLE(
spid SMALLINT PRIMARY KEY,
blocked SMALLINT INDEX blocked_index,
waittime BIGINT)
INSERT INTO #proc
SELECT spid, blocked, waittime
FROM master..sysprocesses
;WITH block_cte AS
(
SELECT spid, CAST(blocked AS BIGINT) [wait_time], spid [root_spid]
FROM #proc
WHERE blocked = 0
UNION ALL
SELECT blocked.spid, blocked.waittime, block_cte.spid
FROM #proc AS blocked
INNER JOIN block_cte ON blocked.blocked = block_cte.spid
)
SELECT root_spid blocking_spid, SUM(wait_time) total_blocking_time
FROM block_cte
GROUP BY root_spid
This question is probably best transfered to Stack DBA. I'm sure those clever guys and girls can not only tell you the answer but also the reason behind it.
Not being sure myself I decided to test it...
My script captures the record count fromsysProcesses 1,000 times. Now to do this I had to circumnavigate several limits placed on CTEs. Among other restrictions; you cannot use aggregate functions. This makes counting records quite hard. So I created an inline table function to return the current row count from sysProcesses.
sysProcess Count Function
CREATE FUNCTION ProcessCount()
RETURNS TABLE
AS
RETURN
(
-- Return the current process count.
SELECT
COUNT(*) AS RecordCount
FROM
Master..sysProcesses
)
;
I wrapped this function in a CTE.
CTE
WITH RCTE AS
(
/* CTE to test if recursion is effected by updates to
* underlying data.
*/
-- Anchor part.
SELECT
1 AS ExecutionCount,
1 AS JoinField,
RecordCount
FROM
ProcessCount()
UNION ALL
-- Recursive part.
SELECT
r.ExecutionCount + 1 AS ExecutionCount,
1 AS JoinField,
pc.RecordCount
FROM
ProcessCount() AS pc
INNER JOIN RCTE AS r ON r.JoinField = 1
WHERE
r.ExecutionCount < 1000
)
SELECT
MIN(RecordCount) AS MinRecordCount,
MAX(RecordCount) AS MaxRecordCount
FROM
RCTE
OPTION
(MAXRECURSION 1000)
;
GO
If the min and max record counts are always equal this would suggest there is only one consistent view of sysProcesses, used throughout the query. Any difference proves this is not the case. Running on SQL Server 2008 R2 I did find differences:
Results
Run Min Max
1 113 254
2 107 108
3 86 108
Of course the inline function could be to blame here. It certainly changed my execution plan. This has taught me a lesson. I really need to better understand execution plans. I'm sure reading the OPs plan would provide a definitive answer.

Use SQL variable for comparison in the same SELECT statement that sets it

How do I make the following T-SQL statement legal? I can copy the subquery that sets #Type variable for every CASE option, but I'd rather execute the subquery only once. Is it possible?
SELECT
#Type = (SELECT CustomerType FROM dbo.Customers WHERE CustomerId = (SELECT CustomerId FROM dbo.CustomerCategories WHERE CatId= #CatId)),
CASE
WHEN #Type = 'Consumer'THEN dbo.Products.FriendlyName
WHEN #Type = 'Company' THEN dbo.Products.BusinessName
WHEN #Type IS NULL THEN dbo.Products.FriendlyName
WHEN #Type = '' THEN dbo.Products.FriendlyName
END Product,
...
FROM
Products
INNER JOIN
Category
...
Edit: modified my example to be more concrete...have to run now...will be back tomorrow...sorry for signing off short but have to pick up kids :D will check back tomorrow. THX!!
Clarification: I can't separate the two: in the subquery's where-clasue, I need to refer to columns from tables that're used in the main query's join stmt. If I separate them, then #Type will lose relevance.
Why not just separate it into two operations? What do you think you gain by trying to glom them into a single statement?
SELECT #Type = (subquery);
SELECT CASE WHEN #type = 'Consumer'...
At the risk of sounding obtuse, do you really need the variable at all? Why not:
SELECT CASE WHEN col_form_subquery = 'Consumer' THEN ...
END Product
FROM (<subquery>) AS x;
With that form you'll need to decide whether you want to assign values to variables or retrieve results.
You can also pull multiple variables, e.g.
SELECT #Col1 = Col1, #Col2 = Col2
FROM (<subquery>) AS x;
-- then refer to those variables in your other query:
SELECT *, #Col1, #Col2 FROM dbo.Products WHERE Col1 = #Col2;
But this is all conjecture, because you haven't shared enough specifics.
EDIT okay, now that we have a real query and can understand a bit better what you're after, let's see if we can write you a new version. I'll assume that you were only trying to store the #Type variable so you can re-use it within the query, and that you weren't trying to store a value there to use later (after this query).
SELECT CASE
WHEN c.CustomerType = 'Company' THEN p.BusinessName
WHEN COALESCE(c.CustomerType, '') IN ('Consumer', '') THEN p.FriendlyName
END
--, other columns
FROM dbo.Products AS p
INNER JOIN dbo.Category AS cat
ON p.CatId = cat.CatId
INNER JOIN dbo.CustomerCategories AS ccat
ON ccat.CatId = cat.CatId
INNER JOIN dbo.Customers AS c
ON c.CustomerId = ccat.CustomerId
WHERE cat.CategoryId = #CatId;
Some notes:
I'm not sure why you thought subqueries are the right way to approach this. Usually it is much better (and clearer to other developers) to build proper joins and let SQL Server optimize the query overall instead of trying to be smart and optimize individual subqueries largely independent of the main query. A proper join will help to eliminate rows up front that would otherwise, through the subqueries, potentially be materialized - only to be discarded. Trust SQL Server to do its job, and in this case its job is to perform a join across multiple tables.
The join to dbo.Category might not be needed if the SELECT doesn't need to display the category name. If so then change the where clause and remove that join (join to CusomterCategories instead).
The second case can be changed to a simple ELSE if you've covered all the possible scenarios.
I made an assumption about the join between Products and Category (why is Category not plural like the others?). If this isn't it please fill us in.
You cannot not do that, you will get the following error
A SELECT statement that assigns a value to a variable must not be combined with data-retrieval operations.
separate the two and then return the variable as part of the select statement

Dynamic inner query

Is there a way to code a dynamic inner query? Basically, I find myself typing something like the following query over and over:
;with tempData as (
--this inner query is the part that changes, but there's always a timeGMT column.
select timeGMT, dataCol2, dataCol3
from tbl1 t1
join tbl2 t2 on t1.ID=t2.ID
)
select dateadd(ss,d.gmtOffset,t.timeGMT) timeLocal,
t.*
from tempData t
join dst d on t.timeGMT between d.sTimeGMT and d.eTimeGMT
where d.zone = 'US-Eastern'
The only thing I can think of is a stored proc with the inner query text as the input for some dynamic sql... However, my understanding of the optimizer (which is, admittedly, limited) says this isn't really a good idea.
From a performance perspective, what you have there is the version on which I would expect the optimizer to do the best job.
If the "outer" part of your example is static and code maintenance overrides performance, I'd look to encapsulating the dateadd result in a table-valued function (TVF). Since the time conversion is very much the common thread in these queries, I would definitely focus on that part of the workload.
For example, your query that can vary would look like this:
select timeGMT, dataCol2, dataCol3, lt.timeLocal
from tbl1 t1
join tbl2 t2 on t1.ID = t2.ID
cross apply dbo.LocalTimeGet(timeGMT, 'US-Eastern') AS lt
Where the TVF dbo.LocalTimeGet contains the logic for dateadd(ss,d.gmtOffset,t.timeGMT) and the lookup of the time zone offset value based on the time zone name. The implementation of that function would look something like:
CREATE FUNCTION dbo.LocalTimeGet (
#TimeGMT datetime,
#TimeZone varchar(20)
)
RETURNS TABLE
AS
RETURN (
SELECT DATEADD(ss, d.gmtOffset, #TimeGMT) AS timeLocal
FROM dst AS d
WHERE d.zone = #TimeZone
);
GO
The upside of this approach is when you upgrade to 2008 or later, there are system functions you could use to make this conversion a lot easier to code and you'll only have to alter the TVF. If your result sets are small, I'd consider a system scalar function (SQL 2008) over a TVF, even if it implements those same system functions. Based on your comment, it sounds like the system functions won't do what you need, but you could still stick with your implementation of a dst table, which is encapsulated in the TVF above.
TVFs can be a performance problem because the optimizer assumes they only return 1 row.
If you need to combine encapsulation and performance, then I'd do the time zone calc in the application code instead. Even though you'd have to apply it to each project that uses it, you would only have to implement it 1x in each project (in the Data Access Layer) and treat it as a common utility library if you'll be using across projects.
To answer the OP's follow-on question, a SQL Server 2008 solution would look like this:
First, create permanent definitions:
CREATE TYPE dbo.tempDataType AS TABLE (
timeGMT DATETIME,
dataCol2 int,
dataCol3 int)
GO
CREATE PROCEDURE ComputeDateWithDST
#tempData tempDataType READONLY
AS
SELECT dateadd(ss,d.gmtOffset,t.timeGMT) timeLocal, t.*
FROM #tempData t
JOIN dst d ON t.timeGMT BETWEEN d.sTimeGMT AND d.eTimeGMT
WHERE d.zone = 'US-Eastern'
GO
Thereafter, whenever you want to plug a subquery (which has now become a separate query, no longer a CTE) into the stored procedure:
DECLARE #tempData tempDataType
INSERT #tempData
-- sample subquery:
SELECT timeGMT, dataCol2, dataCol3
FROM tbl1 t1
JOIN tbl2 t2 ON t1.ID=t2.ID
EXEC ComputeDateWithDST #tempData;
GO
Performance could be an issue because you'd be running separately what used to be a CTE instead of letting SQL Server combine it with the main query to optimize the execution plan.

select statment performance degradation when using DISTINCT with parameters

Note for bounty - START:
PARAMETERS SNIFFING (that is the only "idea" that was reported in pre-bounty questions) is not the issue here, as you can read in the "update" section at the end of the question. The problem is really related to how sql server creates execution plans for a parametrized query when distinct is used.
I uploaded a very simple database backup (it works with sql server 2008 R2) here (you must wait 20 seconds before downloading). Against this DB you can try to run the following queries:
-- PARAMETRIZED QUERY
declare #IS_ADMINISTRATOR int
declare #User_ID int
set #IS_ADMINISTRATOR = 1 -- 1 for administrator 0 for normal
set #User_ID = 50
SELECT DISTINCT -- PLEASE REMEMBER DISTINCT MAKES THE DIFFERENCE!!!
DOC.DOCUMENT_ID
FROM
DOCUMENTS DOC LEFT OUTER JOIN
FOLDERS FOL ON FOL.FOLDER_ID = DOC.FOLDER_ID LEFT OUTER JOIN
ROLES ROL ON (FOL.FOLDER_ID = ROL.FOLDER_ID)
WHERE
1 = #IS_ADMINISTRATOR OR ROL.USER_ID = #USER_ID
-- NON PARAMETRIZED QUERY
SELECT DISTINCT -- PLEASE REMEMBER DISTINCT MAKES THE DIFFERENCE!!!
DOC.DOCUMENT_ID
FROM
DOCUMENTS DOC LEFT OUTER JOIN
FOLDERS FOL ON FOL.FOLDER_ID = DOC.FOLDER_ID LEFT OUTER JOIN
ROLES ROL ON (FOL.FOLDER_ID = ROL.FOLDER_ID)
WHERE
1 = 1 OR ROL.USER_ID = 50
Final note: I noticed DSTINCT is the problem, my goal is to achieve the same speed (or at least almost the same speed) in both queries.
Note for bounty - END:
Original question:
I noticed that there is an heavy difference in performance between
-- Case A
select distinct * from table where id > 1
compared to (this is the sql generated by my Delphi application)
-- Case B1
exec sp_executesql N'select distinct * from table where id > #P1',N'#P1 int',1
that is equivalent to
-- Case B2
declare #P1 int
set #P1 = 1
select distinct * from table where id > #P1
A performs much faster than B1 and B2. The performance becomes the same in case I remove DISTINCT.
May you comment on this?
Here i posted a trivial query, I noticed this on a query with 3 INNER JOIN. Anyway not a complex query.
Note: I was expecting to have THE EXACT SAME PERFORMANCE, in cases A and B1/B2.
So are there some caveats in using DISTINCT?
UPDATE:
I tried to disable parameter sniffing using DBCC TRACEON (4136, -1) (the flag to disable parameter sniffing) but nothing changes. So in this case the problem is NOT LINKED TO PARAMETERS SNIFFING. Any idea?
The problem isn't that DISTINCT is causing a performance degradation with parameters, it's that the rest of the query isn't being optimized away in the parameterized query because the optimizer won't just optimize away all of the joins using 1=#IS_ADMINISTRATOR like it will with just 1=1. It won't optimize the joins away without distinct because it needs to return duplicates based on the result of the joins.
Why? Because the execution plan tossing out all of the joins would be invalid for any value other than #IS_ADMINISTRATOR = 1. It will never generate that plan regardless of whether you are caching plans or not.
This performs as well as the non parameterized query on my 2008 server:
-- PARAMETRIZED QUERY
declare #IS_ADMINISTRATOR int
declare #User_ID int
set #IS_ADMINISTRATOR = 1 -- 1 for administrator 0 for normal
set #User_ID = 50
IF 1 = #IS_ADMINISTRATOR
BEGIN
SELECT DISTINCT -- PLEASE REMEMBER DISTINCT MAKES THE DIFFERENCE!!!
DOC.DOCUMENT_ID
FROM
DOCUMENTS DOC LEFT OUTER JOIN
FOLDERS FOL ON FOL.FOLDER_ID = DOC.FOLDER_ID LEFT OUTER JOIN
ROLES ROL ON (FOL.FOLDER_ID = ROL.FOLDER_ID)
WHERE
1 = 1
END
ELSE
BEGIN
SELECT DISTINCT -- PLEASE REMEMBER DISTINCT MAKES THE DIFFERENCE!!!
DOC.DOCUMENT_ID
FROM
DOCUMENTS DOC LEFT OUTER JOIN
FOLDERS FOL ON FOL.FOLDER_ID = DOC.FOLDER_ID LEFT OUTER JOIN
ROLES ROL ON (FOL.FOLDER_ID = ROL.FOLDER_ID)
WHERE
ROL.USER_ID = #USER_ID
END
What's clear from the query plan I see running your example is that #IS_ADMINISTRATOR = 1 does not get optimized out the same as 1=1. In your non-parameterized example, the JOINS are completely optimized out, and it just returns every id in the DOCUMENTS table (very simple).
There are also different optimizations missing when #IS_ADMINISTRATOR <> 1. For instance, the LEFT OUTER JOINS are automatically changed to INNER JOINs without that OR clause, but they are left as-is with that or clause.
See also this answer: SQL LIKE % FOR INTEGERS for a dynamic SQL alternative.
Of course, this doesn't really explain the performance difference in your original question, since you don't have the OR in there. I assume that was an oversight.
But also see "parameter sniffing" issue.
Why does a parameterized query produces vastly slower query plan vs non-parameterized query
https://groups.google.com/group/microsoft.public.sqlserver.programming/msg/1e4a2438bed08aca?hl=de
Have you tried running your second (slower) query without dynamic SQL? Have you cleared the cache and rerun the first query? You may be experiencing parameter sniffing with the parameterized dynamic SQL query.
I think the DISTINCT is a red herring and not the actual issue.

Resources