select statment performance degradation when using DISTINCT with parameters - sql-server

Note for bounty - START:
PARAMETERS SNIFFING (that is the only "idea" that was reported in pre-bounty questions) is not the issue here, as you can read in the "update" section at the end of the question. The problem is really related to how sql server creates execution plans for a parametrized query when distinct is used.
I uploaded a very simple database backup (it works with sql server 2008 R2) here (you must wait 20 seconds before downloading). Against this DB you can try to run the following queries:
-- PARAMETRIZED QUERY
declare #IS_ADMINISTRATOR int
declare #User_ID int
set #IS_ADMINISTRATOR = 1 -- 1 for administrator 0 for normal
set #User_ID = 50
SELECT DISTINCT -- PLEASE REMEMBER DISTINCT MAKES THE DIFFERENCE!!!
DOC.DOCUMENT_ID
FROM
DOCUMENTS DOC LEFT OUTER JOIN
FOLDERS FOL ON FOL.FOLDER_ID = DOC.FOLDER_ID LEFT OUTER JOIN
ROLES ROL ON (FOL.FOLDER_ID = ROL.FOLDER_ID)
WHERE
1 = #IS_ADMINISTRATOR OR ROL.USER_ID = #USER_ID
-- NON PARAMETRIZED QUERY
SELECT DISTINCT -- PLEASE REMEMBER DISTINCT MAKES THE DIFFERENCE!!!
DOC.DOCUMENT_ID
FROM
DOCUMENTS DOC LEFT OUTER JOIN
FOLDERS FOL ON FOL.FOLDER_ID = DOC.FOLDER_ID LEFT OUTER JOIN
ROLES ROL ON (FOL.FOLDER_ID = ROL.FOLDER_ID)
WHERE
1 = 1 OR ROL.USER_ID = 50
Final note: I noticed DSTINCT is the problem, my goal is to achieve the same speed (or at least almost the same speed) in both queries.
Note for bounty - END:
Original question:
I noticed that there is an heavy difference in performance between
-- Case A
select distinct * from table where id > 1
compared to (this is the sql generated by my Delphi application)
-- Case B1
exec sp_executesql N'select distinct * from table where id > #P1',N'#P1 int',1
that is equivalent to
-- Case B2
declare #P1 int
set #P1 = 1
select distinct * from table where id > #P1
A performs much faster than B1 and B2. The performance becomes the same in case I remove DISTINCT.
May you comment on this?
Here i posted a trivial query, I noticed this on a query with 3 INNER JOIN. Anyway not a complex query.
Note: I was expecting to have THE EXACT SAME PERFORMANCE, in cases A and B1/B2.
So are there some caveats in using DISTINCT?
UPDATE:
I tried to disable parameter sniffing using DBCC TRACEON (4136, -1) (the flag to disable parameter sniffing) but nothing changes. So in this case the problem is NOT LINKED TO PARAMETERS SNIFFING. Any idea?

The problem isn't that DISTINCT is causing a performance degradation with parameters, it's that the rest of the query isn't being optimized away in the parameterized query because the optimizer won't just optimize away all of the joins using 1=#IS_ADMINISTRATOR like it will with just 1=1. It won't optimize the joins away without distinct because it needs to return duplicates based on the result of the joins.
Why? Because the execution plan tossing out all of the joins would be invalid for any value other than #IS_ADMINISTRATOR = 1. It will never generate that plan regardless of whether you are caching plans or not.
This performs as well as the non parameterized query on my 2008 server:
-- PARAMETRIZED QUERY
declare #IS_ADMINISTRATOR int
declare #User_ID int
set #IS_ADMINISTRATOR = 1 -- 1 for administrator 0 for normal
set #User_ID = 50
IF 1 = #IS_ADMINISTRATOR
BEGIN
SELECT DISTINCT -- PLEASE REMEMBER DISTINCT MAKES THE DIFFERENCE!!!
DOC.DOCUMENT_ID
FROM
DOCUMENTS DOC LEFT OUTER JOIN
FOLDERS FOL ON FOL.FOLDER_ID = DOC.FOLDER_ID LEFT OUTER JOIN
ROLES ROL ON (FOL.FOLDER_ID = ROL.FOLDER_ID)
WHERE
1 = 1
END
ELSE
BEGIN
SELECT DISTINCT -- PLEASE REMEMBER DISTINCT MAKES THE DIFFERENCE!!!
DOC.DOCUMENT_ID
FROM
DOCUMENTS DOC LEFT OUTER JOIN
FOLDERS FOL ON FOL.FOLDER_ID = DOC.FOLDER_ID LEFT OUTER JOIN
ROLES ROL ON (FOL.FOLDER_ID = ROL.FOLDER_ID)
WHERE
ROL.USER_ID = #USER_ID
END
What's clear from the query plan I see running your example is that #IS_ADMINISTRATOR = 1 does not get optimized out the same as 1=1. In your non-parameterized example, the JOINS are completely optimized out, and it just returns every id in the DOCUMENTS table (very simple).
There are also different optimizations missing when #IS_ADMINISTRATOR <> 1. For instance, the LEFT OUTER JOINS are automatically changed to INNER JOINs without that OR clause, but they are left as-is with that or clause.
See also this answer: SQL LIKE % FOR INTEGERS for a dynamic SQL alternative.
Of course, this doesn't really explain the performance difference in your original question, since you don't have the OR in there. I assume that was an oversight.

But also see "parameter sniffing" issue.
Why does a parameterized query produces vastly slower query plan vs non-parameterized query
https://groups.google.com/group/microsoft.public.sqlserver.programming/msg/1e4a2438bed08aca?hl=de

Have you tried running your second (slower) query without dynamic SQL? Have you cleared the cache and rerun the first query? You may be experiencing parameter sniffing with the parameterized dynamic SQL query.
I think the DISTINCT is a red herring and not the actual issue.

Related

Trying to find a solution to long running SQL code where I think NESTED SQL statement is the culprit

I have a SQL statement that has a weird 2nd nested SQL statement that I think is causing this query to run for 6+ min and any suggestions/help would be appreciated. I tried creating a TEMP table for the values in the nested SQL statement and just do a simple join but there is nothing to join on in the SQL code so that is why they used a 1=1 in the ON statement for the join. Here is the SQL code:
Declare #TransactionEndDate datetime;
Select #TransactionEndDate = lastmonth_end from dbo.DTE_udfCommonDates(GETDATE());
Select ''''+TreatyName as Treaty,
cast(EndOfMonth as Date) as asOfDate,
Count(Distinct ClaimSysID) as ClaimCount,
Count(Distinct FeatureSysID) as FeatureCount,
Sum(OpenReserve) as OpenReserve
From (
Select
TreatyName,
EndOfMonth,
dbo.CMS_Claims.ClaimSysID,
FeatureSysID,
sum(IW_glGeneralLedger.TransactionAmount)*-1 as OpenReserve
From dbo.CMS_Claims
Inner Join dbo.CMS_Claimants
On dbo.CMS_Claims.ClaimSysID = dbo.CMS_Claimants.ClaimSysID
Inner Join dbo.CMS_Features
On dbo.CMS_Features.ClaimantSysID = dbo.CMS_Claimants.ClaimantSysID
Left Join dbo.IW_glGeneralLedger
On IW_glGeneralLedger.FeatureID = dbo.CMS_Features.FeatureSysID
Left Join dbo.IW_glSubChildAccount
On dbo.IW_glSubChildAccount.glSubChildAccountID = dbo.IW_glGeneralLedger.glSubChildAccountSysID
Left Join dbo.IW_glAccountGroup
On dbo.IW_glAccountGroup.glAccountGroupID = dbo.IW_glSubChildAccount.glAccountGroupSysID
Left Join dbo.IW_BankRegister
On dbo.IW_BankRegister.BankRegisterSysID = dbo.IW_glGeneralLedger.BankRegisterID
Left Join dbo.IW_BankRegisterStatus
On dbo.IW_BankRegisterStatus.BankRegisterStatusSysID = dbo.IW_BankRegister.BankRegisterStatusID
**Left Join (Select Distinct dbo.DTE_get_month_end(dt) as EndOfMonth
From IW_Calendar
Where dt Between '3/1/2004'
and #TransactionEndDate) as dates
on 1=1**
Left Join dbo.IW_ReinsuranceTreaty
On dbo.IW_ReinsuranceTreaty.TreatySysID = IW_glGeneralLedger.PolicyTreatyID
Where dbo.IW_glGeneralLedger.TransactionDate Between '1/1/2004 00:00:00' And EndOfMonth
And dbo.IW_glAccountGroup.Code In ('RESERVEINDEMNITY')
And (
(dbo.IW_glGeneralLedger.BankRegisterID Is Null)
Or (
(IW_BankRegister.PrintedDate Between '1/1/2004 00:00:00' And EndOfMonth Or dbo.IW_glGeneralLedger.BankRegisterID = 0)
And
(dbo.IW_BankRegisterStatus.EnumValue In ('Approved','Outstanding','Cleared','Void') Or dbo.IW_glGeneralLedger.BankRegisterID = 0))
)
Group By TreatyName, dbo.CMS_Claims.ClaimSysID, FeatureSysID, EndOfMonth
Having sum(IW_glGeneralLedger.TransactionAmount) <> 0
) As Data
Group By TreatyName,EndOfMonth
Order By EndOfMonth, TreatyName
This nested SQL code only provides a table of End of Month values in one column called EndOfMonth and this is what I'm trying to fix:
Select Distinct dbo.DTE_get_month_end(dt) as EndOfMonth
From IW_Calendar
Where dt Between '3/1/2004'
and #TransactionEndDate
Please use the below methods to increase the query performance.
Use temporary tables. ( load relevant data into temporary tables with necessary where conditions and then join).
Use clustered and non clustered indexes to your tables.
Create Multiple-Column Indexes.
Index the ORDER-BY / GROUP-BY / DISTINCT Columns for Better Response Time.
Use Parameterized Queries.
Use query hints accordingly.
NOLOCK: In the event that data is locked, this tells SQL Server to read data from the last known value available, also known as a dirty read. Since it is possible to use some old values and some new values, data sets can contain inconsistencies. Do not use this in any place in which data quality is important.
RECOMPILE: Adding this to the end of a query will result in a new execution plan being generated each time this query executed. This should not be used on a query that is executed often, as the cost to optimize a query is not trivial. For infrequent reports or processes, though, this can be an effective way to avoid undesired plan reuse. This is often used as a bandage when statistics are out of date or parameter sniffing is occurring.
MERGE/HASH/LOOP: This tells the query optimizer to use a specific type of join as part of a join operation. This is super-risky as the optimal join will change as data, schema, and parameters evolve over time. While this may fix a problem right now, it will introduce an element of technical debt that will remain for as long as the hint does.
OPTIMIZE FOR: Can specify a parameter value to optimize the query for. This is often used when we want performance to be controlled for a very common use case so that outliers do not pollute the plan cache. Similar to join hints, this is fragile and when business logic changes, this hint usage may become obsolete.

TSQL Query performance

I have the following query which takes about 20s to complete.
declare #shoppingBasketID int
select #shoppingBasketID = [uid]
from shoppingBasket sb
where sb.requestID = 21918154 and sb.[status] > 0
select
ingredientGroup.shoppingBasketItemID as itemID,
ingredientGroup.[uid] as groupUID
from shoppingBasketItem item
left outer join shoppingBasketItemBundle itemBundle on itemBundle.primeMenuItemID = item.[uid]
left outer join shoppingBasketItem bundleItem on bundleItem.[uid] = isnull(itemBundle.linkMenuItemID, item.[uid])
left outer join shoppingBasketItemIngredientGroup ingredientGroup on ingredientGroup.shoppingBasketItemID = isnull(itemBundle.linkMenuItemID, item.[uid])
left outer join shoppingBasketItemIngredient ingredient on ingredient.shoppingBasketItemIngredientGroupID = ingredientGroup.[uid]
where item.shoppingBasketID = #shoppingBasketID
The 'shoppingBasketItemIngredient' table has 40 millions rows.
When I change the last line to the following the query returns the results almost instantly. (I moved the first select into the second select query).
where item.shoppingBasketID = (select [uid] from shoppingBasket sb where sb.requestID = 21918154 and sb.[status] > 0)
Do you know why?
This is too long for a comment.
Queries in stored procedures are compiled the first time they are run and the query plan is cached. So, if you test the stored procedure on an empty table, then it might generate a bad query plan -- and that doesn't get updated automatically.
You can force a recompile at either the stored procedure or query level, using the option WITH (RECOMPILE). Here is some documentation.
You could add a query hint.
When using a variable the query optimizer could generate a slow execution plan.
It's easier for a query optimizer to calculate the optimal plan when a fixed value is used.
But by adding the right hint(s) it could go for a faster execution plan.
For example:
select
...
where item.shoppingBasketID = #shoppingBasketID
OPTION ( OPTIMIZE FOR (#shoppingBasketID UNKNOWN) );
In the example UNKNOWN was used, but you can give a value instead.

How can I fix this query (IN-clause) so that SQL server performance does not degrade when there is a low volume of data?

The following chart shows the performance of a process over time.
The process is calling a stored procedure with the following form:
CREATE PROCEDURE [dbo].[GetResultSetsAndResultsWhereStatusIsValidatedByPatientId]
#PatientId uniqueidentifier
AS
BEGIN
SELECT DISTINCT resultSetTable.ResultSetId,
resultSetTable.OrderId,
resultSetTable.ReceivedDateTime,
resultSetTable.ProfileId,
resultSetTable.Status,
profileTable.Code,
testResultTable.AbnormalFlag,
testResultTable.Result,
orderTable.ReceptionDateTime,
testTable.TestCode,
orderedProfileTable.[Status] as opStatus
FROM dbo.ResultSet resultSetTable
INNER JOIN dbo.[Profile] profileTable on (profileTable.ProfileId = resultSetTable.ProfileId)
INNER JOIN dbo.[TestResult] testResultTable on (testResultTable.ResultSetId = resultSetTable.ResultSetId)
INNER JOIN dbo.[Order] orderTable on (resultSetTable.OrderId = orderTable.OrderId)
INNER JOIN dbo.[Test] testTable on (testResultTable.TestId = testTable.TestId)
INNER JOIN dbo.OrderedProfile orderedProfileTable on (orderedProfileTable.ProfileId = resultSetTable.ProfileId)
WHERE orderTable.PatientId = #PatientId
AND orderedProfileTable.[Status] in ('V', 'REP')
END
The problem seems to be the IN-clause. If I remove the IN-clause and only check for one of the values then I get consistent performance as seen in the second part of the graph.
AND orderedProfileTable.[Status] = 'V'
The issue also seems to be related to the amount of data in the tables. Only two tables grow, [ResultSet] and [TestResult], and both these tables are empty at the start of performance runs.
I have tried the following:
Move the IN-clause to a an outer select - no effect
Replace the IN-clause with a join - severe performance degradation
Create an index for the "Status" field used in the IN-clause - no effect
Is there a way to always get the low performance even when there is no data in the two relevant tables?
Have you tried throwing the IN query into an EXIST clause?
WHERE
orderTable.PatientId = #PatientId
AND
EXISTS
(SELECT *
FROM dbo.OrderedProfile as p
WHERE
p.profileid = orderedprofiletable.profileid
AND
[Status] IN ('v','rep'))
Since you're only searching for static results ('v' and 'rep') I would think that the IN clause by itself would be your best bet, but EXIST can sometimes speed up performance so it's worth a shot.
The problem was not related to the IN-clause in the end but a logic error. We started questioning why the query needed DISTINCT and when we removed it we discovered a logic error in the last join (needed some more criteria in what it matches against).
The error has been partially resolved and the performance issue seems to be resolved.
The stored procedure now completes in less than 10 ms on average and no performance degradation.

SQL query showing drastically different performance depending on the parameters

We have a stored procedure that searches products based on a number of input parameters that differ from one scenario to the next. Depending on the input parameters the search involves anywhere from two to about a dozen different tables. In order to avoid unnecessary joins we build the actual search query as dynamic SQL and execute it inside the stored procedure.
In one of the most basic scenarios the user searches products by a keyword alone (see Query 1 below), which usually takes less than a second. However, if they search by a keyword and department (Query 2 below), the execution time goes up to well over a minute, and the execution plan looks somewhat different (the attached snapshots of the plans are showing just the parts that differ).
Query 1 (fast)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
WHERE (1=1)
AND (CONTAINS((Product.*), #Keywords) OR CONTAINS((ProductVariant.*), #Keywords))
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
Query 2 (slow)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
WHERE (1=1)
AND (CONTAINS((Product.*), #Keywords) OR CONTAINS((ProductVariant.*), #Keywords))
AND (Product.SourceID = #SourceID)
AND (Product.DepartmentID = #DepartmentID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
Both the Product and ProductVariant table have some string columns that participate in the full-text index. The Product table has a non-clustered indexed on the SourceID column and another non-clustered indexed on SourceID+DepartmentID (this redundancy is not an oversight but is intended). ProductVariant.ProductID is a FK to Product and has a non-clustered index on it. Statistics are updated for all indexes and columns, and no missing indexes are reported by SQL Management Studio.
Any suggestions on what might be causing this drastically different performance?
P.S. Forgot to mention that Product.DepartmentID is a FK to a table of departments, in case it makes any difference.
Thanks to #MartinSmith for the suggestion to break the full-text search logic out into temp tables and then using them to filter the results of the main query. The following returns in just 2 seconds:
SELECT
[Key] AS ProductID
INTO
#matchingProducts
FROM
CONTAINSTABLE(Product, *, #Keywords)
SELECT
[Key] AS VariantID
INTO
#matchingVariants
FROM
CONTAINSTABLE(ProductVariant, *, #Keywords)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
LEFT OUTER JOIN #matchingProducts ON #matchingProducts.ProductID = Product.ProductID
LEFT OUTER JOIN #matchingVariants ON #matchingVariants.VariantID = ProductVariant.VariantID
WHERE (1=1)
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
AND (Product.DepartmentID = #DepartmentID)
AND (NOT #matchingProducts.ProductID IS NULL OR NOT #matchingVariants.VariantID IS NULL)
Curiously, when I tried to simplify the above solution using nested queries as shown below, the results were somewhere in-between in terms of speed (around 25 secs). Theoretically, the query below should be identical to the one above, yet somehow SQL Server internally compiles the second one differently.
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
LEFT OUTER JOIN
(
SELECT
[Key] AS ProductID
FROM
CONTAINSTABLE(Product, *, #Keywords)
) MatchingProducts
ON MatchingProducts.ProductID = Product.ProductID
LEFT OUTER JOIN
(
SELECT
[Key] AS VariantID
FROM
CONTAINSTABLE(ProductVariant, *, #Keywords)
) MatchingVariants
ON MatchingVariants.VariantID = ProductVariant.VariantID
WHERE (1=1)
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
AND (Product.DepartmentID = #DepartmentID)
AND (NOT MatchingProducts.ProductID IS NULL OR NOT MatchingVariants.VariantID IS NULL)
This may have been your mistake
In order to avoid unnecessary joins we build the actual search query as dynamic SQL and execute it inside the stored procedure.
The dynamic SQL cannot be optimized by the server in most cases. There are certain techniques to mitigate this. Read more The Curse and Blessings of Dynamic SQL
Get rid of your dynamic SQL and build a decent query using proper indices. I assure you; the SQL Server knows better than you when it comes to optimizing. Define ten different queries if you must (or a hundred).
Secondly... Why would you ever expect the same execution plan when running different queries? using different columns/indices? The execution plans and results you get seem perfectly natural to me, given your approach.
You will not get the same execution plan / performance because you are not querying column 'DepartmentID' the first query.

Awkward JOIN causes poor performance

I have a stored procedure that combines data from several tables via UNION ALL. If the parameters passed in to the stored procedure don't apply to a particular table, I attempt to "short-circuit" that table by using "helper bits", e.g. #DataSomeTableExists and adding a corresponding condition in the WHERE clause, e.g. WHERE #DataSomeTableExists = 1
One (psuedo) table in the stored procedure is a bit awkward and causing me some grief.
DECLARE #DataSomeTableExists BIT = (SELECT CASE WHEN EXISTS(SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable') THEN 1 ELSE 0 END);
...
UNION ALL
SELECT *
FROM REF_MinuteDimension AS dim WITH (NOLOCK)
CROSS JOIN (SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable') AS T
CROSS APPLY dbo.fGetLastValueFromSomeTable(T.ParentId, dim.TimeStamp) dpp
WHERE #DataSomeTableExists = 1 AND dim.TimeStamp >= #StartDateTime AND dim.TimeStamp <= #EndDateTime
UNION ALL
...
Note: REF_MinuteDimension is nothing more than smalldatetimes with minute increments.
(1) The execution plan (below) indicates a warning on the nested loops operator saying that there is no join predicate. This is probably not good, but there really isn't a natural join between the tables. Is there a better way to write such a query? For each ParentId in T, I want the value from the UDF for every minute between #StartDateTime and #EndDateTime.
(2) Even when #DataSomeTableExists = 0, there is I/O activity on the tables in this query as reported by SET STATISTICS IO ON and the actual execution plan. The execution plan reports 14.2 % cost which is too much considering these tables don't even apply in this case.
SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable' comes back empty.
Is it the way my query is written? Why wouldn't the helper bit or an empty T short circuit this query?
For 2) I can sure say that line
CROSS JOIN (SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable') AS T
Ill force #T to be analysed and to enter a join. You can create to versions of a SP with and without that join and use that flag to execute one or another but I cannot say that ill save any response time||cpu clocks||I/O bandwith||memory.
For 1) I suggest to remove the (nolock) if you are using SQL Server 2005 or better and to keep a close eye in that UDF. Cannot say more without a good SQL fiddle.
I should mention, I have no clue if this will ever work, as it's kind of an odd way to write a sproc and table-valued UDFs aren't well understood by the query optimizer. You might have to build your resultset into a table variable or temp table conditionally, based on IF statements, then return that data. But I would try this, first:
--helper bit declared
declare #DataSomeTableExists BIT = 0x0
if exists (select 1 from #T where StorageTable = 'DATA_SomeTable')
begin
set #DataSomeTableExists = 0x1
end
...
UNION ALL
SELECT *
FROM REF_MinuteDimension AS dim WITH (NOLOCK)
CROSS JOIN (SELECT * FROM #T WHERE StorageTable = 'DATA_SomeTable' and #DataSomeTableExists = 0x1) AS T
CROSS APPLY dbo.fGetLastValueFromSomeTable(T.ParentId, dim.TimeStamp) dpp
WHERE #DataSomeTableExists = 0x1 AND dim.TimeStamp >= #StartDateTime AND dim.TimeStamp <= #EndDateTime
UNION ALL
...
And if you don't know already, the UDF might be giving you weird readings in the execution plans. I don't know enough to give you accurate data, but you should search around to understand the limitations.
Since your query is dependent on run-time variables, consider using dynamic SQL to create your query on the fly. This way you can include the tables you want and exclude the ones you don't want.
There are downsides to dynamic SQL, so read up

Resources