I have a query that runs against a pretty large table, I need to do a count on it.
If I use a literal the query runs in a few seconds but when I put the values in as variables (which I need to do) the query takes forever and presumably does a full table scan.
I've done quite a lot of reading about this and I understand it most likely to do with parameter sniffing, which I can't pretend that I understand it, I just want to know how I can fix it, otherwise, I'm going to have to fall back on calling it in c# with generated query strings.
This query runs in a few seconds..
SELECT Count(Id) FROM dbo.BeaconScan WHERE State = 'Archived' AND LastSeen < '29 February 2020';
This one takes forever
DECLARE #Date DATE = '31 March 2020'
DECLARE #Status NVARCHAR(256) = 'Archived'
SELECT Count(Id) FROM dbo.BeaconScan WHERE State = #Status AND LastSeen < #Date;
If you are using the stored procedure, to eliminate parameter sniffing, do this:
DECLARE #DateLocal DATE = #Date
DECLARE #StatusLocal NVARCHAR(256) = #Status
SELECT Count(Id) FROM dbo.BeaconScan WHERE State = #StatusLocal AND LastSeen < #DateLocal
You should examine your actual plans to be sure that the problem is in the parameter sniffing. Only actual plan shows you actual number of rows vs expected + the parameter for which the plan was built.
If you want your query to use your actual parameter every time, you can add recompile option at query level:
SELECT Count(Id) FROM dbo.BeaconScan WHERE State = #Status AND LastSeen < #Date
option(recompile);
SQL Server optimizes queries based on rowcount estimates and heuristics. These estimates can differ with with literals, local variables, or parameters.
With a literal or parameter (parameter declared in the app code and passed with the command), SQL Server estimates counts based on the actual value provided and the statistics histogram (if an index or statistics exist on the column). This generally results in accurate estimates and an optimal plan when stats are up-to-date.
With a local variable (T-SQL DECLARE statement) or OPTIMIZE FOR UNKNOWN query hint, SQL Server estimates counts based on the overall average density of values and ignores the actual value and histogram. This generally results in a compromise plan that might be good enough overall but can be suboptimal for certain values. Adding an OPTION (RECOMPILE) query hint to a query with local variables will instead use the actual local variable values for optimization and yield the same plan as if literals were specified.
Note that parameterized query plans without the RECOMPILE hint are cached and reused. Current parameter values are ignored when the plan is reused so the reused query plan might be suboptimal for the current parameter values. This is another case where OPTION (RECOMPILE) might improve performance.
Use the OPTION (RECOMPILE) hint judiciously, considering query execution frequency. The compilation overhead can outweigh the runtime savings for queries that are executed frequently (e.g. many times per second).
With a literal date, the optimizer can determine if a SEEK will out perform a SCAN. If the table has many years of data, but the query only asks for data after 29 Feb 2020, the optimizer can determine that it needs a small data set and will SEEK. The query will run relatively quickly.
The optimizer views a variable date as unknown. Therefore, the optimizer must build a plan that accounts for dates like 1 Jan 2001 or 12 Dec 2012. Large datasets do better with SCAN (index scan or table scan). Given the unknown value, the optimizer will often select SCAN. The query will run much longer because it is reading every row and not using the indexes.
To avoid the unknown, you can use the OPTIMIZE FOR query hint. But, depending on your use case, that may be no different than just using a literal.
Parameter sniffing usually refers to stored procedures. To avoid, assign procedure parameter to a variable within the first line or two of the procedure. Not necessary to know the full explanation for parameter sniffing in order to avoid it.
Related
I'm running Python pyodbc against SQL Server. I have a very complex query that I minimize here as
DECLARE #mydate DATETIME = '2021-02-08'
SELECT *
FROM atable
WHERE adate = #mydate AND afield = ?
On the Python side I'm executing the usual
crsr.execute(sql, field)
What is baffling me is that it returns all the results and it ignores the condition afield = field with no other errors but with a strange order so that when I plot the graph it is very confused! Why does it happen?
(edit Of course I should have added an ORDER BY)
I have already fixed the code with an initial
DECLARE #myfield VARCHAR(32) = ?
followed by the where condition ending with afield=#myfield and now it works as expected: the order is the normal one even if I have not introduced an explicit ORDER BY.
In other words, aside from the fact that the final correct fix is adding an ORDER BY, e.g.
SELECT *
FROM atable
WHERE adate = #mydate AND afield = ?
ORDER BY id
I'm wondering why introducing the above said change was enough to change the order.
Because your SQL connection driver does not seem to support proper parameters, it is embedding the parameter as text. I have no idea how good it's sanitizing and escaping method is, but it's not usually a good idea because of the risk of SQL injection.
I note that pymssql driver does support named parameters
What can happen is that when the compiler can see the exact value you want to use, it will calculate the statistics of how many rows are likely to match (the "cardinality"), and therefore it may choose a different access pattern based on what indexes are available.
When the value comes through as a proper parameter (when using a good SQL driver), the parameter is "sniffed", and the compiler will use the statistics based on the value in the first run of the query. This can be of benefit if there is a commonly-used value.
If the value is in a local variable, then parameter-sniffing is not used, and the compiler uses the average cardinality for that predicate. This may be better or worse than "sniffing".
It would sound like embedding the value is better. But this is not the case.
When the value is embedded, each query stand on it's own, and must be compiled again every time. This can have a big impact on execution time and CPU usage, as well as memory usage to save the query plans.
I have following queries:
DECLARE #application_number CHAR(8)= '37832904';
SELECT
la.LEASE_NUMBER AS lease_number,
la.[LEASE_APPLICATION] AS application_number,
tnu.[FOLLOWUP_CODE] AS note_type_code -- catch codes not in codes table
FROM [dbo].[lease_applications] la
LEFT JOIN [dbo].tickler_notes_uniq tnu ON tnu.[ACCOUNT_NUMBER] = la.[ACCOUNT_NUMBER]
WHERE la.LEASE_APPLICATION = #application_number
OR #application_number IS NULL;
SELECT
la.LEASE_NUMBER AS lease_number,
la.[LEASE_APPLICATION] AS application_number,
tnu.[FOLLOWUP_CODE] AS note_type_code -- catch codes not in codes table
FROM [dbo].[lease_applications] la
LEFT JOIN [dbo].tickler_notes_uniq tnu ON tnu.[ACCOUNT_NUMBER] = la.[ACCOUNT_NUMBER]
WHERE la.LEASE_APPLICATION = #application_number;
The only difference between these 2 queries is that I've added checking for the variable if it is NULL or not.
The execution plans of these queries are:
You can find graphical plan here
So the question is. Why the plans are so different?
UPDATE:
The actual execution plan of the first query can be found here
OPTION(RECOMPILE) changed the actual execution plan to the good one. However the downside of that is that my main goal was to create the TVF with these params and then everybody who uses that function is supposed to provide that option.
It is also worth to mention that my main goal is to create TVF with 2 params. Each of it might be null and might be not but at least 1 of them is supposed to be NOT NULL. These params are more or less equal, they are just different keys in the 2 tables that would give the same result anyway (the same number of rows and so on). That's why I wanted to do something like
WHERE (col1 = #param1 OR #param1 IS NULL) AND (col2 = #param2 OR #param2 IS NULL) AND (#param1 IS NOT NULL or #param2 IS NOT NULL)
So, basically I am not interested in ALL records at all
You have two different plans for two different queries.
It makes sense that when you have an equality condition on the WHERE clause(la.LEASE_APPLICATION = #application_number)(and having indexes in place) you get an index seek: working as expected!
On the other hand, when you write both conditions into one WHERE clause (la.LEASE_APPLICATION = #application_number OR #application_number IS NULL) the query optimizer has chosen to do a scan.
Even though the parameter value has been supplied and it is not null, the plan that is being used is the cached one and it can not know at compile time the actual value of your parameter.
This is the case if you have a stored procedure and you are calling it with parameters. This is not the case when executing a simple query using a variable.
As #sepupic has stated, variable values do not get sniffed.
The plan is generated to handle both cases: when you have a value for your parameter as well as when you have none.
One option to fix your problem would be using OPTION(RECOMPILE) as it has been stated already in the comments.
Another option would be to have your queries separated(for ex. having two different stored procedures, called by a third "wrapper" procedure), so that they get optimized accordingly, each one on it's own.
I would suggest you to take a look at this article by Kimberly L. Tripp: Building High Performance Stored Procedures and this other one by Aaron Bertrand: An Updated "Kitchen Sink" Example. I think these are the best articles explaining these kind of scenarios.
Both articles explain this situation, possible problems with it and possible solutions as well such as option(recompile), dynamic sql or having separated stored procedures.
Good luck!
Your queries do not use parameters, they use a variable. The variable is not sniffed at the moment the batch is compiled (compilation = making a plan) because the batch is seen as one whole thing. So server has no idea if the variable is null or is not null. And it must make a plan that will be suitable in both cases.
The first query can filter no rows at all so the scan is selected.
The second query does filter, but the value is unknown, so if you use SQL server 2014 and the fintered column is not unique, the estimation is C^3/4 (C= table cardinality)
The situation can be different if you use RECOMPILE query option. When you add it to your query, it's recompiled AFTER the assignment of table variable is done. In this case the variable value is known, and you'll get another plan. It will be a plan based on column statistics for a known value of your filter
I have some long running (a few hours) stored procedures which contain queries that goes to tables that contain millions of records in a distributed environment. These stored procedures take a date parameter and filters these tables according to that date parameter.
I've been thinking that because of the parameter sniffing feature of SQL Server, at the first time that my stored procedure gets called, the query execution plan will be cached according to that specific date and any future calls will use that exact plan. And I think that since creating an execution plan takes only a few seconds, why would I not use RECOMPILE option in my long running queries, right? Does it have any cons that I have missed?
if the query should run within your acceptable performance limits and you suspect parameter sniffing is the cause,i suggest you add recompile hint to the query..
Also if the query is part of stored proc,instead of recompiling the entire proc,you can also do a statement level recompilation like
create proc procname
(
#a int
)
as
select * from table where a=#a
option(recompile)
--no recompile here
select * from table t1
join
t2 on t1.id=t2.id
end
Also to remind ,recompiling query will cost you.But to quote from Paul White
There is a price to pay for the plan compilation on every execution, but the improved plan quality often repays this cost many times over.
Query store in 2016 helps you in tracking this issues and also stores plans for the queries over time..you will be able to see which are performing worse..
if you are not on 2016,William Durkin have developed open query store for versions (2008-2014) which works more or less the same and helps you in troubleshootng issues
Further reading:
Parameter Sniffing, Embedding, and the RECOMPILE Options
I have a view that runs fast (< 1s) when specifying a value in the where clause:
SELECT *
FROM vwPayments
WHERE AccountId = 8155
...but runs slow (~3s) when that value is a variable:
DECLARE #AccountId BIGINT = 8155
SELECT *
FROM vwPayments
WHERE AccountId = #AccountId
Why is the execution plan different for the second query? Why is it running so much slower?
In the first case the parameter value was known while compiling the statement. The optimizer used the statistics histogram to generate the best plan for that particular parameter value.
When you defined the local variable, SQL server was not able to use the parameter value to find 'the optimal value'. Since the parameter value is unknown at compile time, the optimizer calculates an estimated number of rows based on 'uniform distribution'. The optimizer came up with a plan that would be 'good enough' for any possible input parameter value.
Another interesting article that almost exactly describes your case can be found here.
In short the statistical analysis the query optimizer uses to pick the best plan picks a seek when the value is a known value and it can leverage statistics and a scan when the value is not known. It picks a scan in the second choice because the plan is compiled before the value of the where clause is known.
While I rarely recommend bossing the query analyzer around in this specific case you can use a forceseek hint or other query hints to override the engine. Be aware however, that finding a way to get an optimal plan with the engine's help is a MUCH better solution.
I did a quick Google and found a decent article that goes into the concept of local variables affecting query plans more deeply.
DECLARE #Local_AccountId BIGINT = #AccountId
SELECT *
FROM vwPayments
WHERE AccountId = #Local_AccountId
OPTION(RECOMPILE)
It works for me
It could be parameter sniffing. Try and do the following - I assume it is in a stored procedure?
DECLARE #Local_AccountId BIGINT = #AccountId
SELECT *
FROM vwPayments
WHERE AccountId = #Local_AccountId
For details about parameter sniffing, you can view this link : http://blogs.technet.com/b/mdegre/archive/2012/03/19/what-is-parameter-sniffing.aspx
See if the results are different. I have encountered this problem several times, especially if the query is being called a lot during peaks and the execution plan cached is one which was created when off-peak.
Another option, but you should not need in your case is adding "WITH RECOMPILE" to a procedure definition. This would cause the procedure to be recompiled every time it is called. View http://www.techrepublic.com/article/understanding-sql-servers-with-recompile-option/5662581
I think #souplex made a very good point
Basically at the first case it's just a number and easy for system to understand, while the 2nd one is variable which means every time the system need to find the very value of it and do the check for each statement, which is a different method
I'm having trouble understanding the behavior of the estimated query plans for my statement in SQL Server when a change from a parameterized query to a non-parameterized query.
I have the following query:
DECLARE #p0 UniqueIdentifier = '1fc66e37-6eaf-4032-b374-e7b60fbd25ea'
SELECT [t5].[value2] AS [Date], [t5].[value] AS [New]
FROM (
SELECT COUNT(*) AS [value], [t4].[value] AS [value2]
FROM (
SELECT CONVERT(DATE, [t3].[ServerTime]) AS [value]
FROM (
SELECT [t0].[CookieID]
FROM [dbo].[Usage] AS [t0]
WHERE ([t0].[CookieID] IS NOT NULL) AND ([t0].[ProductID] = #p0)
GROUP BY [t0].[CookieID]
) AS [t1]
OUTER APPLY (
SELECT TOP (1) [t2].[ServerTime]
FROM [dbo].[Usage] AS [t2]
WHERE ((([t1].[CookieID] IS NULL) AND ([t2].[CookieID] IS NULL))
OR (([t1].[CookieID] IS NOT NULL) AND ([t2].[CookieID] IS NOT NULL)
AND ([t1].[CookieID] = [t2].[CookieID])))
AND ([t2].[CookieID] IS NOT NULL)
AND ([t2].[ProductID] = #p0)
ORDER BY [t2].[ServerTime]
) AS [t3]
) AS [t4]
GROUP BY [t4].[value]
) AS [t5]
ORDER BY [t5].[value2]
This query is generated by a Linq2SQL expression and extracted from LINQPad. This produces a nice query plan (as far as I can tell) and executes in about 10 seconds on the database. However, if I replace the two uses of parameters with the exact value, that is replace the two '= #p0' parts with '= '1fc66e37-6eaf-4032-b374-e7b60fbd25ea' ' I get a different estimated query plan and the query now runs much longer (more than 60 seconds, haven't seen it through).
Why is it that performing the seemingly innocent replacement produces a much less efficient query plan and execution? I have cleared the procedure cache with 'DBCC FreeProcCache' to ensure that I was not caching a bad plan, but the behavior remains.
My real problem is that I can live with the 10 seconds execution time (at least for a good while) but I can't live with the 60+ sec execution time. My query will (as hinted above) by produced by Linq2SQL so it is executed on the database as
exec sp_executesql N'
...
WHERE ([t0].[CookieID] IS NOT NULL) AND ([t0].[ProductID] = #p0)
...
AND ([t2].[ProductID] = #p0)
...
',N'#p0 uniqueidentifier',#p0='1FC66E37-6EAF-4032-B374-E7B60FBD25EA'
which produces the same poor execution time (which I think is doubly strange since this seems to be using parameterized queries.
I'm not looking for advise on which indexes to create or the like, I'm just trying to understand why the query plan and execution are so dissimilar on three seemingly similar queries.
EDIT: I have uploaded execution plans for the non-parameterized and the parameterized query as well as an execution plan for a parameterized query (as suggested by Heinz) with a different GUID here
Hope it helps you help me :)
If you provide an explicit value, SQL Server can use statistics of this field to make a "better" query plan decision. Unfortunately (as I've experienced myself recently), if the information contained in the statistics is misleading, sometimes SQL Server just makes the wrong choices.
If you want to dig deeper into this issue, I recommend you to check what happens if you use other GUIDs: If it uses a different query plan for different concrete GUIDs, that's an indication that statistics data is used. In that case, you might want to look at sp_updatestats and related commands.
EDIT: Have a look at DBCC SHOW_STATISTICS: The "slow" and the "fast" GUID are probably in different buckets in the histogram. I've had a similar problem, which I solved by adding an INDEX table hint to the SQL, which "guides" SQL Server towards finding the "right" query plan. Basically, I've looked at what indices are used during a "fast" query and hard-coded those into the SQL. This is far from an optimal or elegant solution, but I haven't found a better one yet...
I'm not looking for advise on which indexes to create or the like, I'm just trying to understand why the query plan and execution are so dissimilar on three seemingly similar queries.
You seem to have two indexes:
IX_NonCluster_Config (ProductID, ServerTime)
IX_NonCluster_ProductID_CookieID_With_ServerTime (ProductID, CookieID) INCLUDE (ServerTime)
The first index does not cover CookieID but is ordered on ServerTime and hence is more efficient for the less selective ProductID's (i. e. those that you have many)
The second index does cover all columns but is not ordered, and hence is more efficient for more selective ProductID's (those that you have few).
In average, you ProductID cardinality is so that SQL Server expects the second method to be efficient, which is what it uses when you use parametrized queries or explicitly provide selective GUID's.
However, your original GUID is considered less selective, that's why the first method is used.
Unfortunately, the first method requires additional filtering on CookieID which is why it's less efficient in fact.
My guess is that when you take the non paramaterized route, your guid has to be converted from a varchar to a UniqueIdentifier which may cause an index not to be used, while it will be used taking the paramatarized route.
I've seen this happen with using queries that have a smalldatetime in the where clause against a column that uses a datetime.
Its difficult to tell without looking at the execution plans, however if I was going to guess at a reason I'd say that its a combinaton of parameter sniffing and poor statistics - In the case where you hard-code the GUID into the query, the query optimiser attempts to optimise the query for that value of the parameter. I believe that the same thing happens with the parameterised / prepared query (this is called parameter sniffing - the execution plan is optimised for the parameters used the first time that the prepared statement is executed), however this definitely doesn't happen when you declare the parameter and use it in the query.
Like I said, SQL server attempt to optimise the execution plan for that value, and so usually you should see better results. It seems here that that information it is basing its decisions on is incorrect / misleading, and you are better off (for some reason) when it optimises the query for a generic parameter value.
This is mostly guesswork however - its impossible to tell really without the execution - if you can upload the executuion plan somewhere then I'm sure someone will be able to help you with the real reason.