Sql Server predicates lazy? - sql-server

I have a query:
SELECT
someFields
FROM
someTable
WHERE
cheapLookup=1
AND (CAST(someField as FLOAT)/otherField)<0.9
So, will the CAST and division be performed in the case that cheapLookup is 0? If not, how can I avoid the calculation in this case?

It depends on the query plan, which is determined by the estimated cost of each considered alternative plan that would produce correct results.
If the predicate 'cheapLookup = 1' can use an index, and it is sufficiently selective, SQL Server would likely choose to seek on that index and apply the second predicate as a residual (that is, only evaluating it on rows that are matched by the seeking operation).
On the other hand, if cheapLookup is not the leading key in an index, or if it is not very selective, SQL Server might choose to scan, applying both predicates to every row encountered.
The second predicate will not be chosen for a seeking operation, unless there happens to be an indexed computed column on the whole expression, and using that index turns out to be the cheapest way to execute the whole query. If a suitable index exists, SQL Server would seek on 'second predicate result < 0.9', and apply 'cheapLookup=1' as a residual. There is also the possibility that the indexed computed column has cheapLookup as its second key, which would result in a pure seek, with no residual.
The other thing about the second predicate is that without a computed column (whether or not indexed), SQL Server will have to guess at the selectivity of the expression. With the computed column, the server might be able to create statistics on the expression-result column, which will help the optimizer. Note that a computed column on 'CAST(someField as FLOAT)/otherField' would have to be persisted before it could be indexed or have statistics created on it, because it contains an imprecise data type.
In summary, it's not the complexity of the expression that counts so much as the estimated cost of the whole plan that uses each of the available access methods considered by the optimizer.

SQL is declarative: you tell the database what you want, not how you want it done. The database is entirely free to evaluate lazily or eagerly. In fact, it can evaluate thrice in reverse order for all I know :)
In rare cases, you can improve performance by reframing your query in such a way that it avoids a specific expensive operation. For example, moving the floating point math to a separate query would force lazy evaluation:
declare #t table (id int, someField float, otherField float)
insert #t select id, someField, otherField from someTable
where cheaplLookup <> 1
delete #t where (CAST(someField as FLOAT)/otherField) >= 0.9
insert #t select id, someField, otherField from someTable
where cheaplLookup = 1
In your example, I would expect SQL Server to choose the best way without any hints or tricks.

What you're referring to is short-circuiting, like other languages (e.g. C#) support.
I believe SQL Server can short-circuit but depends on the scenario / what happens in the optimizer so there is certainly not a guarantee that it will. It just might.
Excellent reference on this by Remus Rusanu here: http://rusanu.com/2009/09/13/on-sql-server-boolean-operator-short-circuit/

It depends on how SQL Server optimizes the query, you could run the Query Analyzer to see for your particular case
A sure fire way to optimize would to say
WITH QueryResult AS (
SELECT
someFields
FROM
someTable
WHERE
cheapLookup=1
)
SELECT * FROM QueryResult WHERE (CAST(someField as FLOAT)/otherField)<0.9

Related

Query is fast with direct comparison, but not with table comparison with same column index

I have a fairly complex query that does a direct comparision with #EventId if provided and fast since it grabs the clustered index row. However, sometimes I have to do a group of these Event IDs, and the second line takes almost 30 seconds to run. I figured it would work the same way with looking up the primary key. Is there a reason why it's so much slower?
DECLARE #EventIds TABLE(Id INT NOT NULL);
WHERE
(#EventId IS NULL OR (ev.Id = #EventId)) AND
(NOT EXISTS(SELECT 1 FROM #EventIds) OR ev.Id IN (SELECT * FROM #EventIds))
There's no real good reason to have the expression
NOT EXISTS(SELECT 1 FROM #EventIds) OR ev.Id IN (SELECT * FROM #EventIds)
The first expression, even if true, doesn't preclude the evaluation of the second expression because SQL Server doesn't shortcut boolean expressions.
Second, as table variables have been known to cause bad execution plans due to incorrect statistics and row count. Please refer to this essay on the difference between table variables and temporary tables, topics: Cardinality, and No column statistics.
It might help to add the following query hint at the end of the query:
OPTION(RECOMPILE);
Yes this recompiles the plan each time, but if you're getting horrible performance the small additional compile time doesn't matter that much.
This query hint is also recommended if you have optional filters as you have with #EventId.
It may also help to have a primary key on Id defined on the #EventIds table variable. This would allow an index seek instead of a table scan.

SQL Server : Multiple Where Clauses

Suppose I have a T-SQL command with multiple WHERE conditions like this:
SELECT *
FROM TableName
WHERE Column1 NOT LIKE '%exclude%'
AND Column2 > 10
Would the query exclude a row as soon as Column1 was not met or would it still go on to test the next condition for Column2?
I am asking because I want to see if it would be more efficient to swap my conditions around to first test if Column2 > 10 before I run a more time-consuming condition.
Edit: If it matters, Column1 is of type bigint and Column2 is of type ntext
Sql will devise a query plan based on available indexes and statistics. Sql doesn't necessarily have "short-circuit" expression evaluation per se because it is a procedural language but ultimately the query plan will perform short-circuit evaluation.
Swapping the expressions should not affect performance.
As Marc said, swapping columns in where clause will not make any change in performance. Instead, you could look for changing the data type NTEXT into nvarchar(X) where x represents some meaningful data length.

How can I force a subquery to perform as well as a #temp table?

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

Index not used due to type conversion?

I have a process that is performing badly due to full table scans on a particular table. I have computed statistics, rebuilt existing indices and tried adding new indices for this table but this hasn't solved the issue.
Can an implicit type conversion stop an index being used? What about other reasons? The cost of a full table scan is around 1000 greater than the index lookup should be.
EDIT:
SQL statement:
select unique_key
from src_table
where natural_key1 = :1
and natural_key2 = :2
and natural_key3 = :3;
Cardinality of natural_key1 is high, but there is a type conversion.
The other parts of the natural key are low cardinality, and bitmap indices are not enabled.
Table size is around 1,000,000 records.
Java code (not easily modifiable):
ps.setLong(1, oid);
This conflicts with the column datatype: varchar2
an implicit conversion can prevent an index from being used by the optimizer. Consider:
SQL> CREATE TABLE a (ID VARCHAR2(10) PRIMARY KEY);
Table created
SQL> insert into a select rownum from dual connect by rownum <= 1e6;
1000000 rows inserted
This is a simple table but the datatype is not 'right', i-e if you query it like this it will full scan:
SQL> select * from a where id = 100;
ID
----------
100
This query is in fact equivalent to:
select * from a where to_number(id) = 100;
It cannot use the index since we indexed id and not to_number(id). If we want to use the index we will have to be explicit:
select * from a where id = '100';
In reply to pakr's comment:
There are lots of rules concerning implicit conversions. One good place to start is the documentation. Among other things, we learn that:
During SELECT FROM operations, Oracle converts the data from the column to the type of the target variable.
It means that when implicit conversion occurs during a "WHERE column=variable" clause, Oracle will convert the datatype of the column and NOT of the variable, therefore preventing an index from being used. This is why you should always use the right kind of datatypes or explicitly converting the variable.
From the Oracle doc:
Oracle recommends that you specify explicit conversions, rather than rely on implicit or automatic conversions, for these reasons:
SQL statements are easier to understand when you use explicit datatype conversion functions.
Implicit datatype conversion can have a negative impact on performance, especially if the datatype of a column value is converted to that of a constant rather than the other way around.
Implicit conversion depends on the context in which it occurs and may not work the same way in every case. For example, implicit conversion from a datetime value to a VARCHAR2 value may return an unexpected year depending on the value of the NLS_DATE_FORMAT parameter.
Algorithms for implicit conversion are subject to change across software releases and among Oracle products. Behavior of explicit conversions is more predictable.
Make you condition sargable, that is compare the field itself to a constant condition.
This is bad:
SELECT *
FROM mytable
WHERE TRUNC(date) = TO_DATE('2009.07.21')
, since it cannot use the index. Oracle cannot reverse the TRUNC() function to get the range bounds.
This is good:
SELECT *
FROM mytable
WHERE date >= TO_DATE('2009.07.21')
AND date < TO_DATE('2009.07.22')
To get rid of implicit conversion, well, use explicit conversion:
This is bad:
SELECT *
FROM mytable
WHERE guid = '794AB5396AE5473DA75A9BF8C4AA1F74'
-- This uses implicit conversion. In fact this is RAWTOHEX(guid) = '794AB5396AE5473DA75A9BF8C4AA1F74'
This is good:
SELECT *
FROM mytable
WHERE guid = HEXTORAW('794AB5396AE5473DA75A9BF8C4AA1F74')
Update:
This query:
SELECT unique_key
FROM src_table
WHERE natural_key1 = :1
AND natural_key2 = :2
AND natural_key3 = :3
heavily depends on the type of your fields.
Explicitly cast your variables to the field type, as if from string.
You could use a function-based index.
Your query is:
select
unique_key
from
src_table
where
natural_key1 = :1
In your case the index isn't being used because natural_key1 is a varchar2 and :1 is a number. Oracle is converting your query to:
select
unique_key
from
src_table
where
to_number(natural_key1) = :1
So... put on an index for to_number(natural_key1):
create index ix_src_table_fnk1 on src_table(to_number(natural_key1));
Your query will now use the ix_src_table_fnk1 index.
Of course, better to get your Java programmers to do it properly in the first place.
What happens to your query if you run it with an explicit conversion around the argument (e.g., to_char(:1) or to_number(:1) as appropriate)? If doing so makes your query run fast, you have your answer.
However, if your query still runs slow with the explicit conversion, there may be another issue. You don't mention what version of Oracle you're running, if your high-cardinality column (natural_key1) has values that have a very skewed distribution, you may be using a query plan generated when the query was first run, which used an unfavorable value for :1.
For example, if your table of 1 million rows had 400,000 rows with natural_key1 = 1234, and the remaining 600,000 were unique (or nearly so), the optimizer would not choose the index if your query constrained on natural_key1 = 1234. Since you're using bind variables, if that was the first time you ran the query, the optimizer would choose that plan for all subsequent runs.
One way to test this theory would be to run this command before running your test statement:
alter system flush shared_pool;
This will remove all query plans from the optimizer's brain, so the next statement run will be optimized fresh. Alternatively, you could run the statement as straight SQL with literals, no bind variables. If it ran well in either case, you'd know your problem was due to plan corruption.
If that is the case, you don't want to use that alter system command in production - you'll probably ruin the rest of your system's performance if you run it regularly, but you could get around it by using dynamic sql instead of bind variables, or if it is possible to determine ahead of time that :1 is non-selective, use a slightly different query for the nonselective cases (such as re-ordering the conditions in the WHERE clause, which will cause the optimizer to use a different plan).
Finally, you can try adding an index hint to your query, e.g.:
SELECT /*+ INDEX(src_table,<name of index for natural_key1>) */
unique_key
FROM src_table
WHERE natural_key1 = :1
AND natural_key2 = :2
AND natural_key3 = :3;
I'm not a big fan of index hints - they're a pretty fragile method of programming. If the name changed on the index down the road, you'd never know it until your query started to perform poorly, plus you're potentially shooting yourself in the foot if server upgrades or data distribution changes result in the optimizer being able to choose an even better plan.

SQL Server query plan differences

I'm having trouble understanding the behavior of the estimated query plans for my statement in SQL Server when a change from a parameterized query to a non-parameterized query.
I have the following query:
DECLARE #p0 UniqueIdentifier = '1fc66e37-6eaf-4032-b374-e7b60fbd25ea'
SELECT [t5].[value2] AS [Date], [t5].[value] AS [New]
FROM (
SELECT COUNT(*) AS [value], [t4].[value] AS [value2]
FROM (
SELECT CONVERT(DATE, [t3].[ServerTime]) AS [value]
FROM (
SELECT [t0].[CookieID]
FROM [dbo].[Usage] AS [t0]
WHERE ([t0].[CookieID] IS NOT NULL) AND ([t0].[ProductID] = #p0)
GROUP BY [t0].[CookieID]
) AS [t1]
OUTER APPLY (
SELECT TOP (1) [t2].[ServerTime]
FROM [dbo].[Usage] AS [t2]
WHERE ((([t1].[CookieID] IS NULL) AND ([t2].[CookieID] IS NULL))
OR (([t1].[CookieID] IS NOT NULL) AND ([t2].[CookieID] IS NOT NULL)
AND ([t1].[CookieID] = [t2].[CookieID])))
AND ([t2].[CookieID] IS NOT NULL)
AND ([t2].[ProductID] = #p0)
ORDER BY [t2].[ServerTime]
) AS [t3]
) AS [t4]
GROUP BY [t4].[value]
) AS [t5]
ORDER BY [t5].[value2]
This query is generated by a Linq2SQL expression and extracted from LINQPad. This produces a nice query plan (as far as I can tell) and executes in about 10 seconds on the database. However, if I replace the two uses of parameters with the exact value, that is replace the two '= #p0' parts with '= '1fc66e37-6eaf-4032-b374-e7b60fbd25ea' ' I get a different estimated query plan and the query now runs much longer (more than 60 seconds, haven't seen it through).
Why is it that performing the seemingly innocent replacement produces a much less efficient query plan and execution? I have cleared the procedure cache with 'DBCC FreeProcCache' to ensure that I was not caching a bad plan, but the behavior remains.
My real problem is that I can live with the 10 seconds execution time (at least for a good while) but I can't live with the 60+ sec execution time. My query will (as hinted above) by produced by Linq2SQL so it is executed on the database as
exec sp_executesql N'
...
WHERE ([t0].[CookieID] IS NOT NULL) AND ([t0].[ProductID] = #p0)
...
AND ([t2].[ProductID] = #p0)
...
',N'#p0 uniqueidentifier',#p0='1FC66E37-6EAF-4032-B374-E7B60FBD25EA'
which produces the same poor execution time (which I think is doubly strange since this seems to be using parameterized queries.
I'm not looking for advise on which indexes to create or the like, I'm just trying to understand why the query plan and execution are so dissimilar on three seemingly similar queries.
EDIT: I have uploaded execution plans for the non-parameterized and the parameterized query as well as an execution plan for a parameterized query (as suggested by Heinz) with a different GUID here
Hope it helps you help me :)
If you provide an explicit value, SQL Server can use statistics of this field to make a "better" query plan decision. Unfortunately (as I've experienced myself recently), if the information contained in the statistics is misleading, sometimes SQL Server just makes the wrong choices.
If you want to dig deeper into this issue, I recommend you to check what happens if you use other GUIDs: If it uses a different query plan for different concrete GUIDs, that's an indication that statistics data is used. In that case, you might want to look at sp_updatestats and related commands.
EDIT: Have a look at DBCC SHOW_STATISTICS: The "slow" and the "fast" GUID are probably in different buckets in the histogram. I've had a similar problem, which I solved by adding an INDEX table hint to the SQL, which "guides" SQL Server towards finding the "right" query plan. Basically, I've looked at what indices are used during a "fast" query and hard-coded those into the SQL. This is far from an optimal or elegant solution, but I haven't found a better one yet...
I'm not looking for advise on which indexes to create or the like, I'm just trying to understand why the query plan and execution are so dissimilar on three seemingly similar queries.
You seem to have two indexes:
IX_NonCluster_Config (ProductID, ServerTime)
IX_NonCluster_ProductID_CookieID_With_ServerTime (ProductID, CookieID) INCLUDE (ServerTime)
The first index does not cover CookieID but is ordered on ServerTime and hence is more efficient for the less selective ProductID's (i. e. those that you have many)
The second index does cover all columns but is not ordered, and hence is more efficient for more selective ProductID's (those that you have few).
In average, you ProductID cardinality is so that SQL Server expects the second method to be efficient, which is what it uses when you use parametrized queries or explicitly provide selective GUID's.
However, your original GUID is considered less selective, that's why the first method is used.
Unfortunately, the first method requires additional filtering on CookieID which is why it's less efficient in fact.
My guess is that when you take the non paramaterized route, your guid has to be converted from a varchar to a UniqueIdentifier which may cause an index not to be used, while it will be used taking the paramatarized route.
I've seen this happen with using queries that have a smalldatetime in the where clause against a column that uses a datetime.
Its difficult to tell without looking at the execution plans, however if I was going to guess at a reason I'd say that its a combinaton of parameter sniffing and poor statistics - In the case where you hard-code the GUID into the query, the query optimiser attempts to optimise the query for that value of the parameter. I believe that the same thing happens with the parameterised / prepared query (this is called parameter sniffing - the execution plan is optimised for the parameters used the first time that the prepared statement is executed), however this definitely doesn't happen when you declare the parameter and use it in the query.
Like I said, SQL server attempt to optimise the execution plan for that value, and so usually you should see better results. It seems here that that information it is basing its decisions on is incorrect / misleading, and you are better off (for some reason) when it optimises the query for a generic parameter value.
This is mostly guesswork however - its impossible to tell really without the execution - if you can upload the executuion plan somewhere then I'm sure someone will be able to help you with the real reason.

Resources