Does wrapping nullable columns in ISNULL cause table scans? - sql-server

Code analysis rule SR0007 for Visual Studio 2010 database projects states that:
You should explicitly indicate how to handle NULL values in comparison expressions by wrapping each column that can contain a NULL value in an ISNULL function.
However code analysis rule SR0006 is violated when:
As part of a comparison, an expression contains a column reference ... Your code could cause a table scan if it compares an expression that contains a column reference.
Does this also apply to ISNULL, or does ISNULL never result in a table scan?

Yes it causes table scans. (though seems to get optimised out if the column isn't actually nullable)
The SR0007 rule is extremely poor blanket advice as it renders the predicate unsargable and means any indexes on the column will be useless. Even if there is no index on the column it might still make cardinality estimates inaccurate affecting other parts of the plan.
The categorization of it in the Microsoft.Performance category is quite amusing as it seems to have been written by someone with no understanding of query performance.
It claims the rationale is
If your code compares two NULL values or a NULL value with any other
value, your code will return an unknown result.
Whilst the expression itself does evaluate to unknown your code returns a completely deterministic result once you understand that any =, <>, >, < etc comparison with NULL evaluate as Unknown and that the WHERE clause only returns rows where the expression evaluates to true.
It is possible that they mean if ANSI_NULLS is off but the example they give in the documentation of WHERE ISNULL([c2],0) > 2; vs WHERE [c2] > 2; would not be affected by this setting anyway. This setting
affects a comparison only if one of the operands of the comparison is
either a variable that is NULL or a literal NULL.
Execution plans showing scans vs seek or below
CREATE TABLE #foo
(
x INT NULL UNIQUE
)
INSERT INTO #foo
SELECT ROW_NUMBER() OVER (ORDER BY ##SPID)
FROM sys.all_columns
SELECT *
FROM #foo
WHERE ISNULL(x, 10) = 10
SELECT *
FROM #foo
WHERE x = 10
SELECT *
FROM #foo
WHERE x = 10
OR x IS NULL

Related

Switching a case statement to a constant greatly slows query

I have run into an issue with SQL Server 2017 where replacing:
a CASE statement that assigns a numerical value
with a constant numerical value
slows down the query be a factor of 6+.
The rather complicated query has the general form of:
WITH CTE1 AS
(
...
),
WITH CTE2 AS
(
SELECT
--conditions based on below
FROM
(SELECT
--various math,
CASE
--statement assigning values to different runID combinations for samples with matching siteIDs and dates (due to the ON statement below)
ELSE NULL
....
END AS whichCombination
FROM
CTE1 AS value1
JOIN
CTE1 AS value2 ON (value1.siteID = value2.siteID,
value1.date = value2.date,
value1.sampleID <> value2.sampleID)
) AS combinations
WHERE combinations.whichCombination IS NOT NULL
)
SELECT various data
FROM dataTable
LEFT JOIN
(stuff from CTE2) AS pairTable ON dataTable.sampleID = pairTable.sampleID
The CASE statement assigns a pair number to different combinations of rows from the self join.
This then is used to select only the combinations that I want.
However, when the CASE statement is replaced with: 1 AS whichCombination (a constant value so no rows are assigned NULL) the query slows dramatically. This also occurs if CASE WHEN 1 = 1 THEN 1 is used.
This makes no sense to me as either way the values are:
numerical
not unique
not an index
The only thing that is unique is that each combination of rows is a assigned a unique value.
Is SQL Server somehow using this as an index that speeds things up?
And how would I replicate this behavior without the CASE statement as this answer says you cannot create indices for CTE's?
EDIT: Also of note is that the slowdown occurs only if main select statement (the last 5 lines) is included (i.e. if CTE2 is run as the main select statement instead of being a CTE)
Best, JD
One workaround would be spliting these CTE's to temp tables, then you could add indexes if needed.

COUNT(NULL) and the IN clause

This is more of a curious question. I know this question seems like an odd ball but I use null when checking for data because I'm not concerned what data is there but only IF data is there. I believe the following scenario only occurs in SQL Server.
When I want to see if a record exists I'll use:
IF(EXISTS(SELECT null FROM Table1 WHERE Criteria IN (1, 2)))
The following code also works:
IF((SELECT COUNT(null) FROM Table1 WHERE Criteria = 1) = 2)
But this doesn't work:
IF((SELECT COUNT(null) FROM Table1 WHERE Criteria IN (1,2)) = 2)
and get this error:
Operand data type NULL is invalid for count operator.
Why is the third statement any different because of the IN clause?
Here is a SQL Fiddle of what I'm talking about:
http://sqlfiddle.com/#!6/6d7db/8
Narrowed it down to only if there are multiple items in the IN clause too
It seems to be something about the query optimizer.
In the first two queries (from your fiddle), the count(null) seems to be converted to COUNT(*) as you can see in the execution plan.
In the second line, IN with only one value is optimized to =, resulting in the exact same query as above:
With IN (1,2) the query fails. It's the same if you use COUNT(1): It's converted to COUNT(*) where the query can only return one row, but stays COUNT(1) in the third.
Another sidenote: The effect only works with a real table. If you use a table variable, all three statements throw the error.
The bottom line should probably be: count(null) is wrong (as Heinzi explained), it just may slip through the optimizer in very rare circumstances.
COUNT(null), the short form of COUNT(ALL null), simply does not make sense. Let's have a look at the definition of COUNT (emphasis mine):
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates expression for each row in a group and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates expression for each row in a group and returns the number of unique, nonnull values.
Thus, COUNT(ALL someExpressionThatYieldsNull) would always return 0, no matter how many records are matched by your WHERE clause. Obviously, that makes it utterly unsuitable for counting rows. COUNT(*) would be correct here.
I am quite surprised that your second example works at all, you might have stumbled upon a bug here. Trying the following in MSSQL 2012 (SQLFiddle):
SELECT COUNT(NULL) FROM someTable;
yields the following error:
Operand data type NULL is invalid for count operator.
which makes perfect sense.

determine order of operations in query

Say I have a query like this:
SELECT *
FROM Foo
WHERE Name IN ('name1', 'name2')
AND (Date<'2013-01-01' AND Date>'2010-01-01')
AND Type = 1
Is there a way to force the SQL server to evaluate the expressions in the order I determine and not what the query optimizer says? For example I want the IN clause evaluated first, the output of that evaluated by Type = 1 and finally the dates, in EXACTLY that order.
Yes it is largely possible (though there are some caveats and counter examples discussed in the answers here)
SELECT *
FROM Foo
WHERE 1 = CASE
WHEN Name IN ( 'name1', 'name2' ) THEN
CASE
WHEN Type = 1 THEN
CASE
WHEN ( Date < '2013-01-01'
AND Date > '2010-01-01' ) THEN 1
END
END
END
But why bother? There are only very limited circumstances in which I can see this would be useful (e.g. preventing divide by zero if an earlier predicate evaluated to 0).
Wrapping the predicates up like this makes the query completely unsargable and prevents index usage for any of the three (otherwise sargable) predicates. It guarantees a full scan reading all rows.
To see an example of this
CREATE TABLE Foo
(
Id INT IDENTITY PRIMARY KEY,
Name VARCHAR(10),
[Date] DATE,
[Type] TINYINT,
Filler CHAR(8000) NULL
)
CREATE NONCLUSTERED INDEX IX_Name
ON Foo(Name)
CREATE NONCLUSTERED INDEX IX_Date
ON Foo(Date)
CREATE NONCLUSTERED INDEX IX_Type
ON Foo(Type)
INSERT INTO Foo
(Name,
[Date],
[Type])
SELECT TOP (100000) 'name' + CAST(0 + CRYPT_GEN_RANDOM(1) AS VARCHAR),
DATEADD(DAY, 7 * CRYPT_GEN_RANDOM(1), '2012-01-01'),
0 + CRYPT_GEN_RANDOM(1)
FROM master..spt_values v1,
master..spt_values v2
Then running the original query in the question vs this query gives plans
Note the second query is costed as being 100% of the cost of the batch.
The Query optimizer left to its own devices first seeks into the 414 rows matching the type predicate and uses that as a build input for the hash table. It then seeks into the 728 rows matching the name, sees if it matches anything in the hash table and for the 4 that do it performs a key lookup for the other columns and evaluates the Date predicate against those. Finally it returns the single matching row.
The second query just ploughs through all the rows in the table and evaluates the predicates in the desired order. The difference in number of pages read is pretty significant.
Original Query
Table 'Foo'. Scan count 3, logical reads 23,
Table 'Worktable'. Scan count 0, logical reads 0
Nested case
Table 'Foo'. Scan count 1, logical reads 100373
Short answer: NO!
You can try to use brackets, hints, study query plan, etc.
But is that wise to mess up with engine/optimizer that way?
You ill need a lot of study and experience to outsmart the optimizer, that said, please let the engine take care of that details for you.

NOT IN subquery fails when there are NULL-valued results

Sorry guys, I had no idea how to phrase this one, but I have the following in a where clause:
person_id not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
)
When the subquery returns no results, my whole select fails to return anything. To work around this, I replaced person_id in the subquery with isnull(person_id, '00000000-0000-0000-0000-000000000000').
It seems to work, but is there a better way to solve this?
It is better to use NOT EXISTS anyway:
WHERE NOT EXISTS(
SELECT 1 FROM protocol_application_log_devl pal
WHERE pal.person_id = person_id
AND pal.set_id = #set_id
)
Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?
A pattern I see quite a bit, and wish that I didn't, is NOT IN. When
I see this pattern, I cringe. But not for performance reasons – after
all, it creates a decent enough plan in this case:
The main problem is that the results can be surprising if the target
column is NULLable (SQL Server processes this as a left anti semi
join, but can't reliably tell you if a NULL on the right side is equal
to – or not equal to – the reference on the left side). Also,
optimization can behave differently if the column is NULLable, even if
it doesn't actually contain any NULL values
Instead of NOT IN, use a correlated NOT EXISTS for this query pattern.
Always. Other methods may rival it in terms of performance, when all
other variables are the same, but all of the other methods introduce
either performance problems or other challenges.
While I support Tim's answer as being correct-in-practice (NOT IN is not appropriate here), this is an interesting case noted in the IN / NOT IN documentation:
Caution: Any null values returned by subquery or expression that are compared to test_expression using IN or NOT IN return UNKNOWN. Using null values in together with IN or NOT IN can produce unexpected results1.
This is why the isnull "fixes" the problem - it masks any such NULL values and avoids the unexpected behavior. With that in mind, the following approach would also work (but please heed the advice about not using NOT IN to begin with):
person_id not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
AND person_id NOT NULL -- guard here
)
However, a NULL person_id is suspicious and might indicate other issues ..
1 Here is the Proof pudding:
select case when 1 not in (2) then 1 else 0 end as r1,
case when 1 not in (2, NULL) then 1 else 0 end as r2
-- r1: 1, r2: 0
I just replaced the null value with empty value using isnull function as below example. It solved my issue
where isnull(UserId,'') not in (select UserID from users where ...)
This should work:
nvl(person_id, '') not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
)

Error converting date in sql server [duplicate]

I've been trying to understand why I get a "divide by zero encountered" (Msg 8134) with my SQL query, but I must be missing something. I would like like to know the why for the specific case below, I am not looking for NULLIF, CASE WHEN... or similar as I already know about them (and can of course use them in a situation as the one below).
I have an SQL statement with a computed column similar to
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
Running this statement errors with the above mentioned error messages, which, in itself, is not the problem - obviously TotalSize might well be 0 and therefore cause the error.
Now what I don't understand is that I do not have any rows where the TotalSize column is 0 when I comment the computed column out, I double checked that this isn't the case.
Then I thought that for some reason the column computation would be performed on the whole result set before actually filtering with the conditions of the where clause, but this a) wouldn't make sense imho and b) when trying to reproduce the error with a test set-up everything works fine (see below):
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0001',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0002',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0003',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0004',0)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0005',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0006',0)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0007',1)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (1,100,21)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (2,100,10)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (3,100,55)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (4,0,10)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (5,100,23)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (6,100,18)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (7,100,11)
-- This statement does not throw an error as apparently the row for ComputerID 4
-- is filtered out before computing the (FreeSpace / TotalSize * 100)
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM
tblComputer
JOIN
tblHDD ON
tblComputer.ID = tblHDD.ComputerID
WHERE
IsServer = 1
I am quite stumped and would like to know what the reason is.
Any ideas or pointers into the right direction are very welcome, thanks in advance
Update
Thank you so far for your input, but unfortunately I seem not to be getting closer to the root of the problem. I managed to strip the statement down a little bit and now have the case that I can execute it without errors if one JOIN is removed (I would need it for additional columns in the output which I temporarily removed).
I do not understand, why using the JOIN leads to the error, shouldn't a standard INNER JOIN always either return the same number of rows or less, but never more?
Working code
SELECT
TotalSize,
FreeSpace
((FreeSpace / TotalSize) * 100)
FROM
MyTable1
INNER JOIN
MyTable2 ON
MyTable1.ID = MyTable2.Table1ID
WHERE
SomeCondition
Error causing code
SELECT
TotalSize,
FreeSpace
((FreeSpace / TotalSize) * 100)
FROM
MyTable1
INNER JOIN
MyTable2 ON
MyTable1.ID = MyTable2.Table1ID
-- This JOIN causes "divide by zero encountered" error
INNER JOIN
MyTable3 ON
MyTable2.ID = MyTable3.Table2ID
WHERE
SomeCondition
I also tried my luck using a cursor and looping over the result row by row, but in that case no error occurred (no matter, which of the two statements above I tried).
Sorry for the messy code indentation, somehow the correct formatting doesn't seem to be applied.
G.
SQL is a declarative language; you write a query that logically describes the result you want, but it is up to the optimizer to produce a physical plan. This physical plan may not bear much relation to the written form of the query, because the optimizer does not simply reorder 'steps' derived from the textual form of the query, it can apply over 300 different transformations to find an efficient execution strategy.
The optimizer has considerable freedom to reorder expressions, joins, and other logical query constructions. This means that you cannot, in general, rely on any written query form to force one thing to be evaluated before another. In particular, the rewrite given by Lieven does not force the WHERE clause predicate to be evaluated before the expression. The optimizer may, depending on cost estimations, decide to evaluate the expression wherever it seems most efficient to do so. This may even mean, in some cases, that the expression is evaluated more than once.
The original question considered this possibility, but rejected it as 'not making much sense'. Nevertheless, this is the way the product works - if SQL Server estimates that a join will reduce the set size enough to make it cheaper to compute the expression on the result of the join, it is free to do so.
The general rule is to never depend on a particular evaluation order to avoid things like overflow or divide-by-zero errors. In this example, one would employ a CASE statement to check for a zero divisor - an example of defensive programming.
The optimizer's freedom to reorder things is a fundamental tenet of its design. You can find cases where it leads to counter-intuitive behaviours, but overall the benefits far outweigh the disadvantages.
Paul
The basic steps that SQL Server uses to process a single SELECT statement include the following
The parser scans the SELECT statement and breaks it into logical
units such as keywords, expressions,
operators, and identifiers.
A query tree, sometimes referred to as a sequence tree, is built
describing the logical steps needed to
transform the source data into the
format required by the result set.
The query optimizer analyzes different ways the source tables can
be accessed. It then selects the
series of steps that returns the
results fastest while using fewer
resources. The query tree is updated
to record this exact series of steps.
The final, optimized version of the
query tree is called the execution
plan.
The relational engine starts executing the execution plan. As the
steps that require data from the base
tables are processed, the relational
engine requests that the storage
engine pass up data from the rowsets
requested from the relational engine.
The relational engine processes the data returned from the storage
engine into the format defined for the
result set and returns the result set
to the client.
My interpretation of things is that there is no guarantee that your where clause get's evaluated before evaluating the computed column for all rows.
You could verify that assumption by changing you query like below and forcing the where clause to be evaluated before the computation.
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM (
SELECT
TotalSize,
FreeSpace,
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
) t
What rows are returned when you run:
SELECT
TotalSize
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
and ((TotalSize * 100) = 0)
This might give you a clue as to how SQL Serve ris evaluating (TotalSize * 100) to be zero.
Another idea, is there anything in your where statement which might also be the problem?
You're assuming it's the TotalSize, but it might be somewhere else.
I was running into the same issue. In my case NULLs were acceptable so I was able to fix it this way:
Select Expression1 / Expression2 -- Caused Division By 0
Select Expression1 / NULLIF(Expression2,0) -- Causes result to be NULL
If you need other handling, you can wrap the entire expression in an ISNULL function like this:
Select ISNULL(Expression1 / NULLIF(Expression2,0)-5) -- Returns -5 instead of null or divide by 0

Resources