COUNT(NULL) and the IN clause

COUNT(NULL) and the IN clause - sql-server

This is more of a curious question. I know this question seems like an odd ball but I use null when checking for data because I'm not concerned what data is there but only IF data is there. I believe the following scenario only occurs in SQL Server.
When I want to see if a record exists I'll use:
IF(EXISTS(SELECT null FROM Table1 WHERE Criteria IN (1, 2)))
The following code also works:
IF((SELECT COUNT(null) FROM Table1 WHERE Criteria = 1) = 2)
But this doesn't work:
IF((SELECT COUNT(null) FROM Table1 WHERE Criteria IN (1,2)) = 2)
and get this error:
Operand data type NULL is invalid for count operator.
Why is the third statement any different because of the IN clause?
Here is a SQL Fiddle of what I'm talking about:
http://sqlfiddle.com/#!6/6d7db/8
Narrowed it down to only if there are multiple items in the IN clause too

It seems to be something about the query optimizer.
In the first two queries (from your fiddle), the count(null) seems to be converted to COUNT(*) as you can see in the execution plan.
In the second line, IN with only one value is optimized to =, resulting in the exact same query as above:
With IN (1,2) the query fails. It's the same if you use COUNT(1): It's converted to COUNT(*) where the query can only return one row, but stays COUNT(1) in the third.
Another sidenote: The effect only works with a real table. If you use a table variable, all three statements throw the error.
The bottom line should probably be: count(null) is wrong (as Heinzi explained), it just may slip through the optimizer in very rare circumstances.

COUNT(null), the short form of COUNT(ALL null), simply does not make sense. Let's have a look at the definition of COUNT (emphasis mine):
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates expression for each row in a group and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates expression for each row in a group and returns the number of unique, nonnull values.
Thus, COUNT(ALL someExpressionThatYieldsNull) would always return 0, no matter how many records are matched by your WHERE clause. Obviously, that makes it utterly unsuitable for counting rows. COUNT(*) would be correct here.
I am quite surprised that your second example works at all, you might have stumbled upon a bug here. Trying the following in MSSQL 2012 (SQLFiddle):
SELECT COUNT(NULL) FROM someTable;
yields the following error:
Operand data type NULL is invalid for count operator.
which makes perfect sense.

Related

Group by an evaluated field (sql server) [duplicate]

Why are column ordinals legal for ORDER BY but not for GROUP BY? That is, can anyone tell me why this query
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY OrgUnitID
cannot be written as
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY 1
When it's perfectly legal to write a query like
SELECT OrgUnitID FROM Employee AS e ORDER BY 1
?
I'm really wondering if there's something subtle about the relational calculus, or something, that would prevent the grouping from working right.
The thing is, my example is pretty trivial. It's common that the column that I want to group by is actually a calculation, and having to repeat the exact same calculation in the GROUP BY is (a) annoying and (b) makes errors during maintenance much more likely. Here's a simple example:
SELECT DATEPART(YEAR,LastSeenOn), COUNT(*)
FROM Employee AS e
GROUP BY DATEPART(YEAR,LastSeenOn)
I would think that SQL's rule of normalize to only represent data once in the database ought to extend to code as well. I'd want to only right that calculation expression once (in the SELECT column list), and be able to refer to it by ordinal in the GROUP BY.
Clarification: I'm specifically working on SQL Server 2008, but I wonder about an overall answer nonetheless.

One of the reasons is because ORDER BY is the last thing that runs in a SQL Query, here is the order of operations
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
so once you have the columns from the SELECT clause you can use ordinal positioning
EDIT, added this based on the comment
Take this for example
create table test (a int, b int)
insert test values(1,2)
go
The query below will parse without a problem, it won't run
select a as b, b as a
from test
order by 6
here is the error
Msg 108, Level 16, State 1, Line 3
The ORDER BY position number 6 is out of range of the number of items in the select list.
This also parses fine
select a as b, b as a
from test
group by 1
But it blows up with this error
Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.

There is a lot of elementary inconsistencies in SQL, and use of scalars is one of them. For example, anyone might expect
select * from countries
order by 1
and
select * from countries
order by 1.00001
to be a similar queries (the difference between the two can be made infinitesimally small, after all), which are not.

I'm not sure if the standard specifies if it is valid, but I believe it is implementation-dependent. I just tried your first example with one SQL engine, and it worked fine.

use aliasses :
SELECT DATEPART(YEAR,LastSeenOn) as 'seen_year', COUNT(*) as 'count'
FROM Employee AS e
GROUP BY 'seen_year'
** EDIT **
if GROUP BY alias is not allowed for you, here's a solution / workaround:
SELECT seen_year
, COUNT(*) AS Total
FROM (
SELECT DATEPART(YEAR,LastSeenOn) as seen_year, *
FROM Employee AS e
) AS inline_view
GROUP
BY seen_year

databases that don't support this basically are choosing not to. understand the order of the processing of the various steps, but it is very easy (as many databases have shown) to parse the sql, understand it, and apply the translation for you. Where its really a pain is when a column is a long case statement. having to repeat that in the group by clause is super annoying. yes, you can do the nested query work around as someone demonstrated above, but at this point it is just lack of care about your users to not support group by column numbers.

NOT IN subquery fails when there are NULL-valued results

Sorry guys, I had no idea how to phrase this one, but I have the following in a where clause:
person_id not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
)
When the subquery returns no results, my whole select fails to return anything. To work around this, I replaced person_id in the subquery with isnull(person_id, '00000000-0000-0000-0000-000000000000').
It seems to work, but is there a better way to solve this?

It is better to use NOT EXISTS anyway:
WHERE NOT EXISTS(
SELECT 1 FROM protocol_application_log_devl pal
WHERE pal.person_id = person_id
AND pal.set_id = #set_id
)
Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?
A pattern I see quite a bit, and wish that I didn't, is NOT IN. When
I see this pattern, I cringe. But not for performance reasons – after
all, it creates a decent enough plan in this case:
The main problem is that the results can be surprising if the target
column is NULLable (SQL Server processes this as a left anti semi
join, but can't reliably tell you if a NULL on the right side is equal
to – or not equal to – the reference on the left side). Also,
optimization can behave differently if the column is NULLable, even if
it doesn't actually contain any NULL values
Instead of NOT IN, use a correlated NOT EXISTS for this query pattern.
Always. Other methods may rival it in terms of performance, when all
other variables are the same, but all of the other methods introduce
either performance problems or other challenges.

While I support Tim's answer as being correct-in-practice (NOT IN is not appropriate here), this is an interesting case noted in the IN / NOT IN documentation:
Caution: Any null values returned by subquery or expression that are compared to test_expression using IN or NOT IN return UNKNOWN. Using null values in together with IN or NOT IN can produce unexpected results1.
This is why the isnull "fixes" the problem - it masks any such NULL values and avoids the unexpected behavior. With that in mind, the following approach would also work (but please heed the advice about not using NOT IN to begin with):
person_id not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
AND person_id NOT NULL -- guard here
)
However, a NULL person_id is suspicious and might indicate other issues ..
1 Here is the Proof pudding:
select case when 1 not in (2) then 1 else 0 end as r1,
case when 1 not in (2, NULL) then 1 else 0 end as r2
-- r1: 1, r2: 0

I just replaced the null value with empty value using isnull function as below example. It solved my issue
where isnull(UserId,'') not in (select UserID from users where ...)

This should work:
nvl(person_id, '') not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
)

Error converting date in sql server [duplicate]

I've been trying to understand why I get a "divide by zero encountered" (Msg 8134) with my SQL query, but I must be missing something. I would like like to know the why for the specific case below, I am not looking for NULLIF, CASE WHEN... or similar as I already know about them (and can of course use them in a situation as the one below).
I have an SQL statement with a computed column similar to
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
Running this statement errors with the above mentioned error messages, which, in itself, is not the problem - obviously TotalSize might well be 0 and therefore cause the error.
Now what I don't understand is that I do not have any rows where the TotalSize column is 0 when I comment the computed column out, I double checked that this isn't the case.
Then I thought that for some reason the column computation would be performed on the whole result set before actually filtering with the conditions of the where clause, but this a) wouldn't make sense imho and b) when trying to reproduce the error with a test set-up everything works fine (see below):
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0001',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0002',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0003',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0004',0)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0005',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0006',0)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0007',1)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (1,100,21)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (2,100,10)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (3,100,55)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (4,0,10)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (5,100,23)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (6,100,18)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (7,100,11)
-- This statement does not throw an error as apparently the row for ComputerID 4
-- is filtered out before computing the (FreeSpace / TotalSize * 100)
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM
tblComputer
JOIN
tblHDD ON
tblComputer.ID = tblHDD.ComputerID
WHERE
IsServer = 1
I am quite stumped and would like to know what the reason is.
Any ideas or pointers into the right direction are very welcome, thanks in advance
Update
Thank you so far for your input, but unfortunately I seem not to be getting closer to the root of the problem. I managed to strip the statement down a little bit and now have the case that I can execute it without errors if one JOIN is removed (I would need it for additional columns in the output which I temporarily removed).
I do not understand, why using the JOIN leads to the error, shouldn't a standard INNER JOIN always either return the same number of rows or less, but never more?
Working code
SELECT
TotalSize,
FreeSpace
((FreeSpace / TotalSize) * 100)
FROM
MyTable1
INNER JOIN
MyTable2 ON
MyTable1.ID = MyTable2.Table1ID
WHERE
SomeCondition
Error causing code
SELECT
TotalSize,
FreeSpace
((FreeSpace / TotalSize) * 100)
FROM
MyTable1
INNER JOIN
MyTable2 ON
MyTable1.ID = MyTable2.Table1ID
-- This JOIN causes "divide by zero encountered" error
INNER JOIN
MyTable3 ON
MyTable2.ID = MyTable3.Table2ID
WHERE
SomeCondition
I also tried my luck using a cursor and looping over the result row by row, but in that case no error occurred (no matter, which of the two statements above I tried).
Sorry for the messy code indentation, somehow the correct formatting doesn't seem to be applied.
G.

SQL is a declarative language; you write a query that logically describes the result you want, but it is up to the optimizer to produce a physical plan. This physical plan may not bear much relation to the written form of the query, because the optimizer does not simply reorder 'steps' derived from the textual form of the query, it can apply over 300 different transformations to find an efficient execution strategy.
The optimizer has considerable freedom to reorder expressions, joins, and other logical query constructions. This means that you cannot, in general, rely on any written query form to force one thing to be evaluated before another. In particular, the rewrite given by Lieven does not force the WHERE clause predicate to be evaluated before the expression. The optimizer may, depending on cost estimations, decide to evaluate the expression wherever it seems most efficient to do so. This may even mean, in some cases, that the expression is evaluated more than once.
The original question considered this possibility, but rejected it as 'not making much sense'. Nevertheless, this is the way the product works - if SQL Server estimates that a join will reduce the set size enough to make it cheaper to compute the expression on the result of the join, it is free to do so.
The general rule is to never depend on a particular evaluation order to avoid things like overflow or divide-by-zero errors. In this example, one would employ a CASE statement to check for a zero divisor - an example of defensive programming.
The optimizer's freedom to reorder things is a fundamental tenet of its design. You can find cases where it leads to counter-intuitive behaviours, but overall the benefits far outweigh the disadvantages.
Paul

The basic steps that SQL Server uses to process a single SELECT statement include the following
The parser scans the SELECT statement and breaks it into logical
units such as keywords, expressions,
operators, and identifiers.
A query tree, sometimes referred to as a sequence tree, is built
describing the logical steps needed to
transform the source data into the
format required by the result set.
The query optimizer analyzes different ways the source tables can
be accessed. It then selects the
series of steps that returns the
results fastest while using fewer
resources. The query tree is updated
to record this exact series of steps.
The final, optimized version of the
query tree is called the execution
plan.
The relational engine starts executing the execution plan. As the
steps that require data from the base
tables are processed, the relational
engine requests that the storage
engine pass up data from the rowsets
requested from the relational engine.
The relational engine processes the data returned from the storage
engine into the format defined for the
result set and returns the result set
to the client.
My interpretation of things is that there is no guarantee that your where clause get's evaluated before evaluating the computed column for all rows.
You could verify that assumption by changing you query like below and forcing the where clause to be evaluated before the computation.
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM (
SELECT
TotalSize,
FreeSpace,
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
) t

What rows are returned when you run:
SELECT
TotalSize
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
and ((TotalSize * 100) = 0)
This might give you a clue as to how SQL Serve ris evaluating (TotalSize * 100) to be zero.
Another idea, is there anything in your where statement which might also be the problem?
You're assuming it's the TotalSize, but it might be somewhere else.

I was running into the same issue. In my case NULLs were acceptable so I was able to fix it this way:
Select Expression1 / Expression2 -- Caused Division By 0
Select Expression1 / NULLIF(Expression2,0) -- Causes result to be NULL
If you need other handling, you can wrap the entire expression in an ISNULL function like this:
Select ISNULL(Expression1 / NULLIF(Expression2,0)-5) -- Returns -5 instead of null or divide by 0

Does wrapping nullable columns in ISNULL cause table scans?

Code analysis rule SR0007 for Visual Studio 2010 database projects states that:
You should explicitly indicate how to handle NULL values in comparison expressions by wrapping each column that can contain a NULL value in an ISNULL function.
However code analysis rule SR0006 is violated when:
As part of a comparison, an expression contains a column reference ... Your code could cause a table scan if it compares an expression that contains a column reference.
Does this also apply to ISNULL, or does ISNULL never result in a table scan?

Yes it causes table scans. (though seems to get optimised out if the column isn't actually nullable)
The SR0007 rule is extremely poor blanket advice as it renders the predicate unsargable and means any indexes on the column will be useless. Even if there is no index on the column it might still make cardinality estimates inaccurate affecting other parts of the plan.
The categorization of it in the Microsoft.Performance category is quite amusing as it seems to have been written by someone with no understanding of query performance.
It claims the rationale is
If your code compares two NULL values or a NULL value with any other
value, your code will return an unknown result.
Whilst the expression itself does evaluate to unknown your code returns a completely deterministic result once you understand that any =, <>, >, < etc comparison with NULL evaluate as Unknown and that the WHERE clause only returns rows where the expression evaluates to true.
It is possible that they mean if ANSI_NULLS is off but the example they give in the documentation of WHERE ISNULL([c2],0) > 2; vs WHERE [c2] > 2; would not be affected by this setting anyway. This setting
affects a comparison only if one of the operands of the comparison is
either a variable that is NULL or a literal NULL.
Execution plans showing scans vs seek or below
CREATE TABLE #foo
(
x INT NULL UNIQUE
)
INSERT INTO #foo
SELECT ROW_NUMBER() OVER (ORDER BY ##SPID)
FROM sys.all_columns
SELECT *
FROM #foo
WHERE ISNULL(x, 10) = 10
SELECT *
FROM #foo
WHERE x = 10
SELECT *
FROM #foo
WHERE x = 10
OR x IS NULL

SQL with unknown purpose

I have to change some SQL queries (SQL Server 2005) done by another person and in that code I see often the following construction:
SELECT fieldA, SUM(CASE fieldB WHEN null THEN 0 ELSE fieldB END) as AliasName FROM ...
I don't understand the case statement because as far as I know, null can not be checked within a case and therefore I think that the above code does the same as:
SELECT fieldA, SUM(fieldB) as AliasName FROM ...
I have also done some tests and have not seen any differences in the result. Am I missing something, or can I replace the upper statement through the short one?
UPDATE
Only for completeness because it's not mentioned in the answers: The upper code returns the same result as the lower. The used case construction does not replace null's through zero's and therefore it can be ommited. If the purpose of the original sql was to make sure that never null will be returned, the coalesce or the isnull-operator can be used (as stated in the answers).

The output of your second statement will contain nulls (when aggregating records that only have null values for fieldB). If you don't mind that, you're ok.
If you want zeros in your output rather than null values, use this:
select fieldA, sum(isnull(fieldB, 0)) as AliasName from ...

You would achieve this more readably with
SELECT fieldA, COALESCE(fieldB, 0) as AliasName

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight