NOT IN subquery fails when there are NULL-valued results

NOT IN subquery fails when there are NULL-valued results - sql-server

Sorry guys, I had no idea how to phrase this one, but I have the following in a where clause:
person_id not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
)
When the subquery returns no results, my whole select fails to return anything. To work around this, I replaced person_id in the subquery with isnull(person_id, '00000000-0000-0000-0000-000000000000').
It seems to work, but is there a better way to solve this?

It is better to use NOT EXISTS anyway:
WHERE NOT EXISTS(
SELECT 1 FROM protocol_application_log_devl pal
WHERE pal.person_id = person_id
AND pal.set_id = #set_id
)
Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?
A pattern I see quite a bit, and wish that I didn't, is NOT IN. When
I see this pattern, I cringe. But not for performance reasons – after
all, it creates a decent enough plan in this case:
The main problem is that the results can be surprising if the target
column is NULLable (SQL Server processes this as a left anti semi
join, but can't reliably tell you if a NULL on the right side is equal
to – or not equal to – the reference on the left side). Also,
optimization can behave differently if the column is NULLable, even if
it doesn't actually contain any NULL values
Instead of NOT IN, use a correlated NOT EXISTS for this query pattern.
Always. Other methods may rival it in terms of performance, when all
other variables are the same, but all of the other methods introduce
either performance problems or other challenges.

While I support Tim's answer as being correct-in-practice (NOT IN is not appropriate here), this is an interesting case noted in the IN / NOT IN documentation:
Caution: Any null values returned by subquery or expression that are compared to test_expression using IN or NOT IN return UNKNOWN. Using null values in together with IN or NOT IN can produce unexpected results1.
This is why the isnull "fixes" the problem - it masks any such NULL values and avoids the unexpected behavior. With that in mind, the following approach would also work (but please heed the advice about not using NOT IN to begin with):
person_id not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
AND person_id NOT NULL -- guard here
)
However, a NULL person_id is suspicious and might indicate other issues ..
1 Here is the Proof pudding:
select case when 1 not in (2) then 1 else 0 end as r1,
case when 1 not in (2, NULL) then 1 else 0 end as r2
-- r1: 1, r2: 0

I just replaced the null value with empty value using isnull function as below example. It solved my issue
where isnull(UserId,'') not in (select UserID from users where ...)

This should work:
nvl(person_id, '') not in (
SELECT distinct person_id
FROM protocol_application_log_devl pal
WHERE pal.set_id = #set_id
)

Related

Switching a case statement to a constant greatly slows query

I have run into an issue with SQL Server 2017 where replacing:
a CASE statement that assigns a numerical value
with a constant numerical value
slows down the query be a factor of 6+.
The rather complicated query has the general form of:
WITH CTE1 AS
(
...
),
WITH CTE2 AS
(
SELECT
--conditions based on below
FROM
(SELECT
--various math,
CASE
--statement assigning values to different runID combinations for samples with matching siteIDs and dates (due to the ON statement below)
ELSE NULL
....
END AS whichCombination
FROM
CTE1 AS value1
JOIN
CTE1 AS value2 ON (value1.siteID = value2.siteID,
value1.date = value2.date,
value1.sampleID <> value2.sampleID)
) AS combinations
WHERE combinations.whichCombination IS NOT NULL
)
SELECT various data
FROM dataTable
LEFT JOIN
(stuff from CTE2) AS pairTable ON dataTable.sampleID = pairTable.sampleID
The CASE statement assigns a pair number to different combinations of rows from the self join.
This then is used to select only the combinations that I want.
However, when the CASE statement is replaced with: 1 AS whichCombination (a constant value so no rows are assigned NULL) the query slows dramatically. This also occurs if CASE WHEN 1 = 1 THEN 1 is used.
This makes no sense to me as either way the values are:
numerical
not unique
not an index
The only thing that is unique is that each combination of rows is a assigned a unique value.
Is SQL Server somehow using this as an index that speeds things up?
And how would I replicate this behavior without the CASE statement as this answer says you cannot create indices for CTE's?
EDIT: Also of note is that the slowdown occurs only if main select statement (the last 5 lines) is included (i.e. if CTE2 is run as the main select statement instead of being a CTE)
Best, JD

One workaround would be spliting these CTE's to temp tables, then you could add indexes if needed.

Optimize SQL in MS SQL Server that returns more than 90% of records in the table

I have the below sql
SELECT Cast(Format(Sum(COALESCE(InstalledSubtotal, 0)), 'F') AS MONEY) AS TotalSoldNet,
BP.BoundProjectId AS ProjectId
FROM BoundProducts BP
WHERE ( BP.IsDeleted IS NULL
OR BP.IsDeleted = 0 )
GROUP BY BP.BoundProjectId
I already have an index on the table BoundProducts on this column order (BoundProjectId, IsDeleted)
Currently this query takes around 2-3 seconds to return the result. I am trying to reduce it to zero seconds.
This query returns 25077 rows as of now.
Please provide me any ideas to improvise the query.

Looking at this in a bit different point of view, I can think that your OR condition is screwing up your query, why not to rewrite it like this?
SELECT CAST(FORMAT(SUM(COALESCE(BP.InstalledSubtotal, 0)), 'F') AS MONEY) AS TotalSoldNet
, BP.BoundProjectId AS ProjectId
FROM (
SELECT BP.BoundProjectId, BP.InstalledSubtotal
FROM dbo.BoundProducts AS BP
WHERE BP.IsDeleted IS NULL
UNION ALL
SELECT BP.BoundProjectId, BP.InstalledSubtotal
FROM dbo.BoundProducts AS BP
WHERE BP.IsDeleted = 0
) AS BP
GROUP BY BP.BoundProjectId;
I've had better experience with UNION ALL rather than OR.
I think it should work totally the same. On top of that, I'd create this index:
CREATE NONCLUSTERED INDEX idx_BoundProducts_IsDeleted_BoundProjectId_iInstalledSubTotal
ON dbo.BoundProducts (IsDeleted, BoundProjectId)
INCLUDE (InstalledSubTotal);
It should satisfy your query conditions and seek index quite well. I know it's not a good idea to index bit fields, but it's worth trying.
P.S. Why not to default your IsDeleted column value to 0 and make it NOT NULLABLE? By doing that, it should be enough to do a simple check WHERE IsDeleted = 0, that'd boost your query too.

If you really want to try index seek, it should be possible using query hint forceseek, but I don't think it's going to make it any faster.
The options I suggested last time are still valid, remove format and / or create an indexed view.
You should also test if the problem is the query itself or just displaying the results after that, for example trying it with "select ... into #tmp". If that's fast, then the problem is not the query.
The index name in the screenshot is not the same as in create table statement, but I assume that's just a name you changed for the question. If the scan is happening to another index, then you should include that too.

COUNT(NULL) and the IN clause

This is more of a curious question. I know this question seems like an odd ball but I use null when checking for data because I'm not concerned what data is there but only IF data is there. I believe the following scenario only occurs in SQL Server.
When I want to see if a record exists I'll use:
IF(EXISTS(SELECT null FROM Table1 WHERE Criteria IN (1, 2)))
The following code also works:
IF((SELECT COUNT(null) FROM Table1 WHERE Criteria = 1) = 2)
But this doesn't work:
IF((SELECT COUNT(null) FROM Table1 WHERE Criteria IN (1,2)) = 2)
and get this error:
Operand data type NULL is invalid for count operator.
Why is the third statement any different because of the IN clause?
Here is a SQL Fiddle of what I'm talking about:
http://sqlfiddle.com/#!6/6d7db/8
Narrowed it down to only if there are multiple items in the IN clause too

It seems to be something about the query optimizer.
In the first two queries (from your fiddle), the count(null) seems to be converted to COUNT(*) as you can see in the execution plan.
In the second line, IN with only one value is optimized to =, resulting in the exact same query as above:
With IN (1,2) the query fails. It's the same if you use COUNT(1): It's converted to COUNT(*) where the query can only return one row, but stays COUNT(1) in the third.
Another sidenote: The effect only works with a real table. If you use a table variable, all three statements throw the error.
The bottom line should probably be: count(null) is wrong (as Heinzi explained), it just may slip through the optimizer in very rare circumstances.

COUNT(null), the short form of COUNT(ALL null), simply does not make sense. Let's have a look at the definition of COUNT (emphasis mine):
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates expression for each row in a group and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates expression for each row in a group and returns the number of unique, nonnull values.
Thus, COUNT(ALL someExpressionThatYieldsNull) would always return 0, no matter how many records are matched by your WHERE clause. Obviously, that makes it utterly unsuitable for counting rows. COUNT(*) would be correct here.
I am quite surprised that your second example works at all, you might have stumbled upon a bug here. Trying the following in MSSQL 2012 (SQLFiddle):
SELECT COUNT(NULL) FROM someTable;
yields the following error:
Operand data type NULL is invalid for count operator.
which makes perfect sense.

Does wrapping nullable columns in ISNULL cause table scans?

Code analysis rule SR0007 for Visual Studio 2010 database projects states that:
You should explicitly indicate how to handle NULL values in comparison expressions by wrapping each column that can contain a NULL value in an ISNULL function.
However code analysis rule SR0006 is violated when:
As part of a comparison, an expression contains a column reference ... Your code could cause a table scan if it compares an expression that contains a column reference.
Does this also apply to ISNULL, or does ISNULL never result in a table scan?

Yes it causes table scans. (though seems to get optimised out if the column isn't actually nullable)
The SR0007 rule is extremely poor blanket advice as it renders the predicate unsargable and means any indexes on the column will be useless. Even if there is no index on the column it might still make cardinality estimates inaccurate affecting other parts of the plan.
The categorization of it in the Microsoft.Performance category is quite amusing as it seems to have been written by someone with no understanding of query performance.
It claims the rationale is
If your code compares two NULL values or a NULL value with any other
value, your code will return an unknown result.
Whilst the expression itself does evaluate to unknown your code returns a completely deterministic result once you understand that any =, <>, >, < etc comparison with NULL evaluate as Unknown and that the WHERE clause only returns rows where the expression evaluates to true.
It is possible that they mean if ANSI_NULLS is off but the example they give in the documentation of WHERE ISNULL([c2],0) > 2; vs WHERE [c2] > 2; would not be affected by this setting anyway. This setting
affects a comparison only if one of the operands of the comparison is
either a variable that is NULL or a literal NULL.
Execution plans showing scans vs seek or below
CREATE TABLE #foo
(
x INT NULL UNIQUE
)
INSERT INTO #foo
SELECT ROW_NUMBER() OVER (ORDER BY ##SPID)
FROM sys.all_columns
SELECT *
FROM #foo
WHERE ISNULL(x, 10) = 10
SELECT *
FROM #foo
WHERE x = 10
SELECT *
FROM #foo
WHERE x = 10
OR x IS NULL

SQL with unknown purpose

I have to change some SQL queries (SQL Server 2005) done by another person and in that code I see often the following construction:
SELECT fieldA, SUM(CASE fieldB WHEN null THEN 0 ELSE fieldB END) as AliasName FROM ...
I don't understand the case statement because as far as I know, null can not be checked within a case and therefore I think that the above code does the same as:
SELECT fieldA, SUM(fieldB) as AliasName FROM ...
I have also done some tests and have not seen any differences in the result. Am I missing something, or can I replace the upper statement through the short one?
UPDATE
Only for completeness because it's not mentioned in the answers: The upper code returns the same result as the lower. The used case construction does not replace null's through zero's and therefore it can be ommited. If the purpose of the original sql was to make sure that never null will be returned, the coalesce or the isnull-operator can be used (as stated in the answers).

The output of your second statement will contain nulls (when aggregating records that only have null values for fieldB). If you don't mind that, you're ok.
If you want zeros in your output rather than null values, use this:
select fieldA, sum(isnull(fieldB, 0)) as AliasName from ...

You would achieve this more readably with
SELECT fieldA, COALESCE(fieldB, 0) as AliasName

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight