Do I need to handle nulls on LEFT JOINs? - sql-server

There is a more senior SQL developer (the DBA) at the office who told me that in all the LEFT JOINS of my script, I must handle the scenario where the join column of the left table is possibly null, otherwise, I have to use INNER JOINs. Now, being a noob, I might be wrong here, but I can't see his point and left me needlessly confused.
His explanation was, unless the column is non-nullable, either I must
use ISNULL(LeftTable.ColumnA,<replacement value here>) on the ON clause, or
handle null values in the ON clause or the
WHERE clause, either by adding AND LeftTable.ColumnA IS NOT NULL or AND LeftTable.ColumnA IS NULL.
I thought those are unnecessary, since one uses a LEFT JOIN if one does not mind returning null rows from the right table, if the values of the right table join column does not match the left table join column, whether it be using equality or inequality. My intent is that it does not have to be equal to the right table join column values. If the left table join column is null, it is ok for me to return null rows on the right table, as a null is not equal to anything.
What is it that I am not seeing here?
MAJOR EDIT:
So I am adding table definitions and scripts. These are not the exact scripts, just to illustrate the problem. I have remove earlier edits which are incorrect as was not in front of the script before.
CREATE TABLE dbo.Contact (
ContactID int NOT NULL, --PK
FirstName varchar(10) NULL,
LastName varchar(10) NULL,
StatusID int NULL,
CONSTRAINT PK_Contact_ContactID
PRIMARY KEY CLUSTERED (ContactID)
);
GO
CREATE TABLE dbo.UserGroup (
UserGroupID int NOT NULL, --PK
UserGroup varchar(50) NULL,
StatusID int NULL,
CONSTRAINT PK_UserGroup_UserGroupID
PRIMARY KEY CLUSTERED (UserGroupID)
);
GO
CREATE TABLE dbo.UserGroupContact (
UserGroupID int NOT NULL, --PK,FK
ContactID int NOT NULL, --PK,FK
StatusID int NULL
CONSTRAINT PK_UserGroupContact_UserGroupContactID
PRIMARY KEY CLUSTERED (UserGroupID, ContactID),
CONSTRAINT FK_UserGroupContact_UserGroupId
FOREIGN KEY (UserGroupId)
REFERENCES [dbo].[UserGroup](UserGroupId),
CONSTRAINT FK_UserGroupContact_ContactId
FOREIGN KEY (ContactId)
REFERENCES [dbo].[Contact](ContactId)
);
GO
CREATE TABLE dbo.Account (
AccountID int NOT NULL, --PK
AccountName varchar(50) NULL,
AccountManagerID int NULL, --FK
Balance int NULL,
CONSTRAINT PK_Account_AccountID
PRIMARY KEY CLUSTERED (AccountID),
CONSTRAINT FK_Account_AccountManagerID
FOREIGN KEY (AccountManagerID)
REFERENCES [dbo].[Contact](ContactId),
);
GO
My original query would look like below. When I say "left table", I mean the table on the left of the ON clause in a join. If "right table", its the table on the right of the ON clause.
SELECT
a.AccountId,
a.AccountName,
a.Balance,
ug.UserGroup,
ugc.UserGroupID,
a.AccountManagerID,
c.FirstName,
c.LastName
FROM dbo.Account a
LEFT JOIN dbo.Contact c
ON a.AccountManagerID = c.ContactID
AND c.StatusID=1
LEFT JOIN dbo.UserGroupContact ugc
ON a.AccountManagerID = ugc.ContactID
AND ugc.StatusID=1
LEFT JOIN dbo.UserGroup ug
ON ugc.UserGroupID = ug.UserGroupID
AND ug.StatusID=1
WHERE
a.Balance > 0
AND ugc.UserGroupID = 10
AND a.AccountManagerID NOT IN (20,30)
Notice in the example script above, the first and second left joins has a nullable column on the left table and non-nullable column on the right table. The third left join has both nullable columns on the left and right tables.
The suggestion was to "change to inner join or handle NULL condition in where clause" or "There is use of LEFT JOIN but there are non null conditions referenced in the WHERE clause."
The suggestion is to do either of these depending on intent:
a) convert to inner join (not possible as I want unmatched rows from Account table)
SELECT
a.AccountId,
a.AccountName,
a.Balance,
ug.UserGroup,
ugc.UserGroupID,
a.AccountManagerID,
c.FirstName,
c.LastName
FROM dbo.Account a
INNER JOIN dbo.Contact c
ON a.AccountManagerID = c.ContactID
AND c.StatusID=1
INNER JOIN dbo.UserGroupContact ugc
ON a.AccountManagerID = ugc.ContactID
AND ugc.StatusID=1
INNER JOIN dbo.UserGroup ug
ON ugc.UserGroupID = ug.UserGroupID
AND ug.StatusID=1
WHERE
a.Balance > 0
AND ugc.UserGroupID = 10
AND a.AccountManagerID NOT IN (20,30)
b) handle nulls in WHERE clause (not possible as I want to return rows with nulls on column a.AccountManagerID and on ugc.UserGroupID)
SELECT
a.AccountId,
a.AccountName,
a.Balance,
ug.UserGroup,
ugc.UserGroupID,
a.AccountManagerID,
c.FirstName,
c.LastName
FROM dbo.Account a
LEFT JOIN dbo.Contact c
ON a.AccountManagerID = c.ContactID
AND c.StatusID=1
LEFT JOIN dbo.UserGroupContact ugc
ON a.AccountManagerID = ugc.ContactID
AND ugc.StatusID=1
LEFT JOIN dbo.UserGroup ug
ON ugc.UserGroupID = ug.UserGroupID
AND ug.StatusID=1
WHERE
a.Balance > 0
AND ugc.UserGroupID = 10
AND a.AccountManagerID NOT IN (20,30)
AND a.AccountManagerID IS NOT NULL
AND ugc.UserGroupID IS NOT NULL
c) handle nulls in ON clause (I settled on this which I thought doesn't make sense because it's redundant)
SELECT
a.AccountId,
a.AccountName,
a.Balance,
ug.UserGroup,
ugc.UserGroupID,
a.AccountManagerID,
c.FirstName,
c.LastName
FROM dbo.Account a
LEFT JOIN dbo.Contact c
ON a.AccountManagerID = c.ContactID
AND c.StatusID=1
AND a.AccountManagerID IS NOT NULL
LEFT JOIN dbo.UserGroupContact ugc
ON a.AccountManagerID = ugc.ContactID
AND ugc.StatusID=1
AND a.AccountManagerID IS NOT NULL
LEFT JOIN dbo.UserGroup ug
ON ugc.UserGroupID = ug.UserGroupID
AND ug.StatusID=1
AND ugc.UserGroupID IS NOT NULL
WHERE
a.Balance > 0
AND ugc.UserGroupID = 10
AND a.AccountManagerID NOT IN (20,30)
I did not provide example for ISNULL(). Also, I think he was not referring to implicit inner joins.
To recap, how do I handle this suggestion: "There is use of LEFT JOIN but there are non null conditions referenced in the WHERE clause."? He commented it's a "questionable LEFT JOIN logic".

One thing your question doesn't talk about is ANSI NULLs, whether they're on or off. If ANSI NULLs are on, comparing NULL = NULL return false, but if they're off, NULL = NULL returns true.
You can read more about ANSI NULLs here: https://learn.microsoft.com/en-us/sql/t-sql/statements/set-ansi-nulls-transact-sql
So if ANSI NULLs are OFF, you very much care about matching a NULL foreign key to missing row in a join. Your rows with NULL foreign keys are going to match every single row where the left table was all NULLs.
If ANSI NULLs are ON, the LEFT OUTER JOIN will behave as expected, and NULL foreign keys will not match up with NULL primary keys of other missing rows.
If another dev is telling you that you need to be careful about NULLs in OUTER JOINs, that's probably a good indication that the database you're working with has ANSI NULLs OFF.

one uses a LEFT JOIN if one does not mind returning null rows from the right table
Left table LEFT JOIN right table ON condition returns INNER JOIN rows plus unmatched left table rows extended by by nulls.
One uses left join if that's what one wants.
the join column of the left table
A join is not on "the join column"--whatever that means. It is on the condition.
That might, say, be one column in the left table being equal to the same-named column in the right. Or be a function of one column in the left table being equal to the same-named column in the right. Or be a boolean function of same-named columns. Or involve/include any of those. Or be any boolean function of any of the input columns.
If the left table join column is null, it is ok for me to return null rows on the right table, as a null is not equal to anything.
It seems you are suffering from a fundamental misconception. The only thing that is "ok for me to return" is the rows you were told to return, for certain possible input.
It's not a matter of, say, coding some condition on some tables because we want certain inner join rows and then accepting whatever null-extended rows we get. If we use a left join, it's because it returns the correct inner join rows & the correct null-extended rows; otherwise we want a different expression.
It is not a matter of, say, a left table row having null meaning that that row must not be part of the inner join & must be null-extended. We have some input; we want some output. If we want the inner join of two tables on some condition no matter how that condition uses nulls or any other input values plus the unmatched left table rows then we left join those tables on that condition; otherwise we want a different expression.
(Your question uses but doesn't explain "handle". You don't tell us the rows you were told to return, for certain possible input. You don't even give us example desired output for example input or your actual output for some query. So we have no way of adddressing what your DBA's critique is trying to say about what you ought to do or what you are doing your queries.)

Going to expand a bit on my comment here; this, however, is guess work based on what we have at the moment.
based on your current wording, what you've stated is wrong. Let's take these simple tables:
USE Sandbox;
GO
CREATE TABLE Example1 (ID int NOT NULL, SomeValue varchar(10));
GO
CREATE TABLE Example2 (ID int NOT NULL, ParentID int NOT NULL, SomeOtherValue varchar(10));
GO
INSERT INTO Example1
VALUES (1,'abc'),(2,'def'),(3,'bcd'),(4,'zxy');
GO
INSERT INTO Example2
VALUES (1,1,'sadfh'),(2,1,'asdgfkhji'),(3,3,'sdfhdfsbh');
Now, let's have a simple query with a LEFT JOIN:
SELECT *
FROM Example1 E1
LEFT JOIN Example2 E2 ON E1.ID = E2.ParentID
ORDER BY E1.ID, E2.ID;
Note that 5 rows are returned. No handling of NULL was required. if you added an OR to the ON it would be non-sensical, as ParentID cannot have a value of NULL.
If, however, we add something to the WHERE for example:
SELECT *
FROM Example1 E1
LEFT JOIN Example2 E2 ON E1.ID = E2.ParentID
WHERE LEFT(E2.SomeOtherValue,1) = 's'
ORDER BY E1.ID, E2.ID;
This now turns the LEFT JOIN into an implicit INNER JOIN. The above would therefore be better written as:
SELECT *
FROM Example1 E1
JOIN Example2 E2 ON E1.ID = E2.ParentID
WHERE LEFT(E2.SomeOtherValue,1) = 's'
ORDER BY E1.ID, E2.ID;
This, however, may not be the intended output; you may well want unmatched rows (and why you intially used a LEFT JOIN. There are 2 ways you could do that. The first is add the criteria to the ON clause:
SELECT *
FROM Example1 E1
LEFT JOIN Example2 E2 ON E1.ID = E2.ParentID
AND LEFT(E2.SomeOtherValue,1) = 's'
ORDER BY E1.ID, E2.ID;
The other would be do add an OR (don't use ISNULL, it affects SARGability!):
SELECT *
FROM Example1 E1
LEFT JOIN Example2 E2 ON E1.ID = E2.ParentID
WHERE LEFT(E2.SomeOtherValue,1) = 's'
OR E2.ID IS NULL
ORDER BY E1.ID, E2.ID;
This, I imagine is what your senior is talking about.
To repeat though:
SELECT *
FROM Example1 E1
LEFT JOIN Example2 E2 ON E1.ID = E2.ParentID OR E2.ID IS NULL
ORDER BY E1.ID, E2.ID;
Makes no sense. E2.ID cannot have a value of NULL, so the clause makes no change to the query, apart from probably making it run slower.
Cleanup:
DROP TABLE Example1;
DROP TABLE Example2;

in my eyes this is very simple, as far as I understood it.
Let's try with an example.
Imagine to have 2 tables, a master and a details table.
MASTER TABLE "TheMaster"
ID NAME
1 Foo1
2 Foo2
3 Foo3
4 Foo4
5 Foo5
6 Foo6
DETAILS TABLE "TheDetails"
ID ID_FK TheDetailValue
1 1 3
2 1 5
3 3 3
4 5 2
5 5 9
6 3 6
7 1 4
TheDetails table is linked to TheMaster table through the field ID_FK.
Now, imagine to run a query where you need to sum the values of the column TheDetailValue. I would go with something like this:
SELECT TheMaster.ID, TheMaster.NAME, Sum(TheDetails.TheDetailValue) AS SumOfTheDetailValue
FROM TheMaster INNER JOIN TheDetails ON TheMaster.ID = TheDetails.ID_FK
GROUP BY TheMaster.ID, TheMaster.NAME;
You would get a list like this:
ID NAME SumOfTheDetailValue
1 Foo1 12
3 Foo3 9
5 Foo5 11
But, what is your query uses a LEFT JOIN instead of a INNER JOIN? For example:
SELECT TheMaster.ID, TheMaster.NAME, Sum(TheDetails.TheDetailValue) AS SumOfTheDetailValue
FROM TheMaster LEFT JOIN TheDetails ON TheMaster.ID = TheDetails.ID_FK
GROUP BY TheMaster.ID, TheMaster.NAME;
The result would be:
ID NAME SumOfTheDetailValue
1 Foo1 12
2 Foo2
3 Foo3 9
4 Foo4
5 Foo5 11
6 Foo6
You will obtain a NULL for each master field having no values in the details table.
How do you exclude these values? Using an ISNULL!
SELECT TheMaster.ID, TheMaster.NAME, Sum(TheDetails.TheDetailValue) AS SumOfTheDetailValue
FROM TheMaster LEFT JOIN TheDetails ON TheMaster.ID = TheDetails.ID_FK
WHERE (((TheDetails.ID_FK) Is Not Null))
GROUP BY TheMaster.ID, TheMaster.NAME;
...which would take us to these results:
ID NAME SumOfTheDetailValue
1 Foo1 12
3 Foo3 9
5 Foo5 11
...which is exactly what we obtained before using an INNER JOIN.
So, in the end, I guess your collegue is talking about the use of the ISNULL function, in order to exclude the records having no relation in another table.
That's it.
For example purpose only the query were made using MS Access (rapid test), so the ISNULL function is implemented with "Is Null", which can become "Is Not Null". In your case probably it's something like ISNULL() and/or NOT ISNULL()

Related

T-SQL How to join to multiple tables depending on value

So I have some complicated production tables. What it comes down to, however, is that I'd like to be able to join to multiple tables depending on the where the value is that I'm seeking. Specifically, say I have an employee ID, "JHDOE", and I want to join it to the table where I can get that employee's name. The main table for employees is "Table A":
Notice that the field ID_2 does NOT have the value "JHDOE". Instead, it has "DOEJH". Well, there is another table that actually has the value I'm seeking, "Table B":
In this table, we do see "JHDOE" so at first I tried something like this:
from TableStart as start
left join TableB as b on
case
when start.EMP_ID like '[0-9]%'
then b.ID
else b.ID_2
end = start.EMP_ID
But this created other problems. So what I'd like to do is do something like join to EITHER table, or at least something to the same effect. One method I tried was this:
from TableStart as start
left join (select a.Name, a.ID, a.ID_2
from TableA as a
union
select b.Name, b.ID, b.ID_2
from TableB as b) names on
case
when start.EMP_ID like '[0-9]%'
then names.ID
else names.ID_2
end = start.EMP_ID
The result set for the union looks like this:
On my production data, this same scenario resulted in a blank. I suppose it doesn't know which row to join to? So I think I need to do something like pivot the rows into columns and then do an OR... but I'm not sure. I would be greatly appreciative of any guidance or instruction.
Instead of using a CASE in the join I think an IN operator would work. That will function as an OR (EMP_ID = ID or EMP_ID = ID_2). And if you LEFT JOIN from TableStart to both TableA and TableB and use the COALESCE function with each column from TableA and TableB you will get the first non-null value.
This assumes that if both TableA and TableB have a match for TableStart you are fine with taking the first non-null value which could result in a mixture of values from TableA and TableB if TableA has some null values.
select
start.*
, coalesce (a.Name, b.Name) as [Name]
, coalesce (a.ID, b.ID) as [ID]
, coalesce (a.ID_2, b.ID_2) as [ID_2]
from TableStart as start
left join TableA as a on start.EMP_ID in (a.ID, a.ID_2)
left join TableB as b on start.EMP_ID in (b.ID, b.ID_2)
Here is the full demo.

Ignore condition in WHERE clause when column is NULL

I do have table were one row (with Type =E) is related to another row.
I have written query to return COUNT of those related rows. The problem is that there is no explicit relationship (like ID column that would clearly say which row is related to other row). Therefore I am trying to find relationship based on multiple conditions in WHERE clause.
The problem is that in few cases, the columns A and B could be NULL (for records where TYPE = 'M'). In such a cases I would like to ignore that condition, so It would use only first 3 conditions to determine relationship.
I have tried CASE Statement but is not working as expected:
SELECT [T1].[ID],[T1].[AlphaId],[T1].[Type],[T1].[A],[T1].[B],[T1].[Date],[T1].[ServiceID]
,( SELECT COUNT(*)
FROM MyTable T2
WHERE [T1].[AlphaId]=[T2].[AlphaId] AND
[T1].[Date]=[T2].[Date] AND
[T1].[ServiceID]=[T2].[ServiceID] AND
[T2].[A]=CASE WHEN [T2].[A] IS NULL THEN NULL ELSE [T1].[A] END AND
[T2].[B]=CASE WHEN [T2].[B] IS NULL THEN NULL ELSE [T1].[B] END AND
[T2].[Type]='M'
) as TotalCount
FROM MyTable T1
WHERE [T1].[Type] = 'E'
I can't ignore that condition, as for some cases the Date, ServiceID could be same, however it's the A, B which differs them. Luckily where A, B IS NULL, it is the Date, ServiceID which differs those two records.
http://sqlfiddle.com/#!3/c98db/1
Many thanks in advance.
You could join the tables and use COUNT and GROUP BY to get the counts. Then you can JOIN [A] and [B] if they are equal or NULL.
SELECT [T1].[ID],[T1].[AlphaId],[T1].[Type],[T1].[A],[T1].[B],[T1].[Date],[T1].[ServiceID], count([T2].[ID])
FROM MyTable T1
INNER JOIN MyTable T2 ON [T1].[AlphaId]=[T2].[AlphaId] AND
[T1].[Date]=[T2].[Date] AND
[T1].[ServiceID]=[T2].[ServiceID] AND
([T2].[A]= [T1].[A] OR [T2].[A] IS NULL )AND
([T2].[B]= [T1].[B] OR [T2].[B] IS NULL )AND
[T2].[Type] <> [T1].[Type]
WHERE [T1].[Type] = 'E'
GROUP BY [T1].[ID],[T1].[AlphaId],[T1].[Type],[T1].[A],[T1].[B],[T1].[Date],[T1].[ServiceID]

T-SQL: Can a subquery in the SELECT clause implicitly reference a table in the main outer query?

In T-SQL, is it possible to have a subquery in the SELECT clause that implicitly references tables in the main, outer query? For example:
select NAME,
case when exists (select o.ORDERID) then 1 else 0 end BUYER
from CUSTOMER c
left join ORDER o
on c.CUSTID = o.CUSTID
In other words, can I write subqueries without a FROM clause?
Intellisense seems to recognize outer table aliases in subqueries, but I can't find any documentation that says this is acceptable T-SQL. I can certainly run some of my own tests, but I also wanted to check with the community. Thanks.
Yes this is valid syntax.
A SELECT without a FROM is treated as though selecting from a single row table. Referencing columns from the outer query is required for correlated sub queries and is perfectly valid there.
The particular query you have makes no sense though. It will always evaluate to 1 as that subquery always returns a single row (with a single column containing the corresponding o.OrderId) going into the EXISTS check.
Probably you want to check o.OrderId IS NULL
SELECT NAME,
CASE
WHEN o.ORDERID IS NULL THEN 0
ELSE 1
END BUYER,
FROM CUSTOMER c
LEFT JOIN ORDER o
ON c.CUSTID = o.CUSTID
Where it does make sense to use this type of syntax is in a null safe equality check.
e.g.
SELECT A,
B,
CASE
WHEN EXISTS (SELECT T.A
EXCEPT
SELECT T.B) THEN 1
ELSE 0
END AS DistinctFrom
FROM T
Is equivalent to
SELECT A,
B,
CASE
WHEN A <> B
OR ( A IS NULL
AND B IS NOT NULL )
OR ( A IS NOT NULL
AND B IS NULL ) THEN 1
ELSE 0
END AS DistinctFrom
FROM T

Query with Left Outer Join

I'm having trouble figuring this out.
According to Jeff Atwood A Visual Explanation of SQL Joins Left outer join produces a complete set of records from Table A, with the matching records (where available) in Table B. If there is no match, the right side will contain null.
The left table (TableA) doesn't have duplicates. The right tableB has 1 or 2 entries for each client number. The PrimaryTP designates one as primary with 1 and the other has 0.
I shouldn't have to include the line And B.PrimaryTP = 1 because TableA doesn't have duplicates. Yet if I leave it out I get duplicate client numbers. Why?
Can you help me understand how this works. It's being very confusing to me. The logic of And B.PrimaryTP = 1 escapes me. Yet it seems to work. Still, I'm scared to trust it if I don't understand it. Can you help me understand it. Or do I have a logic error hidden in the query?
SELECT A.ClientNum --returns a list with no duplicate client numbers
FROM (...<TableA>
) as A
Left Outer Join
<TableB> as B
on A.ClientNum = B.ClientNum
--eliminate mismatch of (ClientNum <> FolderNum)
Where A.ClientNum Not In
(
Select ClientNum From <TableB>
Where ClientNum Is Not Null
And ClientNum <> IsNull(FolderNum, '')
)
--eliminate case where B.PrimaryTP <> 1
And B.PrimaryTP = 1
The difference between an INNER JOIN and a LEFT JOIN is just that the LEFT JOIN still returns the rows in Table A when there are no corresponding rows in Table B.
But it's still a JOIN, which means that if there is more than one corresponding row in Table B, it will join the row from Table A to each one of them.
So if you want to make sure that you get no more than one result for each row in Table A, you have to make sure that no more than one row from Table B is found - hence the And B.PrimaryTP = 1.
If you have one client number in A and two matches in Table B, then you will get duplicates.
Suppose you have the following data,
Table-A(client Num) Table-B(client Num)
1 2
2 2
The left Join Results
Table-A(client Num) Table-B(client Num)
1 (null)
2 2
2 2
This is the cause of duplicates. So you need to take distinct values form Table B or perform Distinct on the result set.
I shouldn't have to include the line And B.PrimaryTP = 1 because TableA doesn't have duplicates. Yet if I leave it out I get duplicate client numbers. Why?
Because both rows in the right table match a row in the left table. There is no way for SQL Server to output a triangular result; it must show the columns from both tables for every joined row. And this is true for INNER JOIN as well.
DECLARE #a TABLE(a INT);
DECLARE #b TABLE(b INT);
INSERT #a VALUES(1),(2);
INSERT #b VALUES(1),(1);
SELECT a.a, b.b FROM #a AS a
LEFT OUTER JOIN #b AS b ON a.a = b.b;
SELECT a.a, b.b FROM #a AS a
INNER JOIN #b AS b ON a.a = b.b;
Results:
a b
-- ----
1 1
1 1
2 NULL
a b
-- --
1 1
1 1
On the link that you gave the joins are explained very good. So the problem is that you have several records from table A (no matter that there are no duplicates) is that to 1 record in A there are 2 records in B (in some cases). To avoid this you can use either DISTINCT clause, either GROUP BY clause.
The LEFT OUTER JOIN will give you all the records from A with all the matching records from B. The difference with an INNER JOIN is that if there are no matching records in B, an INNER join will omit the record from A entirely, while the LEFT join will then still include a row with the results from A.
In your case, however, you may also want to check out the DISTINCT keyword.

Make use of index when JOIN'ing against multiple columns

Simplified, I have two tables, contacts and donotcall
CREATE TABLE contacts
(
id int PRIMARY KEY,
phone1 varchar(20) NULL,
phone2 varchar(20) NULL,
phone3 varchar(20) NULL,
phone4 varchar(20) NULL
);
CREATE TABLE donotcall
(
list_id int NOT NULL,
phone varchar(20) NOT NULL
);
CREATE NONCLUSTERED INDEX IX_donotcall_list_phone ON donotcall
(
list_id ASC,
phone ASC
);
I would like to see what contacts matches the phone number in a specific list of DoNotCall phone.
For faster lookup, I have indexed donotcall on list_id and phone.
When I make the following JOIN it takes a long time (eg. 9 seconds):
SELECT DISTINCT c.id
FROM contacts c
JOIN donotcall d
ON d.list_id = 1
AND d.phone IN (c.phone1, c.phone2, c.phone3, c.phone4)
Execution plan on Pastebin
While if I LEFT JOIN on each phone field seperately it runs a lot faster (eg. 1.5 seconds):
SELECT c.id
FROM contacts c
LEFT JOIN donotcall d1
ON d1.list_id = 1
AND d1.phone = c.phone1
LEFT JOIN donotcall d2
ON d2.list_id = 1
AND d2.phone = c.phone2
LEFT JOIN donotcall d3
ON d3.list_id = 1
AND d3.phone = c.phone3
LEFT JOIN donotcall d4
ON d4.list_id = 1
AND d4.phone = c.phone4
WHERE
d1.phone IS NOT NULL
OR d2.phone IS NOT NULL
OR d3.phone IS NOT NULL
OR d4.phone IS NOT NULL
Execution plan on Pastebin
My assumption is that the first snippet runs slowly because it doesn't utilize the index on donotcall.
So, how to do a join towards multiple columns and still have it use the index?
SQL Server might think resolving IN (c.phone1, c.phone2, c.phone3, c.phone4) using an index is too expensive.
You can test if the index would be faster with a hint:
SELECT c.*
FROM contacts c
JOIN donotcall d with (index(IX_donotcall_list_phone))
ON d.list_id = 1
AND d.phone IN (c.phone1, c.phone2, c.phone3, c.phone4)
From the query plans you posted, it shows the first plan is estimated to produce 40k rows, but it just returns 21 rows. The second plan estimates 1 row (and of course returns 21 too.)
Are your statistics up to date? Out-of-date statistics can explain the query analyzer making bad choices. Statistics should be updated automatically or in a weekly job. Check the age of your statistics with:
select object_name(ind.object_id) as TableName
, ind.name as IndexName
, stats_date(ind.object_id, ind.index_id) as StatisticsDate
from sys.indexes ind
order by
stats_date(ind.object_id, ind.index_id) desc
You can update them manually with:
EXEC sp_updatestats;
With this poor database structure, a UNION ALL query might be fastest.

Resources