Make use of index when JOIN'ing against multiple columns - sql-server

Simplified, I have two tables, contacts and donotcall
CREATE TABLE contacts
(
id int PRIMARY KEY,
phone1 varchar(20) NULL,
phone2 varchar(20) NULL,
phone3 varchar(20) NULL,
phone4 varchar(20) NULL
);
CREATE TABLE donotcall
(
list_id int NOT NULL,
phone varchar(20) NOT NULL
);
CREATE NONCLUSTERED INDEX IX_donotcall_list_phone ON donotcall
(
list_id ASC,
phone ASC
);
I would like to see what contacts matches the phone number in a specific list of DoNotCall phone.
For faster lookup, I have indexed donotcall on list_id and phone.
When I make the following JOIN it takes a long time (eg. 9 seconds):
SELECT DISTINCT c.id
FROM contacts c
JOIN donotcall d
ON d.list_id = 1
AND d.phone IN (c.phone1, c.phone2, c.phone3, c.phone4)
Execution plan on Pastebin
While if I LEFT JOIN on each phone field seperately it runs a lot faster (eg. 1.5 seconds):
SELECT c.id
FROM contacts c
LEFT JOIN donotcall d1
ON d1.list_id = 1
AND d1.phone = c.phone1
LEFT JOIN donotcall d2
ON d2.list_id = 1
AND d2.phone = c.phone2
LEFT JOIN donotcall d3
ON d3.list_id = 1
AND d3.phone = c.phone3
LEFT JOIN donotcall d4
ON d4.list_id = 1
AND d4.phone = c.phone4
WHERE
d1.phone IS NOT NULL
OR d2.phone IS NOT NULL
OR d3.phone IS NOT NULL
OR d4.phone IS NOT NULL
Execution plan on Pastebin
My assumption is that the first snippet runs slowly because it doesn't utilize the index on donotcall.
So, how to do a join towards multiple columns and still have it use the index?

SQL Server might think resolving IN (c.phone1, c.phone2, c.phone3, c.phone4) using an index is too expensive.
You can test if the index would be faster with a hint:
SELECT c.*
FROM contacts c
JOIN donotcall d with (index(IX_donotcall_list_phone))
ON d.list_id = 1
AND d.phone IN (c.phone1, c.phone2, c.phone3, c.phone4)
From the query plans you posted, it shows the first plan is estimated to produce 40k rows, but it just returns 21 rows. The second plan estimates 1 row (and of course returns 21 too.)
Are your statistics up to date? Out-of-date statistics can explain the query analyzer making bad choices. Statistics should be updated automatically or in a weekly job. Check the age of your statistics with:
select object_name(ind.object_id) as TableName
, ind.name as IndexName
, stats_date(ind.object_id, ind.index_id) as StatisticsDate
from sys.indexes ind
order by
stats_date(ind.object_id, ind.index_id) desc
You can update them manually with:
EXEC sp_updatestats;

With this poor database structure, a UNION ALL query might be fastest.

Related

LEFT JOIN With Redundant Predicate Performs Better Than a CROSS JOIN?

I'm looking at the execution plans for two of these statements and am kind of stumped on why the LEFT JOIN statement performs better than the CROSS JOIN statement:
Table Definitions:
CREATE TABLE [Employee] (
[ID] int NOT NULL IDENTITY(1,1),
[FirstName] varchar(40) NOT NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED ([ID] ASC)
);
CREATE TABLE [dbo].[Numbers] (
[N] INT IDENTITY (1, 1) NOT NULL,
CONSTRAINT [PK_Numbers] PRIMARY KEY CLUSTERED ([N] ASC)
); --The Numbers table contains numbers 0 to 100,000.
Queries in Question where I join one 'day' to each Employee:
DECLARE #PeriodStart AS date = '2019-11-05';
DECLARE #PeriodEnd AS date = '2019-11-05';
SELECT E.FirstName, CD.ClockDate
FROM Employee E
CROSS JOIN (SELECT DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
WHERE N.N <= DATEDIFF(day, #PeriodStart, #PeriodEnd)
) CD
WHERE E.ID > 2000;
SELECT E.FirstName, CD.ClockDate
FROM Employee E
LEFT JOIN (SELECT DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
WHERE N.N <= DATEDIFF(day, #PeriodStart, #PeriodEnd)
) CD ON CD.ClockDate = CD.ClockDate
WHERE E.ID > 2000;
The Execution Plans:
https://www.brentozar.com/pastetheplan/?id=B139JjPKK
As you can see, according to the optimizer the second (left join) query with the seemingly redundant predicate seems to cost way less than the first (cross join) query. This is also the case when the period dates span multiple days.
What's weird is if I change the LEFT JOIN's predicate to something different like 1 = 1 it'll perform like the CROSS APPLY. I also tried changing the SELECT portion of the LEFT JOIN to SELECT N and joined on CD.N = CD.N ... but that also seems to perform poorly.
According to the execution plan, the second query has an index seek that only reads 3000 rows from the Numbers table while the first query is reading 10 times as many. The second query's index seek also has this predicate (which I assume comes from the LEFT JOIN):
dateadd(day,[Numbers].[N] as [N].[N],[#PeriodStart])=dateadd(day,[Numbers].[N] as [N].[N],[#PeriodStart])
I would like to understand why the second query seems to perform so much better even though I wouldn't except it to? Does it have something to do with the fact I'm joining the results of the DATEADD function? Is SQL evaluating the results of DATEADD before joining?
The reason these queries get different estimates, even though the plan is almost the same and will probably take the same time, appears to be because DATEADD(day, N.N, #PeriodStart) is nullable, therefore CD.ClockDate = CD.ClockDate essentially just verifies that the result is not null. The optimizer cannot see that it will always be non-null, so takes the row-estimate down because of it.
But it seems to me that the primary performance problem in your query is that you are selecting the whole of your numbers table every time. Instead you should just select the amount of rows you need
SELECT E.FirstName, CD.ClockDate
FROM Employee E
CROSS JOIN (
SELECT TOP (DATEDIFF(day, #PeriodStart, #PeriodEnd) + 1)
DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
ORDER BY N.N
) CD
WHERE E.ID > 2000;
Using this technique, you can even use CROSS APPLY (SELECT TOP (outerValue) if you want to correlate the amount of rows to the rest of the query.
For further tips on numbers tables, see Itzik Ben-Gan's excellent series

Speed up view performance

I have an old view that takes 4 mins to run, I have been asked to speed it up. The FROM looks like this:
FROM TableA
CROSS JOIN ViewA
INNER JOIN TableB on ViewA.Name = TableB.Name
AND TableA.Code = TableB.Code
AND TableA.Location = TableB.Location
WHERE (DATEDIFF(m, ViewA.SubmitDate, GETDATE()) = 1) -- Only pull last months rows
Table A has around 99k rows, ViewA has around 2000 rows and TableB has around 101K rows. I think the problem is at the INNER JOIN because it I remove it, the query takes 1 second.
My first thought was to see if I could down the number of rows in ViewA by breaking the whole thing into CTEs but this made zero impact. I am thinking I need to index TableB, because it is just a bunch of varchars being used in the joins. I am now changing it to temp tables so I can index it. I can not change the underlying tables and views. Is index temp tables a good way to go, or is there a better solution.
Edit to add info regarding existing indexes. Only thing with an index on it right now is TableA.Id which is the PK and a clustered Index. TableB has an Id field but it is not the PK. ViewA is not indexed.
Edit again to correct some structure. SubmitDate is in the View, not the table.
Here is a very basic structure:
CREATE TABLE TableA
(
Id int NOT NULL PRIMARY KEY,
Section varchar(20) NULL,
Code varchar(20) NULL
)
CREATE TABLE TableB
(
Id int NOT NULL PRIMARY KEY,
Name varchar(20) NULL,
Code varchar(20) NULL,
Section varchar(20) NULL
)
CREATE TABLE TableC
(
Id int NOT NULL PRIMARY KEY,
Name varchar(20) NULL,
SubmitDate DateTime NOT NULL
)
CREATE TABLE TableD
(
Id int NOT NULL PRIMARY KEY,
Section varchar(20) NULL
)
CREATE VIEW ViewA
AS
SELECT c.Section, d.Name, c.SubmitDate
FROM TableC c
JOIN TableD d ON a.Id = b.Id
One improovement is to rewrite where clause into sargable clause. Add index to SubmitDate if there is no index and change query to:
FROM TableA
CROSS JOIN ViewA
INNER JOIN TableB on ViewA.Name = TableB.Name
AND TableA.Code = TableB.Code
AND TableA.Location = TableB.Location
WHERE
TableA.SubmitDate >=DATEADD(MONTH,DATEDIFF(MONTH,0,GETDATE())-1,0)
And TableA.SubmitDate < Dateadd(DAY, 1, DATEADD(MONTH,
DATEDIFF(MONTH, -1, GETDATE())-1, -1) )
Also add nonclustered indexes on Name, Code and Location columns.

How do I compare two rows from a SQL database table based on DateTime within 3 seconds?

I have a table of DetailRecords containing records that seem to be "duplicates" of other records, but they have a unique primary key [ID]. I would like to delete these "duplicates" from the DetailRecords table and keep the record with the longest/highest Duration. I can tell that they are linked records because their DateTime field is within 3 seconds of another row's DateTime field and the Duration is within 2 seconds of one another. Other data in the row will also be duplicated exactly, such as Number, Rate, or AccountID, but this could be the same for the data that is not "duplicate" or related.
CREATE TABLE #DetailRecords (
[AccountID] INT NOT NULL,
[ID] VARCHAR(100) NULL,
[DateTime] VARCHAR(100) NULL,
[Duration] INT NULL,
[Number] VARCHAR(200) NULL,
[Rate] DECIMAL(8,6) NULL
);
I know that I will most likely have to perform a self join on the table, but how can I find two rows that are similar within a DateTime range of plus or minus 3 seconds, instead of just exactly the same?
I am having the same trouble with the Duration within a range of plus or minus 2 seconds.
The key is taking the absolute value of the difference between the dates and durations. I don't know SQL server, but here's how I'd do it in SQLite. The technique should be the same, only the specific function names will be different.
SELECT a.id, b.id
FROM DetailRecords a
JOIN DetailRecords b
ON a.id > b.id
WHERE abs(strftime("%s", a.DateTime) - strftime("%s", b.DateTime)) <= 3
AND abs(a.duration - b.duration) <= 2
Taking the absolute value of the difference covers the "plus or minus" part of the range. The self join is on a.id > b.id because a.id = b.id would duplicate every pair.
Given the entries...
ID|DateTime |Duration
1 |2014-01-26T12:00:00|5
2 |2014-01-26T12:00:01|6
3 |2014-01-26T12:00:06|6
4 |2014-01-26T12:00:03|11
5 |2014-01-26T12:00:02|10
6 |2014-01-26T12:00:01|6
I get the pairs...
5|4
2|1
6|1
6|2
And you should really store those dates as DateTime types if you can.
You could use a self-referential CTE and compare the DateTime fields.
;WITH CTE AS (
SELECT AccountID,
ID,
DateTime,
rn = ROW_NUMBER() OVER (PARTITION BY AccountID, ID, <insert any other matching keys> ORDER BY AccountID)
FROM table
)
SELECT earliestAccountID = c1.AccountID,
earliestDateTime = c1.DateTime,
recentDateTime = c2.DateTime,
recentAccountID = c2.AccountID
FROM cte c1
INNER JOIN cte c2
ON c1.rn = 1 AND c2.rn = 2 AND c1.DateTime <> c2.DateTime
Edit
I made several assumptions about the data set, so this may not be as relevant as you need. If you're simply looking for difference between possible duplicates, specifically DateTime differences, this will work. However, this does not constrain to your date range, nor does it automatically assume what the DateTime column is used for or how it is set.

TSQL optimisation

I have the below query which is taking 2 seconds to execute as there is a significant number of rows (1 million + each) in the two tables and was wondering if there is anything further I can do to optimise the query.
Tables
tblInspection.ID bigint (Primary Key)
tblInspection.IsPassedFirstTime bit (Non clustered index)
tblInspectionFailures.ID bigint (Primary Key)
tblInspectionFailures.InspectionID bigint (Non clustered index)
Query
SELECT TOP 1 tblInspection.ID FROM tblInspection
INNER JOIN tblInspectionFailures ON tblInspection.ID = tblInspectionFailures.InspectionID
WHERE (tblInspection.IsPassedFirstTime = 1)
Execution Plan
I can see that I am doing clustered seeks on the indexes but its still taking some time
the only thing I can think of is
SELECT i.ID FROM
(select TOP 1 id from tblInspection
WHERE IsPassedFirstTime = 1) i
INNER JOIN tblInspectionFailures ON
i.ID = tblInspectionFailures.InspectionID
try
SET ROWCOUNT 1
SELECT tblInspection.ID FROM tblInspection
INNER JOIN tblInspectionFailures ON tblInspection.ID = tblInspectionFailures.InspectionID
WHERE (tblInspection.IsPassedFirstTime = 1)
this does basically the same thing but tells sql to stop returning rows after the 1st one

How can I get this query to return 0 instead of null?

I have this query:
SELECT (SUM(tblTransaction.AmountPaid) - SUM(tblTransaction.AmountCharged)) AS TenantBalance, tblTransaction.TenantID
FROM tblTransaction
GROUP BY tblTransaction.TenantID
But there's a problem with it; there are other TenantID's that don't have transactions and I want to get those too.
For example, the transaction table has 3 rows for bob, 2 row for john and none for jane. I want it to return the sum for bob and john AND return 0 for jane. (or possibly null if there's no other way)
How can I do this?
Tables are like this:
Tenants
ID
Other Data
Transactions
ID
TenantID (fk to Tenants)
Other Data
(You didn't state your sql engine, so I'm going to link to the MySQL documentation).
This is pretty much exactly what the COALESCE() function is meant for. You can feed it a list, and it'll return the first non-null value in the list. You would use this in your query as follows:
SELECT COALESCE((SUM(tr.AmountPaid) - SUM(tr.AmountCharged)), 0) AS TenantBalance, te.ID
FROM tblTenant AS te
LEFT JOIN tblTransaction AS tr ON (tr.TenantID = te.ID)
GROUP BY te.ID;
That way, if the SUM() result would be NULL, it's replaced with zero.
Edited: I rewrote the query using a LEFT JOIN as well as the COALESCE(), I think this is the key of what you were missing originally. If you only select from the Transactions table, there is no way to get information about things not in the table. However, by using a left join from the Tenants table, you should get a row for every existing tenant.
Below is a full walkthrough of the problem. The function isnull has also been included to ensure that a balance of zero (rather than null) is returned for Tenants with no transactions.
create table tblTenant
(
ID int identity(1,1) primary key not null,
Name varchar(100)
);
create table tblTransaction
(
ID int identity(1,1) primary key not null,
tblTenantID int,
AmountPaid money,
AmountCharged money
);
insert into tblTenant(Name)
select 'bob' union all select 'Jane' union all select 'john';
insert into tblTransaction(tblTenantID,AmountPaid, AmountCharged)
select 1,5.00,10.00
union all
select 1,10.00,10.00
union all
select 1,10.00,10.00
union all
select 2,10.00,15.00
union all
select 2,15.00,15.00
select * from tblTenant
select * from tblTransaction
SELECT
tenant.ID,
tenant.Name,
isnull(SUM(Trans.AmountPaid) - SUM(Trans.AmountCharged),0) AS Balance
FROM tblTenant tenant
LEFT JOIN tblTransaction Trans ON
tenant.ID = Trans.tblTenantID
GROUP BY tenant.ID, tenant.Name;
drop table tblTenant;
drop table tblTransaction;
Select Tenants.ID, ISNULL((SUM(tblTransaction.AmountPaid) - SUM(tblTransaction.AmountCharged)), 0) AS TenantBalance
From Tenants
Left Outer Join Transactions Tenants.ID = Transactions.TenantID
Group By Tenents.ID
I didn't syntax check it but it is close enough.
SELECT (SUM(ISNULL(tblTransaction.AmountPaid, 0))
- SUM(ISNULL(tblTransaction.AmountCharged, 0))) AS TenantBalance
, tblTransaction.TenantID
FROM tblTransaction
GROUP BY tblTransaction.TenantID
I only added this because if you're intention is to take into account for one of the parts being null you'll need to do the ISNULL separately
Actually, I found an answer:
SELECT tenant.ID, ISNULL(SUM(trans.AmountPaid) - SUM(trans.AmountCharged),0) AS Balance FROM tblTenant tenant
LEFT JOIN tblTransaction trans
ON tenant.ID = trans.TenantID
GROUP BY tenant.ID

Resources