LEFT JOIN With Redundant Predicate Performs Better Than a CROSS JOIN? - sql-server

I'm looking at the execution plans for two of these statements and am kind of stumped on why the LEFT JOIN statement performs better than the CROSS JOIN statement:
Table Definitions:
CREATE TABLE [Employee] (
[ID] int NOT NULL IDENTITY(1,1),
[FirstName] varchar(40) NOT NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED ([ID] ASC)
);
CREATE TABLE [dbo].[Numbers] (
[N] INT IDENTITY (1, 1) NOT NULL,
CONSTRAINT [PK_Numbers] PRIMARY KEY CLUSTERED ([N] ASC)
); --The Numbers table contains numbers 0 to 100,000.
Queries in Question where I join one 'day' to each Employee:
DECLARE #PeriodStart AS date = '2019-11-05';
DECLARE #PeriodEnd AS date = '2019-11-05';
SELECT E.FirstName, CD.ClockDate
FROM Employee E
CROSS JOIN (SELECT DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
WHERE N.N <= DATEDIFF(day, #PeriodStart, #PeriodEnd)
) CD
WHERE E.ID > 2000;
SELECT E.FirstName, CD.ClockDate
FROM Employee E
LEFT JOIN (SELECT DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
WHERE N.N <= DATEDIFF(day, #PeriodStart, #PeriodEnd)
) CD ON CD.ClockDate = CD.ClockDate
WHERE E.ID > 2000;
The Execution Plans:
https://www.brentozar.com/pastetheplan/?id=B139JjPKK
As you can see, according to the optimizer the second (left join) query with the seemingly redundant predicate seems to cost way less than the first (cross join) query. This is also the case when the period dates span multiple days.
What's weird is if I change the LEFT JOIN's predicate to something different like 1 = 1 it'll perform like the CROSS APPLY. I also tried changing the SELECT portion of the LEFT JOIN to SELECT N and joined on CD.N = CD.N ... but that also seems to perform poorly.
According to the execution plan, the second query has an index seek that only reads 3000 rows from the Numbers table while the first query is reading 10 times as many. The second query's index seek also has this predicate (which I assume comes from the LEFT JOIN):
dateadd(day,[Numbers].[N] as [N].[N],[#PeriodStart])=dateadd(day,[Numbers].[N] as [N].[N],[#PeriodStart])
I would like to understand why the second query seems to perform so much better even though I wouldn't except it to? Does it have something to do with the fact I'm joining the results of the DATEADD function? Is SQL evaluating the results of DATEADD before joining?

The reason these queries get different estimates, even though the plan is almost the same and will probably take the same time, appears to be because DATEADD(day, N.N, #PeriodStart) is nullable, therefore CD.ClockDate = CD.ClockDate essentially just verifies that the result is not null. The optimizer cannot see that it will always be non-null, so takes the row-estimate down because of it.
But it seems to me that the primary performance problem in your query is that you are selecting the whole of your numbers table every time. Instead you should just select the amount of rows you need
SELECT E.FirstName, CD.ClockDate
FROM Employee E
CROSS JOIN (
SELECT TOP (DATEDIFF(day, #PeriodStart, #PeriodEnd) + 1)
DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
ORDER BY N.N
) CD
WHERE E.ID > 2000;
Using this technique, you can even use CROSS APPLY (SELECT TOP (outerValue) if you want to correlate the amount of rows to the rest of the query.
For further tips on numbers tables, see Itzik Ben-Gan's excellent series

Related

How does order by work when all column values are identical?

I use SQL Server 2016. Below is the rows in table: test_account. You can see the values of updDtm and fileCreateTime are identical. id is the primary key.
id accno updDtm fileCreatedTime
-----------------------------------------------------------------------
1 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
2 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
3 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
I want to query the latest account id which accno is 123456789 order by updDtm, fileCreatedTime
I run the following SQL, the output result is id = 1
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1
My question is does the query result is repeatable and reliable (always output id=1 either run 1 time or multiple times) when the values of columns updDtm and fileCreatedTime are identical or just output the random id?
I read some articles and learn that for MySql and Oracle the query result is not reliable and reproducible. How about SQL Server?
The context of this documentation reference is ORDER BY usage with OFFSET and FETCH but the same considerations apply to all ORDER BY usage, including windowing functions like ROW_NUMBER(). In summary,
To achieve stable results between query requests, the following conditions must be met:
The underlying data that is used by the query must not change.
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
I'm trying to find an case to test if the query would output result
other than id=1 but with no luck
The ordering of rows when duplicate ORDER BY values exist is undefined (a.k.a. non-deterministic and arbitrary) because it depends on the execution plan (which may vary due to available indexes, stats, and the optimizer), parallelism, database engine internals, and even physical data storage. The example below yields different results due to a parallel plan on my test instance.
DROP TABLE IF EXISTS dbo.test_account;
CREATE TABLE dbo.test_account(
id int NOT NULL
CONSTRAINT pk_test_account PRIMARY KEY CLUSTERED
, accno int NOT NULL
, updDtm datetime2 NOT NULL
, fileCreatedTime datetime2 NOT NULL
);
--insert 100K rows
WITH
t10 AS (SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(n))
,t1k AS (SELECT 0 AS n FROM t10 AS a CROSS JOIN t10 AS b CROSS JOIN t10 AS c)
,t1g AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS num FROM t1k AS a CROSS JOIN t1k AS b CROSS JOIN t1k AS c)
INSERT INTO dbo.test_account (id, accno, updDtm, fileCreatedTime)
SELECT num, 123456789, '2022-07-27 09:41:10.0000000', '2022-07-27 11:33:33.8300000'
FROM t1g
WHERE num <= 100000;
GO
--run query 10 times
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1;
GO 10
Example results:
1
27001
25945
57071
62813
1
1
1
36450
78805
The simple solution is to add the primary key as the last column to the ORDER BY clause to break ties. This returns the same id value (1) in every iteration regardless of the execution plan and indexes.
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC, a.id) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1;
GO 10
On a side note, this index will optimize the query:
CREATE NONCLUSTERED INDEX idx ON dbo.test_account(accno, updDtm DESC, fileCreatedTime DESC, id);

Join tables via project names: if project name meets a specific criteria than replace it with another field

I need to join two tables via their project’s names. But for a few project names that meet a specific criteria, I need the join to be matched to their descriptions (job description is like a name and is unique). I am not 100% sure how to do this. Can a case expression be applied? I have provided what I have so far but it’s not joining properly when I am doing a case expression on names that are like BTG –.
SELECT
[Name] AS 'NAME'
,[DATA_Id] AS 'ID_FIELD'
,format([ApprovedOn], 'MM/dd/yyyy') as 'DATE_APPROVED'
,[DATA_PROJECT_NAME]
,[PHASE_NAME]
,[DATA_JOB_ID]
,[JOB_TYPE]
,[SUB_TYPE]
,format([CREATED_DATE], 'MM/dd/yyyy') as 'DATE_CREATED'
,CASE
WHEN [DATA_JOB_ID] = [DATA_Id] THEN 'OK'
WHEN [DATA_JOB_ID] != [DATA_Id] THEN 'NEED DATA NUMBER'
ELSE 'NEED DATA NUMBER'
END AS ACTION_SPECIALISTS
,DATA_PROJECTS
FROM [MI].[MI_B_View].[app_build]
LEFT JOIN
(SELECT * ,
CASE
WHEN [DATA_PROJECT_NAME] LIKE 'BTG -%' THEN [JOB_DESCRIPTION]
ELSE [DATA_PROJECT_NAME]
END AS DATA_PROJECTS
FROM [ExternalUser].[DATA].[JOB] WHERE [JOB_DESCRIPTION] LIKE '%ROW%' AND REGION = 'CITY') AS B
ON [Name] = [DATA_PROJECTS]
WHERE
REGION_ID = 1
AND APPROVED = 1
ORDER BY [ApprovedOn] DESC
TL; DR: The answer by Caius Jard is correct - you can join on anything, as long as it evaluates to true or false (ignoring unknown).
Unfortunately, the way you join between two tables can have drastically different performance depending on your methodology. If you join on an expression, you will usually get very poor performance. Using computed columns, materializing the intermediate result in a table, or splitting up your join conditions can all help with poor performance.
Joins are not the only place where expressions can ding you; grouping, aggregates, filters, or anything that relies on a good cardinality estimate will suffer when expressions are involved.
When I compare two methods of joining (they are functionally equivalent despite the new magic column; more on that later)
SELECT *
FROM #Build AppBuild
LEFT OUTER JOIN #Job Job
ON ( AppBuild.Name = Job.DATA_PROJECT_NAME
AND Job.DATA_PROJECT_NAME NOT LIKE 'BTG -%' )
OR ( Job.DATA_PROJECT_NAME LIKE 'BTG -%'
AND Job.JOB_DESCRIPTION = AppBuild.Name );
SELECT *
FROM #Build AppBuild
LEFT OUTER JOIN #Job Job
ON AppBuild.Name = Job.JoinOnMe;
The resulting query plans have huge differences:
You'll notice that the estimated cost of the first join is much higher - but that doesn't even tell the whole story. If I actually run these two queries with ~6M rows in each table, I end up with the second one finishing in ~7 seconds, and the first one being nowhere near done after 2 minutes. This is because the join predicate was pushed down onto the #Job table's table scan:
SQL Server has no idea what percentage of records will have a DATA_PROJECT_NAME (NOT) LIKE 'BTG -%', so it picks an estimate of 1 row. This then leads it to pick a nested loop join, and a sort, and a spool, all of which really end up making things perform quite poorly for us when we get far more than 1 row out of that table scan.
The fix? Computed columns. I created my tables like so:
CREATE TABLE #Build
(
Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);
CREATE TABLE #Job
(
JOB_DESCRIPTION varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
DATA_PROJECT_NAME varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
JoinOnMe AS CASE WHEN DATA_PROJECT_NAME LIKE N'BTG -%' THEN DATA_PROJECT_NAME
ELSE JOB_DESCRIPTION END
);
It turns out that SQL Server will maintain statistics on JoinOnMe, even though there is an expression inside of it and this value has not been materialized anywhere. If you wanted to, you could even index the computed column.
Because we have statistics on JoinOnMe, a join on it will give a good cardinality estimate (when I tested it was exactly correct), and thus a good plan.
If you don't have the freedom to alter the table, then you should at least split the join into two joins. It may seem counter intuitive, but if you're ORing together a lot of conditions for an outer join, SQL Server will usually get a better estimate (and thus better plans) if each OR condition is separate, and you then COALESCE the result set.
When I include a query like this:
SELECT AppBuild.Name,
COALESCE( Job.JOB_DESCRIPTION, Job2.JOB_DESCRIPTION ) JOB_DESCRIPTION,
COALESCE( Job.DATA_PROJECT_NAME, Job2.DATA_PROJECT_NAME ) DATA_PROJECT_NAME
FROM #Build AppBuild
LEFT OUTER JOIN #Job Job
ON ( AppBuild.Name = Job.DATA_PROJECT_NAME
AND Job.DATA_PROJECT_NAME NOT LIKE 'BTG -%' )
LEFT OUTER JOIN #Job Job2
ON ( Job2.DATA_PROJECT_NAME LIKE 'BTG -%'
AND Job2.JOB_DESCRIPTION = AppBuild.Name );
It is also 0% of total cost, relative to the first query. When compared against joining on the computed column, the difference is about 58%/42%
Here is how I created the tables and populated them with test data
DROP TABLE IF EXISTS #Build;
DROP TABLE IF EXISTS #Job;
CREATE TABLE #Build
(
Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);
CREATE TABLE #Job
(
JOB_DESCRIPTION varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
DATA_PROJECT_NAME varchar(50) COLLATE DATABASE_DEFAULT NOT NULL,
JoinOnMe AS CASE WHEN DATA_PROJECT_NAME LIKE N'BTG -%' THEN DATA_PROJECT_NAME
ELSE JOB_DESCRIPTION END
);
INSERT INTO #Build
( Name )
SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ))
FROM master.dbo.spt_values
CROSS APPLY master.dbo.spt_values SV2;
INSERT INTO #Job
( JOB_DESCRIPTION, DATA_PROJECT_NAME )
SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )),
CASE WHEN ROUND( RAND(), 0 ) = 1 THEN CAST(ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )) AS nvarchar(20))
ELSE 'BTG -1' END
FROM master.dbo.spt_values SV
CROSS APPLY master.dbo.spt_values SV2;
Sure, any expression that evaluates to a truth can be used in a join:
SELECT *
FROM
person
INNER JOIN
country
ON
country.name =
CASE person.homeCity
WHEN 'London' THEN 'England'
WHEN 'Amsterdam' THEN 'Holland'
ELSE person.homecountry
END
Suppose homecountry features records like 'united kingdom', 'great britain' and 'netherlands' but these don't match up with our country names in the countries table - we could use a case when to convert them (and I've case'd on the city name just to demo that it doesn't have to be anything to do with the country in the case), but for all the others (the ELSE case) we just pass through the country name from the person table unchanged
Ultimately the CASE WHEN will output some string and this will be matched against the other table column (but it could be matched against another case when etc)
In your scenario you might find you can avoid all this and just write something using AND and OR, like
a JOIN b
ON
(a.projectname like 'abc%' and a.projectname = b.description) OR
(a.projectname like '%def' and a.whatever = b.othercolumn)
Evaluations in a CASE WHEN are short circuited, evaluating from left to right
Remember; anything that ultimately evaluates to a truth can be used in an ON. Even ON 5<10 is valid (joins all rows to all other rows because it's always true)

How do I compare two rows from a SQL database table based on DateTime within 3 seconds?

I have a table of DetailRecords containing records that seem to be "duplicates" of other records, but they have a unique primary key [ID]. I would like to delete these "duplicates" from the DetailRecords table and keep the record with the longest/highest Duration. I can tell that they are linked records because their DateTime field is within 3 seconds of another row's DateTime field and the Duration is within 2 seconds of one another. Other data in the row will also be duplicated exactly, such as Number, Rate, or AccountID, but this could be the same for the data that is not "duplicate" or related.
CREATE TABLE #DetailRecords (
[AccountID] INT NOT NULL,
[ID] VARCHAR(100) NULL,
[DateTime] VARCHAR(100) NULL,
[Duration] INT NULL,
[Number] VARCHAR(200) NULL,
[Rate] DECIMAL(8,6) NULL
);
I know that I will most likely have to perform a self join on the table, but how can I find two rows that are similar within a DateTime range of plus or minus 3 seconds, instead of just exactly the same?
I am having the same trouble with the Duration within a range of plus or minus 2 seconds.
The key is taking the absolute value of the difference between the dates and durations. I don't know SQL server, but here's how I'd do it in SQLite. The technique should be the same, only the specific function names will be different.
SELECT a.id, b.id
FROM DetailRecords a
JOIN DetailRecords b
ON a.id > b.id
WHERE abs(strftime("%s", a.DateTime) - strftime("%s", b.DateTime)) <= 3
AND abs(a.duration - b.duration) <= 2
Taking the absolute value of the difference covers the "plus or minus" part of the range. The self join is on a.id > b.id because a.id = b.id would duplicate every pair.
Given the entries...
ID|DateTime |Duration
1 |2014-01-26T12:00:00|5
2 |2014-01-26T12:00:01|6
3 |2014-01-26T12:00:06|6
4 |2014-01-26T12:00:03|11
5 |2014-01-26T12:00:02|10
6 |2014-01-26T12:00:01|6
I get the pairs...
5|4
2|1
6|1
6|2
And you should really store those dates as DateTime types if you can.
You could use a self-referential CTE and compare the DateTime fields.
;WITH CTE AS (
SELECT AccountID,
ID,
DateTime,
rn = ROW_NUMBER() OVER (PARTITION BY AccountID, ID, <insert any other matching keys> ORDER BY AccountID)
FROM table
)
SELECT earliestAccountID = c1.AccountID,
earliestDateTime = c1.DateTime,
recentDateTime = c2.DateTime,
recentAccountID = c2.AccountID
FROM cte c1
INNER JOIN cte c2
ON c1.rn = 1 AND c2.rn = 2 AND c1.DateTime <> c2.DateTime
Edit
I made several assumptions about the data set, so this may not be as relevant as you need. If you're simply looking for difference between possible duplicates, specifically DateTime differences, this will work. However, this does not constrain to your date range, nor does it automatically assume what the DateTime column is used for or how it is set.

OVER (ORDER BY Col) generates 'Sort' operation

I'm working on a query that needs to do filtering, ordering and paging according to the user's input. Now I'm testing a case that's really slow, upon inspection of the Query Plan a 'Sort' is taking 96% of the time.
The datamodel is really not that complicated, the following query should be clear enough to understand what's happening:
WITH OrderedRecords AS (
SELECT
A.Id
, A.col2
, ...
, B.Id
, B.col1
, ROW_NUMBER() OVER (ORDER BY B.col1 ASC) AS RowNumber
FROM A
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.BId = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...))
)
SELECT
*
FROM OrderedRecords WHERE RowNumber Between x AND y
A is a table containing about 100k records, but will grow to tens of millions in the field, while B is category type table with 5 items (and this will never grow any bigger then perhaps a few more). There are clustered indexes on A.Id and B.Id.
Performance is really dreadful and I'm wondering if it's possible to remedy this somehow. If, for example, the ordering is on A.Id instead of B.col1 everything is pretty darn fast. Perhaps I can optimize B.col1 is some sort of index.
I already tried putting an index on the field itself, but this didn't help. Probably because the number of distinct items in table B is very small (in itself & compared to A).
Any ideas?
I think this may be part of the problem:
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.Id = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...)
Your LEFT JOIN is going to logically act like an INNER JOIN because of the WHERE clause you have in place, since only certain B.ID rows are going to be returned. If that's your intent, then go ahead and use an inner join, which may help the optimizer realize that you are looking for a restricted number of rows.
I suggest you to try following.
For the B table create index:
create index IX_B_1 on B (col1, Id, SomeThing)
For the A table create index:
create index IX_A_1 on A (col2, BId) include (Id, ...)
In the include put all other columns of the table A, that listed in SELECT of OrderedRecords CTE.
However, as you see, index IX_A_1 is space taking, and can take size of about table data itself.
So, as an alternative you may try omit extra columns from include part of the index:
create index IX_A_2 on A (col2, BId) include (Id)
but in this case you will have to slightly modify your query:
;WITH OrderedRecords AS (
SELECT
AId = A.Id
, A.col2
-- remove other A columns from here
, bid = B.Id
, B.col1
, ROW_NUMBER() OVER (ORDER BY B.col1 ASC) AS RowNumber
FROM A
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.BId = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...))
)
SELECT
R.*, A.OtherColumns
FROM OrderedRecords R
join A on A.Id = R.AId
WHERE R.RowNumber Between x AND y

SQL Server pick random (or first) value with aggregation

How can I get SQL Server to return the first value (any one, I don't care, it just needs to be fast) it comes across when aggregating?
For example, let's say I have:
ID Group
1 A
2 A
3 A
4 B
5 B
and I need to get any one of the ID's for each group. I can do this as follows:
Select
max(id)
,group
from Table
group by group
which returns
ID Group
3 A
5 B
That does the job, but it seems stupid to me to ask SQL Server to calculate the highest ID when all it really needs to do is to pick the first ID it comes across.
Thanks
PS - the fields are indexed, so maybe it doesn't really make a difference?
There is an undocumented aggregate called ANY which is not valid syntax but is possible to get to appear in your execution plans. This does not provide any performance advantage however.
Assuming the following table and index structure
CREATE TABLE T
(
id int identity primary key,
[group] char(1)
)
CREATE NONCLUSTERED INDEX ix ON T([group])
INSERT INTO T
SELECT TOP 1000000 CHAR( 65 + ROW_NUMBER() OVER (ORDER BY ##SPID) % 3)
FROM sys.all_objects o1, sys.all_objects o2, sys.all_objects o3
I have also populated with sample data such that there are many rows per group.
Your original query
SELECT MAX(id),
[group]
FROM T
GROUP BY [group]
Gives Table 'T'. Scan count 1, logical reads 1367 and the plan
|--Stream Aggregate(GROUP BY:([[T].[group]) DEFINE:([Expr1003]=MAX([[T].[id])))
|--Index Scan(OBJECT:([[T].[ix]), ORDERED FORWARD)
Rewritten to get the ANY aggregate...
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [group] ORDER BY [group] ) AS RN
FROM T)
SELECT id,
[group]
FROM cte
WHERE RN=1
Gives Table 'T'. Scan count 1, logical reads 1367 and the plan
|--Stream Aggregate(GROUP BY:([[T].[group]) DEFINE:([[T].[id]=ANY([[T].[id])))
|--Index Scan(OBJECT:([[T].[ix]), ORDERED FORWARD)
Even though potentially SQL Server could stop processing the group as soon as the first value is found and skip to the next one it doesn't. It still processes all rows and the logical reads are the same.
For this particular example with many rows in the group a more efficient version would be a recursive CTE.
WITH RecursiveCTE
AS (
SELECT TOP 1 id, [group]
FROM T
ORDER BY [group]
UNION ALL
SELECT R.id, R.[group]
FROM (
SELECT T.*,
rn = ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM T
JOIN RecursiveCTE R
ON R.[group] < T.[group]
) R
WHERE R.rn = 1
)
SELECT *
FROM RecursiveCTE
OPTION (MAXRECURSION 0);
Which gives
Table 'Worktable'. Scan count 2, logical reads 19
Table 'T'. Scan count 4, logical reads 12
The logical reads are much less as it retrieves the first row per group then seeks into the next group rather than reading a load of records that don't contribute to the final result.

Resources