OVER (ORDER BY Col) generates 'Sort' operation - sql-server

I'm working on a query that needs to do filtering, ordering and paging according to the user's input. Now I'm testing a case that's really slow, upon inspection of the Query Plan a 'Sort' is taking 96% of the time.
The datamodel is really not that complicated, the following query should be clear enough to understand what's happening:
WITH OrderedRecords AS (
SELECT
A.Id
, A.col2
, ...
, B.Id
, B.col1
, ROW_NUMBER() OVER (ORDER BY B.col1 ASC) AS RowNumber
FROM A
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.BId = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...))
)
SELECT
*
FROM OrderedRecords WHERE RowNumber Between x AND y
A is a table containing about 100k records, but will grow to tens of millions in the field, while B is category type table with 5 items (and this will never grow any bigger then perhaps a few more). There are clustered indexes on A.Id and B.Id.
Performance is really dreadful and I'm wondering if it's possible to remedy this somehow. If, for example, the ordering is on A.Id instead of B.col1 everything is pretty darn fast. Perhaps I can optimize B.col1 is some sort of index.
I already tried putting an index on the field itself, but this didn't help. Probably because the number of distinct items in table B is very small (in itself & compared to A).
Any ideas?

I think this may be part of the problem:
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.Id = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...)
Your LEFT JOIN is going to logically act like an INNER JOIN because of the WHERE clause you have in place, since only certain B.ID rows are going to be returned. If that's your intent, then go ahead and use an inner join, which may help the optimizer realize that you are looking for a restricted number of rows.

I suggest you to try following.
For the B table create index:
create index IX_B_1 on B (col1, Id, SomeThing)
For the A table create index:
create index IX_A_1 on A (col2, BId) include (Id, ...)
In the include put all other columns of the table A, that listed in SELECT of OrderedRecords CTE.
However, as you see, index IX_A_1 is space taking, and can take size of about table data itself.
So, as an alternative you may try omit extra columns from include part of the index:
create index IX_A_2 on A (col2, BId) include (Id)
but in this case you will have to slightly modify your query:
;WITH OrderedRecords AS (
SELECT
AId = A.Id
, A.col2
-- remove other A columns from here
, bid = B.Id
, B.col1
, ROW_NUMBER() OVER (ORDER BY B.col1 ASC) AS RowNumber
FROM A
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.BId = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...))
)
SELECT
R.*, A.OtherColumns
FROM OrderedRecords R
join A on A.Id = R.AId
WHERE R.RowNumber Between x AND y

Related

LEFT JOIN With Redundant Predicate Performs Better Than a CROSS JOIN?

I'm looking at the execution plans for two of these statements and am kind of stumped on why the LEFT JOIN statement performs better than the CROSS JOIN statement:
Table Definitions:
CREATE TABLE [Employee] (
[ID] int NOT NULL IDENTITY(1,1),
[FirstName] varchar(40) NOT NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED ([ID] ASC)
);
CREATE TABLE [dbo].[Numbers] (
[N] INT IDENTITY (1, 1) NOT NULL,
CONSTRAINT [PK_Numbers] PRIMARY KEY CLUSTERED ([N] ASC)
); --The Numbers table contains numbers 0 to 100,000.
Queries in Question where I join one 'day' to each Employee:
DECLARE #PeriodStart AS date = '2019-11-05';
DECLARE #PeriodEnd AS date = '2019-11-05';
SELECT E.FirstName, CD.ClockDate
FROM Employee E
CROSS JOIN (SELECT DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
WHERE N.N <= DATEDIFF(day, #PeriodStart, #PeriodEnd)
) CD
WHERE E.ID > 2000;
SELECT E.FirstName, CD.ClockDate
FROM Employee E
LEFT JOIN (SELECT DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
WHERE N.N <= DATEDIFF(day, #PeriodStart, #PeriodEnd)
) CD ON CD.ClockDate = CD.ClockDate
WHERE E.ID > 2000;
The Execution Plans:
https://www.brentozar.com/pastetheplan/?id=B139JjPKK
As you can see, according to the optimizer the second (left join) query with the seemingly redundant predicate seems to cost way less than the first (cross join) query. This is also the case when the period dates span multiple days.
What's weird is if I change the LEFT JOIN's predicate to something different like 1 = 1 it'll perform like the CROSS APPLY. I also tried changing the SELECT portion of the LEFT JOIN to SELECT N and joined on CD.N = CD.N ... but that also seems to perform poorly.
According to the execution plan, the second query has an index seek that only reads 3000 rows from the Numbers table while the first query is reading 10 times as many. The second query's index seek also has this predicate (which I assume comes from the LEFT JOIN):
dateadd(day,[Numbers].[N] as [N].[N],[#PeriodStart])=dateadd(day,[Numbers].[N] as [N].[N],[#PeriodStart])
I would like to understand why the second query seems to perform so much better even though I wouldn't except it to? Does it have something to do with the fact I'm joining the results of the DATEADD function? Is SQL evaluating the results of DATEADD before joining?
The reason these queries get different estimates, even though the plan is almost the same and will probably take the same time, appears to be because DATEADD(day, N.N, #PeriodStart) is nullable, therefore CD.ClockDate = CD.ClockDate essentially just verifies that the result is not null. The optimizer cannot see that it will always be non-null, so takes the row-estimate down because of it.
But it seems to me that the primary performance problem in your query is that you are selecting the whole of your numbers table every time. Instead you should just select the amount of rows you need
SELECT E.FirstName, CD.ClockDate
FROM Employee E
CROSS JOIN (
SELECT TOP (DATEDIFF(day, #PeriodStart, #PeriodEnd) + 1)
DATEADD(day, N.N, #PeriodStart) AS ClockDate
FROM Numbers N
ORDER BY N.N
) CD
WHERE E.ID > 2000;
Using this technique, you can even use CROSS APPLY (SELECT TOP (outerValue) if you want to correlate the amount of rows to the rest of the query.
For further tips on numbers tables, see Itzik Ben-Gan's excellent series

MSSQL can't understand what's happening with the action "having count(*) lesser than <some field of other table>"

I've tried to understand some part of an exercise i'm doing and just couldn't get it.
There's a part where 'T' is selected, grouped by 'a' and than it's redirected to "having count(*) < T3.a",
and I don't know how to approach it.
I've tried googling this sort of thing and see if there are similar examples but all other examples were using regular numbers for ex.: "having count(*) < 5" and not whole fields for comparison.
The exercise is this:
MSSQL exercise
create table T(a int, b int);
insert into T values(1,2);
insert into T values(1,1);
insert into T values(2,3);
insert into T values(2,4);
insert into T values(3,4);
insert into T values(4,5);
select T3.b, (select count(T5.a)
from T T5
where T5.a = T3.b)
from (select T1.a as a, T2.b as b
from T T1, T T2
where T1.b < T2.a) as T3
where not exists (select T4.a
from T T4
group by T4.a
having count(*) < T3.a);
I thought that the having count(*) was comparing each value that was grouped by to each value of T3.a in each row and if all rows have met the criteria than the value is getting selected but I somehow get different results.
Can someone please explain to me what is really going on behind this "having count(*) < T3.a" operation?
Thank you in advance.
To repeat myself from the comments, a HAVING is like a WHERE for aggregate functions. You cannot use aggregate function in the WHERE, for example WHERE SUM(SomeColumn) > 5, so you need to do them in the HAVING: HAVING SUM(SomeColumn) > 5. This would returns any rows where the SUM of the column SomeColumn is greater than 5 in the group.
For your expression, HAVING COUNT(*) < T3.a it would only return rows where the value of COUNT(*) is less than the value of T3.a.
Let's break this down to it's separate parts.
First the FROM
from (select T1.a as a, T2.b as b
from T T1, T T2
where T1.b < T2.a) as T3
This uses the old-style deprecated cross-join syntax. It can be rewritten as a normal join:
from (select T1.a as a, T2.b as b
from T T1
join T T2 on T1.b < T2.a
) as T3
If we analyze what it does, we realize that it is actually what is known as a triangular join: every row is self-joined to every row lower than it. This was commonly done when window aggregates were not available.
WHERE
where not exists (select T4.a
from T T4
group by T4.a
having count(*) < T3.a);
This is a correlated subquery: T3.a is a reference to the outer query.
What this predicate says is: for this particular row, there must be no rows in the subquery.
The subquery itself says: take all rows in T, group them by a and count, then only include rows for which the count is less than the outer reference a.
Note that because it is an EXIST, the actual selected value is not used. I suspect this may not have been the intention.
SELECT
select T3.b, (select count(T5.a)
from T T5
where T5.a = T3.b)
We then take b from the first join, and the count from a subquery of all matching T rows. Again, this was common when window aggregates were not available.
So the whole thing can be rewritten as follows:
select T2.b, (select count(T5.a)
from T T5
where T5.a = T3.b)
from (
select *, count(*) over (partition by a) as cnt
from T
) T1
join T T2 on T1.b < T2.a
where T1.cnt < T1.a;
There is something not quite right about the logic in your query, but without knowing what the original intention was, and without seeing the table and column names, I cannot say. The triangular join in particular looks very suspect.

T-SQL How to join to multiple tables depending on value

So I have some complicated production tables. What it comes down to, however, is that I'd like to be able to join to multiple tables depending on the where the value is that I'm seeking. Specifically, say I have an employee ID, "JHDOE", and I want to join it to the table where I can get that employee's name. The main table for employees is "Table A":
Notice that the field ID_2 does NOT have the value "JHDOE". Instead, it has "DOEJH". Well, there is another table that actually has the value I'm seeking, "Table B":
In this table, we do see "JHDOE" so at first I tried something like this:
from TableStart as start
left join TableB as b on
case
when start.EMP_ID like '[0-9]%'
then b.ID
else b.ID_2
end = start.EMP_ID
But this created other problems. So what I'd like to do is do something like join to EITHER table, or at least something to the same effect. One method I tried was this:
from TableStart as start
left join (select a.Name, a.ID, a.ID_2
from TableA as a
union
select b.Name, b.ID, b.ID_2
from TableB as b) names on
case
when start.EMP_ID like '[0-9]%'
then names.ID
else names.ID_2
end = start.EMP_ID
The result set for the union looks like this:
On my production data, this same scenario resulted in a blank. I suppose it doesn't know which row to join to? So I think I need to do something like pivot the rows into columns and then do an OR... but I'm not sure. I would be greatly appreciative of any guidance or instruction.
Instead of using a CASE in the join I think an IN operator would work. That will function as an OR (EMP_ID = ID or EMP_ID = ID_2). And if you LEFT JOIN from TableStart to both TableA and TableB and use the COALESCE function with each column from TableA and TableB you will get the first non-null value.
This assumes that if both TableA and TableB have a match for TableStart you are fine with taking the first non-null value which could result in a mixture of values from TableA and TableB if TableA has some null values.
select
start.*
, coalesce (a.Name, b.Name) as [Name]
, coalesce (a.ID, b.ID) as [ID]
, coalesce (a.ID_2, b.ID_2) as [ID_2]
from TableStart as start
left join TableA as a on start.EMP_ID in (a.ID, a.ID_2)
left join TableB as b on start.EMP_ID in (b.ID, b.ID_2)
Here is the full demo.

Getting non-deterministic results from WITH RECURSIVE cte

I'm trying to create a recursive CTE that traverses all the records for a given ID, and does some operations between ordered records. Let's say I have customers at a bank who get charged a uniquely identifiable fee, and a customer can pay that fee in any number of installments:
WITH recursive payments (
id
, index
, fees_paid
, fees_owed
)
AS (
SELECT id
, index
, fees_paid
, fee_charged
FROM table
WHERE index = 1
UNION ALL
SELECT t.id
, t.index
, t.fees_paid
, p.fees_owed - p.fees_paid
FROM table t
JOIN payments p
ON t.id = p.id
AND t.index = p.index + 1
)
SELECT *
FROM payments
ORDER BY 1,2;
The join logic seems sound, but when I join the output of this query to the source table, I'm getting non-deterministic and incorrect results.
This is my first foray into Snowflake's recursive CTEs. What am I missing in the intermediate result logic that is leading to the non-determinism here?
I assume this is edited code, because in the anchor of you CTE you select the fourth column fee_charged which does not exist, and then in the recursion you don't sum the fees paid and other stuff, basically you logic seems rather strange.
So creating some random data, that has two different id streams to recurse over:
create or replace table data (id number, index number, val text);
insert into data
select * from values (1,1,'a'),(2,1,'b')
,(1,2,'c'), (2,2,'d')
,(1,3,'e'), (2,3,'f')
v(id, index, val);
Now altering you CTE just a little bit to concat that strings together..
WITH RECURSIVE payments AS
(
SELECT id
, index
, val
FROM data
WHERE index = 1
UNION ALL
SELECT t.id
, t.index
, p.val || t.val as val
FROM data t
JOIN payments p
ON t.id = p.id
AND t.index = p.index + 1
)
SELECT *
FROM payments
ORDER BY 1,2;
we get:
ID INDEX VAL
1 1 a
1 2 ac
1 3 ace
2 1 b
2 2 bd
2 3 bdf
Which is exactly as I would expect. So how this relates to your "it gets strange when I join to other stuff" is ether, your output of you CTE is not how you expect it to be.. Or your join to other stuff is not working as you expect, Or there is a bug with snowflake.
Which all comes down to, if the CTE results are exactly what you expect, create a table and join that to your other table, so eliminate some form of CTE vs JOIN bug, and to debug why your join is not working.
But if your CTE output is not what you expect, then lets help debug that.

set difference in SQL query

I'm trying to select records with a statement
SELECT *
FROM A
WHERE
LEFT(B, 5) IN
(SELECT * FROM
(SELECT LEFT(A.B,5), COUNT(DISTINCT A.C) c_count
FROM A
GROUP BY LEFT(B,5)
) p1
WHERE p1.c_count = 1
)
AND C IN
(SELECT * FROM
(SELECT A.C , COUNT(DISTINCT LEFT(A.B,5)) b_count
FROM A
GROUP BY C
) p2
WHERE p2.b_count = 1)
which takes a long time to run ~15 sec.
Is there a better way of writing this SQL?
If you would like to represent Set Difference (A-B) in SQL, here is solution for you.
Let's say you have two tables A and B, and you want to retrieve all records that exist only in A but not in B, where A and B have a relationship via an attribute named ID.
An efficient query for this is:
# (A-B)
SELECT DISTINCT A.* FROM (A LEFT OUTER JOIN B on A.ID=B.ID) WHERE B.ID IS NULL
-from Jayaram Timsina's blog.
You don't need to return data from the nested subqueries. I'm not sure this will make a difference withiut indexing but it's easier to read.
And EXISTS/JOIN is probably nicer IMHO then using IN
SELECT *
FROM
A
JOIN
(SELECT LEFT(B,5) AS b1
FROM A
GROUP BY LEFT(B,5)
HAVING COUNT(DISTINCT C) = 1
) t1 On LEFT(A.B, 5) = t1.b1
JOIN
(SELECT C AS C1
FROM A
GROUP BY C
HAVING COUNT(DISTINCT LEFT(B,5)) = 1
) t2 ON A.C = t2.c1
But you'll need a computed column as marc_s said at least
And 2 indexes: one on (computed, C) and another on (C, computed)
Well, not sure what you're really trying to do here - but obviously, that LEFT(B, 5) expression keeps popping up. Since you're using a function, you're giving up any chance to use an index.
What you could do in your SQL Server table is to create a computed, persisted column for that expression, and then put an index on that:
ALTER TABLE A
ADD LeftB5 AS LEFT(B, 5) PERSISTED
CREATE NONCLUSTERED INDEX IX_LeftB5 ON dbo.A(LeftB5)
Now use the new computed column LeftB5 instead of LEFT(B, 5) anywhere in your query - that should help to speed up certain lookups and GROUP BY operations.
Also - you have a GROUP BY C in there - is that column C indexed?
If you are looking for just set difference between table1 and table2,
the below query is simple that gives the rows that are in table1, but not in table2, such that both tables are instances of the same schema with column names as
columnone, columntwo, ...
with
col1 as (
select columnone from table2
),
col2 as (
select columntwo from table2
)
...
select * from table1
where (
columnone not in col1
and columntwo not in col2
...
);

Resources