Window function behaves differently in Subquery/CTE? - sql-server

I thought the following three SQL statements are semantically the same. The database engine will expand the second and third query to the first one internally.
select ....
from T
where Id = 1
select *
from
(select .... from T) t
where Id = 1
select *
from
(select .... from T where Id = 1) t
However, I found the window function behaves differently. I have the following code.
-- Prepare test data
with t1 as
(
select *
from (values ( 2, null), ( 3, 10), ( 5, -1), ( 7, null), ( 11, null), ( 13, -12), ( 17, null), ( 19, null), ( 23, 1759) ) v ( id, col1 )
)
select *
into #t
from t1
alter table #t add primary key (id)
go
The following query returns all the rows.
select
id, col1,
cast(substring(max(cast(id as binary(4)) + cast(col1 as binary(4)))
over (order by id
rows between unbounded preceding and 1 preceding), 5, 4) as int) as lastval
from
#t
id col1 lastval
-------------------
2 NULL NULL
3 10 NULL
5 -1 10
7 NULL -1
11 NULL -1
13 -12 -1
17 NULL -12
19 NULL -12
23 1759 -12
Without CTE/subquery: then I added a condition just return the row which Id = 19.
select
id, col1,
cast(substring(max(cast(id as binary(4)) + cast(col1 as binary(4))) over (order by id rows between unbounded preceding and 1 preceding), 5, 4) as int) as lastval
from
#t
where
id = 19;
However, lastval returns null?
With CTE/subquery: now the condition is applied to the CTE:
with t as
(
select
id, col1,
cast(substring(max(cast(id as binary(4)) + cast(col1 as binary(4))) over (order by id rows between unbounded preceding and 1 preceding ), 5, 4) as int) as lastval
from
#t)
select *
from t
where id = 19;
-- Subquery
select
*
from
(select
id, col1,
cast(substring(max(cast(id as binary(4)) + cast(col1 as binary(4))) over (order by id rows between unbounded preceding and 1 preceding), 5, 4) as int) as lastval
from
#t) t
where
id = 19;
Now lastval returns -12 as expected?

The logic order of operations of the SELECT statement is import to understand the results of your first example. From the Microsoft documentation, the order is, from top to bottom:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
Note that the WHERE clause processing happens logically before the SELECT clause.
The query without the CTE is being filtered where id = 19. The order of operations causes the where to process before the window function in the select clause. There is only 1 row with an id of 19. Therefore, the where limits the rows to id = 19 before the window function can process the rows between unbounded preceding and 1 preceding. Since there are no rows for the window function, the lastval is null.
Compare this to the CTE. The outer query's filter has not yet been applied, so the CTE operates an all of the data. The rows between unbounded preceding finds the prior rows. The outer part of the query applies the filter to the intermediate results returns just the row 19 which already has the correct lastval.
You can think of the CTE as creating a temporary #Table with the CTE data in it. All of the data is logically processed into a separate table before returning data to the outer query. The CTE in your example creates a temporary work table with all of the rows that includes the lastval from the prior rows. Then, the filter in the outer query gets applied and limits the results to id 19.
(In reality, the CTE can shortcut and skip generating data, if it can do so to improve performance without affecting the results. Itzik Ben-Gan has a great example of a CTE that skips processing when it has returned enough data to satisfy the query.)
Consider what happens if you put the filter in the CTE. This should behave exactly like the first example query that you provided. There is only 1 row with an id = 19, so the window function does not find any preceding rows:
with t as ( select id, col1,
cast(substring(max(cast(id as binary(4)) + cast(col1 as binary(4))) over ( order by id
rows between unbounded preceding and 1 preceding ), 5, 4) as int) as lastval
from #t
where id = 19 -- moved filter inside CTE
)
select *
from t

Window functions operate on your result set, so when you added where id = 19 your result set only had 1 row. Since your window function specifies rows between unbounded preceding and 1 preceding there was no preceding row, and resulted in null.
By using the subquery/cte you are allowing the window function to operate over the unfiltered result set (where the preceding rows exist), then retrieving only those rows from that result set where id = 19.

The querys you are comparing are not equivalent.
select id ,
(... ) as lastval
from #t
where id = 19;
will take only 1 row, so 'lastval' will take NULL from col1 as for the windowed function does not find preceding row.

Related

SQL Server : how to divide database records into even, random groups

tblNames
OrganizationID (int)
LastName (varchar)
...
GroupNumber (int)
GroupNumber is currently NULL for all records, I need an UPDATE statement to update this column.
I need to split up records on an OrganizationID level into even, random groups.
If there are < 20,000 records for an OrganizationID, I need 2 even, random groups. So records for that OrganizationID will have a GroupNumber of 1 or 2. There will be the same (or if odd number of records difference of only 1) number of records for GroupNumber = 1 and for GroupNumber = 2, and there will be no recognizable way to tell how a person got into a GroupNumber - i.e. LastNames that start with A-L are group 1, M-Z are group 2 would not be OK.
If there are > 20,000 records for an OrganizationID, I need 4 even, random groups. So records for that OrganizationID will have a GroupNumber values of 1, 2, 3, or 4. There will be the same (or if odd number of records difference of only 1) number of records for each GroupNumber, and there will be no recognizable way to tell how a person got into a GroupNumber - i.e. LastNames that start with A-F are group 1, G-L are group 2, etc. would not be OK.
There are only about 20 organizations, so I can run an update statement 20 times, once per organizationID if needed.
I have full control of the table so I can add keys or columns, but for now this is what it is.
Would appreciate any help.
Create row numbers randomly (with ROW_NUMBER and GETID). Then get their modulo 2 or 4 depending on the record count to get buckets 0 to 1 or 0 to 3.
select
organizationid, lastname, ...,
case when cnt <= 20000 then rn % 2 else rn % 4 end as bucket
from
(
select
organizationid, lastname, ...,
row_number() over(order by newid()) as rn,
count(*) over () as cnt
from mytable
) randomized;
UPDATE: I suppose the update statement would have to look something like this:
with randomized as
(
select
groupnumber,
row_number() over(order by newid()) as rn,
count(*) over () as cnt
from mytable
)
update randomzized
set groupnumber = case when cnt <= 20000 then rn % 2 else rn % 4 end + 1;
Another slightly different approach;
Setting up some fake data:
if object_id('tempdb.dbo.#Orgs') is not null drop table #Orgs
create table #Orgs
(
RID int identity(1,1) primary key clustered,
OrganizationId int,
LastName varchar(36),
GroupId int
)
insert into #Orgs (OrganizationId, LastName)
select top 40000 row_number() over (order by (select null)) % 20000, newid()
from sys.all_objects a, sys.all_objects b
then using the rarely useful ntile() function to get as close to identically sized groups as possible. Sorting by newid() essentially sorts the data randomly (or as random as generating one guid to the next is).
declare #NumRandomGroups int = 4
update o
set GroupId = x.GroupId
from #orgs o
inner join (select RID, GroupId = ntile(#NumRandomGroups) over (order by newid())
from #orgs) x
on o.RID = x.RID
select GroupId, count(1)
from #Orgs
group by GroupId
select *
from #Orgs
order by RID
You can then set #NumRandomGroups to whatever you want it to be based on the count of Organizations

TSQL : Find PAIR Sequence in a table

I have following table in T-SQL(there are other columns too but no identity column or primary key column):
Oid Cid
1 a
1 b
2 f
3 c
4 f
5 a
5 b
6 f
6 g
7 f
So in above example I would like to highlight that following Oid are duplicate when looking at Cid column values as "PAIRS":
Oid:
1 (1 matches Oid: 5)
2 (2 matches Oid: 4 and 7)
Please NOTE that Oid 2 match did not include Oid 6, since the pair of 6 has letter 'G' as well.
Is it possible to create a query without using While loop to highlight the "Oid" like above? along with how many other matches count exist in database?
I am trying to find the patterns within the dataset relating to these two columns. Thank you in Advance.
Here is a worked example - see comments for explanation:
--First set up your data in a temp table
declare #oidcid table (Oid int, Cid char(1));
insert into #oidcid values
(1,'a'),
(1,'b'),
(2,'f'),
(3,'c'),
(4,'f'),
(5,'a'),
(5,'b'),
(6,'f'),
(6,'g'),
(7,'f');
--This cte gets a table with all of the cids in order, for each oid
with cte as (
select distinct Oid, (select Cid + ',' from #oidcid i2
where i2.Oid = i.Oid order by Cid
for xml path('')) Cids
from #oidcid i
)
select Oid, cte.Cids
from cte
inner join (
-- Here we get just the lists of cids that appear more than once
select Cids, Count(Oid) as OidCount
from cte group by Cids
having Count(Oid) > 1 ) as gcte on cte.Cids = gcte.Cids
-- And when we list them, we are showing the oids with duplicate cids next to each other
Order by cte.Cids
select o1.Cid, o1.Oid, o2.Oid
, count(*) + 1 over (partition by o1.Cid) as [cnt]
from table o1
join table o2
on o1.Cid = o2.Cid
and o1.Oid < o2.Oid
order by o1.Cid, o1.Oid, o2.Oid
Maybe Like this then:
WITH CTE AS
(
SELECT Cid, oid
,ROW_NUMBER() OVER (PARTITION BY cid ORDER BY cid) AS RN
,SUM(1) OVER (PARTITION BY oid) AS maxRow2
,SUM(1) OVER (PARTITION BY cid) AS maxRow
FROM oid
)
SELECT * FROM CTE WHERE maxRow != 1 AND maxRow2 = 1
ORDER BY oid

TSQL matching the first instances of multiple values in a resultset

Say I have part of a large query, as below, that returns a resultset with multiple rows of the same key information (PolNum) with different value information (PolPremium) in a random order.
Would it be possible to select the first matching PolNum fields and sum up the PolPremium. In this case I know that there are 2 PolNumber's used so given the screenshot of the resultset (yes I know it starts at 14 for illustration purposes) and return the first values and sum the result.
First match for PolNum 000035789547
(ROW 14) PolPremium - 32.00
First match for PolNum 000035789547
(ROW 16) PolPremium - 706043.00
Total summed should be 32.00 + 706043.00 = 706072.00
Query
OUTER APPLY
(
SELECT PolNum, PolPremium
FROM PN20
WHERE PolNum IN(SELECT PolNum FROM SvcPlanPolicyView
WHERE SvcPlanPolicyView.ControlNum IN (SELECT val AS ServedCoverages FROM ufn_SplitMax(
(SELECT TOP 1 ServicedCoverages FROM SV91 WHERE SV91.AccountKey = 3113413), ';')))
ORDER BY PN20.PolEffDate DESC
}
Resultset
Suppose that pic if the final result your query produces. Then you can do something like:
DECLARE #t TABLE
(
PolNum VARCHAR(20) ,
PolPremium MONEY
)
INSERT INTO #t
VALUES ( '000035789547', 32 ),
( '000035789547', 76 ),
( '000071709897', 706043.00 ),
( '000071709897', 1706043.00 )
SELECT t.PolNum ,
SUM(PolPremium) AS PolPremium
FROM ( SELECT * ,
ROW_NUMBER() OVER ( PARTITION BY PolNum ORDER BY PolPremium ) AS rn
FROM #t
) t
WHERE rn = 1
GROUP BY GROUPING SETS(t.PolNum, ( ))
Output:
PolNum PolPremium
000035789547 32.00
000071709897 706043.00
NULL 706075.00
Just replace #t with your query. Also I assume that row with minimum of premium is the first. You could probably do filtering top row in outer apply part but it really not clear for me what is going on there without some sample data.

T-SQL First_Value() - How can it return more than a single value?

I have the following query:
Select PH.SubId
From dbo.PanelHistory PH
Where
PH.Scribe2Time <> (Select FIRST_VALUE(ReadTimeLocal) OVER (Order By ReadTimeLocal) From dbo.PanelWorkflow Where ProcessNumber = 2690 And dbo.PanelWorkflow.SubId = PH.SubId)
I'm getting an error (512) that says: Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
How can the subquery return more than a single value? There can only be one first value. I must be overlooking something with this query.
By the way, I realize I could easily use Min() instead of First_Value, but I wanted to experiment with some of these Windowing functions.
How many rows do you see?
SELECT FIRST_VALUE(name) OVER (ORDER BY create_date) AS RN
FROM sys.objects
Even though there is only one distinct first value it still returns it for every row in the query.
So if the sub query itself matches multiple rows you will get this error. You could get rid of it with DISTINCT or TOP 1.
Probably not very efficient but you say this is just for experimental purposes.
This isn't an answer. It's just an extended comment generated by the following conclusion:
I could easily use Min() instead of First_Value, but I
wanted to experiment with some of these Windowing functions.
Min can't be used instead of FIRST_VALUE.
Example:
SET NOCOUNT ON;
DECLARE #MyTable TABLE(ID INT, TranDate DATETIME)
INSERT #MyTable VALUES (1, '2012-02-02'), (2, '2011-01-01'), (3, '2013-03-03')
SELECT MIN(ID) AS MIN_ID FROM #MyTable
SELECT ID, MIN(ID) OVER(ORDER BY TranDate) AS MIN_ID_ORDER_BY FROM #MyTable;
SELECT ID, FIRST_VALUE(ID) OVER(ORDER BY TranDate) AS FIRST_VALUE_ID_ORDER_BY FROM #MyTable;
Results:
MIN_ID
-----------
1
ID MIN_ID_ORDER_BY
----------- ---------------
2 2
1 1
3 1
ID FIRST_VALUE_ID_ORDER_BY
----------- -----------------------
2 2
1 2
3 2
FIRST_VALUE() will still return a row for every record that meets tour WHERE clause. TOP 1 should work:
Select PH.SubId
From dbo.PanelHistory PH
Where
PH.Scribe2Time <> (Select TOP 1 ReadTimeLocal
From dbo.PanelWorkflow
Where ProcessNumber = 2690
And dbo.PanelWorkflow.SubId = PH.SubId
Order By ReadTimeLocal DESC)
or MIN:
Select PH.SubId
From dbo.PanelHistory PH
Where
PH.Scribe2Time <> (Select MIN(ReadTimeLocal)
From dbo.PanelWorkflow
Where ProcessNumber = 2690
And dbo.PanelWorkflow.SubId = PH.SubId)
The PARTITION/OVER functions are look-ahead column functions. They aren't row functions - by that, I mean, they don't effect an entire row, number of rows returned, etc. An OVER aggregate can depend on values in other rows, but the tangible result is only to calculate a single column in the current row.
You may have seen something similar to what you are trying to do via an OVER ROW_NUMBER ranking function. Multiple rows are still returned, but only one of them has a ROW_NUMBER of 1. The rest are filtered in an encapsulating WHERE or JOIN predicate.

SQL Select Statement For Calculating A Running Average Column

I am trying to have a running average column in the SELECT statement based on a column from the n previous rows in the same SELECT statement. The average I need is based on the n previous rows in the resultset.
Let me explain
Id Number Average
1 1 NULL
2 3 NULL
3 2 NULL
4 4 2 <----- Average of (1, 3, 2),Numbers from previous 3 rows
5 6 3 <----- Average of (3, 2, 4),Numbers from previous 3 rows
. . .
. . .
The first 3 rows of the Average column are null because there are no previous rows. The row 4 in the Average column shows the average of the Number column from the previous 3 rows.
I need some help trying to construct a SQL Select statement that will do this.
This should do it:
--Test Data
CREATE TABLE RowsToAverage
(
ID int NOT NULL,
Number int NOT NULL
)
INSERT RowsToAverage(ID, Number)
SELECT 1, 1
UNION ALL
SELECT 2, 3
UNION ALL
SELECT 3, 2
UNION ALL
SELECT 4, 4
UNION ALL
SELECT 5, 6
UNION ALL
SELECT 6, 8
UNION ALL
SELECT 7, 10
--The query
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM RowsToAverage rta
)
SELECT nr.ID, nr.Number,
CASE
WHEN nr.RowNumber <=3 THEN NULL
ELSE ( SELECT avg(Number)
FROM NumberedRows
WHERE RowNumber < nr.RowNumber
AND RowNumber >= nr.RowNumber - 3
)
END AS MovingAverage
FROM NumberedRows nr
Assuming that the Id column is sequential, here's a simplified query for a table named "MyTable":
SELECT
b.Id,
b.Number,
(
SELECT
AVG(a.Number)
FROM
MyTable a
WHERE
a.id >= (b.Id - 3)
AND a.id < b.Id
AND b.Id > 3
) as Average
FROM
MyTable b;
Edit: I missed the point that it should average the three previous records...
For a general running average, I think something like this would work:
SELECT
id, number,
SUM(number) OVER (ORDER BY ID) /
ROW_NUMBER() OVER (ORDER BY ID) AS [RunningAverage]
FROM myTable
ORDER BY ID
A simple self join would seem to perform much better than a row referencing subquery
Generate 10k rows of test data:
drop table test10k
create table test10k (Id int, Number int, constraint test10k_cpk primary key clustered (id))
;WITH digits AS (
SELECT 0 as Number
UNION SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
)
,numbers as (
SELECT
(thousands.Number * 1000)
+ (hundreds.Number * 100)
+ (tens.Number * 10)
+ ones.Number AS Number
FROM digits AS ones
CROSS JOIN digits AS tens
CROSS JOIN digits AS hundreds
CROSS JOIN digits AS thousands
)
insert test10k (Id, Number)
select Number, Number
from numbers
I would pull the special case of the first 3 rows out of the main query, you can UNION ALL those back in if you really want it in the row set. Self join query:
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM test10k rta
)
SELECT nr.ID, nr.Number,
avg(trailing.Number) as MovingAverage
FROM NumberedRows nr
join NumberedRows as trailing on trailing.RowNumber between nr.RowNumber-3 and nr.RowNumber-1
where nr.Number > 3
group by nr.id, nr.Number
On my machine this takes about 10 seconds, the subquery approach that Aaron Alton demonstrated takes about 45 seconds (after I changed it to reflect my test source table) :
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM test10k rta
)
SELECT nr.ID, nr.Number,
CASE
WHEN nr.RowNumber <=3 THEN NULL
ELSE ( SELECT avg(Number)
FROM NumberedRows
WHERE RowNumber < nr.RowNumber
AND RowNumber >= nr.RowNumber - 3
)
END AS MovingAverage
FROM NumberedRows nr
If you do a SET STATISTICS PROFILE ON, you can see the self join has 10k executes on the table spool. The subquery has 10k executes on the filter, aggregate, and other steps.
Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
Check out some solutions here. I'm sure that you could adapt one of them easily enough.
If you want this to be truly performant, and arn't afraid to dig into a seldom-used area of SQL Server, you should look into writing a custom aggregate function. SQL Server 2005 and 2008 brought CLR integration to the table, including the ability to write user aggregate functions. A custom running total aggregate would be the most efficient way to calculate a running average like this, by far.
Alternatively you can denormalize and store precalculated running values. Described here:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/01/23/denormalizing-to-enforce-business-rules-running-totals.aspx
Performance of selects is as fast as it goes. Of course, modifications are slower.

Resources