Unexpected analytic function output in common table expression - sql-server

In SQL Server 2019, analytic functions are not returning the results that I would expect in the context of recursive common table expressions. Consider the following non-recursive T-SQL query:
WITH SourceData (RowNum, Uniform, RowVal) AS (
SELECT 1, 'A', 'A' UNION ALL
SELECT 2, 'A', 'B' UNION ALL
SELECT 3, 'A', 'C' UNION ALL
SELECT 4, 'A', 'D'
),
RecursiveCte0 (RowNum, Uniform, RowVal, MinVal, SomeSum, RowNumCalc, RecursiveLevel) AS (
SELECT RowNum, Uniform, RowVal, RowVal, RowNum, CAST(RowNum AS BIGINT), 0
FROM SourceData
),
RecursiveCte1 (RowNum, Uniform, RowVal, MinVal, SomeSum, RowNumCalc, RecursiveLevel) AS (
SELECT * FROM RecursiveCte0
UNION ALL
SELECT
RowNum, Uniform, RowVal,
MIN(MinVal) OVER (PARTITION BY Uniform),
SUM(RowNum) OVER (PARTITION BY Uniform),
ROW_NUMBER() OVER (PARTITION BY Uniform ORDER BY RowNum),
RecursiveLevel + 1
FROM RecursiveCte0
)
SELECT *
FROM RecursiveCte1
ORDER BY RecursiveLevel, RowNum;
Results:
RowNum Uniform RowVal MinVal SomeSum RowNumCalc RecursiveLevel
1 A A A 1 1 0
2 A B B 2 2 0
3 A C C 3 3 0
4 A D D 4 4 0
1 A A A 10 1 1
2 A B A 10 2 1
3 A C A 10 3 1
4 A D A 10 4 1
As expected, the MIN, SUM, and ROW_NUMBER functions generate the appropriate values based on all rows from RecursiveCte0. I would expect the following recursive query to be logically identical to the non-recursive version above, but it produces different results:
WITH SourceData (RowNum, Uniform, RowVal) AS (
SELECT 1, 'A', 'A' UNION ALL
SELECT 2, 'A', 'B' UNION ALL
SELECT 3, 'A', 'C' UNION ALL
SELECT 4, 'A', 'D'
),
RecursiveCte (RowNum, Uniform, RowVal, MinVal, SomeSum, RowNumCalc, RecursiveLevel) AS (
SELECT RowNum, Uniform, RowVal, RowVal, RowNum, CAST(RowNum AS BIGINT), 0
FROM SourceData
UNION ALL
SELECT
RowNum, Uniform, RowVal,
MIN(MinVal) OVER (PARTITION BY Uniform),
SUM(RowNum) OVER (PARTITION BY Uniform),
ROW_NUMBER() OVER (PARTITION BY Uniform ORDER BY RowNum),
RecursiveLevel + 1
FROM RecursiveCte
WHERE RecursiveLevel < 1
)
SELECT *
FROM RecursiveCte
ORDER BY RecursiveLevel, RowNum;
Results:
RowNum Uniform RowVal MinVal SomeSum RowNumCalc RecursiveLevel
1 A A A 1 1 0
2 A B B 2 2 0
3 A C C 3 3 0
4 A D D 4 4 0
1 A A A 1 1 1
2 A B B 2 1 1
3 A C C 3 1 1
4 A D D 4 1 1
For each of the three analytic functions, it appears that the grouping is only being applied within the context of each individual row, rather than all of the rows at that level. This unexpected behavior also happens if I partition over (SELECT NULL). I would expect the analytic functions to apply to the entire recursion level, as per MSDN:
Analytic and aggregate functions in the recursive part of the CTE are
applied to the set for the current recursion level and not to the set
for the CTE. Functions like ROW_NUMBER operate only on the subset of
data passed to them by the current recursion level and not the entire
set of data passed to the recursive part of the CTE.
Why do these two queries produce different results? Is there a way to effectively use analytic functions with recursive common table expressions?

Related

Group records based on time interval starting from timestamp of first record in each group

Struggling with this; need to group records within a specific time interval starting from the first timestamp (FREEZE_TIME) - but the first record outside the first group is the starting point for the time interval for the next group and so on. Expected result, THAW_COUNT, is the count of all groups for a PARENT_SAMPLE_ID. So for table:
SAMPLE_ID
FREEZE_TIME
PARENT_SAMPLE_ID
1
null
null
2
2015-11-27 10:23:10
1
3
2015-11-27 10:59:23
1
4
2015-11-27 11:05:43
1
5
2015-11-27 12:53:48
1
6
2015-11-27 13:42:25
1
I would like to get a result of:
PARENT_SAMPLE_ID
THAW_COUNT
1
2
So sample_id:s 2,3 and 4 should be in the same group and sample id:s 5 and 6 are in the next group.
I have tried something like:
with SampleList as
(
select PARENT_SAMPLE_ID, FREEZE_TIME,
ROW_NUMBER() OVER (partition by PARENT_SAMPLE_ID order by FREEZE_TIME asc) RN
from
SAMPLE
)
,
FirstSample as
(
select PARENT_SAMPLE_ID, FREEZE_TIME
from SampleList
where RN = 1
)
,
SelectedSample as
(
select
s.PARENT_SAMPLE_ID,
ABS(DATEDIFF(MINUTE, s.FREEZE_TIME, sFirst.FREEZE_TIME))/60 DiffToFirst
from SampleList s
inner join FirstSample sFirst ON s.PARENT_SAMPLE_ID = sFirst.PARENT_SAMPLE_ID
group by s.PARENT_SAMPLE_ID, ABS(DATEDIFF(MINUTE, s.FREEZE_TIME, sFirst.FREEZE_TIME))/60
)
select PARENT_SAMPLE_ID, count(*) THAW_COUNT
from SelectedSample
group by PARENT_SAMPLE_ID
But this will return a THAW_COUNT of 3 as sampleId:s 5 and 6 will be in different groups because the grouping is based on hour intervals from freeze time of sampleId 2 only. How do I get the grouping for group 2 to start from the first record outside the first group (sampleId 5) and so on?
This can be treated as a gaps and islands problem. Using some windows functions to check counts and using LAG to look at the "previous" row we can solve this. If you have multiple values for SAMPLE_ID you will want to add some partitioning.
create table #Something
(
SAMPLE_ID int
, FREEZE_TIME datetime
, PARENT_SAMPLE_ID int
)
insert #Something
select 1, null, null union all
select 2, '2015-11-27 10:23:10', 1 union all
select 3, '2015-11-27 10:59:23', 1 union all
select 4, '2015-11-27 11:05:43', 1 union all
select 5, '2015-11-27 12:53:48', 1 union all
select 6, '2015-11-27 13:42:25', 1;
with MyGroups as
(
select *
, GroupNum = count(IsNewGroup) over (order by FREEZE_TIME rows unbounded preceding)
from
(
select *
, IsNewGroup = case when LAG(FREEZE_TIME, 1, '') over(order by FREEZE_TIME) < dateadd(hour, -1, FREEZE_TIME) then 1 end
from #Something
) x
)
select coalesce(PARENT_SAMPLE_ID, SAMPLE_ID)
, count(distinct GroupNum)
from MyGroups
group by coalesce(PARENT_SAMPLE_ID, SAMPLE_ID)
drop table #Something

Moving Median, Mode in T-SQL

I am using SQL Server 2012 and I know it is quite simple to calculate moving averages.
But what I need is to get the mode and the median for a defined window frame like so (with a window of 2 preceding to current row; month unique):
MONTH | CODE | MEDIAN | MODE
1 0 0 0
2 3 1.5 0
3 2 2 0
4 2 2 2
5 2 2 2
6 5 2 2
7 3 3 2
If several values qualify as mode, than pick the first.
I commented my code thoroughly. Read my comments on my Mode calculations and let me know it needs tweaking. Overall, it's a relatively simple query. It just has a lot of ugly subqueries and it has a lot of comments. Check it out:
DECLARE #Table TABLE ([Month] INT,[Code] INT);
INSERT INTO #Table
VALUES (1,0),
(2,3),
(3,2),
(4,2), --Try commenting this out to test my special mode thingymajig
(5,2),
(6,5),
(7,3);
WITH CTE
AS
(
SELECT ROW_NUMBER() OVER (ORDER BY [Month]) row_num,
[Month],
CAST(Code AS FLOAT) Code
FROM #Table
)
SELECT [Month],
Code,
ISNULL((
SELECT CASE
--When there is only one previous value at row_num = 2, find Mean of first two codes
WHEN A.row_num = 2 THEN (LAG(B.code,1) OVER (ORDER BY [Code]) + B.Code)/2.0
--Else find middle code value of current and previous two rows
ELSE B.Code
END
FROM CTE B
--How subquery relates to outer query
WHERE B.row_num BETWEEN A.row_num - 2 AND A.row_num
ORDER BY B.[Code]
--Order by code and offset by 1 so don't select the lowest value, but fetch the one above the lowest value
OFFSET 1 ROW FETCH NEXT 1 ROW ONLY),
0) AS Median,
--I did mode a little different
--Instead of Avg(D.Code) you could list the values because with mode,
--If there's a tie with more than one of each number, you have multiple modes
--Instead of doing that, I simply return the mean of the tied modes
--When there's one, it doesn't change anything.
--If you were to delete the month 4, then your number of Codes 2 and number of Codes 3 would be the same in the last row.
--Proper mode would be 2,3. I instead average them out to be 2.5.
ISNULL((
SELECT AVG(D.Code)
FROM (
SELECT C.Code,
COUNT(*) cnt,
DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) dnse_rank
FROM CTE C
WHERE C.row_num <= A.row_num
GROUP BY C.Code
HAVING COUNT(*) > 1) D
WHERE D.dnse_rank = 1),
0) AS Mode
FROM CTE A
Results:
Month Code Median Mode
----------- ---------------------- ---------------------- ----------------------
1 0 0 0
2 3 1.5 0
3 2 2 0
4 2 2 2
5 2 2 2
6 5 2 2
7 3 3 2
If I understood your requirements correctly, your source table contains MONTH and CODE columns, and you want to calculate MEDIAN and MODE.
The query below calculates MEDIAN and MODE with moving window <= than 3 month ("2 preceding to current row") and returns the results matching your example.
-----------------------------------------------------
--Demo data
-----------------------------------------------------
CREATE TABLE #Data(
[Month] INT NOT NULL,
[Code] INT NOT NULL,
CONSTRAINT [PK_Data] PRIMARY KEY CLUSTERED
(
[Month] ASC
));
INSERT #Data
([Month],[Code])
VALUES
(1,0),
(2,3),
(3,2),
(4,2),
(5,2),
(6,5),
(7,3);
-----------------------------------------------------
--Query
-----------------------------------------------------
DECLARE #PrecedingRowsLimit INT = 2;
WITH [MPos] AS
(
SELECT [R].[Month]
, [RB].[Month] AS [SubId]
, [RB].[Code]
, ROW_NUMBER() OVER(PARTITION BY [R].[Month] ORDER BY [RB].[Code]) AS [RowNumberInPartition]
, CASE
WHEN [R].[Count] % 2 = 1 THEN ([R].[Count] + 1) / 2
ELSE NULL
END AS [MedianPosition]
, CASE
WHEN [R].[Count] % 2 = 0 THEN [R].[Count] / 2
ELSE NULL
END AS [MedianPosition1]
, CASE
WHEN [R].[Count] % 2 = 0 THEN [R].[Count] / 2 + 1
ELSE NULL
END AS [MedianPosition2]
FROM
(
SELECT [RC].[Month]
, [RC].[RowNumber]
, CASE WHEN [RC].[Count] > #PrecedingRowsLimit + 1 THEN #PrecedingRowsLimit + 1 ELSE [RC].[Count] END AS [Count]
FROM
(
SELECT [Month]
, ROW_NUMBER() OVER(ORDER BY [Month]) AS [RowNumber]
, ROW_NUMBER() OVER(ORDER BY [Month]) AS [Count]
FROM #Data
) [RC]
) [R]
INNER JOIN #Data [RB]
ON [R].[Month] >= [RB].[Month]
AND [RB].[Month] >= [R].[RowNumber] - #PrecedingRowsLimit
)
SELECT DISTINCT [M].[Month]
, [ORIG].[Code]
, COALESCE([ME].[Code],([M1].[Code] + [M2].[Code]) / 2.0) AS [Median]
, [MOD].[Mode]
FROM [MPos] [M]
LEFT JOIN [MPOS] [ME]
ON [M].[Month] = [ME].[Month]
AND [M].[MedianPosition] = [ME].[RowNumberInPartition]
LEFT JOIN [MPOS] [M1]
ON [M].[Month] = [M1].[Month]
AND [M].[MedianPosition1] = [M1].[RowNumberInPartition]
LEFT JOIN [MPOS] [M2]
ON [M].[Month] = [M2].[Month]
AND [M].[MedianPosition2] = [M2].[RowNumberInPartition]
INNER JOIN
(
SELECT [MG].[Month]
, FIRST_VALUE([MG].[Code]) OVER (PARTITION BY [MG].[Month] ORDER BY [MG].[Count] DESC , [MG].[SubId] ASC) AS [Mode]
FROM
(
SELECT [Month] , MIN([SubId]) AS [SubId], [Code] , COUNT(1) AS [Count]
FROM [MPOS]
GROUP BY [Month] , [Code]
) [MG]
) [MOD]
ON [M].[Month] = [MOD].[Month]
INNER JOIN #Data [ORIG]
ON [ORIG].[Month] = [M].[Month]
ORDER BY [M].[Month];

t-sql single column result-set to 3 columns

This is a bit of a weird question, and I know it would probably be easier to not do it in SQL, but it will make my life a lot easier.
Basically I have a single column result-set, and I need to turn that into 3 columns, not based on any criteria.
eg.
1
2
3
4
5
6
7
into:
1 2 3
4 5 6
7
It will always be a fixed 3 column result I need in this case.
Currently I am using a cursor and inserting into a table variable, which seems a bit terrible. There must be a better way.
Thanks
Try this:
declare #t table(n int)
insert #t(n) values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
select [0],[1],[2]
from
(
select n
, (ROW_NUMBER() over (order by n) - 1) % 3 c
, (ROW_NUMBER() over (order by n) - 1) / 3 r
from #t
) x
pivot (max(n) for c in ([0], [1], [2])) p
It's possible, but man is this an ugly requirement. This really belongs in the presentation tier, not in the sql.
WITH original As
(
SELEZCT MyColumn, row_number() over (order by MyColumn) as ordinal
FROM RestOfOriginalQueryHere
),
Grouped As
(
SELECT MyColumn, ordinal / 3 As row, ordinal % 3 As col
FROM original
)
SELECT o1.MyColumn, o2.MyColumn, o3.MyColumn
FROM grouped g1
LEFT JOIN grouped g2 on g2.row = g1.row and g2.col = 1
LEFT JOIN grouped g3 on g2.row = g1.row and g3.col = 2
WHERE g1.col = 0

Split a string without delimiter in TSQl

I have a result set from SELECT Statement, how can i split one column without any delimiter
this is my result
Size TCount TDevice
2 5 E01
4.5 3 W02E01
I want to have this
Size TCount TDevice
2 5 E
2 5 0
2 5 1
4.5 3 W
4.5 6 0 (we have 2 times of 0)
4.5 3 2
4.5 3 1
thank you
You can join onto an auxiliary numbers table. I am using spt_values for demo purposes but you should create a permanent one.
WITH Nums
AS (SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number BETWEEN 1 AND 1000),
Result(Size, TCount, TDevice)
AS (SELECT 2, 5,'E01'
UNION ALL
SELECT 4.5,3,'W02E01')
SELECT Size,
COUNT(*) * TCount AS TCount,
SUBSTRING(TDevice, number, 1) AS TDevice
FROM Result
JOIN Nums
ON Nums.number <= LEN(TDevice)
GROUP BY Size,
TCount,
SUBSTRING(TDevice, number, 1)
;with cte as
(
select Size,TCount,
substring(TDevice, 1, 1) as Chars,
stuff(TDevice, 1, 1, '') as TDevice
from t1
union all
select Size,TCount,
substring(TDevice, 1, 1) as Chars,
stuff(TDevice, 1, 1, '') as TDevice
from cte
where len(TDevice) > 0
)
select distinct Size,sum(TCount),Chars
from cte
group by Size,Chars
SQL Fiddle
Advantage: It doesn't require any User defined function (UDF) to be created.

Filter Duplicate Rows on Conditions

I would like to filter duplicate rows on conditions so that the rows with minimum modified and maximum active and unique rid and did are picked. self join? or any better approach that would be performance wise better?
Example:
id rid modified active did
1 1 2010-09-07 11:37:44.850 1 1
2 1 2010-09-07 11:38:44.000 1 1
3 1 2010-09-07 11:39:44.000 1 1
4 1 2010-09-07 11:40:44.000 0 1
5 2 2010-09-07 11:41:44.000 1 1
6 1 2010-09-07 11:42:44.000 1 2
Output expected is
1 1 2010-09-07 11:37:44.850 1 1
5 2 2010-09-07 11:41:44.000 1 1
6 1 2010-09-07 11:42:44.000 1 2
Commenting on the first answer, the suggestion does not work for the below dataset(when active=0 and modified is the minimum for that row)
id rid modified active did
1 1 2010-09-07 11:37:44.850 1 1
2 1 2010-09-07 11:38:44.000 1 1
3 1 2010-09-07 11:39:44.000 1 1
4 1 2010-09-07 11:36:44.000 0 1
5 2 2010-09-07 11:41:44.000 1 1
6 1 2010-09-07 11:42:44.000 1 2
Assuming SQL Server 2005+. Use RANK() instead of ROW_NUMBER() if you want ties returned.
;WITH YourTable as
(
SELECT 1 id,1 rid,cast('2010-09-07 11:37:44.850' as datetime) modified, 1 active,1 did union all
SELECT 2,1,'2010-09-07 11:38:44.000', 1,1 union all
SELECT 3,1,'2010-09-07 11:39:44.000', 1,1 union all
SELECT 4,1,'2010-09-07 11:36:44.000', 0,1 union all
SELECT 5,2,'2010-09-07 11:41:44.000', 1,1 union all
SELECT 6,1,'2010-09-07 11:42:44.000', 1,2
),cte as
(
SELECT id,rid,modified,active, did,
ROW_NUMBER() OVER (PARTITION BY rid,did ORDER BY active DESC, modified ASC ) RN
FROM YourTable
)
SELECT id,rid,modified,active, did
FROM cte
WHERE rn=1
order by id
select id, rid, min(modified), max(active), did from foo group by rid, did order by id;
You can get good performance with a CROSS APPLY if you have a table that has one row for each combination of rid and did:
SELECT
X.*
FROM
ParentTable P
CROSS APPLY (
SELECT TOP 1 *
FROM YourTable T
WHERE P.rid = T.rid AND P.did = T.did
ORDER BY active DESC, modified
) X
Substituting (SELECT DISTINCT rid, did FROM YourTable) for ParentTable would work but will hurt performance.
Also, here is my crazy, single scan magic query which can often outperform other methods:
SELECT
id = Substring(Packed, 6, 4),
rid,
modified = Convert(datetime, Substring(Packed, 2, 4)),
Active = Convert(bit, 1 - Substring(Packed, 1, 1)),
did,
FROM
(
SELECT
rid,
did,
Packed = Min(Convert(binary(1), 1 - active) + Convert(binary(4), modified) + Convert(binary(4), id)
FROM
YourTable
GROUP BY
rid,
did
) X
This method is not recommended because it's not easy to understand, and it's very easy to make mistakes with it. But it's a fun oddity because it can outperform other methods in some cases.

Resources