SQL Numbering of non-sequential Groups - sql-server

Can anyone help me with a select statement that will return the "Section" numbering I have shown below? I found some similar questions and answers but nothing that addresses my specific requirements.
My data (simplified for this example) are the "Sequence" and "Data" columns and I want to produce the "Section" column, based on:
my data being ordered by the value in the Sequence column, and
based on the break in value of the Data column:
Note that the "Section" numbering I desire breaks on the "change in value" of the Data column with no consideration for the actual values in that column or for them having to be in any particular sequence.
I should also clarify that the values in the Sequence column will be contiguous so no missing numbers in the sequence, which the chosen answer satisfies.

We can use the difference in row number method here:
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (ORDER BY Sequence) -
ROW_NUMBER() OVER (PARTITION BY Data ORDER BY Sequence) rn
FROM yourTable
)
SELECT Sequence, Data, DENSE_RANK() OVER (ORDER BY rn) AS Section
FROM cte
ORDER BY Sequence;
Demo
It is difficult to explain in words why this method works here, but if you are curious, you may try SELECT * FROM cte to see what is happening.

this solution uses the window function twice.
-- [2] then use dense_rank() on the smallest sequence by Data
select *, dense_rank() over(order by s)
from
(
-- [1] find the smallest Sequence for each group of Data
select *, s = min(Sequence) over (partition by Data)
from tbl
) t
order by Sequence

Related

rank gets affected by groupby, why

I've recently seen a query like below (the rank, dense_rank, with group by clause). I found the group by clause makes the rank behaves like dense rank, and could not find microsoft documentation about it.
with FactTransactionHistory as
(
select 2 as ProductKey,'abc1' as trx
union
select 3 as ProductKey,'abc1' as trx
union
select 4 as ProductKey,'abc' as trx
union
select 4 as ProductKey,'abc2' as trx
union
select 4 as ProductKey,'abc3' as trx
union
select 5 as ProductKey,'abc' as trx
)
select ProductKey, DENSE_RANK() over(order by ProductKey) rowNumDense, RANK() over(order by ProductKey) rowNum
/*, count(*) recordCount*/
from FactTransactionHistory
group by ProductKey
My understanding is if the over clause has partition by, it will be ordered within the partition, hence the rank value is determined within the partition.
But this query has no partitition by, so the order by is on the whole dataset, and I could not explain about the rank function, why it is behaving like dense_rank.
Can you please help on explaining why?
Note: if I remove the group by clause, the rank and dense_rank has shown different value as the documentation stated.
I found the group by clause makes rank behave like dense rank.
These two ranking functions only differ on how they handle ties. Here, you are ordering the over() clause of the window function with the same column that is used in the group by - that is ProductKey. By nature, aggregation guarantees no duplicates on the product key, so both functions give the same result.
But this query has no partition by, so the order by is on the whole dataset
This is the place where your expectation goes wrong. To quote the docs on the OVER clause
If PARTITION BY is not specified, the function treats all rows of the query result set as a single group.
My emphasis. It's the result set rows, not the source rows, that make up the single partition here.

The field in ORDER BY affects the result of window functions

I have simple T-SQL query, which calculates row number, rows count and total volume across all records:
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY prev_date),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY prev_date),
*
FROM #t;
I get the following result:
However, this is NOT what I expected: in all two rows the rows_count must be 2 and vol_total must be 300:
The workaround would be to add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. However, I thought that there must be another way.
In the end of the day I have found out that the ORDER BY clause must use id field rather prev_date field:
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY id),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY id)
After this change the query's output is as expected.
But! I don't understand why is this so? How come the ordering affects partitioning?
For Aggregate functions generally it is not required to have order in the window definition unless you want to do the aggregation one at a time in an ordered fashion, it is like running total. Simply removing the orders will fix the problem.
If I want to explain it from another way it would be like a window that is expanding row by row as you move on to another row. It is started with the first row, calculate the aggregation with all the rows from before (which in the first row is just the current row!) to the position of row.
if you remove the order, the aggregation will be computed for all the rows in the window definition and no order of applying window will take effect.
You can change the order in window definition to see the effect of it.
Of course, ranking functions need the order and this point is just for the aggregations.
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id),
vol_total = SUM(volume) OVER (PARTITION BY id),
*
FROM #t;
Enabling order in the window for aggregations added after SqlServer 2012 and it was not part of the first release of the feature in 2005.
For a detailed explanation of the order in window functions on aggregates this is a great help:
Producing a moving average and cumulative total - SqlServer Documentation

Return column with varying values depending on change points

I'm fairly new to Microsoft SQL Server, so maybe this is very simple yet I just don't have the experience to pull from.
The data I have is similar to the first three columns shown (A, B, C). I want to use those columns to return the data in the yellow highlighted column (D). Basically, I'm trying to show all values of a variable from the current week onward, including when there are change points of the variable. The value of the variable should continue forward in time until the value of the variable changes (column C).
Thanks in advance.
SELECT T1.*, COALESCE(SQ.NewValue, T1.StartingValue) FROM YourTable T1
OUTER APPLY (SELECT TOP 1 T2.NewValue FROM YourTable T2
WHERE T1.Week <= T2.week AND
T2.NewValue IS NOT NULL
ORDER BY T2.Week DESC) SQ
One way is to make Column D a correlated sub-query that gets the most recent previous value of C that is not NULL.
One method, which doesn't need 2 table scans is to use a CTE to create a "group number" and then the OVER clause with a MAX:
WITH VTE AS (
SELECT *
FROM (VALUES(1,0.5,NULL),
(2,0.5,1),
(3,0.5,NULL),
(4,0.5,NULL),
(5,0.5,0.8),
(6,0.5,NULL)) V(WeekNo, Starting, New)),
CTE AS(
SELECT *,
COUNT(New) OVER (ORDER BY WeekNo ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM VTE)
SELECT WeekNo, Starting, New,
ISNULL(MAX(New) OVER (PARTITION BY CTE.Grp),Starting) AS Result
FROM CTE
ORDER BY WeekNo;

SQL Server - Delete Duplicate Rows - how does Partition By affect this query?

I've been using the following inherited query where I'm trying to delete duplicate rows and I'm getting some unexpected results when first running it as a SELECT - I believe it has something to do with my lack of understanding of the Partition part of the statement:
WITH CTE AS(
SELECT [Id],
[Url],
[Identifier],
[Name],
[Entity],
[DOB],
RN = ROW_NUMBER()OVER(PARTITION BY Name ORDER BY Name)
FROM Data.Statistics
where Id = 2170
)
DELETE FROM CTE WHERE RN > 1
Can someone help me understand exactly what I'm doing with the Partition BY Name part of this? This doesn't limit the query in any way to only looking for duplicates in the Name field, correct? I need to ensure that it's looking for records where all 5 of the fields inside the CTE definition are the same for a record to be considered a duplicate.
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) doesn't make a lot of sense. You wouldn't ORDER BY the same thing you used in PARTITION BY since it will be the same value for everything in the partition, making the ORDER BY part useless.
Basically the CTE part of this query is saying to split the matching rows (those with [Id] = 2170) temporarily into groups for each distinct name, and within each group of rows with the same name, order those by name (which are obviously all the same value) and then return the row number within that sequence group as RN. Unique names will all have a row number of 1, because there is only one row with that name. Duplicate names will have row numbers 1, 2, 3, and so on. The order of those rows is undefined in this case because of the silly ORDER BY clause, but if you changed the ORDER BY to something meaningful, the row numbers would follow that sequence.

Retrieving specific number of rows based on sum of row number

After reading an experimenting I decided I need to ask:
I am trying to retrieve a specific number of rows from a table based on the sum of the row number: This is a basic table with two columns: CusID, CusName.
I started by numbering each row to 1 so that I can use a SUM of the row number, or so I thought.
WITH Example AS
(
SELECT
*,
ROW_NUMBER() OVER (Partition by CusID ORDER BY CusID) AS RowNumber
FROM
MySchema.MyTable
)
I am not sure how to move beyond here. I tried using the HAVING clause but obviously that would not work. I could also use TOP or Percent.
But I would like to retrieve the rows based on the sum of row number.
What's the way to do this?
First of all Windowed functions cannot be used in the context of another windowed function or aggregate.So you can not use Aggregate function inside the row_number I think it could better than use all function after your with like this
WITH Example AS
(
SELECT *, ROW_NUMBER() OVER (Partition by CusID ORDER BY CusID) AS RowNumber
FROM MySchema.MyTable
)
select cusid,cusname,sum(rownumber) from example
group by Cusid,Cusname
having .....

Resources