TSQL Partition by with Order by

TSQL Partition by with Order by - sql-server

I noticed that when I use a partition by like below
SELECT
ROW_NUMBER() OVER(PARTITION BY categoryid
ORDER BY unitprice, productid) AS rownum,
categoryid, productid, productname, unitprice
FROM Production.Products;
the result set returned to me accordingly in the proper partiton and order.
Does that mean I do not have to provide an Order BY clause at the end to absolutely guarantee the order that I want?
Thanks

No.
To guarantee the result order, you must use an ORDER BY that applies to the outermost query. Anything else is just coincidence.
SELECT
ROW_NUMBER() OVER(PARTITION BY categoryid
ORDER BY unitprice, productid) AS rownum,
categoryid, productid, productname, unitprice
FROM Production.Products
ORDER BY categoryid,unitprice,productid;
ORDER BY has two roles:
To actually define how another feature works. This is true when using TOP, say, or within an OVER() partition function. It doesn't require sorting to occur, it just says "this definition only makes sense if we consider the rows in the result set to occur in a particular order - here's the one I want to use"
To dictate the sort order of the result set. This is true when it's an ORDER BY clause on the outermost statement that is part of a particular query - not in a subquery, a CTE, an OVER() paritition function, etc.
Occasionally you will experience it being used in both senses at once - when the outermost statement includes a TOP or OFFSET/FETCH, then it's used for both purposes.

Related

rank gets affected by groupby, why

I've recently seen a query like below (the rank, dense_rank, with group by clause). I found the group by clause makes the rank behaves like dense rank, and could not find microsoft documentation about it.
with FactTransactionHistory as
(
select 2 as ProductKey,'abc1' as trx
union
select 3 as ProductKey,'abc1' as trx
union
select 4 as ProductKey,'abc' as trx
union
select 4 as ProductKey,'abc2' as trx
union
select 4 as ProductKey,'abc3' as trx
union
select 5 as ProductKey,'abc' as trx
)
select ProductKey, DENSE_RANK() over(order by ProductKey) rowNumDense, RANK() over(order by ProductKey) rowNum
/*, count(*) recordCount*/
from FactTransactionHistory
group by ProductKey
My understanding is if the over clause has partition by, it will be ordered within the partition, hence the rank value is determined within the partition.
But this query has no partitition by, so the order by is on the whole dataset, and I could not explain about the rank function, why it is behaving like dense_rank.
Can you please help on explaining why?
Note: if I remove the group by clause, the rank and dense_rank has shown different value as the documentation stated.

I found the group by clause makes rank behave like dense rank.
These two ranking functions only differ on how they handle ties. Here, you are ordering the over() clause of the window function with the same column that is used in the group by - that is ProductKey. By nature, aggregation guarantees no duplicates on the product key, so both functions give the same result.

But this query has no partition by, so the order by is on the whole dataset
This is the place where your expectation goes wrong. To quote the docs on the OVER clause
If PARTITION BY is not specified, the function treats all rows of the query result set as a single group.
My emphasis. It's the result set rows, not the source rows, that make up the single partition here.

Query with window function over slightly different over clauses, remove table spool

I have this query where I can't get the plan I am after.
My query looks like this:
Rows are put in different groups (partition by), from each group I want to pick one row, using row_number to assign a priority.
I also need a flag, which is an aggregate over all rows in each group.
SELECT *
FROM (
SELECT [ProductReviewID]
,[ProductID]
,[ReviewerName]
,[ReviewDate]
,[Rating]
,[Comments]
,[ModifiedDate]
,ROW_NUMBER() over (partition by ProductID order by ReviewDate) as Prio
,min(rating) over (partition by ProductID) as Flag
FROM [AdventureWorks2008R2].[Production].[ProductReview]
) x
WHERE Prio = 1 and Flag >= 2
The resulting plan looks like this (query1):
https://1drv.ms/u/s!Auamqfb9LjGpaoxpkRpYS3iIVbE
The problem are the table spools.
The plan I want looks like Query2, where there is one sort the feeds into a window spool ( I cant show an exact plan, because I dont have a dev-server with sql2016).
select ... <- window spool <- sort <- ...
The problem arises because the optimizer is not smart enough to see that one sort (ProductId, ReviewDate) is good for both window functions.
I can get the plan I want, if make the over clause the same (this should give you a plan like just described):
,ROW_NUMBER() over (partition by ProductID order by ReviewDate) as Prio
,min(rating) over (partition by ProductID order by ReviewDate) as Flag
But that changes the result of min() in way that does not meet the required logic here.
Any ideas? Thanks for help.
EDIT: This is what I have with lptrs help, the trick is to put one "dummy" column as the first for order by. I regard it as answered. Thanks.
,ROW_NUMBER() over (partition by ProductID order by ProductID, ReviewDate) as Prio
,min(rating) over (partition by ProductID order by ProductID) as Flag

The field in ORDER BY affects the result of window functions

I have simple T-SQL query, which calculates row number, rows count and total volume across all records:
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY prev_date),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY prev_date),
*
FROM #t;
I get the following result:
However, this is NOT what I expected: in all two rows the rows_count must be 2 and vol_total must be 300:
The workaround would be to add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. However, I thought that there must be another way.
In the end of the day I have found out that the ORDER BY clause must use id field rather prev_date field:
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY id),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY id)
After this change the query's output is as expected.
But! I don't understand why is this so? How come the ordering affects partitioning?

For Aggregate functions generally it is not required to have order in the window definition unless you want to do the aggregation one at a time in an ordered fashion, it is like running total. Simply removing the orders will fix the problem.
If I want to explain it from another way it would be like a window that is expanding row by row as you move on to another row. It is started with the first row, calculate the aggregation with all the rows from before (which in the first row is just the current row!) to the position of row.
if you remove the order, the aggregation will be computed for all the rows in the window definition and no order of applying window will take effect.
You can change the order in window definition to see the effect of it.
Of course, ranking functions need the order and this point is just for the aggregations.
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id),
vol_total = SUM(volume) OVER (PARTITION BY id),
*
FROM #t;
Enabling order in the window for aggregations added after SqlServer 2012 and it was not part of the first release of the feature in 2005.
For a detailed explanation of the order in window functions on aggregates this is a great help:
Producing a moving average and cumulative total - SqlServer Documentation

SQL ORDER BY clause ERROR in the ROW_NUMBER

I want to use ROW_NUMBER() function and get first and latest values.
I write bellow query. But I got an error.
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
help me to solve the issue. Below the sql query
SELECT *
FROM(
SELECT OPP_ID,PRJ_ID,
ROW_NUMBER() OVER (PARTITION BY OPP_ID ORDER BY MAX(CREATION_DATE) DESC) AS RN
FROM OPPOR
GROUP BY OPP_ID,PRJ_ID
ORDER BY MAX(CREATION_DATE) DESC) OP
WHERE OP.RN = 1

The row_number function can do it's own aggregation and ordering, so no need to use group by or order by in your subquery (order by won't work in subqueries as you've seen). It is a little unclear if you want to partition by opp_id or opp_id and prj_id though. But this should be what you're looking for:
SELECT *
FROM(
SELECT OPP_ID,PRJ_ID,
ROW_NUMBER() OVER (PARTITION BY OPP_ID ORDER BY CREATION_DATE DESC) AS RN
FROM OPPOR
) OP
WHERE OP.RN = 1

How to elegantly write a SQL ORDER BY (which is invalid in inline query) but required for aggregate GROUP BY?

I have a simple query that runs in SQL 2008 and uses a custom CLR aggregate function, dbo.string_concat which aggregates a collection of strings.
I require the comments ordered sequentially hence the ORDER BY requirement.
The query I have has an awful TOP statement in it to allow ORDER BY to work for the aggregate function otherwise the comments will be in no particular order when they are concatenated by the function.
Here's the current query:
SELECT ID, dbo.string_concat(Comment)
FROM (
SELECT TOP 10000000000000 ID, Comment, CommentDate
FROM Comments
ORDER BY ID, CommentDate DESC
) x
GROUP BY ID
Is there a more elegant way to rewrite this statement?

So... what you want is comments concatenated in order of ID then CommentDate of the most recent comment?
Couldn't you just do
SELECT ID, dbo.string_concat(Comment)
FROM Comments
GROUP BY ID
ORDER BY ID, MAX(CommentDate) DESC
Edit: Misunderstood your objective. Best I can come up with is that you could clean up your query a fair bit by making it SELECT TOP 100 PERCENT, it's still using a top but at least it gets around having an arbitrary number as the limit.

Since you're using sql server 2008, you can use a Common Table Expression:
WITH cte_ordered (ID, Comment, CommentDate)
AS
(
SELECT ID, Comment, CommentDate
FROM Comments
ORDER BY ID, CommentDate DESC
)
SELECT ID, dbo.string_concat(Comment)
FROM cte_ordered
GROUP BY ID