rank gets affected by groupby, why - sql-server

I've recently seen a query like below (the rank, dense_rank, with group by clause). I found the group by clause makes the rank behaves like dense rank, and could not find microsoft documentation about it.
with FactTransactionHistory as
(
select 2 as ProductKey,'abc1' as trx
union
select 3 as ProductKey,'abc1' as trx
union
select 4 as ProductKey,'abc' as trx
union
select 4 as ProductKey,'abc2' as trx
union
select 4 as ProductKey,'abc3' as trx
union
select 5 as ProductKey,'abc' as trx
)
select ProductKey, DENSE_RANK() over(order by ProductKey) rowNumDense, RANK() over(order by ProductKey) rowNum
/*, count(*) recordCount*/
from FactTransactionHistory
group by ProductKey
My understanding is if the over clause has partition by, it will be ordered within the partition, hence the rank value is determined within the partition.
But this query has no partitition by, so the order by is on the whole dataset, and I could not explain about the rank function, why it is behaving like dense_rank.
Can you please help on explaining why?
Note: if I remove the group by clause, the rank and dense_rank has shown different value as the documentation stated.

I found the group by clause makes rank behave like dense rank.
These two ranking functions only differ on how they handle ties. Here, you are ordering the over() clause of the window function with the same column that is used in the group by - that is ProductKey. By nature, aggregation guarantees no duplicates on the product key, so both functions give the same result.

But this query has no partition by, so the order by is on the whole dataset
This is the place where your expectation goes wrong. To quote the docs on the OVER clause
If PARTITION BY is not specified, the function treats all rows of the query result set as a single group.
My emphasis. It's the result set rows, not the source rows, that make up the single partition here.

Related

Using Top in T-SQL

A question on using Top. For example, we have this SQL statement:
SELECT TOP (5) WITH TIES orderid, orderdate, custid, empid
FROM Sales.Orders
ORDER BY orderdate DESC;
It orders return rows by orderdate first then select the top most five rows.
But isn't that ORDER clause happens after SELECT clause, which means that the first five order in random will be returned first then those five rows are ordered by orderdate?
The order of commands in the statement doesn't reflect the actual order of operations that SQL follows. See this article which shows the order to be:
from
where
group by
having
select
order by
limit
As you can see, the TOP operation (limit) is the last to be executed.
Question has already an accepted answer. But I would like to quote content from Microsoft Documentation.
Logical Processing Order of the SELECT statement
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
But isn't that ORDER clause happens after SELECT clause, which means
that the first five order in random will be returned first then those
five rows are ordered by orderdate ?
No. ORDER BY is processed after the SELECT, but limiting the result set to 5 rows happens even later.
The physical details of actual query processing may vary, but the end result would be as if the server sorted the whole table by orderdate, then picked the top 5 (or more if needed due to ties) rows, return those rows and discard the rest.

SQL Server - Delete Duplicate Rows - how does Partition By affect this query?

I've been using the following inherited query where I'm trying to delete duplicate rows and I'm getting some unexpected results when first running it as a SELECT - I believe it has something to do with my lack of understanding of the Partition part of the statement:
WITH CTE AS(
SELECT [Id],
[Url],
[Identifier],
[Name],
[Entity],
[DOB],
RN = ROW_NUMBER()OVER(PARTITION BY Name ORDER BY Name)
FROM Data.Statistics
where Id = 2170
)
DELETE FROM CTE WHERE RN > 1
Can someone help me understand exactly what I'm doing with the Partition BY Name part of this? This doesn't limit the query in any way to only looking for duplicates in the Name field, correct? I need to ensure that it's looking for records where all 5 of the fields inside the CTE definition are the same for a record to be considered a duplicate.
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) doesn't make a lot of sense. You wouldn't ORDER BY the same thing you used in PARTITION BY since it will be the same value for everything in the partition, making the ORDER BY part useless.
Basically the CTE part of this query is saying to split the matching rows (those with [Id] = 2170) temporarily into groups for each distinct name, and within each group of rows with the same name, order those by name (which are obviously all the same value) and then return the row number within that sequence group as RN. Unique names will all have a row number of 1, because there is only one row with that name. Duplicate names will have row numbers 1, 2, 3, and so on. The order of those rows is undefined in this case because of the silly ORDER BY clause, but if you changed the ORDER BY to something meaningful, the row numbers would follow that sequence.

Building sql query with count() where count() is > 1

If I have a table where there are duplicate IDs, how can I count the number of times the same ID appears in the table and only show records that have a count greater than 1?
I've tried:
SELECT COUNT(ID) AS myCount FROM myTbl
WHERE myCount > 1 GROUP BY ID
But it says myCount is invalid column name. Can someone show me what I'm doing wrong?
You need to use the HAVING keyword:
SELECT COUNT(ID) AS myCount FROM myTbl
GROUP BY ID
HAVING COUNT(ID) > 1
From MSDN:
Specifies a search condition for a group or an aggregate. HAVING can
be used only with the SELECT statement. HAVING is typically used in a
GROUP BY clause. When GROUP BY is not used, HAVING behaves like a
WHERE clause.
You need to understand the logical query processing phases.
Following are the main query clauses
specified in the order that you are supposed to type them (known as “keyed-in order”):
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
The logical query processing order, which is the conceptual interpretation
order, is different. It starts with the FROM clause. Here is the logical query processing
order of the six main query clauses:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
A typical mistake made by people who don’t understand logical query processing is attempting
to refer in the WHERE clause to a column alias defined in the SELECT clause. You
can’t do this because the WHERE clause is evaluated before the SELECT clause.
If you understand that the WHERE clause is evaluated before the SELECT clause, you realize
that this attempt is wrong because at this phase, the attribute myCount doesn’t yet exist.
It’s important to understand the difference between WHERE and HAVING. The WHERE
clause is evaluated before rows are grouped, and therefore is evaluated per row.
The HAVING clause is evaluated after rows are grouped, and therefore is evaluated per group.
The HAVING (evaluated per group):
can contain aggregate functions
executed after grouping (exclude records after grouping)
cannot be used without a GROUP BY
In the other hand, the WHERE :
cannot contain aggregate functions (like in your case)
processes after FROM
can be used without GROUP BY
So your query should be like below :
SELECT COUNT(ID) AS myCount FROM myTbl
GROUP BY ID
HAVING COUNT(ID) > 1
Note : Notice that the ORDER BY clause is the first and only clause that is allowed to refer to column
aliases defined in the SELECT clause. That’s because the ORDER BY clause is the only one
to be evaluated after the SELECT clause.

How to elegantly write a SQL ORDER BY (which is invalid in inline query) but required for aggregate GROUP BY?

I have a simple query that runs in SQL 2008 and uses a custom CLR aggregate function, dbo.string_concat which aggregates a collection of strings.
I require the comments ordered sequentially hence the ORDER BY requirement.
The query I have has an awful TOP statement in it to allow ORDER BY to work for the aggregate function otherwise the comments will be in no particular order when they are concatenated by the function.
Here's the current query:
SELECT ID, dbo.string_concat(Comment)
FROM (
SELECT TOP 10000000000000 ID, Comment, CommentDate
FROM Comments
ORDER BY ID, CommentDate DESC
) x
GROUP BY ID
Is there a more elegant way to rewrite this statement?
So... what you want is comments concatenated in order of ID then CommentDate of the most recent comment?
Couldn't you just do
SELECT ID, dbo.string_concat(Comment)
FROM Comments
GROUP BY ID
ORDER BY ID, MAX(CommentDate) DESC
Edit: Misunderstood your objective. Best I can come up with is that you could clean up your query a fair bit by making it SELECT TOP 100 PERCENT, it's still using a top but at least it gets around having an arbitrary number as the limit.
Since you're using sql server 2008, you can use a Common Table Expression:
WITH cte_ordered (ID, Comment, CommentDate)
AS
(
SELECT ID, Comment, CommentDate
FROM Comments
ORDER BY ID, CommentDate DESC
)
SELECT ID, dbo.string_concat(Comment)
FROM cte_ordered
GROUP BY ID

SQL Error with Order By in Subquery

I'm working with SQL Server 2005.
My query is:
SELECT (
SELECT COUNT(1) FROM Seanslar WHERE MONTH(tarihi) = 4
GROUP BY refKlinik_id
ORDER BY refKlinik_id
) as dorduncuay
And the error:
The ORDER BY clause is invalid in views, inline functions, derived
tables, subqueries, and common table expressions, unless TOP or FOR
XML is also specified.
How can I use ORDER BY in a sub query?
This is the error you get (emphasis mine):
The ORDER BY clause is invalid in
views, inline functions, derived
tables, subqueries, and common table
expressions, unless TOP or FOR XML is
also specified.
So, how can you avoid the error? By specifying TOP, would be one possibility, I guess.
SELECT (
SELECT TOP 100 PERCENT
COUNT(1) FROM Seanslar WHERE MONTH(tarihi) = 4
GROUP BY refKlinik_id
ORDER BY refKlinik_id
) as dorduncuay
If you're working with SQL Server 2012 or later, this is now easy to fix. Add an offset 0 rows:
SELECT (
SELECT
COUNT(1) FROM Seanslar WHERE MONTH(tarihi) = 4
GROUP BY refKlinik_id
ORDER BY refKlinik_id OFFSET 0 ROWS
) as dorduncuay
Besides the fact that order by doesn't seem to make sense in your query....
To use order by in a sub select you will need to use TOP 2147483647.
SELECT (
SELECT TOP 2147483647
COUNT(1) FROM Seanslar WHERE MONTH(tarihi) = 4
GROUP BY refKlinik_id
ORDER BY refKlinik_id
) as dorduncuay
My understanding is that "TOP 100 PERCENT" doesn't gurantee ordering anymore starting with SQL 2005:
In SQL Server 2005, the ORDER BY
clause in a view definition is used
only to determine the rows that are
returned by the TOP clause. The ORDER
BY clause does not guarantee ordered
results when the view is queried,
unless ORDER BY is also specified in
the query itself.
See SQL Server 2005 breaking changes
Hope this helps,
Patrick
If building a temp table, move the ORDER BY clause from inside the temp table code block to the outside.
Not allowed:
SELECT * FROM (
SELECT A FROM Y
ORDER BY Y.A
) X;
Allowed:
SELECT * FROM (
SELECT A FROM Y
) X
ORDER BY X.A;
You don't need order by in your sub query. Move it out into the main query, and include the column you want to order by in the subquery.
however, your query is just returning a count, so I don't see the point of the order by.
A subquery (nested view) as you have it returns a dataset that you can then order in your calling query. Ordering the subquery itself will make no (reliable) difference to the order of the results in your calling query.
As for your SQL itself:
a) I seen no reason for an order by as you are returning a single value.
b) I see no reason for the sub query anyway as you are only returning a single value.
I'm guessing there is a lot more information here that you might want to tell us in order to fix the problem you have.
Add the Top command to your sub query...
SELECT
(
SELECT TOP 100 PERCENT
COUNT(1)
FROM
Seanslar
WHERE
MONTH(tarihi) = 4
GROUP BY
refKlinik_id
ORDER BY
refKlinik_id
) as dorduncuay
:)
maybe this trick will help somebody
SELECT
[id],
[code],
[created_at]
FROM
( SELECT
[id],
[code],
[created_at],
(ROW_NUMBER() OVER (
ORDER BY
created_at DESC)) AS Row
FROM
[Code_tbl]
WHERE
[created_at] BETWEEN '2009-11-17 00:00:01' AND '2010-11-17 23:59:59'
) Rows
WHERE
Row BETWEEN 10 AND 20;
here inner subquery ordered by field created_at (could be any from your table)
In this example ordering adds no information - the COUNT of a set is the same whatever order it is in!
If you were selecting something that did depend on order, you would need to do one of the things the error message tells you - use TOP or FOR XML
Try moving the order by clause outside sub select and add the order by field in sub select
SELECT * FROM
(SELECT COUNT(1) ,refKlinik_id FROM Seanslar WHERE MONTH(tarihi) = 4 GROUP BY refKlinik_id)
as dorduncuay
ORDER BY refKlinik_id
For me this solution works fine as well:
SELECT tbl.a, tbl.b
FROM (SELECT TOP (select count(1) FROM yourtable) a,b FROM yourtable order by a) tbl
Good day
for some guys the order by in the sub-query is questionable.
the order by in sub-query is a must to use if you need to delete some records based on some sorting.
like
delete from someTable Where ID in (select top(1) from sometable where condition order by insertionstamp desc)
so that you can delete the last insertion form table.
there are three way to do this deletion actually.
however, the order by in the sub-query can be used in many cases.
for the deletion methods that uses order by in sub-query review below link
http://web.archive.org/web/20100212155407/http://blogs.msdn.com/sqlcat/archive/2009/05/21/fast-ordered-delete.aspx
i hope it helps. thanks you all
For a simple count like the OP is showing, the Order by isn't strictly needed. If they are using the result of the subquery, it may be. I am working on a similiar issue and got the same error in the following query:
-- I want the rows from the cost table with an updateddate equal to the max updateddate:
SELECT * FROM #Costs Cost
INNER JOIN
(
SELECT Entityname, costtype, MAX(updatedtime) MaxUpdatedTime
FROM #HoldCosts cost
GROUP BY Entityname, costtype
ORDER BY Entityname, costtype -- *** This causes an error***
) CostsMax
ON Costs.Entityname = CostsMax.entityname
AND Costs.Costtype = CostsMax.Costtype
AND Costs.UpdatedTime = CostsMax.MaxUpdatedtime
ORDER BY Costs.Entityname, Costs.costtype
-- *** To accomplish this, there are a few options:
-- Add an extraneous TOP clause, This seems like a bit of a hack:
SELECT * FROM #Costs Cost
INNER JOIN
(
SELECT TOP 99.999999 PERCENT Entityname, costtype, MAX(updatedtime) MaxUpdatedTime
FROM #HoldCosts cost
GROUP BY Entityname, costtype
ORDER BY Entityname, costtype
) CostsMax
ON Costs.Entityname = CostsMax.entityname
AND Costs.Costtype = CostsMax.Costtype
AND Costs.UpdatedTime = CostsMax.MaxUpdatedtime
ORDER BY Costs.Entityname, Costs.costtype
-- **** Create a temp table to order the maxCost
SELECT Entityname, costtype, MAX(updatedtime) MaxUpdatedTime
INTO #MaxCost
FROM #HoldCosts cost
GROUP BY Entityname, costtype
ORDER BY Entityname, costtype
SELECT * FROM #Costs Cost
INNER JOIN #MaxCost CostsMax
ON Costs.Entityname = CostsMax.entityname
AND Costs.Costtype = CostsMax.Costtype
AND Costs.UpdatedTime = CostsMax.MaxUpdatedtime
ORDER BY Costs.Entityname, costs.costtype
Other possible workarounds could be CTE's or table variables. But each situation requires you to determine what works best for you. I tend to look first towards a temp table. To me, it is clear and straightforward. YMMV.
On possible needs to order a subquery is when you have a UNION :
You generate a call book of all teachers and students.
SELECT name, phone FROM teachers
UNION
SELECT name, phone FROM students
You want to display it with all teachers first, followed by all students, both ordered by. So you cant apply a global order by.
One solution is to include a key to force a first order by, and then order the names :
SELECT name, phone, 1 AS orderkey FROM teachers
UNION
SELECT name, phone, 2 AS orderkey FROM students
ORDER BY orderkey, name
I think its way more clear than fake offsetting subquery result.
I Use This Code To Get Top Second Salary
I am Also Get Error Like
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.
TOP 100 I Used To Avoid The Error
select * from (
select tbl.Coloumn1 ,CONVERT(varchar, ROW_NUMBER() OVER (ORDER BY (SELECT 1))) AS Rowno from (
select top 100 * from Table1
order by Coloumn1 desc) as tbl) as tbl where tbl.Rowno=2

Resources