Top vs Rank/Row Number functions - Which performs higher? - sql-server

I attempted to Google the Cost of using Top in a query vs using a Ranking or Row_Number type function.
Does the cost of each depend on the situation or can the cost of these two features be determined across the board for all situations?
Some mock SQL is below using a simple CTE to demonstrate my question would look like the below:
WITH fData AS
(
SELECT 1 AS ID, 'John' AS fName, 'Black' AS lName, CAST('05/19/1975' AS DATE) AS birthDate UNION ALL
SELECT 2 AS ID, 'John' AS fName, 'Black' AS lName, CAST('04/1/1989' AS DATE) AS birthDate UNION ALL
SELECT 3 AS ID, 'John' AS fName, 'Black' AS lName, CAST('11/16/1995' AS DATE) AS birthDate UNION ALL
SELECT 4 AS ID, 'John' AS fName, 'Black' AS lName, CAST('01/16/1968' AS DATE) AS birthDate UNION ALL
SELECT 5 AS ID, 'John' AS fName, 'Black' AS lName, CAST('01/16/1968' AS DATE) AS birthDate
)
/* Using TOP 1 vs Row_Number() - Uncomment this and comment the below to VIEW TOP version */
--SELECT TOP 1 d.ID, d.fName, d.lName, d.birthDate
--FROM fData d
--ORDER BY d.birthDate
/* Using the below vs TOP 1 */
SELECT * FROM
( SELECT d.ID, d.fName, d.lName, d.birthDate, Row_Number() OVER (ORDER BY d.birthDate) AS ranker
FROM fData d
) r
WHERE r.ranker = 1
When using TOP there's not a need to apply a secondary Wrapping query around it and it looks cleaner. After applying a Row_Number or a Ranking function you then must wrap it to tell the query which row you are now wanting... either by applying the WHERE ranker = 1 or ranker >= 5 to achieve the same as TOP 1 or TOP 5.
Which is better faster if this is even something that can be determined?

In the case of your example the TOP is somewhat more efficient.
The execution plan for TOP is below
The TOP N sort with N=1 just needs to keep track of the row with the lowest birthDate that it sees.
For the row_number query it recognises that the row number is always ascending and does itself add a TOP 1 to the plan but it doesn't combine the separated TOP and SORT into a TOP N Sort - so it does a full sort of all 5 rows.
In the case that an index supplies rows in the desired order without the need for a sort there won't be much in it. The row_number query will have an extra couple of operators that are fairly inexpensive anyway.
WHY use ranking functions in SQL Server when it has TOP
Ranking functions in general are more powerful than TOP.
For the cases where both would work consider that TOP is a fairly ancient proprietary syntax and not standard SQL. It was in the product a long time before window functions were added. If portable SQL is a concern you should not use TOP.
Though you might not use ranking functions either. As another (standard SQL) alternative is
SELECT d.ID, d.fName, d.lName, d.birthDate
FROM fData d
ORDER BY d.birthDate
OFFSET 0 ROWS
FETCH NEXT 1 ROW ONLY
which gives the same plan as TOP 1

Related

While loop for Teradata

world.
I'm try to find a way to compile multiple events together. Data looks like this:
Basically, it is series of row data with event logs
I want to generate an aggregation of these row events such that if a new event occurred within 30 seconds of the other ending, it combines the time together. However, if the event log does not have an abutting event, then it is not captured. And these events are 'person' specific.
I envision the output to look something like this:
My intuition suggests using some kind of while loop, but I'm not sure where to start
There's no need for recursion (and very hard to write) or a loop over a cursor.
SELECT
Person,
Min(starttime),
Max(starttime),
-- get a concatenated string
Trim(Trailing ',' FROM (XmlAgg(Reason || ',' ORDER BY Reason ) (VARCHAR(1000))))
FROM
(
SELECT Person, Start_timestamp, Stop_timestamp, Reason,
-- assign the same number to all rows within 30 seconds
Sum(flag) Over
Over (PARTITION BY Person
ORDER BY Start_timestamp
ROWS Unbounded Preceding) AS grp
FROM
(
SELECT Person, Start_timestamp, Stop_timestamp, Reason,
-- check if previous end is within 30 seconds of the current start
CASE WHEN Lag(Stop_timestamp)
Over (PARTITION BY Person
ORDER BY Start_timestamp) + INTERVAL '30' SECOND < Start_timestamp
THEN 0
ELSE 1
END AS flag
FROM tab
) AS dt
) AS dt
-- aggregate per person and group
GROUP BY Person, grp
If your Teradata version supports SESSIONIZE you can simplify the group calculation, but I couldn't write this syntaxc ad hoc :-)
You can achieve this using RECURSIVE CTE
WITH RECURSIVE MYREC(Person,Start_timestamp,Stop_timestamp ,Reason,LVL)
AS(
SELECT Person,MIN(Start_timestamp),MAX(Stop_timestamp),MIN(Reason)(varchar(100)) AS Reason,1
FROM MYTABLE
GROUP BY 1
UNION ALL
SELECT b.Person,b.Start_timestamp,b.Stop_timestamp ,trim(a.Reason) || ',' || trim( b.Reason), LVL+1
FROM MYTABLE a INNER JOIN MYREC b
ON a.Person = b.Person
AND a.Reason > b.Reason
)
SELECT Person,Start_timestamp,Stop_timestamp,Reason
FROM MYREC
QUALIFY RANK() OVER(PARTITION BY Person ORDER BY Reason DESC) = 1
Change MYTABLE to your tablename

Find/replace string values based upon table values

I have a somewhat unusual need to find/replace values in a string from values in a separate table.
Basically, I need to standardize a bunch of addresses, and one of the steps is to replace things like St, Rd or Blvd with Street, Road or Boulevard. I was going to write a function with bunch of nested REPLACE() statements, but this is 1) inefficient; and 2) not practical. There are over 500 possible abbreviations for street types according the USPS website.
What I'd like to do is something akin to:
REPLACE(Address1, Col1, Col2) where col1 and col2 are abbreviation and full street type in a separate table.
Anyone have any insight into something like this?
You can do such replacements using a recursive CTE. Something like this:
with r as (
select t.*, row_number() over (order by id) as seqnum
from replacements
),
cte as (
select replace(t.address, r.col1, r.col2) as address, seqnum as i
from t cross join
r
where r.seqnum = 1
union all
select replace(cte.address, r.col1, r.col2), i + 1
from cte join
r
on r.i = cte.i + 1
)
select cte.*
from (select cte.*, max(i) over () as maxi
from cte
) cte
where maxi = i;
That said, this is basically iteration. It will be quite expensive to do this on a table where there are 500 replacements per row.
SQL is probably not the best tool for this.

Is there a way to combine these queries?

I have begun working some of the programming problems on HackerRank as a "productive distraction".
I was working on the first few in the SQL section and came across this problem (link):
Query the two cities in STATION with the shortest and
longest CITY names, as well as their respective lengths
(i.e.: number of characters in the name). If there is
more than one smallest or largest city, choose the one
that comes first when ordered alphabetically.
Input Format
The STATION table is described as follows:
where LAT_N is the northern latitude and LONG_W is
the western longitude.
Sample Input
Let's say that CITY only has four entries:
1. DEF
2. ABC
3. PQRS
4. WXY
Sample Output
ABC 3
PQRS 4
Explanation
When ordered alphabetically, the CITY names are listed
as ABC, DEF, PQRS, and WXY, with the respective lengths
3, 3, 4 and 3. The longest-named city is obviously PQRS,
but there are options for shortest-named city; we choose
ABC, because it comes first alphabetically.
I agree that this requirement could be written much more clearly, but the basic gist is pretty easy to get, especially with the clarifying example. The question I have, though, occurred to me because the instructions given in the comments for the question read as follows:
/*
Enter your query here.
Please append a semicolon ";" at the end of the query and
enter your query in a single line to avoid error.
*/
Now, writing a query on a single line doesn't necessarily imply a single query, though that seems to be the intended thrust of the statement. However, I was able to pass the test case using the following submission (submitted on 2 lines, with a carriage return in between):
SELECT TOP 1 CITY, LEN(CITY) FROM STATION ORDER BY LEN(CITY), CITY;
SELECT TOP 1 CITY, LEN(CITY) FROM STATION ORDER BY LEN(CITY) DESC, CITY;
Again, none of this is advanced SQL. But it got me thinking. Is there a non-trivial way to combine this output into a single results set? I have some ideas in mind where the WHERE clause basically adds some sub-queries in an OR statement to combine the two queries into one. Here is another submission I had that passed the test case:
SELECT
CITY,
LEN(CITY)
FROM
STATION
WHERE
ID IN (SELECT TOP 1 ID FROM STATION ORDER BY LEN(CITY), CITY)
OR
ID IN (SELECT TOP 1 ID FROM STATION ORDER BY LEN(CITY) DESC, CITY)
ORDER BY
LEN(CITY), CITY;
And, yes, I realize that the final , CITY in the final ORDER BY clause is superfluous, but it kind of makes the point that this query hasn't really saved that much effort, especially against returning the query results separately.
Note: This isn't a true MAX and MIN situation. Given the following input, you aren't actually taking the first and last rows:
Sample Input
1. ABC
2. ABCD
3. ZYXW
Based on the requirements as written, you'd take #1 and #2, not #1 and #3.
This makes me think that my solutions actually might be the most efficient way to accomplish this, but my set-based thinking could always use some strengthening, and I'm not sure if that might play in here or not.
Here's another alternative. I think it's pretty straight forward, easy to understand what's going on. Performance is good.
Still has a couple of sub-queries though.
select
min(City), len(City)
from Station
group by
len(City)
having
len(City) = (select min(len(City)) from Station)
or
len(City) = (select max(len(City)) from Station)
Untested as well, but I don't see a reason for it not to work:
SELECT *
FROM (
SELECT TOP (1) CITY, LEN(CITY) AS CITY_LEN
FROM STATION
ORDER BY CITY_LEN, CITY
) AS T
UNION ALL
SELECT *
FROM (
SELECT TOP (1) CITY, LEN(CITY) AS CITY_LEN
FROM STATION
ORDER BY CITY_LEN DESC, CITY
) AS T2;
You cant have UNION ALL with ORDER BY for each SELECT statement, but you can workaround it by using subqueries togeter with TOP (1) clause and ORDER BY.
UNTESTED:
WITH CTE AS (
Select ID, len(City), row_number() over (order by City) as AlphaRN,
row_number() over (order by Len(City) desc) as LenRN) B
Select * from cte
Where AlphaRN = 1 and (lenRN = (select max(lenRN) from cte) or
lenRN = (Select min(LenRN) from cte))
Here's the best I could come up with:
with Ordering as
(
select
City,
Forward = row_number() over (order by len(City), City),
Backward = row_number() over (order by len(City) desc, City)
from
Station
)
select City, len(City) from Ordering where 1 in (Forward, Backward);
There are definitely a lot of ways to approach this as evidenced by the variety of answers, but I don't think anything beats your original two-query solution in terms of cleanly and concisely expressing the intended behavior. Interesting question, though!
This is what I came with. I tried to use only one query, without CTE's or sub-queries.
;WITH STATION AS ( --Dummy table
SELECT *
FROM (VALUES
(1,'DEF','EU',1,9),
(2,'ABC','EU',1,6), -- This is shortest
(3,'PQRS','EU',1,5),
(4,'WXY','EU',1,4),
(5,'FGHA','EU',1,2),
(6,'ASDFHG','EU',1,3) --This is longest
) as t(ID, CITY, [STATE], LAT_N,LONG_W)
)
SELECT TOP 1 WITH TIES CITY,
LEN(CITY) as CITY_LEN
FROM STATION
ORDER BY ROW_NUMBER() OVER(PARTITION BY LEN(CITY) ORDER BY LEN(CITY) ASC),
CASE WHEN MAX(LEN(CITY)) OVER (ORDER BY (SELECT NULL)) = LEN(CITY)
OR MIN(LEN(CITY)) OVER (ORDER BY (SELECT NULL))= LEN(CITY)
THEN 0 ELSE 1 END
Output:
CITY CITY_LEN
ABC 3
ASDFHG 6
select min(CITY), length(CITY)
from STATION
group by length(CITY)
having length(CITY) = (select min(length(CITY)) from STATION)
or length(CITY) = (select max(length(CITY)) from STATION);

T-SQL First_Value() - How can it return more than a single value?

I have the following query:
Select PH.SubId
From dbo.PanelHistory PH
Where
PH.Scribe2Time <> (Select FIRST_VALUE(ReadTimeLocal) OVER (Order By ReadTimeLocal) From dbo.PanelWorkflow Where ProcessNumber = 2690 And dbo.PanelWorkflow.SubId = PH.SubId)
I'm getting an error (512) that says: Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
How can the subquery return more than a single value? There can only be one first value. I must be overlooking something with this query.
By the way, I realize I could easily use Min() instead of First_Value, but I wanted to experiment with some of these Windowing functions.
How many rows do you see?
SELECT FIRST_VALUE(name) OVER (ORDER BY create_date) AS RN
FROM sys.objects
Even though there is only one distinct first value it still returns it for every row in the query.
So if the sub query itself matches multiple rows you will get this error. You could get rid of it with DISTINCT or TOP 1.
Probably not very efficient but you say this is just for experimental purposes.
This isn't an answer. It's just an extended comment generated by the following conclusion:
I could easily use Min() instead of First_Value, but I
wanted to experiment with some of these Windowing functions.
Min can't be used instead of FIRST_VALUE.
Example:
SET NOCOUNT ON;
DECLARE #MyTable TABLE(ID INT, TranDate DATETIME)
INSERT #MyTable VALUES (1, '2012-02-02'), (2, '2011-01-01'), (3, '2013-03-03')
SELECT MIN(ID) AS MIN_ID FROM #MyTable
SELECT ID, MIN(ID) OVER(ORDER BY TranDate) AS MIN_ID_ORDER_BY FROM #MyTable;
SELECT ID, FIRST_VALUE(ID) OVER(ORDER BY TranDate) AS FIRST_VALUE_ID_ORDER_BY FROM #MyTable;
Results:
MIN_ID
-----------
1
ID MIN_ID_ORDER_BY
----------- ---------------
2 2
1 1
3 1
ID FIRST_VALUE_ID_ORDER_BY
----------- -----------------------
2 2
1 2
3 2
FIRST_VALUE() will still return a row for every record that meets tour WHERE clause. TOP 1 should work:
Select PH.SubId
From dbo.PanelHistory PH
Where
PH.Scribe2Time <> (Select TOP 1 ReadTimeLocal
From dbo.PanelWorkflow
Where ProcessNumber = 2690
And dbo.PanelWorkflow.SubId = PH.SubId
Order By ReadTimeLocal DESC)
or MIN:
Select PH.SubId
From dbo.PanelHistory PH
Where
PH.Scribe2Time <> (Select MIN(ReadTimeLocal)
From dbo.PanelWorkflow
Where ProcessNumber = 2690
And dbo.PanelWorkflow.SubId = PH.SubId)
The PARTITION/OVER functions are look-ahead column functions. They aren't row functions - by that, I mean, they don't effect an entire row, number of rows returned, etc. An OVER aggregate can depend on values in other rows, but the tangible result is only to calculate a single column in the current row.
You may have seen something similar to what you are trying to do via an OVER ROW_NUMBER ranking function. Multiple rows are still returned, but only one of them has a ROW_NUMBER of 1. The rest are filtered in an encapsulating WHERE or JOIN predicate.

Anyway to get a value similar to ##ROWCOUNT when TOP is used?

If I have a SQL statement such as:
SELECT TOP 5
*
FROM Person
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
PRINT ##ROWCOUNT
-- shows '5'
Is there anyway to get a value like ##ROWCOUNT that is the actual count of all of the rows that match the query without re-issuing the query again sans the TOP 5?
The actual problem is a much more complex and intensive query that performs beautifully since we can use TOP n or SET ROWCOUNT n but then we cannot get a total count which is required to display paging information in the UI correctly. Presently we have to re-issue the query with a #Count = COUNT(ID) instead of *.
Whilst this doesn't exactly meet your requirement (in that the total count isn't returned as a variable), it can be done in a single statement:
;WITH rowCTE
AS
(
SELECT *
,ROW_NUMBER() OVER (ORDER BY ID DESC) AS rn1
,ROW_NUMBER() OVER (ORDER BY ID ASC) AS rn2
FROM Person
WHERE Name LIKE 'Sm%'
)
SELECT *
,(rn1 + rn2) - 1 as totalCount
FROM rowCTE
WHERE rn1 <=5
The totalCount column will have the total number of rows matching the where filter.
It would be interesting to see how this stacks up performance-wise against two queries on a decent-sized data-set.
you'll have to run another COUNT() query:
SELECT TOP 5
*
FROM Person
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
DECLARE #r int
SELECT
#r=COUNT(*)
FROM Person
WHERE Name LIKE 'Sm%'
select #r
Something like this may do it:
SELECT TOP 5
*
FROM Person
cross join (select count(*) HowMany
from Person
WHERE Name LIKE 'Sm%') tot
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
The subquery returns one row with one column containing the full count; the cross join includes it with all rows returned by the "main" query"; and "SELECT *" would include new column HowMany.
Depending on your needs, the next step might be to filter out that column from your return set. One way would be to load the data from the query into a temp table, and then return just the desired columns, and get rowcount from the HowMany column from any row.

Resources