Efficiently extract row range from SELECT with ORDER BY statement - sql-server

I know that you can use ROW_NUMER() to get the row number and then perform WHERE on on the results, as shown here:
USE AdventureWorks2012;
GO
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS RowNumber
FROM Sales.SalesOrderHeader
)
SELECT SalesOrderID, OrderDate, RowNumber
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
However, this will first sort all the data and will only then extract the row range.
Since one can find the kth member of a sorted array: How to find the kth largest element in an unsorted array of length n in O(n)?, I would hope it would also be possible in SQL.

The sort can be avoided with an index on OrderDate:
CREATE INDEX IX_SalesOrderHeader_OrderDate ON Sales.SalesOrderHeader(OrderDate);
An ordered scan of the this index will be performed until the specified upper ROW_NUMBER() limit is reached, limiting the scan to that number of rows. ROW_NUMBER() values lower than the specified range will be discarded from the results.
As with any set-based pagination technique in SQL where a useful ordering index exists, performance will largely depend on the number of rows that need to be skipped and returned.

Related

T-SQL: aggregate function for calculating Nth percentile

I am trying to calculate the Nth percentile of all of the values in a single column in a table. All I want is a scalar, aggregate value for which N percent of the values are below. For instance, If the table has 100 rows where the value is the same as the row index plus one (1 to 100 consecutively), then I'd want this value to tell me that 95% of the values are below 95.
The PERCENTILE_CONT analytic function looks closest to what I want. But if I try to use it like this:
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER () AS P95
I get one row per row in the table, all with the same value. I could use TOP 1 to just give me one of those rows, but now I've done an additional table scan.
I am not trying to create a wizbang table of results partitioned by some other column in the original table. I just want an aggregate, scalar value.
Edit: I have been able to use PERCENTILE_CONT in a query with a WHERE clause. For example:
DECLARE #P95 INT
SELECT TOP 1 #P95 = (PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER ())
FROM ExampleTable
WHERE LOWER(Color) = 'blue'
SELECT #P95
Including the WHERE clause gives a different result than I got without it.
From what I can tell, you will need to do a subquery here. For example, to find the number of records strictly below the 95 percentile we can try:
WITH cte AS (
SELECT ValueColumn,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER () AS P95
FROM yourTable
)
SELECT COUNT(*)
FROM cte
WHERE ValueColumn < P95;

SQL Server Conditional Sort Performance Issue

I have a table and it have around 5 millions rows. When I try a conditional sort for this table, it takes around 25 secs, but when I change conditional sort to a certain sort criteria, it takes 1 second. Only difference like below;
--takes 1 second
ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
--takes around 25 seconds
CASE #SortColumn WHEN 'OrderId' THEN ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
Who can explain what is going on SQL server in this scenario?
OrderId must be indexed. Thus in the first instance:
ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
SQL does not need to perform a sort having originally performed an index scan on the OrderId column. It knows that the index is ordered by the column you want to order by so does not need to perform another sort.
However in the second example SQL has to evaluate CASE #SortColumn WHEN 'OrderId' THEN ROW_NUMBER() OVER (ORDER BY OrderId DESC) END for each row. Thus it performs a Compute Scalar operation on each row to work out the result of the CASE statement. The results of this operation cannot be mapped to an index as they do not represent a column and a further sort operation is required. Over 5 million rows this is a very expensive operation.
If you were to run the queries over non-indexed columns:
--takes 25 second
ROW_NUMBER() OVER (ORDER BY NonIndexedColumn DESC) AS RowNumber
--takes around 25 seconds
CASE #SortColumn WHEN 'NonIndexedColumn' THEN ROW_NUMBER() OVER (ORDER BY NonIndexedColumn DESC) AS RowNumber
then both queries would presumably run equally slowly as SQL would have to sort in both instances (and not just used a sorted index). Thus passing in a column to sort by is always going to end up with slow performance over a large number of rows if someone picks a non-indexed column. You therefore need to ensure your results are filtered down to a manageable amount of rows prior to the ORDER BY being applied.

Retrieving X rows from an ordered CTE, TOP vs Range

Objective:
Want to know which is faster/better performance when trying to retrieve a finite number of rows from CTE that is already ordered.
Example:
Say I have a CTE(intentionally simplified) that looks like this, and I only want the top 5 rows :
WITH cte
AS (
SELECT Id = RANK() OVER (ORDER BY t.ActionID asc)
, t.Name
FROM tblSample AS t -- tblSample is indexed on Id
)
Which is faster:
SELECT TOP 5 * FROM cte
OR
SELECT * FROM cte WHERE Id BETWEEN 1 AND 5 ?
Notes:
I am not a DB programmer, so to me the TOP solution seems better as
once SS finds the 5th row, it will stop executing and "return" (100%
assumption) while in the other method, i feel it will unnecessarily
process the whole cte.
My question is for a CTE, would the answer to this question be the same if it were a table?
The most important thing to note is that both queries are not going to always produce the same result set. Consider the following data:
CREATE TABLE #tblSample (ActionId int not null, name varchar(10) not null);
INSERT #tblSample VALUES (1,'aaa'),(2,'bbb'),(3,'ccc');
Both of these will produce the same result:
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT TOP(2) * FROM CTE;
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT * FROM CTE WHERE id BETWEEN 1 AND 2;
Now let's do this update:
UPDATE #tblSample SET ActionId = 1;
After this update the first query still returns two rows, the second query returns 3. Keep in mind too that, without an ORDER BY in the TOP query the results are not guaranteed because there is no default order in SQL.
With that out of the way - which performs better? It depends. It depends on your indexing, your statistics, number of rows, and the execution plan that the SQL Engine goes with.
Top 5 selects any 5 rows as per Index defined on the table whereas Id between 1 and 5 tries to fetch data based on Id column whether by Index seek or scan depends on the selected attributes. Both are two different queries.. 'Id between' query might be slow if you do not have any index on Id,
Let me try to explain with an example...
Consider this is your data..
create index nci_name on yourcte(id) include(name)
--drop index nci_name on yourcte
;with cte as (
select * from yourcte )
select top 5 * from cte
;with cte as (
select * from yourcte )
select * from cte where id between 1 and 5
First i am creating index on id with name included, Now if you see your second query does Index seek and first one does index scan and selects top 5, so in this case second approach is better
See the execution plan:
Now i am removing the index
Executing
--drop index nci_name on yourtable
Now it does table scan on both the approaches
If you notice in both the table scans, in the first one it reads only 5 rows and second approach it reads 10 rows and applies predicate
See execution plan properties for first plan
For second approach it reads 10 rows
Now first approach is better..
In your case this index needs to be on ActionId which determines the id.
Hence performance depends on how you index on your base table.
In order to get the RANK() which you are calculating in your cte it must sort all the data by t.ActionID. Sorting is a blocking operation: the entire input must be processed before a single row is output.
So in this case whether you select any five rows, or if you take the five that sorted to the top of the pile is probably irrelevant.

Need help select query in SQL Server

I have this query :
SELECT *
FROM
(SELECT
*,
ROW_NUMBER() OVER (ORDER BY sort_by) as row
FROM table_name) a
WHERE
row > start_row AND row <= limit_row
This query will select anything from table_name, starting from start_row until limit_row, and the result will arranged by the sort_by column.
But I also need to add the condition WHERE column_name = column_value. And the data arranged by the sort_by column can be in either ascending or descending order.
My question is where should I add the condition column_name = column_value, and the ORDER ASC/DESC in my query?
If my question isn't clear, please ask. Thanks.
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY sort_by DESC) as row
FROM table_name
WHERE column_name = column_value
) a
WHERE row > start_row
AND row <= limit_row
ORDER BY a.row DESC
The row_number function uses the order to determine the order of the data for numbering purposes; this means the row number order is important to know and understand, especially if you are paging data. Typically, when paging data, you want your ordering so that row 1 is the newest record because you want your first page of data to be the most recent; this generally means the order by on the row number would be descending.
The outer order by only changes the order returned back to you and is really acting only as a display ordering. So, typically, that order by would be ascending when paging data as you are already ordering from newest to oldest.
Also, if you are using a new version of SQL Server, they added a paging feature that performs much better (in my experience) than the row numbering paging used in the past.

Retrieving specific number of rows based on sum of row number

After reading an experimenting I decided I need to ask:
I am trying to retrieve a specific number of rows from a table based on the sum of the row number: This is a basic table with two columns: CusID, CusName.
I started by numbering each row to 1 so that I can use a SUM of the row number, or so I thought.
WITH Example AS
(
SELECT
*,
ROW_NUMBER() OVER (Partition by CusID ORDER BY CusID) AS RowNumber
FROM
MySchema.MyTable
)
I am not sure how to move beyond here. I tried using the HAVING clause but obviously that would not work. I could also use TOP or Percent.
But I would like to retrieve the rows based on the sum of row number.
What's the way to do this?
First of all Windowed functions cannot be used in the context of another windowed function or aggregate.So you can not use Aggregate function inside the row_number I think it could better than use all function after your with like this
WITH Example AS
(
SELECT *, ROW_NUMBER() OVER (Partition by CusID ORDER BY CusID) AS RowNumber
FROM MySchema.MyTable
)
select cusid,cusname,sum(rownumber) from example
group by Cusid,Cusname
having .....

Resources