SQL Server Conditional Sort Performance Issue - sql-server

I have a table and it have around 5 millions rows. When I try a conditional sort for this table, it takes around 25 secs, but when I change conditional sort to a certain sort criteria, it takes 1 second. Only difference like below;
--takes 1 second
ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
--takes around 25 seconds
CASE #SortColumn WHEN 'OrderId' THEN ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
Who can explain what is going on SQL server in this scenario?

OrderId must be indexed. Thus in the first instance:
ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
SQL does not need to perform a sort having originally performed an index scan on the OrderId column. It knows that the index is ordered by the column you want to order by so does not need to perform another sort.
However in the second example SQL has to evaluate CASE #SortColumn WHEN 'OrderId' THEN ROW_NUMBER() OVER (ORDER BY OrderId DESC) END for each row. Thus it performs a Compute Scalar operation on each row to work out the result of the CASE statement. The results of this operation cannot be mapped to an index as they do not represent a column and a further sort operation is required. Over 5 million rows this is a very expensive operation.
If you were to run the queries over non-indexed columns:
--takes 25 second
ROW_NUMBER() OVER (ORDER BY NonIndexedColumn DESC) AS RowNumber
--takes around 25 seconds
CASE #SortColumn WHEN 'NonIndexedColumn' THEN ROW_NUMBER() OVER (ORDER BY NonIndexedColumn DESC) AS RowNumber
then both queries would presumably run equally slowly as SQL would have to sort in both instances (and not just used a sorted index). Thus passing in a column to sort by is always going to end up with slow performance over a large number of rows if someone picks a non-indexed column. You therefore need to ensure your results are filtered down to a manageable amount of rows prior to the ORDER BY being applied.

Related

The field in ORDER BY affects the result of window functions

I have simple T-SQL query, which calculates row number, rows count and total volume across all records:
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY prev_date),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY prev_date),
*
FROM #t;
I get the following result:
However, this is NOT what I expected: in all two rows the rows_count must be 2 and vol_total must be 300:
The workaround would be to add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. However, I thought that there must be another way.
In the end of the day I have found out that the ORDER BY clause must use id field rather prev_date field:
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY id),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY id)
After this change the query's output is as expected.
But! I don't understand why is this so? How come the ordering affects partitioning?
For Aggregate functions generally it is not required to have order in the window definition unless you want to do the aggregation one at a time in an ordered fashion, it is like running total. Simply removing the orders will fix the problem.
If I want to explain it from another way it would be like a window that is expanding row by row as you move on to another row. It is started with the first row, calculate the aggregation with all the rows from before (which in the first row is just the current row!) to the position of row.
if you remove the order, the aggregation will be computed for all the rows in the window definition and no order of applying window will take effect.
You can change the order in window definition to see the effect of it.
Of course, ranking functions need the order and this point is just for the aggregations.
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id),
vol_total = SUM(volume) OVER (PARTITION BY id),
*
FROM #t;
Enabling order in the window for aggregations added after SqlServer 2012 and it was not part of the first release of the feature in 2005.
For a detailed explanation of the order in window functions on aggregates this is a great help:
Producing a moving average and cumulative total - SqlServer Documentation

How to identify which column(s) have different value in SQL Server

I have a table which has more than 100 columns, in normal case the contract_id should be unique in this table, but sometimes there are duplicate values. I use this SQL statement to retrieve data from this table:
select distinct contract_id, col1, col2,...colM
from the_table;
but I found contract_id values, I know there should be some values are different in the same column(s), can I have a way to find out all these columns which have different value result in I saw duplicate contract_id even though I use distinct, because there are lots of fields and only a few columns have different values. It is difficult to compare each column one by one manually.
Try something along
SELECT contract_id
FROM the_table
GROUP BY contract_id
HAVING COUNT(contract_id)>1;
or
WITH NumberedRows AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY contract_id ORDER BY(SELECT NULL)) AS RowNumber
,*
FROM the_table
)
SELECT *
FROM NumberedRows
WHERE RowNumber>1;
The first will show you all the contract_id values, which occur at least twice, the second will show you all the rows you might want to manipulate (delete/change).
attention: I used SELECT NULL in the ORDER BY of the OVER() clause. It is very important to use a fitting ORDER BY clause here. This will be responsible for Which row gets the number 1 and which rows get increasing numbers and will show up in the result due to >1?

Using a running total calculated column in SQL Server table variable

I have inherited a stored procedure that utilizes a table variable to store data, then updates each row with a running total calculation. The order of the records in the table variable is very important, as we want the volume to be ordered highest to lowest (i.e. the running total will get increasingly larger as you go down the table).
My problem is, during the step where the table variable is updated, the running total seems to be calculating , but not in a way that the data in the table variable was previously sorted by (descending by highest volume)
DECLARE #TableVariable TABLE ([ID], [Volume], [SortValue], [RunningTotal])
--Populate table variable and order by the sort value...
INSERT INTO #TableVariable (ID, Volume, SortValue)
SELECT
[ID], [Volume], ABS([Volume]) as SortValue
FROM
dbo.VolumeTable
ORDER BY
SortValue DESC
--Set TotalVolume variable...
SELECT#TotalVolume = ABS(sum([Volume]))
FROM #TableVariable
--Calculate running total, update rows in table variable...I believe this is where problem occurs?
SET #RunningTotal = 0
UPDATE #TableVariable
SET #RunningTotal = RunningTotal = #RunningTotal + [Volume]
FROM #TableVariable
--Output...
SELECT
ID, Volume, SortValue, RunningTotal
FROM
#TableVariable
ORDER BY
SortValue DESC
The result is, the record that had the highest volume, that I would have expected the running total to calculate on first (thus running total = [volume]), somehow ends up much further down in the list. The running total seems to calculate randomly
Here is what I would expect to get:
But here is what the code actually generates:
Not sure if there is a way to get the UPDATE statement to be enacted on the table variable in such a way that it is ordered by volume desc? From what Ive read so far, it could be an issue with the sorting behavior of a table variable but not sure how to correct? Can anyone help?
GarethD provided the definitive link to the multiple ways of calculating running totals and their performance. The correct one is both the simplest and fastest, 300 times faster that then quirky update. That's because it can take advantage of any indexes that cover the sort column, and because it's a lot simpler.
I repeat it here to make clear how much simpler this is when the database provided the appropriate windowing functions
SELECT
[Date],
TicketCount,
SUM(TicketCount) OVER (ORDER BY [Date] RANGE UNBOUNDED PRECEDING)
FROM dbo.SpeedingTickets
ORDER BY [Date];
The SUM line means: Sum all ticket counts over all (UNBOUNDED) the rows that came before (PRECEDING) the current one if they were ordered by date
That ends up being 300 times faster than the quirky update.
The equivalent query for VolumeTable would be:
SELECT
ID,
Volume,
ABS(Volume) as SortValue,
SUM(Volume) OVER (ORDER BY ABS(Volume) DESC RANGE UNBOUNDED PRECEDING)
FROM
VolumeTable
ORDER BY ABS(Volume) DESC
Note that this will be a lot faster if there is an index on the sort column (Volume), and ABS isn't used. Applying any function on a column means that the optimizer can't use any indexes that cover it, because the actual sort value is different than the one stored in the index.
If the table is very large and performance suffers, you could create a computed column and create an index on it
Take a peek at the Window functions offered in SQL
For example
Declare #YourTable table (ID int,Volume int)
Insert Into #YourTable values
(100,1306489),
(125,898426),
(150,907404)
Select ID
,Volume
,RunningTotal = sum(Volume) over (Order by Volume Desc)
From #YourTable
Order By Volume Desc
Returns
ID Volume RunningTotal
100 1306489 1306489
150 907404 2213893
125 898426 3112319
To be clear, The #YourTable is for demonstrative purposes only. There should be no need to INSERT your actual data into a table variable.
EDIT to Support 2008 (Good news is Row_Number() is supported in 2008)
Select ID
,Volume
,RowNr=Row_Number() over (Order by Volume Desc)
Into #Temp
From #YourTable
Select A.ID
,A.Volume
,RunningTotal = sum(B.Volume)
From #Temp A
Join #Temp B on (B.RowNr<=A.RowNr)
Group By A.ID,A.Volume
Order By A.Volume Desc

Efficiently extract row range from SELECT with ORDER BY statement

I know that you can use ROW_NUMER() to get the row number and then perform WHERE on on the results, as shown here:
USE AdventureWorks2012;
GO
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS RowNumber
FROM Sales.SalesOrderHeader
)
SELECT SalesOrderID, OrderDate, RowNumber
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
However, this will first sort all the data and will only then extract the row range.
Since one can find the kth member of a sorted array: How to find the kth largest element in an unsorted array of length n in O(n)?, I would hope it would also be possible in SQL.
The sort can be avoided with an index on OrderDate:
CREATE INDEX IX_SalesOrderHeader_OrderDate ON Sales.SalesOrderHeader(OrderDate);
An ordered scan of the this index will be performed until the specified upper ROW_NUMBER() limit is reached, limiting the scan to that number of rows. ROW_NUMBER() values lower than the specified range will be discarded from the results.
As with any set-based pagination technique in SQL where a useful ordering index exists, performance will largely depend on the number of rows that need to be skipped and returned.

Need help select query in SQL Server

I have this query :
SELECT *
FROM
(SELECT
*,
ROW_NUMBER() OVER (ORDER BY sort_by) as row
FROM table_name) a
WHERE
row > start_row AND row <= limit_row
This query will select anything from table_name, starting from start_row until limit_row, and the result will arranged by the sort_by column.
But I also need to add the condition WHERE column_name = column_value. And the data arranged by the sort_by column can be in either ascending or descending order.
My question is where should I add the condition column_name = column_value, and the ORDER ASC/DESC in my query?
If my question isn't clear, please ask. Thanks.
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY sort_by DESC) as row
FROM table_name
WHERE column_name = column_value
) a
WHERE row > start_row
AND row <= limit_row
ORDER BY a.row DESC
The row_number function uses the order to determine the order of the data for numbering purposes; this means the row number order is important to know and understand, especially if you are paging data. Typically, when paging data, you want your ordering so that row 1 is the newest record because you want your first page of data to be the most recent; this generally means the order by on the row number would be descending.
The outer order by only changes the order returned back to you and is really acting only as a display ordering. So, typically, that order by would be ascending when paging data as you are already ordering from newest to oldest.
Also, if you are using a new version of SQL Server, they added a paging feature that performs much better (in my experience) than the row numbering paging used in the past.

Resources