Retrieving X rows from an ordered CTE, TOP vs Range - sql-server

Objective:
Want to know which is faster/better performance when trying to retrieve a finite number of rows from CTE that is already ordered.
Example:
Say I have a CTE(intentionally simplified) that looks like this, and I only want the top 5 rows :
WITH cte
AS (
SELECT Id = RANK() OVER (ORDER BY t.ActionID asc)
, t.Name
FROM tblSample AS t -- tblSample is indexed on Id
)
Which is faster:
SELECT TOP 5 * FROM cte
OR
SELECT * FROM cte WHERE Id BETWEEN 1 AND 5 ?
Notes:
I am not a DB programmer, so to me the TOP solution seems better as
once SS finds the 5th row, it will stop executing and "return" (100%
assumption) while in the other method, i feel it will unnecessarily
process the whole cte.
My question is for a CTE, would the answer to this question be the same if it were a table?

The most important thing to note is that both queries are not going to always produce the same result set. Consider the following data:
CREATE TABLE #tblSample (ActionId int not null, name varchar(10) not null);
INSERT #tblSample VALUES (1,'aaa'),(2,'bbb'),(3,'ccc');
Both of these will produce the same result:
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT TOP(2) * FROM CTE;
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT * FROM CTE WHERE id BETWEEN 1 AND 2;
Now let's do this update:
UPDATE #tblSample SET ActionId = 1;
After this update the first query still returns two rows, the second query returns 3. Keep in mind too that, without an ORDER BY in the TOP query the results are not guaranteed because there is no default order in SQL.
With that out of the way - which performs better? It depends. It depends on your indexing, your statistics, number of rows, and the execution plan that the SQL Engine goes with.

Top 5 selects any 5 rows as per Index defined on the table whereas Id between 1 and 5 tries to fetch data based on Id column whether by Index seek or scan depends on the selected attributes. Both are two different queries.. 'Id between' query might be slow if you do not have any index on Id,
Let me try to explain with an example...
Consider this is your data..
create index nci_name on yourcte(id) include(name)
--drop index nci_name on yourcte
;with cte as (
select * from yourcte )
select top 5 * from cte
;with cte as (
select * from yourcte )
select * from cte where id between 1 and 5
First i am creating index on id with name included, Now if you see your second query does Index seek and first one does index scan and selects top 5, so in this case second approach is better
See the execution plan:
Now i am removing the index
Executing
--drop index nci_name on yourtable
Now it does table scan on both the approaches
If you notice in both the table scans, in the first one it reads only 5 rows and second approach it reads 10 rows and applies predicate
See execution plan properties for first plan
For second approach it reads 10 rows
Now first approach is better..
In your case this index needs to be on ActionId which determines the id.
Hence performance depends on how you index on your base table.

In order to get the RANK() which you are calculating in your cte it must sort all the data by t.ActionID. Sorting is a blocking operation: the entire input must be processed before a single row is output.
So in this case whether you select any five rows, or if you take the five that sorted to the top of the pile is probably irrelevant.

Related

How to filter or split a CTE so that 2 rows are not added with the same value in a specific column

So the title sounds convoluted because my problem kinda is.
I have a CTE that pulls in some values (LineId, OrderNumber, OrderLine, Type, BuildUsed)
Later on a have a Select that populates a view that does a join on the CTE with something like this
left join CTE C on C.LineId = (select top 1 lineId from CTE C2 where C2.orderNumber = orderNumber and C2.orderLine = orderLine order by LineId
An example of my data would look like
LineId = 10, Order : OIP001, Line = 1, Type = Active, BuildUsed = XE9
LineId = 80, Order : OIP001, Line = 1, Type = Inactive, BuildUsed = XB2
The CTE does a Select, Union, Select. The first select gets all the active entries and the 2nd select gets all the inactive entries.
Any given order could have both active or inactive or just 1 of them.
The issue I am having is that my runtime is bad. It runs in close to 20 seconds when it should be like 4 or 5. The issue is that the join I listed above has to search and order every time and its a huge time sink.
So i thought if there was a way to basically break the CTE into 2 steps.
Insert all the active orders (These are the ones that I would want to pick if they are available)
Insert all the inactive orders (If that ordernumber and orderline does not already exist in the first step)
That way I don't have to order and sort every single join but I can just do a normal join thats significantly faster.
If it helps at all the LineId is based on a rownumber() in the CTE that looks like
ROW_NUMBER() OVER(ORDER BY Type desc, DescriptionStatus asc) as LineId
So the LineId is already ordered correctly.
Is there any way to split the CTE so that my 2nd part of the select can check if the ordernumber and orderline alraedy exists in the first part?
To specify. I would like to find any Active entries for the ordernumber and orderline first and then if none are found, try the inactive entries.
WHAT I HAVE TRIED SO FAR :
I tried adding the query for the 2nd part into the first part as a where clause. So it would only add where it wouldn't exist in the first part. But the time of the query got so insane I just stopped running it and scrapped that idea.
I believe you're just looking for a WHERE NOT EXISTS that uses a correlated sub-query to eliminate rows from your second result set that you've already retrieved in your first result set.
WHERE NOT EXISTS is generally pretty performant, but test the CTE by itself to be sure it meets your needs.
Something similar to this:
WITH cte
AS
(
SELECT
act.LineID,
act.OrderNumber,
act.OrderLine,
act.Type,
act.BuildUsed
FROM
ActiveSource AS act
UNION ALL
SELECT
inact.LineID
,inact.OrderNumber
,inact.OrderLine
,inact.Type
,inact.BuildUsed
FROM
InactiveSource AS inact
WHERE
NOT EXISTS
(
SELECT
1
FROM
ActiveSource AS a
WHERE
a.OrderNumber = inact.OrderNumber
AND a.OrderLine = inact.OrderLine
)
)
SELECT * FROM cte;

Why SQL Server 2014 uses an index scan operation for a query with a window function in a subquery?

There is a table containing about 5 million rows with an indexed column (for example, [X]). When I try to get a value from the [X] column of a specified row along with a value from the previous row in accordance with the order specified by the index, I get an inefficient actual execution plan with fat pipes.
Here is a simplified example.
CREATE TABLE #tbl (Id INT IDENTITY PRIMARY KEY, Val INT);
INSERT INTO #tbl
(Val)
SELECT TOP 1000000
a.object_id
FROM
sys.all_objects AS a
CROSS JOIN
sys.all_objects AS b
SELECT
Id, Val, PrevId
FROM
(SELECT *, PrevId = LAG(Id) OVER(ORDER BY Id) FROM #tbl) AS t
WHERE
ID = 42069;
DROP TABLE #tbl;
Is there a better solution?
Any help is greatly appreciated.
Try this:
CREATE TABLE #tbl (Id INT IDENTITY PRIMARY KEY, Val INT);
INSERT INTO #tbl
(Val)
SELECT TOP 1000000
a.object_id
FROM
sys.all_objects AS a
CROSS JOIN
sys.all_objects AS b
SELECT t.*, rez.PrevId
FROM #tbl t
OUTER APPLY (SELECT TOP 1 ti.Id as PrevId
FROM #tbl ti
WHERE ti.id < t.id
ORDER BY ti.id desc) rez
WHERE t.Id = 42069
DROP TABLE #tbl;
Since you intend to return a single row (because you know the Id field is a Primary Key), why not tell the sub-query to return just one row?
New query plan (with slim pipes):
Giving this additional information will help the Optimizer pick a better plan (actually, it helps it do two Clustered Index Seeks, instead of a Full Clustered Index Scan).
I tried to re-write my query so that it uses the Index as much as possible. Even though the two queries are semantically identical, not using LAG will give the optimizer a better idea on how to make a better execution plan.
If you most definitely want to use LAG, then I think the most efficient query you can write (which is slightly less efficient that the version above) is:
SELECT TOP 1 *
, PrevId = LAG(Id) OVER(ORDER BY Id)
FROM #tbl t
WHERE t.Id <= 42069
ORDER BY t.Id desc
Which gives you this execution plan:
The performance stats of this query can be found in the screenshot below (green is best overall, orange is best when using LAG):
I have tried various options to solve this, which you can see above, which all gave better results, starting from your query (which uses LAG), however this above (and last) version gives out the best performance.

SQL Server - Delete Duplicate Rows - how does Partition By affect this query?

I've been using the following inherited query where I'm trying to delete duplicate rows and I'm getting some unexpected results when first running it as a SELECT - I believe it has something to do with my lack of understanding of the Partition part of the statement:
WITH CTE AS(
SELECT [Id],
[Url],
[Identifier],
[Name],
[Entity],
[DOB],
RN = ROW_NUMBER()OVER(PARTITION BY Name ORDER BY Name)
FROM Data.Statistics
where Id = 2170
)
DELETE FROM CTE WHERE RN > 1
Can someone help me understand exactly what I'm doing with the Partition BY Name part of this? This doesn't limit the query in any way to only looking for duplicates in the Name field, correct? I need to ensure that it's looking for records where all 5 of the fields inside the CTE definition are the same for a record to be considered a duplicate.
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) doesn't make a lot of sense. You wouldn't ORDER BY the same thing you used in PARTITION BY since it will be the same value for everything in the partition, making the ORDER BY part useless.
Basically the CTE part of this query is saying to split the matching rows (those with [Id] = 2170) temporarily into groups for each distinct name, and within each group of rows with the same name, order those by name (which are obviously all the same value) and then return the row number within that sequence group as RN. Unique names will all have a row number of 1, because there is only one row with that name. Duplicate names will have row numbers 1, 2, 3, and so on. The order of those rows is undefined in this case because of the silly ORDER BY clause, but if you changed the ORDER BY to something meaningful, the row numbers would follow that sequence.

Using a running total calculated column in SQL Server table variable

I have inherited a stored procedure that utilizes a table variable to store data, then updates each row with a running total calculation. The order of the records in the table variable is very important, as we want the volume to be ordered highest to lowest (i.e. the running total will get increasingly larger as you go down the table).
My problem is, during the step where the table variable is updated, the running total seems to be calculating , but not in a way that the data in the table variable was previously sorted by (descending by highest volume)
DECLARE #TableVariable TABLE ([ID], [Volume], [SortValue], [RunningTotal])
--Populate table variable and order by the sort value...
INSERT INTO #TableVariable (ID, Volume, SortValue)
SELECT
[ID], [Volume], ABS([Volume]) as SortValue
FROM
dbo.VolumeTable
ORDER BY
SortValue DESC
--Set TotalVolume variable...
SELECT#TotalVolume = ABS(sum([Volume]))
FROM #TableVariable
--Calculate running total, update rows in table variable...I believe this is where problem occurs?
SET #RunningTotal = 0
UPDATE #TableVariable
SET #RunningTotal = RunningTotal = #RunningTotal + [Volume]
FROM #TableVariable
--Output...
SELECT
ID, Volume, SortValue, RunningTotal
FROM
#TableVariable
ORDER BY
SortValue DESC
The result is, the record that had the highest volume, that I would have expected the running total to calculate on first (thus running total = [volume]), somehow ends up much further down in the list. The running total seems to calculate randomly
Here is what I would expect to get:
But here is what the code actually generates:
Not sure if there is a way to get the UPDATE statement to be enacted on the table variable in such a way that it is ordered by volume desc? From what Ive read so far, it could be an issue with the sorting behavior of a table variable but not sure how to correct? Can anyone help?
GarethD provided the definitive link to the multiple ways of calculating running totals and their performance. The correct one is both the simplest and fastest, 300 times faster that then quirky update. That's because it can take advantage of any indexes that cover the sort column, and because it's a lot simpler.
I repeat it here to make clear how much simpler this is when the database provided the appropriate windowing functions
SELECT
[Date],
TicketCount,
SUM(TicketCount) OVER (ORDER BY [Date] RANGE UNBOUNDED PRECEDING)
FROM dbo.SpeedingTickets
ORDER BY [Date];
The SUM line means: Sum all ticket counts over all (UNBOUNDED) the rows that came before (PRECEDING) the current one if they were ordered by date
That ends up being 300 times faster than the quirky update.
The equivalent query for VolumeTable would be:
SELECT
ID,
Volume,
ABS(Volume) as SortValue,
SUM(Volume) OVER (ORDER BY ABS(Volume) DESC RANGE UNBOUNDED PRECEDING)
FROM
VolumeTable
ORDER BY ABS(Volume) DESC
Note that this will be a lot faster if there is an index on the sort column (Volume), and ABS isn't used. Applying any function on a column means that the optimizer can't use any indexes that cover it, because the actual sort value is different than the one stored in the index.
If the table is very large and performance suffers, you could create a computed column and create an index on it
Take a peek at the Window functions offered in SQL
For example
Declare #YourTable table (ID int,Volume int)
Insert Into #YourTable values
(100,1306489),
(125,898426),
(150,907404)
Select ID
,Volume
,RunningTotal = sum(Volume) over (Order by Volume Desc)
From #YourTable
Order By Volume Desc
Returns
ID Volume RunningTotal
100 1306489 1306489
150 907404 2213893
125 898426 3112319
To be clear, The #YourTable is for demonstrative purposes only. There should be no need to INSERT your actual data into a table variable.
EDIT to Support 2008 (Good news is Row_Number() is supported in 2008)
Select ID
,Volume
,RowNr=Row_Number() over (Order by Volume Desc)
Into #Temp
From #YourTable
Select A.ID
,A.Volume
,RunningTotal = sum(B.Volume)
From #Temp A
Join #Temp B on (B.RowNr<=A.RowNr)
Group By A.ID,A.Volume
Order By A.Volume Desc

How ROW_NUMBER used with insertions?

I've multipe uniond statements in MSSQL Server that is very hard to find a unique column among the result.
I need to have a unique value per each row, so I've used ROW_NUMBER() function.
This result set is being copied to other place (actually a SOLR index).
In the next time I will run the same query, I need to pick only the newly added rows.
So, I need to confirm that, the newly added rows will be numbered afterward the last row_number value of the last time.
In other words, Is the ROW_NUMBER functions orders the results with the insertion order - suppose I don't adding any ORDER BY clause?
If no, (as I think), Is there any alternatives?
Thanks.
Without seeing the sql I can only give the general answer that MS Sql does not guarantee the order of select statements without an order clause so that would mean that the row_number may not be the insertion order.
I guess you can do something like this..
;WITH
cte
AS
(
SELECT * , rn = ROW_NUMBER() OVER (ORDER BY SomeColumn)
FROM
(
/* Your Union Queries here*/
)q
)
INSERT INTO Destination_Table
SELECT * FROM
CTE LEFT JOIN Destination_Table
ON CTE.Refrencing_Column = Destination_Table.Refrencing_Column
WHERE Destination_Table.Refrencing_Column IS NULL
I would suggest you consider 'timestamping' the row with the time it was inserted. Or adding an identity column to the table.
But what it sounds like you want to do is get current max id and then add the row_number to it.
Select col1, col2, mid + row_number() over(order by smt) id
From (
Select col1, col2, (select max(id) from tbl) mid
From query
) t

Resources