Efficient limit result set in SQL window function - sql-server

My question would be better served as a comment on Limit result set in sql window function , but I don't have the necessary reputation to comment.
Given a table of moving vehicle locations, for each vehicle I wish to find the most recent recorded position (and other data about the vehicle at that time). Based on answers in the other question, I can run a query like:
Table definition:
CREATE TABLE VehiclePositions
(
Id BIGINT NOT NULL,
VehicleID NVARCHAR(12) NULL,
Timestamp DATETIME NULL,
PositionX FLOAT NULL,
PositionY FLOAT NULL,
PositionZ SMALLINT NULL,
Speed SMALLINT NULL,
Heading SMALLINT NULL
)
Query:
select *
from
(select
*,
row_number() over (partition by VehicleID order by Timestamp desc) as ranking
from VehiclePositions) as x
where
ranking = 1
Now, the problem is that this does a full table scan. I thought that by creating an appropriate index, I could avoid this:
CREATE INDEX idx_VehicPosition ON VehiclePositions(VehicleID, Timestamp);
However, SQL Server will happily ignore this index in the query and still perform the stable scan.
Note: I can get SQL Server to use the index, but the code is rather ugly:
DECLARE #ids TABLE (id NVARCHAR(12) UNIQUE)
INSERT INTO #ids
SELECT DISTINCT VehicleID
FROM VehiclePositions
SELECT ep.*
FROM VehiclePositions vp
WHERE Timestamp = (SELECT Max(TimeStamp) FROM VehiclePositions vp2
WHERE vp2.VehicleID = vp.VehicleID)
AND VehicleID IN (SELECT DISTINCT id FROM #ids)
(The VehicleID IN... is because it seems SQL Server doesn't implement seek-skip optimisations. It still comes up with a pretty non-optimal query plan that visits the index twice, but at least it doesn't execute in linear time).
Is there a way to make SQL Server run the window function query intelligently?
I'm using SQL Server 2014...
Help will be appreciated

What i would do :
SELECT *
FROM
(SELECT MAX(Timestamp) as maxtime,
VehicleID
FROM VehiclePositions
GROUP BY VehicleID ) as maxed INNER JOIN
(SELECT Id ,
VehicleID ,
Timestamp ,
PositionX ,
PositionY,
PositionZ,
Speed ,
Heading
FROM VehiclePositions) as vals
ON maxed.maxtime = vals.Timestamp
AND maxed.VehicleID = vals.VehicleID
to my knowledge you cant get around your index getting scanned twice.

As long as you are selecting all vehicles from the table and are select all column (or at least columns that are not in your index), I would expect the table scan to keep popping up.
In many cases, that will actually be the most efficient query plan. Only if you have a many rows per vehicle (like several pages) a seek strategy might be faster.
If you do have a lot of rows per vehicle, you might consider partitioning your table on Timestamp...

You can filter results in windows function using 'qualify', as follows:
select *
from VehiclePositions
qualify row_number() over (partition by VehicleID order by Timestamp desc) = 1

Related

Select latest row on duplicate values while transfering table?

I have a logging table that is live which saves my value to a table frequently.
My plan is to take those values and put them on a temporary table with
SELECT * INTO #temp from Block
From there I guess my block table is empty and the logger can keep on logging new values.
The next step is that I want to save them in a existing table. I wanted to use
INSERT INTO TABLENAME(COLUMN1,COLUMN2...) SELECT (COLUMN1,COLUMN2...) FROM #temp
The problem is that the #temp table has duplicates primary keys. And I only want to store the last ID.
I have tried DISTINCT but it didn't work. Could not get ROW_Count to work. Are there any ideas on how I should do it? I wish to make it with as few reads as possible.
Also, in the future I plan to send them to another database, how do I do that on SQL Server? I guess it's something like FROM Table [in databes]?
I couldn't get the blocks to copy. But here goes:
create TABLE Product_log (
Grade char(64),
block_ID char(64) PRIMARY KEY NOT NULL,
Density char(64),
BatchNumber char(64) NOT NULL,
BlockDateID Datetime
);
That is my table i want to store the data in. There I do not wish to have duplicates on the id. The problem is, while logging I get duplicates since I log on change. Lets say that the batchid is 1, if it becomes 2 while logging. I will get a blockid twice, both with batch number 1 and 2. How do I pick the latter?
Hope I explained enough for guidance. While logging they look like this:
id SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE SiemensTiaV15_s71200_BatchTester_TestWriteValue_VALUE SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP SiemensTiaV15_s71200_MainTank_Density_VALUE SiemensTiaV15_s71200_MainTank_Grade_VALUE
1 00545 S0047782 2020-06-09 11:18:44.583 0 xxxxx
2 00545 S0047783 2020-06-09 11:18:45.800 0 xxxxx
Please use below query,
select * from
(select id, SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE,SiemensTiaV15_s71200_BatchTester_TestWriteValue_VALUE, SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP, SiemensTiaV15_s71200_MainTank_Density_VALUE,SiemensTiaV15_s71200_MainTank_Grade_VALUE,
row_number() over (partition by SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE order by SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP desc) as rnk
from table_name) qry
where rnk=1;
INTO #temp FROM Block; INSERT INTO Product_log(Grade, block_ID, Density, BatchNumber, BlockDateID)
selct NewBatchIDValue_VALUE, TestWriteValue_VALUE, TestWriteValue_TIMESTAMP,
Density_VALUE, Grade_VALUE from
(select NewBatchIDValue_VALUE, TestWriteValue_VALUE,
TestWriteValue_TIMESTAMP, Density_VALUE, Grade_VALUE, row_number() over
(partition by BatchTester_NewBatchIDValue_VALUE order by
BatchTester_TestWriteValue_VALUE) as rnk from #temp) qry
where rnk = 1;

Retreive data from query

I have a table with 6 columns where one column is Id (big int , primary), and one column CreatedDate (datetime), and it has rows more than one million.
when retrieving data from this table using the below query takes more than 1 minute.
select * from MyTable where CreatedDate between '2019-05-01' and '2019-05-30'
I used below query also, but it also takes more than 1 minute.
declare #minId bigint, #maxId bigint
select #minId = min(Id) from MyTable where CreatedDate > = '2019-05-01'
select #maxId = max(id) from MyTable where CreatedDate <= '2019-05-30'
select #minId, #maxId
select * from MyTable where Id between #minId and #maxId
It has only one index (Id - primary key), and I assume adding index to CreatedDate may affect insert/update operations.
I want to join this result to another table to get some report data to display in a grid, but when executing this query time out occurs.
How can I retrieve data quickly?
Create index on CreatedDate will help in retrieval while it will have some side-effect in Insert
Avoid selecting all columns with the '*' wildcard, unless you intend to use them all. Selecting redundant columns may result in unnecessary performance degradation.
Try to create the following index:
CREATE NONCLUSTERED INDEX [IX_CreatedDate_ID]
ON dbo.YourTable
(CreatedDate, ID)
GO
Pay attention to the order of the index (CreatedDate, ID). CreatedDate is the first column in index. It is very important. So when you will use WHERE CreatedDate BETWEEN '2019-05-01' and '2019-05-30', then your query plan will have index seek.
And your query should look like this:
SELECT
CreatedDate
, ID from MyTable
WHERE CreatedDate BETWEEN '2019-05-01' and '2019-05-30'
try it using Stored Procedure. The stored procedure is ready to go code which is precompiled code take less time to fetch data.
Sample stored procedure code

TSQL - extract data to table/view to speed up query

I use this statement to create a list for excel
SELECT DISTINCT Year, Version
FROM myView
WHERE id <> 'old'
ORDER BY Year DESC, Version DESC
The problem is that the execution time is over 30s because of the almost 2 million rows.
The result has only around 1000 rows.
What are my options to extract only those two columns in order to speed up the execution time? I also need to make sure that inserts to the underlying table are recognized.
Do I need a new table to copy the values from the view? And a trigger to manage the updates?
Thank you
So, presumably there's a table with Year and id underlying your view. Given this (trivial) example:
CREATE TABLE myTable ([id] varchar(10), [Year] int, [Version] int);
Just create an index on that table that matches the way you're querying your data. Given your query of:
SELECT DISTINCT Year, Version
FROM myView
WHERE id <> 'old'
ORDER BY Year DESC, Version DESC
This query matches the WHERE and ORDER BY clauses and should give you all the performance you need:
IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[myTable]') AND name = N'IX_YearVersion_Filtered')
DROP INDEX [IX_YearVersion_Filtered] ON [dbo].[myTable] WITH ( ONLINE = OFF )
GO
CREATE NONCLUSTERED INDEX [IX_YearVersion_Filtered] ON [dbo].[myTable]
(
[Year] DESC,
[Version] DESC
)
WHERE ([id]<>'old')
GO
with cte_x
as
(SELECT Year, Version
FROM myView
WHERE id not in ('old')
group by Year, Version)
SELECT DISTINCT Year, Version
FROM cte_x
ORDER BY Year DESC, Version DESC

SQL Server order by clause without using top etc

I have the following SQL view:
CREATE VIEW [dbo].[VW_ScanData]
AS
SELECT
top 10 ID,
Chip_ID,
[IPAddress] As FilterKey,
[DateTime]
FROM
TBL_ScanData WITH(NOLOCK)
ORDER BY ID DESC
GO
The idea is that this returns the 10 most recent entries. I have been told to to use a filterkey to check recent entries per IP Address.
The problem is that as it stands above, it will return the top 10 entries and remove all the ones that dont match the filter key which means in some cases it will not return anything.
I want it to return the 10 most recent entries of the given IP Address (Filter key).
I have tried removing 'top 10', but it will then not accept the order by clause, meaning it will not necessarily give the most recent entries.
As said, I need to use a filter key to comply with the rest of the framework of the project
I would recommend that you do not bake concerns like row limits, ordering, and lock hints into a view, as this will limit the usefulness / reusability of the view to different consumers. Instead, leave it up to the caller to decide on such concerns, which can be applied retrospectively when using the view.
If you remove the row limit from the view, filter and row limit can then be done from the caller:
SELECT TOP 10 *
FROM [dbo].[VW_ScanData]
WHERE FilterKey = 'FOO'
ORDER BY ID DESC;
That said, the view then doesn't really add any value beyond selecting from the table directly, other than the aliasing of IPAddress:
CREATE VIEW [dbo].[VW_ScanData]
AS
SELECT
ID,
Chip_ID,
[IPAddress] As FilterKey,
[DateTime]
FROM
TBL_ScanData
GO
Edit
Other options available to you are using a stored procedure or a Table user defined function. The latter will allow you to bake in all the concerns you require, and the Filter key can be passed as a parameter to the function:
CREATE FUNCTION [dbo].[FN_ScanData](#FilterKey VARCHAR(50))
RETURNS #Result TABLE
(
ID INT,
Chip_ID INT,
FilterKey VARCHAR(50),
[DateTime] DATETIME
)
AS
BEGIN
INSERT INTO #Result
SELECT
top 10 ID,
Chip_ID,
[IPAddress] As FilterKey,
[DateTime]
FROM
TBL_ScanData WITH(NOLOCK) -- This will bite you!
WHERE
[IPAddress] = #FilterKey
ORDER BY ID DESC
RETURN
END
Which you can then call like so ('Foo' is your filter key):
SELECT *
FROM [dbo].[FN_ScanData]('FOO');
This select gets the last 10 entries per FilterKey.
select id,chip_id,FilterKey,[DateTime]
FROM (SELECT ID,
Chip_ID,
FilterKey,
[DateTime],
ROW_NUMBER() OVER (Partition By FilterKey Order BY ID DESC) AS RN
FROM TBL_ScanData WITH(NOLOCK) )
WHERE RN <= 10

Optimal performing query for latest record for each N

Here is the scenario I find myself in.
I have a reasonably big table that I need to query the latest records from. Here is the create for the essential columns for the query:
CREATE TABLE [dbo].[ChannelValue](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[UpdateRecord] [bit] NOT NULL,
[VehicleID] [int] NOT NULL,
[UnitID] [int] NOT NULL,
[RecordInsert] [datetime] NOT NULL,
[TimeStamp] [datetime] NOT NULL
) ON [PRIMARY]
GO
The ID column is a Primary Key and there is a non-Clustered index on VehicleID and TimeStamp
CREATE NONCLUSTERED INDEX [IX_ChannelValue_TimeStamp_VehicleID] ON [dbo].[ChannelValue]
(
[TimeStamp] ASC,
[VehicleID] ASC
)ON [PRIMARY]
GO
The table I'm working on to optimise my query is a little over 23 million rows and is only a 10th of the sizes the query needs to operate against.
I need to return the latest row for each VehicleID.
I've been looking through the responses to this question here on StackOverflow and I've done a fair bit of Googling and there seem to be 3 or 4 common ways of doing this on SQL Server 2005 and upwards.
So far the fastest method I've found is the following query:
SELECT cv.*
FROM ChannelValue cv
WHERE cv.TimeStamp = (
SELECT
MAX(TimeStamp)
FROM ChannelValue
WHERE ChannelValue.VehicleID = cv.VehicleID
)
With the current amount of data in the table it takes about 6s to execute which is within reasonable limits but with the amount of data the table will contain in the live environment the query begins to perform too slow.
Looking at the execution plan my concern is around what SQL Server is doing to return the rows.
I cannot post the execution plan image because my Reputation isn't high enough but the index scan is parsing every single row within the table which is slowing the query down so much.
I've tried rewriting the query with several different methods including using the SQL 2005 Partition method like this:
WITH cte
AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC) AS seq
FROM ChannelValue
)
SELECT
VehicleID,
TimeStamp,
Col1
FROM cte
WHERE seq = 1
But the performance of that query is even worse by quite a large magnitude.
I've tried re-structuring the query like this but the result speed and query execution plan is nearly identical:
SELECT cv.*
FROM (
SELECT VehicleID
,MAX(TimeStamp) AS [TimeStamp]
FROM ChannelValue
GROUP BY VehicleID
) AS [q]
INNER JOIN ChannelValue cv
ON cv.VehicleID = q.VehicleID
AND cv.TimeStamp = q.TimeStamp
I have some flexibility available to me around the table structure (although to a limited degree) so I can add indexes, indexed views and so forth or even additional tables to the database.
I would greatly appreciate any help at all here.
Edit Added the link to the execution plan image.
Depends on your data (how many rows are there per group?) and your indexes.
See Optimizing TOP N Per Group Queries for some performance comparisons of 3 approaches.
In your case with millions of rows for only a small number of Vehicles I would add an index on VehicleID, Timestamp and do
SELECT CA.*
FROM Vehicles V
CROSS APPLY (SELECT TOP 1 *
FROM ChannelValue CV
WHERE CV.VehicleID = V.VehicleID
ORDER BY TimeStamp DESC) CA
If your records are inserted sequentially, replacing TimeStamp in your query with ID may make a difference.
As a side note, how many records is this returning? Your delay could be network overhead if you are getting hundreds of thousands of rows back.
Try this:
SELECT SequencedChannelValue.* -- Specify only the columns you need, exclude the SequencedChannelValue
FROM
(
SELECT
ChannelValue.*, -- Specify only the columns you need
SeqValue = ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC)
FROM ChannelValue
) AS SequencedChannelValue
WHERE SequencedChannelValue.SeqValue = 1
A table or index scan is expected, because you're not filtering data in any way. You're asking for the latest TimeStamp for all VehicleIDs - the query engine HAS to look at every row to find the latest TimeStamp.
You can help it out by narrowing the number of columns being returned (don't use SELECT *), and by providing an index that consists of VehicleID + TimeStamp.

Resources