Select only the most recent datarows [duplicate] - sql-server

This question already has answers here:
Get top 1 row of each group
(19 answers)
Closed 1 year ago.
I have a table that takes multiple entries for specific products, you can create a sample like this:
CREATE TABLE test(
[coltimestamp] [datetime] NOT NULL,
[col2] [int] NOT NULL,
[col3] [int] NULL,
[col4] [int] NULL,
[col5] [int] NULL)
GO
Insert Into test
values ('2021-12-06 12:31:59.000',1,8,5321,1234),
('2021-12-06 12:31:59.000',7,8,4047,1111),
('2021-12-06 14:38:07.000',7,8,3521,1111),
('2021-12-06 12:31:59.000',10,8,3239,1234),
('2021-12-06 12:31:59.000',27,8,3804,1234),
('2021-12-06 14:38:07.000',27,8,3957,1234)
You can view col2 as product number if u like.
What I need is a query for this kind of table that returns unique data for col2, it must choose the most recent timestamp for not unique col2 entries.
In other words I need the most recent entry for each product
So in the sample the result will show two rows less: the old timestamp for col2 = 7 and col2 = 27 are removed
Thanks for your advanced knowledge

Give a row number by ROW_NUMBER() for each col2 value in the descending order of timestamp.
;with cte as(
Select rn=row_number() over(partition by col2 order by coltimestamp desc),*
From table_name
)
Select * from cte
Whwre rn=1;

Related

SQL Server 2016 Compare values from multiple columns in multiple rows in single table

USE dev_db
GO
CREATE TABLE T1_VALS
(
[SITE_ID] [int] NULL,
[LATITUDE] [numeric](10, 6) NULL,
[UNIQUE_ID] [int] NULL,
[COLLECT_RANK] [int] NULL,
[CREATED_RANK] [int] NULL,
[UNIQUE_ID_RANK] [int] NULL,
[UPDATE_FLAG] [int] NULL
)
GO
INSERT INTO T1_VALS
(SITE_ID,LATITUDE,UNIQUE_ID,COLLECT_RANK,CREATED_RANK,UNIQUEID_RANK)
VALUES
(207442,40.900470,59664,1,1,1)
(207442,40.900280,61320,1,1,2)
(204314,40.245220,48685,1,2,2)
(204314,40.245910,59977,1,1,1)
(202416,39.449530,9295,1,1,2)
(202416,39.449680,62264,1,1,1)
I generated the COLLECT_RANK and CREATED_RANK columns from two date columns (not shown here) and the UNIQUEID_RANK column from the UNIQUE_ID which is used here.
I used a SELECT OVER clause with ranking function to generate these columns. A _RANK value of 1 means the latest date or greatest UNIQUE_ID value. I thought my solution would be pretty straight forward using these rank values via array and cursor processing but I seem to have painted myself into a corner.
My problem: I need to choose LONGITUDE value and its UNIQUE_ID based upon the following business rules and set the update value, (1), for that record in its UPDATE_FLAG column.
Select the record w/most recent Collection Date (i.e. RANK value = 1) for a given SITE_ID. If multiple records exist w/same Collection Date (i.e. same RANK value), select the record w/most recent Created Date (RANK value =1) for a given SITE_ID. If multiple records exist w/same Created Date, select the record w/highest Unique ID for a given SITE_ID (i.e. RANK value = 1).
Your suggestions would be most appreciated.
I think you can use top and order by:
select top 1 t1.*
from t1_vals
order by collect_rank asc, create_rank, unique_id desc;
If you want this for sites, which might be what your question is asking, then use row_number():
select t1.*
from (select t1.*,
row_number() over (partition by site_id order by collect_rank asc, create_rank, unique_id desc) as seqnum
from t1_vals
) t1
where seqnum = 1;

Efficiently query for the latest version of a record using SQL

I need to query a table for the latest version of a record for all available dates (end of day time-series). The example below illustrates what I am trying to achieve.
My question is whether the table's design (primary key, etc.) and the LEFT OUTER JOIN query is accomplishing this goal in the most efficient manner.
CREATE TABLE [PriceHistory]
(
[RowID] [int] IDENTITY(1,1) NOT NULL,
[ItemIdentifier] [varchar](10) NOT NULL,
[EffectiveDate] [date] NOT NULL,
[Price] [decimal](12, 2) NOT NULL,
CONSTRAINT [PK_PriceHistory]
PRIMARY KEY CLUSTERED ([ItemIdentifier] ASC, [RowID] DESC, [EffectiveDate] ASC)
)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-15',5.50)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-16',5.75)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-16',6.25)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-17',6.05)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-18',6.85)
GO
SELECT
L.EffectiveDate, L.Price
FROM
[PriceHistory] L
LEFT OUTER JOIN
[PriceHistory] R ON L.ItemIdentifier = R.ItemIdentifier
AND L.EffectiveDate = R.EffectiveDate
AND L.RowID < R.RowID
WHERE
L.ItemIdentifier = 'ABC' and R.EffectiveDate is NULL
ORDER BY
L.EffectiveDate
Follow up: Table can contain thousands of ItemIdentifiers each with dacades worth of price data. Historical version of data needs to be preserved for audit reasons. Say I query the table and use the data in a report. I store #MRID = Max(RowID) at the time the report was generated. Now if the price for 'ABC' on '2016-03-16' is corrected at some later date, I can modify the query using #MRID and replicate the report that I ran earlier.
A slightly modified version of #SeanLange's answer will give you the last row per date, instead of per product:
with sortedResults as
(
select *
, ROW_NUMBER() over(PARTITION by ItemIdentifier, EffectiveDate
ORDER by ID desc) as RowNum
from PriceHistory
)
select ItemIdentifier, EffectiveDate, Price
from sortedResults
where RowNum = 1
order by 2
I assume you have more than 1 ItemIdentifier in your table. Your design is a bit problematic in that you are keeping versions of the data in your table. You can however do something like this quite easily to get the most recent one for each ItemIdentifier.
with sortedResults as
(
select *
, ROW_NUMBER() over(PARTITION by ItemIdentifier order by EffectiveDate desc) as RowNum
from PriceHistory
)
select *
from sortedResults
where RowNum = 1
Short answer, no.
You're hitting the same table twice, and possibly creating a looped table scan, depending on your existing indexes. In the best case, you're causing a looped index seek, and then throwing out most of the rows.
This would be the most efficient query for what you're asking.
SELECT
L.EffectiveDate,
L.Price
FROM
(
SELECT
L.EffectiveDate,
L.Price,
ROW_NUMBER() OVER (
PARTITION BY
L.ItemIdentifier,
L.EffectiveDate
ORDER BY RowID DESC ) RowNum
FROM [PriceHistory] L
WHERE L.ItemIdentifier = 'ABC'
) L
WHERE
L.RowNum = 1;

Performing a SUM when value could exist in one of a few different columns

Assume there is a simple table:
CREATE TABLE [dbo].[foo](
[foo_id] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
[val1_id] [int] NULL,
[val1_amount] [int] NULL,
[val2_id] [int] NULL,
[val2_amount] [int] NULL,
[val3_id] [int] NULL,
[val3_amount] [int] NULL,
[val4_id] [int] NULL
[val4_amount] [int] NULL,
) ON [PRIMARY]
And there is some other table that is:
CREATE TABLE [dbo].[val](
[val_id] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
[amount] [int] NOT NULL,
) ON [PRIMARY]
The user of the application will select an entry from the table val, and the application will place val_id and amount in val?_id and val?_amount in a non-deterministic manner (yes, I know this isn't good, but I have to deal with it).
Is there a way to produce output that will group by the val_id from the foo table and SUM the amount values from the foo table, given that the val_id/amount may be stored in any of the val?_id/val?_amount columns? Note that if val_id is stored in val1_id, the amount will always be stored in val1_amount within that row, but the same val_id may appear in a different val?_id in a different row (but the amount will be in the corresponding column)
Well, after you smacked the guy that designed this, you can try to UNPIVOT your columns, and then simply SUM the val_amount resultant column. I'm not gonna actually use UNPIVOT, but I'm doing the same with CROSS APPLY:
SELECT x.Val_id, SUM(x.val_amount) Val_Amount
FROM dbo.foo t
CROSS APPLY
(
VALUES
(t.val1_id, t.val1_amount),
(t.val2_id, t.val2_amount),
(t.val3_id, t.val3_amount),
(t.val4_id, t.val4_amount)
) x (Val_id, val_amount)
GROUP BY x.Val_id;
Here is a sqlfiddle with a live demo for you to try.
I did it using the COALESCE function. I'm not clear why I had to do a nested query, but I couldn't get it to take it otherwise:
select id, sum(amount)
FROM
(SELECT COALESCE(val1_id , val2_id , val3_id , val4_id) id,
COALESCE(val1_amount , val2_amount , val3_amount , val4_amount) amount
from foo) coalesced_data
GROUP BY ID
http://sqlfiddle.com/#!3/2ef35/6
Is there a way to produce output that will group by the val_id from the foo table and SUM the amount values from the foo table, given that the val_id/amount may be stored in any of the val?_id/val?_amount columns?
Hell is other people's data. You might do something like this:
create view FOO as
select foo.id, val1_id as valueid, val1_amount as amt from T
union all
select foo.id val2_id as valueid, val2_amount as amt from T
union all
select foo.id val3_id as valueid, val3_amount as amt from T
union all
select foo.id val4_id as valueid, val4_amount as amt from T
If you have correctly stated that the value will only ever appear in one pair of (val_id, val_amount) columns leaving the other 3 pairs blank, then this will work better than the currently accepted answer (#lamak's)
select id = coalesce(val1_id, val2_id, val3_id, val4_id),
val = sum(coalesce(val1_amount, val2_amount, val3_amount, val4_amount))
from foo
group by coalesce(val1_id, val2_id, val3_id, val4_id);
It is easy to understand and even though the COALESCE is repeated, it is really only evaluated once so there is no performance penalty. The query plan is estimated at 50% the cost of the other query.
The currently accepted answer may also produce a phantom NULL row, for example for the data in my SQL fiddle demo that does not contain anything in the 3rd pair (val3_amount).
SQL Fiddle demo

Copy Distinct Records Based on 3 Cols

I have loads of data in a table called Temp. This data consists of duplicates.
Not Entire rows but the same data in 3 columns. They are HouseNo,DateofYear,TimeOfDay.
I want to copy only the distinct rows from "Temp" into another table, "ThermData."
Basically what i want to do is copy all the distinct rows from Temp to ThermData where distinct(HouseNo,DateofYear,TimeOfDay). Something like that.
I know we can't do that. An alternative to how i can do that.
Do help me out. I have tried lots of things but haven't solved got it.
Sample Data. Values which are repeated are like....
I want to delete the duplicate row based on the values of HouseNo,DateofYear,TimeOfDay
HouseNo DateofYear TimeOfDay Count
102 10/1/2009 0:00:02 AM 2
102 10/1/2009 1:00:02 AM 2
102 10/1/2009 10:00:02 AM 2
Here is a Northwind example based on the Orders table.
There are duplicates based on the (EmployeeID , ShipCity , ShipCountry) columns.
If you only execute the code between these 2 lines:
/* Run everything below this line to show crux of the fix */
/* Run everything above this line to show crux of the fix */
you'll see how it works. Basically:
(1) You run a GROUP BY on the 3 columns of interest. (derived1Duplicates)
(2) Then you join back to the table using these 3 columns. (on ords.EmployeeID = derived1Duplicates.EmployeeID and ords.ShipCity = derived1Duplicates.ShipCity and ords.ShipCountry = derived1Duplicates.ShipCountry)
(3) Then for each group, you tag them with Cardinal numbers (1,2,3,4,etc) (using ROW_NUMBER())
(4) Then you keep the row in each group that has the cardinal number of "1". (where derived2DuplicatedEliminated.RowIDByGroupBy = 1)
Use Northwind
GO
declare #DestinationVariableTable table (
NotNeededButForFunRowIDByGroupBy int not null ,
NotNeededButForFunDuplicateCount int not null ,
[OrderID] [int] NOT NULL,
[CustomerID] [nchar](5) NULL,
[EmployeeID] [int] NULL,
[OrderDate] [datetime] NULL,
[RequiredDate] [datetime] NULL,
[ShippedDate] [datetime] NULL,
[ShipVia] [int] NULL,
[Freight] [money] NULL,
[ShipName] [nvarchar](40) NULL,
[ShipAddress] [nvarchar](60) NULL,
[ShipCity] [nvarchar](15) NULL,
[ShipRegion] [nvarchar](15) NULL,
[ShipPostalCode] [nvarchar](10) NULL,
[ShipCountry] [nvarchar](15) NULL
)
INSERT INTO #DestinationVariableTable (NotNeededButForFunRowIDByGroupBy , NotNeededButForFunDuplicateCount , OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry )
Select RowIDByGroupBy , MyDuplicateCount , OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
From
(
/* Run everything below this line to show crux of the fix */
Select
RowIDByGroupBy = ROW_NUMBER() OVER(PARTITION BY ords.EmployeeID , ords.ShipCity , ords.ShipCountry ORDER BY ords.OrderID )
, derived1Duplicates.MyDuplicateCount
, ords.*
from
[dbo].[Orders] ords
join
(
select EmployeeID , ShipCity , ShipCountry , COUNT(*) as MyDuplicateCount from [dbo].[Orders] GROUP BY EmployeeID , ShipCity , ShipCountry /*HAVING COUNT(*) > 1*/
) as derived1Duplicates
on ords.EmployeeID = derived1Duplicates.EmployeeID and ords.ShipCity = derived1Duplicates.ShipCity and ords.ShipCountry = derived1Duplicates.ShipCountry
/* Run everything above this line to show crux of the fix */
)
as derived2DuplicatedEliminated
where derived2DuplicatedEliminated.RowIDByGroupBy = 1
select * from #DestinationVariableTable
emphasized text*emphasized text*emphasized text

3 indexs or single index with 2 columns in sql server 2008?

I have a SQL query with a where clause like so:
Where ManufacturerID = #ManufacturerID
AND ItemID IN (SELECT ItemID FROM #T)
AND RelatedItemID IN (SELECT RelatedItemID FROM #T)
What would give me the best performance or is the proper way to do this? 3 indexes - one on each column or a single index that includes all 3?
HERE IS A MORE COMPLETE VIWE OF THE SP BEING RUN:
DECLARE #T TABLE (
[CategoryID] [int] NOT NULL,
[ManufacturerID] [int] NULL,
[ItemID] [varchar](100) NOT NULL,
[ItemName] [varchar](100) NULL,
[PhotoName] [varchar](150) NULL,
[ModifiedOn] [datetime] NULL,
[ModifiedBy] [varchar](50) NULL,
[IsDeleted] [bit] NOT NULL)
;WITH T As
(SELECT CategoryID, ManufacturerID, ItemID, ItemName, PhotoName, ModifiedOn, ModifiedBy, IsDeleted
FROM StagingCategoryItems
WHERE (ManufacturerID = #ManufacturerID)
EXCEPT
SELECT CategoryID, ManufacturerID, ItemID, ItemName, PhotoName, ModifiedOn, ModifiedBy, IsDeleted
FROM CategoryProducts
WHERE (ManufacturerID = #ManufacturerID)
)
INSERT INTO #T
SELECT *
FROM T
DELETE FROM CategoryProducts WHERE ManufacturerID = #ManufacturerID
AND ItemID IN (SELECT ItemID FROM #T)
AND CategoryID IN(SELECT CategoryID FROM #T)
INSERT INTO [CategoryProducts]
([CategoryID]
,[ManufacturerID]
,[ItemID]
,[ItemName]
,[PhotoName]
,[CreatedOn]
,[CreatedBy]
,[ModifiedOn]
,[ModifiedBy]
,[DeletedOn]
,[DeletedBy]
,[IsDeleted])
SELECT [CategoryID]
,[ManufacturerID]
,[ItemID]
,[ItemName]
,[PhotoName]
,[CreatedOn]
,[CreatedBy]
,[ModifiedOn]
,[ModifiedBy]
,[DeletedOn]
,[DeletedBy]
,[IsDeleted]
FROM [StagingCategoryItems]
WHERE ManufacturerID = #ManufacturerID
AND ItemID IN (SELECT ItemID FROM #T)
AND CategoryID IN(SELECT CategoryID FROM #T)
ItemID IN (SELECT ItemID FROM #T)
AND RelatedItemID IN (SELECT RelatedItemID FROM #T)
Now this is a very dangerous condition. It expresses the condition that the current ItemID is in #T and the RelatedItemID is also in #T, but note that they do no have to be on the same row in #T. To give an example, if #T contains:
ItemID RelatedItemId
1 2
3 4
and in your table you have a row like:
ItemID RelatedItemId
1 4
the WHERE condition will be TRUE. Are you sure this is the resolution you want?
As for your original indexes question: unfortunately the answer to this is 'it depends'. A number of index combinations can be good, and exactly the same indexes can be bad, depending on your actual data. When approaching a question like yours you need to ask yourself the question 'which condition is most restrictive, and how restrictive it is?'.
Say that your ManufacturerID = #ManufacturerID will restrict the number of candidate rows to about 10% (eg. you have 10 distinct manufacturers), the ItemID IN (SELECT ItemID FROM #T) restricts to a constant size of 100 rows in average, and the last condition does the same. Then even a single index on ItemID will be enough. Specially if is the clustered index, but even as a NC index, you're talking about an average 100 key lookps, which is small change.
But now lets say that Say that your ManufacturerID = #ManufacturerID will restrict the number of candidate rows to about 10%, the ItemID IN (SELECT ItemID FROM #T) restricts to a about 5% of the total number of rows, and the last condition does the same, but the exact match of all three conditions is only .0001% of the rows. Now no single column index
would help, you need a index that includes all three. In what order? Excellent question.
I recommend you go over the General Index Design Guidelines.
A general rule with any SQL server(PostgreSQL, Oracle, MySQL....) not just Microsoft SQL Server performance question is to test it under your workload and see what the explain plan gives and if the performance meets your requirements. Test a few options out and see how it effects the explain plan and performance(aka time to completion in most cases). I find you don't need to even know much about a database if you can prove it with really good testing. Not that know how is not of value, but all the know how in the world seldom beats real world tests.
One, since the other two are table variables.

Resources