Find duplicate rows - keep one entry

Find duplicate rows - keep one entry - sql-server

I have a sql-server table like this:
date : date
symbol : string
open : money
...
In the act of collecting historical data, I may have accidentally added the same data for a given date more than once. I need to keep one of the rows. But any more than one entry for the given symbol on a given date needs to be deleted. For example, this is wrong (two entries for INTC on 2/2/2019):
1/31/2019 INTC 48.32
2/2/2019 INTC 49.51
2/2/2019 INTC 49.51
How do I delete, per each symbol, duplicate rows automatically through a sql script and leave the rest of the data that does not contain duplicates alone?

You can use some CTE "magic":
WITH CTE AS(
SELECT [date], [Symbol], [open],
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN
FROM YourTable
WHERE [date] = '20190202'
AND [Symbol] = 'INTC'
AND [open] = 49.51)
DELETE FROM CTE
WHERE RN > 1;
If you want to DELETE any duplicates you've created and assuming that a duplicate denotes 2 or more rows that share the same values for date, symbol and open, then you can do:
WITH CTE AS(
SELECT [date], [Symbol], [open],
ROW_NUMBER() OVER (PARTITION BY [date], [Symbol], [open] ORDER BY (SELECT NULL)) AS RN
FROM YourTable)
DELETE FROM CTE
WHERE RN > 1;
If you should only have one entry per day (or day and symbol perhaps), then create it as a UNIQUE constraint:
ALTER TABLE YourTable ADD CONSTRAINT UK_date_symbol UNIQUE ([date],symbol);

Related

SQL Server 2008 find the prior to last timestamp record

I am trying to query a table in order to find the prior to last record. For the moment i have tried various solution such as this code but i cannnot make it work (see below)
the fields i am trying to query are from this table [CMI_Industry_Workload].[dbo].[Gantt_Value] and are [ProjectName] and [TimeStamp].
The first field is a string field and the second one is a timestamp field.
select t.*
from (Select t.*
row_number() over (partition by [ProjectName] order by [Timestamp]) as seqnum,
count(*) over (partition by [ProjectName]) as cnt
from [CMI_Industry_Workload].[dbo].[Gantt_Value] t
) t
where seqnum in (1, cnt - 1, cnt);
So i expect to have not the most recent record but the one before that.
Thanks a lot
Gary

This should find prior to last record in your table (assuming that data is sorted by Timestamp field. This will sort rows in descending order by Timestamp, skip (offset) 1 (very last) row and fetch only one row, making that prior to last one.
SELECT T.*
FROM [CMI_Industry_Workload].[dbo].[Gantt_Value] AS T
ORDER BY [Timestamp] DESC
OFFSET 1 ROWS
FETCH NEXT 1 ROWS ONLY;
Above will work with 2012+
This is version with ROW_NUMBER
SELECT *
FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY [Timestamp] DESC) AS Seq
FROM [CMI_Industry_Workload].[dbo].[Gantt_Value]) AS T
WHERE T.Seq = 2;
This will order yet again by Timestamp desc and will pick 2nd value (prior to last one).
Keep in mind that MSSQL 2008 is (or soon will be) out of support. I'd strongly encourage upgrading.
Update
Based on OP's comment, this must be the answer then:
SELECT *
FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY [ProjectName] ORDER BY [Timestamp] DESC) AS Seq
FROM [CMI_Industry_Workload].[dbo].[Gantt_Value]) AS T
WHERE T.Seq > 1;
Second update
If Timestamp values happen to be duplicates and you want to treat them as same, you might want to use DENSE_RANK() instead of ROW_NUMBER(). It's going to assign same sequence number if current value matches previous value within the sequence.
SELECT *
FROM ( SELECT *, DENSE_RANK() OVER (PARTITION BY [ProjectName] ORDER BY [Timestamp] DESC) AS Seq
FROM [CMI_Industry_Workload].[dbo].[Gantt_Value]) AS T
WHERE T.Seq > 1;

How to select highest common value across groups

`Suppose I have a set of data with 2 fields - Type and Date. I am interested in finding (if exists) the the max common date across the various types. Is this easier to do in SQL or LINQ?
Given the data below the result should be 2018-02-01 as this is the max common date for all types. It there is no such date then no data is returned.
Type, Date
---------
1,2018-03-01
1,2018-02-01
1,2018-01-01
2,2018-02-01
2,2018-05-01
2,2018-01-01
3,2018-01-01
3,2018-03-01
3,2018-02-01

You could use:
SELECT TOP 1 [Date], COUNT(*) OVER(PARTITION BY Date) AS cnt
FROM tab
ORDER BY cnt DESC, [Date] DESC
DBFiddle Demo

This'll work if you have an unlimited or indeterminable number of Types:
CREATE TABLE #Sample ([Type] int, [DAte] date);
INSERT INTO #Sample
VALUES
(1,'20180301'),
(1,'20180201'),
(1,'20180101'),
(2,'20180201'),
(2,'20180501'),
(2,'20180101'),
(3,'20180101'),
(3,'20180301'),
(3,'20180201');
GO
WITH EntryCount AS(
SELECT [Type], [Date],
COUNT(*) OVER (PARTITION By [Date]) AS Entries
FROM #Sample)
SELECT MAX(Date)
FROM EntryCount EC
WHERE Ec.Entries = (SELECT COUNT(DISTINCT sq.[Type]) FROM #Sample sq);
GO
DROP TABLE #Sample;
Not sure how quick it'll be either though.

Example
Select Top 1 [Date]
from YourTable
Group By [Date]
Order By count([Type]) desc,[Date] desc
Returns
2018-02-01

This is not going to be very efficient not matter how you slice it because you have to compare across three groups. Assuming you have 3 types you could use a self join. Something like this.
select MAX(YourDate)
from YourTable yt
join YourTable yt2 on yt2.YourType = 2 and yt.YourDate = yt2.YourDate
join YourTable yt3 on yt3.YourType = 3 and yt.YourDate = yt3.YourDate
where yt.YourType = 1

T- SQL Duplicate Records

I am trying to delete every other record which are duplicate my select query returns every other record duplicate (tblPoints.ptUser_ID) is the unique id
SELECT *, u.usMembershipID
FROM [ABCRewards].[dbo].[tblPoints]
inner join tblUsers u on u.User_ID = tblPoints.ptUser_ID
where ptUser_ID in (select user_id from tblusers where Client_ID = 8)
and ptCreateDate >= '3/9/2016'
and ptDesc = 'December Anniversary'

Usually duplicates getting returned by an INNER JOIN suggests an issue with the query but if you are certain that your join is correct then this would do it:
;WITH CTE
AS (SELECT *
, ROW_NUMBER() OVER(PARTITION BY t.ptUser_ID ORDER BY t.ptUser_ID) AS rn
FROM [ABCRewards].[dbo].[tblPoints] AS t)
/*Uncomment below to Review duplicates*/
--SELECT *
--FROM CTE
--WHERE rn > 1;
/*Uncomment below to Delete duplicates*/
--DELETE
--FROM CTE
--WHERE rn > 1;

When cleaning up data duplication, I have always used the same query pattern to delete all the duplicate and keep the wanted one(original, most recent, whatever). The below query pattern delete all duplicates and keep the one you wish to keep.
Just replace all [] with your table and fields.
[Field(s)ToDetectDuplications] : Put here the field(s) that allow you to say that they are dupplicate when they have the same values.
[Field(s)ToChooseWhichDupplicationIsKept ] : Put here a fields to choose which dupplicate will be kept. For exemple, the one with the
biggest value or the less old one.
.
DELETE [YourTableName]
FROM [YourTableName]
INNER JOIN (SELECT [YourTablePrimaryKey],
I = ROW_NUMBER() OVER(PARTITION BY [Field(s)ToDetectDuplications] ORDER BY [Field(s)ToChooseWhichDupplicationIsKept ] DESC)
FROM [dbo].[YourTableName]) AS T ON [YourTableName].[YourTablePrimaryKey] = T.[YourTablePrimaryKey]
AND T.I > 1
I recommend to have a look to what will be deleted before. To do so, just replace the "delete" statement with a "select" instead just like below.
SELECT T.I,
[YourTableName].*
FROM [YourTableName]
INNER JOIN (SELECT [YourTablePrimaryKey],
I = ROW_NUMBER() OVER(PARTITION BY [Field(s)ToDetectDuplications] ORDER BY [Field(s)ToChooseWhichDupplicationIsKept ] DESC)
FROM [dbo].[YourTableName]) AS T ON [YourTableName].[YourTablePrimaryKey] = T.[YourTablePrimaryKey]
AND T.I > 1
Explanation :
Here we use "row_number()", "Partition by" and "Order by" to detect duplicates. "Partition" group together all rows. Set your partitions fields in order to have one row per partition when the data is right. That way bad data come out with partition that have more than one row. Row_number assign them a number. When a number is greater then 1, then this mean there is a duplicate with this partition. The "order by" is use to tell "row_number" in what order to assign them a number. Number 1 is kept, all others are deleted.
Exemple with OP's schema and specification
Here I attempted to fill the patern with guess I have made on your database schema.
DECLARE #userID INT
SELECT #userID = 8
SELECT T.I,
[ABCRewards].[dbo].[tblPoints].*
FROM [ABCRewards].[dbo].[tblPoints]
INNER JOIN (SELECT [YourTablePrimaryKey],
I = ROW_NUMBER() OVER(PARTITION BY T.ptDesc, T.ptUser_ID ORDER BY ptCreateDate DESC)
FROM [ABCRewards].[dbo].[tblPoints]
WHERE T.ptCreateDate >= '3/9/2016'
AND T.ptDesc = 'December Anniversary'
AND T.ptUser_ID = #userID
) AS T ON [ABCRewards].[dbo].[tblPoints].[YourTablePrimaryKey] = T.[YourTablePrimaryKey]
AND T.I > 1

Ordering by an expression in a partition by

I have some almost duplicate data in my database (duplicates based on these 5 columns: Date, Code, Expiry, TheType, Strike, there are many more columns but they won't be counted towards labeling a record a duplicate). I want to keep only one record in each case and the one I want to keep is the one whose mtm column is closest to its checkprice column (i.e. minimize abs(mtm-checkprice)). So I think the CTE below gets pretty close if I can just order the partition by that expression. The way I tried gives me the error Invalid column name 'diff'.
WITH CTE AS(
SELECT *, ABS(Mtm - checkprice) as diff,
RN = ROW_NUMBER()OVER(PARTITION BY Date, Strike, Mtm, /* ALL THE OTHER COLUMN NAMES */
ORDER BY diff DESC)
FROM FullStats
)
--DELETE FROM CTE WHERE RN > 1
SELECT * FROM CTE WHERE RN > 1
ORDER BY Date, Code, Expiry, TheType, Strike
Any ideas on how to rectify this?

Use the ABS(mtm-checkprice) in the ORDER BY of the ROW_NUMBER:
WITH CTE AS(
SELECT *, Diff = ABS(mtm-checkprice),
RN = ROW_NUMBER()OVER(PARTITION BY Date, Code, Expiry, TheType, Strike
ORDER BY ABS(mtm-checkprice) ASC)
FROM FullStats
)
--DELETE FROM CTE WHERE RN > 1
SELECT * FROM CTE WHERE RN > 1
ORDER BY Date, Code, Expiry, TheType, Strike
You cannot access Diff in the ROW_NUMBER, only outside of the CTE.

How to use row_number() in SQL Server

I want to update row data where the row_number of the column (p_id) is 1.. but this syntax is providing error:
update app1
set p_id = 1
where Row_Number() = 1 over(p_id)

You can't use ROW_NUMBER() directly - you need to e.g. use a CTE (Common Table Expression) for that:
;WITH DataToUpdate AS
(
SELECT
SomeID,
p_id,
ROW_NUMBER() OVER(ORDER BY .......) AS 'RowNum'
FROM
dbo.app1
)
UPDATE DataToUpdate
SET p_id = 1
WHERE
RowNum = 1
In order to use the ROW_NUMBER function, you also need at least an ORDER BY clause to define an order by which the rows are ordered.
From your question, it's not very clear what criteria (column) you want to order by to determine your ROW_NUMBER(), and it's also not clear what kind of column there is to uniquely identify a row (so that the UPDATE can be applied)

This will update only the first employee of that age. May be used as a lottery type logic
create table emp(name varchar(3),Age int, Salary int, IncentiveFlag bit)
insert into emp values('aaa',23,90000,0);
insert into emp values('bbb',22,50000,0);
insert into emp values('ccc',63,60000,0);
insert into emp values('ddd',53,50000,0);
insert into emp values('eee',23,80000,0);
insert into emp values('fff',53,50000,0);
insert into emp values('ggg',53,50000,0);
update A
set IncentiveFlag=1
from
(
Select row_number() over (partition by Age order by age ) AS SrNo,* from emp
)A
where A.SrNo=1

TO Delete duplicates ;WITH CTE(Name,Address1,Phone,RN)
AS
(
SELECT Name,Address1,Phone,
ROW_NUMBER() OVER(PARTITION BY Name ORDER BY Name) AS RN
)
DELETE FROM CTE WHERE RN > 1

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Find duplicate rows - keep one entry - sql-server

Related

SQL Server 2008 find the prior to last timestamp record

How to select highest common value across groups

T- SQL Duplicate Records

Ordering by an expression in a partition by

How to use row_number() in SQL Server

Categories

Resources