Delete duplicate row with having

Delete duplicate row with having - sql-server

I want to remove my duplicate value in SQL Server. So I found a query that can find them, how can I append delete statement to this query to delete them?
(SELECT
DocumentNumber, LineNumber, SheetNumber, Unit, COUNT(*)
FROM
Lines
GROUP BY
DocumentNumber, LineNumber, SheetNumber, Unit
HAVING
COUNT(*) > 1)

I would use a CTE and a ranking function like ROW_NUMBER:
;WITH CTE AS
(
SELECT Id, DocumentNumber, LineNumber, SheetNumber, Unit,
RN = ROW_NUMBER() OVER (PARTITION BY DocumentNumber, LineNumber, SheetNumber, Unit
ORDER BY Id DESC) -- keeps the newest Id-row
FROM dbo.Lines
)
DELETE FROM CTE WHERE RN > 1
The great benefit, it's easier to read and to maintain and it's easy to change it to a SELECT * FROM CTE to see what you're going to delete.
Modify the Order By part to implement a custom logic which row to keep and which rows to delete. It can contain multiple columns(either ASC or DESC) or even conditional statements (f.e. with CASE).

Here are a few ways to do that:
DELETE FROM Lines L
INNER JOIN (SELECT
DocumentNumber,LineNumber,SheetNumber,Unit, COUNT(*)
FROM
Lines
GROUP BY
DocumentNumber,LineNumber,SheetNumber,Unit
HAVING
COUNT(*) > 1) D ON L.DocumentNumber = D.DocumentNumber AND L.LineNumber = D.LineNumber AND L.SheetNumber = D.SheetNumber AND L.Unit = D.Unit
or you can also use table variable or CTE or IN and subquery if you have a column like ID

Much appropriate approach using EXISTS
DELETE a
FROM lines a
WHERE EXISTS (SELECT 1
FROM lines b
WHERE a.documentnumber = b.documentnumber
AND a.linenumber = b.linenumber
AND a.sheetnumber = b.sheetnumber
AND a.unit = b.unit
HAVING Count(*) > 1)
Another approach using COUNT() OVER()
;WITH cte
AS (SELECT id,
documentnumber,
linenumber,
sheetnumber,
unit,
CNT = Count(1) OVER(partition BY documentnumber, linenumber, sheetnumber, unit)
FROM dbo.lines)
DELETE FROM cte
WHERE cnt > 1
Note : Both my approaches deletes all the records for this combination documentnumber, linenumber, sheetnumber, unit if it is duplicated

Related

Is there any way to sum duplicate rows when deleting duplicates using CTE?

I have a table that contains duplicated ItemId. I am using CTE to remove the duplicate records and keep only single record for each item. I am able to successfully achieve this milestone using following Query:
Create procedure sp_SumSameItems
as
begin
with cte as (select a.Id,a.ItemId,Qty, QtyPrice,
ROW_NUMBER() OVER(PARTITION by ItemId ORDER BY Id) AS rn from tblTest a)
delete x from tblTest x Join cte On x.Id = cte.Id where cte.rn > 1
end
The actual problem is I want to Sum the Qty and QtyPrice before deleting duplicate records. Where should I add Sum function ?
Problem Illustration:

You can't use update with delete statement, you need to update before :
update t
set t.qty = (select sum(t1.qty) from table t1 where t1.itemid = t.itemid);

A CTE is valid for only one statement, so you will need to either run the cte twice, once summing and then deleting or you could put the result of CTE in a temp table and then use the temp table to sum and then delete records in the original table.

At first level, you have to update Qty and QtyPrice after that remove duplicate records.
Given Example:
CREATE PROCEDURE Sp_sumsameitems
AS
BEGIN
WITH cte1
AS (SELECT a.id,
a.itemid,
Sum(qty) Qty,
Sum(qtyprice)QtyPrice,
FROM tbltest a
GROUP BY a.id)
UPDATE x
SET x.qty = c.qty,
x.qtyprice = c.qtyprice
FROM tbltest x
JOIN cte1 c
ON x.id = cte.id
WITH cte
AS (SELECT a.id,
a.itemid,
qty,
qtyprice,
Row_number()
OVER(
partition BY itemid
ORDER BY id) AS rn
FROM tbltest a)
DELETE x
FROM tbltest x
JOIN cte
ON x.id = cte.id
WHERE cte.rn > 1
END

T- SQL Duplicate Records

I am trying to delete every other record which are duplicate my select query returns every other record duplicate (tblPoints.ptUser_ID) is the unique id
SELECT *, u.usMembershipID
FROM [ABCRewards].[dbo].[tblPoints]
inner join tblUsers u on u.User_ID = tblPoints.ptUser_ID
where ptUser_ID in (select user_id from tblusers where Client_ID = 8)
and ptCreateDate >= '3/9/2016'
and ptDesc = 'December Anniversary'

Usually duplicates getting returned by an INNER JOIN suggests an issue with the query but if you are certain that your join is correct then this would do it:
;WITH CTE
AS (SELECT *
, ROW_NUMBER() OVER(PARTITION BY t.ptUser_ID ORDER BY t.ptUser_ID) AS rn
FROM [ABCRewards].[dbo].[tblPoints] AS t)
/*Uncomment below to Review duplicates*/
--SELECT *
--FROM CTE
--WHERE rn > 1;
/*Uncomment below to Delete duplicates*/
--DELETE
--FROM CTE
--WHERE rn > 1;

When cleaning up data duplication, I have always used the same query pattern to delete all the duplicate and keep the wanted one(original, most recent, whatever). The below query pattern delete all duplicates and keep the one you wish to keep.
Just replace all [] with your table and fields.
[Field(s)ToDetectDuplications] : Put here the field(s) that allow you to say that they are dupplicate when they have the same values.
[Field(s)ToChooseWhichDupplicationIsKept ] : Put here a fields to choose which dupplicate will be kept. For exemple, the one with the
biggest value or the less old one.
.
DELETE [YourTableName]
FROM [YourTableName]
INNER JOIN (SELECT [YourTablePrimaryKey],
I = ROW_NUMBER() OVER(PARTITION BY [Field(s)ToDetectDuplications] ORDER BY [Field(s)ToChooseWhichDupplicationIsKept ] DESC)
FROM [dbo].[YourTableName]) AS T ON [YourTableName].[YourTablePrimaryKey] = T.[YourTablePrimaryKey]
AND T.I > 1
I recommend to have a look to what will be deleted before. To do so, just replace the "delete" statement with a "select" instead just like below.
SELECT T.I,
[YourTableName].*
FROM [YourTableName]
INNER JOIN (SELECT [YourTablePrimaryKey],
I = ROW_NUMBER() OVER(PARTITION BY [Field(s)ToDetectDuplications] ORDER BY [Field(s)ToChooseWhichDupplicationIsKept ] DESC)
FROM [dbo].[YourTableName]) AS T ON [YourTableName].[YourTablePrimaryKey] = T.[YourTablePrimaryKey]
AND T.I > 1
Explanation :
Here we use "row_number()", "Partition by" and "Order by" to detect duplicates. "Partition" group together all rows. Set your partitions fields in order to have one row per partition when the data is right. That way bad data come out with partition that have more than one row. Row_number assign them a number. When a number is greater then 1, then this mean there is a duplicate with this partition. The "order by" is use to tell "row_number" in what order to assign them a number. Number 1 is kept, all others are deleted.
Exemple with OP's schema and specification
Here I attempted to fill the patern with guess I have made on your database schema.
DECLARE #userID INT
SELECT #userID = 8
SELECT T.I,
[ABCRewards].[dbo].[tblPoints].*
FROM [ABCRewards].[dbo].[tblPoints]
INNER JOIN (SELECT [YourTablePrimaryKey],
I = ROW_NUMBER() OVER(PARTITION BY T.ptDesc, T.ptUser_ID ORDER BY ptCreateDate DESC)
FROM [ABCRewards].[dbo].[tblPoints]
WHERE T.ptCreateDate >= '3/9/2016'
AND T.ptDesc = 'December Anniversary'
AND T.ptUser_ID = #userID
) AS T ON [ABCRewards].[dbo].[tblPoints].[YourTablePrimaryKey] = T.[YourTablePrimaryKey]
AND T.I > 1

Deleting duplicates in a time series

I have a large set of measurements taken every 1 millisecond stored in a SQL Server 2012 table. Whenever there are 3 or more duplicate values in some rows that I would like to delete the middle duplicates. Highlighted values in this image of sample data are the ones that I want to delete. Is there a way to do this with a SQL query?

You can do this using a CTE and ROW_NUMBER:
SQL Fiddle
WITH CteGroup AS(
SELECT *,
grp = ROW_NUMBER() OVER(ORDER BY MS) - ROW_NUMBER() OVER(PARTITION BY Value ORDER BY MS)
FROM YourTable
),
CteFinal AS(
SELECT *,
RN_FIRST = ROW_NUMBER() OVER(PARTITION BY grp, Value ORDER BY MS),
RN_LAST = ROW_NUMBER() OVER(PARTITION BY grp, Value ORDER BY MS DESC)
FROM CteGroup
)
DELETE
FROM CteFinal
WHERE
RN_FIRST > 1
AND RN_LAST > 1

I'm sure there must be a more efficient way to do this, but you could join the table to itself twice to find the previous and next value in the list, and then delete all of the entries where all three values are the same.
DELETE FROM tbl
WHERE ms IN
(
SELECT T.ms
FROM tbl T
INNER JOIN tbl T1 ON T.ms = T1.ms + 1
INNER JOIN tbl T2 ON T.ms = T2.ms - 1
WHERE T.value = T1.value AND T.value = T2.value
)
If the table is really big, I can see this blowing tempdb though.

Yes there is
select * from table group by table.field ->value

SQL Server: join on derived table that contains WITH clause?

I'd like to join on a subquery / derived table that contains a WITH clause (the WITH clause is necessary to filter on ROW_NUMBER() = 1). In Teradata something similar would work fine, but Teradata uses QUALIFY ROW_NUMBER() = 1 instead of a WITH clause.
Here is my attempt at this join:
-- want to join row with max StartDate on JobModelID
INNER JOIN (
WITH AllRuns AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY JobModelID ORDER BY StartDate DESC) AS RowNumber
FROM Runs
)
SELECT * FROM AllRuns WHERE RowNumber = 1
) Runs
ON JobModels.JobModelID = Runs.JobModelID
What am I doing wrong?

You could use multiple WITH clauses. Something like
;WITH AllRuns AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY JobModelID ORDER BY StartDate DESC) AS RowNumber
FROM Runs
),
Runs AS(
SELECT *
FROM AllRuns
WHERE RowNumber = 1
)
SELECT *
FROM ... INNER JOIN (
Runs ON JobModels.JobModelID = Runs.JobModelID
For more detail on the usages/structure/rules see WITH common_table_expression (Transact-SQL)

Adding a join condition is probably less efficient, but usually works fine for me.
INNER JOIN (
SELECT *,
ROW_NUMBER() OVER
(PARTITION BY JobModelID
ORDER BY StartDate DESC) AS RowNumber
FROM Runs
) Runs
ON JobModels.JobModelID = Runs.JobModelID
AND Runs.RowNumber = 1

select top 1 with a group by

I have two columns:
namecode name
050125 chris
050125 tof
050125 tof
050130 chris
050131 tof
I want to group by namecode, and return only the name with the most number of occurrences. In this instance, the result would be
050125 tof
050130 chris
050131 tof
This is with SQL Server 2000

I usually use ROW_NUMBER() to achieve this. Not sure how it performs against various data sets, but we haven't had any performance issues as a result of using ROW_NUMBER.
The PARTITION BY clause specifies which value to "group" the row numbers by, and the ORDER BY clause specifies how the records within each "group" should be sorted. So partition the data set by NameCode, and get all records with a Row Number of 1 (that is, the first record in each partition, ordered by the ORDER BY clause).
SELECT
i.NameCode,
i.Name
FROM
(
SELECT
RowNumber = ROW_NUMBER() OVER (PARTITION BY t.NameCode ORDER BY t.Name),
t.NameCode,
t.Name
FROM
MyTable t
) i
WHERE
i.RowNumber = 1;

select distinct namecode
, (
select top 1 name from
(
select namecode, name, count(*)
from myTable i
where i.namecode = o.namecode
group by namecode, name
order by count(*) desc
) x
) as name
from myTable o

SELECT max_table.namecode, count_table2.name
FROM
(SELECT namecode, MAX(count_name) AS max_count
FROM
(SELECT namecode, name, COUNT(name) AS count_name
FROM mytable
GROUP BY namecode, name) AS count_table1
GROUP BY namecode) AS max_table
INNER JOIN
(SELECT namecode, COUNT(name) AS count_name, name
FROM mytable
GROUP BY namecode, name) count_table2
ON max_table.namecode = count_table2.namecode AND
count_table2.count_name = max_table.max_count

I did not try but this should work,
select top 1 t2.* from (
select namecode, count(*) count from temp
group by namecode) t1 join temp t2 on t1.namecode = t2.namecode
order by t1.count desc

Here are to examples that you could use but the temp table use is more efficient than the view, but was done on a small data sample. You would want to check your own statistics.
--Creating A View
GO
CREATE VIEW StateStoreSales AS
SELECT t.state,t.stor_id,t.stor_name,SUM(s.qty) 'TotalSales'
,ROW_NUMBER() OVER (PARTITION BY t.state ORDER BY SUM(s.qty) DESC) AS 'Rank'
FROM [dbo].[sales] s
JOIN [dbo].[stores] t ON (s.stor_id = t.stor_id)
GROUP BY t.state,t.stor_id,t.stor_name
GO
SELECT * FROM StateStoreSales
WHERE Rank <= 1
ORDER BY TotalSales Desc
DROP VIEW StateStoreSales
---Using a Temp Table
SELECT t.state,t.stor_id,t.stor_name,SUM(s.qty) 'TotalSales'
,ROW_NUMBER() OVER (PARTITION BY t.state ORDER BY SUM(s.qty) DESC) AS 'Rank' INTO #TEMP
FROM [dbo].[sales] s
JOIN [dbo].[stores] t ON (s.stor_id = t.stor_id)
GROUP BY t.state,t.stor_id,t.stor_name
SELECT * FROM #TEMP
WHERE Rank <= 1
ORDER BY TotalSales Desc
DROP TABLE #TEMP

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Delete duplicate row with having - sql-server

Related

Is there any way to sum duplicate rows when deleting duplicates using CTE?

T- SQL Duplicate Records

Deleting duplicates in a time series

SQL Server: join on derived table that contains WITH clause?

select top 1 with a group by

Categories

Resources