How can I remove duplicate rows?

How can I remove duplicate rows? - sql-server

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.

There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField

The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC

delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid

DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1

This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)

I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);

Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;

I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)

SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable

I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...

This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)

Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;

Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)

This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)

Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1

Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)

I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )

Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.

The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.

By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)

Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1

Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD

I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0

From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?

DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);

CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)

I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

Related

Is there a way to tell SQL Server to check the table for a duplicate before inserting each new row?

I tried using the SQL below to insert values from one table, importTable, into another table, POInvoicing. It appears that the way this query below works is it checks the POInvoicing table for any possible duplicates from the importTable and for those entries that are not duplicates, it inserts them into the table. The end result is SQL inserting duplicates that already exist in importTable. Is there a way to tell SQL Server to check the table for a possible duplicate entry, if not, add the next row. Then check the table for a duplicate entry, if not, add the next row. I know this will be slower but speed isn't an issue.
INSERT INTO POInvoicing
(VendorID, InvoiceNo)
SELECT dbo.importTable.VendorID,
dbo.importTable.InvoiceNo
FROM dbo.importTable
WHERE NOT EXISTS (SELECT VendorID,
InvoiceNo
FROM POInvoicing
WHERE POInvoicing.VendorID = dbo.importTable.VendorID AND
POInvoicing.InvoiceNo = dbo.importTable.InvoiceNo)
This isn't exactly the functionality I was hoping for. What I want is for the query to insert a row into the table and then check for "duplicates" before inserting the next row. What constitutes a duplicate in the importTable would be the combination of VendorID and InvoiceNo. There are about a dozen different columns in importTable and technically each row is distinct, so DISTINCT won't work here.
I can't simply remove duplicates from the importTable for a couple of reasons not relevant to the question above (though I can provide it if necessary), so that method is out.

If you really don't care (or refuse to tell us) how you want to decide between two rows with the same VendorID and InvoiceNo values, you can pick an arbitrary row like this:
;WITH NewRows AS
(
SELECT VendorID, InvoiceNo, InvoiceDate, /* ... other columns ... */
rn = ROW_NUMBER() OVER (PARTITION BY VendorID, InvoiceNo ORDER BY (SELECT NULL))
FROM dbo.importTable AS i
WHERE NOT EXISTS (SELECT 1 FROM dbo.POInvoicing AS p
WHERE p.VendorID = i.VendorID
AND p.InvoiceNo = i.InvoiceNo)
)
INSERT dbo.POInvoicing(VendorID, InvoiceNo, InvoiceDate /* , ... other columns ... */)
SELECT VendorID, InvoiceNo, InvoiceDate /* , ... other columns */
FROM NewRows
WHERE rn = 1;
If you later decide there is a specific row you want in the case of duplicates, you can swap out (SELECT NULL) for something else. For example, to take the row with the latest invoice date:
OVER (PARTITION BY VendorID, InvoiceNo ORDER BY InvoiceDate DESC)
Again, I wasn't asking questions here to be annoying, it was to help you get the solution you need. If you want SQL Server to pick between two duplicates, you can either tell it how to pick, or you'll have to accept arbitrary / non-deterministic results. You should not jump the fence for looping / cursors just because the first thing you tried didn't work the way you wanted it to.
Also please always specify the schema and use sensible table aliases.

Adding a primary key constraint or unique key constraint in your table to avoid duplicate data insertion.
Also use distinct keyword in your select query to avoid this.
Duplicate rows can also be eliminated by using group by or row_number() functions in SQL.
Using DISTINCT Keyword
INSERT INTO POInvoicing
(VendorID, InvoiceNo, InvoiceDate)
SELECT DISTINCT dbo.importTable.VendorID,
dbo.importTable.InvoiceNo,
dbo.importTable.InvoiceDate
FROM dbo.importTable
WHERE NOT EXISTS (SELECT VendorID,
InvoiceNo
FROM POInvoicing
WHERE POInvoicing.VendorID = dbo.importTable.VendorID
AND
POInvoicing.InvoiceNo = dbo.importTable.InvoiceNo)
Try this INNER JOIN
INSERT INTO POInvoicing
(VendorID, InvoiceNo, InvoiceDate)
SELECT dbo.importTable.VendorID,
dbo.importTable.InvoiceNo,
dbo.importTable.InvoiceDate
FROM dbo.importTable IM
INNER JOIN POInvoicing S ON S.POInvoicing.VendorID <>
dbo.importTable.VendorID
AND
S.POInvoicing.InvoiceNo <> dbo.importTable.InvoiceN

Copy whole row excluding identifier column

I'm trying to insert a new row into a table which is an exact copy of another row except for the identifier. Previously I hadn't had the issue because there was an ID-column which didn't fill automatically. So I could just copy a row like this
INSERT INTO table1 SELECT * FROM table1 WHERE Id = 5
And then manually change the ID like this
WITH tbl AS (SELECT ROW_NUMBER() OVER (PARTITION BY Id ORDER BY Id) AS RNr, Id
FROM table1 WHERE Id = 5
UPDATE tbl SET Id = (SELECT MAX(Id) FROM table1) + 1
WHERE RNr = 2
Recently the column Id has changed to not be filled manually (which I also like better). But this leads to the error that I obviously can't fill that column while IDENTITY_INSERT is false. Unfortunately, I don't have the right to use SET IDENTITY_INSERT IdentityTable ON/OFF before and after the statement, which I also would like to avoid anyways.
I'm looking for a way to insert the whole row excluding the Id column without naming each other column in the INSERT and SELECT statement (which are quite a lot).

In the code below the maximum value of the ID gets added by one, so your integrity is not violated.
insert into table1
select a.Top_ID, Column2, Column3, ColumnX
from table1 t1
outer apply (select max(id_column) + 1as Top_ID from table1) as a
where t1.Id = 1

Okay, I found a way by using a temporary table
SELECT * INTO #TmpTbl
FROM table1 WHERE Id = 5;
ALTER TABLE #TmpTbl
DROP COLUMN Id;
INSERT INTO table1 SELECT * FROM #TmpTbl;
DROP TABLE #TmpTbl

SQL Server: Run separate CTE for each record without cursor or function

Given table A with a column LocationID and many records,
Is it possible to fully run a CTE for each record, without using a cursor (while fetch loop) or function (via cross apply)?
I can't run a CTE from the table A because the CTE will go deep through a parent-child hierarchical table (ParentID, ChildID) to find all descendants of a specific type, for each LocationID of table A. It seems that if I do CTE using table A, it will mix the children of all LocationID in table A.
Basically I need to separately run a CTE, for each LocationID of table A, and put in a table with LocationID and ChildID columns (LocationID are the ones from table A and ChildID are all descendants of a specific type found via CTE).

This is your basic layout.
;with CTE AS
(
select .......
)
select *
from CTE
cross apply (select distinct Location from TableA) a
where CTE.Location=a.Location
Some sample data and expected results will provide for a better answer.

You could do something like this:
Declare #LocationID As Int
Select
LocationID
, 0 as Processed
Into #Temp_Table
From TableA
While Exists (Select Top 1 1 From #Temp_Table Where Processed = 0)
Begin
Select Top 1 #LocationID = LocationID
From #Temp_Table
Where Processed = 0
Order By LocationID
/* Do your processing here */
Update #Temp_Table Set
Processed = 0
Where LocationID = #LocationID
End
It's still RBAR but (in my environment, at least) it's way faster than a cursor.

I was able to find a solution. I just had to keep the original LocationID as a reference, then in the CTE results, which will include all possible records as it goes deep into the list, I apply the filter I need. Yes, all records are mixed in the results, however because the reference to origin table's LocationID was kept (as OriginalParentID) I'm still able to retrieve it.
;WITH CTE AS
(
--Original list of parents
SELECT a.LocationID AS OriginalParentID, l.ParentID, l.ChildID, l.ChildType
FROM TableA a
INNER JOIN tblLocationHierarchy l ON l.ParentID = a.LocationID
UNION ALL
--Getting all descendants of the original list of parents
SELECT CTE.OriginalParentID, l.ParentID, l.ChildID, l.ChildType
FROM tblLocationHierarchy l
INNER JOIN CTE ON CTE.ChildID = l.ParentID
)
SELECT OriginalParentID, ChildID
FROM CTE
--Filtering is done here
WHERE ChildType = ...

SQL Server: Find duplicates using group by and having count than delete them all but not first [duplicate]

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.

There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField

The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC

delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid

DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1

This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)

I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);

Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;

I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)

SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable

I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...

This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)

Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;

Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)

This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)

Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1

Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)

I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )

Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.

The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.

By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)

Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1

Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD

I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0

From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?

DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);

CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)

I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

How to select previous record in sql server 2008

Hello everyone I would like to ask you that how could I select previous record in sql server 2008 like image below if I stand on "ACCOUNTING MANAGER" I would like to select "SELLER"

It's a better way to put primary key in the table. So create a primary key in your table.
select top 1 t1.* from table1 t1, table1 t2
where t1.primaryKey = t2.primaryKey - 1 order by primaryKey desc

You can try this.
first of all you need to add one column as Identity.
ALTER TABLE Your_TableName ADD AUTOID INT IDENTITY(1,1)
Then you need to find the rowID of the record where you are standing right now (i.e. "ACCOUNTING MANAGER").
declare #RowID INT
Set #RowID=(Select AUTOID from Your_TableName where JOBDESC="ACCOUNTING MANAGER")
AND then
select * from Your_TableName where AUTOID=(#RowID-1)
#RowID -1 if you want previous record.
#RowID+1 if you want next record.

use cte like create rownumber() using row_number function....
with temp as (select *,row_number()over( order by [tooday] asc) as rn from tablename )
select t1.jobDESC from temp t1
join temp t2 on t1.rn =t2.rn-1
where t2.jobDESC = 'ACCOUNTING MANAGER'
Note change tablename with table name ..

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How can I remove duplicate rows? - sql-server

delete t1 from table t1, table t2 where t1.columnA = t2.columnA and t1.rowid>t2.rowid Postgres: delete from table t1 using table t2 where t1.columnA = t2.columnA and t1.rowid > t2.rowid

DELETE LU FROM (SELECT *, Row_number() OVER ( partition BY col1, col1, col3 ORDER BY rowid DESC) [Row] FROM mytable) LU WHERE [row] > 1

This will delete duplicate rows, except the first row DELETE FROM Mytable WHERE RowID NOT IN ( SELECT MIN(RowID) FROM Mytable GROUP BY Col1, Col2, Col3 ) Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)

To Fetch Duplicate Rows: SELECT name, email, COUNT() FROM users GROUP BY name, email HAVING COUNT() > 1 To Delete the Duplicate Rows: DELETE users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY name, email);

Quick and Dirty to delete exact duplicated rows (for small tables): select distinct * into t2 from t1; delete from t1; insert into t1 select * from t2; drop table t2;

SELECT DISTINCT * INTO tempdb.dbo.tmpTable FROM myTable TRUNCATE TABLE myTable INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable DROP TABLE tempdb.dbo.tmpTable

This query showed very good performance for me: DELETE tbl FROM MyTable tbl WHERE EXISTS ( SELECT * FROM MyTable tbl2 WHERE tbl2.SameValue = tbl.SameValue AND tbl.IdUniqueValue < tbl2.IdUniqueValue ) it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)

This is the easiest way to delete duplicate record DELETE FROM tblemp WHERE id IN ( SELECT MIN(id) FROM tblemp GROUP BY title HAVING COUNT(id)>1 )

Use this WITH tblTemp as ( SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name) As RowNumber,* FROM <table_name> ) DELETE FROM tblTemp where RowNumber >1

I had a table where I needed to preserve non-duplicate rows. I'm not sure on the speed or efficiency. DELETE FROM myTable WHERE RowID IN ( SELECT MIN(RowID) AS IDNo FROM myTable GROUP BY Col1, Col2, Col3 HAVING COUNT(*) = 2 )

Another way of doing this :-- DELETE A FROM TABLE A, TABLE B WHERE A.COL1 = B.COL1 AND A.COL2 = B.COL2 AND A.UNIQUEFIELD > B.UNIQUEFIELD

DELETE FROM table_name T1 WHERE rowid > ( SELECT min(rowid) FROM table_name T2 WHERE T1.column_name = T2.column_name );

Related

Is there a way to tell SQL Server to check the table for a duplicate before inserting each new row?

Copy whole row excluding identifier column

SQL Server: Run separate CTE for each record without cursor or function

SQL Server: Find duplicates using group by and having count than delete them all but not first [duplicate]

How to select previous record in sql server 2008

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How can I remove duplicate rows? - sql-server

delete t1 from table t1, table t2 where t1.columnA = t2.columnA and t1.rowid>t2.rowid Postgres: delete from table t1 using table t2 where t1.columnA = t2.columnA and t1.rowid > t2.rowid

DELETE LU FROM (SELECT *, Row_number() OVER ( partition BY col1, col1, col3 ORDER BY rowid DESC) [Row] FROM mytable) LU WHERE [row] > 1

This will delete duplicate rows, except the first row DELETE FROM Mytable WHERE RowID NOT IN ( SELECT MIN(RowID) FROM Mytable GROUP BY Col1, Col2, Col3 ) Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)

To Fetch Duplicate Rows: SELECT name, email, COUNT(*) FROM users GROUP BY name, email HAVING COUNT(*) > 1 To Delete the Duplicate Rows: DELETE users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY name, email);

Quick and Dirty to delete exact duplicated rows (for small tables): select distinct * into t2 from t1; delete from t1; insert into t1 select * from t2; drop table t2;

SELECT DISTINCT * INTO tempdb.dbo.tmpTable FROM myTable TRUNCATE TABLE myTable INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable DROP TABLE tempdb.dbo.tmpTable

This query showed very good performance for me: DELETE tbl FROM MyTable tbl WHERE EXISTS ( SELECT * FROM MyTable tbl2 WHERE tbl2.SameValue = tbl.SameValue AND tbl.IdUniqueValue < tbl2.IdUniqueValue ) it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)

This is the easiest way to delete duplicate record DELETE FROM tblemp WHERE id IN ( SELECT MIN(id) FROM tblemp GROUP BY title HAVING COUNT(id)>1 )

Use this WITH tblTemp as ( SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name) As RowNumber,* FROM <table_name> ) DELETE FROM tblTemp where RowNumber >1

I had a table where I needed to preserve non-duplicate rows. I'm not sure on the speed or efficiency. DELETE FROM myTable WHERE RowID IN ( SELECT MIN(RowID) AS IDNo FROM myTable GROUP BY Col1, Col2, Col3 HAVING COUNT(*) = 2 )

Another way of doing this :-- DELETE A FROM TABLE A, TABLE B WHERE A.COL1 = B.COL1 AND A.COL2 = B.COL2 AND A.UNIQUEFIELD > B.UNIQUEFIELD

DELETE FROM table_name T1 WHERE rowid > ( SELECT min(rowid) FROM table_name T2 WHERE T1.column_name = T2.column_name );

Related

Is there a way to tell SQL Server to check the table for a duplicate before inserting each new row?

Copy whole row excluding identifier column

SQL Server: Run separate CTE for each record without cursor or function

SQL Server: Find duplicates using group by and having count than delete them all but not first [duplicate]

How to select previous record in sql server 2008

Categories

Resources

To Fetch Duplicate Rows: SELECT name, email, COUNT() FROM users GROUP BY name, email HAVING COUNT() > 1 To Delete the Duplicate Rows: DELETE users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY name, email);