How to delete Duplicate records in snowflake database table - snowflake-cloud-data-platform

how to delete the duplicate records from snowflake table. Thanks
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange

Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Doesn't recreate the table
Doesn't modify the original table
Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
All in a transaction

If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name

Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE keyword means that the table will be truncated before the insert takes place.

Snowflake does not have effective primary keys, their use is primarily with ERD tools.
Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested.
However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.

this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.

Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;

Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);

simple UNION eliminate duplicates on use case of just all columns/no pks.
anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.
Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.
you want fix massive dups load ? then add column like batch id and remove all batch loaded records
Its like being healthy, you have 2 approaches:
eat a lot > get far > go-to a gym to burn it
eat well > have healthy life style and no need for gym.
So before discussing best gym, try change life style.
hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.

The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;

This is inspired by #Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id

Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1

Related

Select latest row on duplicate values while transfering table?

I have a logging table that is live which saves my value to a table frequently.
My plan is to take those values and put them on a temporary table with
SELECT * INTO #temp from Block
From there I guess my block table is empty and the logger can keep on logging new values.
The next step is that I want to save them in a existing table. I wanted to use
INSERT INTO TABLENAME(COLUMN1,COLUMN2...) SELECT (COLUMN1,COLUMN2...) FROM #temp
The problem is that the #temp table has duplicates primary keys. And I only want to store the last ID.
I have tried DISTINCT but it didn't work. Could not get ROW_Count to work. Are there any ideas on how I should do it? I wish to make it with as few reads as possible.
Also, in the future I plan to send them to another database, how do I do that on SQL Server? I guess it's something like FROM Table [in databes]?
I couldn't get the blocks to copy. But here goes:
create TABLE Product_log (
Grade char(64),
block_ID char(64) PRIMARY KEY NOT NULL,
Density char(64),
BatchNumber char(64) NOT NULL,
BlockDateID Datetime
);
That is my table i want to store the data in. There I do not wish to have duplicates on the id. The problem is, while logging I get duplicates since I log on change. Lets say that the batchid is 1, if it becomes 2 while logging. I will get a blockid twice, both with batch number 1 and 2. How do I pick the latter?
Hope I explained enough for guidance. While logging they look like this:
id SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE SiemensTiaV15_s71200_BatchTester_TestWriteValue_VALUE SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP SiemensTiaV15_s71200_MainTank_Density_VALUE SiemensTiaV15_s71200_MainTank_Grade_VALUE
1 00545 S0047782 2020-06-09 11:18:44.583 0 xxxxx
2 00545 S0047783 2020-06-09 11:18:45.800 0 xxxxx
Please use below query,
select * from
(select id, SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE,SiemensTiaV15_s71200_BatchTester_TestWriteValue_VALUE, SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP, SiemensTiaV15_s71200_MainTank_Density_VALUE,SiemensTiaV15_s71200_MainTank_Grade_VALUE,
row_number() over (partition by SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE order by SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP desc) as rnk
from table_name) qry
where rnk=1;
INTO #temp FROM Block; INSERT INTO Product_log(Grade, block_ID, Density, BatchNumber, BlockDateID)
selct NewBatchIDValue_VALUE, TestWriteValue_VALUE, TestWriteValue_TIMESTAMP,
Density_VALUE, Grade_VALUE from
(select NewBatchIDValue_VALUE, TestWriteValue_VALUE,
TestWriteValue_TIMESTAMP, Density_VALUE, Grade_VALUE, row_number() over
(partition by BatchTester_NewBatchIDValue_VALUE order by
BatchTester_TestWriteValue_VALUE) as rnk from #temp) qry
where rnk = 1;

Create trigger to keep the latest record

I have a Product table which keeps on adding rows with product_id and price . It has millions of rows.
It has a product_id as Primary key like below.
CREATE TABLE ProductPrice(
product_id VARCHAR2(10),
prod_date DATE ,
price NUMBER(8,0) ,
PRIMARY KEY (product_id)
)
Now this has millions of rows and to get the latest price it get a lot of time.
So to manage the latest price, I have created another table which will keep only the latest price with same format.
CREATE TABLE ProductPriceLatest(
product_id VARCHAR2(10),
prod_date DATE ,
price NUMBER(8,0) ,
PRIMARY KEY (product_id)
)
And on every insert on original table, i will write a trigger which will update the row in this table.
But how can i get the newly inserted values inside the trigger body?
I have tried something like this:
CREATE OR REPLACE TRIGGER TRIG_HISTory
AFTER INSERT
on ProductPriceLatest
FOR EACH ROW
DECLARE
BEGIN
UPDATE latest_price
SET price = NEW.price ,
WHERE product_id = NEW.product_id ;
END;
Thanks in advance.
You need to use the :new keyword to differentiate with :old values. Also, better use AFTER trigger:
CREATE OR REPLACE TRIGGER TRIG_HISTORY
AFTER INSERT ON source_table_name
FOR EACH ROW
DECLARE
BEGIN
MERGE INTO dest_table_name d
USING (select :new.price p, :new.product_id p_id from dual) s
ON (d.product_id = s.p_id)
WHEN MATCHED THEN
UPDATE SET d.price = s.p
WHEN NOT MATCHED THEN
INSERT (price, product_id)
VALUES (s.p, s.p_id);
END;
Retrieving the latest price from your first table should be fast if you have the correct index. Building the correct index on your ProductPrice table is a far better solution to your problem than trying to maintain a separate table.
Your query to get the latest prices would look like this.
SELECT p.product_id, p.prod_date, p.price
FROM ProductPrice p
JOIN (
SELECT product_id, MAX(prod_date) latest_prod_date
FROM ProductPrice
GROUP BY product_id
) m ON p.product_id = m.product_id
AND p.prod_date = m.latest_prod_date
WHERE p.product_id = ????
This works because the subquery looks up the latest product date for each product. It then uses that information to find the right row in the table to show you.
If you create a compound index on (product_id, prod_date, price) this query will run almost miraculously fast. That's because the query planner can find the correct index item in O(log n) time or better.
You can make it into a view like this:
CREATE OR REPLACE VIEW ProductPriceLatest AS
SELECT p.product_id, p.prod_date, p.price
FROM ProductPrice p
JOIN (
SELECT product_id, MAX(prod_date) latest_prod_date
FROM ProductPrice
GROUP BY product_id
) m ON p.product_id = m.product_id
AND p.prod_date = m.latest_prod_date;
Then you can use the view like this:
SELECT * FROM ProductPriceLatest WHERE product_id = ???
and get the same high performance.
This is easier, less error-prone, and just as fast as creating a separate table and maintaining it. By the way, DBMS jargon for the table you propose to create is materialized view.

How to restrict duplicate record to insert into table in snowflake

I have created below table with primary key in snowflake and whenever i am trying to insert data into this table, it's allow duplicate records also.
How to restrict duplicate id ?
create table tab11(id int primary key not null,grade varchar(10));
insert into tab11 values(1,'A');
insert into tab11 values(1,'B');
select * from tab11;
Output: Inserted duplicate records.
ID GRADE
1 A
1 B
Snowflake allows you to identify a column as a Primary Key but it doesn't enforce uniqueness on them. From the documentation here:
Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.
A Primary Key in Snowflake is purely for informative purposes. I'm not from Snowflake, but I imagine that enforcing uniqueness in Primary Keys does not really align with how Snowflake stores data behind the scenes and it probably would impact insertion speed.
You may want to look at using a merge statement to handle what happens when a row with a duplicate PK arrives:
create table tab1(id int primary key not null, grade varchar(10));
insert into tab1 values(1, 'A');
-- Try merging values 1, and 'B': Nothing will be added
merge into tab1 using
(select * from (values (1, 'B')) x(id, grade)) tab2
on tab1.id = tab2.id
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
-- Try merging values 2, and 'B': New row added
merge into tab1 using
(select * from (values (2, 'B')) x(id, grade)) tab2
on tab1.id = tab2.id
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
-- If instead of ignoring dupes, we want to update:
merge into tab1 using
(select * from (values (1, 'F'), (2, 'F')) x(id, grade)) tab2
on tab1.id = tab2.id
when matched then update set tab1.grade = tab2.grade
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
For more complex merges, you may want to investigate using Snowflake streams (change data capture tables). In addition to the documentation, I have created a SQL script walk through of how to use a stream to keep a staging and prod table in sync:
https://snowflake.pavlik.us/index.php/2020/01/12/snowflake-streams-made-simple
You could try to use SEQUENCE to fit your requirement
https://docs.snowflake.net/manuals/user-guide/querying-sequences.html#using-sequences
Snowflake does NOT enforce unique constraints, hence you can only mitigate the issue by:
using a SEQUENCE to populate the column you want to be unique;
defining the column as NOT NULL (which is effectively enforced);
using a stored procedure where you can programmatically ensure no duplicates are introduced;
using a stored procedure (which could be run by scheduled Task possibly) to de-duplicate on a regular basis;
You will have to check for duplicates yourself during the insertion (within your INSERT query).
Greg Pavlik's answer using a MERGE query is one way to do it, but you can also achieve the same result with an INSERT query (if you don't plan on updating the existing rows -- if you do, use MERGE instead)
The idea is to insert with a SELECT that checks for the existence of those keys first, along with a window function to qualify the records and remove duplicates from the insert data itself. Here's an example:
INSERT INTO tab11
SELECT *
FROM (VALUES
(1,'A'),
(1,'B')
) AS t(id, grade)
-- Make sure the table doesn't already contain the IDs we're trying to insert
WHERE id NOT IN (
SELECT id FROM tab11
)
-- Make sure the data we're inserting doesn't contain duplicate IDs
-- If it does, only the first record will be inserted (based on the ORDER BY)
-- Ideally, we would want to order by a timestamp to select the latest record
QUALIFY ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY grade ASC
) = 1;
Alternatively, you can achieve the same result with a LEFT JOIN instead of a WHERE NOT IN (...) -- but it doesn't make a big difference unless your table is using a composite primary key (so that you can join on multiple keys).
INSERT INTO tab11
SELECT t.id, t.grade
FROM (VALUES
(1,'A'),
(1,'B')
) AS t(id, grade)
LEFT JOIN tab11
ON tab11.id = t.id
-- Insert only if no match is found in the join (i.e. ID doesn't exit)
WHERE tab11.id IS NULL
QUALIFY ROW_NUMBER() OVER (
PARTITION BY t.id
ORDER BY t.grade ASC
) = 1;
Side note: Snowflake is an OLAP database (as opposed to OLTP), and hence is designed for analytical queries & bulk operations (as opposed to operations on individual records). It's not a good idea to insert records one at a time in your table; instead, you should ingest data in bulk into a landing/staging table (possibly using Snowpipe), and use the data in that table to update your destination table (ideally using a table stream).
Snowflake documentation says it doesnt enforce the constraint.
https://docs.snowflake.com/en/sql-reference/constraints-overview.html
Instead of the load script process to fail, I would rather try and use merge. I have not used merge statements in snowflake yet. For other nosql databases, I have used merge statements instead of insert.

SQL Server: Find duplicates using group by and having count than delete them all but not first [duplicate]

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

How can I remove duplicate rows?

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

Resources