Related
I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1
I need to find if two rows (one having the same id of the other +50000) are the same. Is there any way to make this query work?
select 1
from table1 c1,
table2 c2
where c1.id=c2.id+50000 and CHECKSUM(c1.*) = CHECKSUM(c2.*)
CHECKSUM() apparently does not accept "table.*" expressions. It accepts either "*" alone or list of columns, but I can't do that as this query needs to work also for other tables with other columns.
EDIT: I just realized that CHECKSUM() will not work as the value will always be different if the IDs are different....
The original question still holds out of curiosity.
Try something like this, it will work for most datatypes (not TEXT and some others):
SELECT 1
FROM
table1 c1
JOIN
table2 c2
ON
c1.id=c2.id+50000 and
EXISTS(SELECT c1.col1, col2, col3, col4 EXCEPT SELECT c2.col1, col2, col3, col4)
You can do it using derived tables:
SELECT
SUM(CASE
WHEN a.cs <> b.cs THEN 1
ELSE 0
END)
FROM (SELECT RowNumber, CHECKSUM(*) AS cs FROM #A) a
FULL OUTER JOIN (SELECT RowNumber, CHECKSUM(*) AS cs FROM #B) b
ON a.RowNumber=b.RowNumber;
This is an excerpt from a script I've written previously. I have not changed any of the object names to match your example. The result of this query is the number of differences between #A and #B where the RowNumber columns match.
To apply to your need, you can create two temporary tables, populating them from the originals, but replacing the ID column with a "RowNumber" column that matches between the rows you want to match (ie: c1.id=c2.id+50000). That way, you don't have mis-matched IDs to interfere with the CHECKSUM.
I need to get some items from database with top three comments for each item.
Now I have two stored procedures GetAllItems and GetTopThreeeCommentsByItemId.
In application I get 100 items and then in foreach loop I call GetTopThreeeCommentsByItemId procedure to get top three comments.
I know that this is bad from performance standpoint.
Is there some technique that allows to get this with one query?
I can use OUTER APPLY to get one top comment (if any) but I don't know how to get three.
Items {ItemId, Title, Description, Price etc.}
Comments {CommentId, ItemId etc.}
Sample data that I want to get
Item_1
-- comment_1
-- comment_2
-- comment_3
Item_2
-- comment_4
-- comment_5
One approach would be to use a CTE (Common Table Expression) if you're on SQL Server 2005 and newer (you aren't specific enough in that regard).
With this CTE, you can partition your data by some criteria - i.e. your ItemId - and have SQL Server number all your rows starting at 1 for each of those "partitions", ordered by some criteria.
So try something like this:
;WITH ItemsAndComments AS
(
SELECT
i.ItemId, i.Title, i.Description, i.Price,
c.CommentId, c.CommentText,
ROW_NUMBER() OVER(PARTITION BY i.ItemId ORDER BY c.CommentId) AS 'RowNum'
FROM
dbo.Items i
LEFT OUTER JOIN
dbo.Comments c ON c.ItemId = i.ItemId
WHERE
......
)
SELECT
ItemId, Title, Description, Price,
CommentId, CommentText
FROM
ItemsAndComments
WHERE
RowNum <= 3
Here, I am selecting up to three entries (i.e. comments) for each "partition" (i.e. for each item) - ordered by the CommentId.
Does that approach what you're looking for??
You can write a single stored procedure which calls GetAllItems and GetTopThreeeCommentsByItemId, takes results in temp tables and join those tables to produce the single resultset you need.
If you do not have a chance to use a stored procedure, you can still do the same by running a single SQL script from data access tier, which calls GetAllItems and GetTopThreeeCommentsByItemId and takes results into temp tables and join them later to return a single resultset.
This gets two elder brother using OUTER APPLY:
select m.*, elder.*
from Member m
outer apply
(
select top 2 ElderBirthDate = x.BirthDate, ElderFirstname = x.Firstname
from Member x
where x.BirthDate < m.BirthDate
order by x.BirthDate desc
) as elder
order by m.BirthDate, elder.ElderBirthDate desc
Source data:
create table Member
(
Firstname varchar(20) not null,
Lastname varchar(20) not null,
BirthDate date not null unique
);
insert into Member(Firstname,Lastname,Birthdate) values
('John','Lennon','Oct 9, 1940'),
('Paul','McCartney','June 8, 1942'),
('George','Harrison','February 25, 1943'),
('Ringo','Starr','July 7, 1940');
Output:
Firstname Lastname BirthDate ElderBirthDate ElderFirstname
-------------------- -------------------- ---------- -------------- --------------------
Ringo Starr 1940-07-07 NULL NULL
John Lennon 1940-10-09 1940-07-07 Ringo
Paul McCartney 1942-06-08 1940-10-09 John
Paul McCartney 1942-06-08 1940-07-07 Ringo
George Harrison 1943-02-25 1942-06-08 Paul
George Harrison 1943-02-25 1940-10-09 John
(6 row(s) affected)
Live test: http://www.sqlfiddle.com/#!3/19a63/2
marc's answer is better, just use OUTER APPLY if you need to query "near" entities (e.g. geospatial, elder brothers, nearest date to due date, etc) to the main entity.
Outer apply walkthrough: http://www.ienablemuch.com/2012/04/outer-apply-walkthrough.html
You might need DENSE_RANK instead of ROW_NUMBER/RANK though, as the criteria of a comment being a top could yield ties. TOP 1 could yield more than one, TOP 3 could yield more than three too. Example of that scenario(DENSE_RANK walkthrough): http://www.anicehumble.com/2012/03/postgresql-denserank.html
Its better that you select the statement by using the row_number statement and select the top 3 alone
select a.* from
(
Select *,row_number() over(partition by column)[dup]
) as a
where dup<=3
I want to learn how to combine two db tables which have no fields in common. I've checked UNION but MSDN says :
The following are basic rules for combining the result sets of two queries by using UNION:
The number and the order of the columns must be the same in all queries.
The data types must be compatible.
But I have no fields in common at all. All I want is to combine them in one table like a view.
So what should I do?
There are a number of ways to do this, depending on what you really want. With no common columns, you need to decide whether you want to introduce a common column or get the product.
Let's say you have the two tables:
parts: custs:
+----+----------+ +-----+------+
| id | desc | | id | name |
+----+----------+ +-----+------+
| 1 | Sprocket | | 100 | Bob |
| 2 | Flange | | 101 | Paul |
+----+----------+ +-----+------+
Forget the actual columns since you'd most likely have a customer/order/part relationship in this case; I've just used those columns to illustrate the ways to do it.
A cartesian product will match every row in the first table with every row in the second:
> select * from parts, custs;
id desc id name
-- ---- --- ----
1 Sprocket 101 Bob
1 Sprocket 102 Paul
2 Flange 101 Bob
2 Flange 102 Paul
That's probably not what you want since 1000 parts and 100 customers would result in 100,000 rows with lots of duplicated information.
Alternatively, you can use a union to just output the data, though not side-by-side (you'll need to make sure column types are compatible between the two selects, either by making the table columns compatible or coercing them in the select):
> select id as pid, desc, null as cid, null as name from parts
union
select null as pid, null as desc, id as cid, name from custs;
pid desc cid name
--- ---- --- ----
101 Bob
102 Paul
1 Sprocket
2 Flange
In some databases, you can use a rowid/rownum column or pseudo-column to match records side-by-side, such as:
id desc id name
-- ---- --- ----
1 Sprocket 101 Bob
2 Flange 101 Bob
The code would be something like:
select a.id, a.desc, b.id, b.name
from parts a, custs b
where a.rownum = b.rownum;
It's still like a cartesian product but the where clause limits how the rows are combined to form the results (so not a cartesian product at all, really).
I haven't tested that SQL for this since it's one of the limitations of my DBMS of choice, and rightly so, I don't believe it's ever needed in a properly thought-out schema. Since SQL doesn't guarantee the order in which it produces data, the matching can change every time you do the query unless you have a specific relationship or order by clause.
I think the ideal thing to do would be to add a column to both tables specifying what the relationship is. If there's no real relationship, then you probably have no business in trying to put them side-by-side with SQL.
If you just want them displayed side-by-side in a report or on a web page (two examples), the right tool to do that is whatever generates your report or web page, coupled with two independent SQL queries to get the two unrelated tables. For example, a two-column grid in BIRT (or Crystal or Jasper) each with a separate data table, or a HTML two column table (or CSS) each with a separate data table.
This is a very strange request, and almost certainly something you'd never want to do in a real-world application, but from a purely academic standpoint it's an interesting challenge. With SQL Server 2005 you could use common table expressions and the row_number() functions and join on that:
with OrderedFoos as (
select row_number() over (order by FooName) RowNum, *
from Foos (nolock)
),
OrderedBars as (
select row_number() over (order by BarName) RowNum, *
from Bars (nolock)
)
select *
from OrderedFoos f
full outer join OrderedBars u on u.RowNum = f.RowNum
This works, but it's supremely silly and I offer it only as a "community wiki" answer because I really wouldn't recommend it.
SELECT *
FROM table1, table2
This will join every row in table1 with table2 (the Cartesian product) returning all columns.
select
status_id,
status,
null as path,
null as Description
from
zmw_t_status
union
select
null,
null,
path as cid,
Description from zmw_t_path;
try:
select * from table 1 left join table2 as t on 1 = 1;
This will bring all the columns from both the table.
If the tables have no common fields then there is no way to combine the data in any meaningful view. You would more likely end up with a view that contains duplicated data from both tables.
To get a meaningful/useful view of the two tables, you normally need to determine an identifying field from each table that can then be used in the ON clause in a JOIN.
THen in your view:
SELECT T1.*, T2.* FROM T1 JOIN T2 ON T1.IDFIELD1 = T2.IDFIELD2
You mention no fields are "common", but although the identifying fields may not have the same name or even be the same data type, you could use the convert / cast functions to join them in some way.
why don't you use simple approach
SELECT distinct *
FROM
SUPPLIER full join
CUSTOMER on (
CUSTOMER.OID = SUPPLIER.OID
)
It gives you all columns from both tables and returns all records from customer and supplier if Customer has 3 records and supplier has 2 then supplier'll show NULL in all columns
Select
DISTINCT t1.col,t2col
From table1 t1, table2 t2
OR
Select
DISTINCT t1.col,t2col
From table1 t1
cross JOIN table2 t2
if its hug data , its take long time ..
SELECT t1.col table1col, t2.col table2col
FROM table1 t1
JOIN table2 t2 on t1.table1Id = x and t2.table2Id = y
Joining Non-Related Tables
Demo SQL Script
IF OBJECT_ID('Tempdb..#T1') IS NOT NULL DROP TABLE #T1;
CREATE TABLE #T1 (T1_Name VARCHAR(75));
INSERT INTO #T1 (T1_Name) VALUES ('Animal'),('Bat'),('Cat'),('Duet');
SELECT * FROM #T1;
IF OBJECT_ID('Tempdb..#T2') IS NOT NULL DROP TABLE #T2;
CREATE TABLE #T2 (T2_Class VARCHAR(10));
INSERT INTO #T2 (T2_Class) VALUES ('Z'),('T'),('H');
SELECT * FROM #T2;
To Join Non-Related Tables , we are going to introduce one common joining column of Serial Numbers like below.
SQL Script
SELECT T1.T1_Name,ISNULL(T2.T2_Class,'') AS T2_Class FROM
( SELECT T1_Name,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS S_NO FROM #T1) T1
LEFT JOIN
( SELECT T2_Class,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS S_NO FROM #T2) T2
ON t1.S_NO=T2.S_NO;
select * from this_table;
select distinct person from this_table
union select address as location from that_table
drop wrong_table from this_database;
Very hard when you have to do this with three select statments
I tried all proposed techniques up there but it's in-vain
Please see below script. please advice if you have alternative solution
select distinct x.best_Achiver_ever,y.Today_best_Achiver ,z.Most_Violator from
(SELECT Top(4) ROW_NUMBER() over (order by tl.username) AS conj, tl.
[username] + '-->' + str(count(*)) as best_Achiver_ever
FROM[TiketFollowup].[dbo].N_FCR_Tikect_Log_Archive tl
group by tl.username
order by count(*) desc) x
left outer join
(SELECT
Top(4) ROW_NUMBER() over (order by tl.username) as conj, tl.[username] + '-->' + str(count(*)) as Today_best_Achiver
FROM[TiketFollowup].[dbo].[N_FCR_Tikect_Log] tl
where convert(date, tl.stamp, 121) = convert(date,GETDATE(),121)
group by tl.username
order by count(*) desc) y
on x.conj=y.conj
left outer join
(
select ROW_NUMBER() over (order by count(*)) as conj,username+ '--> ' + str( count(dbo.IsViolated(stamp))) as Most_Violator from N_FCR_Ticket
where dbo.IsViolated(stamp) = 'violated' and convert(date,stamp, 121) < convert(date,GETDATE(),121)
group by username
order by count(*) desc) z
on x.conj = z.conj
Please try this query:
Combine two tables that have no common columns:
SELECT *
FROM table1
UNION
SELECT *
FROM table2
ORDER BY orderby ASC
I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1