How to find and delete all duplicates from SQL Server database

How to find and delete all duplicates from SQL Server database - sql-server

I'm new to SQL in general and I need to delete all duplicates in a given database.
For the moment, I use this DB to experiment some things.
The table currently looks like this :
I know I can find all duplicates using this query :
SELECT COUNT(*) AS NBR_DOUBLES, Name, Owner
FROM dbo.animals
GROUP BY Name, Owner
HAVING COUNT(*) > 1
but I have a lot of trouble finding an adapted and updated solution to not only find all the duplicates, but also delete them all, only leaving one of each.
Thanks a lot for taking some of your time to help me.

;WITH numbered AS (
SELECT ROW_NUMBER() OVER(PARTITION BY Name, Owner ORDER BY Name, Owner) AS _dupe_num
FROM dbo.Animals
)
DELETE FROM numbered WHERE _dupe_num > 1;
This will delete all but one of each occurance with the same Name & Owner, if you need it to be more specific you should extend the PARTITION BY clause. If you want it to take in account the entire record you should add all your fields.
The record left behind is currently random, since it seems you do not have any field to have any sort of ordering on.

What you want to do is use a projection that numbers each record within a given duplicate set. You can do that with a Windowing Function, like this:
SELECT Name, Owner
,Row_Number() OVER ( PARTITION BY Name, Owner ORDER BY Name, Owner, Birth) AS RowNum
FROM dbo.animals
ORDER BY Name, Owner
This should give you results like this:
Name Owner RowNum
Ecstasy Sacha 1
Ecstasy Sacha 2
Ecstasy Sacha 3
Gremlin Max 1
Gremlin Max 2
Gremlin Max 3
Outch Max 1
Outch Max 2
Outch Max 3
Now you want to convert this to a DELETE statement that has a WHERE clause targeting rows with RowNum > 1. The way to use a windowing function with a DELETE is to first include the windowing function as part of a common table expression (CTE), like this:
WITH dupes AS
(
SELECT Name, Owner,
Row_Number() OVER ( PARTITION BY Name, Owner ORDER BY Name, Owner, Birth) AS RowNum
FROM dbo.animals
)
DELETE FROM dupes WHERE RowNum > 1;
This will delete later duplicates, but leave row #1 for each group intact. The only trick now is to make sure row #1 is the correct row, since not all of your duplicates have the same values for the Birth or Death columns. This is the reason I included the Birth column in the windowing function, while other answers (so far) have not. You need to decide if you want to keep the oldest animal or the youngest, and optionally change the Birth order in the OVER clause to match your needs.

Use CTE. I will show you a sample :
Create table #Table1(Field1 varchar(100));
Insert into #Table1 values
('a'),('b'),('f'),('g'),('a'),('b');
Select * from #Table1;
WITH CTE AS(
SELECT Field1,
RN = ROW_NUMBER()OVER(PARTITION BY Field1 ORDER BY Field1)
FROM #Table1
)
--SELECT * FROM CTE WHERE RN > 1
DELETE FROM CTE WHERE RN > 1
What I am doing is, numbering the rows. If there are duplicates based on PARTITION BY columns, it will be numbered sequentially, else 1.
Then delete those records whose count is greater than 1.
I won't spoon feed you solution hence you will have to play with PARTITION BY to reach your output
output :
Select * from #Table1;
Field1
---------
a
b
f
g
a
b
/*with cte as (...) SELECT * FROM CTE;*/
Field1 RN
------- -----
a 1
a 2
b 1
b 2
f 1
g 1

if NBR_DOUBLES had an ID field, I believe you could use this;
DELETE FROM NBR_DOUBLES WHERE ID IN
(
SELECT MAX(ID)
FROM dbo.animals
GROUP BY Name, Owner
HAVING COUNT(*) > 1
)

Related

T-SQL: GROUP BY, but while keeping a non-grouped column (or re-joining it)?

I'm on SQL Server 2008, and having trouble querying an audit table the way I want to.
The table shows every time a new ID comes in, as well as every time an IDs Type changes
Record # ID Type Date
1 ae08k M 2017-01-02:12:03
2 liei0 A 2017-01-02:12:04
3 ae08k C 2017-01-02:13:05
4 we808 A 2017-01-03:20:05
I'd kinda like to produce a snapshot of the status for each ID, at a certain date. My thought was something like this:
SELECT
ID
,max(date) AS Max
FROM
Table
WHERE
Date < 'whatever-my-cutoff-date-is-here'
GROUP BY
ID
But that loses the Type column. If I add in the type column to my GROUP BY, then I'd get get duplicate rows per ID naturally, for all the types it had before the date.
So I was thinking of running a second version of the table (via a common table expression), and left joining that in to get the Type.
On my query above, all I have to join to are the ID & Date. Somehow if the dates are too close together, I end up with duplicate results (like say above, ae08k would show up once for each Type). That or I'm just super confused.
Basically all I ever do in SQL are left joins, group bys, and common table expressions (to then left join). What am I missing that I'd need in this situation...?

Use row_number()
select *
from ( select *
, row_number() over (partition by id order by date desc) as rn
from table
WHERE Date < 'whatever-my-cutoff-date-is-here'
) tt
where tt.rn = 1

I'd kinda like know how many IDs are of each type, at a certain date.
Well, for that you use COUNT and GROUP BY on Type:
SELECT Type, COUNT(ID)
FROM Table
WHERE Date < 'whatever-your-cutoff-date-is-here'
GROUP BY Type

Basing on your comment under Zohar Peled answer you probably looking for something like this:
; with cte as (select distinct ID from Table where Date < '$param')
select [data].*, [data2].[count]
from cte
cross apply
( select top 1 *
from Table
where Table.ID = cte.ID
and Table.Date < '$param'
order by Table.Date desc
) as [data]
cross apply
( select count(1) as [count]
from Table
where Table.ID = cte.ID
and Table.Date < '$param'
) as [data2]

SQL Server: select random rows with distinct id from table where id is not distinct

I have a simple table named Tickets with the following columns:
ticketId, userId
where ticketId is the primary key, UserId is not unique.
A user can therefore have several tickets, each with unique ticketId's.
I'm struggling to find a solution on my problem which is that I need to select 5 random tickets by 5 unique userId's.
I know how to select the random tickets by using the following query:
SELECT TOP 5 *
FROM Tickets
ORDER BY RAND(CHECKSUM(*) * RAND())
Which returns something like:
Ticket id: UserId:
--------------------------
10 1
25 1
31 2
42 2
56 3
My question is: what do I need to add to the query for it to select the random rows between distinct userId's so that it does not return more than one unique ticket for a user
Mind I need the most performance correct solution, since the table could potentially be filled with millions of rows in the long run.
Thanks in advance,
Christian
Edit:
The more tickets a user has, the higher the chances of selection. However it should still be randomly selected and not just select the user with the highest amount of tickets. Just like in a lottery.
In other words it should select 5 random rows between all rows, but ensure that the 5 rows have a unique userId.

Please try like this .... NEWID()
Select UserId
from
(
SELECT TOP 5 UserId
FROM Tickets
ORDER BY NEWID()
)k
CROSS APPLY
(
select top 1 TicketId
from Tickets T WHERE T.UserId = k.UserId
ORDER BY NEWID()
)u

Edit: As pointed out in the comments, this solution doesn't properly weight the users by number of tickets (so a user with 1000 tickets incorrectly has same change of winning as user with 1 ticket). This was particularly dumb of me since I pointed out this problem on other answers.
Given that Steve now has his solution working, I think that is the better answer.
Original answer:
I think something like the following works:
SELECT top 5 ticketid, userid
FROM
(
SELECT ticketid, userid, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY NEWID()) as nid
FROM tickets
) a
WHERE nid = 1
ORDER BY NEWID()
Here's an sql fiddle to play around with it.
Credit where credit is due: I based this on Steve's solution which I don't think works correctly as written.

Something like the following I think.
Please note this code is untested, so please excuse any small syntax errors.
WITH randomised_tickets AS
(
SELECT
*
,ROW_NUMBER() OVER (ORDER BY NEWID() ASC) AS random_order
FROM Tickets
)
,ordered_winning_tickets AS
(
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY userId ORDER BY random_order ASC) AS user_win_order
FROM randomised_tickets
)
SELECT TOP 5
*
FROM
ordered_winning_tickets
WHERE
user_win_order = 1 --eliminate 2nd wins from the list
ORDER BY
random_order

You could try something like this, using ignore_dup_key on a temp table to eliminate duplciates for a user:
drop table if exists #WinningTickets
create table #WinningTickets(PickId int identity primary key, TicketId int, UserId int)
create unique index ix_unique_user on #WinningTickets(UserId) with (ignore_dup_key=on)
while ( select count(*) from #WinningTickets ) < 5
begin
insert into #WinningTickets
select top 10 TicketId, UserId
from Tickets
order by newid()
end
select top 5 *
from #WinningTickets
order by PickId

In T-SQL how to select only the top(not max) value in a group of record

I have some sample data as follows
Name Value Timestamp
a 23 2016/12/23 11:23
a 43 2016/12/23 12:55
b 12 2016/12/23 12:55
I want to select the latest value for a and b. When I used Last_Value, I used the following query
Select Name, Last_Value(Value) over (partition by Name order by timestamp) from table
This returned 2 rows for a, but I wanted it grouped so that I get only the last entered value for each name. So I had to use sub queries.
select x.Name,x.Value from (Select Name, Last_Value(Value) over (partition by Name order by timestamp) ) as x group by x.Name,x.Value
This again returns 2 records for a...I just wanted to do a group by and orderby and instaed of selelcting the max() wanted to select the top record.
Can anybody tell me how to solve this problem?

One method doesn't use window functions:
select t.*
from table t
where t.timestamp = (select max(t2.timestamp) from table t2 where t2.name = t.name);
Otherwise, the subquery method is fine, although I would often use row_number() and conditional aggregation rather than last_value() (or first_value() with a descending order by).
Unfortunately, SQL Server does not support first_value() or last_value() as an aggregation function, only as a window function.

Top Keyword Not Working in Access 2007

SELECT top 3 a.[CustID],a.[CustName],a.[ContactNo],a.[Address],[EmailID] ,
(select count(1) FROM tblCustomer x) as [RecordCount]
FROM tblCustomer a
where a.[CustID] NOT IN (
SELECT TOP 6 m.[CustID]
FROM tblCustomer m
Order by m.[CreatedOn] desc)
order by a.[CreatedOn] desc
I m trying to Get top 3 Result from above Query but I m getting a lot more than that:
Can someone Recorrect above query ..

TOP in Ms Access includes not just the required number, but all matched results. In this case you have chosen date, so if there are several matched dates, they will all be returned. If you need just three records, order by a unique field in addition to the required sort order. For example
... order by a.[CreatedOn] desc, custid

Find max value and show corresponding value from different field in SQL server

I have a table with data about cities that includes their name, population and other fields irrelevant to my question.
ID Name Population
1 A 45667
2 B 123456
3 C 3005
4 D 13769
To find the max population is basic, but I need a resulting table that has the max population in one column, and the corresponding city's name in another column
Population Name
123456 B
I've looked through similar questions, but for some reason the answers look over-complicated. Is there a way to write the query in 1 or 2 lines?

There are several ways that this can be done:
A filter in the WHERE clause:
select id, name, population
from yourtable
where population in (select max(population)
from yourtable)
Or a subquery:
select id, name, population
from yourtable t1
inner join
(
select max(population) MaxPop
from yourtable
) t2
on t1.population = t2.maxpop;
Or you can use TOP WITH TIES. If there can be no ties, then you can remove the with ties. This will include any rows that have the same population value:
select top 1 with ties id, name, population
from yourtable
order by population desc
Since you are using SQL Server you can also use ranking functions to get the result:
select id, name, population
from
(
select id, name, population,
row_number() over(order by population desc) rn
from yourtable
) src
where rn = 1
See SQL Fiddle with Demo of all.
As a side note on the ranking function, you might want to use dense_rank() instead of row_number(). Then in the event you have more than one city with the same population you will get both city names. (See Demo)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to find and delete all duplicates from SQL Server database - sql-server

if NBR_DOUBLES had an ID field, I believe you could use this; DELETE FROM NBR_DOUBLES WHERE ID IN ( SELECT MAX(ID) FROM dbo.animals GROUP BY Name, Owner HAVING COUNT(*) > 1 )

Related

T-SQL: GROUP BY, but while keeping a non-grouped column (or re-joining it)?

SQL Server: select random rows with distinct id from table where id is not distinct

In T-SQL how to select only the top(not max) value in a group of record

Top Keyword Not Working in Access 2007

Find max value and show corresponding value from different field in SQL server

Categories

Resources