detecting duplicates and removing them - sql-server

I've been trying to solve a problem in my database which is quite common but I couldn't find a solution so far and I hope you could help me with this.
I have a database with people and their associated addresses. My primary goal is to find out how many unique households are in there. For example, I want to count a family as one. So far a ran a query to display last_names and addresses which are more than one:
select Last_Name ,add_line1, count(*) from ##all_people
group by Last_Name,ADD_LINE1
having count(*) > 1
This shows me people with the same last_name and address but I need their IDs in order to remove them from my temptable.
Furthermore, I'd like to ask how it is possible to display only one record for each household.
This is the structure of my temptable:
ID First_name Last_Name add_line1
Thank you so much for your help!!!

to find duplicates, you can use Count() Over() and partition by the grouping you want.
select * from (
select Id, Last_Name ,add_line1, count(*) over (partition by Last_Name, add_line1) dupe_count from ##all_people
) t
where t.dupe_count > 1
to find the ones you want to delete, you can use Row_Number()
select * from (
select Id, Last_Name ,add_line1, row_number() over (partition by Last_Name, add_line1 order by ID) extras from ##all_people
) t
where t.extras > 1
use t.extras = 1 to see one row per grouping

You seem to have a lot of questions here...
My primary goal is to find out how many unique households are in there.
You can do this with a distinct count:
SELECT COUNT(DISTINCT Last_Name + add_line1)
FROM ##all_people
...but I need their IDs in order to remove them from my temptable
I think this is solved by the new count query.
Furthermore, I'd like to ask how it is possible to display only one record for each household.
Just use distinct last name and address:
SELECT DISTINCT last_name, add_line1
FROM ##all_people

Related

How to find the count of users who follow the same order sequence?

I'm having trouble thinking this through. I have about 3 million rows with columns: userID, Channel the user last came in from before purchase, and their order sequence. I would like to find a way to the count of users that follow the same order sequences based on the channel. Is there a certain function that can help me accomplish this?
Ex. TPA --> TPA --> Email
how many people follow this sequence?
You can get the sequences using string_agg():
select path, count(*) as num_users
from (select user_id,
string_agg(channel, '-->') within group (order by sequence) as path
from t
group b user_id
) u
group by path
order by num_users desc;
string_agg() is a relatively new function. In older versions of SQL Server, you would probably use XML functions.
If you just need to count them, you could first group them, and then perform a count. In fact, once grouped you can do whatever you like...
WITH dte as (
SELECT distinct
userid,
(
SELECT Channel+'>'
FROM MyTable t2
WHERE t2.userid= t.userid
FOR XML PATH('')
) Concatenated
FROM MyTable t
)
SELECT whatever
FROM dte
Hope this works out for you. :)

Same query giving different results

I am still new to working in databases, so please have patience with me. I have read through a number of similar questions, but none of them seem to be talking about the same issue I am facing.
Just a bit of info on what I am doing, I have a table filled with contact information, and some of the contacts are duplicated, but most of the duplicated rows have a truncated phone number, which makes that data useless.
I wrote the following query to search for the duplicates:
WITH CTE (CID, Firstname, lastname, phone, email, length, dupcnt) AS
(
SELECT
CID, Firstname, lastname, phone, email, LEN(phone) AS length,
ROW_NUMBER() OVER (PARTITION BY Firstname, lastname, email
ORDER BY Firstname) AS dupcnt
FROM
[data.com_raw]
)
SELECT *
FROM CTE
WHERE dupcnt > 1
AND length <= 10
I assumed that this query would find all records that have duplicates based on the three columns that I have specified, and select any that have the dupcnt greater than 1, and a phone column with a length less than or equal to 10. But when I run the query more than once I get different result sets each execution. There must be some logic that I am missing here, but I am completely baffled by this. All of the columns are of varchar datatype, except for CID, which is int.
Instead of ROW_NUMBER() use COUNT(*), and remove the ORDER BY since that's not necessary with COUNT(*).
The way you have it now, you are chunking up records into similar groups/partitions of records by firstname/lastname/email. Then you are ORDERING each group/partition by firstname. Firstname is part of the partition, meaning every firstname in that group/partition is identical. You will get different results depending on how SQL Server fetches the results from storage (which record it found first is 1, what it found second is 2). Every time it fetches records (every time you run this sql) it may fetch each record from disk or cache at a different order.
Count(*) will return ALL duplicate rows
So instead:
COUNT(*) OVER (PARTITION BY Firstname, lastname, email ) AS dupcnt
Which will return the number of records that share the same firstname, lastname, and email. You then keep any record that is greater than 1.
ORDER BY Firstname is non-deterministic here as they all have the same Firstname from the partition by
If CID is unique you could use that for the order by but I suspect you really want count.
I believe you are getting different results with every run would be because (a) unless clearly specified in the query, you can assume nothing about the order in which SQL return data in a query, and (b) the only ordering criteria you provide is by FirstName, which is far less precise than your grouping (Firstname, lastname, email).
As for the query itself, as written it assumes that the first item found in a given partition contains a valid phone number. Without specifying the order, you cannot know this will be true… and what if all items in a given grouping have invalid phone numbers? Below is my stab at pulling out the data you're looking for, in a hopefully useful format.
WITH CTE -- Sorry, I'm lazy and generally don't list the columns
AS
(
SELECT
Firstname
,lastname
,phone
,count(*) HowMany -- How many in group
,sum(case len(phone) when 10 then 1 else 0 end) BadLength -- How many "bad" in group
from data.com_raw
group by
Firstname
,lastname
,phone
having count(*) <> sum(case len(phone) when 10 then 1 else 0 end)
and count(*) > 1 -- Remove this to find singletons with invalid phone numbers
)
select
cr.CID
,cr.Firstname
,cr.lastname
,case len(cr.phone) when 10 then '' else 'Bad' end) IsBad
,cr.phone
,cr.email
from data.com_raw cr
inner join CTE
on CTE.Firstname = cr.Firstname
and CTE.lastname = cr.lastname
and CTE.phone = cr.phone
order by
cr.CID
,cr.Firstname
,cr.lastname
,case len(cr.phone) when 10 then '' else 'Bad' end)
,cr.phone
(Yes, if there are no indexes to support this, you will end up with a table scan.)
SELECT Firstname, lastname,email, COUNT(*)
FROM [data.com_raw]
GROUP BY Firstname, lastname,email HAVING COUNT(*)>1
WHERE LEN(PHONE)<= 10

Choose 2 Random Values from 2 Separate Columns SQL

I am having to prepare test data to send to a 3rd party however I don't wish to send the customers real name nor do I wish to send their real date of birth.
I could solve the D.O.B issue by just randomly increasing the DOB by several years. However the name is different, is there anyway I can have a list of say 10 customer names and just choose a different Firstname and Surname each time.
I wish to mix and match the names however so it essentially randomly picks 1 firstname and then randomly picks a lastname and puts them together on the same line.
SELECT TOP 1 opde.first_name
FROM Table AS opde
ORDER BY NEWID()
This will return a random first name each time, but if I put the surname column in it will also return the matching surname, I don't want that I want a random surname from the list.
I tried doing this via a UNION but you can't do an ORDER BY NEWID() in the UNION.
Cheers.
I think this one might help...
WITH fn AS
(
SELECT TOP 1 opde.first_name
FROM Table AS opde
ORDER BY NEWID()
),
sn AS
(
SELECT TOP 1 opde.surname
FROM Table AS opde
ORDER BY NEWID()
)
SELECT first_name, surname
FROM fn
CROSS APPLY sn;
In the fn subquery you select a random first name. In the sn you do the same but with an surname.
With the cross apply you combine those two results
You may use a union of subqueries, each of which uses order by with NEWID:
SELECT first_name
FROM
(SELECT TOP 1 opde.first_name FROM Table AS opde ORDER BY NEWID()) t1
UNION ALL
SELECT first_name
FROM
(SELECT TOP 1 opde.first_name FROM Table AS opde ORDER BY NEWID()) t2;
Demo

SQL Server: select random rows with distinct id from table where id is not distinct

I have a simple table named Tickets with the following columns:
ticketId, userId
where ticketId is the primary key, UserId is not unique.
A user can therefore have several tickets, each with unique ticketId's.
I'm struggling to find a solution on my problem which is that I need to select 5 random tickets by 5 unique userId's.
I know how to select the random tickets by using the following query:
SELECT TOP 5 *
FROM Tickets
ORDER BY RAND(CHECKSUM(*) * RAND())
Which returns something like:
Ticket id: UserId:
--------------------------
10 1
25 1
31 2
42 2
56 3
My question is: what do I need to add to the query for it to select the random rows between distinct userId's so that it does not return more than one unique ticket for a user
Mind I need the most performance correct solution, since the table could potentially be filled with millions of rows in the long run.
Thanks in advance,
Christian
Edit:
The more tickets a user has, the higher the chances of selection. However it should still be randomly selected and not just select the user with the highest amount of tickets. Just like in a lottery.
In other words it should select 5 random rows between all rows, but ensure that the 5 rows have a unique userId.
Please try like this .... NEWID()
Select UserId
from
(
SELECT TOP 5 UserId
FROM Tickets
ORDER BY NEWID()
)k
CROSS APPLY
(
select top 1 TicketId
from Tickets T WHERE T.UserId = k.UserId
ORDER BY NEWID()
)u
Edit: As pointed out in the comments, this solution doesn't properly weight the users by number of tickets (so a user with 1000 tickets incorrectly has same change of winning as user with 1 ticket). This was particularly dumb of me since I pointed out this problem on other answers.
Given that Steve now has his solution working, I think that is the better answer.
Original answer:
I think something like the following works:
SELECT top 5 ticketid, userid
FROM
(
SELECT ticketid, userid, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY NEWID()) as nid
FROM tickets
) a
WHERE nid = 1
ORDER BY NEWID()
Here's an sql fiddle to play around with it.
Credit where credit is due: I based this on Steve's solution which I don't think works correctly as written.
Something like the following I think.
Please note this code is untested, so please excuse any small syntax errors.
WITH randomised_tickets AS
(
SELECT
*
,ROW_NUMBER() OVER (ORDER BY NEWID() ASC) AS random_order
FROM Tickets
)
,ordered_winning_tickets AS
(
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY userId ORDER BY random_order ASC) AS user_win_order
FROM randomised_tickets
)
SELECT TOP 5
*
FROM
ordered_winning_tickets
WHERE
user_win_order = 1 --eliminate 2nd wins from the list
ORDER BY
random_order
You could try something like this, using ignore_dup_key on a temp table to eliminate duplciates for a user:
drop table if exists #WinningTickets
create table #WinningTickets(PickId int identity primary key, TicketId int, UserId int)
create unique index ix_unique_user on #WinningTickets(UserId) with (ignore_dup_key=on)
while ( select count(*) from #WinningTickets ) < 5
begin
insert into #WinningTickets
select top 10 TicketId, UserId
from Tickets
order by newid()
end
select top 5 *
from #WinningTickets
order by PickId

How to find and delete all duplicates from SQL Server database

I'm new to SQL in general and I need to delete all duplicates in a given database.
For the moment, I use this DB to experiment some things.
The table currently looks like this :
I know I can find all duplicates using this query :
SELECT COUNT(*) AS NBR_DOUBLES, Name, Owner
FROM dbo.animals
GROUP BY Name, Owner
HAVING COUNT(*) > 1
but I have a lot of trouble finding an adapted and updated solution to not only find all the duplicates, but also delete them all, only leaving one of each.
Thanks a lot for taking some of your time to help me.
;WITH numbered AS (
SELECT ROW_NUMBER() OVER(PARTITION BY Name, Owner ORDER BY Name, Owner) AS _dupe_num
FROM dbo.Animals
)
DELETE FROM numbered WHERE _dupe_num > 1;
This will delete all but one of each occurance with the same Name & Owner, if you need it to be more specific you should extend the PARTITION BY clause. If you want it to take in account the entire record you should add all your fields.
The record left behind is currently random, since it seems you do not have any field to have any sort of ordering on.
What you want to do is use a projection that numbers each record within a given duplicate set. You can do that with a Windowing Function, like this:
SELECT Name, Owner
,Row_Number() OVER ( PARTITION BY Name, Owner ORDER BY Name, Owner, Birth) AS RowNum
FROM dbo.animals
ORDER BY Name, Owner
This should give you results like this:
Name Owner RowNum
Ecstasy Sacha 1
Ecstasy Sacha 2
Ecstasy Sacha 3
Gremlin Max 1
Gremlin Max 2
Gremlin Max 3
Outch Max 1
Outch Max 2
Outch Max 3
Now you want to convert this to a DELETE statement that has a WHERE clause targeting rows with RowNum > 1. The way to use a windowing function with a DELETE is to first include the windowing function as part of a common table expression (CTE), like this:
WITH dupes AS
(
SELECT Name, Owner,
Row_Number() OVER ( PARTITION BY Name, Owner ORDER BY Name, Owner, Birth) AS RowNum
FROM dbo.animals
)
DELETE FROM dupes WHERE RowNum > 1;
This will delete later duplicates, but leave row #1 for each group intact. The only trick now is to make sure row #1 is the correct row, since not all of your duplicates have the same values for the Birth or Death columns. This is the reason I included the Birth column in the windowing function, while other answers (so far) have not. You need to decide if you want to keep the oldest animal or the youngest, and optionally change the Birth order in the OVER clause to match your needs.
Use CTE. I will show you a sample :
Create table #Table1(Field1 varchar(100));
Insert into #Table1 values
('a'),('b'),('f'),('g'),('a'),('b');
Select * from #Table1;
WITH CTE AS(
SELECT Field1,
RN = ROW_NUMBER()OVER(PARTITION BY Field1 ORDER BY Field1)
FROM #Table1
)
--SELECT * FROM CTE WHERE RN > 1
DELETE FROM CTE WHERE RN > 1
What I am doing is, numbering the rows. If there are duplicates based on PARTITION BY columns, it will be numbered sequentially, else 1.
Then delete those records whose count is greater than 1.
I won't spoon feed you solution hence you will have to play with PARTITION BY to reach your output
output :
Select * from #Table1;
Field1
---------
a
b
f
g
a
b
/*with cte as (...) SELECT * FROM CTE;*/
Field1 RN
------- -----
a 1
a 2
b 1
b 2
f 1
g 1
if NBR_DOUBLES had an ID field, I believe you could use this;
DELETE FROM NBR_DOUBLES WHERE ID IN
(
SELECT MAX(ID)
FROM dbo.animals
GROUP BY Name, Owner
HAVING COUNT(*) > 1
)

Resources