Finding Duplicate Data in Oracle - database

I have a table with 500,000+ records, and fields for ID, first name, last name, and email address. What I'm trying to do is find rows where the first name AND last name are both duplicates (as in the same person has two separate IDs, email addresses, or whatever, they're in the table more than once). I think I know how to find the duplicates using GROUP BY, this is what I have:
SELECT first_name, last_name, COUNT(*)
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
The problem is that I need to then move the entire row with these duplicated names into a different table. Is there a way to find the duplicates and get the whole row? Or at least to get the IDs as well? I tried using a self-join, but got back more rows than were in the table to begin with. Would that be a better approach? Any help would be greatly appreciated.

The most effective way to remove duplicate rows is with a self-join:
DELETE FROM person_table a
WHERE a.rowid >
ANY (SELECT b.rowid
FROM person_table b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name);
This will remove all duplicates even if there are more than one duplicate row.
There is more on removing duplicates and differing methods here: http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm
Hope it helps...
EDIT: As per your comments, if you want to select all but one of the duplicates then
SELECT *
FROM person_table a
WHERE a.rowid >
ANY (SELECT b.rowid
FROM person_table b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name);

An index on (first_name, last_name) or on (last_name, first_name) would help:
SELECT t.*
FROM
person_table t
JOIN
( SELECT first_name, last_name
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
) dup
ON dup.last_name = t.last_name
AND dup.first_name = t.first_name
or:
SELECT t.*
FROM person_table t
WHERE EXISTS
( SELECT *
FROM person_table dup
WHERE dup.last_name = t.last_name
AND dup.first_name = t.first_name
AND dup.ID <> t.ID
)

This will give you an ID you want to move/delete/etc. Note that it does not work if count(*) > 2, as you get only 1 ID (you could re-run your query for these cases).
SELECT max(ID), first_name, last_name, COUNT(*)
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
Edit: You can use COLLECT to get all IDs at once (but be careful, as you only want to move/delete all but one)

To add another option, I usually use this one to remove duplicates:
delete from person_table
where rowid in (select rid
from (select rowid rid, row_number() over
(partition by first_name,last_name order by rowid) rn
from person_table
)
where rn <> 1 )

Related

SQL Server Remove Duplicates

I have a table that Tracks Employees and the days they have spent in a policy. I don't generate this Data, it is dumped to our Server Daily.
The table looks like this:
My Goal is to get rid of the duplicates by keeping only the most recent Date.
In this example, if I run the query, I would like it to keep Rows 11 for Nicholas Morris and 14 for Tiana Sullivan.
Assumption: First name and Last Name combo are unique
So far,
This is what I have been doing:
select *
from
Employees IN(
Select ID
from Employees
group by FirstName, lastName
Having count(*) > 1)
This returns to me the rows that have duplicates and I have to manually search them and remove the ones I don't want to keep.
I am sure there is a better way of doing this
Thanks for your help
You can use a CTE and ROW_NUMBER() function to do it.
The query to get the data is:
SELECT ID, FirstName, LastName, ROW_NUMBER()
OVER (PARTITION BY FirstName, LastName ORDER BY DaysInPolicy DESC) AS Identifier
FROM
Employees
The query to remove duplicates is:
;WITH CTE AS (
SELECT ID, ROW_NUMBER()
OVER (PARTITION BY FirstName, LastName ORDER BY DaysInPolicy DESC) AS Identifier
FROM
Employees
)
DELETE E
FROM
Employees E
INNER JOIN CTE C ON C.ID = E.ID
WHERE
C.Identifier > 1
You could delete using an exists operator where you remove any row that has the same first and last name, but with a newer date:
DELETE FROM employees e1
WHERE EXISTS (SELECT *
FROM employees e2
WHERE e1.FirstName = e2.FirstName AND
e1.LastName = e2.LastName AND
e1.DaysInPolicy < e2.DaysInPolicy)
Try this:
SELECT * FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Last_Name, First_Nmae ORDER BY DaysInPolicy DESC) AS RowNum
FROM Employees
) AS Emp
WHERE Emp.RowNum > 1

Replace Values in Oracle DB with set of test data

I am currently working on a project where I have a copy of a production Oracle database with some production values that I want to replace with a set of test values (with the assumption that I have more production records than test ones and I will have duplicates).
Here is a sample of what I am looking to do:
Any suggestions would be greatly appreciated.
One approach would be to number the rows by their rowid, then use a modulo operation to match the original data to your test data table, like this:
merge into customer c
using (
with cte_customer as (
select rowid xrowid, mod(row_number() over (order by rowid)-1,(select count(*) from test_data))+1 rownumber
from customer
order by rowid
), cte_testdata as (
select row_number() over (order by rowid) rownumber, first_name, last_name, email
from test_data
order by rowid
)
select c.xrowid, t.last_name, t.first_name, t.email
from cte_customer c
left outer join cte_testdata t on (t.rownumber = c.rownumber)
order by c.xrowid
) u
on (c.rowid = u.xrowid)
when matched then update set
c.last_name = u.last_name,
c.first_name = u.first_name,
c.email = u.email

Unique data based on name,email from sql server

I want to get unique rows based on FirstName,EmailID. I tried few things by adding DISTINCT to all row that still get duplicate rows. tried Group By that failed with error. I can do a subquery but that will be slow. WHat is the best solution for below query
SELECT FirstName,LastName,FamilyName, EmailID,Phone,City,Country,CreatedOn,t.Type , ID
FROM Forms C JOIN Form_Type T
ON c.Form_TypeID = t.Form_TypeID
WHERE c.Form_TypeID = 1 AND DATEDIFF( "d", CreatedOn, GETDATE()) < 31
ORDER BY CreatedOn DESC
See if this works for you:
SELECT *
FROM (
SELECT FirstName,LastName,FamilyName, EmailID,Phone,City,Country,CreatedOn,t.Type , ID,
ROW_NUMBER() OVER (PARTITION BY FirstName ,EmailID ORDER BY CreatedOn DESC ) NewCol
FROM Forms C
JOIN Form_Type T ON c.Form_TypeID = t.Form_TypeID
WHERE c.Form_TypeID = 1
AND DATEDIFF("d", CreatedOn, GETDATE()) < 31
) t
WHERE NewCol = 1
I have added an extra column (i.e. NewCol) in the inner table. I am assuming that you wanted to display recent record (using CREATEDON) for each combination of "FirstName, Email"
DISTINCT will not work in your case, as you want all the fields from the table. So you need to use a sub-query to create a list of distinct names/emails.
You should be able to adapt the following example to your needs:
SELECT User, EMail, Address1, Address2
FROM Table1 t1
INNER JOIN (SELECT DISTINCT(User, EMail) FROM Table1) tmp ON t1.User = tmp.User AND t1.EMail = tmp.EMail
Using an INNER JOIN this returns only rows from Table1 that are in table tmp. Table tmp is defined as the distinct combinations of User and EMail from Table1.
So what happens is: You create a distinct list of User and EMail from Table1. Then you select all the entries from Table1 where User and EMail are in that list.

Select Maximum, using filelds from 2 tables

I have a database Library, which has a lot of tables and we need 3 tables for query:
Table Librarians: ID, Surname;
Table StudentCard: ID, foreign key on table Librarians and other columns which we don't use
Table TeacherCard: ID, foreign key on table Librarians and other columns which we don't use
Query: select the librarian's surname, which gave the most count of books.
I know, how to resolve, when I took data only from one table, e. g. TeacherCard
SELECT TOP 1 WITH TIES
Librarians.LastName, MAX(Librarians.CountOfBooks) AS Books
FROM
(SELECT
L.LastName, COUNT(*) AS CountOfBooks
FROM Libs L, T_Cards T
WHERE T.Id_Lib IN (SELECT L.Id)
GROUP BY L.LastName) AS Librarians
GROUP BY
Librarians.LastName
ORDER BY
MAX(Librarians.CountOfBooks) DESC
GO
I don't know, how to use data from TeacherCard and from StudetnCard at the same time.
Please, help to write this query.
I have a right resolving !!!!
SELECT TOP 1 B.Name, B.CountOut
FROM
(SELECT A.Name, SUM(A.Count) AS CountOut
FROM
(SELECT Libs.LastName AS Name, COUNT(S_Cards.DateOut) AS [Count]
FROM Libs JOIN S_Cards ON S_Cards.Id_Lib = Libs.Id
GROUP BY Libs.LastName
UNION ALL
SELECT Libs.LastName AS Name, COUNT(T_Cards.DateOut) AS [Count]
FROM Libs JOIN T_Cards ON T_Cards.Id_Lib = Libs.Id
GROUP BY Libs.LastName) AS A
GROUP BY A.Name ) AS B
ORDER BY B.CountOut DESC
I have another right answer:
SELECT TOP 2 LastName, COUNT (*) [count] FROM
(SELECT LastName FROM Libs L, S_Cards S
WHERE S.id_lib = L.id
UNION ALL
SELECT LastName FROM Libs L, T_Cards T
WHERE T.id_lib = L.id) As Res
GROUP By LastName
ORDER BY COUNT (*) DESC

Cleaning T-SQL tables

I'm new to T-SQL and trying to do some cleanup on some data imported from Excel into SQL Server.
I have made a batch import that imports the raw data to a staging table and now I want to clean it up.
I have the following tables
tblRawInput (my stageing table):
Name, Name2, Name3, Phonenumber, Group
tblPeople:
PersonID (IDENTITY), Name, Phonenumber, GroupID
tblGroups:
GroupID (IDENTITY), Groupname
tblAltNames:
NameID (IDENTITY), Name, PersonID
The query should be able to split the data into the other tables, but not create a group if it already exists.
I am at a loss. Could anyone give me a pointer in the right direction.
When I do a SELECT INTO it creates multiple copies of the groups.
You can use a not exists clause to insert only new groups:
insert into Groups
(GroupName)
select distinct Group
from tblRawInput ri
where not exists
(
select *
from Groups
where g.GroupName = ri.Group
)
After this, you can insert into tblPeople like;
insert tblPeople
(Name, GroupID)
select ri.Name
, g.GroupID
from tblRawInput ri
-- Look up GroupID in the Groups table
join Groups g
on g.GroupName = ri.Group
You can do the alternate names along the same lines.
First in this case order matters. Insert to groups first.
insert into tblGroups
(GroupName)
select Group
from tblRawInput ri
where not exists
(
select *
from tblGroups g
where g.GroupName = ri.Group
)
Then insert to the people table
Insert into tblPeople(Name, Phonenumber, GroupID)
Select Name, Phonenumber, GroupID
from tblRawInput ri
join tblGroups g
on g.groupName = ri.group
where not exists
(
select *
from tblPeople p
where p.Name = ri.Name
and p.Phonenumber = ri.Phonenumber
and p.groupId = g.groupid
)
Then get the alt name
Insert into tblAltNames (Name, Personid)
Select Distinct Name2, PersonID
from tblRawInput ri
join tblPerson p
on p.Name = ri.Name
where not exists
(
select *
from tblAltNames p
where p.Name = ri.Name2
)
Of course all of this should be wrapped in a transaction and a try catch block with a rollback of everything is one of the inserts fails.
Possibly the second query shoudl use the output clause to get teh personids inserted instead of the join. You don't really have anything good to join on here because names are not unique.

Resources