Going through tables and replacing IDs resulting from duplicates in dimension table

Going through tables and replacing IDs resulting from duplicates in dimension table - sql-server

I have a dimension Users table that unfortunately has a bunch of duplicate records. See screenshot.
I have thousands of users and 5 tables referencing the duplicates. I want to delete records with "bad" UserIDs. I want to go through the 5 dependencies and update bad UserIds with "good" (circled in red).
What would be a good approach to this?
Here's what I did to get the above screenshot...
SELECT UserID
,userIds.FirstName
,userIds.LastName
,dupTable.Email
,dupTable.Username
,dupTable.DupCount
FROM dbo.DimUsers AS userIds
LEFT OUTER JOIN
(SELECT FirstName
,LastName
,Email
,UserName
,DupCount
FROM
(SELECT FirstName
,LastName
,UserName
,Email
,COUNT(*) AS DupCount -- we're finding duplications by matches on FirstName,
-- last name, UserName AND Email. All four fields must match
-- to find a dupe. More confidence from this.
FROM dbo.DimUsers
GROUP BY FirstName
,LastName
,UserName
,Email
HAVING COUNT(*) > 1) AS userTable -- any count more than 1 is a dupe
WHERE LastName NOT LIKE 'NULL' -- exclude entries with literally NULL names
AND FirstName NOT LIKE 'NULL'
)AS dupTable
ON dupTable.FirstName = userIds.FirstName -- to get the userIds of dupes, we LEFT JOIN the original table
AND dupTable.LastName = userIds.LastName -- on four fields to increase our confidence
AND dupTable.Email = userIds.Email
AND dupTable.Username = userIds.Username
WHERE DupCount IS NOT NULL -- ignore NULL dupcounts, these are not dupes

This code should work, created for 1 dependency table but you can use the same logic to update other 4 tables.
update t
set UserID = MinUserID.UserID
from
DimUsersChild1 t
inner join DimUsers on DimUsers.UserID = t.UserID
inner join (
select min(UserID) UserID, FirstName, LastName, UserName, Email
from DimUsers
group by
FirstName, LastName, UserName, Email
) MinUserID on
MinUserID.FirstName = DimUsers.FirstName and
MinUserID.LastName = DimUsers.LastName and
MinUserID.UserName = DimUsers.UserName and
MinUserID.Email = DimUsers.Email
select * from DimUsersChild1;
delete t1
from
DimUsers t
inner join DimUsers t1 on t1.FirstName = t.FirstName and
t1.LastName = t.LastName and
t1.UserName = t.UserName and
t1.Email = t.Email
where
t.UserID < t1.UserID
select * from DimUsers;
Here is a working demo

Related

Unique data based on name,email from sql server

I want to get unique rows based on FirstName,EmailID. I tried few things by adding DISTINCT to all row that still get duplicate rows. tried Group By that failed with error. I can do a subquery but that will be slow. WHat is the best solution for below query
SELECT FirstName,LastName,FamilyName, EmailID,Phone,City,Country,CreatedOn,t.Type , ID
FROM Forms C JOIN Form_Type T
ON c.Form_TypeID = t.Form_TypeID
WHERE c.Form_TypeID = 1 AND DATEDIFF( "d", CreatedOn, GETDATE()) < 31
ORDER BY CreatedOn DESC

See if this works for you:
SELECT *
FROM (
SELECT FirstName,LastName,FamilyName, EmailID,Phone,City,Country,CreatedOn,t.Type , ID,
ROW_NUMBER() OVER (PARTITION BY FirstName ,EmailID ORDER BY CreatedOn DESC ) NewCol
FROM Forms C
JOIN Form_Type T ON c.Form_TypeID = t.Form_TypeID
WHERE c.Form_TypeID = 1
AND DATEDIFF("d", CreatedOn, GETDATE()) < 31
) t
WHERE NewCol = 1
I have added an extra column (i.e. NewCol) in the inner table. I am assuming that you wanted to display recent record (using CREATEDON) for each combination of "FirstName, Email"

DISTINCT will not work in your case, as you want all the fields from the table. So you need to use a sub-query to create a list of distinct names/emails.
You should be able to adapt the following example to your needs:
SELECT User, EMail, Address1, Address2
FROM Table1 t1
INNER JOIN (SELECT DISTINCT(User, EMail) FROM Table1) tmp ON t1.User = tmp.User AND t1.EMail = tmp.EMail
Using an INNER JOIN this returns only rows from Table1 that are in table tmp. Table tmp is defined as the distinct combinations of User and EMail from Table1.
So what happens is: You create a distinct list of User and EMail from Table1. Then you select all the entries from Table1 where User and EMail are in that list.

Tough SQL Rank query

Scenario: primary user table plus a separate audit table that tracks changes to the user's name. Each time a user is added or part of their name is edited, we write a row to the audit table.
Trying to write a query that pulls the most immediate former name
select nsh.AuthEmail, nsh.UserID, nsh.name_lastnamefirst, t.FormerName, t.RankOrder
from (
Select
an.AuditNameID, nsh.AuthEmail, nsh.UserID, nsh.name_lastnamefirst,
FormerName = CASE WHEN RTRIM(an.LastName) <> RTRIM(nsh.LastName) OR RTRIM(an.FirstName) <> RTRIM(nsh.FirstName) OR RTRIM(an.Suffix) <> RTRIM(nsh.Suffix) OR RTRIM(an.MaidenName)<>RTRIM(nsh.MaidenName) THEN LTRIM(an.LastName + ' ' + an.Suffix + ', ' + an.FirstName + ' ' + ISNULL(an.MiddleName,''))
ELSE null
END,
RANK() over (partition by an.UserID order by an.AuditNameID DESC) RankOrder
From [dbo].[AuditName] an
INNER JOIN dbo.StudentPrograms p ON an.UserID = p.UserID
INNER JOIN dbo.NameScalarHelper nsh ON p.UserID = nsh.UserID
WHERE p.SiteProgramID = 139 AND p.IsActive =1
) t
RIGHT OUTER JOIN dbo.NameScalarHelper nsh ON nsh.UserID = t.UserID
where FormerName is not null
The problem is that I can't figure out how to return the data from audit table where the RANK is RANK -1 because the top rank is the current data. Let me know if any ideas.

Looking at your requirement you basicly need to return latest name that doesn't match current one from your Audit table.
I think you could use OUTER APPLY to achieve that:
SELECT *
FROM [dbo].[StudentPrograms] AS SP
INNER JOIN [dbo].[NameScalarHelper] AS NSH
ON NSH.UserID = SP.UserID
OUTER APPLY (
SELECT TOP (1) *
FROM [dbo].[AuditName] AS AN
WHERE AN.UserID = SP.UserID
AND (
RTRIM(AN.LastName) <> RTRIM(NSH.LastName)
OR RTRIM(AN.FirstName) <> RTRIM(NSH.FirstName)
OR RTRIM(AN.Suffix) <> RTRIM(NSH.Suffix)
OR RTRIM(AN.MaidenName) <> RTRIM(NSH.MaidenName)
)
ORDER BY AuditNameID DESC
) AS AN
WHERE SP.SiteProgramID = 139
AND SP.IsActive = 1;
This will find latest name from your audit table which doesn't match latest one.
By the way, I'd strongly suggest to clean up your database and remove any trailing/leading spaces, so that you don't need to use LTRIM() or RTRIM() in your where clause so that SQL Server would be able to make use of indexes. Read this article for more details.
SELECT *
FROM [dbo].[StudentPrograms] AS SP
INNER JOIN [dbo].[NameScalarHelper] AS NSH
ON NSH.UserID = SP.UserID
OUTER APPLY (
SELECT TOP (1) *
FROM [dbo].[AuditName] AS AN
WHERE AN.UserID = SP.UserID
AND (
AN.LastName <> NSH.LastName
OR AN.FirstName <> NSH.FirstName
OR AN.Suffix <> NSH.Suffix
OR AN.MaidenNam) <> NSH.MaidenName
)
ORDER BY AuditNameID DESC
) AS AN
WHERE SP.SiteProgramID = 139
AND SP.IsActive = 1;
I was trying to understand the way you store data and replicated a tiny example:
DECLARE #User TABLE
(
UserID INT
, FirstName VARCHAR(50)
, LastName VARCHAR(50)
);
DECLARE #Audit TABLE
(
AuditID INT IDENTITY(1, 1)
, UserID INT
, FirstName VARCHAR(50)
, LastName VARCHAR(50)
);
INSERT INTO #User (UserID, FirstName, LastName)
VALUES (1, 'Ben', 'White');
INSERT INTO #Audit (UserID, FirstName, LastName)
VALUES (1, 'Ben', 'White');
SELECT *
FROM #User AS U
OUTER APPLY (
SELECT TOP (1) *
FROM #Audit AS A
WHERE A.UserID = U.UserID
AND (
A.FirstName <> U.FirstName
OR A.LastName <> U.LastName
)
ORDER BY A.AuditID DESC
) AS A;
UPDATE U
SET U.LastName = 'Whiter'
FROM #User AS U
WHERE U.UserID = 1;
INSERT INTO #Audit (UserID, FirstName, LastName)
VALUES (1, 'Ben', 'Whiter');
SELECT *
FROM #User AS U
OUTER APPLY (
SELECT TOP (1) *
FROM #Audit AS A
WHERE A.UserID = U.UserID
AND (
A.FirstName <> U.FirstName
OR A.LastName <> U.LastName
)
ORDER BY A.AuditID DESC
) AS A;
UPDATE U
SET U.LastName = 'Whitest'
FROM #User AS U
WHERE U.UserID = 1;
INSERT INTO #Audit (UserID, FirstName, LastName)
VALUES (1, 'Ben', 'Whitest');
SELECT *
FROM #User AS U
OUTER APPLY (
SELECT TOP (1) *
FROM #Audit AS A
WHERE A.UserID = U.UserID
AND (
A.FirstName <> U.FirstName
OR A.LastName <> U.LastName
)
ORDER BY A.AuditID DESC
) AS A;
INSERT INTO #User (UserID, FirstName, LastName)
VALUES (2, 'Tom', 'Brooks');
INSERT INTO #Audit (UserID, FirstName, LastName)
VALUES (2, 'Tom', 'Brooks');
SELECT *
FROM #User AS U
OUTER APPLY (
SELECT TOP (1) *
FROM #Audit AS A
WHERE A.UserID = U.UserID
AND (
A.FirstName <> U.FirstName
OR A.LastName <> U.LastName
)
ORDER BY A.AuditID DESC
) AS A;
I assumed that when you create a user - you also add record to Audit table for consistency. Each time you make an update - you also log that into Audit table. Finally I just added yet another user and ran the query.
That's the output for each query:
User was created:
UserID FirstName LastName AuditID UserID FirstName LastName
------ --------- -------- ------- ------ --------- --------
1 Ben White null null null null
Its' last name was changed first time:
UserID FirstName LastName AuditID UserID FirstName LastName
------ --------- -------- ------- ------ --------- --------
1 Ben Whiter 1 1 Ben White
Its' last name was changed second time:
UserID FirstName LastName AuditID UserID FirstName LastName
------ --------- -------- ------- ------ --------- --------
1 Ben Whitest 2 1 Ben Whiter
A new user has been added:
UserID FirstName LastName AuditID UserID FirstName LastName
------ --------- -------- ------- ------ --------- --------
1 Ben Whitest 2 1 Ben Whiter
2 Tom Brooks null null null null
Everything else is just formatting and you should not do that in SQL Server - this should be done in application layer.

By looking at your query, I'm guessing you are getting back all of a person's past names, since you have no filter based on the RankOrder that you've created. Your current name should be 1 in the RankOrder, I assume, and so your most recent previous name would be ranked 2. You can add this to your derived table's where clause like this:
select nsh.AuthEmail, nsh.UserID, nsh.name_lastnamefirst, t.FormerName, t.RankOrder
from (
Select
an.AuditNameID, nsh.AuthEmail, nsh.UserID, nsh.name_lastnamefirst,
FormerName = CASE WHEN RTRIM(an.LastName) <> RTRIM(nsh.LastName) OR RTRIM(an.FirstName) <> RTRIM(nsh.FirstName) OR RTRIM(an.Suffix) <> RTRIM(nsh.Suffix) OR RTRIM(an.MaidenName)<>RTRIM(nsh.MaidenName) THEN LTRIM(an.LastName + ' ' + an.Suffix + ', ' + an.FirstName + ' ' + ISNULL(an.MiddleName,''))
ELSE null
END,
RANK() over (partition by an.UserID order by an.AuditNameID DESC) RankOrder
From [dbo].[AuditName] an
INNER JOIN dbo.StudentPrograms p ON an.UserID = p.UserID
INNER JOIN dbo.NameScalarHelper nsh ON p.UserID = nsh.UserID
WHERE p.SiteProgramID = 139 AND p.IsActive =1 and RankOrder = 2
) t
RIGHT OUTER JOIN dbo.NameScalarHelper nsh ON nsh.UserID = t.UserID
where FormerName is not null
Let me know if I am missing something.

Cleaning T-SQL tables

I'm new to T-SQL and trying to do some cleanup on some data imported from Excel into SQL Server.
I have made a batch import that imports the raw data to a staging table and now I want to clean it up.
I have the following tables
tblRawInput (my stageing table):
Name, Name2, Name3, Phonenumber, Group
tblPeople:
PersonID (IDENTITY), Name, Phonenumber, GroupID
tblGroups:
GroupID (IDENTITY), Groupname
tblAltNames:
NameID (IDENTITY), Name, PersonID
The query should be able to split the data into the other tables, but not create a group if it already exists.
I am at a loss. Could anyone give me a pointer in the right direction.
When I do a SELECT INTO it creates multiple copies of the groups.

You can use a not exists clause to insert only new groups:
insert into Groups
(GroupName)
select distinct Group
from tblRawInput ri
where not exists
(
select *
from Groups
where g.GroupName = ri.Group
)
After this, you can insert into tblPeople like;
insert tblPeople
(Name, GroupID)
select ri.Name
, g.GroupID
from tblRawInput ri
-- Look up GroupID in the Groups table
join Groups g
on g.GroupName = ri.Group
You can do the alternate names along the same lines.

First in this case order matters. Insert to groups first.
insert into tblGroups
(GroupName)
select Group
from tblRawInput ri
where not exists
(
select *
from tblGroups g
where g.GroupName = ri.Group
)
Then insert to the people table
Insert into tblPeople(Name, Phonenumber, GroupID)
Select Name, Phonenumber, GroupID
from tblRawInput ri
join tblGroups g
on g.groupName = ri.group
where not exists
(
select *
from tblPeople p
where p.Name = ri.Name
and p.Phonenumber = ri.Phonenumber
and p.groupId = g.groupid
)
Then get the alt name
Insert into tblAltNames (Name, Personid)
Select Distinct Name2, PersonID
from tblRawInput ri
join tblPerson p
on p.Name = ri.Name
where not exists
(
select *
from tblAltNames p
where p.Name = ri.Name2
)
Of course all of this should be wrapped in a transaction and a try catch block with a rollback of everything is one of the inserts fails.
Possibly the second query shoudl use the output clause to get teh personids inserted instead of the join. You don't really have anything good to join on here because names are not unique.

Recursive CTE with additional EXISTS conditions?

I have a situation where I need to be able to see if a given person is within a user/manager hierarchy.
I have the next structure of table:
UserId
UserName
ManagerId
I have 2 IDs: some UserId (say 5) and ManagerId (say 2). As a result I need to know if manager with given Id (2) is chief for user with given id (5)? For example, if
User 1 reports to user 2.
User 3 reports to user 1.
User 4 reports to user 3
the result SQL-query have to show that for UserId = 4 and ManagerId = 1 answer is true.
I've just created query for getting all hierarchy:
WITH temp (level, UserName, UserId, ManagerId) AS
(
SELECT 1 AS level, EmployeeName, EmployeeId, BossId
FROM Employees
WHERE BossId IS NULL
UNION ALL
SELECT level+1 AS level, EmployeeName, EmployeeId, BossId
FROM Employees, temp
WHERE BossId = UserId
)
SELECT t.* from temp AS t
But now I don't know how to get result query with above mentioned conditions :(
Thanks in advance for any help!

Find the user in the anchor and walk your way back up the hierarchy. Check the rows you have got in the recursive query against the manager.
This will return the manager row if there exist one.
WITH temp AS
(
SELECT EmployeeName, EmployeeId, BossId
FROM Employees
WHERE EmployeeId = #UserID
UNION ALL
SELECT E.EmployeeName, E.EmployeeId, E.BossId
FROM Employees AS E
inner join temp AS T
ON E.EmployeeId = T.BossId
)
SELECT *
FROM temp
WHERE EmployeeId = #ManagerID

This will return the BossID if he or she exist:
WITH BOSSES AS
(
SELECT BossID
FROM Employees
WHERE EmployeeID = #uID
UNION ALL
SELECT E.BossID
FROM Employees E
JOIN BOSSES B ON E.EmployeeID = B.BossID
)
SELECT *
FROM BOSSES
WHEN BossID = #bID

I've included the hierarchy of all levels with the CTE that you can then use to query. Using this hierarchy, you can see all the managers of a given employee in a delimited column (might be useful for other calculations).
Give this a try:
WITH cte (UserId, ManagerId, Level, Hierarchy) as (
SELECT EmployeeId, BossId, 0, CAST(EmployeeId as nvarchar)
FROM Employee
WHERE BossId IS NULL
UNION ALL
SELECT EmployeeId, BossId, Level+1,
CAST(cte.Hierarchy + '-' + CAST(EmployeeId as nvarchar) as nvarchar)
FROM Employee INNER JOIN cte ON Employee.BossId=cte.UserId
)
SELECT *
FROM cte
WHERE UserId = 4
AND '-' + Hierarchy LIKE '%-1-%'
And here is the Fiddle. I've used UserId = 4 and ManagerId = 1.
Good luck.

Finding Duplicate Data in Oracle

I have a table with 500,000+ records, and fields for ID, first name, last name, and email address. What I'm trying to do is find rows where the first name AND last name are both duplicates (as in the same person has two separate IDs, email addresses, or whatever, they're in the table more than once). I think I know how to find the duplicates using GROUP BY, this is what I have:
SELECT first_name, last_name, COUNT(*)
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
The problem is that I need to then move the entire row with these duplicated names into a different table. Is there a way to find the duplicates and get the whole row? Or at least to get the IDs as well? I tried using a self-join, but got back more rows than were in the table to begin with. Would that be a better approach? Any help would be greatly appreciated.

The most effective way to remove duplicate rows is with a self-join:
DELETE FROM person_table a
WHERE a.rowid >
ANY (SELECT b.rowid
FROM person_table b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name);
This will remove all duplicates even if there are more than one duplicate row.
There is more on removing duplicates and differing methods here: http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm
Hope it helps...
EDIT: As per your comments, if you want to select all but one of the duplicates then
SELECT *
FROM person_table a
WHERE a.rowid >
ANY (SELECT b.rowid
FROM person_table b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name);

An index on (first_name, last_name) or on (last_name, first_name) would help:
SELECT t.*
FROM
person_table t
JOIN
( SELECT first_name, last_name
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
) dup
ON dup.last_name = t.last_name
AND dup.first_name = t.first_name
or:
SELECT t.*
FROM person_table t
WHERE EXISTS
( SELECT *
FROM person_table dup
WHERE dup.last_name = t.last_name
AND dup.first_name = t.first_name
AND dup.ID <> t.ID
)

This will give you an ID you want to move/delete/etc. Note that it does not work if count(*) > 2, as you get only 1 ID (you could re-run your query for these cases).
SELECT max(ID), first_name, last_name, COUNT(*)
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
Edit: You can use COLLECT to get all IDs at once (but be careful, as you only want to move/delete all but one)

To add another option, I usually use this one to remove duplicates:
delete from person_table
where rowid in (select rid
from (select rowid rid, row_number() over
(partition by first_name,last_name order by rowid) rn
from person_table
)
where rn <> 1 )

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Going through tables and replacing IDs resulting from duplicates in dimension table - sql-server

Related

Unique data based on name,email from sql server

Tough SQL Rank query

Cleaning T-SQL tables

Recursive CTE with additional EXISTS conditions?

Finding Duplicate Data in Oracle

Categories

Resources