Cleaning T-SQL tables

Cleaning T-SQL tables - sql-server

I'm new to T-SQL and trying to do some cleanup on some data imported from Excel into SQL Server.
I have made a batch import that imports the raw data to a staging table and now I want to clean it up.
I have the following tables
tblRawInput (my stageing table):
Name, Name2, Name3, Phonenumber, Group
tblPeople:
PersonID (IDENTITY), Name, Phonenumber, GroupID
tblGroups:
GroupID (IDENTITY), Groupname
tblAltNames:
NameID (IDENTITY), Name, PersonID
The query should be able to split the data into the other tables, but not create a group if it already exists.
I am at a loss. Could anyone give me a pointer in the right direction.
When I do a SELECT INTO it creates multiple copies of the groups.

You can use a not exists clause to insert only new groups:
insert into Groups
(GroupName)
select distinct Group
from tblRawInput ri
where not exists
(
select *
from Groups
where g.GroupName = ri.Group
)
After this, you can insert into tblPeople like;
insert tblPeople
(Name, GroupID)
select ri.Name
, g.GroupID
from tblRawInput ri
-- Look up GroupID in the Groups table
join Groups g
on g.GroupName = ri.Group
You can do the alternate names along the same lines.

First in this case order matters. Insert to groups first.
insert into tblGroups
(GroupName)
select Group
from tblRawInput ri
where not exists
(
select *
from tblGroups g
where g.GroupName = ri.Group
)
Then insert to the people table
Insert into tblPeople(Name, Phonenumber, GroupID)
Select Name, Phonenumber, GroupID
from tblRawInput ri
join tblGroups g
on g.groupName = ri.group
where not exists
(
select *
from tblPeople p
where p.Name = ri.Name
and p.Phonenumber = ri.Phonenumber
and p.groupId = g.groupid
)
Then get the alt name
Insert into tblAltNames (Name, Personid)
Select Distinct Name2, PersonID
from tblRawInput ri
join tblPerson p
on p.Name = ri.Name
where not exists
(
select *
from tblAltNames p
where p.Name = ri.Name2
)
Of course all of this should be wrapped in a transaction and a try catch block with a rollback of everything is one of the inserts fails.
Possibly the second query shoudl use the output clause to get teh personids inserted instead of the join. You don't really have anything good to join on here because names are not unique.

Related

SQL IN clause multiple columns and multiple value

This query is fine works.
SELECT * FROM TABLE WHERE 330110042 IN (iItem01,iItem02,iItem03,iItem04,iItem05,iItem_1,iItem_2,iItem_3,iItem_4,iItem_5,iItem_6,iItem_7,iItem_8,iItem_9,iItem_10,iItem_11,iItem_12,iItem_13,iItem_14,iItem_15,iItem_16,iItem_17,iItem_18,iItem_19,iItem_20,iItem_21,iItem_22,iItem_23,iItem_24,iItem_25,iItem_26,iItem_27,iItem_28,iItem_29,iItem_30)
But this query didnt work.
SELECT * FROM TABLE WHERE 330110042, 330110002, 330110002 IN (iItem01,iItem02,iItem03,iItem04,iItem05,iItem_1,iItem_2,iItem_3,iItem_4,iItem_5,iItem_6,iItem_7,iItem_8,iItem_9,iItem_10,iItem_11,iItem_12,iItem_13,iItem_14,iItem_15,iItem_16,iItem_17,iItem_18,iItem_19,iItem_20,iItem_21,iItem_22,iItem_23,iItem_24,iItem_25,iItem_26,iItem_27,iItem_28,iItem_29,iItem_30)
How i work in SQL Server?

It's difficult to tell your exact goal here, but one possibility would be to turn the list of values into a table structure of its own. A Common Table Expression might work:
;WITH Ids AS
(
SELECT 330110042 AS Id
UNION ALL
SELECT 330110002
)
SELECT t.*
FROM [Table] t
INNER JOIN Ids i ON t.iItem01 = i.Id OR t.iItem02 = i.Id OR...
But, maybe a solution with UNPIVOT would be more elegant. I presume that your table has a primary key column called Id:
;WITH Unpivoted AS
(
SELECT Id, ColName, ColValue
FROM (SELECT Id, iItem01, iItem02, iItem03
FROM [Table] t) p
UNPIVOT
(ColValue FOR ColName IN (iItem01, iItem02, iItem03)) AS unpvt
)
SELECT t.*
FROM [Table] t
WHERE EXISTS (SELECT 1 FROM Unpivoted u
WHERE t.Id = u.Id
AND u.ColValue IN (330110042, 330110002))
Of course, you would add all the necessary columns. I added only the first three for this example.

Select Maximum, using filelds from 2 tables

I have a database Library, which has a lot of tables and we need 3 tables for query:
Table Librarians: ID, Surname;
Table StudentCard: ID, foreign key on table Librarians and other columns which we don't use
Table TeacherCard: ID, foreign key on table Librarians and other columns which we don't use
Query: select the librarian's surname, which gave the most count of books.
I know, how to resolve, when I took data only from one table, e. g. TeacherCard
SELECT TOP 1 WITH TIES
Librarians.LastName, MAX(Librarians.CountOfBooks) AS Books
FROM
(SELECT
L.LastName, COUNT(*) AS CountOfBooks
FROM Libs L, T_Cards T
WHERE T.Id_Lib IN (SELECT L.Id)
GROUP BY L.LastName) AS Librarians
GROUP BY
Librarians.LastName
ORDER BY
MAX(Librarians.CountOfBooks) DESC
GO
I don't know, how to use data from TeacherCard and from StudetnCard at the same time.
Please, help to write this query.

I have a right resolving !!!!
SELECT TOP 1 B.Name, B.CountOut
FROM
(SELECT A.Name, SUM(A.Count) AS CountOut
FROM
(SELECT Libs.LastName AS Name, COUNT(S_Cards.DateOut) AS [Count]
FROM Libs JOIN S_Cards ON S_Cards.Id_Lib = Libs.Id
GROUP BY Libs.LastName
UNION ALL
SELECT Libs.LastName AS Name, COUNT(T_Cards.DateOut) AS [Count]
FROM Libs JOIN T_Cards ON T_Cards.Id_Lib = Libs.Id
GROUP BY Libs.LastName) AS A
GROUP BY A.Name ) AS B
ORDER BY B.CountOut DESC

I have another right answer:
SELECT TOP 2 LastName, COUNT (*) [count] FROM
(SELECT LastName FROM Libs L, S_Cards S
WHERE S.id_lib = L.id
UNION ALL
SELECT LastName FROM Libs L, T_Cards T
WHERE T.id_lib = L.id) As Res
GROUP By LastName
ORDER BY COUNT (*) DESC

Going through tables and replacing IDs resulting from duplicates in dimension table

I have a dimension Users table that unfortunately has a bunch of duplicate records. See screenshot.
I have thousands of users and 5 tables referencing the duplicates. I want to delete records with "bad" UserIDs. I want to go through the 5 dependencies and update bad UserIds with "good" (circled in red).
What would be a good approach to this?
Here's what I did to get the above screenshot...
SELECT UserID
,userIds.FirstName
,userIds.LastName
,dupTable.Email
,dupTable.Username
,dupTable.DupCount
FROM dbo.DimUsers AS userIds
LEFT OUTER JOIN
(SELECT FirstName
,LastName
,Email
,UserName
,DupCount
FROM
(SELECT FirstName
,LastName
,UserName
,Email
,COUNT(*) AS DupCount -- we're finding duplications by matches on FirstName,
-- last name, UserName AND Email. All four fields must match
-- to find a dupe. More confidence from this.
FROM dbo.DimUsers
GROUP BY FirstName
,LastName
,UserName
,Email
HAVING COUNT(*) > 1) AS userTable -- any count more than 1 is a dupe
WHERE LastName NOT LIKE 'NULL' -- exclude entries with literally NULL names
AND FirstName NOT LIKE 'NULL'
)AS dupTable
ON dupTable.FirstName = userIds.FirstName -- to get the userIds of dupes, we LEFT JOIN the original table
AND dupTable.LastName = userIds.LastName -- on four fields to increase our confidence
AND dupTable.Email = userIds.Email
AND dupTable.Username = userIds.Username
WHERE DupCount IS NOT NULL -- ignore NULL dupcounts, these are not dupes

This code should work, created for 1 dependency table but you can use the same logic to update other 4 tables.
update t
set UserID = MinUserID.UserID
from
DimUsersChild1 t
inner join DimUsers on DimUsers.UserID = t.UserID
inner join (
select min(UserID) UserID, FirstName, LastName, UserName, Email
from DimUsers
group by
FirstName, LastName, UserName, Email
) MinUserID on
MinUserID.FirstName = DimUsers.FirstName and
MinUserID.LastName = DimUsers.LastName and
MinUserID.UserName = DimUsers.UserName and
MinUserID.Email = DimUsers.Email
select * from DimUsersChild1;
delete t1
from
DimUsers t
inner join DimUsers t1 on t1.FirstName = t.FirstName and
t1.LastName = t.LastName and
t1.UserName = t.UserName and
t1.Email = t.Email
where
t.UserID < t1.UserID
select * from DimUsers;
Here is a working demo

Recursive CTE with additional EXISTS conditions?

I have a situation where I need to be able to see if a given person is within a user/manager hierarchy.
I have the next structure of table:
UserId
UserName
ManagerId
I have 2 IDs: some UserId (say 5) and ManagerId (say 2). As a result I need to know if manager with given Id (2) is chief for user with given id (5)? For example, if
User 1 reports to user 2.
User 3 reports to user 1.
User 4 reports to user 3
the result SQL-query have to show that for UserId = 4 and ManagerId = 1 answer is true.
I've just created query for getting all hierarchy:
WITH temp (level, UserName, UserId, ManagerId) AS
(
SELECT 1 AS level, EmployeeName, EmployeeId, BossId
FROM Employees
WHERE BossId IS NULL
UNION ALL
SELECT level+1 AS level, EmployeeName, EmployeeId, BossId
FROM Employees, temp
WHERE BossId = UserId
)
SELECT t.* from temp AS t
But now I don't know how to get result query with above mentioned conditions :(
Thanks in advance for any help!

Find the user in the anchor and walk your way back up the hierarchy. Check the rows you have got in the recursive query against the manager.
This will return the manager row if there exist one.
WITH temp AS
(
SELECT EmployeeName, EmployeeId, BossId
FROM Employees
WHERE EmployeeId = #UserID
UNION ALL
SELECT E.EmployeeName, E.EmployeeId, E.BossId
FROM Employees AS E
inner join temp AS T
ON E.EmployeeId = T.BossId
)
SELECT *
FROM temp
WHERE EmployeeId = #ManagerID

This will return the BossID if he or she exist:
WITH BOSSES AS
(
SELECT BossID
FROM Employees
WHERE EmployeeID = #uID
UNION ALL
SELECT E.BossID
FROM Employees E
JOIN BOSSES B ON E.EmployeeID = B.BossID
)
SELECT *
FROM BOSSES
WHEN BossID = #bID

I've included the hierarchy of all levels with the CTE that you can then use to query. Using this hierarchy, you can see all the managers of a given employee in a delimited column (might be useful for other calculations).
Give this a try:
WITH cte (UserId, ManagerId, Level, Hierarchy) as (
SELECT EmployeeId, BossId, 0, CAST(EmployeeId as nvarchar)
FROM Employee
WHERE BossId IS NULL
UNION ALL
SELECT EmployeeId, BossId, Level+1,
CAST(cte.Hierarchy + '-' + CAST(EmployeeId as nvarchar) as nvarchar)
FROM Employee INNER JOIN cte ON Employee.BossId=cte.UserId
)
SELECT *
FROM cte
WHERE UserId = 4
AND '-' + Hierarchy LIKE '%-1-%'
And here is the Fiddle. I've used UserId = 4 and ManagerId = 1.
Good luck.

Finding Duplicate Data in Oracle

I have a table with 500,000+ records, and fields for ID, first name, last name, and email address. What I'm trying to do is find rows where the first name AND last name are both duplicates (as in the same person has two separate IDs, email addresses, or whatever, they're in the table more than once). I think I know how to find the duplicates using GROUP BY, this is what I have:
SELECT first_name, last_name, COUNT(*)
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
The problem is that I need to then move the entire row with these duplicated names into a different table. Is there a way to find the duplicates and get the whole row? Or at least to get the IDs as well? I tried using a self-join, but got back more rows than were in the table to begin with. Would that be a better approach? Any help would be greatly appreciated.

The most effective way to remove duplicate rows is with a self-join:
DELETE FROM person_table a
WHERE a.rowid >
ANY (SELECT b.rowid
FROM person_table b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name);
This will remove all duplicates even if there are more than one duplicate row.
There is more on removing duplicates and differing methods here: http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm
Hope it helps...
EDIT: As per your comments, if you want to select all but one of the duplicates then
SELECT *
FROM person_table a
WHERE a.rowid >
ANY (SELECT b.rowid
FROM person_table b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name);

An index on (first_name, last_name) or on (last_name, first_name) would help:
SELECT t.*
FROM
person_table t
JOIN
( SELECT first_name, last_name
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
) dup
ON dup.last_name = t.last_name
AND dup.first_name = t.first_name
or:
SELECT t.*
FROM person_table t
WHERE EXISTS
( SELECT *
FROM person_table dup
WHERE dup.last_name = t.last_name
AND dup.first_name = t.first_name
AND dup.ID <> t.ID
)

This will give you an ID you want to move/delete/etc. Note that it does not work if count(*) > 2, as you get only 1 ID (you could re-run your query for these cases).
SELECT max(ID), first_name, last_name, COUNT(*)
FROM person_table
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
Edit: You can use COLLECT to get all IDs at once (but be careful, as you only want to move/delete all but one)

To add another option, I usually use this one to remove duplicates:
delete from person_table
where rowid in (select rid
from (select rowid rid, row_number() over
(partition by first_name,last_name order by rowid) rn
from person_table
)
where rn <> 1 )

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Cleaning T-SQL tables - sql-server

Related

SQL IN clause multiple columns and multiple value

Select Maximum, using filelds from 2 tables

Going through tables and replacing IDs resulting from duplicates in dimension table

Recursive CTE with additional EXISTS conditions?

Finding Duplicate Data in Oracle

Categories

Resources