SQL Remove almost duplicate rows

SQL Remove almost duplicate rows - sql-server

I have a table that contains unfortuantely bad data and I'm trying to filter some out. I am sure that the LName, FName combonation is unique since the data set is small enough to verify.
LName, FName, Email
----- ----- -----
Smith Bob bsmith#example.com
Smith Bob NULL
Doe Jane NULL
White Don dwhite#example.com
I would like to have the query results bring back the "duplicate" record that does not have a NULL email, yet still bring back a NULL Email when there is not a duplicate.
E.g.
Smith Bob bsmith#example.com
Doe Jane NULL
White Don dwhite#example.com
I think the solution is similar to Sql, remove duplicate rows by value, but I don't really understand if the asker's requirements are the same as mine.
Any suggestions?
Thanks

You can use ROW_NUMBER() analytic function:
SELECT *
FROM (
SELECT a.*, ROW_NUMBER() OVER(PARTITION BY LName, FName ORDER BY Email DESC) rnk
FROM <YOUR_TABLE> a
) a
WHERE RNK = 1

This drops the null rows if there are any non null values.
SELECT lname
, fname
, MIN(email)
FROM YourTable
GROUP BY
lname
, fname
Test script
DECLARE #Test TABLE (
LName VARCHAR(32)
, FName VARCHAR(32)
, Email VARCHAR(32)
)
INSERT INTO #Test
SELECT 'Smith', 'Bob', 'bsmith#example.com'
UNION ALL SELECT 'Smith', 'Bob', 'NULL'
UNION ALL SELECT 'Doe', 'Jane', 'NULL'
UNION ALL SELECT 'White', 'Don', 'dwhite#example.com'
SELECT lname
, fname
, MIN(Email)
FROM #Test
GROUP BY
lname
, fname

Here is a relatively simple query that uses standard SQL and does just this:
SELECT * FROM Person P
WHERE Email IS NOT NULL OR -- Take all people with non-null e-mails
Email IS NULL AND -- and all people with null e-mails, as long as
NOT EXISTS -- there is no duplicate record of the same person
(SELECT * -- with a non-null e-mail
FROM Person P2
WHERE P2.LName=P.LName AND P2.FName=P.FName AND P2.Email IS NOT NULL)

Since there are plenty of SQL solutions posted already, you may want to create a data fix to remove the bad data, then add the necessary constraints to prevent bad data from ever being inserted. Bad data in a database is a side effect of poor design.

Related

SQL combine multiple records into one row

I have a table pulling userid's and their personal and work e-mails. I'd like to have one line per id showing both types of e-mails, but I can't figure out how to do that.
declare #t table(NPI int, email varchar(50), emailtype varchar(50))
insert into #t
values(1, 'john#home', 'personal'), (1, 'john#work', 'work');
This is the query I've written so far, which puts this on 2 separate rows:
select npi, case when emailtype = 'personal' then email end as personalemail,
case when emailtype = 'work' then email end as workemail
from #t;
Current Output:
npi personalemail workemail
1 john#home NULL
1 NULL john#work
What I'd like to see is:
npi personalemail workemail
1 john#home john#work
How can I do this?

This has been asked and answered around here about a million times. It is called conditional aggregation or crosstab. It is faster to write an answer than find one. As an alternative you could use PIVOT but I find the syntax a bit obtuse for me.
select NPI
, max(case when emailtype = 'personal' then email end) as PersonalEmail
, max(case when emailtype = 'work' then email end) as WorkEmail
from #t
group by NPI

Use pivot
SELECT
*
FROM #T
PIVOT
(
MAX(email)
FOR EmailType IN
(
personal,work
)
)Q

T-SQL prepare dynamic COALESCE

As attached in screenshot, there are two tables.
Configuration:
Detail
Using Configuration and Detail table I would like to populate IdentificationType and IDerivedIdentification column in the Detail table.
Following logic should be used, while deriving above columns
Configuration table has order of preference, which user can change dynamically (i.e. if country is Austria then ID preference should be LEI then TIN (in case LEI is blanks) then CONCAT (if both blank then some other logic)
In case of contract ID = 3, country is BG, so LEI should be checked first, since its NULL, CCPT = 456 will be picked.
I could have used COALESCE and CASE statement, in case hardcoding is allowed.
Can you please suggest any alternation approach please ?
Regards
Digant

Assuming that this is some horrendous data dump and you are trying to clean it up here is some SQL to throw at it. :) Firstly, I was able to capture your image text via Adobe Acrobat > Excel.
(I also built the schema for you at: http://sqlfiddle.com/#!6/8f404/12)
Firstly, the correct thing to do is fix the glaring problem and that's the table structure. Assuming you can't here's a solution.
So, here it is and what it does is unpivots the columns LEI, NIND, CCPT and TIN from the detail table and also as well as FirstPref, SecondPref, ThirdPref from the Configuration table. Basically, doing this helps to normalize the data although it's costing you major performance if there are no plans to fix the data structure or you cannot. After that you are simply joining the tables Detail.ContactId to DerivedTypes.ContactId then DerivedPrefs.ISOCountryCode to Detail.CountrylSOCountryCode and DerivedTypes.ldentificationType = DerivedPrefs.ldentificationType If you use an inner join rather than the left join you can remove the RANK() function but it will not show all ContactIds, only those that have a value in their LEI, NIND, CCPT or TIN columns. I think that's a better solution anyway because why would you want to see an error mixed in a report? Write a separate report for those with no values in those columns. Lastly, the TOP (1) with ties allows you to display one record per ContactId and allows for the record with the error to still display. Hope this helps.
CREATE TABLE Configuration
(ISOCountryCode varchar(2), CountryName varchar(8), FirstPref varchar(6), SecondPref varchar(6), ThirdPref varchar(6))
;
INSERT INTO Configuration
(ISOCountryCode, CountryName, FirstPref, SecondPref, ThirdPref)
VALUES
('AT', 'Austria', 'LEI', 'TIN', 'CONCAT'),
('BE', 'Belgium', 'LEI', 'NIND', 'CONCAT'),
('BG', 'Bulgaria', 'LEI', 'CCPT', 'CONCAT'),
('CY', 'Cyprus', 'LEI', 'NIND', 'CONCAT')
;
CREATE TABLE Detail
(ContactId int, FirstName varchar(1), LastName varchar(3), BirthDate varchar(4), CountrylSOCountryCode varchar(2), Nationality varchar(2), LEI varchar(9), NIND varchar(9), CCPT varchar(9), TIN varchar(9))
;
INSERT INTO Detail
(ContactId, FirstName, LastName, BirthDate, CountrylSOCountryCode, Nationality, LEI, NIND, CCPT, TIN)
VALUES
(1, 'A', 'DES', NULL, 'AT', 'AT', '123', '4345', NULL, NULL),
(2, 'B', 'DEG', NULL, 'BE', 'BE', NULL, '890', NULL, NULL),
(3, 'C', 'DEH', NULL, 'BG', 'BG', NULL, '123', '456', NULL),
(4, 'D', 'DEi', NULL, 'BG', 'BG', NULL, NULL, NULL, NULL)
;
SELECT TOP (1) with ties Detail.ContactId,
FirstName,
LastName,
BirthDate,
CountrylSOCountryCode,
Nationality,
LEI,
NIND,
CCPT,
TIN,
ISNULL(DerivedPrefs.ldentificationType, 'ERROR') ldentificationType,
IDerivedIdentification,
RANK() OVER (PARTITION BY Detail.ContactId ORDER BY
CASE WHEN Pref = 'FirstPref' THEN 1
WHEN Pref = 'SecondPref' THEN 2
WHEN Pref = 'ThirdPref' THEN 3
ELSE 99 END) AS PrefRank
FROM
Detail
LEFT JOIN
(
SELECT
ContactId,
LEI,
NIND,
CCPT,
TIN
FROM Detail
) DetailUNPVT
UNPIVOT
(IDerivedIdentification FOR ldentificationType IN
(LEI, NIND, CCPT, TIN)
)AS DerivedTypes
ON DerivedTypes.ContactId = Detail.ContactId
LEFT JOIN
(
SELECT
ISOCountryCode,
CountryName,
FirstPref,
SecondPref,
ThirdPref
FROM
Configuration
) ConfigUNPIVOT
UNPIVOT
(ldentificationType FOR Pref IN
(FirstPref, SecondPref, ThirdPref)
)AS DerivedPrefs
ON DerivedPrefs.ISOCountryCode = Detail.CountrylSOCountryCode
and DerivedTypes.ldentificationType = DerivedPrefs.ldentificationType
ORDER BY RANK() OVER (PARTITION BY Detail.ContactId ORDER BY
CASE WHEN Pref = 'FirstPref' THEN 1
WHEN Pref = 'SecondPref' THEN 2
WHEN Pref = 'ThirdPref' THEN 3
ELSE 99 END)

Using a CTE to split results across a CROSS APPLY

I have some data that I need to output as rows containing markup tags, which I'm doing inside a table valued function.
This has been working fine up to a point using code in the format below, using the search query to gather my data, and then inserting into my returned table using the output from results.
I now need to take a longer data field and split it up over a number of rows, and I'm at something of a loss as to how to achieve this.
I started with the idea that I wanted to use a CTE to process the data from my query, but I can't see a way to get the data from my search query into my CTE and from there into my results set.
I guess I can see an alternative way of doing this by creating another table valued function in the database that returns a results set if I feed it my comment_text column, but it seems like a waste to do it that way.
Does anyone see a route through to a solution?
Example "Real" Table:
DECLARE #Comments TABLE
(
id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
comment_date DATETIME NOT NULL,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
comment_title VARCHAR(50) NOT NULL,
comment_text char(500)
);
Add Comment Rows:
INSERT INTO #Comments VALUES(CURRENT_TIMESTAMP, 'Bob', 'Example','Bob''s Comment', 'Text of Bob''s comment.');
INSERT INTO #Comments VALUES(CURRENT_TIMESTAMP, 'Alice', 'Example','Alice''s Comment', 'Text of Alice''s comment that is much longer and will need to be split over multiple rows.');
Format of returned results table:
DECLARE #return_table TABLE
(
comment_date DATETIME,
commenter_name VARCHAR(101),
markup VARCHAR(100)
);
Naive query (Can't run because the variable comment_text in the SplitComment CTE can't be identified.
WITH SplitComment(note,start_idx) AS
(
SELECT '<Note>'+SUBSTRING(comment_text,0,50)+'</Note>', 0
UNION ALL
SELECT '<Text>'+SUBSTRING(note,start_idx,50)+'</Text>', start_idx+50 FROM SplitComment WHERE (start_idx+50) < LEN(note)
)
INSERT INTO #return_table
SELECT results.* FROM
(
SELECT
comment_date,
CAST(first_name+' '+last_name AS VARCHAR(101)) commenter,
comment_title,
comment_text
FROM #Comments
) AS search
CROSS APPLY
(
SELECT comment_date, commenter, '<title>'+comment_title+'</title>' markup
UNION ALL SELECT comment_date, commenter, SplitComment
) AS results;
SELECT * FROM #return_table;
Results (when the function is run without the CTE):
comment_date commenter_name markup
2017-07-07 11:53:57.240 Bob Example <title>Bob's Comment</title>
2017-07-07 11:53:57.240 Alice Example <title>Alice's Comment</title>
Ideally, I'd like to get one additional row for Bob's comment, and two rows for Alice's comment. Something like this:
comment_date commenter_name markup
2017-07-07 11:53:57.240 Bob Example <title>Bob's Comment</title>
2017-07-07 11:53:57.240 Bob Example <Note>Bob's Comment</Note>
2017-07-07 11:53:57.240 Alice Example <title>Alice's Comment</title>
2017-07-07 11:53:57.240 Alice Example <Note>Text of Alice''s comment that is much longer and w</Note>
2017-07-07 11:53:57.240 Alice Example <Text>ill need to be split over multiple rows.</Text>

May be you are looking for something like this (it' a simplified version, I used only first name and comment_date as "identifier").
I tested it using this data and - for the moment - imaging max len 50 to split text column.
Tip: change comment_text datatype to VARCHAR(500)
DECLARE #Comments TABLE
(
id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
comment_date DATETIME NOT NULL,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
comment_title VARCHAR(50) NOT NULL,
comment_text VARCHAR(500)
);
INSERT INTO #Comments VALUES(CURRENT_TIMESTAMP, 'Bob', 'Example','Bob''s Comment', 'Text of Bob''s comment.');
INSERT INTO #Comments VALUES(CURRENT_TIMESTAMP, 'Alice', 'Example','Alice''s Comment'
, 'Text of Alice''s comment that is much longer and will need to be split over multiple rows aaaaaa bbbbbb cccccc ddddddddddd eeeeeeeeeeee fffffffffffff ggggggggggggg.');
WITH CTE AS (SELECT comment_date, first_name, '<Note>'+CAST( SUBSTRING(comment_text, 1, 50) AS VARCHAR(500)) +'</Note>'comment_text, 1 AS RN
FROM #Comments
UNION ALL
SELECT A.comment_date, A.first_name, '<Text>'+CAST( SUBSTRING(A.comment_text, B.RN*50+1, 50) AS VARCHAR(500)) +'</Text>'AS comment_text, B.RN+1 AS RN
FROM #Comments A
INNER JOIN CTE B ON A.comment_date=B.comment_date AND A.first_name=B.first_name
WHERE LEN(A.comment_text) > B.RN*50+1
)
SELECT A.comment_date, A.first_name, '<title>'+ comment_title+'</title>' AS markup
FROM #Comments A
UNION ALL
SELECT B.comment_date, B.first_name, B.comment_text AS markup
FROM CTE B ;
Output:
comment_date first_name markup
2017-07-07 14:30:51.117 Bob <title>Bob's Comment</title>
2017-07-07 14:30:51.117 Alice <title>Alice's Comment</title>
2017-07-07 14:30:51.117 Bob <Note>Text of Bob's comment.</Note>
2017-07-07 14:30:51.117 Alice <Note>Text of Alice's comment that is much longer and wi</Note>
2017-07-07 14:30:51.117 Alice <Text>ll need to be split over multiple rows aaaaaa bbbb</Text>
2017-07-07 14:30:51.117 Alice <Text>bb cccccc ddddddddddd eeeeeeeeeeee fffffffffffff g</Text>
2017-07-07 14:30:51.117 Alice <Text>gggggggggggg.</Text>

Here's a solution that also allows sorting the resultset
It uses a recursive CTE to calculate the positions in the long text.
And by joining the table to the CTE, the text can be sliced up into rows.
with cte as
(
select id, 1 as lvl, len(comment_text) as posmax, 1 pos1, 50 limit
from #Comments
union all
select id, lvl + 1, posmax, iif(pos1+limit<posmax,pos1+limit,posmax), limit
from cte
where pos1+limit<posmax
)
, CTE2 AS
(
select id, 0 as lvl,
comment_date,
concat(first_name,' ',last_name) as commenter,
'<Title>'+rtrim(comment_title)+'</Title>' as markup
from #Comments
union all
select t.id, c.lvl,
comment_date,
concat(first_name,' ',last_name) as commenter_name,
concat(iif(lvl=1,'<Note>','<Text>'),substring(comment_text,pos1,limit),iif(lvl=1,'</Note>','</Text>')) as markup
from #Comments t
join cte c on c.id = t.id
)
select comment_date, commenter, markup
from CTE2
order by id, lvl;
Output:
comment_date commenter markup
----------------------- ------------ -------------------------------------
2017-07-07 15:06:31.293 Bob Example <Title>Bob's Comment</Title>
2017-07-07 15:06:31.293 Bob Example <Note>Text of Bob's comment.</Note>
2017-07-07 15:06:31.293 Alice Example <Title>Alice's Comment</Title>
2017-07-07 15:06:31.293 Alice Example <Note>Text of Alice's comment that is much longer and wi</Note>
2017-07-07 15:06:31.293 Alice Example <Text>ll need to be split over multiple rows.</Text>

SQL Server Group rows with multiple occurences of Group BY columns

I am trying to summarize a dataset and get the minimum and maximum date for each group. However, a group can exist multiple times if there is a gap. Here is sample data:
CREATE TABLE temp (
id int,
FIRSTNAME nvarchar(50),
LASTNAME nvarchar(50),
STARTDATE datetime2(7),
ENDDATE datetime2(7)
)
INSERT into temp values(1,'JOHN','SMITH','2013-04-02','2013-05-31')
INSERT into temp values(2,'JOHN','SMITH','2013-05-31','2013-10-31')
INSERT into temp values(3,'JANE','DOE','2013-10-31','2016-07-19')
INSERT into temp values(4,'JANE','DOE','2016-07-19','2016-08-11')
INSERT into temp values(5,'JOHN','SMITH','2016-08-11','2017-02-01')
INSERT into temp values(6,'JOHN','SMITH','2017-02-01','9999-12-31')
I am looking to summarize the data as follows:
JOHN SMITH 2013-04-02 2013-10-31
JANE DOE 2013-10-31 2016-08-11
JOHN SMITH 2016-08-11 9999-12-31
A "group by" will combine the two John Smith records together with the incorrect min and max dates.
Any help is appreciated.
Thanks.

As JNevill pointed out, this is a classic Gaps and Islands problem. Below is one solution using Row_Number().
Select FirstName
,LastName
,StartDate=min(StartDate)
,EndDate =max(EndDate)
From (
Select *
,Grp = Row_Number() over (Order by ID) - Row_Number() over (Partition By FirstName,LastName Order by EndDate)
From Temp
) A
Group By FirstName,LastName,Grp
Order By min(StartDate)

Please try the following...
SELECT firstName,
lastName,
MIN( startDate ) AS earliestStartDate,
MAX( endDate ) AS latestEndDate
FROM temp
GROUP BY firstName,
lastName;
This statement will use the GROUP BY statement to group together the records based on firstName and lastName combinations. It will then return the firstName and lastName for each group as well as the earliest startDate for that group courtesy of the MIN() function and the latest endDate for that group courtesy of the MAX() function.
If you have any questions or comments, then please feel free to post a Comment accordingly.

How to get records where csv column contains all the values of a filter

i have a table some departments tagged with user as
User | Department
user1 | IT,HR,House Keeping
user2 | HR,House Keeping
user3 | IT,Finance,HR,Maintainance
user4 | Finance,HR,House Keeping
user5 | IT,HR,Finance
i have created a SP that take parameter varchar(max) as filter (i dynamically merged if in C# code)
in the sp i creat a temp table for the selected filters eg; if user select IT & Finance & HR
i merged the string as IT##Finance##HR (in C#) & call the sp with this parameter
in SP i make a temp table as
FilterValue
IT
Finance
HR
now the issue how can i get the records that contains all the departments taged with them
(users that are associated with all the values in temp table) to get
User | Department
user3 | IT,Finance,HR,Maintainance
user5 | IT,HR,Finance
as optput
please suggest an optimised way to achive this filtering

This design is beyond horrible -- you should really change this to use truly relational design with a dependent table.
That said, if you are not in a position to change the design, you can limp around the problem with XML, and it might give you OK performance.
Try something like this (replace '#test' with your table name as needed...). You won't need to even create your temp table -- this will jimmy your comma-delimited string around into XML, which you can then use XQuery against directly:
DECLARE #test TABLE (usr int, department varchar(1000))
insert into #test (usr, department)
values (1, 'IT,HR,House Keeping')
insert into #test (usr, department)
values (2, 'HR,House Keeping')
insert into #test (usr, department)
values (3, 'IT,Finance,HR,Maintainance')
insert into #test (usr, department)
values (4, 'Finance,HR,House Keeping')
insert into #test (usr, department)
values (5, 'IT,HR,Finance')
;WITH departments (usr, department, depts)
AS
(
SELECT usr, department, CAST(NULLIF('<department><dept>' + REPLACE(department, ',', '</dept><dept>') + '</dept></department>', '<department><dept></dept></department>') AS xml)
FROM #test
)
SELECT departments.usr, departments.department
FROM departments
WHERE departments.depts.exist('/department/dept[text()[1] eq "IT"]') = 1
AND departments.depts.exist('/department/dept[text()[1] eq "HR"]') = 1
AND departments.depts.exist('/department/dept[text()[1] eq "Finance"]') = 1

I agree with others that your design is, um, not ideal, and given the fact that it may change, as per your comment, one is not too motivated to find a really fascinating solution for the present situation.
Still, you can have one that at least works correctly (I think) and meets the situation. Here's what I've come up with:
;
WITH
UserDepartment ([User], Department) AS (
SELECT 'user1', 'IT,HR,House Keeping' UNION ALL
SELECT 'user2', 'HR,House Keeping' UNION ALL
SELECT 'user3', 'IT,Finance,HR,Maintainance' UNION ALL
SELECT 'user4', 'Finance,HR,House Keeping' UNION ALL
SELECT 'user5', 'IT,HR,Finance'
),
Filter (FilterValue) AS (
SELECT 'IT' UNION ALL
SELECT 'Finance' UNION ALL
SELECT 'HR'
),
CSVSplit AS (
SELECT
ud.*,
--x.node.value('.', 'varchar(max)')
x.Value AS aDepartment
FROM UserDepartment ud
CROSS APPLY (SELECT * FROM dbo.Split(',', ud.Department)) x
)
SELECT
c.[User],
c.Department
FROM CSVSplit c
INNER JOIN Filter f ON c.aDepartment = f.FilterValue
GROUP BY
c.[User],
c.Department
HAVING
COUNT(*) = (SELECT COUNT(*) FROM Filter)
The first two CTEs are just sample tables, the rest of the query is the solution proper.
The CSVSplit CTE uses a Split function that splits a comma-separated list into a set of items and returns them as a table. The entire CTE turns the row set of the form
----- ---------------------------
user1 department1,department2,...
... ...
into this:
----- -----------
user1 department1
user1 department2
... ...
The main SELECT joins the normalised row set with the filter table and selects rows where the number of matches exactly equals the number of items in the filter table. (Note: this implies there's no identical names in UserDepartment.Department).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

SQL Remove almost duplicate rows - sql-server

You can use ROW_NUMBER() analytic function: SELECT * FROM ( SELECT a.*, ROW_NUMBER() OVER(PARTITION BY LName, FName ORDER BY Email DESC) rnk FROM <YOUR_TABLE> a ) a WHERE RNK = 1

Since there are plenty of SQL solutions posted already, you may want to create a data fix to remove the bad data, then add the necessary constraints to prevent bad data from ever being inserted. Bad data in a database is a side effect of poor design.

Related

SQL combine multiple records into one row

T-SQL prepare dynamic COALESCE

Using a CTE to split results across a CROSS APPLY

SQL Server Group rows with multiple occurences of Group BY columns

How to get records where csv column contains all the values of a filter

Categories

Resources