SQL Server Group rows with multiple occurences of Group BY columns - sql-server

I am trying to summarize a dataset and get the minimum and maximum date for each group. However, a group can exist multiple times if there is a gap. Here is sample data:
CREATE TABLE temp (
id int,
FIRSTNAME nvarchar(50),
LASTNAME nvarchar(50),
STARTDATE datetime2(7),
ENDDATE datetime2(7)
)
INSERT into temp values(1,'JOHN','SMITH','2013-04-02','2013-05-31')
INSERT into temp values(2,'JOHN','SMITH','2013-05-31','2013-10-31')
INSERT into temp values(3,'JANE','DOE','2013-10-31','2016-07-19')
INSERT into temp values(4,'JANE','DOE','2016-07-19','2016-08-11')
INSERT into temp values(5,'JOHN','SMITH','2016-08-11','2017-02-01')
INSERT into temp values(6,'JOHN','SMITH','2017-02-01','9999-12-31')
I am looking to summarize the data as follows:
JOHN SMITH 2013-04-02 2013-10-31
JANE DOE 2013-10-31 2016-08-11
JOHN SMITH 2016-08-11 9999-12-31
A "group by" will combine the two John Smith records together with the incorrect min and max dates.
Any help is appreciated.
Thanks.

As JNevill pointed out, this is a classic Gaps and Islands problem. Below is one solution using Row_Number().
Select FirstName
,LastName
,StartDate=min(StartDate)
,EndDate =max(EndDate)
From (
Select *
,Grp = Row_Number() over (Order by ID) - Row_Number() over (Partition By FirstName,LastName Order by EndDate)
From Temp
) A
Group By FirstName,LastName,Grp
Order By min(StartDate)

Please try the following...
SELECT firstName,
lastName,
MIN( startDate ) AS earliestStartDate,
MAX( endDate ) AS latestEndDate
FROM temp
GROUP BY firstName,
lastName;
This statement will use the GROUP BY statement to group together the records based on firstName and lastName combinations. It will then return the firstName and lastName for each group as well as the earliest startDate for that group courtesy of the MIN() function and the latest endDate for that group courtesy of the MAX() function.
If you have any questions or comments, then please feel free to post a Comment accordingly.

Related

Group by the maximum date and return the text from each grouped record

I am trying to return the row with the maxium date for every person, how can I filter this
Initials, Date, Text
DD, 1/1/2022, 123
DD, 1/1/2021, 456
DD, 1/1/2020, 789
KT, 1/1/2020, abc
KT, 1/1/2022, def
KT, 1/2/2021, ghi
so the results appear like this?
Initials, Date, Text
DD, 1/1/2022, 123
KT, 1/1/2022, def
I've tried the MAX date and grouping by the initials, but I am trying to return the text from the entry with the maximum date.
Thank you
There are several ways of doing it.
DDL (I used a temp table for testing purposes):
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL
DROP TABLE #tmp;
CREATE TABLE #tmp (
initials varchar(10),
date date,
text varchar(10)
);
INSERT INTO #tmp
VALUES
('DD', '2022-01-01', '123'),
('DD', '2021-01-01', '456'),
('DD', '2020-01-01', '789'),
('KT', '2020-01-01', 'abc'),
('KT', '2022-01-01', '123'),
('KT', '2021-01-02', 'ghi');
Version 1 (Subquery with MAX):
This version pulls unique initials and their MAX date, then links back to the original table on those initials and date to pull the text from the record.
SELECT t.initials, t.date, t.text
FROM #tmp AS t
INNER JOIN (
SELECT initials, MAX(date) AS max_date
FROM #tmp
GROUP BY initials
) AS sub
ON t.initials = sub.initials
AND t.date = sub.max_date
Version 2 (Subquery with ROW_NUMBER):
This version uses the window function ROW_NUMBER to number each row ordered by date descending. This will give the most recent date a row number of 1. Then pull the text from the records with row number = 1.
SELECT sub.initials, sub.date, sub.text
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY initials ORDER BY date DESC) AS row_num
FROM #tmp
) AS sub
WHERE sub.row_num = 1
Since the text for both initials in your sample set on the max date are the same, I added initials and date to the final result set for clarification.
You can also use CTEs instead of subqueries, which I prefer, but I didn't want to complicate matters.

Combining rows with overlapping dates in T-SQL

I have some data similar to the below:
Base data
Student Start Date End Date Course
John 01-Jan-20 30-Sep-20 Business
John 01-Jan-20 30-Dec-20 Psychology
John 01-Oct-20 NULL Music
Jack 01-Feb-20 30-Sep-20 Business
Jack 01-Apr-20 30-Nov-20 Music
I want to transform the data so I have a row for each student, for each time period, with a concatenated list of courses, i.e.
Target output
Student Start Date End Date Course
John 01-Jan-20 30-Sep-20 Business, Psychology
John 01-Oct-20 30-Dec-20 Psychology, Music
John 01-Jan-21 NULL Music
Jack 01-Feb-20 31-Mar-20 Business
Jack 01-Apr-20 30-Sep-20 Business, Music
Jack 01-Oct-20 30-Nov-20 Music
I have a script that works if the dates are identical, using STUFF on the course field and grouping on student/dates (code below). But I can't work out how to handle the overlapping dates?
Select Student,
Courses =
STUFF((select ',' + course
from Table1 b
where a.student = b.student
for XML PATH('')
),1,1,''
)
from table1 a
Group by student
This is a little long winded, as you need to get the groups for the dates. As the dates don't overlap, you then need to do a bit of elimination of some of the groupings too, so it takes a couple of sweeps.
I use CTEs to get the groups I need, and then use a subquery to string aggregate (on a more recent version of SQL Server you can use STRING_AGG and not need a second scan of the table). This ends up with this:
WITH YourTable AS(
SELECT *
FROM (VALUES('John',CONVERT(date,'01-Jan-20'),CONVERT(date,'30-Sep-20'),'Business'),
('John',CONVERT(date,'01-Jan-20'),CONVERT(date,'30-Dec-20'),'Psychology'),
('John',CONVERT(date,'01-Oct-20'),CONVERT(date,NULL),'Music'),
('Jack',CONVERT(date,'01-Feb-20'),CONVERT(date,'30-Sep-20'),'Business'),
('Jack',CONVERT(date,'01-Apr-20'),CONVERT(date,'30-Nov-20'),'Music'))V(Student,StartDate,EndDate,Course)),
Dates AS(
SELECT DISTINCT V.Student, V.[Date]
FROM YourTable YT
CROSS APPLY (VALUES(YT.Student,YT.StartDate),
(YT.Student,YT.EndDate)) V(Student,[Date])),
Islands AS(
SELECT *,
LEAD(ISNULL([Date],'99991231')) OVER (PARTITION BY Student ORDER BY ISNULL([Date],'99991231')) AS NextDate
FROM Dates
WHERE [Date] IS NOT NULL),
Groups AS(
SELECT I.Student,
I.Date AS StartDate,
CASE DATEPART(DAY,I.NextDate) WHEN 1 THEN DATEADD(DAY, -1, I.NextDate) ELSE I.NextDate END AS EndDate,
STUFF((SELECT ',' + YT.Course
FROM YourTable YT
WHERE YT.Student = I.Student
AND YT.StartDate <= I.[Date]
AND (YT.EndDate >= I.NextDate OR YT.EndDate IS NULL)
ORDER BY YT.Course
FOR XML PATH(''),TYPE).value('(./text())[1]','nvarchar(MAX)'),1,1,'') AS Courses
FROM Islands I)
SELECT Student,
StartDate,
EndDate,
Courses
FROM Groups
WHERE ([StartDate] != EndDate OR EndDate IS NULL)
AND Courses IS NOT NULL
ORDER BY Student DESC,
StartDate ASC;

Convert rows to columns in MS SQL

I'm looking for an efficient way to convert rows to columns in MS SQL server.
Example DB Table:
**ID PersonID Person201Code Person201Value**
1 1 CurrentIdNo 0556
2 1 FirstName Queency
3 1 LastName Sablan
The query result should be like this:
**CurrentIdNo FirstName LastName**
0556 Queency Sablan
I tried using PIVOT but it only return null on row values:
SELECT CurrentIdNo, FirstName, LastName
FROM
(
SELECT ID, PersonId, Person201Code, Person201Value
FROM HRPerson201
) src
PIVOT
(
MAX (ID)
FOR Person201Code in (CurrentIdNo, Firstname, LastName))
pvt;
How can I successfully convert rows to columns in MS SQL server?
Thanks!
Remove the ID from pivot source query and add Person201Value pivot aggregate
instead of ID
SELECT CurrentIdNo,
FirstName,
LastName
FROM (SELECT PersonId,
Person201Code,
Person201Value
FROM HRPerson201) src
PIVOT ( Max (Person201Value)
FOR Person201Code IN (CurrentIdNo,
Firstname,
LastName)) pvt;
SQLFIDDLE DEMO
SELECT *
FROM
(SELECT personid,Person201Code,Person201Value
FROM #pivot) Sales
PIVOT(max(Person201Value)
FOR Person201Code in (CurrentIdNo, Firstname, LastName))
AS PivotSales;

T-SQL group by from multiple years

I barely know how to ask this question aside from the specific example, so here goes:
We have an event registration table, and I want to match registrants that have registered for one of 4 events in each of the preceding 5 years.
The only way I can think of doing this is with verbose sub-queries, but performance-wise it's an absolute dog:
SELECT FirstName, LastName, EmailAddress
FROM RegTable
WHERE EventId IN (1,2,3,4)
AND EventYear = 2011
AND FirstName + LastName + DOB IN (SELECT FirstName + LastName + DOB FROM RegTable WHERE EventId IN (1,2,3,4) AND EventYear = 2012)
And so on for each year. Like I said, not very eloquent or efficient.
Is there a simpler way?
You can do a GROUP BY with HAVING and then do a INTERSECT with current Year events
SELECT FirstName, LastName, DOB
FROM RegTable
WHERE EventId IN (1,2,3,4)
AND EventYear IN (2011,2010,2009,2008,2007)
GROUP BY FirstName, LastName, DOB
HAVING COUNT(Distinct EventYear) = 5
INTERSECT
SELECT DISTINCT FirstName ,LastName ,DOB
FROM RegTable
WHERE EventId IN (1,2,3,4)
AND EventYear = 2012
The above query in action with sample data. SQL Fiddle
I hoestly didn't understand the question, just rewriting your query (assuming that it is doing what you need):
SELECT RT2011.FirstName, RT2011.LastName, RT2011.EmailAddress
FROM
RegTable RT2011,
RegTable RT2012
WHERE RT2011.EventId IN (1,2,3,4)
AND RT2012.EventId IN (1,2,3,4)
AND RT2011.EventYear = 2011
AND RT2012.EventYear = 2012
AND RT2011.FirstName = RT2012.FirstName
AND RT2011.LastName = RT2012.LastName
AND RT2011.DOB = RT2012.DOB

SQL Remove almost duplicate rows

I have a table that contains unfortuantely bad data and I'm trying to filter some out. I am sure that the LName, FName combonation is unique since the data set is small enough to verify.
LName, FName, Email
----- ----- -----
Smith Bob bsmith#example.com
Smith Bob NULL
Doe Jane NULL
White Don dwhite#example.com
I would like to have the query results bring back the "duplicate" record that does not have a NULL email, yet still bring back a NULL Email when there is not a duplicate.
E.g.
Smith Bob bsmith#example.com
Doe Jane NULL
White Don dwhite#example.com
I think the solution is similar to Sql, remove duplicate rows by value, but I don't really understand if the asker's requirements are the same as mine.
Any suggestions?
Thanks
You can use ROW_NUMBER() analytic function:
SELECT *
FROM (
SELECT a.*, ROW_NUMBER() OVER(PARTITION BY LName, FName ORDER BY Email DESC) rnk
FROM <YOUR_TABLE> a
) a
WHERE RNK = 1
This drops the null rows if there are any non null values.
SELECT lname
, fname
, MIN(email)
FROM YourTable
GROUP BY
lname
, fname
Test script
DECLARE #Test TABLE (
LName VARCHAR(32)
, FName VARCHAR(32)
, Email VARCHAR(32)
)
INSERT INTO #Test
SELECT 'Smith', 'Bob', 'bsmith#example.com'
UNION ALL SELECT 'Smith', 'Bob', 'NULL'
UNION ALL SELECT 'Doe', 'Jane', 'NULL'
UNION ALL SELECT 'White', 'Don', 'dwhite#example.com'
SELECT lname
, fname
, MIN(Email)
FROM #Test
GROUP BY
lname
, fname
Here is a relatively simple query that uses standard SQL and does just this:
SELECT * FROM Person P
WHERE Email IS NOT NULL OR -- Take all people with non-null e-mails
Email IS NULL AND -- and all people with null e-mails, as long as
NOT EXISTS -- there is no duplicate record of the same person
(SELECT * -- with a non-null e-mail
FROM Person P2
WHERE P2.LName=P.LName AND P2.FName=P.FName AND P2.Email IS NOT NULL)
Since there are plenty of SQL solutions posted already, you may want to create a data fix to remove the bad data, then add the necessary constraints to prevent bad data from ever being inserted. Bad data in a database is a side effect of poor design.

Resources