Generate list of mismatched data element Combinations between two sources - sql-server

We receive data on a weekly and monthly basis with information regarding customers. We also sometimes have the same information stored from another source. The two sources sometimes provide contradictory information regarding customers.
How would I write a query which tells me the mismatched CustomerId and corresponding Vehicle? For example, CustomerId 947623 is associated with Kia in the vendor extract [Table 1] whereas we have the same customer stored as related to Hyundai [Table 2].
Table 1: Data received from the vendor.
CustomerId
FirstName
LastName
Vehicle
MiscColumns
027548
Jane
Doe
Honda
MiscData
947623
John
Smith
Kia
MiscData
549816
Erin
Woods
Chevy
MiscData
739232
Henry
Jackson
Ford
MiscData
Table 2: Internal data records
CustomerId
FirstName
LastName
Vehicle
MiscColumns
027548
Jane
Doe
Honda
MiscData
947623
John
Smith
Hyundai
MiscData
549816
Erin
Woods
Chevy
MiscData
739232
Henry
Jackson
Ford
MiscData

Please try the following solution.
It will work starting from SQL Server 2016 onwards.
SQL
-- DDL and sample data population, start
DECLARE #TableA TABLE (CustomerId CHAR(6) PRIMARY KEY, FirstName VARCHAR(100), LastName VARCHAR(100), Vehicle VARCHAR(100));
DECLARE #TableB table (CustomerId CHAR(6) PRIMARY KEY, FirstName VARCHAR(100), LastName VARCHAR(100), Vehicle VARCHAR(100));
INSERT INTO #TableA (CustomerId, FirstName, LastName, Vehicle) VALUES
('027548', 'Jane', 'Doe', 'Honda'),
('947623', 'John', 'Smith', 'Kia'),
('549816', 'Erin', 'Woods', 'Chevy'),
('739232', 'Henry', 'Jackson', 'Ford');
INSERT INTO #TableB (CustomerId, FirstName, LastName, Vehicle) VALUES
('027548', 'Jane', 'Doe', 'Honda'),
('947623', 'John', 'Smith', 'Hyundai'),
('549816', 'Erin', 'Woods', 'Chevy'),
('739232', 'Henry', 'Jackson', 'Ford');
-- DDL and sample data population, end
SELECT CustomerId
,[key] AS [column]
,Org_Value = MAX( CASE WHEN Src=1 THEN Value END)
,New_Value = MAX( CASE WHEN Src=2 THEN Value END)
FROM (
SELECT Src=1
,CustomerId
,B.*
FROM #TableA A
CROSS APPLY ( SELECT [Key]
,COALESCE(Value, '') AS Value
FROM OpenJson( (SELECT A.* For JSON Path,Without_Array_Wrapper,INCLUDE_NULL_VALUES))
) AS B
UNION ALL
SELECT Src=2
,CustomerId
,B.*
FROM #TableB A
CROSS APPLY ( SELECT [Key]
,COALESCE(Value, '') AS Value
FROM OpenJson( (SELECT A.* For JSON Path,Without_Array_Wrapper,INCLUDE_NULL_VALUES))
) AS B
) AS A
GROUP BY CustomerId,[key]
HAVING MAX(CASE WHEN Src=1 THEN Value END)
<> MAX(CASE WHEN Src=2 THEN Value END)
ORDER BY CustomerId,[key];
Output
CustomerId
column
Org_Value
New_Value
947623
Vehicle
Kia
Hyundai

Related

Need help pulling specific type of ID when there are multiple IDs

I have a dataset where an employee with one SSN can have multiple employee IDs. In that situation I need to only brings back records where the employee ID begins with '200.' In most situations there will only be one employee ID or the employee ID is null(which is okay to bring back).
This is a sample dataset:
declare #t table(id int, name varchar(100), ssn int, eeid int)
insert into #t
values(1, 'John Smith', '55512', '2006544'),
(1, 'John Smith', '55512', '12345'),
(2, 'Bob Johnson', '55514', '200454'),
(3, 'Tom Smith', '44454', NULL),
(4, 'John Thompson', '45434', '204435'),
(4, 'John Thompson', '45434', '12353568')
The output should look like this:
Id Name SSN EEID
1 John Smith 55512 2006544
2 Bob Johnson 55514 200454
3 Tom Smith 44454 NULL
4 John Thompson 45434 204435
I tried playing with a Window function but got stuck. I tried using Rownum but it didn't give the correct result with 'John Smith.'
select *,
row_number()over(partition by ssn ORDER BY case when EEID like '200%'
then 1 end) AS ROWNUM
from #t
ROW_NUMBER, on it's own, doesn't change the returned rows, it just "numbers" them. You'd either need to put the expression in a CTE/Subquery and then filter to the first row in the WHERE, or you use TOP and put the expression in the ORDER BY:
--CTE solution
WITH CTE AS(
SELECT ID,
[Name],
SSN,
EEID,
ROW_NUMBER() OVER (PARTITION BY SSD ORDER BY CASE WHEN eeid LIKE '200%' THEN 1 ELSE 2 END, eeid ASC) AS RN
FROM dbo.YourTable)
SELECT ID,
[Name],
SSN,
EEID
FROM CTE
WHERE RN = 1;
--ORDER BY and TOP solution
SELECT TOP (1) WITH TIES
ID,
[Name],
SSN,
EEID
FROM dbo.YourTable
ORDER BY ROW_NUMBER() OVER (PARTITION BY SSD ORDER BY CASE WHEN eeid LIKE '200%' THEN 1 ELSE 2 END, eeid);

SQL identify duplicate and update

i need help in below issue.i have a customer table CustA which is having columns custid, first name , surname, phone1, phone2,lastupdateddate. This table has duplicate records.a record is considered duplicate in CustA table when
first name & surname & (phone1 or phone2) is duplicated
custid firstname surname phone1 phone2 lastupdateddate
1000 Sam Son 334566 NULL 1-jan-2016
1001 sam son NULL 334566 1-feb-2016
i have used cte for this scenario to Partition by firstname, lastname, phone1, phone2 based on rownumber. But the OR condition is remaining as challenge for phone1 or phone2 in CTE query. Please share your thoughts. Appreciate it.
Trick here is COALESCE
With cte as
(
select Count()over(partition by firstname, lastname, coalesce(phone1, phone2)) as cnt,*
From yourtable
)
Select * from CTE
WHere cnt > 1
Though if it isn't the case that one is always null You can use a CASE expression to ensure that the values are presented in a consistent order.
WITH cte
AS (SELECT COUNT(*)
OVER(
partition BY firstname,
lastname,
CASE WHEN phone1 < phone2 THEN phone1 ELSE phone2 END,
CASE WHEN phone1 < phone2 THEN phone2 ELSE phone1 END) AS cnt,
*
FROM yourtable)
SELECT *
FROM CTE
WHERE cnt > 1
This one will also give you the list of dupes (optional custid<>A.custid)
Declare #Yourtable table (custid int,firstname varchar(50),surname varchar(50),phone1 varchar(25),phone2 varchar(25),lastupdate date)
Insert into #Yourtable values
(1000,'Sam','Son' ,'334566',NULL ,'1-jan-2016'),
(1001,'sam','son' ,NULL ,'334566','1-feb-2016'),
(1003,'sam','son' ,NULL ,NULL ,'2-feb-2016'),
(1002,'Not','ADupe',NULL ,NULL ,'1-feb-2016')
Select A.*
,B.Dupes
From #YourTable A
Cross Apply (Select Dupes=(Select Stuff((Select Distinct ',' + cast(custid as varchar(25))
From #YourTable
Where custid<>A.custid
and firstname=A.firstname
and surname =A.surname
and (IsNull(A.phone1,'') in (IsNull(phone1,''),IsNull(phone2,'')) or IsNull(A.phone2,'') in (IsNull(phone1,''),IsNull(phone2,'')) )
For XML Path ('')),1,1,'')
)
) B
Where Dupes is not null
Returns
custid firstname surname phone1 phone2 lastupdate Dupes
1000 Sam Son 334566 NULL 2016-01-01 1001,1003
1001 sam son NULL 334566 2016-02-01 1000,1003
1003 sam son NULL NULL 2016-02-02 1000,1001

Grouping similar items recursively

I have been reading the following Microsoft article on recursive queries using CTE and just can't seem to wrap my head around how to use it for group common items.
I have a table the contains the following columns:
ID
FirstName
LastName
DateOfBirth
BirthCountry
GroupID
What I need to do is start with the first person in the table and iterate through the table and find all the people that have the same (LastName and BirthCountry) or have the same (DateOfBirth and BirthCountry).
Now the tricky part is that I have to assign them the same GroupID and then for each person in that GroupID, I need to see if anyone else has the same information and then put the in the same GroupID.
I think I could do this with multiple cursors but it is getting tricky.
Here is sample data and output.
ID FirstName LastName DateOfBirth BirthCountry GroupID
----------- ---------- ---------- ----------- ------------ -----------
1 Jonh Doe 1983-01-01 Grand 100
2 Jack Stone 1976-06-08 Grand 100
3 Jane Doe 1982-02-08 Grand 100
4 Adam Wayne 1983-01-01 Grand 100
5 Kay Wayne 1976-06-08 Grand 100
6 Matt Knox 1983-01-01 Hay 101
John Doe and Jane Doe are in the same Group (100) because they have the same (LastName and BirthCountry).
Adam Wayne is in Group (100) because he has the same (BirthDate and BirthCountry) as John Doe.
Kay Wayne is in Group (100) because she has the same (LastName and BirthCountry) as Adam Wayne who is already in Group (100).
Matt Knox is in a new group (101) because he does not match anyone in previous groups.
Jack Stone is in a group (100) because he has the same (BirthDate and BirthCountry) as Kay Wayne who is already in Group (100).
Data scripts:
CREATE TABLE #Tbl(
ID INT,
FirstName VARCHAR(50),
LastName VARCHAR(50),
DateOfBirth DATE,
BirthCountry VARCHAR(50),
GroupID INT NULL
);
INSERT INTO #Tbl VALUES
(1, 'Jonh', 'Doe', '1983-01-01', 'Grand', NULL),
(2, 'Jack', 'Stone', '1976-06-08', 'Grand', NULL),
(3, 'Jane', 'Doe', '1982-02-08', 'Grand', NULL),
(4, 'Adam', 'Wayne', '1983-01-01', 'Grand', NULL),
(5, 'Kay', 'Wayne', '1976-06-08', 'Grand', NULL),
(6, 'Matt', 'Knox', '1983-01-01', 'Hay', NULL);
Here's what I came up with. I have rarely written recursive queries so it was some good practice for me. By the way Kay and Adam do not share a birth country in your sample data.
with data as (
select
LastName, DateOfBirth, BirthCountry,
row_number() over (order by LastName, DateOfBirth, BirthCountry) as grpNum
from T group by LastName, DateOfBirth, BirthCountry
), r as (
select
d.LastName, d.DateOfBirth, d.BirthCountry, d.grpNum,
cast('|' + cast(d.grpNum as varchar(8)) + '|' as varchar(1024)) as equ
from data as d
union all
select
d.LastName, d.DateOfBirth, d.BirthCountry, r.grpNum,
cast(r.equ + cast(d.grpNum as varchar(8)) + '|' as varchar(1024))
from r inner join data as d
on d.grpNum > r.grpNum
and charindex('|' + cast(d.grpNum as varchar(8)) + '|', r.equ) = 0
and (d.LastName = r.LastName or d.DateOfBirth = r.DateOfBirth)
and d.BirthCountry = r.BirthCountry
), g as (
select LastName, DateOfBirth, BirthCountry, min(grpNum) as grpNum
from r group by LastName, DateOfBirth, BirthCountry
)
select t.*, dense_rank() over (order by g.grpNum) + 100 as GroupID
from T as t
inner join g
on g.LastName = t.LastName
and g.DateOfBirth = t.DateOfBirth
and g.BirthCountry = t.BirthCountry
For the recursion to terminate it's necessary to keep track of the equivalences (via string concatenation) so that at each level it only needs to consider newly discovered equivalences (or connections, transitivities, etc.) Notice that I've avoided using the word group to avoid bleeding into the GROUP BY concept.
http://rextester.com/edit/TVRVZ10193
EDIT: I used an almost arbitrary numbering for the equivalences but if you wanted them to appear in a sequence based on the lowest ID with each block that's easy to do. Instead of using row_number() say min(ID) as grpNum presuming, of course, that IDs are unique.
I assume groupid is the output you want which start from 100.
Even if groupid come from another table,then it is no problem.
Firstly,sorry for my "No cursor comments".Cursor or RBAR operation is require for this task.In fact after a very long time i met such requirement which took so long and I use RBAR operation.
if tommorrow i am able to do using SET BASE METHOD,then I will come and edit it.
Most importantly using RBAR operation make the script more understanding and I think it wil work for other sample data too.
Also give feedback about the performance and how it work with other sample data.
Alsi in my script you note that id are not in serial,and it do not matter,i did this in order to test.
I use print for debuging purpose,you can remove it.
SET NOCOUNT ON
DECLARE #Tbl TABLE(
ID INT,
FirstName VARCHAR(50),
LastName VARCHAR(50),
DateOfBirth DATE,
BirthCountry VARCHAR(50),
GroupID INT NULL
);
INSERT INTO #Tbl VALUES
(1, 'Jonh', 'Doe', '1983-01-01', 'Grand', NULL) ,
(2, 'Jack', 'Stone', '1976-06-08', 'Grand', NULL),
(3, 'Jane', 'Doe', '1982-02-08', 'Grand', NULL),
(4, 'Adam', 'Wayne', '1983-01-01', 'Grand', NULL),
(5, 'Kay', 'Wayne', '1976-06-08', 'Grand', NULL),
(6, 'Matt', 'Knox', '1983-01-01', 'Hay', NULL),
(7, 'Jerry', 'Stone', '1976-06-08', 'Hay', NULL)
DECLARE #StartGroupid INT = 100
DECLARE #id INT
DECLARE #Groupid INT
DECLARE #Maxid INT
DECLARE #i INT = 1
DECLARE #MinGroupID int=#StartGroupid
DECLARE #MaxGroupID int=#StartGroupid
DECLARE #LastGroupID int
SELECT #maxid = max(id)
FROM #tbl
WHILE (#i <= #maxid)
BEGIN
SELECT #id = id
,#Groupid = Groupid
FROM #Tbl a
WHERE id = #i
if(#Groupid is not null and #Groupid<#MinGroupID)
set #MinGroupID=#Groupid
if(#Groupid is not null and #Groupid>#MaxGroupID)
set #MaxGroupID=#Groupid
if(#Groupid is not null)
set #LastGroupID=#Groupid
UPDATE A
SET groupid =case
when #id=1 and b.groupid is null then #StartGroupid
when #id>1 and b.groupid is null then #MaxGroupID+1--(Select max(groupid)+1 from #tbl where id<#id)
when #id>1 and b.groupid is not null then #MinGroupID --(Select min(groupid) from #tbl where id<#id)
end
FROM #Tbl A
INNER JOIN #tbl B ON b.id = #ID
WHERE (
(
a.BirthCountry = b.BirthCountry
and a.DateOfBirth = b.dateofbirth
)
or (a.LastName = b.LastName and a.BirthCountry = b.BirthCountry)
or (a.LastName = b.LastName and a.dateofbirth = b.dateofbirth)
)
--if(#id=7) --#id=2,#id=3 and so on (for debug
--break
SET #i = #i + 1
SET #ID = #I
END
SELECT *
FROM #Tbl
Alternate Method but still it return 56,000 rows without rownum=1.See if it work with other sample data or see if you can further optimize it.
;with CTE as
(
select a.ID,a.FirstName,a.LastName,a.DateOfBirth,a.BirthCountry
,#StartGroupid GroupID
,1 rn
FROM #Tbl A where a.id=1
UNION ALL
Select a.ID,a.FirstName,a.LastName,a.DateOfBirth,a.BirthCountry
,case when ((a.BirthCountry = b.BirthCountry and a.DateOfBirth = b.dateofbirth)
or (a.LastName = b.LastName and a.BirthCountry = b.BirthCountry)
or (a.LastName = b.LastName and a.dateofbirth = b.dateofbirth)
) then b.groupid else b.groupid+1 end
, b.rn+1
FROM #tbl A
inner join CTE B on a.id>1
where b.rn<#Maxid
)
,CTE1 as
(select * ,row_number()over(partition by id order by groupid )rownum
from CTE )
select * from cte1
where rownum=1
Maybe you can run it in this way
SELECT *
FROM table_name
GROUP BY
FirstName,
LastName,
GroupID
HAVING COUNT(GroupID) >= 2
ORDER BY GroupID

Building a snapshot table from audit records

I have a Customer table with the following structure.
CustomerId Name Address Phone
1 Joe 123 Main NULL
I also have an Audit table that tracks changes to the Customer table.
Id Entity EntityId Field OldValue NewValue Type AuditDate
1 Customer 1 Name NULL Joe Add 2016-01-01
2 Customer 1 Phone NULL 567-54-3332 Add 2016-01-01
3 Customer 1 Address NULL 456 Centre Add 2016-01-01
4 Customer 1 Address 456 Centre 123 Main Edit 2016-01-02
5 Customer 1 Phone 567-54-3332 843-43-1230 Edit 2016-01-03
6 Customer 1 Phone 843-43-1230 NULL Delete 2016-01-04
I have a CustomerHistory reporting table that will be populated with a daily ETL job. It has the same fields as Customer Table with additional field SnapShotDate.
I need to write a query that takes the records in Audit table, transforms and inserts into CustomerHistory as seen below.
CustomerId Name Address Phone SnapShotDate
1 Joe 456 Centre 567-54-3332 2016-01-01
1 Joe 123 Main 567-54-3332 2016-01-02
1 Joe 123 Main 843-43-1230 2016-01-03
1 Joe 123 Main NULL 2016-01-04
I guess the solution would involve a self-join on Audit table or a recursive CTE. I would appreciate any help with developing this solution.
Note: Unfortunately, I do not have the option to use triggers or change the Audit table schema. Query performance is not a concern since this will be a nightly ETL process.
You can use below script.
DROP TABLE #tmp
CREATE TABLE #tmp (
id INT Identity
, EntityId INT
, NAME VARCHAR(10)
, Address VARCHAR(100)
, Phone VARCHAR(20)
, Type VARCHAR(10)
, SnapShotDate DATETIME
)
;with cte1 as (
select AuditDate, EntityId, Type, [Name], [Address], [Phone]
from
(select AuditDate, EntityId, Type, Field, NewValue from #Audit) p
pivot
(
max(NewValue)
for Field in ([Name], [Address], [Phone])
) as xx
)
insert into #tmp (EntityId, Name, Address, Phone, Type, SnapShotDate)
select EntityId, Name, Address, Phone, Type, AuditDate
from cte1
-- update NULLs columns with the most recent value
update #tmp
set Name = (select top 1 Name from #tmp tp2
where EntityId = tp2.EntityId and Name is not null
order by id desc)
where Name is null
update #tmp
set Address = (select top 1 Address from #tmp tp2
where EntityId = tp2.EntityId and Address is not null
order by id desc)
where Address is null
update #tmp
set Phone = (select top 1 Phone from #tmp tp2
where EntityId = tp2.EntityId and Phone is not null
order by id desc)
where Phone is null
To Create Test Data, use below script
CREATE TABLE #Customer (
CustomerId INT
, NAME VARCHAR(10)
, Address VARCHAR(100)
, Phone VARCHAR(20)
)
INSERT INTO #Customer
VALUES (1, 'Joe', '123 Main', NULL)
CREATE TABLE #Audit (
Id INT
, Entity VARCHAR(50)
, EntityId INT
, Field VARCHAR(20)
, OldValue VARCHAR(100)
, NewValue VARCHAR(100)
, Type VARCHAR(10)
, AuditDate DATETIME
)
insert into #Audit values
(1, 'Customer', 1, 'Name' ,NULL ,'Joe' ,'Add' ,'2016-01-01'),
(2, 'Customer', 1, 'Phone' ,NULL ,'567-54-3332' ,'Add' ,'2016-01-01'),
(3, 'Customer', 1, 'Address' ,NULL ,'456 Centre' ,'Add' ,'2016-01-01'),
(4, 'Customer', 1, 'Address' ,'456 Centre' ,'123 Main' ,'Edit' ,'2016-01-02'),
(5, 'Customer', 1, 'Phone' ,'567-54-3332' ,'843-43-1230' ,'Edit' ,'2016-01-03'),
(6, 'Customer', 1, 'Phone' ,'843-43-1230' ,NULL ,'Delete' ,'2016-01-04'),
(7, 'Customer', 2, 'Name' ,NULL ,'Peter' ,'Add' ,'2016-01-01'),
(8, 'Customer', 2, 'Phone' ,NULL ,'111-222-3333' ,'Add' ,'2016-01-01'),
(8, 'Customer', 2, 'Address' ,NULL ,'Parthenia' ,'Add' ,'2016-01-01')
Result
EntityId Name Address Phone Type SnapShotDate
1 Joe 456 Centre 567-54-3332 Add 2016-01-01
1 Joe 123 Main 843-43-1230 Edit 2016-01-02
1 Joe 123 Main 843-43-1230 Edit 2016-01-03
1 Joe 123 Main 843-43-1230 Delete 2016-01-04

Date Range Intersection Splitting in SQL

I have a SQL Server 2005 database which contains a table called Memberships.
The table schema is:
PersonID int, Surname nvarchar(30), FirstName nvarchar(30), Description nvarchar(100), StartDate datetime, EndDate datetime
I'm currently working on a grid feature which shows a break-down of memberships by person. One of the requirements is to split membership rows where there is an intersection of date ranges. The intersection must be bound by the Surname and FirstName, ie splits only occur with membership records of the same Surname and FirstName.
Example table data:
18 Smith John Poker Club 01/01/2009 NULL
18 Smith John Library 05/01/2009 18/01/2009
18 Smith John Gym 10/01/2009 28/01/2009
26 Adams Jane Pilates 03/01/2009 16/02/2009
Expected result set:
18 Smith John Poker Club 01/01/2009 04/01/2009
18 Smith John Poker Club / Library 05/01/2009 09/01/2009
18 Smith John Poker Club / Library / Gym 10/01/2009 18/01/2009
18 Smith John Poker Club / Gym 19/01/2009 28/01/2009
18 Smith John Poker Club 29/01/2009 NULL
26 Adams Jane Pilates 03/01/2009 16/02/2009
Does anyone have any idea how I could write a stored procedure that will return a result set which has the break-down described above.
The problem you are going to have with this problem is that as the data set grows, the solutions to solve it with TSQL won't scale well. The below uses a series of temporary tables built on the fly to solve the problem. It splits each date range entry into its respective days using a numbers table. This is where it won't scale, primarily due to your open ranged NULL values which appear to be inifinity, so you have to swap in a fixed date far into the future that limits the range of conversion to a feasible length of time. You could likely see better performance by building a table of days or a calendar table with appropriate indexing for optimized rendering of each day.
Once the ranges are split, the descriptions are merged using XML PATH so that each day in the range series has all of the descriptions listed for it. Row Numbering by PersonID and Date allows for the first and last row of each range to be found using two NOT EXISTS checks to find instances where a previous row doesn't exist for a matching PersonID and Description set, or where the next row doesn't exist for a matching PersonID and Description set.
This result set is then renumbered using ROW_NUMBER so that they can be paired up to build the final results.
/*
SET DATEFORMAT dmy
USE tempdb;
GO
CREATE TABLE Schedule
( PersonID int,
Surname nvarchar(30),
FirstName nvarchar(30),
Description nvarchar(100),
StartDate datetime,
EndDate datetime)
GO
INSERT INTO Schedule VALUES (18, 'Smith', 'John', 'Poker Club', '01/01/2009', NULL)
INSERT INTO Schedule VALUES (18, 'Smith', 'John', 'Library', '05/01/2009', '18/01/2009')
INSERT INTO Schedule VALUES (18, 'Smith', 'John', 'Gym', '10/01/2009', '28/01/2009')
INSERT INTO Schedule VALUES (26, 'Adams', 'Jane', 'Pilates', '03/01/2009', '16/02/2009')
GO
*/
SELECT
PersonID,
Description,
theDate
INTO #SplitRanges
FROM Schedule, (SELECT DATEADD(dd, number, '01/01/2008') AS theDate
FROM master..spt_values
WHERE type = N'P') AS DayTab
WHERE theDate >= StartDate
AND theDate <= isnull(EndDate, '31/12/2012')
SELECT
ROW_NUMBER() OVER (ORDER BY PersonID, theDate) AS rowid,
PersonID,
theDate,
STUFF((
SELECT '/' + Description
FROM #SplitRanges AS s
WHERE s.PersonID = sr.PersonID
AND s.theDate = sr.theDate
FOR XML PATH('')
), 1, 1,'') AS Descriptions
INTO #MergedDescriptions
FROM #SplitRanges AS sr
GROUP BY PersonID, theDate
SELECT
ROW_NUMBER() OVER (ORDER BY PersonID, theDate) AS ID,
*
INTO #InterimResults
FROM
(
SELECT *
FROM #MergedDescriptions AS t1
WHERE NOT EXISTS
(SELECT 1
FROM #MergedDescriptions AS t2
WHERE t1.PersonID = t2.PersonID
AND t1.RowID - 1 = t2.RowID
AND t1.Descriptions = t2.Descriptions)
UNION ALL
SELECT *
FROM #MergedDescriptions AS t1
WHERE NOT EXISTS
(SELECT 1
FROM #MergedDescriptions AS t2
WHERE t1.PersonID = t2.PersonID
AND t1.RowID = t2.RowID - 1
AND t1.Descriptions = t2.Descriptions)
) AS t
SELECT DISTINCT
PersonID,
Surname,
FirstName
INTO #DistinctPerson
FROM Schedule
SELECT
t1.PersonID,
dp.Surname,
dp.FirstName,
t1.Descriptions,
t1.theDate AS StartDate,
CASE
WHEN t2.theDate = '31/12/2012' THEN NULL
ELSE t2.theDate
END AS EndDate
FROM #DistinctPerson AS dp
JOIN #InterimResults AS t1
ON t1.PersonID = dp.PersonID
JOIN #InterimResults AS t2
ON t2.PersonID = t1.PersonID
AND t1.ID + 1 = t2.ID
AND t1.Descriptions = t2.Descriptions
DROP TABLE #SplitRanges
DROP TABLE #MergedDescriptions
DROP TABLE #DistinctPerson
DROP TABLE #InterimResults
/*
DROP TABLE Schedule
*/
The above solution will also handle gaps between additional Descriptions as well, so if you were to add another Description for PersonID 18 leaving a gap:
INSERT INTO Schedule VALUES (18, 'Smith', 'John', 'Gym', '10/02/2009', '28/02/2009')
It will fill the gap appropriately. As pointed out in the comments, you shouldn't have name information in this table, it should be normalized out to a Persons Table that can be JOIN'd to in the final result. I simulated this other table by using a SELECT DISTINCT to build a temp table to create that JOIN.
Try this
SET DATEFORMAT dmy
DECLARE #Membership TABLE(
PersonID int,
Surname nvarchar(16),
FirstName nvarchar(16),
Description nvarchar(16),
StartDate datetime,
EndDate datetime)
INSERT INTO #Membership VALUES (18, 'Smith', 'John', 'Poker Club', '01/01/2009', NULL)
INSERT INTO #Membership VALUES (18, 'Smith', 'John','Library', '05/01/2009', '18/01/2009')
INSERT INTO #Membership VALUES (18, 'Smith', 'John','Gym', '10/01/2009', '28/01/2009')
INSERT INTO #Membership VALUES (26, 'Adams', 'Jane','Pilates', '03/01/2009', '16/02/2009')
--Program Starts
declare #enddate datetime
--Measuring extreme condition when all the enddates are null(i.e. all the memberships for all members are in progress)
-- in such a case taking any arbitary date e.g. '31/12/2009' here else add 1 more day to the highest enddate
select #enddate = case when max(enddate) is null then '31/12/2009' else max(enddate) + 1 end from #Membership
--Fill the null enddates
; with fillNullEndDates_cte as
(
select
row_number() over(partition by PersonId order by PersonId) RowNum
,PersonId
,Surname
,FirstName
,Description
,StartDate
,isnull(EndDate,#enddate) EndDate
from #Membership
)
--Generate a date calender
, generateCalender_cte as
(
select
1 as CalenderRows
,min(startdate) DateValue
from #Membership
union all
select
CalenderRows+1
,DateValue + 1
from generateCalender_cte
where DateValue + 1 <= #enddate
)
--Generate Missing Dates based on Membership
,datesBasedOnMemberships_cte as
(
select
t.RowNum
,t.PersonId
,t.Surname
,t.FirstName
,t.Description
, d.DateValue
,d.CalenderRows
from generateCalender_cte d
join fillNullEndDates_cte t ON d.DateValue between t.startdate and t.enddate
)
--Generate Dscription Based On Membership Dates
, descriptionBasedOnMembershipDates_cte as
(
select
PersonID
,Surname
,FirstName
,stuff((
select '/' + Description
from datesBasedOnMemberships_cte d1
where d1.PersonID = d2.PersonID
and d1.DateValue = d2.DateValue
for xml path('')
), 1, 1,'') as Description
, DateValue
,CalenderRows
from datesBasedOnMemberships_cte d2
group by PersonID, Surname,FirstName,DateValue,CalenderRows
)
--Grouping based on membership dates
,groupByMembershipDates_cte as
(
select d.*,
CalenderRows - row_number() over(partition by Description order by PersonID, DateValue) AS [Group]
from descriptionBasedOnMembershipDates_cte d
)
select PersonId
,Surname
,FirstName
,Description
,convert(varchar(10), convert(datetime, min(DateValue)), 103) as StartDate
,case when max(DateValue)= #enddate then null else convert(varchar(10), convert(datetime, max(DateValue)), 103) end as EndDate
from groupByMembershipDates_cte
group by [Group],PersonId,Surname,FirstName,Description
order by PersonId,StartDate
option(maxrecursion 0)
[Only many, many years later.]
I created a stored procedure that will align and break segments by a partition within a single table, and then you can use those aligned breaks to pivot the description into a ragged column using a subquery and XML PATH.
See if the below help:
Documentation: https://github.com/Quebe/SQL-Algorithms/blob/master/Temporal/Date%20Segment%20Manipulation/DateSegments_AlignWithinTable.md
Stored Procedure: https://github.com/Quebe/SQL-Algorithms/blob/master/Temporal/Date%20Segment%20Manipulation/DateSegments_AlignWithinTable.sql
For example, your call might look like:
EXEC dbo.DateSegments_AlignWithinTable
#tableName = 'tableName',
#keyFieldList = 'PersonID',
#nonKeyFieldList = 'Description',
#effectivveDateFieldName = 'StartDate',
#terminationDateFieldName = 'EndDate'
You will want to capture the result (which is a table) into another table or temporary table (assuming it is called "AlignedDataTable" in below example). Then, you can pivot using a subquery.
SELECT
PersonID, StartDate, EndDate,
SUBSTRING ((SELECT ',' + [Description] FROM AlignedDataTable AS innerTable
WHERE
innerTable.PersonID = AlignedDataTable.PersonID
AND (innerTable.StartDate = AlignedDataTable.StartDate)
AND (innerTable.EndDate = AlignedDataTable.EndDate)
ORDER BY id
FOR XML PATH ('')), 2, 999999999999999) AS IdList
FROM AlignedDataTable
GROUP BY PersonID, StartDate, EndDate
ORDER BY PersonID, StartDate

Resources