Best method to merge two sql tables toegether

Best method to merge two sql tables toegether - sql-server

So I have two tables. One tracking a a persons location, and one that has the shifts of staff members.
Staff members have a staffId, location, start and end times, and cost of that shift.
People have an eventId, stayId, personId, location, start and end time. A person will have an event with multiple stays.
What I am attempting to do is mesh these two tables together, so that I can accurately report the cost of each location stay, based on the duration of that stay multiplied by the associated cost of staff covering that location at that time.
The issues I have are:
Location stays do not align with staff shifts. i.e. a person might be in location a between 1pm and 2pm, and four staff might be on shifts from 12:30 to 1:30, and two on from 1:30 till 5.
There are a lot of records.
Not all staff are paid the same
My current method is to expand both tables to have a record for every single minute. So a stay that is between 1pm and 2pm will have 60 records, and a staff shift that goes for 5 hours will have 300 records. I can then take all staff that are working on that location at that minute to get a minute value based on the cost of each staff member divided by the duration of their shift, and apply that value to the corresponding record in the other table.
Techniques used:
I create a table with 50,000 numbers, since some stays can be quite
long.
I take the staff table and join onto the numbers table to split each
shift. Then group it together based on location and minute, with a
staff count and minute cost.
The final step, and the one causing issues, is where I take the
location table, join onto numbers, and also onto the modified staff
table to produce a cost for that minute. I also count the number of
people in that location to account for staff covering multiple
people.
I'm finding this process extremely slow as you can imagine, since my person table has about 500 million records when expanded to the minute level, and the staff table has about 35 million when the same thing is done.
Can people suggest a better method for me to use?
Sample data:
Locations
| EventId | ID | Person | Loc | Start | End
| 1 | 987 | 123 | 1 | May, 20 2015 07:00:00 | May, 20 2015 08:00:00
| 1 | 374 | 123 | 4 | May, 20 2015 08:00:00 | May, 20 2015 10:00:00
| 1 | 184 | 123 | 3 | May, 20 2015 10:00:00 | May, 20 2015 11:00:00
| 1 | 798 | 123 | 8 | May, 20 2015 11:00:00 | May, 20 2015 12:00:00
Staff
| Loc | StaffID | Cost | Start | End
| 1 | 99 | 40 | May, 20 2015 04:00:00 | May, 20 2015 12:00:00
| 1 | 15 | 85 | May, 20 2015 03:00:00 | May, 20 2015 5:00:00
| 3 | 85 | 74 | May, 20 2015 18:00:00 | May, 20 2015 20:00:00
| 4 | 10 | 36 | May, 20 2015 06:00:00 | May, 20 2015 14:00:00
Result
| EventId | ID | Person | Loc | Start | End | Cost
| 1 | 987 | 123 | 1 | May, 20 2015 07:00:00 | May, 20 2015 08:00:00 | 45.50
| 1 | 374 | 123 | 4 | May, 20 2015 08:00:00 | May, 20 2015 10:00:00 | 81.20
| 1 | 184 | 123 | 3 | May, 20 2015 10:00:00 | May, 20 2015 11:00:00 | 95.00
| 1 | 798 | 123 | 8 | May, 20 2015 11:00:00 | May, 20 2015 12:00:00 | 14.75
SQL:
Numbers table
;WITH x AS
(
SELECT TOP (224) object_id FROM sys.all_objects
)
SELECT TOP (50000) n = ROW_NUMBER() OVER (ORDER BY x.object_id)
INTO #numbers
FROM x CROSS JOIN x AS y
ORDER BY n
Staff Table
SELECT
Location,
ISNULL(SUM(ROUND(Cost/ CASE WHEN (DateDiff(MINUTE, StartDateTime, EndDateTime)) = 0 THEN 1 ELSE (DateDiff(MINUTE, StartDateTime, EndDateTime)) END, 5)),0) AS MinuteCost,
Count(Name) AS StaffCount,
RosterMinute = DATEADD(MI, DATEDIFF(MI, 0, StartDateTime) + n.n -1, 0)
INTO #temp_StaffRoster
FROM dbo.StaffRoster
Grouping together, and where help is needed I think
INSERT INTO dbo.FinalTable
SELECT [EventId]
,[Id]
,[Start]
,[End]
,event.[Location]
,SUM(ISNULL(MinuteCost,1)/ISNULL(PeopleCount, 1)) AS Cost
,AVG(ISNULL(StaffCount,1)) AS AvgStaff
FROM dbo.Events event WITH (NOLOCK)
INNER JOIN #numbers n ON n.n BETWEEN 0 AND DATEDIFF(MINUTE, Start, End)
LEFT OUTER JOIN #temp_StaffRoster staff WITH (NOLOCK) ON staff.Location= event.Location AND staff.RosterMinute = DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
LEFT OUTER JOIN (SELECT [Location], DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0) AS Mins, COUNT(Id) as PeopleCount
FROM dbo.Events WITH (NOLOCK)
INNER JOIN #numbers n ON n.n BETWEEN 0 AND DATEDIFF(MINUTE, Start, End)
GROUP BY [Location], DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
) cap ON cap.Location= event.LocationAND cap.Mins = DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
GROUP BY [EventId]
,[Id]
,[Start]
,[End]
,event.[Location]
UPDATE
So I have two tables. One tracking a a persons location, and one that has the shifts of staff members with their cost. I am attempting to consolidate the two tables to calculate the cost of each location stay.
Here is my method:
;;WITH stay AS
(
SELECT TOP 650000
StayId,
Location,
Start,
End
FROM stg_Stay
WHERE Loction IS NOT NULL -- Some locations don't currently have a matching shift location
ORDER BY Location, ADTM
),
shift AS
(
SELECT TOP 36000000
Location,
ShiftMinute,
MinuteCost,
StaffCount
FROM stg_Shifts
ORDER BY Location, ShiftMinute
)
SELECT
[StayId],
SUM(MinuteCost) AS Cost,
AVG(StaffCount) AS StaffCount
INTO newTable
FROM stay S
CROSS APPLY (SELECT MinuteCost, StaffCount
FROM shift R
WHERE R.Location = S.Location
AND R.ShiftMinute BETWEEN S.Start AND S.End
) AS Shifts
GROUP BY [StayId]
This is where I'm at.
I've split the Shifts table into a minute by minute level since there is no clear alignment of shifts to stays.
stg_Stay contains more columns than needed for this operation. stg_Shift is as shown.
Indexes used on stg_Shifts:
CREATE NONCLUSTERED INDEX IX_Shifts_Loc_Min
ON dbo.stg_Shifts (Location, ShiftMinute)
INCLUDE (MinuteCost, StaffCount);
on stg_Stay
CREATE INDEX IX_Stay_StayId ON dbo.stg_Stay (StayId);
CREATE CLUSTERED INDEX IX_Stay_Start_End_Loc ON dbo.stg_Stay (Location,Start,End);
Due to the fact that Shifts has ~36 million records and Stays has ~650k, what can I do to make this perform better?

Don't break down the rows by minutes.
Staging table may help if you can create fast relationship between them. i.e. the overlapped interval
SELECT *
FROM Locations l
OUTER APPLY -- Assume a staff won't appear in different location in the same period of time, of course.
(
SELECT
CONVERT(decimal(14,2), SUM(CostPerMinute * OverlappedMinutes)) AS ActualCost,
COUNT(DISTINCT StaffId) AS StaffCount,
SUM(OverlappedMinutes) AS StaffMinutes
FROM
(
SELECT
*,
-- Calculate overlapped time in minutes
DATEDIFF(MINUTE,
CASE WHEN StartTime > l.StartTime THEN StartTime ELSE l.StartTime END, -- Get greatest start time
CASE WHEN EndTime > l.EndTime THEN l.EndTime ELSE EndTime END -- Get least end time
) AS OverlappedMinutes,
Cost / DATEDIFF(MINUTE, StartTime, EndTime) AS CostPerMinute
FROM Staff
WHERE LocationId = l.LocationId
AND StartTime <= l.EndTime AND l.StartTime <= EndTime -- Match with overlapped time
) data
) StaffInLoc
SQL Fiddle

Take below with a grain of salt since your naming is horrible.
Location should really be a Stay as i guess location is another table defining an single physical location.
Your Staff table is also badly named. Why not name it Shift. I would expect a staff table to contain stuff like Name, Phone etc. Where a Shift table can contain multiple shifts for the same Staff etc.
Second i think your missing a relation between the two tables.
If you join Location and Staff only on Location and overlapping date times i don't think it would make a whole lot of sense for what your trying to do. How do you know which staff is at any location for a given time? Onlything you can do with location and overlapping dates is assume a entry is in the location table relates to every staff who have a shift at that location within the timeframe. So look at the below more as an inspiration to solving your problems and how to find overlapping datetime intervals and less like an actual solution to your problem since i think your data and model is in a bad shape.
If i got it all wrong please provide Primary Keys and Foreign Keys on your tables and a better explanation.
Some dummy data
DROP TABLE dbo.Location
CREATE TABLE dbo.Location
(
StayId INT,
EventId INT,
PersonId INT,
LocationId INT,
StartTime DATETIME2(0),
EndTime DATETIME2(0)
)
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 987 ,1 ,123 ,1 ,'2015-05-20T07:00:00','2015-05-20T08:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 374 ,1 ,123 ,4 ,'2015-05-20T08:00:00','2015-05-20T10:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 184 ,1 ,123 ,3 ,'2015-05-20T10:00:00','2015-05-20T11:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 798 ,1 ,123 ,8 ,'2015-05-20T11:00:00','2015-05-20T12:00:00')
DROP TABLE dbo.Staff
CREATE TABLE Staff
(
StaffId INT,
Cost INT,
LocationId INT,
StartTime DATETIME2(0),
EndTime DATETIME2(0)
)
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 99 ,40 ,1 ,'2015-05-20T04:00:00','2015-05-20T12:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 15 ,85 ,1 ,'2015-05-20T03:00:00','2015-05-20T05:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 85 ,74 ,3 ,'2015-05-20T18:00:00','2015-05-20T20:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 10 ,36 ,4 ,'2015-05-20T06:00:00','2015-05-20T14:00:00')
Actual query
WITH OnLocation AS
(
SELECT
L.StayId, L.EventId, L.LocationId, L.PersonId, S.Cost
, IIF(L.StartTime > S.StartTime, L.StartTime, S.StartTime) AS OnLocationStartTime
, IIF(L.EndTime < S.EndTime, L.EndTime, S.EndTime) AS OnLocationEndTime
FROM dbo.Location L
LEFT JOIN dbo.Staff S
ON S.LocationId = L.LocationId -- TODO are you not missing a join condition on staffid
-- Detects any overlaps between stays and shifts
AND L.StartTime <= S.EndTime AND L.EndTime >= S.StartTime
)
SELECT
*
, DATEDIFF(MINUTE, D.OnLocationStartTime, D.OnLocationEndTime) AS DurationMinutes
, DATEDIFF(MINUTE, D.OnLocationStartTime, D.OnLocationEndTime) / 60.0 * Cost AS DurationCost
FROM OnLocation D
To get a summary you can take the query and add a GROUP BY for whatever your wan't to summarize.

Related

Optimize insert of 1B+ rows

I work in an organization that has over 75,000 employees. In our payroll system, each employee has 32 unique banks which store things like Sick Time, Vacation Time, Banked Overtime, etc.
Here are the existing tables
Employee
(
Employee_key INT IDENTITY(1,1)
Lastname,
Firstname
)
Employee Key | Lastname | Firstname
-----------------------------------
100 | Smith | John
Bank
(
Bank_key INT IDENTITY(1,1),
Bank_name VARCHAR(50)
)
Bank_key | Bank_name
---------------------
100 | VACATION
Employee_balance
(
Employee_key INT, --FK to Employee
Bank_key INT, --FK to Bank
Bank_balance NUMERIC(10,5) -- Aggregate value of bank including future dated entries
)
Employee_key | Bank_Key | Bank_Balance
--------------------------------------
100 | 100 | 0
Employee_balance_trans
(
Employee_key INT, --FK to Employee
Bank_key INT, --FK to Bank
Trans_dt DATE -- transaction date that affects the bank
Bank_delta NUMERIC(10,5)
)
Employee_key | Bank_key | Trans_dt | Balance_delta
--------------------------------------------------
100 | 100 | 20230701 | -8.0
100 | 100 | 20230801 | -8.0
100 | 100 | 20230901 | -8.0
100 | 100 | 20231001 | -8.0
100 | 100 | 20231101 | -8.0
This employee has 5 vacation days booked into the future, for a total of 40 hours. As of January 1, the employee had 40 hours in their vacation bank, but because the employee_balance table is net of all future dated entries, I have to do some SQL processing to get the value for a current date.
SELECT eb.employee_key,
eb.bank_key,
'2023-01-01',
eb.employee_balance - ISNULL(SUM(feb.balance_delta), 0)
FROM employee a
INNER JOIN employee_balance eb on eb.employee_key = a.employee_key
LEFT OUTER JOIN wfms.employee_balance_trans ebt ON ebt.balance_key = eb.balance_key
AND ebt.employee_key = eb.employee_key
AND ebt.trans_dt > '2023-01-01'
GROUP BY eb.employee_key, eb.balance_key, eb.employee_balance
Running this query using 2023-01-01 returns a bank value of 40 hours. Running the query on 2023-07-01 returns a value of 32 hours and so on. This query is fine for calculating a single employee balance. The problem starts when a manager of a department with 1000 employees wants to see a report showing the employee banks at the beginning and end of each month.
I created a new table as follows:
Employee_bank_history
(
employee_key INT, --FK to employee
bank_key INT, --FK to bank
bank_date DATE,
bank_balance NUMERIC (10,5) -- Contains the bank balance as of the bank date
)
The table has a unique clustered index consisting of employee_key, bank_key and bank_date. The table is also populated every evening with a date range from December 31 2021 to Current Date. The start date gets reset every year, so there will be a maximum of 730 days worth of data. This means that at the maximum date range of 730 days, there will be almost 2 billion rows. (75,000 employees X 32 banks X 730 days.)
Currently, I am loading 950 million rows, and the following INSERT statement takes 30-45 minutes.
DECLARE #StartDate DATE = DATEFROMPARTS(DATEPART (YY, GETDATE())-2, 12, 31)
DECLARE #EndDate DATE = GETDATE()
;
WITH cte_bank_dates AS
(
SELECT [date]
FROM dim_date
WHERE [date] BETWEEN #StartDate AND #EndDate
)
INSERT INTO employee_balance_history
SELECT de.employee_key,
deb.balance_key,
cte.[date],
deb.bank_balance - ISNULL(SUM(feb.bank_delta), 0)
FROM employee de
INNER JOIN employee_balance deb on deb.employee_key = de.employee_key
CROSS JOIN cte_bank_dates cte
LEFT OUTER JOIN employee_balance_trans feb ON feb.balance_key = deb.balance_key
AND feb.employee_key = deb.employee_key
AND feb.bank_date > cte.[date]
GROUP BY de.employee_key, deb.balance_key, cte.[date], deb.bank_balance
OPTION (MAXRECURSION 0)
I use the CTE to get only the dates in the correct range. I need to have each date in the range, so that I know which future dates to exclude from the aggregate option. The resulting query to get bank balances as of a given date is blazing fast.
Today, I had my hands slapped and was told that the CROSS JOIN to the CTE was not needed and to optimize the query because it was slowing everything else down when it runs.
Leaving aside the fact that it will run overnight once in production, I'm left to wonder if there's a better way to populate this table for every employee, every bank and every date. The number of rows is unavoidable, as is the calculation to strip out future dated transactions from the employee bank balance.
Does anyone have any idea how I might make this faster, and less resource intensive on the server?

SQL Server - Convert record per day into date range (with gaps)

I have found a lot of questions and answers asking how to convert a date range to records per day, but I need the opposite and can't find anything yet.
So let's say I have this dataset:
User | Available
1 | 01-01-2019
1 | 02-01-2019
1 | 03-01-2019
1 | 04-01-2019
2 | 05-01-2019
2 | 06-01-2019
2 | 07-01-2019
2 | 10-01-2019
2 | 11-01-2019
2 | 12-01-2019
So we have user 1 who is available from 01/01/2019 to 04/01/2019. Then we have user 2 who is available from 05/01/2019 to 07/01/2019 and 10/01/2019 to 12/01/2019.
The result I am looking for should look like this:
User | Start | End
1 | 01-01-2019 | 04-01-2019
2 | 05-01-2019 | 07-01-2019
2 | 10-01-2019 | 12-01-2019
User 1 was fairly easy to calculate using min/max dates, but with the gaps of user 2, I am completely lost. Any suggestions?

I had to do this before somewhere too, this is the solution I used. Basically use a row number split by your grouping columns and ordered by date, and additionally calculate the amount of days from a particular date onwards (any hard-coded day will work).
The key here is that while the row number increases 1 by 1, the anchor difference will only increase 1 by 1 if the days are consecutive. Thus, the rest between the anchor diff and the row number will stay the same only if there are consecutive dates, allowing you to group by and calculate min/max.
IF OBJECT_ID('tempdb..#Availabilities') IS NOT NULL
DROP TABLE #Availabilities
CREATE TABLE #Availabilities (
[User] INT,
Available DATE)
INSERT INTO #Availabilities
VALUES
(1, '2019-01-01'),
(1, '2019-01-02'),
(1, '2019-01-03'),
(1, '2019-01-04'),
(2, '2019-01-05'),
(2, '2019-01-06'),
(2, '2019-01-07'),
(2, '2019-01-10'),
(2, '2019-01-11'),
(2, '2019-01-12')
;WITH WindowFunctions AS
(
SELECT
A.[User],
A.Available,
AnchorDayDifference = DATEDIFF(DAY, '2018-01-01', A.Available),
RowNumber = ROW_NUMBER() OVER (PARTITION BY A.[User] ORDER BY A.Available)
FROM
#Availabilities AS A
)
SELECT
T.[User],
Start = MIN(T.Available),
[End] = MAX(T.Available)
FROM
WindowFunctions AS T
GROUP BY
T.[User],
T.AnchorDayDifference - T.RowNumber
Result:
User Start End
1 2019-01-01 2019-01-04
2 2019-01-05 2019-01-07
2 2019-01-10 2019-01-12
The WindowFunctions values are (added the posterior rest result):
User Available AnchorDayDifference RowNumber GroupingRestResult
1 2019-01-01 365 1 364
1 2019-01-02 366 2 364
1 2019-01-03 367 3 364
1 2019-01-04 368 4 364
2 2019-01-05 369 1 368
2 2019-01-06 370 2 368
2 2019-01-07 371 3 368
2 2019-01-10 374 4 370
2 2019-01-11 375 5 370
2 2019-01-12 376 6 370

This is a "common" Groups and Island question. Provided you're on SQL Server 2012+ (and if you're not, it's time to upgrade) this gets you the result you're after:
USE Sandbox;
GO
WITH VTE AS(
SELECT V.[User],
CONVERT(date,Available,105) AS Available
FROM (VALUES(1,'01-01-2019'),
(1,'02-01-2019'),
(1,'03-01-2019'),
(1,'04-01-2019'),
(2,'05-01-2019'),
(2,'06-01-2019'),
(2,'07-01-2019'),
(2,'10-01-2019'),
(2,'11-01-2019'),
(2,'12-01-2019')) V([User],Available)),
Diffs AS(
SELECT V.[User],
V.Available,
DATEDIFF(DAY, LAG(V.Available,1,DATEADD(DAY, -1, V.Available)) OVER (PARTITION BY V.[User] ORDER BY V.Available), V.Available) AS Diff
FROM VTE V),
Groups AS(
SELECT D.[User],
D.Available,
COUNT(CASE WHEN D.Diff > 1 THEN 1 END) OVER (PARTITION BY D.[User] ORDER BY D.Available
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM Diffs D)
SELECT G.[User],
MIN(G.Available) AS [Start],
MAX(G.Available) AS [End]
FROM Groups G
GROUP BY G.[User],
G.Grp
ORDER BY G.[User],
[Start];
The first CTE Diffs, excluding VTE ("Value Table Expression") for the sample data, gets the difference in days between the different rows. The second CTE Groups then puts the dates into groups (surprise that), base on if the difference was more than 1. Then we can use those groups to get a MIN and MAX for that group in the final SELECT.

I'm reading as MONTHS not DAYS
Example
Select [User]
,[Start] = min([Available])
,[End] = max([Available])
From (
Select *
,Grp = DateDiff(MONTH,'1900-01-01',[Available]) - Row_Number() over (Partition By [User] Order by [Available])
From YourTable
) A
Group By [User],[Grp]
Returns
User Start End
1 2019-01-01 2019-04-01
2 2019-05-01 2019-07-01
2 2019-10-01 2019-12-01

Running sum from a point

I have a forecast of change that I need to add on to actuals.
Example:
Date Group Count ActForc
Nov-15 GrpA 10 A
Dec-15 GrpA 12 A
Jan-16 GrpA -1 F
Feb-16 GrpA 2 F
What I would like to see is:
Date Group Count
Nov-15 GrpA 10
Dec-15 GrpA 12
Jan-16 GrpA 11
Feb-16 GrpA 13
but all of the counting/running sum queries I have seen assume that I want the sections to be separate, and give me ways to create sums for each section, but essentially, I want to seed the sum for the second section with the final value from the first section, and continue from that point, without disturbing the values from the second section

If your forecasts are always in the end of the date range, you can also do this by using few window functions inside each other. Here is a running total calculated over a field that checks if the next row is 'F' then it takes count, otherwise 0. When that is then taken instead of count when the next row is F, it will contain the figure you want.
select
[date],
[group],
case when isnull(lead(ActForc) over (order by Date asc),ActForc) = 'F' then
sum(Count2) over (order by Date asc) else [Count] end,
[count],
ActForc
from (
select
[date],
[group],
case when isnull(lead(ActForc) over (order by Date asc),ActForc) = 'F' then [Count] else 0 end as Count2,
[count],
ActForc
from
table1
) X
This should perform better than any recursive CTEs / correlated subqueries because the data isn't read several times. If you have more groups, partitioning the window functions with the group should fix that.
Example in SQL Fiddle with few more months.

Try with a recursive cte.
First create a subquery to have a row_id
Then create the base case with rn = 1
And finally the recursion calculate each next level.
SQL Fiddle Demo
WITH addID as (
SELECT [Date], [Group], [Count], [ActForc],
ROW_NUMBER() OVER ( ORDER BY [DATE]) as rn
FROM myTable
), cte_name ( [Date], [Group], [Count], [level] ) AS
(
SELECT [Date], [Group], [Count], 1 as [level]
FROM addID
WHERE rn = 1
UNION ALL
SELECT A.[Date],
A.[Group],
CASE WHEN [ActForc] = 'F' THEN C.[Count] + A.[Count]
ELSE A.[Count]
END AS [Count],
C.[level] + 1
FROM addID A
INNER JOIN cte_name C
ON A.rn = C.[level] + 1
)
SELECT *
FROM cte_name
OUTPUT
| Date | Group | Count | level |
|----------------------------|-------|-------|-------|
| November, 01 2015 00:00:00 | GrpA | 10 | 1 |
| December, 01 2015 00:00:00 | GrpA | 12 | 2 |
| January, 01 2016 00:00:00 | GrpA | 11 | 3 |
| February, 01 2016 00:00:00 | GrpA | 13 | 4 |

SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value

In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities.
I have a database view that unions data from all of these tables giving us the ActivityView that looks like this.
As you can see some activities overlap - for example while attending an interview, a client may have been performing a CV update activity.
+----------------------+---------------+---------------------+-------------------+
| activity_client_id | activity_type | activity_start_date | activity_end_date |
+----------------------+---------------+---------------------+-------------------+
| 112 | Interview | 2015-06-01 09:00 | 2015-06-01 11:00 |
| 112 | CV updating | 2015-06-01 09:30 | 2015-06-01 11:30 |
| 112 | Course | 2015-06-02 09:00 | 2015-06-02 16:00 |
| 112 | Interview | 2015-06-03 09:00 | 2015-06-03 10:00 |
+----------------------+---------------+---------------------+-------------------+
Each client has a "Sign Up Date", recorded on the client table, which is when they joined our programme. Here it is for our sample client:
+-----------+---------------------+
| client_id | client_sign_up_date |
+-----------+---------------------+
| 112 | 2015-05-20 |
+-----------+---------------------+
I need to create a report that will show the following columns:
+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
We need this report in order to see how effective our programme is. An important aim of the programme is that we get every client to complete at least 5 hours of activity as quickly as possible.
So this report will tell us how long from sign up does it take each client to achieve this figure.
What makes this even trickier is that when we calculate 5 hours of total activity, we must discount overlapping activities:
In the sample data above the client attended an interview between 09:00 and 11:00.
On the same day they also performed CV updating activity from 09:30 to 11:30.
For our calculation, this would give them total activity for the day of 2.5 hours (150 minutes) - we would only count 30 minutes of the CV updating as the Interview overlaps it up to 11:00.
So the report for our sample client would give the following result:
+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
| 112 | 2015-05-20 | 2015-06-02 |
+-----------+---------------------+--------------------------------------------+
So my question is how can I create the report using a select statement ?
I can work out how to do this by writing a stored procedure that will loop through the view and write the result to a report table.
But I would much prefer to avoid a stored procedure and have a select statement that will give me the report on the fly.
I am using SQL Server 2005.

See SQL Fiddle here.
with tbl as (
-- this will generate daily merged ovelaping time
select distinct
a.id
,(
select min(x.starttime)
from act x
where x.id=a.id and ( x.starttime between a.starttime and a.endtime
or a.starttime between x.starttime and x.endtime )
) start1
,(
select max(x.endtime)
from act x
where x.id=a.id and ( x.endtime between a.starttime and a.endtime
or a.endtime between x.starttime and x.endtime )
) end1
from act a
), tbl2 as
(
-- this will add minute and total minute column
select
*
,datediff(mi,t.start1,t.end1) mi
,(select sum(datediff(mi,x.start1,x.end1)) from tbl x where x.id=t.id and x.end1<=t.end1) totalmi
from tbl t
), tbl3 as
(
-- now final query showing starttime and endtime for 5 hours other wise null in case not completed 5(300 minutes) hours
select
t.id
,min(t.start1) starttime
,min(case when t.totalmi>300 then t.end1 else null end) endtime
from tbl2 t
group by t.id
)
-- final result
select *
from tbl3
where endtime is not null

This is one way to do it:
;WITH CTErn AS (
SELECT activity_client_id, activity_type,
activity_start_date, activity_end_date,
ROW_NUMBER() OVER (PARTITION BY activity_client_id
ORDER BY activity_start_date) AS rn
FROM activities
),
CTEdiff AS (
SELECT c1.activity_client_id, c1.activity_type,
x.activity_start_date, c1.activity_end_date,
DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id
ORDER BY x.activity_start_date) AS seq
FROM CTErn AS c1
LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
CROSS APPLY (SELECT CASE
WHEN c1.activity_start_date < c2.activity_end_date
THEN c2.activity_end_date
ELSE c1.activity_start_date
END) x(activity_start_date)
)
SELECT TOP 1 client_id, client_sign_up_date, activity_start_date,
hoursOfActivicty
FROM CTEdiff AS c1
INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id
CROSS APPLY (SELECT SUM(diff) / 60.0
FROM CTEdiff AS c3
WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
WHERE hoursOfActivicty >= 5
ORDER BY seq
Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.
Demo here
The first CTE, i.e. CTErn, produces the following output:
client_id activity_type start_date end_date rn
112 Interview 2015-06-01 09:00 2015-06-01 11:00 1
112 CV updating 2015-06-01 09:30 2015-06-01 11:30 2
112 Course 2015-06-02 09:00 2015-06-02 16:00 3
112 Interview 2015-06-03 09:00 2015-06-03 10:00 4
The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:
client_id activity_type start_date end_date diff seq
112 Interview 2015-06-01 09:00 2015-06-01 11:00 120 1
112 CV updating 2015-06-01 11:00 2015-06-01 11:30 30 2
112 Course 2015-06-02 09:00 2015-06-02 16:00 420 3
112 Interview 2015-06-03 09:00 2015-06-03 10:00 60 4
The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.
The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.

A Geometric Approach
For another issue, I've taken a geometric approach to date
packing. Namely, I convert dates and times to a sql geometry
type and utilize geometry::UnionAggregate to merge the ranges.
I don't believe this will work in sql-server 2005. But your
problem was such an interesting puzzle that I wanted to see
whether the geometrical approach would work. So any future
users running into this problem that have access to a later
version can consider it.
Code Description
In 'numbers':
I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in
your original table, so I just use it as the base to build it.
In 'mergeLines':
I convert the dates to floats and use those floats
to create geometrical points.
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting
'lines' geometry object might contain multiple lines, but if they
overlap, they turn into one line.
In 'redate':
I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored
only as its two endpoints.
I read the endpoint x values and convert them back to their time
representations (This is usually the end goal, but you need more).
I calculate the difference in minutes between activity start and
end dates (I do this first in seconds then divide by 60 for the
sake of a precision issue).
I calculate the cumulative sume of these minutes for each row.
In the outer query:
I align the previous cumulative minutes sum with each current row
I filter for the row where the 5hr goal was met but where the
previous minutes shows that the 5hr goal for the previous row
was not met.
I then calculate where in the current row's range the user has
met the 5 hours, to not only arrive at the date the five hour
goal was met, but the exact time.
The Code
with
numbers as (
select row_number() over (order by (select null)) i
from #activities -- where I put your data
),
mergeLines as (
select activity_client_id,
lines = geometry::UnionAggregate(line)
from #activities
cross apply (select
startP = geometry::Point(convert(float,activity_start_date), 0, 0),
stopP = geometry::Point(convert(float,activity_end_date), 0, 0)
) pointify
cross apply (select line = startP.STUnion(stopP).STEnvelope()) lineify
group by activity_client_id
),
redate as (
select client_id = activity_client_id,
activities_start_date,
activities_end_date,
minutes,
rollingMinutes = sum(minutes) over(
partition by activity_client_id
order by activities_start_date
rows between unbounded preceding and current row
)
from mergeLines ml
join numbers n on n.i between 1 and ml.lines.STNumGeometries()
cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
cross apply (select
activities_start_date = convert(datetime, l.line.STPointN(1).STX),
activities_end_date = convert(datetime, l.line.STPointN(3).STX)
) unprepare
cross apply (select minutes =
round(datediff(s, activities_start_date, activities_end_date) / 60.0,0)
) duration
)
select client_id,
activities_start_date,
activities_end_date,
met_5hr_goal = dateadd(minute, (60 * 5) - prevRoll, activities_start_date)
from (
select *,
prevRoll = lag(rollingMinutes) over (
partition by client_id
order by rollingMinutes
)
from redate
) ranker
where rollingMinutes >= 60 * 5
and prevRoll < 60 * 5;

Find the min and max dates between multiple sets of dates

Given the following set of data, I'm trying to determine how I can select the start and end dates of the combined date ranges, when they intersect with each other.
For instance, for PartNum 115678, I would want my final result set to display the date ranges 2012/01/01 - 2012/01/19 (rows 1, 2 and 4 combined since the date ranges intersect) and 2012/02/01 - 2012/03/28 (row 3 since this ones does not intersect with the range found previously).
For PartNum 213275, I would want to select the only row for that part, 2012/12/01 - 2013/01/01.
Edit:
I'm currently playing around with the following SQL statement, but it's not giving me exactly what I need.
with DistinctRanges as (
select distinct
ha1.PartNum "PartNum",
ha1.StartDt "StartDt",
ha2.EndDt "EndDt"
from dbo.HoldsAll ha1
inner join dbo.HoldsAll ha2
on ha1.PartNum = ha2.PartNum
where
ha1.StartDt <= ha2.EndDt
and ha2.StartDt <= ha1.EndDt
)
select
PartNum,
StartDt,
EndDt
from DistinctRanges
Here are the results of the query shown in the edit:

You're better off having a persisted Calendar table, but if you don't, the CTE below will create it ad-hoc. The TOP(36000) part is enough to give you 10 years worth of dates from the pivot ('20100101') on the same line.
SQL Fiddle
MS SQL Server 2008 Schema Setup:
create table data (
partnum int,
startdt datetime,
enddt datetime,
age int
);
insert data select
12345, '20120101', '20120116', 15 union all select
12345, '20120115', '20120116', 1 union all select
12345, '20120201', '20120328', 56 union all select
12345, '20120113', '20120119', 6 union all select
88872, '20120201', '20130113', 43;
Query 1:
with Calendar(thedate) as (
select TOP(36600) dateadd(d,row_number() over (order by 1/0),'20100101')
from sys.columns a
cross join sys.columns b
cross join sys.columns c
), tmp as (
select partnum, thedate,
grouper = datediff(d, dense_rank() over (partition by partnum order by thedate), thedate)
from Calendar c
join data d on d.startdt <= c.thedate and c.thedate <= d.enddt
)
select partnum, min(thedate) startdt, max(thedate) enddt
from tmp
group by partnum, grouper
order by partnum, startdt
Results:
| PARTNUM | STARTDT | ENDDT |
------------------------------------------------------------------------------
| 12345 | January, 01 2012 00:00:00+0000 | January, 19 2012 00:00:00+0000 |
| 12345 | February, 01 2012 00:00:00+0000 | March, 28 2012 00:00:00+0000 |
| 88872 | February, 01 2012 00:00:00+0000 | January, 13 2013 00:00:00+0000 |