islands and gaps tsql - sql-server

I have been struggling with a problem that should be pretty simple actually but after a full week of reading, googling, experimenting and so on, my colleague and we cannot find the proper solution. :(
The problem: We have a table with two values:
an employeenumber (P_ID, int) <--- identification of employee
a date (starttime, datetime) <--- time employee checked in
We need to know what periods each employee has been working.
When two dates are less then #gap days apart, they belong to the same period
For each employee there can be multiple records for any given day but I just need to know which dates he worked, I am not interested in the time part
As soon as there is a gap > #gap days, the next date is considered the start of a new range
A range is at least 1 day (example: 21-9-2011 | 21-09-2011) but has no maximum length. (An employee checking in every #gap - 1 days should result in a period from the first day he checked in until today)
What we think we need are the islands in this table where the gap in days is greater than #variable (#gap = 30 means 30 days)
So an example:
SOURCETABLE:
P_ID | starttime
------|------------------
12121 | 24-03-2009 7:30
12121 | 24-03-2009 14:25
12345 | 27-06-2011 10:00
99999 | 01-05-2012 4:50
12345 | 27-06-2011 10:30
12345 | 28-06-2011 11:00
98765 | 13-04-2012 10:00
12345 | 21-07-2011 9:00
99999 | 03-05-2012 23:15
12345 | 21-09-2011 12:00
45454 | 12-07-2010 8:00
12345 | 21-09-2011 17:00
99999 | 06-05-2012 11:05
99999 | 20-05-2012 12:45
98765 | 26-04-2012 16:00
12345 | 07-07-2012 14:00
99999 | 01-06-2012 13:55
12345 | 13-08-2012 13:00
Now what I need as a result is:
PERIODS:
P_ID | Start | End
-------------------------------
12121 | 24-03-2009 | 24-03-2009
12345 | 27-06-2012 | 21-07-2012
12345 | 21-09-2012 | 21-09-2012
12345 | 07-07-2012 | (today) OR 13-08-2012 <-- (less than #gap days ago) OR (last date in table)
45454 | 12-07-2010 | 12-07-2010
45454 | 17-06-2012 | 17-06-2012
98765 | 13-04-2012 | 26-04-2012
99999 | 01-05-2012 | 01-06-2012
I hope this is clear this way, I already thank you for reading this far, it would be great if you could contribute!

I've done a rough script that should get you started. Haven't bothered refining the datetimes and the endpoint comparisons might need tweaking.
select
P_ID,
src.starttime,
endtime = case when src.starttime <> lst.starttime or lst.starttime < DATEADD(dd,-1 * #gap,GETDATE()) then lst.starttime else GETDATE() end,
frst.starttime,
lst.starttime
from #SOURCETABLE src
outer apply (select starttime = MIN(starttime) from #SOURCETABLE sub where src.p_id = sub.p_id and sub.starttime > DATEADD(dd,-1 * #gap,src.starttime)) frst
outer apply (select starttime = MAX(starttime) from #SOURCETABLE sub where src.p_id = sub.p_id and src.starttime > DATEADD(dd,-1 * #gap,sub.starttime)) lst
where src.starttime = frst.starttime
order by P_ID, src.starttime
I get the following output, which is a litle different to yours, but I think its ok:
P_ID starttime endtime starttime starttime
----------- ----------------------- ----------------------- ----------------------- -----------------------
12121 2009-03-24 07:30:00.000 2009-03-24 14:25:00.000 2009-03-24 07:30:00.000 2009-03-24 14:25:00.000
12345 2011-06-27 10:00:00.000 2011-07-21 09:00:00.000 2011-06-27 10:00:00.000 2011-07-21 09:00:00.000
12345 2011-09-21 12:00:00.000 2011-09-21 17:00:00.000 2011-09-21 12:00:00.000 2011-09-21 17:00:00.000
12345 2012-07-07 14:00:00.000 2012-07-07 14:00:00.000 2012-07-07 14:00:00.000 2012-07-07 14:00:00.000
12345 2012-08-13 13:00:00.000 2012-08-16 11:23:25.787 2012-08-13 13:00:00.000 2012-08-13 13:00:00.000
45454 2010-07-12 08:00:00.000 2010-07-12 08:00:00.000 2010-07-12 08:00:00.000 2010-07-12 08:00:00.000
98765 2012-04-13 10:00:00.000 2012-04-26 16:00:00.000 2012-04-13 10:00:00.000 2012-04-26 16:00:00.000
The last two output cols are the results of the outer apply sections, and are just there for debugging.
This is based on the following setup:
declare #gap int
set #gap = 30
set dateformat dmy
-----P_ID----|----starttime----
declare #SOURCETABLE table (P_ID int, starttime datetime)
insert #SourceTable values
(12121,'24-03-2009 7:30'),
(12121,'24-03-2009 14:25'),
(12345,'27-06-2011 10:00'),
(12345,'27-06-2011 10:30'),
(12345,'28-06-2011 11:00'),
(98765,'13-04-2012 10:00'),
(12345,'21-07-2011 9:00'),
(12345,'21-09-2011 12:00'),
(45454,'12-07-2010 8:00'),
(12345,'21-09-2011 17:00'),
(98765,'26-04-2012 16:00'),
(12345,'07-07-2012 14:00'),
(12345,'13-08-2012 13:00')
UPDATE: Slight rethink. Now uses a CTE to work out the gaps forwards and backwards from each item, then aggregates those:
--Get the gap between each starttime and the next and prev (use 999 to indicate non-closed intervals)
;WITH CTE_Gaps As (
select
p_id,
src.starttime,
nextgap = coalesce(DATEDIFF(dd,src.starttime,nxt.starttime),999), --Gap to the next entry
prevgap = coalesce(DATEDIFF(dd,prv.starttime,src.starttime),999), --Gap to the previous entry
isold = case when DATEDIFF(dd,src.starttime,getdate()) > #gap then 1 else 0 end --Is starttime more than gap days ago?
from
#SOURCETABLE src
cross apply (select starttime = MIN(starttime) from #SOURCETABLE sub where src.p_id = sub.p_id and sub.starttime > src.starttime) nxt
cross apply (select starttime = max(starttime) from #SOURCETABLE sub where src.p_id = sub.p_id and sub.starttime < src.starttime) prv
)
--select * from CTE_Gaps
select
p_id,
starttime = min(gap.starttime),
endtime = nxt.starttime
from
CTE_Gaps gap
--Find the next starttime where its gap to the next > #gap
cross apply (select starttime = MIN(sub.starttime) from CTE_Gaps sub where gap.p_id = sub.p_id and sub.starttime >= gap.starttime and sub.nextgap > #gap) nxt
group by P_ID, nxt.starttime
order by P_ID, nxt.starttime

Jon most definitively has shown us the right direction. Performance was horrible though (4million+ records in the database). And it looked like we were missing some information. With all that we learned from you we came up with the solution below. It uses elements of all the proposed answers and cycles through 3 temptables before finally spewing results but performance is good enough, as well as the data it generates.
declare #gap int
declare #Employee_id int
set #gap = 30
set dateformat dmy
--------------------------------------------------------------- #temp1 --------------------------------------------------
CREATE TABLE #temp1 ( EmployeeID int, starttime date)
INSERT INTO #temp1 ( EmployeeID, starttime)
select distinct ck.Employee_id,
cast(ck.starttime as date)
from SERVER1.DB1.dbo.checkins pd
inner join SERVER1.DB1.dbo.Team t on ck.team_id = t.id
where t.productive = 1
--------------------------------------------------------------- #temp2 --------------------------------------------------
create table #temp2 (ROWNR int, Employeeid int, ENDOFCHECKIN datetime, FIRSTCHECKIN datetime)
INSERT INTO #temp2
select Row_number() OVER (partition by EmployeeID ORDER BY t.prev) + 1 as ROWNR,
EmployeeID,
DATEADD(DAY, 1, t.Prev) AS start_gap,
DATEADD(DAY, 0, t.next) AS end_gap
from
(
select a.EmployeeID,
a.starttime as Prev,
(
select min(b.starttime)
from #temp1 as b
where starttime > a.starttime and b.EmployeeID = a.EmployeeID
) as Next
from #temp1 as a) as t
where datediff(day, prev, next ) > 30
group by EmployeeID,
t.Prev,
t.next
union -- add first known date for Employee
select 1 as ROWNR,
EmployeeID,
NULL,
min(starttime)
from #temp1 ct
group by ct.EmployeeID
--------------------------------------------------------------- #temp3 --------------------------------------------------
create table #temp3 (ROWNR int, Employeeid int, ENDOFCHECKIN datetime, STARTOFCHECKIN datetime)
INSERT INTO #temp3
select ROWNR,
Employeeid,
ENDOFCHECKIN,
FIRSTCHECKIN
from #temp2
union -- add last known date for Employee
select (select count(*) from #temp2 b where Employeeid = ct.Employeeid)+1 as ROWNR,
ct.Employeeid,
(select dateadd(d,1,max(starttime)) from #temp1 c where Employeeid = ct.Employeeid),
NULL
from #temp2 ct
group by ct.EmployeeID
---------------------------------------finally check our data-------------------------------------------------
select a1.Employeeid,
a1.STARTOFCHECKIN as STARTOFCHECKIN,
ENDOFCHECKIN = CASE WHEN b1.ENDOFCHECKIN <= a1.STARTOFCHECKIN THEN a1.ENDOFCHECKIN ELSE b1.ENDOFCHECKIN END,
year(a1.STARTOFCHECKIN) as JaarSTARTOFCHECKIN,
JaarENDOFCHECKIN = CASE WHEN b1.ENDOFCHECKIN <= a1.STARTOFCHECKIN THEN year(a1.ENDOFCHECKIN) ELSE year(b1.ENDOFCHECKIN) END,
Month(a1.STARTOFCHECKIN) as MaandSTARTOFCHECKIN,
MaandENDOFCHECKIN = CASE WHEN b1.ENDOFCHECKIN <= a1.STARTOFCHECKIN THEN month(a1.ENDOFCHECKIN) ELSE month(b1.ENDOFCHECKIN) END,
(year(a1.STARTOFCHECKIN)*100)+month(a1.STARTOFCHECKIN) as JaarMaandSTARTOFCHECKIN,
JaarMaandENDOFCHECKIN = CASE WHEN b1.ENDOFCHECKIN <= a1.STARTOFCHECKIN THEN (year(a1.ENDOFCHECKIN)*100)+month(a1.STARTOFCHECKIN) ELSE (year(b1.ENDOFCHECKIN)*100)+month(b1.ENDOFCHECKIN) END,
datediff(M,a1.STARTOFCHECKIN,b1.ENDOFCHECKIN) as MONTHSCHECKEDIN
from #temp3 a1
full outer join #temp3 b1 on a1.ROWNR = b1.ROWNR -1 and a1.Employeeid = b1.Employeeid
where not (a1.STARTOFCHECKIN is null AND b1.ENDOFCHECKIN is null)
order by a1.Employeeid, a1.STARTOFCHECKIN

Related

Cannot increment values in a T-SQL CTE

I have a case where I need to write a CTE ( at least this seems like the best approach) . I have almost everything I need in place but one last issue. I am using a CTE to generate many millions of a records and then I will insert them into a table. The data itself is almost irrelevant except for three columns. 2 date time columns and one character column.
The idea behind the CTE is this. I want one datetime field called Start and one int field called DataValue. I will have a variable which is the count of records I want to aim for and then another variable which is the number of times I want to repeat the datetime value. I don't think I need to explain the software this data represents but basically I need to have 16 rows where the Start value is the same and then after the 16th run I want to then add 15 minutes and then repeat. Effectively there will be events in 15 minute intervals and I will need X number of rows per 15 minute interval to represent those events.
This is my code
Declare #tot as int;
Declare #inter as int;
Set #tot = 26
Set #inter = 3;
WITH mycte(DataValue,start) AS
(
SELECT 1 DataValue, cast('01/01/2011 00:00:00' as datetime) as start
UNION all
if DataValue % #inter = 0
SELECT
DataValue + 1,
cast(DateAdd(minute,15,start) as datetime)
else
select
DataValue + ,
start
FROM mycte
WHERE DataValue + 1 <= #tot)
select
m.start,
m.start,
m.Datavalue%#inter
from mycte as m
option (maxrecursion 0);
I'll change the select statement into an insert statement once I get it working but the m.DataValue%#inter will make it repeat integer when inserting so the only thing I need is to figure out how to make the start be the same 16 times in a row and then increment
It seems that I cannot have an IF statement in the CTE but I am not sure how to accomplish that but what I was going to do was basically say if the DataValue%16 was 0 then increase the value of start.
In the end I should hopefully have something like this where in this case I only repeat it 4 times
+-----------+-------------------+
| DateValue | start |
+-----------+-------------------+
| 1 | 01/01/01 00:00:00 |
| 2 | 01/01/01 00:00:00 |
| 3 | 01/01/01 00:00:00 |
| 4 | 01/01/01 00:00:00 |
| 5 | 01/01/01 00:15:00 |
| 6 | 01/01/01 00:15:00 |
| 7 | 01/01/01 00:15:00 |
| 8 | 01/01/01 00:15:00 |
Is there another way to accomplish this without conditional statements?
You can use case when as below:
Declare #tot as int;
Declare #inter as int;
Set #tot = 26
Set #inter = 3;
WITH mycte(DataValue,start) AS
(
SELECT 1 DataValue, cast('01/01/2011 00:00:00' as datetime) as start
UNION all
SELECT DataValue+1 [Datavalue],
case when (DataValue % #inter) = 0 then cast(DateAdd(minute,15,start) as datetime) else [start] end [start]
FROM mycte
WHERE (DataValue + 1) <= #tot)
select
m.DataValue,
m.[start]
from mycte as m
option (maxrecursion 0);
This will give the below result
DataValue Start
========= =============
1 2011-01-01 00:00:00.000
2 2011-01-01 00:00:00.000
3 2011-01-01 00:00:00.000
4 2011-01-01 00:15:00.000
5 2011-01-01 00:15:00.000
6 2011-01-01 00:15:00.000
7 2011-01-01 00:30:00.000
8 2011-01-01 00:30:00.000
9 2011-01-01 00:30:00.000
10 2011-01-01 00:45:00.000
11 2011-01-01 00:45:00.000
12 2011-01-01 00:45:00.000
....
26 2011-01-01 02:00:00.000
And if you dont want to use case when you can use double recursive cte as below:-
WITH mycte(DataValue,start) AS
( --this recursive cte will generate the same record the number of #inter
SELECT 1 DataValue, cast('01/01/2011 00:00:00' as datetime) as start
UNION all
SELECT DataValue+1 [DataValue],[start]
FROM mycte
WHERE (DataValue + 1) <= #inter)
,Increments as (
-- this recursive cte will do the 15 additions
select * from mycte
union all
select DataValue+#inter [DataValue]
,DateAdd(minute,15,[start]) [start]
from Increments
WHERE (DataValue + 1) <= #tot
)
select
m.DataValue,
m.[start]
from Increments as m
order by DataValue
option (maxrecursion 0);
it will give the same results.
You can do this with a tally table and some basic math. I'm not sure if your total rows are #tot or should they be #tot * #inter. If so, you just need to change the TOP clause. If you need more rows, you just need to alter the tally table generation.
Declare #tot as int;
Declare #inter as int;
Set #tot = 26
Set #inter = 3;
WITH
E(n) AS(
SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
),
E2(n) AS(
SELECT a.n FROM E a, E b
),
E4(n) AS(
SELECT a.n FROM E2 a, E2 b
),
cteTally(n) AS(
SELECT TOP( #tot) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n
FROM E4
)
SELECT n, DATEADD( MI, 15* ((n-1)/#inter), '20110101')
FROM cteTally;

T-SQL create multiply records from one records

I have a cost record and I would like to create N records from it.
The children records have some different parameters.
For example:
The parents record:
date | amount | duration
20170201 | 5000 | 5 months
The children records:
date | amount | duration
20170301 | 1000 | 1 months
20170401 | 1000 | 1 months
20170501 | 1000 | 1 months
20170601 | 1000 | 1 months
20170701 | 1000 | 1 months
How can I do this without iteration? Without cursor or while?
Following SQL CTE query could be used based on Abdul's solution
/*
Create Table PARENT (PARENT_DATE DATE, PARENT_AMOUNT DECIMAL(18,2),PARENT_MONTH INT)
INSERT INTO PARENT SELECT '20170201',5000 ,5
INSERT INTO PARENT SELECT '20180601',120 ,3
*/
;WITH CTE_CHILD
AS (
SELECT
Parent_Date,
Parent_Amount,
Parent_Month,
DateAdd(Month, 1, Parent_Date) as Child_Date,
Parent_Amount/Parent_Month AS Child_Amount,
1 AS Child_Duration
FROM Parent
UNION ALL
SELECT
Parent_Date,
Parent_Amount,
Parent_Month,
DateAdd(Month, 1, Child_Date) as Child_Date,
Child_Amount,
Child_Duration
FROM CTE_CHILD
WHERE
DateAdd(Month, 1, Child_Date) <= DateAdd(Month, Parent_Month, Parent_Date)
)
SELECT
Child_Date,
Child_Amount,
Child_Duration
FROM CTE_CHILD
assuming you have a table like below:
create table tblRecords ( date int, amount money, duration int);
insert into tblRecords values
(20170201,5000,5),
(20180101,9000,3);
you can use a query like below:
select
date= date + r*100
,amount= amount/duration
,duration =1
from tblRecords
cross apply
(
select top (select duration)
r= row_number() over(order by (select null))
from
sys.objects s1
cross join
sys.objects s2
) h
see working demo
One method is CTE.
DECLARE #PARENT AS TABLE
(PARENT_DATE DATE, PARENT_AMOUNT DECIMAL(18,2),PARENT_MONTH INT)
INSERT INTO #PARENT
SELECT '20170201',5000 ,5
;WITH CTE_CHILD
AS (
SELECT DATEADD(MONTH,1,PARENT_DATE) AS CHILD_DATE
,PARENT_AMOUNT/PARENT_MONTH AS CHILD_AMOUNT
,1 AS CHILD_DURATION
FROM #PARENT
WHERE DATEADD(MONTH,1,PARENT_DATE) <= DATEADD(MONTH,PARENT_MONTH,PARENT_DATE)
UNION ALL
SELECT DATEADD(MONTH,1,CHILD_DATE)
,PARENT_AMOUNT/PARENT_MONTH
,1
FROM CTE_CHILD
INNER JOIN #PARENT ON DATEADD(MONTH,1,CHILD_DATE) <= DATEADD(MONTH,PARENT_MONTH,PARENT_DATE)
)
SELECT * FROM CTE_CHILD
option (maxrecursion 0)
Output:-
CHILD_DATE CHILD_AMOUNT CHILD_DURATION
2017-03-01 1000.0000000000000 1
2017-04-01 1000.0000000000000 1
2017-05-01 1000.0000000000000 1
2017-06-01 1000.0000000000000 1
2017-07-01 1000.0000000000000 1

MSSQL: Create incremental row label per group

In my table, I have a primary key and a date. What I'd like to achieve is to have an incremental label based on whether or not there is a break between the dates - column Goal.
Now, below is an example. The break column was calculated using LEAD function (I thought it might help).
I am able to solve it using T-SQL, but this would be last resort. Nothing I tried has worked so far. I am using MSSQL 2014.
PK | Date | break | Goal |
-------------------------------
1 | 03/2017 | 0 | 1 |
1 | 04/2017 | 0 | 1 |
1 | 08/2017 | 1 | 2 |
1 | 09/2017 | 0 | 2 |
1 | 10/2017 | 0 | 2 |
1 | 02/2018 | 1 | 3 |
1 | 03/2018 | 0 | 3 |
Here is a code to reproduce this example:
CREATE TABLE #test
(
ConsumerId INT,
FullDate DATE,
Goal INT
)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-03-01',1)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-04-01',1)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-08-01',2)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-09-01',2)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-10-01',2)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2018-02-01',3)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2018-03-01',3)
SELECT ConsumerId,
FullDate,
CASE WHEN (datediff(month,
isnull(
LEAD (FullDate,1) OVER (PARTITION BY ConsumerId ORDER BY FullDate DESC),
FullDate),
FullDate) > 1)
THEN 1
ELSE 0
END AS break,
Goal
FROM #test
ORDER BY FullDate ASC
EDIT
This is apparently a famous problem "Islands and gaps" as pointed out in the comments. And Google offers many solutions as well as other questions here at SO.
Try this...
WITH
cte_TestGap AS (
SELECT
t.ConsumerId, t.FullDate,
Gap = CASE
WHEN DATEDIFF(mm, t.FullDate, LAG(t.FullDate, 1) OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)) = -1
THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)
END
FROM
#test t
),
cte_SmearGap AS (
SELECT
tg.ConsumerId, tg.FullDate,
GV = MAX(tg.Gap) OVER (PARTITION BY tg.ConsumerId ORDER BY tg.FullDate ROWS UNBOUNDED PRECEDING)
FROM
cte_TestGap tg
)
SELECT
sg.ConsumerId, sg.FullDate,
GroupValue = DENSE_RANK() OVER (PARTITION BY sg.ConsumerId ORDER BY sg.GV)
FROM
cte_SmearGap sg;
An explanation of the code an how it works...
The 1st query, in cte_TestGap, uses the LAG function along with ROW_NUMBER() function to mark the location of gap in the data. We can see that by breaking it out and looking at it's results...
WITH
cte_TestGap AS (
SELECT
t.ConsumerId, t.FullDate,
Gap = CASE
WHEN DATEDIFF(mm, t.FullDate, LAG(t.FullDate, 1) OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)) = -1
THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)
END
FROM
#test t
)
SELECT * FROM cte_TestGap;
cte_TestGap results...
ConsumerId FullDate Gap
----------- ---------- --------------------
1 2017-03-01 1
1 2017-04-01 0
1 2017-08-01 3
1 2017-09-01 0
1 2017-10-01 0
1 2018-02-01 6
1 2018-03-01 0
At this point we want the 0 values to take on the value of the preceding non-0 values, allowing them to be grouped together. This is done in the 2nd query (cte_SmearGap) using the MAX function with a "window frame". So if we look at the output of cte_SmearGap, we can see that...
WITH
cte_TestGap AS (
SELECT
t.ConsumerId, t.FullDate,
Gap = CASE
WHEN DATEDIFF(mm, t.FullDate, LAG(t.FullDate, 1) OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)) = -1
THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)
END
FROM
#test t
),
cte_SmearGap AS (
SELECT
tg.ConsumerId, tg.FullDate,
GV = MAX(tg.Gap) OVER (PARTITION BY tg.ConsumerId ORDER BY tg.FullDate ROWS UNBOUNDED PRECEDING)
FROM
cte_TestGap tg
)
SELECT * FROM cte_SmearGap;
cte_SmearGap results...
ConsumerId FullDate GV
----------- ---------- --------------------
1 2017-03-01 1
1 2017-04-01 1
1 2017-08-01 3
1 2017-09-01 3
1 2017-10-01 3
1 2018-02-01 6
1 2018-03-01 6
At this point All of the rows are in distinct groups... but... We'd like to have our group numbers in a contiguous sequence (1,2,3) as opposed to (1,3,6).
Of course that's easy enough to fix using the DENSE_Rank() function, which is what's happening in the final select...
WITH
cte_TestGap AS (
SELECT
t.ConsumerId, t.FullDate,
Gap = CASE
WHEN DATEDIFF(mm, t.FullDate, LAG(t.FullDate, 1) OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)) = -1
THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)
END
FROM
#test t
),
cte_SmearGap AS (
SELECT
tg.ConsumerId, tg.FullDate,
GV = MAX(tg.Gap) OVER (PARTITION BY tg.ConsumerId ORDER BY tg.FullDate ROWS UNBOUNDED PRECEDING)
FROM
cte_TestGap tg
)
SELECT
sg.ConsumerId, sg.FullDate,
GroupValue = DENSE_RANK() OVER (PARTITION BY sg.ConsumerId ORDER BY sg.GV)
FROM
cte_SmearGap sg;
The end result...
ConsumerId FullDate GroupValue
----------- ---------- --------------------
1 2017-03-01 1
1 2017-04-01 1
1 2017-08-01 2
1 2017-09-01 2
1 2017-10-01 2
1 2018-02-01 3
1 2018-03-01 3
The comment from David Browne was actually extremely useful. If you google "Islands and Gaps", there are many variations of the solution. Below is the one I liked the most.
In the end, I needed the Goal column to be able to group the dates into MIN/MAX. This solution skips this step and directly creates the aggregated range.
Here is the source.
SELECT MIN(FullDate) AS range_start,
MAX(FUllDate) AS range_end
FROM (
SELECT FullDate,
DATEADD(MM, -1 * ROW_NUMBER() OVER(ORDER BY FullDate), FullDate) AS grp
FROM #test
) a
GROUP BY a.grp
And the output:
range_start | range_end |
--------------------------
2017-03-01 | 2017-04-01 |
2017-08-01 | 2017-10-01 |
2018-02-01 | 2018-03-01 |

How to increment data by week in SQL

I am trying to find out the number of stores that had no sales for the last 4 weeks. Because of the way the grid is structured in DevExpress, I need to show the weeks as Wk1, Wk2, Wk3, Wk4 as columns (the data in these columns would be the store count of stores with zero sales).
Right now my data has the Sale Fiscal Week as a row. What I need it to look like is basically like:
Vendor | Store | City | State | Category | Wk1 | Wk2 | Wk3| Wk4
------------------------------------------------------------------------
Prairie Farms | #16141 | Adrian | MI | 2% Gallon | 1 | 0 | 1 | 1
Prairie Farms | #16141 | Adrian | MI | Whole Gallon | 1 | 1 | 0 | 0
instead of
Vendor | Store | City | State | Category | Sale Cal Wk | Sale Fiscal Wk
--------------------------------------------------------------------------
Prairie Farms | #16141 | Adrian | MI | 2% Gallon | 2015-10-23 | 38
Prairie Farms | #16141 | Adrian | MI | Whole Gallon | 2015-10-23 | 38
I have included my code which works so far. I have a count(store) function as 'ZeroStore' in the final select statement to indicate the store as a Zero Sales store.
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
IF 1=0
BEGIN
SET FMTONLY OFF
END
DECLARE #4WeeksAgo date
DECLARE #Today date
SET #4WeeksAgo = DATEADD(day,-28,getdate())
SET #Today = cast(GETDATE() as date)
DECLARE #maxweek as tinyint
SET #maxweek = case when (SELECT DGFiscalWeek
FROM scans.dbo.DimDate
WHERE CAST(Actualdate as DATE)=dateadd(dd,0,CAST(getdate() as DATE))) = 1 THEN 52 ELSE
(SELECT DGFiscalWeek FROM scans.dbo.DimDate WHERE CAST(Actualdate as DATE)=dateadd(dd,0,CAST(getdate() as DATE)))-1 END
DECLARE #LYear as int
SET #LYear = cast ((select max(FiscalYr) from Scans.dbo.Scans)-1 as varchar(6))
DECLARE #Year as int
SET #Year = #LYear+1
CREATE TABLE #Initialize
(
[Master Vendor] varchar (100),
[Dairy Vendor] varchar (100),
Store float,
City varchar (50),
State varchar(5),
District int,
Category varchar(20),
[Sale Cal Wk] date,
SaleYear int,
[Sale Fiscal Week] tinyint,
Units int,
Sales money,
Grade int
)
INSERT INTO #Initialize
SELECT DISTINCT
d.[Master Vendor], d.[Dairy Vendor], d.[Store],
City, State, District,
'2% Gallon',
CAST(MAX(ActualDate) AS DATE) AS [Sale Cal Wk],
DGFiscalYear,
DGFiscalWeek,
Units = 0,
Sales = 0,
Grade = 0
FROM
Scans.dbo.DGStores s
FULL JOIN
Scans.dbo.DimDate dd ON s.State <> 'XX'
FULL JOIN
Tableau.dbo.DollarGeneralDairyDistributors d ON d.Store = s.Store
WHERE
((DGFiscalYear = #LYear AND DGFiscalWeek >= #maxweek) OR
(DGFiscalYear = #Year AND DGFiscalWeek BETWEEN 1 AND #maxweek))
AND d.Store IN (SELECT Store
FROM DollarGeneralDairyDistributors)
GROUP BY
d.[Master Vendor], d.[Dairy Vendor], District, d.Store,
DGFiscalYear, DGFiscalWeek, City, State
INSERT INTO #Initialize
SELECT DISTINCT
d.[Master Vendor] ,
d.[Dairy Vendor] ,
d.[Store],
City,
[State],
District,
Category = 'Whole Gallon',
cast(max(ActualDate) as DATE) as [Sale Cal Wk],
DGFiscalYear,
DGFiscalWeek,
Units=0,
Sales=0,
Grade=0
FROM
Scans.dbo.DGStores s
FULL JOIN
Scans.dbo.DimDate dd
ON
s.State <> 'XX'
FULL JOIN
Tableau.dbo.DollarGeneralDairyDistributors d
ON
d.Store = s.Store
WHERE
((DGFiscalYear = #LYear AND DGFiscalWeek >= #maxweek) OR (DGFiscalYear = #Year AND DGFiscalWeek BETWEEN 1 AND #maxweek))
AND
d.Store IN
(SELECT Store
FROM DollarGeneralDairyDistributors)
GROUP BY d.[Master Vendor], d.[Dairy Vendor] ,District, d.Store, DGFiscalYear, DGFiscalWeek, City, State
CREATE TABLE #Update
(Store varchar(15),
Category varchar(25),
FiscalYr int,
FiscalWk tinyint,
Units int,
Sales money)
INSERT #Update
SELECT DISTINCT Store, Category, FiscalYr, FiscalWk,
isnull(sum(isnull(SumUnits,0)),0) as Units,
isnull(sum(isnull(SumSales,0)),0) as Sales
FROM scans.dbo.Scans sc
JOIN [Scans].[dbo].DollarGeneralDairyCategory c
ON sc.ItemSku = c.ItemSku
WHERE
FiscalYr >= #LYear
GROUP BY[Store], FiscalYr, FiscalWk, Category
UPDATE #Initialize
SET Units = u.Units,
Sales = u.Sales,
Grade=100
FROM #Update u
WHERE #Initialize.[Sale Fiscal Week] = u.FiscalWk AND #Initialize.SaleYear = u.FiscalYr
AND #Initialize.Store=u.Store AND #Initialize.Category = u.Category
SELECT *
, COUNT(store) AS 'ZeroStore'
FROM #Initialize
GROUP BY [Sale Cal Wk], [Master Vendor], [Dairy Vendor], Store, City, State, District, Category, SaleYear, [Sale Fiscal Week], Units, Sales, Grade
HAVING SUM( Units) = 0 AND [Sale Cal Wk] BETWEEN #4WeeksAgo AND #Today
drop table #Initialize,#Update
END
Thank you very much for any input or help.

Best method to merge two sql tables toegether

So I have two tables. One tracking a a persons location, and one that has the shifts of staff members.
Staff members have a staffId, location, start and end times, and cost of that shift.
People have an eventId, stayId, personId, location, start and end time. A person will have an event with multiple stays.
What I am attempting to do is mesh these two tables together, so that I can accurately report the cost of each location stay, based on the duration of that stay multiplied by the associated cost of staff covering that location at that time.
The issues I have are:
Location stays do not align with staff shifts. i.e. a person might be in location a between 1pm and 2pm, and four staff might be on shifts from 12:30 to 1:30, and two on from 1:30 till 5.
There are a lot of records.
Not all staff are paid the same
My current method is to expand both tables to have a record for every single minute. So a stay that is between 1pm and 2pm will have 60 records, and a staff shift that goes for 5 hours will have 300 records. I can then take all staff that are working on that location at that minute to get a minute value based on the cost of each staff member divided by the duration of their shift, and apply that value to the corresponding record in the other table.
Techniques used:
I create a table with 50,000 numbers, since some stays can be quite
long.
I take the staff table and join onto the numbers table to split each
shift. Then group it together based on location and minute, with a
staff count and minute cost.
The final step, and the one causing issues, is where I take the
location table, join onto numbers, and also onto the modified staff
table to produce a cost for that minute. I also count the number of
people in that location to account for staff covering multiple
people.
I'm finding this process extremely slow as you can imagine, since my person table has about 500 million records when expanded to the minute level, and the staff table has about 35 million when the same thing is done.
Can people suggest a better method for me to use?
Sample data:
Locations
| EventId | ID | Person | Loc | Start | End
| 1 | 987 | 123 | 1 | May, 20 2015 07:00:00 | May, 20 2015 08:00:00
| 1 | 374 | 123 | 4 | May, 20 2015 08:00:00 | May, 20 2015 10:00:00
| 1 | 184 | 123 | 3 | May, 20 2015 10:00:00 | May, 20 2015 11:00:00
| 1 | 798 | 123 | 8 | May, 20 2015 11:00:00 | May, 20 2015 12:00:00
Staff
| Loc | StaffID | Cost | Start | End
| 1 | 99 | 40 | May, 20 2015 04:00:00 | May, 20 2015 12:00:00
| 1 | 15 | 85 | May, 20 2015 03:00:00 | May, 20 2015 5:00:00
| 3 | 85 | 74 | May, 20 2015 18:00:00 | May, 20 2015 20:00:00
| 4 | 10 | 36 | May, 20 2015 06:00:00 | May, 20 2015 14:00:00
Result
| EventId | ID | Person | Loc | Start | End | Cost
| 1 | 987 | 123 | 1 | May, 20 2015 07:00:00 | May, 20 2015 08:00:00 | 45.50
| 1 | 374 | 123 | 4 | May, 20 2015 08:00:00 | May, 20 2015 10:00:00 | 81.20
| 1 | 184 | 123 | 3 | May, 20 2015 10:00:00 | May, 20 2015 11:00:00 | 95.00
| 1 | 798 | 123 | 8 | May, 20 2015 11:00:00 | May, 20 2015 12:00:00 | 14.75
SQL:
Numbers table
;WITH x AS
(
SELECT TOP (224) object_id FROM sys.all_objects
)
SELECT TOP (50000) n = ROW_NUMBER() OVER (ORDER BY x.object_id)
INTO #numbers
FROM x CROSS JOIN x AS y
ORDER BY n
Staff Table
SELECT
Location,
ISNULL(SUM(ROUND(Cost/ CASE WHEN (DateDiff(MINUTE, StartDateTime, EndDateTime)) = 0 THEN 1 ELSE (DateDiff(MINUTE, StartDateTime, EndDateTime)) END, 5)),0) AS MinuteCost,
Count(Name) AS StaffCount,
RosterMinute = DATEADD(MI, DATEDIFF(MI, 0, StartDateTime) + n.n -1, 0)
INTO #temp_StaffRoster
FROM dbo.StaffRoster
Grouping together, and where help is needed I think
INSERT INTO dbo.FinalTable
SELECT [EventId]
,[Id]
,[Start]
,[End]
,event.[Location]
,SUM(ISNULL(MinuteCost,1)/ISNULL(PeopleCount, 1)) AS Cost
,AVG(ISNULL(StaffCount,1)) AS AvgStaff
FROM dbo.Events event WITH (NOLOCK)
INNER JOIN #numbers n ON n.n BETWEEN 0 AND DATEDIFF(MINUTE, Start, End)
LEFT OUTER JOIN #temp_StaffRoster staff WITH (NOLOCK) ON staff.Location= event.Location AND staff.RosterMinute = DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
LEFT OUTER JOIN (SELECT [Location], DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0) AS Mins, COUNT(Id) as PeopleCount
FROM dbo.Events WITH (NOLOCK)
INNER JOIN #numbers n ON n.n BETWEEN 0 AND DATEDIFF(MINUTE, Start, End)
GROUP BY [Location], DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
) cap ON cap.Location= event.LocationAND cap.Mins = DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
GROUP BY [EventId]
,[Id]
,[Start]
,[End]
,event.[Location]
UPDATE
So I have two tables. One tracking a a persons location, and one that has the shifts of staff members with their cost. I am attempting to consolidate the two tables to calculate the cost of each location stay.
Here is my method:
;;WITH stay AS
(
SELECT TOP 650000
StayId,
Location,
Start,
End
FROM stg_Stay
WHERE Loction IS NOT NULL -- Some locations don't currently have a matching shift location
ORDER BY Location, ADTM
),
shift AS
(
SELECT TOP 36000000
Location,
ShiftMinute,
MinuteCost,
StaffCount
FROM stg_Shifts
ORDER BY Location, ShiftMinute
)
SELECT
[StayId],
SUM(MinuteCost) AS Cost,
AVG(StaffCount) AS StaffCount
INTO newTable
FROM stay S
CROSS APPLY (SELECT MinuteCost, StaffCount
FROM shift R
WHERE R.Location = S.Location
AND R.ShiftMinute BETWEEN S.Start AND S.End
) AS Shifts
GROUP BY [StayId]
This is where I'm at.
I've split the Shifts table into a minute by minute level since there is no clear alignment of shifts to stays.
stg_Stay contains more columns than needed for this operation. stg_Shift is as shown.
Indexes used on stg_Shifts:
CREATE NONCLUSTERED INDEX IX_Shifts_Loc_Min
ON dbo.stg_Shifts (Location, ShiftMinute)
INCLUDE (MinuteCost, StaffCount);
on stg_Stay
CREATE INDEX IX_Stay_StayId ON dbo.stg_Stay (StayId);
CREATE CLUSTERED INDEX IX_Stay_Start_End_Loc ON dbo.stg_Stay (Location,Start,End);
Due to the fact that Shifts has ~36 million records and Stays has ~650k, what can I do to make this perform better?
Don't break down the rows by minutes.
Staging table may help if you can create fast relationship between them. i.e. the overlapped interval
SELECT *
FROM Locations l
OUTER APPLY -- Assume a staff won't appear in different location in the same period of time, of course.
(
SELECT
CONVERT(decimal(14,2), SUM(CostPerMinute * OverlappedMinutes)) AS ActualCost,
COUNT(DISTINCT StaffId) AS StaffCount,
SUM(OverlappedMinutes) AS StaffMinutes
FROM
(
SELECT
*,
-- Calculate overlapped time in minutes
DATEDIFF(MINUTE,
CASE WHEN StartTime > l.StartTime THEN StartTime ELSE l.StartTime END, -- Get greatest start time
CASE WHEN EndTime > l.EndTime THEN l.EndTime ELSE EndTime END -- Get least end time
) AS OverlappedMinutes,
Cost / DATEDIFF(MINUTE, StartTime, EndTime) AS CostPerMinute
FROM Staff
WHERE LocationId = l.LocationId
AND StartTime <= l.EndTime AND l.StartTime <= EndTime -- Match with overlapped time
) data
) StaffInLoc
SQL Fiddle
Take below with a grain of salt since your naming is horrible.
Location should really be a Stay as i guess location is another table defining an single physical location.
Your Staff table is also badly named. Why not name it Shift. I would expect a staff table to contain stuff like Name, Phone etc. Where a Shift table can contain multiple shifts for the same Staff etc.
Second i think your missing a relation between the two tables.
If you join Location and Staff only on Location and overlapping date times i don't think it would make a whole lot of sense for what your trying to do. How do you know which staff is at any location for a given time? Onlything you can do with location and overlapping dates is assume a entry is in the location table relates to every staff who have a shift at that location within the timeframe. So look at the below more as an inspiration to solving your problems and how to find overlapping datetime intervals and less like an actual solution to your problem since i think your data and model is in a bad shape.
If i got it all wrong please provide Primary Keys and Foreign Keys on your tables and a better explanation.
Some dummy data
DROP TABLE dbo.Location
CREATE TABLE dbo.Location
(
StayId INT,
EventId INT,
PersonId INT,
LocationId INT,
StartTime DATETIME2(0),
EndTime DATETIME2(0)
)
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 987 ,1 ,123 ,1 ,'2015-05-20T07:00:00','2015-05-20T08:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 374 ,1 ,123 ,4 ,'2015-05-20T08:00:00','2015-05-20T10:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 184 ,1 ,123 ,3 ,'2015-05-20T10:00:00','2015-05-20T11:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 798 ,1 ,123 ,8 ,'2015-05-20T11:00:00','2015-05-20T12:00:00')
DROP TABLE dbo.Staff
CREATE TABLE Staff
(
StaffId INT,
Cost INT,
LocationId INT,
StartTime DATETIME2(0),
EndTime DATETIME2(0)
)
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 99 ,40 ,1 ,'2015-05-20T04:00:00','2015-05-20T12:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 15 ,85 ,1 ,'2015-05-20T03:00:00','2015-05-20T05:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 85 ,74 ,3 ,'2015-05-20T18:00:00','2015-05-20T20:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 10 ,36 ,4 ,'2015-05-20T06:00:00','2015-05-20T14:00:00')
Actual query
WITH OnLocation AS
(
SELECT
L.StayId, L.EventId, L.LocationId, L.PersonId, S.Cost
, IIF(L.StartTime > S.StartTime, L.StartTime, S.StartTime) AS OnLocationStartTime
, IIF(L.EndTime < S.EndTime, L.EndTime, S.EndTime) AS OnLocationEndTime
FROM dbo.Location L
LEFT JOIN dbo.Staff S
ON S.LocationId = L.LocationId -- TODO are you not missing a join condition on staffid
-- Detects any overlaps between stays and shifts
AND L.StartTime <= S.EndTime AND L.EndTime >= S.StartTime
)
SELECT
*
, DATEDIFF(MINUTE, D.OnLocationStartTime, D.OnLocationEndTime) AS DurationMinutes
, DATEDIFF(MINUTE, D.OnLocationStartTime, D.OnLocationEndTime) / 60.0 * Cost AS DurationCost
FROM OnLocation D
To get a summary you can take the query and add a GROUP BY for whatever your wan't to summarize.

Resources