T-SQL recursion, date shifting based on previous iteration - sql-server

I have a data set that includes a customer, payment date, and the number of days they have paid for. I need to be calculate the coverage start/end dates that each payment is covering. This is difficult when a payment is made before the current coverage period ends.
The best way I've come up with to think about this would be a month to month cell phone plan where the customer may pay for a specified number of days at any point during a given month. The next covered period should always start the day after the previous covered period expires.
Here is the code sample using a temp table.
CREATE TABLE #Payments
(Customer_ID INTEGER,
Payment_Date DATE,
Days_Paid INTEGER);
INSERT INTO #Payments
VALUES (1,'2018-01-01',30);
INSERT INTO #Payments
VALUES (1,'2018-01-29',20);
INSERT INTO #Payments
VALUES (1,'2018-02-15',30);
INSERT INTO #Payments
VALUES (1,'2018-04-01',30);
I need to get the coverage start/end dates back.
The initial payment is made on 2018-01-01 and they paid for 30 days. That means they are covered until 2018-01-30 (Payment_Date + Paid_Days - 1 since the payment date is included as a covered day). However they made their next payment on 2018-01-29, so I need calculate the start date of the next coverage window, which in this case would be the previous Payment_Date + previous Paid_Days. In this case, coverage window 2 starts on 2018-02-01 and would extend through the 2018-02-19 since they only paid for 20 days on Payment_Date 2018-01-29.
The expected output is:
Customer_ID | Payment_Date | Days_Paid | Coverage_Start_Date | Coverage_End_Date
--------------------------------------------------------------------------------
1 | '2018-01-01'| 30 | '2018-01-01'| '2018-01-30'
1 | '2018-01-29'| 20 | '2018-01-31'| '2018-02-19'
1 | '2018-02-15'| 30 | '2018-02-20'| '2018-03-21'
1 | '2018-04-01'| 30 | '2018-04-01'| '2018-04-30'
Because the current record's coverage start date will depend of the previous record's coverage end date, I feel like this would be a good candidate for recursion, but I can't figure out how to do it.
I have a way to do this in a while loop, but I would like to complete it using a recursive CTE. I have also thought about simply adding up the Days_Paid and adding that to the first payment's start date, however this only works if a payment is made before the previous coverage has expired. In addition, I need to calculate the coverage start/end dates for each Payment_Date.
Finally, using LAG/LEAD functions doesn't appear to work because it does not consider the result of the previous iteration, only the current value of the previous record. Using LAG/LEAD, you get the correct answer for the 2nd payment record, but not the third.
Is there a way to do this with a recursive CTE?

NOTE: This is not a recursive solution, but it is set-based vs. your loop solution.
While trying to solve this recursively it hit me that this is essentially a "running totals" problem, and can be easily solved with window functions.
WITH runningTotal AS
(
SELECT p.*, SUM(Days_Paid) OVER(ORDER BY p.Payment_Date) AS runningTotalDays, MIN(Payment_Date) OVER(ORDER BY p.Payment_Date) startDate
FROM #Payments p
)
SELECT r.Customer_Id, r.Payment_Date,Days_Paid, COALESCE(DATEADD(DAY, LAG(runningTotalDays) OVER(ORDER BY r.Payment_Date) +1, startDate), startDate) AS Coverage_Start_Date, DATEADD(DAY, runningTotalDays, startDate) AS Coverage_End_Date
FROM runningTotal r
Each end date is the "running total" of all the previous Days_Paid added together. Using LAG to get the previous records end date+1 gets you the start date. The COALESCE is to handle the first record. For more than a single customer, you can PARTITION BY Customer_Id.

So of course, right after posting this I came across a similar question that was already answered.
Here's the link: Recursively retrieve LAG() value of previous record
Based on that solution, I was able construct the following solution to my own question.
The key here was adding the "prep_data" CTE which made the recursion problem much easier.
;WITH prep_data AS
(SELECT Customer_ID,
ROW_NUMBER() OVER (PARTITION BY Customer_ID ORDER BY Payment_Date) AS payment_seq_num,
Payment_Date,
Days_Paid,
Payment_Date as Coverage_Start_Date,
DATEADD(DAY,Days_Paid-1,Payment_Date) AS Coverage_End_Date
FROM #Payments),
recursion AS
(SELECT Customer_ID,
payment_seq_num,
Payment_Date,
Days_Paid,
Coverage_Start_Date,
Coverage_End_Date
FROM prep_data
WHERE payment_seq_num = 1
UNION ALL
SELECT r.Customer_ID,
p.payment_seq_num,
p.Payment_Date,
p.Days_Paid,
CASE WHEN r.Coverage_End_Date >= p.Payment_Date THEN DATEADD(DAY,1,r.Coverage_End_Date) ELSE p.Payment_Date END AS Coverage_Start_Date,
DATEADD(DAY,p.Days_Paid-1,CASE WHEN r.Coverage_End_Date >= p.Payment_Date THEN DATEADD(DAY,1,r.Coverage_End_Date) ELSE p.Payment_Date END) AS Coverage_End_Date
FROM recursion r
JOIN prep_data p ON r.payment_seq_num + 1 =p.payment_seq_num
)
SELECT Customer_ID,
Payment_Date,
Days_Paid,
Coverage_Start_Date,
Coverage_End_Date
FROM recursion
ORDER BY payment_seq_num;

Related

How to find difference between dates and find first purchase in an eCommerce database

I am using Microsoft SQL Server Management Studio. I am trying to measure the customer retention rate of an eCommerce site.
For this, I need four values:
customer_id
order_purchase_timestamp
age_by_month
first_purchase
The values of age_by_month and first_purchase are not in my database. I want to calculate them.
In my database, I have customer_id and order_purchase_timestamp.
The first_purchase should be the earliest instance of order_purchase_timestamp. I only want the month and year.
The age_by_month should be the difference of months from first_purchase to order_purchase_timestamp.
I only want to measure the retention of the customer for each month so if two purchases are made in the same month it shouldn't be shown.
the dates are between 2016-10-01 to 2018-09-30. it should be ordered by order_purchase_timestamp
An example
customer_id
order_purchase_timestamp
1
2016-09-04
2
2016-09-05
3
2016-09-05
3
2016-09-15
1
2016-10-04
to
customer_id
first_purchase
age_by_month
order_purchase_timestamp
1
2016-09
0
2016-09-04
2
2016-09
0
2016-09-05
3
2016-09
0
2016-09-05
1
2016-09
1
2016-10-04
What I have done
SELECT
customer_id, order_purchase_timestamp
FROM
orders
WHERE
(order_purchase_timestamp BETWEEN '2016-10-01' AND '2016-12-31')
OR (order_purchase_timestamp BETWEEN '2017-01-01' AND '2017-03-31')
OR (order_purchase_timestamp BETWEEN '2017-04-01' AND '2017-06-30')
OR (order_purchase_timestamp BETWEEN '2017-07-01' AND '2017-09-30')
OR (order_purchase_timestamp BETWEEN '2017-10-01' AND '2017-12-31')
OR (order_purchase_timestamp BETWEEN '2018-01-01' AND '2018-03-31')
OR (order_purchase_timestamp BETWEEN '2018-04-01' AND '2018-06-30')
OR (order_purchase_timestamp BETWEEN '2018-07-01' AND '2018-09-30')
ORDER BY
order_purchase_timestamp
Originally I was going to do it by quarters but I want to do it in months now.
The following approach is designed to be relatively easy to understand. There are other ways (e.g., windowed functions) that may be marginally more efficient; but this makes it easy to maintain at your current SQL skill level.
Note that the SQL commands below build on one another (so the answer is at the end). To follow along, here is a db<>fiddle with the working.
It's based around a simple query (which we'll use as a sub-query) that finds the first order_purchase_timestamp for each customer.
SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
The next thing is DATEDIFF to find the difference between 2 dates.
Then, you can use the above as a subquery to get the first date onto each row - then find the date difference e.g.,
SELECT orders.customer_id,
orders.order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, orders.order_purchase_timestamp) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
Note - DATEDIFF has a 'gotcha' that gets most people but is good for you - when comparing months, it ignores the day component e.g., if finding the difference in months, there is 0 difference in months between 1 Jan and 31 Jan. On the other hand, there will be a difference on 1 month between 31 Jan and 1 Feb. However, I think this is actually what you want!
The above, however, repeats when a customer has multiple purchases within the month (it has one row per purchase). Instead, we can GROUP BY to group by the month it's in, then only take the first purchase for that month.
A 'direct' approach to this would be to group on YEAR(orders.order_purchase_timestamp) AND MONTH(orders.order_purchase_timestamp). However, I use a little trick below - using EOMONTH which finds the last day of the month. EOMONTH returns the same date for any date in that month; therefore, we can group by that.
Finally, you can add the WHERE expression and ORDER BY to get the results you asked for (between the two dates)
SELECT orders.customer_id,
MIN(orders.order_purchase_timestamp) AS order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders AS orders_ref
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
WHERE orders.order_purchase_timestamp BETWEEN '20161001' AND '20180930'
GROUP BY orders.customer_id, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)
ORDER BY order_purchase_timestamp;
Results - note they are different from yours because you wanted the earliest date to be 1/10/2016.
customer_id order_purchase_timestamp first_purchase_date age_by_month
1 2016-10-04 00:00:00.000 2016-09-04 00:00:00.000 1
Edit: Because someone else will do it like this otherwise!
You can do this with a single read-through that will potentially run a little faster. It is also a bit shorter - but harder to understand imo.
The below uses windows functions to calculate both the customer's earliest purchase, and the earliest purchase for each month (and uses DISTINCT rather than a GROUP BY). With that, it just does the DATEDIFF to calculate the difference.
WITH monthly_orders AS
(SELECT DISTINCT orders.customer_id,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id, EOMONTH(orders.order_purchase_timestamp)) AS order_purchase_timestamp,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id) AS first_purchase_date
FROM orders)
SELECT *, DATEDIFF(month, first_purchase_date, order_purchase_timestamp) AS age_by_month
FROM monthly_orders
WHERE order_purchase_timestamp BETWEEN '20161001' AND '20180930';
Note however this has one difference in the results. If you have 2 orders in a month, and your lowest date filter is between the to (e.g., orders on 15/10 and 20/10, and your minimum date is 16/10) then the row won't be included as the earliest purchase in the month is outside the filter range.
Also beware with both of these and what type of date or datetime field you are using - if you have datetimes rather than just dates, BETWEEN '20161001' AND '20180930' is not the same as >= '20161001' AND < '20181001'
Here is short query that achieves all you want (descriptions of methods used are inline):
declare #test table (
customer_id int,
order_purchase_timestamp date
)
-- some test data
insert into #test values
(1, '2016-09-04'),
(2, '2016-09-05'),
(3, '2016-09-05'),
(3, '2016-09-15'),
(1, '2016-10-04');
select
customer_id,
-- takes care of correct display of first_purchase
format(first_purchase, 'yyyy-MM') first_purchase,
-- used to get the difference in months
datediff(m, first_purchase, order_purchase_timestamp) age_by_month,
order_purchase_timestamp
from (
select
*,
-- window function used to find min value for given column within group
-- for each row
min(order_purchase_timestamp) over (partition by customer_id) first_purchase
from #test
) a

Group dates by considering look ahead days

I am using SQL Server 2017. I have a table Requests, to make things simple, there's only one column RequestDate. For example,
RquestDate
4/11
4/12
4/13
4/16
4/18
I need to group by RequestDate by considering look ahead days. If look ahead day is 0, the result should be the same as raw table.
If look ahead day is 1, it means when I look at 4/11, I need to check if 4/12 exists, if so, group 4/12 into 4/11.
The result is:
4/11 --it groups 4/12
4/13
4/16
4/18
If look ahead day is 2, when looking at 4/11, it groups 4/12, 4/13 into it.
The result is:
4/11 -- group 4/12 and 4/13.
4/16 -- group 4/18
So this problem is different from the typical gap and island problem. Because when group dates, there could be gap there, e.g, when look ahead day is 2, 4/16 groups 4/17 and 4/18.
I tried some ways but can't find a decent solution.
A recursive common table expression could work.
Select start request date using a min() function.
Use that same date as the grouping start date.
Step 1 and 2 make up the recursion anchor / start row.
Recursively go looking for the next request date. This date is higher than the previous date (r.RequestDate > c.RequestDate) and does not have another row
that follow the same criteria before it (not exists ... r2.RequestDate < r.RequestDate).
If the current request date (from step 3) falls within the look ahead interval length, then maintain the grouping start date (then c.RequestGroupDate), otherwise start a new group on the current request date (else r.RequestDate).
Step 3 and 4 make up the recursive part of the CTE.
After the recursion every request date as a corresponding request grouping date. The group by r.RequestGroupDate clause reduces the result output to the distinct values.
Sample data
create table Requests
(
RequestDate date
);
insert into Requests (RequestDate) values
('2021-04-11'),
('2021-04-12'),
('2021-04-13'),
('2021-04-16'),
('2021-04-18');
Solution
declare #lookAhead int = 1; -- look ahead days parameter
with rcte as
(
select min(r.RequestDate) as RequestDate,
min(r.RequestDate) as RequestGroupDate
from Requests r
union all
select r.RequestDate,
case
when datediff(day, c.RequestGroupDate, r.RequestDate) <= #lookAhead
then c.RequestGroupDate
else r.RequestDate
end
from rcte c
join Requests r
on r.RequestDate > c.RequestDate
where not exists ( select 'x'
from Requests r2
where r2.RequestDate > c.RequestDate
and r2.RequestDate < r.RequestDate )
)
select r.RequestGroupDate
from rcte r
group by r.RequestGroupDate;
Result
For #lookAhead = 1:
RequestGroupDate
----------------
2021-04-11
2021-04-13
2021-04-16
2021-04-18
For #lookahead = 2:
RequestGroupDate
----------------
2021-04-11
2021-04-16
Fiddle to see things in action.

Factoring public holidays in to a SQL code

Apologies if this is a simple one. I'm looking for some help with the following:
SELECT *
FROM (
SELECT TOP 7
RIGHT (CONVERT (VARCHAR, CompletedDate, 108), 8) AS Time,
WorkType
FROM Table
WHERE WorkType = 'WorkType1'
OR DATEPART (DW, CompletedDate) IN ('7','1')
AND WorkType = 'WorkType2'
ORDER BY CompletedDate DESC) Table
ORDER BY CompletedDate ASC
Multiple events run every day, and the above searches for the last one scheduled to run each day, and pulls the time from it for the past 7 days. This time marks the end of the day's events, and is the value I'm after.
Events run at a different order on weekends, so I search for a different WorkType. WorkType 1 is unique to weekdays. WorkType2 is run both at weekdays and weekends, however it is not the final event on a weekday, so I don't search for it then.
However, this kind of falls apart when public holidays such as bank holidays come into play, as they use the weekend timings. I still need to capture these times, but the above skips over them. If I were to remove or expand the DATEPART, I would end up with duplicate values for each day that don't mark the final job of the day.
What changes can I make to this to capture these lost holiday timings, without manually going through and checking every holiday? Is there a way that I can return a value for JobType2, if JobType1 does not appear on a day?
I suggest a materialized calendar table with one row per date along with the desired WorkType for that day. That will allow you to simply join on to the calendar table to determine the proper WorkType value without embedding the logic in the query itself.
With this table loaded with all dates for your reporting domain:
CREATE TABLE dbo.WorkTypeCalendar(
CalendarDate date NOT NULL
CONSTRAINT PK_Calendar PRIMARY KEY CLUSTERED
, WorkType varchar(10) NOT NULL
);
GO
The query can be refactored as below:
SELECT *
FROM ( SELECT TOP 7
RIGHT(CONVERT (varchar, CompletedDate, 108), 8) AS Time
, WorkType
FROM Table1 AS t
JOIN WorkTypeCalendar AS c ON t.WorkType = c.WorkType
AND t.CompletedDate >= c.CalendarDate
AND t.CompletedDate < DATEADD(DAY,
1,
c.CalendarDate)
ORDER BY CompletedDate DESC
) Table1
ORDER BY CompletedDate ASC
You also might consider making this a generalized utility calendar table. See http://www.dbdelta.com/calendar-table-and-datetime-functions/ for an complete example of such a table and script to load US holidays you can adjust for your needs and locale.

Query to create records between two dates

I have a table with the following fields (among others)
TagID
TagType
EventDate
EventType
EventType can be populated with "Passed Inspection", "Failed Inspection" or "Repaired" (there are actually many others, but simplifies to this for my issue)
Tags can go many months between a failed inspection and the ultimate repair... in this state they are deemed to be "awaiting repair". Tags are still inspected each month even after they have been identified as having failed. (and just to be clear, a “failed inspection” doesn’t mean the item being inspected doesn’t work at all… it still works, just not at 100% capacity…which is why we still do inspections on it).
I need to create a query that counts, by TagType, Month and Year the number of Tags that are awaiting repair. The end result table would look like this, for example
TagType EventMonth EventYear CountofTagID
xyz 1 2011 3
abc 1 2011 2
xyz 2 2011 2>>>>>>>>>>>>indicating a repair had been made since 1/2011
abc 2 2011 2
and so on
The "awaiting repair" status should be assessed on the last day of the month
This is totally baffling me...
One thought that I had was to develop a query that returned:
TagID,
TagType,
FailedInspectionDate, and
NextRepairDate,
then try and do something that stepped thru the months in between the two dates, but that seems wildly inefficient.
Any help would be much appreciated.
Update
A little more research, and a break from the problem to think about it differently gave me the following approach. I'm sure its not efficient or elegant, but it works. Comments to improve would be appreciated.
declare #counter int
declare #FirstRepair date
declare #CountMonths as int
set #FirstRepair = (<Select statement to find first repair across all records>)
set #CountMonths = (<select statement to find the number of months between the first repair across all records and today>)
--clear out the scratch table
delete from dbo.tblMonthEndDate
set #counter=0
while #counter <=#CountMonths --fill the scratch table with the date of the last day of every month from the #FirstRepair till today
begin
insert into dbo.tblMonthEndDate(monthenddate) select dbo.lastofmonth(dateadd(m,#counter, #FirstRepair))
set #counter = #counter+1
end
--set up a CTE to get a cross join between the scratch table and the view that has the associated first Failed Inspection and Repair
;with Drepairs_CTE (FacilityID, TagNumber, CompType, EventDate)
AS
(
SELECT dbo.vwDelayedRepairWithRepair.FacilityID, dbo.vwDelayedRepairWithRepair.TagNumber, dbo.vwDelayedRepairWithRepair.CompType,
dbo.tblMonthEndDate.MonthEndDate
FROM dbo.vwDelayedRepairWithRepair INNER JOIN
dbo.tblMonthEndDate ON dbo.vwDelayedRepairWithRepair.EventDate <= dbo.tblMonthEndDate.MonthEndDate AND
dbo.vwDelayedRepairWithRepair.RepairDate >= dbo.tblMonthEndDate.MonthEndDate
)
--use the CTE to build the final table I want
Select FacilityID, CompType, Count(TagNumber), MONTH(EventDate), YEAR(EventDate), 'zzz' as EventLabel
FROM Drepairs_CTE
GROUP BY FacilityID, CompType, MONTH(EventDate), YEAR(EventDate)`
Result set ultimately looks like this:
FacilityID CompType Count Month Year Label
1 xyz 2 1 2010 zzz
1 xyz 1 2 2010 zzz
1 xyz 1 7 2009 zzz
Here is a recursive CTE which generates table of last dates of months in interval starting with minimum date in repair table and ending with maximum date.
;with tableOfDates as (
-- First generation returns last day of month of first date in repair database
-- and maximum date
select dateadd (m, datediff (m, 0, min(eventDate)) + 1, 0) - 1 startDate,
max(eventDate) endDate
from vwDelayedRepairWithRepair
union all
-- Last day of next month
select dateadd (m, datediff (m, 0, startDate) + 2, 0) - 1,
endDate
from tableOfDates
where startDate <= endDate
)
select *
from tableOfDates
-- If you change the CTE,
-- Set this to reasonable number of months
-- to prevent recursion problems. 0 means no limit.
option (maxrecursion 0)
EndDate column from tableOfDates is to be ignored, as it serves as upper bound only. If you create UDF which returns all the dates in an interval, omit endDate in select list or remove it from CTE and replace with a parameter.
Sql Fiddle playground is here.

T-SQL - Getting most recent date and most recent future date

Assume the table of records below
ID Name AppointmentDate
-- -------- ---------------
1 Bob 1/1/2010
1 Bob 5/1/2010
2 Henry 5/1/2010
2 Henry 8/1/2011
3 John 8/1/2011
3 John 12/1/2011
I want to retrieve the most recent appointment date by person. So I need a query that will give the following result set.
1 Bob 5/1/2010 (5/1/2010 is most recent)
2 Henry 8/1/2011 (8/1/2011 is most recent)
3 John 8/1/2011 (has 2 future dates but 8/1/2011 is most recent)
Thanks!
Assuming that where you say "most recent" you mean "closest", as in "stored date is the fewest days away from the current date and we don't care if it's before or after the current date", then this should do it (trivial debugging might be required):
SELECT ID, Name, AppointmentDate
from (select
ID
,Name
,AppointmentDate
,row_number() over (partition by ID order by abs(datediff(dd, AppointmentDate, getdate()))) Ranking
from MyTable) xx
where Ranking = 1
This usese the row_number() function from SQL 2005 and up. The subquery "orders" the data as per the specifications, and the main query picks the best fit.
Note also that:
The search is based on the current date
We're only calculating difference in days, time (hours, minutes, etc.) is ignored
If two days are equidistant (say, 2 before and 2 after), we pick one randomly
All of which could be adjusted based on your final requirements.
(Phillip beat me to the punch, and windowing functions are an excellent choice. Here's an alternative approach:)
Assuming I correctly understand your requirement as getting the date closest to the present date, whether in the past or future, consider this query:
SELECT t.Name, t.AppointmentDate
FROM
(
SELECT Name, AppointmentDate, ABS(DATEDIFF(d, GETDATE(), AppointmentDate)) AS Distance
FROM Table
) t
JOIN
(
SELECT Name, MIN(ABS(DATEDIFF(d, GETDATE(), AppointmentDate))) AS MinDistance
FROM Table
GROUP BY Name
) d ON t.Name = d.Name AND t.Distance = d.MinDistance

Resources