T-SQL Grouping Dynamic Date Ranges - sql-server

Using MS SQL Server 2019
I have a set of recurring donation records. Each have a First Gift Date and a Last Gift Date associated with them. I need to add a GroupedID to these rows so that I can get the full date range for the earliest FirstGiftDate and the oldest LastGiftDate as long as there is not a break of more than 45 days in between the recurring donations.
For example Bob is a long time supporter. His card has expired multiple times and he has always started new gifts within 45 days. All of his gifts need to be given a single grouped ID. On the opposite side June has been donating and her card expires. She doesn't give again for 6 months, but then continues to give after her card expires. The first gift of Junes should get its own "GroupedID" and the second and third should be grouped together.The grouping count should restart with each donor.
My initial attempt was to join the donation table back to itself aliased as D2. This did work to give me an indicator of which ones were within the 45 day mark but I can't wrap my head around how to then link them. My only thought was to use LEAD and LAG to try analyze each scenario and figure out the different combinations of LEAD and LAG values needed to make it catch each different scenario, but that doesn't seem as reliable as scaleable as I'd like it to be.
I appreciate any help anyone can give.
My code:
SELECT #Donation.*, D2.*
FROM #Donation
LEFT JOIN #Donation D2 ON #Donation.RecurringGiftID <> D2.RecurringGiftID
AND #Donation.Donor = D2.Donor
AND ABS(DATEDIFF(DAY, #Donation.FirstGiftDate, D2.LastGiftDate)) < 45
Table structure and sample data:
CREATE TABLE #Donation
(
RecurringGiftID int,
Donor nvarchar(25),
FirstGiftDate date,
LastGiftDate date
)
INSERT INTO #Donation
VALUES (1, 'Bob', '2017-02-15', '2018-07-01'),
(15, 'Bob', '2018-08-05', '2019-04-01'),
(32, 'Bob', '2019-04-15', '2022-06-15'),
(54, 'June', '2015-05-01', '2016-05-01'),
(96, 'June', '2016-12-15', '2018-02-01'),
(120, 'June', '2018-03-04', '2020-07-01')
Desired output:
RecurringGiftId
Donor
FirstGiftDate
LastGiftDate
GroupedID
1
Bob
2017-02-15
2018-07-01
1
15
Bob
2018-08-05
2019-04-01
1
32
Bob
2019-04-15
2022-06-15
1
54
June
2015-05-01
2016-05-01
1
96
June
2016-12-15
2018-02-01
2
120
June
2018-03-04
2020-07-01
2

use LAG() to detect when current row is more than 45 days from previous and perform a cumulative sum to form the required Group ID
select *,
GroupedID = sum(g) over (partition by Donor order by FirstGiftDate)
from
(
select *,
g = case when datediff(day,
lag(LastGiftDate, 1, '19000101') over (partition by Donor
order by FirstGiftDate),
FirstGiftDate)
> 45
then 1
else 0
end
from #Donation
) d

Related

How to collect all deference in rows between two periods?

I'm trying to see the difference between the two periods for a column.
For example, we see that sales decreased at the end of the month, and we need to see which products were not sold at the end of the month?
I can create SELECT to see quantity for each product for each period:
SELECT product_id, count(product_id) AS Count
FROM testDB
WHERE
sales_date IS NOT NULL
AND
delivery_date BETWEEN '2021-02-01 00:00:03.0000000' AND '2021-02-14 23:56:00.0000000'
GROUP BY
product_id
and the same SELECT with another period:
delivery_date BETWEEN '2021-02-14 00:00:03.0000000' AND '2021-02-28 23:56:00.0000000'
So, after these queries I see list for first period with 10 products with quantity and in second period I see list with 7 products with quantity. I can't get the difference between the lists of the two SELECTs. I tried to use != and NOT IN but without any results.
I will be very grateful for your help. Thanks
Sorry for the confusion. I meant the difference between the two selects:
The result of the first one (for first period):
Product_ID Count
grapes. 100
lime. 13
lemon. 15
cherry. 222
blueberry. 123
banana. 1
apple. 123
watermelon 56
and second one (for second period):
Product_ID Count
grapes. 10
lime. 1
lemon. 10
cherry. 2
blueberry. 13
banana. 12
and I wand to see difference between these selects:
Product_ID Count
apple. 0
watermelon. 0
So we did not sell any apples and watermelons in second period.
SELECT product_id, count(product_id) AS Count,delivery_date-sales_date as DIFFERENCE
FROM testDB
WHERE
sales_date IS NOT NULL
AND
delivery_date BETWEEN '2021-02-01 00:00:03.0000000' AND '2021-02-14 23:56:00.0000000'
GROUP BY
product_id
This should work for getting the difference between the 2 period columns.

Grouping and counting rows by value until next row differs by more than a specified value

The query I am trying to produce is very similar to this, but instead of counting where it changes, I need to count up to the point the difference exceeds a specific value.
I have tried setting a flag where the date difference is more than 14 from the previous, but that then gives 2 records - one for when the difference is greater than 14, and 1 for the rest. As the ID can appear multiple times with different dates, I cannot then group that result.
e.g.
Data:
ID Date
1A 2020-01-01
1A 2020-01-03
2B 2020-01-05
1A 2020-02-01
Result set to be:
ID Date Count
1A 2020-01-01 2
2B 2020-01-05 1
1A 2020-02-01 1
the criteria in this case being where the difference between the dates is more than 14 days
Tried:
SELECT [ID], [Date],
case when datediff(dd,lag([Date]) over (partition by ID order by [Date]),[Date])>14
then 1 else 0 end as CaseValue
FROM Table
where [Date]>'2020-02-01'
and
declare #CountValue bigint
SELECT [ID], [Date],
#CountValue=#CountValue+case when datediff(dd,lag([Date]) over (partition by ID order by [Date]),[Date])>14
then 1 else 0 end
FROM Table
where [Date]>'2020-02-01'

Populating a list of dates without a defined end date - SQL server

I have a list of accounts and their cost which changes every few days.
In this list I only have the start date every time the cost updates to a new one, but no column for the end date.
Meaning, I need to populate a list of dates when the end date for a specific account and cost, should be deduced as the start date of the same account with a new cost.
More or less like that:
Account start date cost
one 1/1/2016 100$
two 1/1/2016 150$
one 4/1/2016 200$
two 3/1/2016 200$
And the result I need would be:
Account date cost
one 1/1/2016 100$
one 2/1/2016 100$
one 3/1/2016 100$
one 4/1/2016 200$
two 1/1/2016 150$
two 2/1/2016 150$
two 3/1/2016 200$
For example, if the cost changed in the middle of the month, than the sample data will only hold two records (one per each unique combination of account-start date-cost), while the results will hold 30 records with the cost for each and every day of the month (15 for the first cost and 15 for the second one). The costs are a given, and no need to calculate them (inserted manually).
Note the result contains more records because the sample data shows only a start date and an updated cost for that account, as of that date. While the results show the cost for every day of the month.
Any ideas?
Solution is a bit long.
I added an extra date for test purposes:
DECLARE #t table(account varchar(10), startdate date, cost int)
INSERT #t
values
('one','1/1/2016',100),('two','1/1/2016',150),
('one','1/4/2016',200),('two','1/3/2016',200),
('two','1/6/2016',500) -- extra row
;WITH CTE as
( SELECT
row_number() over (partition by account order by startdate) rn,
*
FROM #t
),N(N)AS
(
SELECT 1 FROM(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))M(N)
),
tally(N) AS -- tally is limited to 1000 days
(
SELECT ROW_NUMBER()OVER(ORDER BY N.N) - 1 FROM N,N a,N b
),GROUPED as
(
SELECT
cte.account, cte.startdate, cte.cost, cte2.cost cost2, cte2.startdate enddate
FROM CTE
JOIN CTE CTE2
ON CTE.account = CTE2.account
and CTE.rn = CTE2.rn - 1
)
-- used DISTINCT to avoid overlapping dates
SELECT DISTINCT
CASE WHEN datediff(d, startdate,enddate) = N THEN cost2 ELSE cost END cost,
dateadd(d, N, startdate) startdate,
account
FROM grouped
JOIN tally
ON datediff(d, startdate,enddate) >= N
Result:
cost startdate account
100 2016-01-01 one
100 2016-01-02 one
100 2016-01-03 one
150 2016-01-01 two
150 2016-01-02 two
200 2016-01-03 two
200 2016-01-04 one
200 2016-01-04 two
200 2016-01-05 two
500 2016-01-06 two
Thank you #t-clausen.dk!
It didn't solve the problem completely, but did direct me in the correct way.
Eventually I used the LEAD function to generate an end date for every cost per account, and then I was able to populate a list of dates based on that idea.
Here's how I generate the end dates:
DECLARE #t table(account varchar(10), startdate date, cost int)
INSERT #t
values
('one','1/1/2016',100),('two','1/1/2016',150),
('one','1/4/2016',200),('two','1/3/2016',200),
('two','1/6/2016',500)
select account
,[startdate]
,DATEADD(DAY, -1, LEAD([Startdate], 1,'2100-01-01') OVER (PARTITION BY account ORDER BY [Startdate] ASC)) AS enddate
,cost
from #t
It returned the expected result:
account startdate enddate cost
one 2016-01-01 2016-01-03 100
one 2016-01-04 2099-12-31 200
two 2016-01-01 2016-01-02 150
two 2016-01-03 2016-01-05 200
two 2016-01-06 2099-12-31 500
Please note that I set the end date of current costs to be some date in the far future which means (for me) that they are currently active.

SQL Server: How to get a rolling sum over 3 days for different customers within same table

This is the input table:
Customer_ID Date Amount
1 4/11/2014 20
1 4/13/2014 10
1 4/14/2014 30
1 4/18/2014 25
2 5/15/2014 15
2 6/21/2014 25
2 6/22/2014 35
2 6/23/2014 10
There is information pertaining to multiple customers and I want to get a rolling sum across a 3 day window for each customer.
The solution should be as below:
Customer_ID Date Amount Rolling_3_Day_Sum
1 4/11/2014 20 20
1 4/13/2014 10 30
1 4/14/2014 30 40
1 4/18/2014 25 25
2 5/15/2014 15 15
2 6/21/2014 25 25
2 6/22/2014 35 60
2 6/23/2014 10 70
The biggest issue is that I don't have transactions for each day because of which the partition by row number doesn't work.
The closest example I found on SO was:
SQL Query for 7 Day Rolling Average in SQL Server
but even in that case there were transactions made everyday which accomodated the rownumber() based solutions
The rownumber query is as follows:
select customer_id, Date, Amount,
Rolling_3_day_sum = CASE WHEN ROW_NUMBER() OVER (partition by customer_id ORDER BY Date) > 2
THEN SUM(Amount) OVER (partition by customer_id ORDER BY Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
END
from #tmp_taml9
order by customer_id
I was wondering if there is way to replace "BETWEEN 2 PRECEDING AND CURRENT ROW" by "BETWEEN [DATE - 2] and [DATE]"
One option would be to use a calendar table (or something similar) to get the complete range of dates and left join your table with that and use the row_number based solution.
Another option that might work (not sure about performance) would be to use an apply query like this:
select customer_id, Date, Amount, coalesce(Rolling_3_day_sum, Amount) Rolling_3_day_sum
from #tmp_taml9 t1
cross apply (
select sum(amount) Rolling_3_day_sum
from #tmp_taml9
where Customer_ID = t1.Customer_ID
and datediff(day, date, t1.date) <= 3
and t1.Date >= date
) o
order by customer_id;
I suspect performance might not be great though.

SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value

In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities.
I have a database view that unions data from all of these tables giving us the ActivityView that looks like this.
As you can see some activities overlap - for example while attending an interview, a client may have been performing a CV update activity.
+----------------------+---------------+---------------------+-------------------+
| activity_client_id | activity_type | activity_start_date | activity_end_date |
+----------------------+---------------+---------------------+-------------------+
| 112 | Interview | 2015-06-01 09:00 | 2015-06-01 11:00 |
| 112 | CV updating | 2015-06-01 09:30 | 2015-06-01 11:30 |
| 112 | Course | 2015-06-02 09:00 | 2015-06-02 16:00 |
| 112 | Interview | 2015-06-03 09:00 | 2015-06-03 10:00 |
+----------------------+---------------+---------------------+-------------------+
Each client has a "Sign Up Date", recorded on the client table, which is when they joined our programme. Here it is for our sample client:
+-----------+---------------------+
| client_id | client_sign_up_date |
+-----------+---------------------+
| 112 | 2015-05-20 |
+-----------+---------------------+
I need to create a report that will show the following columns:
+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
We need this report in order to see how effective our programme is. An important aim of the programme is that we get every client to complete at least 5 hours of activity as quickly as possible.
So this report will tell us how long from sign up does it take each client to achieve this figure.
What makes this even trickier is that when we calculate 5 hours of total activity, we must discount overlapping activities:
In the sample data above the client attended an interview between 09:00 and 11:00.
On the same day they also performed CV updating activity from 09:30 to 11:30.
For our calculation, this would give them total activity for the day of 2.5 hours (150 minutes) - we would only count 30 minutes of the CV updating as the Interview overlaps it up to 11:00.
So the report for our sample client would give the following result:
+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
| 112 | 2015-05-20 | 2015-06-02 |
+-----------+---------------------+--------------------------------------------+
So my question is how can I create the report using a select statement ?
I can work out how to do this by writing a stored procedure that will loop through the view and write the result to a report table.
But I would much prefer to avoid a stored procedure and have a select statement that will give me the report on the fly.
I am using SQL Server 2005.
See SQL Fiddle here.
with tbl as (
-- this will generate daily merged ovelaping time
select distinct
a.id
,(
select min(x.starttime)
from act x
where x.id=a.id and ( x.starttime between a.starttime and a.endtime
or a.starttime between x.starttime and x.endtime )
) start1
,(
select max(x.endtime)
from act x
where x.id=a.id and ( x.endtime between a.starttime and a.endtime
or a.endtime between x.starttime and x.endtime )
) end1
from act a
), tbl2 as
(
-- this will add minute and total minute column
select
*
,datediff(mi,t.start1,t.end1) mi
,(select sum(datediff(mi,x.start1,x.end1)) from tbl x where x.id=t.id and x.end1<=t.end1) totalmi
from tbl t
), tbl3 as
(
-- now final query showing starttime and endtime for 5 hours other wise null in case not completed 5(300 minutes) hours
select
t.id
,min(t.start1) starttime
,min(case when t.totalmi>300 then t.end1 else null end) endtime
from tbl2 t
group by t.id
)
-- final result
select *
from tbl3
where endtime is not null
This is one way to do it:
;WITH CTErn AS (
SELECT activity_client_id, activity_type,
activity_start_date, activity_end_date,
ROW_NUMBER() OVER (PARTITION BY activity_client_id
ORDER BY activity_start_date) AS rn
FROM activities
),
CTEdiff AS (
SELECT c1.activity_client_id, c1.activity_type,
x.activity_start_date, c1.activity_end_date,
DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id
ORDER BY x.activity_start_date) AS seq
FROM CTErn AS c1
LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
CROSS APPLY (SELECT CASE
WHEN c1.activity_start_date < c2.activity_end_date
THEN c2.activity_end_date
ELSE c1.activity_start_date
END) x(activity_start_date)
)
SELECT TOP 1 client_id, client_sign_up_date, activity_start_date,
hoursOfActivicty
FROM CTEdiff AS c1
INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id
CROSS APPLY (SELECT SUM(diff) / 60.0
FROM CTEdiff AS c3
WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
WHERE hoursOfActivicty >= 5
ORDER BY seq
Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.
Demo here
The first CTE, i.e. CTErn, produces the following output:
client_id activity_type start_date end_date rn
112 Interview 2015-06-01 09:00 2015-06-01 11:00 1
112 CV updating 2015-06-01 09:30 2015-06-01 11:30 2
112 Course 2015-06-02 09:00 2015-06-02 16:00 3
112 Interview 2015-06-03 09:00 2015-06-03 10:00 4
The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:
client_id activity_type start_date end_date diff seq
112 Interview 2015-06-01 09:00 2015-06-01 11:00 120 1
112 CV updating 2015-06-01 11:00 2015-06-01 11:30 30 2
112 Course 2015-06-02 09:00 2015-06-02 16:00 420 3
112 Interview 2015-06-03 09:00 2015-06-03 10:00 60 4
The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.
The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.
A Geometric Approach
For another issue, I've taken a geometric approach to date
packing. Namely, I convert dates and times to a sql geometry
type and utilize geometry::UnionAggregate to merge the ranges.
I don't believe this will work in sql-server 2005. But your
problem was such an interesting puzzle that I wanted to see
whether the geometrical approach would work. So any future
users running into this problem that have access to a later
version can consider it.
Code Description
In 'numbers':
I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in
your original table, so I just use it as the base to build it.
In 'mergeLines':
I convert the dates to floats and use those floats
to create geometrical points.
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting
'lines' geometry object might contain multiple lines, but if they
overlap, they turn into one line.
In 'redate':
I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored
only as its two endpoints.
I read the endpoint x values and convert them back to their time
representations (This is usually the end goal, but you need more).
I calculate the difference in minutes between activity start and
end dates (I do this first in seconds then divide by 60 for the
sake of a precision issue).
I calculate the cumulative sume of these minutes for each row.
In the outer query:
I align the previous cumulative minutes sum with each current row
I filter for the row where the 5hr goal was met but where the
previous minutes shows that the 5hr goal for the previous row
was not met.
I then calculate where in the current row's range the user has
met the 5 hours, to not only arrive at the date the five hour
goal was met, but the exact time.
The Code
with
numbers as (
select row_number() over (order by (select null)) i
from #activities -- where I put your data
),
mergeLines as (
select activity_client_id,
lines = geometry::UnionAggregate(line)
from #activities
cross apply (select
startP = geometry::Point(convert(float,activity_start_date), 0, 0),
stopP = geometry::Point(convert(float,activity_end_date), 0, 0)
) pointify
cross apply (select line = startP.STUnion(stopP).STEnvelope()) lineify
group by activity_client_id
),
redate as (
select client_id = activity_client_id,
activities_start_date,
activities_end_date,
minutes,
rollingMinutes = sum(minutes) over(
partition by activity_client_id
order by activities_start_date
rows between unbounded preceding and current row
)
from mergeLines ml
join numbers n on n.i between 1 and ml.lines.STNumGeometries()
cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
cross apply (select
activities_start_date = convert(datetime, l.line.STPointN(1).STX),
activities_end_date = convert(datetime, l.line.STPointN(3).STX)
) unprepare
cross apply (select minutes =
round(datediff(s, activities_start_date, activities_end_date) / 60.0,0)
) duration
)
select client_id,
activities_start_date,
activities_end_date,
met_5hr_goal = dateadd(minute, (60 * 5) - prevRoll, activities_start_date)
from (
select *,
prevRoll = lag(rollingMinutes) over (
partition by client_id
order by rollingMinutes
)
from redate
) ranker
where rollingMinutes >= 60 * 5
and prevRoll < 60 * 5;

Resources