I am trying to assign a group number to distinct groups of rows in a dataset that has changing data over time. The changing fields are tran_seq, prog_id, deg-id, cur_id, and enroll_status in my example. When any of those fields are different from the previous row, I need a new grouping number. When the fields are the same as the prior row, then the grouping number should stay the same. When I try ROW_NUMBER(), RANK(), or DENSE_RANK(), I get increasing values for the same group (e.g. the first 2 rows in example). I feel I need to ORDER BY start_date as it is temporal data.
+----+----------+---------+--------+--------+---------------+------------+------------+---------+
| | tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date | end_date | desired |
+----+----------+---------+--------+--------+---------------+------------+------------+---------+
| 1 | 1 | 6 | 9 | 3 | ENRL | 2004-08-22 | 2004-12-11 | 1 |
| 2 | 1 | 6 | 9 | 3 | ENRL | 2006-01-10 | 2006-05-06 | 1 |
| 3 | 1 | 6 | 9 | 59 | ENRL | 2006-08-29 | 2006-12-16 | 2 |
| 4 | 2 | 12 | 23 | 45 | ENRL | 2014-01-21 | 2014-05-16 | 3 |
| 5 | 2 | 12 | 23 | 45 | ENRL | 2014-08-18 | 2014-12-05 | 3 |
| 6 | 2 | 12 | 23 | 45 | LOAP | 2015-01-20 | 2015-05-15 | 4 |
| 7 | 2 | 12 | 23 | 45 | ENRL | 2015-08-25 | 2015-12-11 | 5 |
| 8 | 2 | 12 | 23 | 45 | LOAP | 2016-01-12 | 2016-05-06 | 6 |
| 9 | 2 | 12 | 23 | 45 | ENRL | 2016-05-16 | 2016-08-05 | 7 |
| 10 | 2 | 12 | 23 | 45 | LOAJ | 2016-08-23 | 2016-12-02 | 8 |
| 11 | 2 | 12 | 23 | 45 | ENRL | 2017-01-18 | 2017-05-05 | 9 |
| 12 | 2 | 12 | 23 | 45 | ENRL | 2018-01-17 | 2018-05-11 | 9 |
+----+----------+---------+--------+--------+---------------+------------+------------+---------+
Once I have grouping numbers, I think I can group by those to get what I'm ultimately after: a timeline of different statuses with start dates and end dates. For the example data above, that would be:
+---+----------+---------+--------+--------+---------------+------------+------------+
| | tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date | end_date |
+---+----------+---------+--------+--------+---------------+------------+------------+
| 1 | 1 | 6 | 9 | 3 | ENRL | 2004-08-22 | 2006-05-06 |
| 2 | 1 | 6 | 9 | 59 | ENRL | 2004-08-29 | 2006-12-16 |
| 3 | 2 | 12 | 23 | 45 | ENRL | 2014-01-21 | 2014-12-05 |
| 4 | 2 | 12 | 23 | 45 | LOAP | 2015-01-20 | 2015-05-15 |
| 5 | 2 | 12 | 23 | 45 | ENRL | 2015-08-25 | 2015-12-11 |
| 6 | 2 | 12 | 23 | 45 | LOAP | 2016-01-12 | 2016-05-06 |
| 7 | 2 | 12 | 23 | 45 | ENRL | 2016-05-16 | 2016-08-05 |
| 8 | 2 | 12 | 23 | 45 | LOAJ | 2016-08-23 | 2016-12-02 |
| 9 | 2 | 12 | 23 | 45 | ENRL | 2017-01-17 | 2018-05-06 |
+---+----------+---------+--------+--------+---------------+------------+------------+
This is a classic XY problem, in that you are asking for an intermediate step to a different solution, rather than asking about the solution itself.
As you included your overall end goal as a bit of an addendum however, here is how you can reach that without your intermediate step:
declare #t table(tran_seq int, prog_id int, deg_id int, cur_id int, enroll_status varchar(4), start_date date, end_date date, desired int)
insert into #t values
(1,6,9,3 ,'ENRL','2004-08-22','2004-12-11',1)
,(1,6,9,3 ,'ENRL','2006-01-10','2006-05-06',1)
,(1,6,9,59 ,'ENRL','2006-08-29','2006-12-16',2)
,(2,12,23,45,'ENRL','2014-01-21','2014-05-16',3)
,(2,12,23,45,'ENRL','2014-08-18','2014-12-05',3)
,(2,12,23,45,'LOAP','2015-01-20','2015-05-15',4)
,(2,12,23,45,'ENRL','2015-08-25','2015-12-11',5)
,(2,12,23,45,'LOAP','2016-01-12','2016-05-06',6)
,(2,12,23,45,'ENRL','2016-05-16','2016-08-05',7)
,(2,12,23,45,'LOAJ','2016-08-23','2016-12-02',8)
,(2,12,23,45,'ENRL','2017-01-18','2017-05-05',9)
,(2,12,23,45,'ENRL','2018-01-17','2018-05-11',9)
;
select tran_seq
,prog_id
,deg_id
,cur_id
,enroll_status
,min(start_date) as start_date
,max(end_date) as end_date
from(select *
,row_number() over (order by end_date) - row_number() over (partition by tran_seq,prog_id,deg_id,cur_id,enroll_status order by end_date) as grp
from #t
) AS g
group by tran_seq
,prog_id
,deg_id
,cur_id
,enroll_status
,grp
order by start_date;
Output
+----------+---------+--------+--------+---------------+------------+------------+
| tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date | end_date |
+----------+---------+--------+--------+---------------+------------+------------+
| 1 | 6 | 9 | 3 | ENRL | 2004-08-22 | 2006-05-06 |
| 1 | 6 | 9 | 59 | ENRL | 2006-08-29 | 2006-12-16 |
| 2 | 12 | 23 | 45 | ENRL | 2014-01-21 | 2014-12-05 |
| 2 | 12 | 23 | 45 | LOAP | 2015-01-20 | 2015-05-15 |
| 2 | 12 | 23 | 45 | ENRL | 2015-08-25 | 2015-12-11 |
| 2 | 12 | 23 | 45 | LOAP | 2016-01-12 | 2016-05-06 |
| 2 | 12 | 23 | 45 | ENRL | 2016-05-16 | 2016-08-05 |
| 2 | 12 | 23 | 45 | LOAJ | 2016-08-23 | 2016-12-02 |
| 2 | 12 | 23 | 45 | ENRL | 2017-01-18 | 2018-05-11 |
+----------+---------+--------+--------+---------------+------------+------------+
My current issue is that I have a running balance, where one value falls below another the running balance needs to reset. But not only reset, but also use a another value as its starting value and start the balance again.
Below is the table with data in it:
+-------------+--------+---------------------+-------------------+--------------+-------------+
| Tran_DateSK | Amount | Running_AccountFees | Overlimit_Balance | Restart_Calc | Actual_Calc |
+-------------+--------+---------------------+-------------------+--------------+-------------+
| 20200217 | 39 | 39 | 3867.76 | 0 | 39 |
| 20200217 | 50 | 89 | 3867.76 | 0 | 89 |
| 20200316 | 39 | 128 | 4735.52 | 0 | 128 |
| 20200316 | 50 | 178 | 4735.52 | 0 | 178 |
| 20200324 | 50 | 228 | 2685.52 | 0 | 228 |
| 20200330 | 50 | 278 | 49.52 | 1 | 49.52 |
| 20200415 | 39 | 317 | 49.52 | 1 | 49.52 |
| 20200515 | 39 | 356 | 3917.28 | 0 | 88.52 |
| 20200515 | 50 | 406 | 3917.28 | 0 | 138.52 |
| 20200519 | 50 | 456 | 3467.28 | 0 | 188.52 |
| 20200604 | 50 | 506 | 3017.28 | 0 | 238.52 |
| 20200609 | 50 | 556 | 2167.28 | 0 | 288.52 |
| 20200611 | 50 | 606 | 49.28 | 1 | 49.28 |
| 20200615 | 39 | 645 | 3917.04 | 0 | 88.28 |
| 20200615 | 50 | 695 | 3917.04 | 0 | 138.28 |
| 20200616 | 50 | 745 | 3017.04 | 0 | 188.28 |
| 20200616 | 50 | 795 | 3017.04 | 0 | 238.28 |
| 20200619 | 50 | 845 | 2567.04 | 0 | 288.28 |
| 20200624 | 50 | 895 | 47.04 | 1 | 47.04 |
| 20200715 | 39 | 934 | 47.04 | 1 | 47.04 |
+-------------+--------+---------------------+-------------------+--------------+-------------+
Actual Calc is the desired outcome and Running account fees is the issue.
Running account fees is the running balance of "Amount" and overlimit_balance is the test. We need to see that the running_accountfees isn't greater than over limit,
If it is, take overlimits value and start calculating again by adding amount on again.
My query that produced this:
SELECT
[Transaction].ReportDateSK AS 'Tran_DateSK'
,[Transaction].AmountChange/100.00 AS 'Amount'
,SUM([Transaction].AmountChange/100.00)
OVER (PARTITION BY [Transaction].AccountSK
ORDER BY [Transaction].ReportDateSK
ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) AS 'Running_AccountFees'
,[Summary].Overlimit_Balance AS 'Overlimit_Balance'
,CASE
WHEN SUM([Transaction].AmountChange/100.00)
OVER (PARTITION BY [Transaction].AccountSK
ORDER BY [Transaction].ReportDateSK
ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) > [Summary].Overlimit_Balance
THEN 1
ELSE 0
END AS 'Restart_Calc'
,'' AS 'Actual_Calc'
FROM
Fact.[Transaction] [Transaction]
INNER JOIN Fact.AccountSummary [Summary] ON [Summary].DateSK = [Transaction].ReportDateSK
AND [Summary].AccountSK = [Transaction].AccountSK
AND [Summary].[Current] = 1
WHERE IsFeeTransaction = 1
AND [Transaction].AccountSK = 725
AND [Transaction].ReportDateSK BETWEEN 20200217 AND 20200730
Realised that the data in your question is essentially the source data and have been able to come up with the below. It isn't exactly pretty but it provides the correct output. Explanations on how it works are in the comments:
declare #t table(Tran_DateSK int, Amount decimal(10,2), Running_AccountFees int, Overlimit_Balance decimal(10,2), Restart_Calc bit, Actual_Calc decimal(10,2));
insert into #t values(20200217,39,39,3867.76,0,39),(20200217,50,89,3867.76,0,89),(20200316,39,128,4735.52,0,128),(20200316,50,178,4735.52,0,178),(20200324,50,228,2685.52,0,228),(20200330,50,278,49.52,1,49.52),(20200415,39,317,49.52,1,49.52),(20200515,39,356,3917.28,0,88.52),(20200515,50,406,3917.28,0,138.52),(20200519,50,456,3467.28,0,188.52),(20200604,50,506,3017.28,0,238.52),(20200609,50,556,2167.28,0,288.52),(20200611,50,606,49.28,1,49.28),(20200615,39,645,3917.04,0,88.28),(20200615,50,695,3917.04,0,138.28),(20200616,50,745,3017.04,0,188.28),(20200616,50,795,3017.04,0,238.28),(20200619,50,845,2567.04,0,288.28),(20200624,50,895,47.04,1,47.04),(20200715,39,934,47.04,1,47.04);
with t as
(
select Tran_DateSK
,Amount
-- Check if the Running_AccountFees are over the Overlimit_Balance
,case when sum(Amount) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding) > Overlimit_Balance
-- If so, check if the Running_AccountFees in the previous row were also over the Overlimit_Balance
then case when (sum(Amount) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding) - Amount) > lag(Overlimit_Balance,1,0) over (order by Tran_DateSK,Amount,Overlimit_Balance)
then 0 -- and in those instances this means multiple Restart_Calcs in a row, so set the Amount to zero as we don't want to increase the fees when calculating the Actual_Calc
else Amount
end
else Amount
end as Amount_Adj
,sum(Amount) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding) as Running_AccountFees
,lag(Overlimit_Balance,1,0) over (order by Tran_DateSK,Amount,Overlimit_Balance) as Prev_Overlimit_Balance
,Overlimit_Balance
,case when sum(Amount) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding) > Overlimit_Balance
then 1
else 0
end as Restart_Calc
from #t
)
,b as
(
select *
,case when Running_AccountFees > Overlimit_Balance -- If this row is the first in a possible series of balance resets
and sum(Amount_Adj) over (order by Tran_DateSK,Amount,Overlimit_Balance rows between unbounded preceding and 1 preceding) <= Prev_Overlimit_Balance
then Overlimit_Balance -- Take the Overlimit_Balance and subtract the *Adjusted* Running_AccountFees
- sum(Amount_Adj) over (order by Tran_DateSK,Amount,Overlimit_Balance rows between unbounded preceding and 1 preceding)
- Amount_Adj
else 0
end as Reset_Bal
from t
)
select Tran_DateSK
,Amount
,Running_AccountFees
,Overlimit_Balance
,Restart_Calc
-- For each *Adjusted* Running_AccountFees, apply the most negative Reset_Bal value, as this will contain the entire amount that needs to be reset from the current *Adjusted* Running_AccountFees to get the correct Balance_Calc
,sum(Amount_Adj) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding)
+ min(Reset_Bal) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding)
as Balance_Calc
from b
order by Tran_DateSK;
Output
+-------------+--------+---------------------+-------------------+--------------+--------------+
| Tran_DateSK | Amount | Running_AccountFees | Overlimit_Balance | Restart_Calc | Balance_Calc |
+-------------+--------+---------------------+-------------------+--------------+--------------+
| 20200217 | 39.00 | 39.00 | 3867.76 | 0 | 39.00 |
| 20200217 | 50.00 | 89.00 | 3867.76 | 0 | 89.00 |
| 20200316 | 39.00 | 128.00 | 4735.52 | 0 | 128.00 |
| 20200316 | 50.00 | 178.00 | 4735.52 | 0 | 178.00 |
| 20200324 | 50.00 | 228.00 | 2685.52 | 0 | 228.00 |
| 20200330 | 50.00 | 278.00 | 49.52 | 1 | 49.52 |
| 20200415 | 39.00 | 317.00 | 49.52 | 1 | 49.52 |
| 20200515 | 39.00 | 356.00 | 3917.28 | 0 | 88.52 |
| 20200515 | 50.00 | 406.00 | 3917.28 | 0 | 138.52 |
| 20200519 | 50.00 | 456.00 | 3467.28 | 0 | 188.52 |
| 20200604 | 50.00 | 506.00 | 3017.28 | 0 | 238.52 |
| 20200609 | 50.00 | 556.00 | 2167.28 | 0 | 288.52 |
| 20200611 | 50.00 | 606.00 | 49.28 | 1 | 49.28 |
| 20200615 | 39.00 | 645.00 | 3917.04 | 0 | 88.28 |
| 20200615 | 50.00 | 695.00 | 3917.04 | 0 | 138.28 |
| 20200616 | 50.00 | 745.00 | 3017.04 | 0 | 188.28 |
| 20200616 | 50.00 | 795.00 | 3017.04 | 0 | 238.28 |
| 20200619 | 50.00 | 845.00 | 2567.04 | 0 | 288.28 |
| 20200624 | 50.00 | 895.00 | 47.04 | 1 | 47.04 |
| 20200715 | 39.00 | 934.00 | 47.04 | 1 | 47.04 |
+-------------+--------+---------------------+-------------------+--------------+--------------+
Here is the fiddle
I have a table as
TABLE_MAIN
+-----+----------+---------+
| id | name | phase |
+-----+----------+---------+
| 101 | Bolt | PHASE 1 |
| 102 | Nut | PHASE 1 |
| 103 | Screw | PHASE 2 |
| 104 | Hex BOLT | PHASE 2 |
| 105 | Rubber | PHASE 3 |
| 106 | Aluminum | PHASE 3 |
| 107 | Slate | PHASE 3 |
| 108 | Pen | PHASE 3 |
| 109 | Pencil | PHASE 3 |
+-----+----------+---------+
TABLE_ERROR
+-----+----------+---------+
| id | name | phase |
+-----+----------+---------+
| 101 | Bolt | PHASE 1 |
| 102 | Needle | PHASE 1 |
| 101 | Bolt | PHASE 3 |
| 102 | Needle | PHASE 3 |
| 104 | Bolt | PHASE 3 |
| 105 | Rubber | PHASE 3 |
| 105 | Plastic | PHASE 3 |
| 106 | Aluminum | PHASE 3 |
| 106 | Steel | PHASE 3 |
| 106 | Cooper | PHASE 3 |
+-----+----------+---------+
Now I'm trying to find the number of times the ID of PHASE 3 in table_error appearing in table_main for each phase. If the ID is repeating, It should get added to the total count.
Expected
+---------+-----------------+-------+
| phase | already_present | total |
+---------+-----------------+-------+
| PHASE 1 | 2 | 8 |
| PHASE 2 | 1 | 8 |
| PHASE 3 | 5 | 8 |
+---------+-----------------+-------+
I have tried
SELECT phase, count(*) AS already_present, sum(count(*)) OVER () AS total
FROM table_main
WHERE id IN (
SELECT id
FROM table_error
WHERE phase = 'PHASE 3'
)
GROUP BY phase
But it is giving me,
+---------+-----------------+-------+
| phase | already_present | total |
+---------+-----------------+-------+
| PHASE 1 | 2 | 5 |
| PHASE 2 | 1 | 5 |
| PHASE 3 | 2 | 5 |
+---------+-----------------+-------+
Hope this helps you
select tm.phase, count(tm.id) AS already_present, SUM(count(tm.id)) OVER() as total
from table_main tm
inner join table_error te on te.id = tm.id and te.phase = 'PHASE 3'
group by tm.phase