Lateral flatten two columns without repetition in snowflake - snowflake-cloud-data-platform

I have a query that groups by a two variables to get a total of another. In order to maintain my table structure for later computations I listagg() two other variables to save for the next stage of the query. However, when I attempt to do two later flatten's of the listagg() columns my data is repeated to many times.
Example: my_table
id | list1 | code| list2 | total
--------|-----------------|-----|----------|---
2434166 | 735,768,769,746 | 124 | 21,2,1,6 | 30
select
id,
list1_table.value::int as list1_val,
code,
list2.value::int as list2_val,
total
from my_table
lateral flatten(input=>split(list1, ',')) list1_table,
lateral flatten(input=>split(list2, ',')) list2_table
Result:
id | list1 | code| list2 | total
--------|-----------------|-----|----------|---
2434166 | 768 | 124 | 2 | 30
2434166 | 735 | 124 | 2 | 30
2434166 | 746 | 124 | 2 | 30
2434166 | 769 | 124 | 2 | 30
2434166 | 768 | 124 | 21 | 30
2434166 | 735 | 124 | 21 | 30
2434166 | 746 | 124 | 21 | 30
2434166 | 769 | 124 | 21 | 30
2434166 | 768 | 124 | 6 | 30
2434166 | 735 | 124 | 6 | 30
2434166 | 746 | 124 | 6 | 30
2434166 | 769 | 124 | 6 | 30
2434166 | 768 | 124 | 1 | 30
2434166 | 735 | 124 | 1 | 30
2434166 | 746 | 124 | 1 | 30
2434166 | 769 | 124 | 1 | 30
I understand what is going on but I'm just wonder how do I get my desired result:
id | list1 | code| list2 | total
--------|-----------------|-----|----------|---
2434166 | 768 | 124 | 2 | 30
2434166 | 735 | 124 | 21 | 30
2434166 | 746 | 124 | 6 | 30
2434166 | 769 | 124 | 1 | 30

As you noticed yourself, you want 4 records. There are 2 ways to do it, both exploit the index column produced by flatten, which represents the position of the produced value in the input (see the Flatten Documentation)
Using 2 flattens and index-selection
First way is to take the result of your query, and add these index column, here's an example:
select id,
list1_table.value::int as list1_val, list1_table.index as list1_index, code,
list2_table.value::int as list2_val, list2_table.index as list2_index, total
from my_table,
lateral flatten(input=>split(list1, ',')) list1_table,
lateral flatten(input=>split(list2, ',')) list2_table;
---------+-----------+-------------+------+-----------+-------------+-------+
ID | LIST1_VAL | LIST1_INDEX | CODE | LIST2_VAL | LIST2_INDEX | TOTAL |
---------+-----------+-------------+------+-----------+-------------+-------+
2434166 | 735 | 0 | 124 | 21 | 0 | 30 |
2434166 | 735 | 0 | 124 | 2 | 1 | 30 |
2434166 | 735 | 0 | 124 | 1 | 2 | 30 |
2434166 | 735 | 0 | 124 | 6 | 3 | 30 |
2434166 | 768 | 1 | 124 | 21 | 0 | 30 |
2434166 | 768 | 1 | 124 | 2 | 1 | 30 |
2434166 | 768 | 1 | 124 | 1 | 2 | 30 |
2434166 | 768 | 1 | 124 | 6 | 3 | 30 |
2434166 | 769 | 2 | 124 | 21 | 0 | 30 |
2434166 | 769 | 2 | 124 | 2 | 1 | 30 |
2434166 | 769 | 2 | 124 | 1 | 2 | 30 |
2434166 | 769 | 2 | 124 | 6 | 3 | 30 |
2434166 | 746 | 3 | 124 | 21 | 0 | 30 |
2434166 | 746 | 3 | 124 | 2 | 1 | 30 |
2434166 | 746 | 3 | 124 | 1 | 2 | 30 |
2434166 | 746 | 3 | 124 | 6 | 3 | 30 |
---------+-----------+-------------+------+-----------+-------------+-------+
As you can see, the rows you are interested are the ones with the same index.
So to get your result by selecting these rows after the lateral joins happen:
select id,
list1_table.value::int as list1_val, code,
list2_table.value::int as list2_val, total
from my_table,
lateral flatten(input=>split(list1, ',')) list1_table,
lateral flatten(input=>split(list2, ',')) list2_table
where list1_table.index = list2_table.index;
---------+-----------+------+-----------+-------+
ID | LIST1_VAL | CODE | LIST2_VAL | TOTAL |
---------+-----------+------+-----------+-------+
2434166 | 735 | 124 | 21 | 30 |
2434166 | 768 | 124 | 2 | 30 |
2434166 | 769 | 124 | 1 | 30 |
2434166 | 746 | 124 | 6 | 30 |
---------+-----------+------+-----------+-------+
Using 1 flatten + lookup-by-index
An easier, more efficient, and more flexible way (useful if you have multiple arrays like that or e.g. array indices are related but not 1-to-1) is to flatten only on one array, and then use the index of the produced elements to lookup values in other arrays.
Here's an example:
select id, list1_table.value::int as list1_val, code,
split(list2,',')[list1_table.index]::int as list2_val, -- array lookup here
total
from my_table, lateral flatten(input=>split(list1, ',')) list1_table;
---------+-----------+------+-----------+-------+
ID | LIST1_VAL | CODE | LIST2_VAL | TOTAL |
---------+-----------+------+-----------+-------+
2434166 | 735 | 124 | 21 | 30 |
2434166 | 768 | 124 | 2 | 30 |
2434166 | 769 | 124 | 1 | 30 |
2434166 | 746 | 124 | 6 | 30 |
---------+-----------+------+-----------+-------+
See how we simply use the index produced when flattening list1 to lookup the value from list2

To get elements from multiple arrays with the same index we could use array accessor:
SELECT t.id, t.code, t.total, s.ind,
STRTOK_TO_ARRAY(t.list1, ',')[s.ind]::int AS list1_val,
STRTOK_TO_ARRAY(t.list2, ',')[s.ind]::int AS list2_val
FROM t
,(SELECT ROW_NUMBER() OVER(ORDER BY seq4()) - 1 AS ind
FROM TABLE(GENERATOR(ROWCOUNT => 10))) s -- here up to 10 elements
WHERE list1_val IS NOT NULL
ORDER BY t.id, s.ind;
The idea is to generate tally numbers and then access the elements.
Sample data:
CREATE OR REPLACE TABLE t(id INT, list1 TEXT, code INT, list2 TEXT, total INT) AS
SELECT 6, '735,768,769,746', 124, '21,2,1,6', 30 UNION
SELECT 7, '1,2,3' , 1, '10,20,30', 50;

Related

How to efficiently self-join so the query does not time out?

| id1 | id2 | id3 | id4 | val1 | val2 | age | race | zip |
----------------------------------------------------------
| 11 | 222 | 333 | 44 | 789 | abc | 45 | AA |12345|
| 11 | 222 | 333 | 44 | 567 | def | 45 | AA |12345|
| 11 | 333 | 444 | 44 | 789 | xyz | 30 | AS |23456|
| 22 | 555 | 666 | 77 | 012 | abc | 38 | W |34567|
| 22 | 555 | 666 | 77 | 789 | GHI | 38 | W |34567|
| 34 | 333 | 777 | 99 | 012 | GHI | 75 | W |34567|
I want to get ALL rows for ID1, id2, id3, id4 where val1 is 789. So the output look similar to this:
| id1 | id2 | id3 | id4 | val1 | val2 | age | race | zip |
----------------------------------------------------------
| 11 | 222 | 333 | 44 | 789 | abc | 45 | AA |12345|
| 11 | 222 | 333 | 44 | 567 | def | 45 | AA |12345|
| 11 | 333 | 444 | 44 | 789 | xyz | 30 | AS |23456|
| 22 | 555 | 666 | 77 | 012 | abc | 38 | W |34567|
| 22 | 555 | 666 | 77 | 789 | GHI | 38 | W |34567|
I can get the self-join to work on a small dataset, however since the table I'm working is huge, I want an efficient solution that will not timeout. Here is the query I'm using:
select t1.*
from t t1
join t t2 on
t1.id1=t2.id1 and
t1.id2=t2.id2 and
t1.id3=t2.id3 and
t1.id4=t2.id4
where t1.val1='789'
Again, I want ALL rows of ID1 through ID4 as long as one of the VAL1 value is '789'.
You don't need to join to any tables - what you want is EXISTS:
Select *
From t t1
Where Exists (Select *
From t t2
Where t1.ID1 = t2.ID1
And t2.val1 = 789)

Grouping Data by Changing Status Over Time

I am trying to assign a group number to distinct groups of rows in a dataset that has changing data over time. The changing fields are tran_seq, prog_id, deg-id, cur_id, and enroll_status in my example. When any of those fields are different from the previous row, I need a new grouping number. When the fields are the same as the prior row, then the grouping number should stay the same. When I try ROW_NUMBER(), RANK(), or DENSE_RANK(), I get increasing values for the same group (e.g. the first 2 rows in example). I feel I need to ORDER BY start_date as it is temporal data.
+----+----------+---------+--------+--------+---------------+------------+------------+---------+
| | tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date | end_date | desired |
+----+----------+---------+--------+--------+---------------+------------+------------+---------+
| 1 | 1 | 6 | 9 | 3 | ENRL | 2004-08-22 | 2004-12-11 | 1 |
| 2 | 1 | 6 | 9 | 3 | ENRL | 2006-01-10 | 2006-05-06 | 1 |
| 3 | 1 | 6 | 9 | 59 | ENRL | 2006-08-29 | 2006-12-16 | 2 |
| 4 | 2 | 12 | 23 | 45 | ENRL | 2014-01-21 | 2014-05-16 | 3 |
| 5 | 2 | 12 | 23 | 45 | ENRL | 2014-08-18 | 2014-12-05 | 3 |
| 6 | 2 | 12 | 23 | 45 | LOAP | 2015-01-20 | 2015-05-15 | 4 |
| 7 | 2 | 12 | 23 | 45 | ENRL | 2015-08-25 | 2015-12-11 | 5 |
| 8 | 2 | 12 | 23 | 45 | LOAP | 2016-01-12 | 2016-05-06 | 6 |
| 9 | 2 | 12 | 23 | 45 | ENRL | 2016-05-16 | 2016-08-05 | 7 |
| 10 | 2 | 12 | 23 | 45 | LOAJ | 2016-08-23 | 2016-12-02 | 8 |
| 11 | 2 | 12 | 23 | 45 | ENRL | 2017-01-18 | 2017-05-05 | 9 |
| 12 | 2 | 12 | 23 | 45 | ENRL | 2018-01-17 | 2018-05-11 | 9 |
+----+----------+---------+--------+--------+---------------+------------+------------+---------+
Once I have grouping numbers, I think I can group by those to get what I'm ultimately after: a timeline of different statuses with start dates and end dates. For the example data above, that would be:
+---+----------+---------+--------+--------+---------------+------------+------------+
| | tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date | end_date |
+---+----------+---------+--------+--------+---------------+------------+------------+
| 1 | 1 | 6 | 9 | 3 | ENRL | 2004-08-22 | 2006-05-06 |
| 2 | 1 | 6 | 9 | 59 | ENRL | 2004-08-29 | 2006-12-16 |
| 3 | 2 | 12 | 23 | 45 | ENRL | 2014-01-21 | 2014-12-05 |
| 4 | 2 | 12 | 23 | 45 | LOAP | 2015-01-20 | 2015-05-15 |
| 5 | 2 | 12 | 23 | 45 | ENRL | 2015-08-25 | 2015-12-11 |
| 6 | 2 | 12 | 23 | 45 | LOAP | 2016-01-12 | 2016-05-06 |
| 7 | 2 | 12 | 23 | 45 | ENRL | 2016-05-16 | 2016-08-05 |
| 8 | 2 | 12 | 23 | 45 | LOAJ | 2016-08-23 | 2016-12-02 |
| 9 | 2 | 12 | 23 | 45 | ENRL | 2017-01-17 | 2018-05-06 |
+---+----------+---------+--------+--------+---------------+------------+------------+
This is a classic XY problem, in that you are asking for an intermediate step to a different solution, rather than asking about the solution itself.
As you included your overall end goal as a bit of an addendum however, here is how you can reach that without your intermediate step:
declare #t table(tran_seq int, prog_id int, deg_id int, cur_id int, enroll_status varchar(4), start_date date, end_date date, desired int)
insert into #t values
(1,6,9,3 ,'ENRL','2004-08-22','2004-12-11',1)
,(1,6,9,3 ,'ENRL','2006-01-10','2006-05-06',1)
,(1,6,9,59 ,'ENRL','2006-08-29','2006-12-16',2)
,(2,12,23,45,'ENRL','2014-01-21','2014-05-16',3)
,(2,12,23,45,'ENRL','2014-08-18','2014-12-05',3)
,(2,12,23,45,'LOAP','2015-01-20','2015-05-15',4)
,(2,12,23,45,'ENRL','2015-08-25','2015-12-11',5)
,(2,12,23,45,'LOAP','2016-01-12','2016-05-06',6)
,(2,12,23,45,'ENRL','2016-05-16','2016-08-05',7)
,(2,12,23,45,'LOAJ','2016-08-23','2016-12-02',8)
,(2,12,23,45,'ENRL','2017-01-18','2017-05-05',9)
,(2,12,23,45,'ENRL','2018-01-17','2018-05-11',9)
;
select tran_seq
,prog_id
,deg_id
,cur_id
,enroll_status
,min(start_date) as start_date
,max(end_date) as end_date
from(select *
,row_number() over (order by end_date) - row_number() over (partition by tran_seq,prog_id,deg_id,cur_id,enroll_status order by end_date) as grp
from #t
) AS g
group by tran_seq
,prog_id
,deg_id
,cur_id
,enroll_status
,grp
order by start_date;
Output
+----------+---------+--------+--------+---------------+------------+------------+
| tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date | end_date |
+----------+---------+--------+--------+---------------+------------+------------+
| 1 | 6 | 9 | 3 | ENRL | 2004-08-22 | 2006-05-06 |
| 1 | 6 | 9 | 59 | ENRL | 2006-08-29 | 2006-12-16 |
| 2 | 12 | 23 | 45 | ENRL | 2014-01-21 | 2014-12-05 |
| 2 | 12 | 23 | 45 | LOAP | 2015-01-20 | 2015-05-15 |
| 2 | 12 | 23 | 45 | ENRL | 2015-08-25 | 2015-12-11 |
| 2 | 12 | 23 | 45 | LOAP | 2016-01-12 | 2016-05-06 |
| 2 | 12 | 23 | 45 | ENRL | 2016-05-16 | 2016-08-05 |
| 2 | 12 | 23 | 45 | LOAJ | 2016-08-23 | 2016-12-02 |
| 2 | 12 | 23 | 45 | ENRL | 2017-01-18 | 2018-05-11 |
+----------+---------+--------+--------+---------------+------------+------------+

SQL Server Dynamic Resetting Running Balance

My current issue is that I have a running balance, where one value falls below another the running balance needs to reset. But not only reset, but also use a another value as its starting value and start the balance again.
Below is the table with data in it:
+-------------+--------+---------------------+-------------------+--------------+-------------+
| Tran_DateSK | Amount | Running_AccountFees | Overlimit_Balance | Restart_Calc | Actual_Calc |
+-------------+--------+---------------------+-------------------+--------------+-------------+
| 20200217 | 39 | 39 | 3867.76 | 0 | 39 |
| 20200217 | 50 | 89 | 3867.76 | 0 | 89 |
| 20200316 | 39 | 128 | 4735.52 | 0 | 128 |
| 20200316 | 50 | 178 | 4735.52 | 0 | 178 |
| 20200324 | 50 | 228 | 2685.52 | 0 | 228 |
| 20200330 | 50 | 278 | 49.52 | 1 | 49.52 |
| 20200415 | 39 | 317 | 49.52 | 1 | 49.52 |
| 20200515 | 39 | 356 | 3917.28 | 0 | 88.52 |
| 20200515 | 50 | 406 | 3917.28 | 0 | 138.52 |
| 20200519 | 50 | 456 | 3467.28 | 0 | 188.52 |
| 20200604 | 50 | 506 | 3017.28 | 0 | 238.52 |
| 20200609 | 50 | 556 | 2167.28 | 0 | 288.52 |
| 20200611 | 50 | 606 | 49.28 | 1 | 49.28 |
| 20200615 | 39 | 645 | 3917.04 | 0 | 88.28 |
| 20200615 | 50 | 695 | 3917.04 | 0 | 138.28 |
| 20200616 | 50 | 745 | 3017.04 | 0 | 188.28 |
| 20200616 | 50 | 795 | 3017.04 | 0 | 238.28 |
| 20200619 | 50 | 845 | 2567.04 | 0 | 288.28 |
| 20200624 | 50 | 895 | 47.04 | 1 | 47.04 |
| 20200715 | 39 | 934 | 47.04 | 1 | 47.04 |
+-------------+--------+---------------------+-------------------+--------------+-------------+
Actual Calc is the desired outcome and Running account fees is the issue.
Running account fees is the running balance of "Amount" and overlimit_balance is the test. We need to see that the running_accountfees isn't greater than over limit,
If it is, take overlimits value and start calculating again by adding amount on again.
My query that produced this:
SELECT
[Transaction].ReportDateSK AS 'Tran_DateSK'
,[Transaction].AmountChange/100.00 AS 'Amount'
,SUM([Transaction].AmountChange/100.00)
OVER (PARTITION BY [Transaction].AccountSK
ORDER BY [Transaction].ReportDateSK
ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) AS 'Running_AccountFees'
,[Summary].Overlimit_Balance AS 'Overlimit_Balance'
,CASE
WHEN SUM([Transaction].AmountChange/100.00)
OVER (PARTITION BY [Transaction].AccountSK
ORDER BY [Transaction].ReportDateSK
ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) > [Summary].Overlimit_Balance
THEN 1
ELSE 0
END AS 'Restart_Calc'
,'' AS 'Actual_Calc'
FROM
Fact.[Transaction] [Transaction]
INNER JOIN Fact.AccountSummary [Summary] ON [Summary].DateSK = [Transaction].ReportDateSK
AND [Summary].AccountSK = [Transaction].AccountSK
AND [Summary].[Current] = 1
WHERE IsFeeTransaction = 1
AND [Transaction].AccountSK = 725
AND [Transaction].ReportDateSK BETWEEN 20200217 AND 20200730
Realised that the data in your question is essentially the source data and have been able to come up with the below. It isn't exactly pretty but it provides the correct output. Explanations on how it works are in the comments:
declare #t table(Tran_DateSK int, Amount decimal(10,2), Running_AccountFees int, Overlimit_Balance decimal(10,2), Restart_Calc bit, Actual_Calc decimal(10,2));
insert into #t values(20200217,39,39,3867.76,0,39),(20200217,50,89,3867.76,0,89),(20200316,39,128,4735.52,0,128),(20200316,50,178,4735.52,0,178),(20200324,50,228,2685.52,0,228),(20200330,50,278,49.52,1,49.52),(20200415,39,317,49.52,1,49.52),(20200515,39,356,3917.28,0,88.52),(20200515,50,406,3917.28,0,138.52),(20200519,50,456,3467.28,0,188.52),(20200604,50,506,3017.28,0,238.52),(20200609,50,556,2167.28,0,288.52),(20200611,50,606,49.28,1,49.28),(20200615,39,645,3917.04,0,88.28),(20200615,50,695,3917.04,0,138.28),(20200616,50,745,3017.04,0,188.28),(20200616,50,795,3017.04,0,238.28),(20200619,50,845,2567.04,0,288.28),(20200624,50,895,47.04,1,47.04),(20200715,39,934,47.04,1,47.04);
with t as
(
select Tran_DateSK
,Amount
-- Check if the Running_AccountFees are over the Overlimit_Balance
,case when sum(Amount) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding) > Overlimit_Balance
-- If so, check if the Running_AccountFees in the previous row were also over the Overlimit_Balance
then case when (sum(Amount) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding) - Amount) > lag(Overlimit_Balance,1,0) over (order by Tran_DateSK,Amount,Overlimit_Balance)
then 0 -- and in those instances this means multiple Restart_Calcs in a row, so set the Amount to zero as we don't want to increase the fees when calculating the Actual_Calc
else Amount
end
else Amount
end as Amount_Adj
,sum(Amount) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding) as Running_AccountFees
,lag(Overlimit_Balance,1,0) over (order by Tran_DateSK,Amount,Overlimit_Balance) as Prev_Overlimit_Balance
,Overlimit_Balance
,case when sum(Amount) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding) > Overlimit_Balance
then 1
else 0
end as Restart_Calc
from #t
)
,b as
(
select *
,case when Running_AccountFees > Overlimit_Balance -- If this row is the first in a possible series of balance resets
and sum(Amount_Adj) over (order by Tran_DateSK,Amount,Overlimit_Balance rows between unbounded preceding and 1 preceding) <= Prev_Overlimit_Balance
then Overlimit_Balance -- Take the Overlimit_Balance and subtract the *Adjusted* Running_AccountFees
- sum(Amount_Adj) over (order by Tran_DateSK,Amount,Overlimit_Balance rows between unbounded preceding and 1 preceding)
- Amount_Adj
else 0
end as Reset_Bal
from t
)
select Tran_DateSK
,Amount
,Running_AccountFees
,Overlimit_Balance
,Restart_Calc
-- For each *Adjusted* Running_AccountFees, apply the most negative Reset_Bal value, as this will contain the entire amount that needs to be reset from the current *Adjusted* Running_AccountFees to get the correct Balance_Calc
,sum(Amount_Adj) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding)
+ min(Reset_Bal) over (order by Tran_DateSK,Amount,Overlimit_Balance rows unbounded preceding)
as Balance_Calc
from b
order by Tran_DateSK;
Output
+-------------+--------+---------------------+-------------------+--------------+--------------+
| Tran_DateSK | Amount | Running_AccountFees | Overlimit_Balance | Restart_Calc | Balance_Calc |
+-------------+--------+---------------------+-------------------+--------------+--------------+
| 20200217 | 39.00 | 39.00 | 3867.76 | 0 | 39.00 |
| 20200217 | 50.00 | 89.00 | 3867.76 | 0 | 89.00 |
| 20200316 | 39.00 | 128.00 | 4735.52 | 0 | 128.00 |
| 20200316 | 50.00 | 178.00 | 4735.52 | 0 | 178.00 |
| 20200324 | 50.00 | 228.00 | 2685.52 | 0 | 228.00 |
| 20200330 | 50.00 | 278.00 | 49.52 | 1 | 49.52 |
| 20200415 | 39.00 | 317.00 | 49.52 | 1 | 49.52 |
| 20200515 | 39.00 | 356.00 | 3917.28 | 0 | 88.52 |
| 20200515 | 50.00 | 406.00 | 3917.28 | 0 | 138.52 |
| 20200519 | 50.00 | 456.00 | 3467.28 | 0 | 188.52 |
| 20200604 | 50.00 | 506.00 | 3017.28 | 0 | 238.52 |
| 20200609 | 50.00 | 556.00 | 2167.28 | 0 | 288.52 |
| 20200611 | 50.00 | 606.00 | 49.28 | 1 | 49.28 |
| 20200615 | 39.00 | 645.00 | 3917.04 | 0 | 88.28 |
| 20200615 | 50.00 | 695.00 | 3917.04 | 0 | 138.28 |
| 20200616 | 50.00 | 745.00 | 3017.04 | 0 | 188.28 |
| 20200616 | 50.00 | 795.00 | 3017.04 | 0 | 238.28 |
| 20200619 | 50.00 | 845.00 | 2567.04 | 0 | 288.28 |
| 20200624 | 50.00 | 895.00 | 47.04 | 1 | 47.04 |
| 20200715 | 39.00 | 934.00 | 47.04 | 1 | 47.04 |
+-------------+--------+---------------------+-------------------+--------------+--------------+

How to get the Total count without distinct?

Here is the fiddle
I have a table as
TABLE_MAIN
+-----+----------+---------+
| id | name | phase |
+-----+----------+---------+
| 101 | Bolt | PHASE 1 |
| 102 | Nut | PHASE 1 |
| 103 | Screw | PHASE 2 |
| 104 | Hex BOLT | PHASE 2 |
| 105 | Rubber | PHASE 3 |
| 106 | Aluminum | PHASE 3 |
| 107 | Slate | PHASE 3 |
| 108 | Pen | PHASE 3 |
| 109 | Pencil | PHASE 3 |
+-----+----------+---------+
TABLE_ERROR
+-----+----------+---------+
| id | name | phase |
+-----+----------+---------+
| 101 | Bolt | PHASE 1 |
| 102 | Needle | PHASE 1 |
| 101 | Bolt | PHASE 3 |
| 102 | Needle | PHASE 3 |
| 104 | Bolt | PHASE 3 |
| 105 | Rubber | PHASE 3 |
| 105 | Plastic | PHASE 3 |
| 106 | Aluminum | PHASE 3 |
| 106 | Steel | PHASE 3 |
| 106 | Cooper | PHASE 3 |
+-----+----------+---------+
Now I'm trying to find the number of times the ID of PHASE 3 in table_error appearing in table_main for each phase. If the ID is repeating, It should get added to the total count.
Expected
+---------+-----------------+-------+
| phase | already_present | total |
+---------+-----------------+-------+
| PHASE 1 | 2 | 8 |
| PHASE 2 | 1 | 8 |
| PHASE 3 | 5 | 8 |
+---------+-----------------+-------+
I have tried
SELECT phase, count(*) AS already_present, sum(count(*)) OVER () AS total
FROM table_main
WHERE id IN (
SELECT id
FROM table_error
WHERE phase = 'PHASE 3'
)
GROUP BY phase
But it is giving me,
+---------+-----------------+-------+
| phase | already_present | total |
+---------+-----------------+-------+
| PHASE 1 | 2 | 5 |
| PHASE 2 | 1 | 5 |
| PHASE 3 | 2 | 5 |
+---------+-----------------+-------+
Hope this helps you
select tm.phase, count(tm.id) AS already_present, SUM(count(tm.id)) OVER() as total
from table_main tm
inner join table_error te on te.id = tm.id and te.phase = 'PHASE 3'
group by tm.phase

Sum, Group by and Null

I'm dipping my toes into SQL. I have the following table
+------+----+------+------+-------+
| Type | ID | QTY | Rate | Name |
+------+----+------+------+-------+
| B | 1 | 1000 | 21 | Jack |
| B | 2 | 2000 | 12 | Kevin |
| B | 1 | 3000 | 24 | Jack |
| B | 1 | 1000 | 23 | Jack |
| B | 3 | 200 | 13 | Mary |
| B | 2 | 3000 | 12 | Kevin |
| B | 4 | 4000 | 44 | Chris |
| B | 4 | 5000 | 43 | Chris |
| B | 3 | 1000 | 26 | Mary |
+------+----+------+------+-------+
I don't know how I would leverage Sum and Group by to achieve the following result.
+------+----+------+------+-------+------------+
| Type | ID | QTY | Rate | Name | Sum of QTY |
+------+----+------+------+-------+------------+
| B | 1 | 1000 | 21 | Jack | 5000 |
| B | 1 | 3000 | 24 | Jack | Null |
| B | 1 | 1000 | 23 | Jack | Null |
| B | 2 | 3000 | 12 | Kevin | 5000 |
| B | 2 | 3000 | 12 | Kevin | Null |
| B | 3 | 200 | 13 | Mary | 1200 |
| B | 3 | 1000 | 26 | Mary | Null |
| B | 4 | 4000 | 44 | Chris | 9000 |
| B | 4 | 5000 | 43 | Chris | Null |
+------+----+------+------+-------+------------+
Any help is appreciated!
You can use window function :
select t.*,
(case when row_number() over (partition by type, id order by name) = 1
then sum(qty) over (partition by type, id order by name)
end) as Sum_of_QTY
from table t;

Resources