Replacing self joins by window functions - snowflake-cloud-data-platform

I am working with following sample data;
dt | ship_id | audit_id | action
2022-01-02 | 1351 | id1 | destroy
2022-01-01 | 1351 | id1 | create
2021-12-12 | 3457 | id2 | create
2021-12-16 | 3457 | id2 | destroy
2021-12-28 | 3457 | id3 | create
To give some context, for a given ship_id, and audit_id; an entry has to be created before it is destroyed as defined by action column. For example, for ship_id=3457, and audit_id=id2; got created on Dec 12 and destroyed on Dec 16.
The goal is to get, for every dt(when action is created), how many audit_ids are created before it , and how many audit_ids are destroyed before it.
Sample output:
dt | created_cnt | destroyed_cnt
2022-01-01 | 2 | 1
Possible Approach Using self join idea.
select
audit_id,
ship_id,
max(case when action = 'create' then dt end) as creation_time,
max(case when action = 'destroy' then dt end) as removal_time
from table
group by 1,2)
select
t1.creation_time as creation_date,
count(t2.audit_id) as created_cnt,
count(distinct case when t2.removal_time < t1.creation_time then t2.audit_id end) as
destroyed_cnt
from cte as t1
left join cte as t2 on t1.creation_time > t2.creation_time
group by 1
order by 1 desc;
But due to large table, this self-join is slowing things down. Is it possible to use some sort of window functions here to replace joining? Help is appreciated.

Check this solution with over(order by dt rows between unbounded preceding and 1 preceding):
with data as (
select $1 dt, $2 ship, $3 audit, $4 action
from values('2022-01-02', 1, 'id1', 'destroy')
, ('2022-01-01', 1, 'id1', 'create')
, ('2021-12-12', 2, 'id2', 'create')
, ('2020-12-16', 2, 'id2', 'destroy')
, ('2020-12-28', 2, 'id3', 'create')
)
select dt
, sum(iff(action='create',1,0)) over(order by dt rows between unbounded preceding and 1 preceding) created_cnt
, sum(iff(action='destroy',1,0)) over(order by dt rows between unbounded preceding and 1 preceding) destroyed_cnt
from data

An alternative answer using PIVOT instead of IFF(). Would be interested to hear which approach scales best for your problem.
Code (Copy|Paste|Run):
with data as (
select $1 dt, $2 ship, $3 audit, $4 action
from values('2022-01-02', 1, 'id1', 'destroy')
, ('2022-01-01', 1, 'id1', 'create')
, ('2021-12-12', 2, 'id2', 'create')
, ('2020-12-16', 2, 'id2', 'destroy')
, ('2020-12-28', 2, 'id3', 'create')
)
select
dt
, sum($3) over (order by dt rows between unbounded preceding and 1 preceding) created_cnt
, sum($4) over (order by dt rows between unbounded preceding and 1 preceding) destroyed_cnt
from
data pivot ( count (audit) for action in ('create','destroy'));

jiggle Filipe's answer the sum(iff(action='create',1,0)) can be swapped for count_if(action='create')
thus become:
with data as (
select $1 dt, $2 ship, $3 audit, $4 action
from values('2022-01-02', 1, 'id1', 'destroy')
,('2022-01-01', 1, 'id1', 'create')
,('2021-12-12', 2, 'id2', 'create')
,('2020-12-16', 2, 'id2', 'destroy')
,('2020-12-28', 2, 'id3', 'create')
)
select dt
,count_if(action='create') over (order by dt rows between unbounded preceding and 1 preceding) created_cnt
,count_if(action='destroy') over (order by dt rows between unbounded preceding and 1 preceding) destroyed_cnt
from data

Related

SQL Server, How to group rows that are near in time

I have a table that has a time value, and a user id, and I want to group the rows if they are near in time (less than 2 mn between each row), and group them by user id.
Here is an Example :
CreatedAt | User ID
'16:01:01' | '01'
'16:02:20' | '01'
'16:03:20' | '01'
'16:04:20' | '01'
'16:05:20' | '02'
'16:06:20' | '02'
'16:07:20' | '02'
'16:08:20' | '02'
'16:14:02' | '02'
'16:15:01' | '02'
'16:20:02' | '03'
The result should be :
User ID = 01
'16:01:01'
'16:02:20'
'16:03:20'
'16:04:20'
User ID = 02
'16:05:20'
'16:06:20'
'16:07:20'
'16:08:20'
'16:14:02'
'16:15:01'
User ID = 03
'16:20:02'
I'm not even sure if it's doable by SQL, or I have to code it (I have few millions lines in my database so it's not the most effective way).
Thanks for your help.
This assigns a "Group Number" to the sets. however, not sure what this really achieves, but might help you achieve what you want on your presentation layer:
WITH VTE AS(
SELECT CONVERT(time(0), V.CreatedAt) AS CreatedAt, UserID
FROM (VALUES ('16:01:01','01'),
('16:02:20','01'),
('16:03:20','01'),
('16:04:20','01'),
('16:05:20','02'),
('16:06:20','02'),
('16:07:20','02'),
('16:08:20','02'),
('16:14:02','02'),
('16:15:01','02'),
('16:20:02','03')) V(CreatedAt, UserID)),
TimeDiff AS(
SELECT *,
CASE WHEN DATEDIFF(SECOND,LAG(CreatedAt,1,CreatedAt) OVER (PARTITION BY UserID ORDER BY CreatedAt ASC),CreatedAt) <= 120 THEN 1 ELSE 0 END AS Succession
FROM VTE)
SELECT TD.CreatedAt,
TD.UserID,
COUNT(CASE WHEN TD.Succession = 0 THEN 1 END) OVER (PARTITION BY UserID ORDER BY TD.CreatedAt
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS GroupNumber
FROM TimeDiff TD;

Field equal 1 display

I am using SQL Server 2008 and I would like to only get the activityCode for the orderno when it equals 1 if there are duplicate orderno with the activityCode equals 0.
Also, if the record for orderno activityCode equals 0 then display those records also. But I would only like to display the orderno when the activityCode equals 0 if the same orderno activityCode does not equal 1 or the activityCode only equals 0. I hope this is clear and makes sense but let me know if I need to provide more details. Thanks
--create table
create table po_v
(
orderno int,
amount number,
activityCode number
)
--insert values
insert into po_v values
(170268, 2774.31, 0),
(17001988, 288.82, 0),
(17001988, 433.23, 1),
(170271, 3786, 1),
(170271, 8476, 0),
(170055, 34567, 0)
--Results
170268 | 2774.31 | 0
17001988 | 433.23 | 1
170271 | 3786 | 1
170055 | 34567 | 0
*****Updated*****
I have inserted two new records and the results have been updated. The data in the actual table has other numbers besides 0 and 1. The select statement displays the correct orderno's but I would like the other records for the orderno to display also. The partition only populates one record per orderno. If possible I would like to see the records with the same activityCode.
--insert values
insert into po_v values
(170271, 3799, 1),
(172525, 44445, 2)
--select statement
SELECT Orderno,
Amount,
Activitycode
FROM (SELECT orderno,
amount,
activitycode,
ROW_NUMBER()
OVER(
PARTITION BY orderno
ORDER BY activitycode DESC) AS dup
FROM Po_v)dt
WHERE dt.dup = 1
ORDER BY 1
--select statement results
170055 | 34567 | 0
170268 | 2774.31 | 0
170271 | 3786 | 1
172525 | 44445 | 2
17001988 | 433.23 | 1
--expected results
170055 | 34567 | 0
170268 | 2774.31 | 0
170271 | 3786 | 1
170271 | 3799 | 1
172525 | 44445 | 2
17001988 | 433.23 | 1
Not totally clear what you are trying to do here but this returns the output you are expecting.
select orderno
, amount
, activityCode
from
(
select *
, RowNum = ROW_NUMBER() over(partition by orderno order by activityCode desc)
from po_v
) x
where x.RowNum = 1
---EDIT---
With the new details this is a very different question. As I understand it now you want all row for that share the max activity code for each orderno. You can do this pretty easily with a cte.
with MyGroups as
(
select orderno
, Activitycode = max(activitycode)
from po_v
group by orderno
)
select *
from po_v p
join MyGroups g on g.orderno = p.orderno
and g.Activitycode = p.Activitycode
Try this
SELECT Orderno,
Amount,
Activitycode
FROM (SELECT orderno,
amount,
activitycode,
ROW_NUMBER()
OVER(
PARTITION BY orderno
ORDER BY activitycode DESC) AS dup
FROM Po_v)dt
WHERE dt.dup = 1
ORDER BY 1
Result
Orderno Amount Activitycode
------------------------------------
170055 34567.00 0
170268 2774.31 0
170271 3786.00 1
17001988 433.23 1

MSSQL: Create incremental row label per group

In my table, I have a primary key and a date. What I'd like to achieve is to have an incremental label based on whether or not there is a break between the dates - column Goal.
Now, below is an example. The break column was calculated using LEAD function (I thought it might help).
I am able to solve it using T-SQL, but this would be last resort. Nothing I tried has worked so far. I am using MSSQL 2014.
PK | Date | break | Goal |
-------------------------------
1 | 03/2017 | 0 | 1 |
1 | 04/2017 | 0 | 1 |
1 | 08/2017 | 1 | 2 |
1 | 09/2017 | 0 | 2 |
1 | 10/2017 | 0 | 2 |
1 | 02/2018 | 1 | 3 |
1 | 03/2018 | 0 | 3 |
Here is a code to reproduce this example:
CREATE TABLE #test
(
ConsumerId INT,
FullDate DATE,
Goal INT
)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-03-01',1)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-04-01',1)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-08-01',2)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-09-01',2)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2017-10-01',2)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2018-02-01',3)
INSERT INTO #test (ConsumerId, FullDate, Goal) VALUES (1,'2018-03-01',3)
SELECT ConsumerId,
FullDate,
CASE WHEN (datediff(month,
isnull(
LEAD (FullDate,1) OVER (PARTITION BY ConsumerId ORDER BY FullDate DESC),
FullDate),
FullDate) > 1)
THEN 1
ELSE 0
END AS break,
Goal
FROM #test
ORDER BY FullDate ASC
EDIT
This is apparently a famous problem "Islands and gaps" as pointed out in the comments. And Google offers many solutions as well as other questions here at SO.
Try this...
WITH
cte_TestGap AS (
SELECT
t.ConsumerId, t.FullDate,
Gap = CASE
WHEN DATEDIFF(mm, t.FullDate, LAG(t.FullDate, 1) OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)) = -1
THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)
END
FROM
#test t
),
cte_SmearGap AS (
SELECT
tg.ConsumerId, tg.FullDate,
GV = MAX(tg.Gap) OVER (PARTITION BY tg.ConsumerId ORDER BY tg.FullDate ROWS UNBOUNDED PRECEDING)
FROM
cte_TestGap tg
)
SELECT
sg.ConsumerId, sg.FullDate,
GroupValue = DENSE_RANK() OVER (PARTITION BY sg.ConsumerId ORDER BY sg.GV)
FROM
cte_SmearGap sg;
An explanation of the code an how it works...
The 1st query, in cte_TestGap, uses the LAG function along with ROW_NUMBER() function to mark the location of gap in the data. We can see that by breaking it out and looking at it's results...
WITH
cte_TestGap AS (
SELECT
t.ConsumerId, t.FullDate,
Gap = CASE
WHEN DATEDIFF(mm, t.FullDate, LAG(t.FullDate, 1) OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)) = -1
THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)
END
FROM
#test t
)
SELECT * FROM cte_TestGap;
cte_TestGap results...
ConsumerId FullDate Gap
----------- ---------- --------------------
1 2017-03-01 1
1 2017-04-01 0
1 2017-08-01 3
1 2017-09-01 0
1 2017-10-01 0
1 2018-02-01 6
1 2018-03-01 0
At this point we want the 0 values to take on the value of the preceding non-0 values, allowing them to be grouped together. This is done in the 2nd query (cte_SmearGap) using the MAX function with a "window frame". So if we look at the output of cte_SmearGap, we can see that...
WITH
cte_TestGap AS (
SELECT
t.ConsumerId, t.FullDate,
Gap = CASE
WHEN DATEDIFF(mm, t.FullDate, LAG(t.FullDate, 1) OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)) = -1
THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)
END
FROM
#test t
),
cte_SmearGap AS (
SELECT
tg.ConsumerId, tg.FullDate,
GV = MAX(tg.Gap) OVER (PARTITION BY tg.ConsumerId ORDER BY tg.FullDate ROWS UNBOUNDED PRECEDING)
FROM
cte_TestGap tg
)
SELECT * FROM cte_SmearGap;
cte_SmearGap results...
ConsumerId FullDate GV
----------- ---------- --------------------
1 2017-03-01 1
1 2017-04-01 1
1 2017-08-01 3
1 2017-09-01 3
1 2017-10-01 3
1 2018-02-01 6
1 2018-03-01 6
At this point All of the rows are in distinct groups... but... We'd like to have our group numbers in a contiguous sequence (1,2,3) as opposed to (1,3,6).
Of course that's easy enough to fix using the DENSE_Rank() function, which is what's happening in the final select...
WITH
cte_TestGap AS (
SELECT
t.ConsumerId, t.FullDate,
Gap = CASE
WHEN DATEDIFF(mm, t.FullDate, LAG(t.FullDate, 1) OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)) = -1
THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY t.ConsumerId ORDER BY t.FullDate)
END
FROM
#test t
),
cte_SmearGap AS (
SELECT
tg.ConsumerId, tg.FullDate,
GV = MAX(tg.Gap) OVER (PARTITION BY tg.ConsumerId ORDER BY tg.FullDate ROWS UNBOUNDED PRECEDING)
FROM
cte_TestGap tg
)
SELECT
sg.ConsumerId, sg.FullDate,
GroupValue = DENSE_RANK() OVER (PARTITION BY sg.ConsumerId ORDER BY sg.GV)
FROM
cte_SmearGap sg;
The end result...
ConsumerId FullDate GroupValue
----------- ---------- --------------------
1 2017-03-01 1
1 2017-04-01 1
1 2017-08-01 2
1 2017-09-01 2
1 2017-10-01 2
1 2018-02-01 3
1 2018-03-01 3
The comment from David Browne was actually extremely useful. If you google "Islands and Gaps", there are many variations of the solution. Below is the one I liked the most.
In the end, I needed the Goal column to be able to group the dates into MIN/MAX. This solution skips this step and directly creates the aggregated range.
Here is the source.
SELECT MIN(FullDate) AS range_start,
MAX(FUllDate) AS range_end
FROM (
SELECT FullDate,
DATEADD(MM, -1 * ROW_NUMBER() OVER(ORDER BY FullDate), FullDate) AS grp
FROM #test
) a
GROUP BY a.grp
And the output:
range_start | range_end |
--------------------------
2017-03-01 | 2017-04-01 |
2017-08-01 | 2017-10-01 |
2018-02-01 | 2018-03-01 |

How can I get multiple columns values into single row in oracle?

My exact requirement is that if the output of the query 'select amount, quantity from temp_table where type = 5;' is:
amount | quantity
10 | 5
20 | 7
12 | 10
Then, the output should be displayed as:
amount1 | amount2 | amount3 | quantity1 | quantity2 | quantity3
10 | 20 | 12 | 5 | 7 | 10
A possible solution might be:
SELECT LISTAGG(amount, '|') WITHIN GROUP (order by amount)
|| LISTAGG(quantity, '|') WITHIN GROUP (order by amount) as result
FROM temp_table where type = 5;
*Have in mind, that the values of the columns amount and quantity are separated by spaces, thus the ' ' in the listagg() expression. You can change it to '|' or anything else if you like.
Cheers
Use a PIVOT:
SELECT "1_AMOUNT" AS Amount1,
"2_AMOUNT" AS Amount2,
"3_AMOUNT" AS Amount3,
"4_AMOUNT" AS Amount4,
"5_AMOUNT" AS Amount5,
"1_QUANTITY" AS Quantity1,
"2_QUANTITY" AS Quantity2,
"3_QUANTITY" AS Quantity3,
"4_QUANTITY" AS Quantity4,
"5_QUANTITY" AS Quantity5
FROM ( SELECT amount, quantity, ROWNUM rn FROM temp_table WHERE type = 5 )
PIVOT ( MAX( amount ) AS amount,
MAX( quantity ) AS quantity
FOR rn IN ( 1, 2, 3, 4, 5 ) );

How to get comparing records from the same table in SQL Server 2014?

I have a table that shows the entry and exit of items into the warehouse. The Camera 1 and Camera 2 document the entry time and exit time respectively of that item. The cameras then classify the item as it enters and leaves the checkpoint with the help of lasers. Eg: Big box: Class 5, Medium Box: Class 3, Small Box: Class 2.
Sometimes, the cameras classification doesn't match each other. Eg: Classification at entry can be Medium box and on exit can be Small box.
I need to find the number of transactions where the class didn't match for the same TransactionDetail and then a percentage of those class mismatches against all the transaction for a certain time range.
My table looks somewhat like this:
---------------------------------------------------------------------------
| AVDetailID | TransDetailID | AVClassID | CamID | CreatedDate |
---------------------------------------------------------------------------
| 20101522 | 54125478 | 5 | 1 | 2017-05-08 10:15:01:560|
| 20101523 | 54125478 | 5 | 2 | 2017-05-08 10:15:01:620|
| 20101524 | 54125479 | 3 | 1 | 2017-05-08 10:15:03:120|
| 20101525 | 54125479 | 2 | 2 | 2017-05-08 10:15:03:860|
| 20101526 | 54125480 | 4 | 1 | 2017-05-08 10:15:06:330|
| 20101527 | 54125480 | 4 | 2 | 2017-05-08 10:15:06:850|
---------------------------------------------------------------------------
So, in the above case the class changes from 3 to 2 in record 3 and 4. That is one transaction where the class changed. I need to get a percentage of all transactions that where the class changed between each cameras.
The code I've used so far is below. I just need to find a way to get a percentage of the total Transactions.
DECLARE #MinDate DATE = '20170406',
#MaxDate DATE = '20170407';
SELECT COUNT(tdBefore.TransDetailId) TD
--,SUM((COUNT(*) OVER() / allRecords.Count) * 100) AS DiffPercent
FROM AVTransDetail AS tdBefore
INNER JOIN AVTransDetail AS tdAfter
ON tdBefore.TransDetailID = tdAfter.TransDetailID
AND tdBefore.ACClassID = 1
AND tdAfter.ACClassID = 2
CROSS APPLY
(
SELECT COUNT(*) AS [Count]
FROM AVTransDetail
WHERE tdBefore.DateCreated >= #MinDate
AND tdAfter.DateCreated <= #MaxDate
) AS allRecords
WHERE tdBefore.AVCClassId <> tdAfter.AVCClassId
AND tdBefore.DateCreated >= #MinDate
AND tdAfter.DateCreated <= #MaxDate
How do I create a column for percentage of total transactions?
This worked with your sample data.
DECLARE #MinDate DATETIME = '5/8/2017 12:00AM';
DECLARE #MaxDate DATETIME = '5/8/2017 11:59PM';
WITH cam1 AS (
SELECT TransDetailID,AVClassID
FROM AVTransDetail
WHERE CreatedDate BETWEEN #MinDate AND #MaxDate
AND
CamID = 1),
cam2 AS (
SELECT TransDetailID,AVClassID
FROM AVTransDetail
WHERE CreatedDate BETWEEN #MinDate AND #MaxDate
AND
CamID = 2)
SELECT COUNT(*)'Total',SUM(CASE WHEN c1.AVClassID = c2.AVClassID THEN 0 ELSE 1 END)'NonMatch',
SUM(CASE WHEN c1.AVClassID = c2.AVClassID THEN 0 ELSE 1 END) * 100.00/COUNT(*)'Percentage'
FROM cam1 c1
JOIN cam2 c2 ON c1.TransDetailID=c2.TransDetailID
Try the below SQL script.
First we LAG to find the differences. Then, we get each transaction and whether there is a difference. And finally, we get the percentage.
DECLARE #MinDate DATE = '2017/04/06',
#MaxDate DATE = '2017/05/09';
SELECT count(*) AS TotalTransactions
,sum(Change) AS TransactionsWithChange
,(cast(sum(Change) AS FLOAT) / cast(count(*) AS FLOAT)) AS ChangePercent
FROM (
SELECT TransDetailID
,MAX(classChange) AS Change
FROM (
SELECT *
,LAG(AVClassID, 1, AVClassID) OVER (
PARTITION BY TransDetailID ORDER BY AVDetailID
) AS PrevClassId
,CASE
WHEN LAG(AVClassID, 1, AVClassID) OVER (
PARTITION BY TransDetailID ORDER BY AVDetailID
) != AVClassID
THEN 1
ELSE 0
END AS ClassChange
FROM AVTransDetail
where CreatedDate between #MinDate and #MaxDate
) AS CoreData
GROUP BY TransDetailID
) AS ChangeData
Hope this helps.
I added more sample rows to get better result
create table #trans (
AVDetailID int,
TransDetailID int,
AVClassID int,
CamID int,
CreatedDate datetime
)
insert into #trans values
( 20101522, 54125478, 5, 1, '2017-05-08 10:15:01:560'),
( 20101523, 54125478, 5, 2, '2017-05-08 10:15:01:620'),
( 20101524, 54125479, 3, 1, '2017-05-08 10:15:03:120'),
( 20101525, 54125479, 2, 2, '2017-05-08 10:15:03:860'),
( 20101526, 54125480, 4, 1, '2017-05-08 10:15:06:330'),
( 20101527, 54125480, 4, 2, '2017-05-08 10:15:06:850'),
( 20101528, 54125481, 4, 1, '2017-05-08 10:15:07:850'),
( 20101529, 54125481, 5, 2, '2017-05-08 10:15:09:850'),
( 20101530, 54125482, 4, 1, '2017-05-08 10:15:07:850'),
( 20101531, 54125482, 5, 3, '2017-05-08 10:15:09:850')
;with diff as (
-- select records that have different class
select CamID as Ent_CamID, count(*) diff_Count
from #trans ent
outer apply (
select top 1 AVClassID as x_AVClassID, CamID as x_CamID from #trans
where CreatedDate > ent.CreatedDate and TransDetailID = ent.TransDetailID
order by CamID, CreatedDate desc
) ext
where ent.AVClassID <> ext.x_AVClassID
group by ent.CamID, ext.x_CamID
union
select ext.x_CamID as Ext_CamID, count(*) diff_Count
from #trans ent
outer apply (
select top 1 AVClassID as x_AVClassID, CamID as x_CamID from #trans
where CreatedDate > ent.CreatedDate and TransDetailID = ent.TransDetailID
order by CamID, CreatedDate desc
) ext
where ent.AVClassID <> ext.x_AVClassID
group by ent.CamID, ext.x_CamID
)
, perc as (
select Ent_CamID as CamID, sum(diff_Count) Total_Error
, (select count(*)
from #trans where CamID = diff.Ent_CamID
group by CamID) AS Total_Capture
from diff
group by Ent_CamID
)
select CamID, Total_Error, Total_Capture, 100*(Total_Error)/Total_Capture Error_Percentage
from perc
Result:
CamID Total_Error Total_Capture Error_Percentage
1 3 5 60
2 2 4 50
3 1 1 100

Resources