Correlated pattern definitions and Snowflake’s MATCH_RECOGNIZE

Correlated pattern definitions and Snowflake’s MATCH_RECOGNIZE - snowflake-cloud-data-platform

I need to find out when a Customer changed for an Order and was reverted back in the following data. The Order can be assigned to several different Customer before being reverted back to the original Customer. Here is the raw data:
ORDER_NUM
CUSTOMER
LOAD_DATE
111
aaa
2023-02-09 04:49:41.335
111
bbb
2023-02-09 04:49:42.338
111
aaa
2023-02-09 04:49:43.278
222
aaa
2023-02-09 04:49:44.213
222
bbb
2023-02-09 04:49:45.254
333
aaa
2023-02-09 04:49:46.334
333
bbb
2023-02-09 04:49:47.101
333
ccc
2023-02-09 04:49:48.196
I developed the following MATCH_RECOGNIZE query in Oracle and it works:
select * from order_customer
match_recognize(
partition by order_number
order by load_date
one row per match
pattern (init modified+ reversed)
define
init as customer_id = customer_id,
modified as customer_id <> init.customer_id,
reversed as customer_id = init.customer_id
);
But it seems like Snowflake currently doesn't support Correlated pattern definition in MATCH_RECOGNIZE. What is the best way to implement this in Snowflake?

This use case could be resolved with FIRST_VALUE:
select * from order_customer
match_recognize(
partition by order_number
order by load_date
one row per match
pattern (init modified+ reversed)
define
init as customer_id = FIRST_VALUE(customer_id),
modified as customer_id <> FIRST_VALUE(customer_id),
reversed as customer_id = FIRST_VALUE(customer_id)
);
For input:
create or replace table order_customer (order_number number,
customer_id varchar(80),
load_date timestamp);
insert into order_customer values (111, 'aaa', CURRENT_TIMESTAMP);
insert into order_customer values (111, 'bbb', CURRENT_TIMESTAMP);
insert into order_customer values (111, 'aaa', CURRENT_TIMESTAMP);
insert into order_customer values (222, 'aaa', CURRENT_TIMESTAMP);
insert into order_customer values (222, 'bbb', CURRENT_TIMESTAMP);
insert into order_customer values (333, 'aaa', CURRENT_TIMESTAMP);
insert into order_customer values (333, 'bbb', CURRENT_TIMESTAMP);
insert into order_customer values (333, 'ccc', CURRENT_TIMESTAMP);
Output:
Simplifying the entire pattern to (init modified+ init):
select * from order_customer
match_recognize(
partition by order_number
order by load_date
one row per match
pattern (init modified+ init)
define
init as customer_id = FIRST_VALUE(customer_id),
modified as customer_id <> FIRST_VALUE(customer_id)
);

Related

How to select changed columns

The Problem
I'm trying to detect and react to changes in a table where each update is being recorded as a new row with some values being the same as the original, some changed (the ones I want to detect) and some NULL values (not considered changed).
For example, given the following table MyData, and assuming the OrderNumber is the common value,
ID OrderNumber CustomerName PartNumber Qty Price OrderDate
1 123 Acme Corp. WG301 4 15.02 2020-01-02
2 456 Base Inc. AL337 7 20.15 2020-02-03
3 123 NULL WG301b 5 19.57 2020-01-02
If I execute the query for OrderNumber = 123 I would like the following data returned:
Column OldValue NewValue
ID 1 3
PartNumber WG301 WG301b
Qty 4 5
Price 15.02 19.57
Or possibly a single row result with only the changes filled, like this (however, I would strongly prefer the former format):
ID OrderNumber CustomerName PartNumber Qty Price OrderDate
3 NULL NULL WG301b 5 19.57 NULL
My Solution
I have not had a chance to test this, but I was considering writing the query with the following approach (pseudo-code):
select
NewOrNull(last.ID, prev.ID) as ID,
NewOrNull(last.OrderNumber, prev.OrderNumber) as OrderNumber
NewOrNull(last.CustomerName, prev.CustomerName) as CustomerName,
...
from last row with OrderNumber = 123
join previous row where OrderNumber = 123
Where the function NewOrNull(lastVal, prevVal) returns NULL if the values are equal or lastVal value is NULL, otherwise the lastVal.
Why I'm Looking for an Answer
I'm afraid that the ugly join, the number of calls to the function, and the procedural approach may make this approach not scalable. Before I start down the rabbit hole, I was wondering...
The Question
...are there any other approaches I should try, or any best practices to solving this specific type of problem?

I came up with a solution for the second (less preferred) format:
The Data
Using the following data:
INSERT INTO MyData
([ID], [OrderNumber], [CustomerName], [PartNumber], [Qty], [Price], [OrderDate])
VALUES
(1, 123, 'Acme Corp.', 'WG301', '4', '15.02', '2020-01-02'),
(2, 456, 'Base Inc.', 'AL337', '7', '20.15', '2020-02-03'),
(3, 123, NULL, 'WG301b', '5', '19.57', '2020-01-02'),
(4, 123, 'ACME Corp.', 'WG301b', NULL, NULL, '2020-01-02'),
(6, 456, 'Base Inc.', NULL, '7', '20.15', '2020-02-05');
The Function
This function returns the updated value if it has changed, otherwise NULL:
CREATE FUNCTION dbo.NewOrNull
(
#newValue sql_variant,
#oldValue sql_variant
)
RETURNS sql_variant
AS
BEGIN
DECLARE #ret sql_variant
SELECT #ret = CASE
WHEN #newValue IS NULL THEN NULL
WHEN #oldValue IS NULL THEN #newValue
WHEN #newValue = #oldValue THEN NULL
ELSE #newValue
END
RETURN #ret
END;
The Query
This query returns the history of changes for the given order number:
select dbo.NewOrNull(new.ID, old.ID) as ID,
dbo.NewOrNull(new.OrderNumber, old.OrderNumber) as OrderNumber,
dbo.NewOrNull(new.CustomerName, old.CustomerName) as CustomerName,
dbo.NewOrNull(new.PartNumber, old.PartNumber) as PartNumber,
dbo.NewOrNull(new.Qty, old.Qty) as Qty,
dbo.NewOrNull(new.Price, old.Price) as Price,
dbo.NewOrNull(new.OrderDate, old.OrderDate) as OrderDate
from MyData new
left join MyData old
on old.ID = (
select top 1 ID
from MyData pre
where pre.OrderNumber = new.OrderNumber
and pre.ID < new.ID
order by pre.ID desc
)
where new.OrderNumber = 123
The Result
ID OrderNumber CustomerName PartNumber Qty Price OrderDate
1 123 Acme Corp. WG301 4 15.02 2020-01-02
3 (null) (null) WG301b 5 19.57 (null)
4 (null) ACME Corp. (null) (null) (null) (null)
The Fiddle
Here's the SQL Fiddle that shows the whole thing in action.
http://sqlfiddle.com/#!18/b720f/5/0

Script to update thousands of SQL records with a unique ID

I've got a table named 'users' with 5800 records where the externalID is NULL. I want to add an unique externalID to each of these records in the following format: 'legacy_N' where N is some number between 1 and 5800.
this:
ID externalID Name AddedUTC
123 John 2019-09-19 15:14:11.837
634 Susan 2019-09-18 20:39:38.247
499 Lolita 2019-09-18 19:58:29.320
...
becomes this:
ID externalID Name AddedUTC
123 legacy_1 John 2019-09-19 15:14:11.837
634 legacy_2 Susan 2019-09-18 20:39:38.247
499 legacy_3 Lolita 2019-09-18 19:58:29.320
...
Anyway I can do this programatically? Or am I about to spend the 8 hours updating these one by one?

Here's an example based on the data in your question:
-- Create a test table with test data in it
CREATE TABLE TestData([ID] int, [externalID] varchar(20), [Name] varchar(50), [AddedUTC] datetime);
INSERT INTO TestData ([ID], [externalID], [Name], [AddedUTC])
VALUES (123, NULL, 'John', '2019-09-19 15:14:11.837'),
(499, NULL, 'Lolita', '2019-09-18 19:58:29.320'),
(634, NULL, 'Susan', '2019-09-18 20:39:38.247');
-- Display the test data
SELECT * FROM TestData;
-- Do the update
WITH NumberedData AS (
SELECT ID, ROW_NUMBER() OVER (ORDER BY AddedUTC) AS RowNum
FROM TestData
)
UPDATE td
SET [externalID] = 'Legacy_' + CAST(RowNum AS varchar)
FROM TestData td
INNER JOIN NumberedData nd ON td.ID = nd.ID
-- Display updated data
SELECT * FROM TestData;
If you want to order by ID, then you can change the ROW_NUMBER clause to "OVER(ORDER BY ID)" instead.

apply query to each part of table individually

I need to run some query against each rowset in a table (Azure SQL):
ID CustomerID MsgTimestamp Msg
-------------------------------------------------
1 123 2017-01-01 10:00:00 Hello
2 123 2017-01-01 10:01:00 Hello again
3 123 2017-01-01 10:02:00 Can you help me with my order
4 123 2017-01-01 11:00:00 Are you still there
5 456 2017-01-01 10:07:00 Hey I'm a new customer
What I want to do is to extract "chat session" for every customer from message records, that is, if the gap between someone's two consecutive messages is less than 30 minutes, they belong to the same session. I need to record the start and end time of each session in a new table. In the example above, start and end time of the first session for customer 123 are 10:00 and 10:02.
I know I can always use cursor and temp table to achieve that goal, but I'm thinking about utilizing any pre-built mechanism to reach better performance. Please kindly give me some input.

You can use window functions instead of cursor. Something like this should work:
declare #t table (ID int, CustomerID int, MsgTimestamp datetime2(0), Msg nvarchar(100))
insert #t values
(1, 123, '2017-01-01 10:00:00', 'Hello'),
(2, 123, '2017-01-01 10:01:00', 'Hello again'),
(3, 123, '2017-01-01 10:02:00', 'Can you help me with my order'),
(4, 123, '2017-01-01 11:00:00', 'Are you still there'),
(5, 456, '2017-01-01 10:07:00', 'Hey I''m a new customer')
;with x as (
select *, case when datediff(minute, lag(msgtimestamp, 1, '19000101') over(partition by customerid order by msgtimestamp), msgtimestamp) > 30 then 1 else 0 end as g
from #t
),
y as (
select *, sum(g) over(order by msgtimestamp) as gg
from x
)
select customerid, min(msgtimestamp), max(msgtimestamp)
from y
group by customerid, gg

Converting SQL Rows into Columns

I'm having some difficulty understanding the best approach to get the following result set.
I have a result set (thousands of rows) that I want to update from:
ID Question Answer
--- -------- --------
1 Business NULL
1 Job Other
1 Location UK
2 Business Legal
3 Location US
4 Location UK
To This:
ID Buisness Job Location
--- -------- --- --------
1 NULL Other UK
2 Legal NULL NULL
3 NULL NULL US
4 NULL NULL UK
I have been looking at SELF JOINS and PIVOT tables but wanted to understand the best method as I have not been able to achieve the desired output.
Thanks
Gary

If you want to use pivot, you can do it like this:
CREATE TABLE #Table1
([ID] int, [Question] varchar(8), [Answer] varchar(5))
;
INSERT INTO #Table1
([ID], [Question], [Answer])
VALUES
(1, 'Business', NULL),
(1, 'Job', 'Other'),
(1, 'Location', 'UK'),
(2, 'Business', 'Legal'),
(3, 'Location', 'US'),
(4, 'Location', 'UK')
;
select * from
(select * from #Table1) S
pivot (
max(Answer) for Question in (Business, Job, Location)
) P

select
id,
max(case when question='business' then answer end) 'business',
max(case when question='Job' then answer end) 'Job',
max(case when question='Location' then answer end) 'Location'
group by id

Selecting newest entries with different types

I'm designing table which will contain properties of some objects which will change over time.
CREATE TABLE [dbo].[ObjectProperties]
(
[Id] INT NOT NULL PRIMARY KEY IDENTITY,
[ObjectType] SMALLINT NOT NULL,
[Width] SMALLINT NOT NULL,
[Height] SMALLINT NOT NULL,
[Weight] SMALLINT NOT NULL
)
Let's say I have this ObjectTypes:
1 = Chair
2 = Table
And Data for this table:
INSERT INTO [dbo].[ObjectProperties] ([Id], [ObjectType], [Width], [Height], [Weight]) VALUES (1, 1, 50, 50, 1000)
INSERT INTO [dbo].[ObjectProperties] ([Id], [ObjectType], [Width], [Height], [Weight]) VALUES (2, 2, 80, 40, 500)
INSERT INTO [dbo].[ObjectProperties] ([Id], [ObjectType], [Width], [Height], [Weight]) VALUES (3, 1, 50, 50, 2000)
So, as you can see I had Chair object which Weight was 1000 then I changed weight to 2000. And I'm storing something like modification history of objects properties.
Now I want to select newest data from this table for each object. I know how to select newest data for each object one by one:
SELECT TOP 1 * FROM [ObjectProperties] WHERE ObjectType = 1 ORDER BY Id DESC
But what if I want to select few objects with one query? Like
SELECT ... * FROM [ObjectProperties] WHERE ObjectType IN (1, 2) ...
And receive rows with ids 2 and 3 (because 3 has newer properties for Chair than 1)

You can use a CTE with ROW_NUMBER ranking function:
WITH CTE AS(
SELECT *,
RN=ROW_NUMBER()OVER(PARTITION BY ObjectType ORDER BY ID DESC)
FROM [ObjectProperties] op
)
SELECT * FROM CTE WHERE RN = 1
AND ObjectType IN (1, 2)
Demo
The ROW_NUMBER returns one row for every ObjectType-group order by ID DESC(so the record with the highest ID) .If you want to filter by certain ID's you just have to apply the appropriate WHERE clause, either in the CTE or in the outer SELECT.
Ranking Functions

A simple (admittedly crude) way is as follows:
select * from ObjectProperties where id in
(select max(id) from ObjectProperties group by objecttype)
This gives:
Id ObjectType Width Height Weight
----------- ---------- ------ ------ ------
2 2 80 40 500
3 1 50 50 2000

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Correlated pattern definitions and Snowflake’s MATCH_RECOGNIZE - snowflake-cloud-data-platform

Related

How to select changed columns

Script to update thousands of SQL records with a unique ID

apply query to each part of table individually

Converting SQL Rows into Columns

Selecting newest entries with different types

Categories

Resources