I am analyzing user patterns in Snowflake for an ecommerce website. I would like to be able to match various patterns of a user flow (e.g. did they complete an order after viewing a specific page? did they complete an order after selecting an add to cart from a particular portion of the page? etc.).
Is it possible to calculate conversion rates for multiple patterns with a match recognize function?
The data structure looks something like this:
CREATE TEMPORARY TABLE AS events_and_visits (
VISIT_ID bigint,
EVENT_ID bigint,
EVENT_NAME VARCHAR,
REFERENCE VARCHAR
);
INSERT INTO events_and_visits VALUES
(1, 1, 'productView', 'reco'),
(1, 2, 'Add To Cart', 'reco'),
(1, 3, 'Order Complete', NULL),
(2, 4, 'productView', 'reco'),
(3, 5, 'productView', 'reco'),
(3, 6, 'Add To Cart', 'merchant'),
(4, 7, 'productView', 'reco'),
(4, 8, 'productView', 'reco'),
(4, 9, 'Add To Cart', 'merchant'),
(4, 10, 'Order Complete', NULL);
My failed attempt
SELECT *
FROM
events_and_visits MATCH_RECOGNIZE(
PARTITION BY visit_id
ORDER BY
event_id
MEASURES
match_number() AS match_number,
classifier() AS clf
ALL ROWS PER MATCH WITH UNMATCHED ROWS
PATTERN (
(product_rec_view + atc_merchant * | atc_rec *) * oc * --THIS IS SO F*****
)
DEFINE
product_rec_view AS (
event_name = 'productView'
AND reference = 'reco'
),
atc_rec AS (
event_name = 'Add To Cart'
AND reference = 'reco'
),
atc_merchant AS (
event_name = 'Add To Cart'
AND reference = 'merchant'
),
oc AS event_name = 'Order Complete'
);
The data insert can be made simple (and thus only need one execution):
INSERT INTO events_and_visits VALUES
(1, 1, 'productView', 'reco'),
(1, 2, 'Add To Cart', 'reco'),
(1, 3, 'Order Complete', NULL),
(2, 4, 'productView', 'reco'),
(3, 5, 'productView', 'reco'),
(3, 6, 'Add To Cart', 'merchant'),
(4, 7, 'productView', 'reco'),
(4, 8, 'productView', 'reco'),
(4, 9, 'Add To Cart', 'merchant'),
(4, 10, 'Order Complete', NULL);
both (product_rec_view+ (atc_merchant | atc_rec))? oc* or product_rec_view+ (atc_merchant | atc_rec) oc? gives me what I feel like you are wanting, but it's hard to fully understand you intent
SELECT *
FROM
events_and_visits MATCH_RECOGNIZE(
PARTITION BY visit_id
ORDER BY
event_id
MEASURES
MATCH_SEQUENCE_NUMBER() AS mseq,
match_number() AS match_number,
classifier() AS clf
ALL ROWS PER MATCH WITH UNMATCHED ROWS
PATTERN (
(product_rec_view+ (atc_merchant | atc_rec))? oc*
--product_rec_view+ (atc_merchant | atc_rec) oc?
)
DEFINE
product_rec_view AS (
event_name = 'productView' AND reference = 'reco'
),
atc_rec AS (
event_name = 'Add To Cart' AND reference = 'reco'
),
atc_merchant AS (
event_name = 'Add To Cart' AND reference = 'merchant'
),
oc AS event_name = 'Order Complete'
)
ORDER BY 1,2;
VISIT_ID
EVENT_ID
EVENT_NAME
REFERENCE
MSEQ
MATCH_NUMBER
CLF
1
1
productView
reco
1
1
PRODUCT_REC_VIEW
1
2
Add To Cart
reco
2
1
ATC_REC
1
3
Order Complete
3
1
OC
2
4
productView
reco
1
1
3
5
productView
reco
1
1
PRODUCT_REC_VIEW
3
6
Add To Cart
merchant
2
1
ATC_MERCHANT
4
7
productView
reco
1
1
PRODUCT_REC_VIEW
4
8
productView
reco
2
1
PRODUCT_REC_VIEW
4
9
Add To Cart
merchant
3
1
ATC_MERCHANT
4
10
Order Complete
4
1
OC
Related
I want to define variables before a CTE table and after a CTE table because some variables are dependent on the result of the CTE table. For example
SET(K,B) = (5,2);
with my_data(Key,Index,Value) as (
-- data table as cte
select * from values
(1, 3, 10),
(1, 5, 18),
(1, 14, 4),
(2, 2, 11),
(2, 13, 24),
(2, 29, 40)
)
SELECT VALUE + $K
FROM my_data
This examples works perfectly. But this code:
SET(K,B) = (5,2);
with my_data(Key,Index,Value) as (
-- data table as cte
select * from values
(1, 3, 10 ),
(1, 5, 18 ),
(1, 14, 4 ),
(2, 2, 11 ),
(2, 13, 24),
(2, 29, 40)
)
SET AVG_VAL = (SELECT AVG(VALUE) FROM my_data);
SELECT VALUE + $AVG_VAL
FROM my_data
doesn't because snowflake gives me this error
"SQL compilation error: syntax error line 34 at position 0 unexpected 'SET'."
Should I create a temporary table to store the result of this query (SELECT AVG(VALUE) FROM my_data) in it and then include/use this temporary table for future queries instead of a variable?
Your "CTE" is not a standalone "thing" it only exist in the context of a SELECT.
Thus
WITH cte_x AS (...)
SELECT * FROM cte_x
is one SELECT which has a CTE attached to it.
Thus for you variable assignment the CTE has to be "IN" the paren's
with my_data(Key,Index,Value) as (
select * from values
(1, 3, 10 ),
(1, 5, 18 ),
(1, 14, 4 ),
(2, 2, 11 ),
(2, 13, 24),
(2, 29, 40)
)
SELECT AVG(VALUE) FROM my_data;
AVG(VALUE)
17.833333
given that is a discrete chunk of SQL, that can be captured into the variable:
set AVG_VAL = (
with my_data(Key,Index,Value) as (
select * from values
(1, 3, 10 ),
(1, 5, 18 ),
(1, 14, 4 ),
(2, 2, 11 ),
(2, 13, 24),
(2, 29, 40)
)
SELECT AVG(VALUE) FROM my_data
);
status
Statement executed successfully.
now we can use that value:
select $AVG_VAL * 2;
$AVG_VAL * 2
35.666666
But the next query:
SELECT VALUE + $AVG_VAL
FROM my_data
002003 (42S02): SQL compilation error:
Object 'MY_DATA' does not exist or not authorized.
has no CTE called my_data, so that need to be insert:
with my_data(Key,Index,Value) as (
select * from values
(1, 3, 10 ),
(1, 5, 18 ),
(1, 14, 4 ),
(2, 2, 11 ),
(2, 13, 24),
(2, 29, 40)
)
SELECT VALUE + $AVG_VAL
FROM my_data
If you want a table that can be "used twice" you will need an actual table, at which point I would suggest a temporary table so it only have context in this session.
Which the nature of Pankaj's answer (ether via a permanent or temp table)
This can be done as in -
select * from d2;
+-----+-----+
| ID1 | ID2 |
|-----+-----|
| 1 | 2 |
| 100 | 2 |
| 3 | 4 |
| 300 | 4 |
+-----+-----+
Setting variable -
set (var1) = (select sum(id2) from d2);
+----------------------------------+
| status |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
Using variable -
select id1+$var1 from d2;
+-----------+
| ID1+$VAR1 |
|-----------|
| 13 |
| 112 |
| 15 |
| 312 |
+-----------+
An alternatvie approach is to simply use windowed AVG function:
with my_data(Key,Index,Value) as (
-- data table as cte
select * from values
(1, 3, 10),
(1, 5, 18),
(1, 14, 4),
(2, 2, 11),
(2, 13, 24),
(2, 29, 40)
)
SELECT VALUE, AVG(VALUE) OVER(),
VALUE + AVG(VALUE) OVER()
FROM my_data;
Output:
OVER() means that the window used to compute average spans over all rows.
I have made a little example data that I modify in three steps. I cant do it in one, maybe there is a clever way with some logic? I use Microsoft SQL Server
This code will generate the four base tables with example data and the step by step queries I want to combine, the result at the end should have 8 entries:
Reference table:
CREATE TABLE ref
(
ID int NOT NULL
NR int NOT NULL
CONSTRAINT KEYS PRIMARY KEY (ID, NR)
);
INSERT INTO ref
VALUES (1234, 223), (1234, 224), (1234, 225),
(1235, 123), (1235, 124), (1236, 540),
(1236, 541), (1237, 233), (1237, 234);
Con1 table:
CREATE TABLE con1
(
NR int NOT NULL
flag int NOT NULL
PRIMARY KEY (NR)
);
INSERT INTO con1
VALUES (123, 0), (124, 1), (125, 0),
(220, 0), (222, 0), (223, 0),
(224, 0), (225, 1), (300, 0),
(540, 1), (541, 1);
Con2 table:
CREATE TABLE con2
(
NR int NOT NULL
ID int NOT NULL
PRIMARY KEY (NR)
);
INSERT INTO con2
VALUES (123, 1235), (124, 1235), (125, 1243),
(220, 1296), (222, 1255), (223, 1234),
(224, 1234), (225, 1234), (300, 1267),
(540, 1236);
Info table:
CREATE TABLE info
(
NR int NOT NULL
SNR int NOT NULL
SSNR int NOT NULL
Level int not NULL
CONSTRAINT KEYS PRIMARY KEY (NR, SNR, SSNR)
);
INSERT INTO info
VALUES (123, 1, 1, 1), (123, 1, 2, 2),
(123, 1, 3, 2), (123, 2, 1, 1),
(123, 2, 2, 2), (123, 2, 3, 2),
(124, 1, 1, 1), (124, 1, 2, 2),
(124, 1, 3, 2), (125, 1, 1, 1),
(125, 1, 2, 2), (125, 1, 3, 2),
(125, 1, 4, 3), (125, 1, 5, 3),
(220, 1, 1, 1), (220, 1, 2, 2),
(223, 1, 1, 1), (223, 1, 2, 2),
(224, 1, 1, 1), (224, 1, 2, 2),
(224, 1, 3, 2), (225, 1, 1, 1),
(225, 1, 2, 2), (300, 1, 1, 1),
(300, 1, 2, 2), (300, 2, 1, 1),
(300, 2, 2, 2), (540, 1, 1, 1),
(541, 1, 1, 1);
Step #1:
SELECT *
FROM con1
INNER JOIN con2 ON con1.NR = con2.NR
WHERE con1.flag = 1
Step #2:
SELECT ref.*
FROM ref
INNER JOIN step1 ON ref.ID = step1.ID
Step #3:
SELECT *
FROM step2
INNER JOIN info ON step2.NR = info.NR
WHERE info.Level = 1
I tried some different ways but always get too much resulting rows
the result should look like this:
ID
NR
Level
SNR
SSNR
1234
223
1
1
1
1234
224
1
1
1
1234
225
1
1
1
1235
123
1
1
1
1235
123
1
2
1
1235
124
1
1
1
1236
540
1
1
1
1236
541
1
1
1
It should be all entries from info with Level=1
excluding:
all NR that do not occur in the intersection of con1 and con2
all NR that con1 lists with flag = 0
but including:
all excluded NR that run with the same ID (according to ref) as any NR not excluded prior
the result has the same columns as info with on NR matching IDs from ref
You can do this easily with Common Table Expressions:
with step1 As (
Select *
From con1
Inner Join con2 On con1.NR = con2.NR
Where con1.flag = 1
), step2 As (
Select ref.*
From ref
Inner Join step1 On ref.ID = step1.ID
)
Select *
From step2
Inner Join info On step2.NR = info.NR
Where info.Level = 1
I have 2 tables #Claims and #ClaimsActivity:
Query:
declare #Claims table (ClaimID int)
insert into #Claims
values (6070), (6080)
declare #ClaimsActivity table
(
Activityid int,
ClaimID int,
Activity int,
ActivityDate datetime,
ClaimStatus int
)
insert into #ClaimsActivity
values (1, 6070, 0, '2017-11-05 20:23:16.640', 0),
(3, 6070, 6, '2017-11-06 13:50:28.203', 0),
(4, 6070, 9, '2017-11-07 13:39:28.410', 0),
(5, 6070, 10, '2017-11-07 13:40:49.980', 0),
(7, 6070, 8, '2017-11-07 15:46:18.367', 1),
(8, 6070, 8, '2017-11-07 16:50:49.543', 1),
(9, 6070, 9, '2017-11-07 16:50:54.733', 0),
(10, 6070, 4, '2017-11-07 16:55:22.135', 0),
(11, 6070, 6, '2017-11-08 18:32:15.101', 0),
(12, 6080, 0, '2017-11-12 11:15:17.199', 0),
(13, 6080, 8, '2017-11-13 09:12:23.203', 1)
select *
from #Claims
select *
from #ClaimsActivity
order by ActivityDate
I need to add 2 columns based on data in #ClaimsActivity: IsReopened and DateReopened
The logic is:
If the last ClaimStatus (based on ActivityDate) = 1 then IsReopened = 0
But if the last ClaimStatus = 0 then it need to go and check whether one of the Activity is = 9 (Claim Reopened)
and if one of the Activity = 9 then IsReopened should = 1 and DateReopened should be the last date when it was reopened
I brought column StatusOfClaim, but I also need IsReopened and DateReopened
select
Claimid,
isnull((select top 1
case when al.ClaimStatus = 1
then 'Closed'
else 'Open'
end
from
#ClaimsActivity al
where
C.ClaimID = al.ClaimID
order by
al.ActivityDate desc), 'Open') as 'Status of Claim',
NULL as 'isReopen',
NULL as 'DateReopened'
from
#Claims c
Desired output should be like this:
There are many different ways you can accomplish this, but here is an example using CROSS APPLY and OUTER APPLY:
SELECT
ClaimID,
CASE WHEN tmp.IsOpen = 1 THEN 'Open' ELSE 'Closed' END AS 'Status of Claim',
CASE WHEN tmp.IsOpen = 1 AND lastReopen.Activityid IS NOT NULL THEN 1 ELSE 0 END AS 'isReopen',
lastReopen.ActivityDate AS 'DateReopened'
FROM #Claims c
CROSS APPLY (
SELECT ISNULL((
SELECT TOP 1 CASE WHEN al.ClaimStatus = 1 THEN 0 ELSE 1 END
FROM #ClaimsActivity al
WHERE c.ClaimID = al.ClaimID
ORDER BY al.ActivityDate DESC
), 1) AS IsOpen
) tmp
OUTER APPLY (
SELECT TOP 1
al.Activityid,
al.ActivityDate
FROM #ClaimsActivity al
WHERE c.ClaimID = al.ClaimID AND al.Activity = 9
ORDER BY al.ActivityDate DESC
) lastReopen
The CROSS APPLY is just used to produce a column that tells us whether a claim is open or closed, and we can reuse this throughout the rest of the query.
The OUTER APPLY is used to grab to the last "reopen" activity for each claim, of which you want the date.
I can't attest to the performance of this query, but this should at least give you the correct results.
First, I apologize if the title won't make sense but below is the detailed scenario.
Say I have a document_revision table
id document_id phase_id user_id
1 1 3 1
2 1 2 1
3 1 1 1
4 2 3 2
5 2 2 2
where phase_id is: transcribe = 3; proof = 2; and submit = 1.
I would like to write a query where I can filter the revision records where I will disregard a proof phase if the same user did the transcribe and proof. So the output would be:
id document_id phase_id user_id
1 1 3 1
3 1 1 1
4 2 3 2
I've been struggling for hours figuring out a query for this but no luck so far.
Assuming you only want the phase 3 for any case where a user_id was involved in phase 2 and 3, then one way you could do this is with ROW_NUMBER(), e.g.:
DECLARE #T TABLE (ID INT IDENTITY(1, 1), Document_ID INT, Phase_ID INT, [User_ID] INT);
INSERT #T (Document_ID, Phase_ID, [User_ID]) VALUES
(1, 1, 1), (1, 2, 1), (1, 3, 1), (2, 3, 2), (2, 2, 2), (3, 1, 1), (3, 2, 1), (3, 3, 2);
SELECT ID, Document_ID, Phase_ID, [User_ID]
FROM
(
SELECT *, RN = ROW_NUMBER() OVER (PARTITION BY Document_ID, [User_ID], CASE WHEN Phase_ID IN (2, 3) THEN 2 ELSE Phase_ID END ORDER BY Phase_ID DESC)
FROM #T
) AS T
WHERE RN = 1;
DECLARE #document_revision TABLE (
id INT IDENTITY(1,1),
document_id INT,
phase_id INT,
user_id INT
);
INSERT INTO #document_revision
(document_id, phase_id, user_id)
VALUES
(1, 3, 1),
(1, 2, 1),
(1, 1, 1),
(2, 3, 2),
(2, 2, 2),
-- To test a scenario where there is a proof and a submit with no transcribe phases and same document
(3, 2, 3),
(3, 1, 3),
-- To test a scenario where there is a transcribe and a submit with no proof phases and same document
(4, 3, 4),
(4, 1, 4),
-- To test a scenario where there is a proof and a submit with no transcribe phase (for document_id 5) but different document and same user as above
(5, 2, 4);
SELECT dr.id
, dr.document_id
, dr.phase_id
, dr.user_id
FROM #document_revision AS dr
WHERE NOT EXISTS ( SELECT 1
FROM #document_revision AS temp
-- Same user
WHERE temp.user_id = dr.user_id
-- Same document
AND temp.document_id = dr.document_id
-- To check if there is already a transcribe phase_id with the same user_id and document_id
AND temp.phase_id = 3
-- -- To check if there is already a proof phase_id with the same user_id and document_id
AND dr.phase_id = 2 )
results:
id document_id phase_id user_id
1 1 3 1
3 1 1 1
4 2 3 2
6 3 2 3
7 3 1 3
8 4 3 4
9 4 1 4
10 5 2 4
Select query is not working when use variable in MSSQL2014
My Schema is :-
CREATE TABLE product
(idproduct int, name varchar(50), description varchar(50), tax decimal(18,0))
INSERT INTO product
(idproduct, name, description,tax)
VALUES
(1, 'abc', 'This is abc',10),
(2, 'xyz', 'This is xyz',20),
(3, 'pqr', 'This is pqr',15)
CREATE TABLE product_storage
(idstorage int,idproduct int,added datetime, quantity int, price decimal(18,0))
INSERT INTO product_storage
(idstorage,idproduct, added, quantity,price)
VALUES
(1, 1, 2010-01-01,0,10.0),
(2, 1, 2010-01-02,0,11.0),
(3, 1, 2010-01-03,10,12.0),
(4, 2, 2010-01-04,0,12.0),
(5, 2, 2010-01-05,10,11.0),
(6, 2, 2010-01-06,10,13.0),
(7, 3, 2010-01-07,10,14.0),
(8, 3, 2010-01-07,10,16.0),
(9, 3, 2010-01-09,10,13.0)
and i am executing below command:-
declare #price1 varchar(10)
SELECT p.idproduct, p.name, p.tax,
[#price1]=(SELECT top 1 s.price
FROM product_storage s
WHERE s.idproduct=p.idproduct AND s.quantity > 0
ORDER BY s.added ASC),
(#price1 * (1 + tax/100)) AS [price_with_tax]
FROM product p
;
This is not working in MSSQL, Please Help me out.
for detail check http://sqlfiddle.com/#!6/91ec2/296
And My query is working in MYSQL
Check for detail :- http://sqlfiddle.com/#!9/a71b8/1
Try this query
SELECT
p.idproduct
, p.name
, p.tax
, (t1.price * (1 + tax/100)) AS [price_with_tax]
FROM product p
inner join
(
SELECT ROW_NUMBER() over (PARTITION by s.idproduct order by s.added ASC) as linha, s.idproduct, s.price
FROM product_storage s
WHERE s.quantity > 0
) as t1
on t1.idproduct = p.idproduct and t1.linha = 1
Try it like this:
Explanantion: You cannot use a variable "on the fly", but you can do row-by-row calculation in an APPLY...
SELECT p.idproduct, p.name, p.tax,
Price.price1,
(price1 * (1 + tax/100)) AS [price_with_tax]
FROM product p
CROSS APPLY (SELECT top 1 s.price
FROM product_storage s
WHERE s.idproduct=p.idproduct AND s.quantity > 0
ORDER BY s.added ASC) AS Price(price1)
;
EDIT: Your Fiddle uses a bad literal date format, try this:
INSERT INTO product_storage
(idstorage,idproduct, added, quantity,price)
VALUES
(1, 1, '20100101',0,10.0),
(2, 1, '20100102',0,11.0),
(3, 1, '20100103',10,12.0),
(4, 2, '20100104',0,12.0),
(5, 2, '20100105',10,11.0),
(6, 2, '20100106',10,13.0),
(7, 3, '20100107',10,14.0),
(8, 3, '20100108',10,16.0),
(9, 3, '20100109',10,13.0)
Here is the correct schema for SQL Server and query runs perfect as Shnugo Replied.
VALUES
(1, 1, convert(datetime,'2010-01-01'),0,10.0),
(2, 1, convert(datetime,'2010-01-02'),0,11.0),
(3, 1, convert(datetime,'2010-01-03'),10,12.0),
(4, 2, convert(datetime,'2010-01-04'),0,12.0),
(5, 2, convert(datetime,'2010-01-05'),10,11.0),
(6, 2, convert(datetime,'2010-01-06'),10,13.0),
(7, 3, convert(datetime,'2010-01-07'),10,14.0),
(8, 3, convert(datetime,'2010-01-07'),10,16.0),
(9, 3, convert(datetime,'2010-01-09'),10,13.0)