FIRST() and LAST() for MATCH_RECOGNIZE - snowflake-cloud-data-platform

We are analyzing the streaming twitter data to find users who are posting similar (almost same) tweets over and over. I am using MATCH_RECOGNIZE for this. It is able to find the pattern, but I am not able to get the FIRST() and the LAST() values correctly. Here is sample dataset:
I am using the following Query:
SELECT
USERID
, NUM_OF_TWEETS
, FIRST_TWEET
, LAST_TWEET
, FIRST_TWEET_ID
, LAST_TWEET_ID
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ONE ROW PER MATCH
PATTERN (SIMILAR+)
DEFINE
SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
This correct identify the users that are posting same tweets over an over:
But I am not able to get the first tweet and the last tweet in the matching sequence.

There are multiple things at play.
The first is you only "have one row trigging a match" so first and last are the second row of you data. This can be seen by changing to ALL ROWS PER MATCH
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa')
)
SELECT
*
FROM TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
match_number() as match_number,
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ALL ROWS PER MATCH
PATTERN (SIMILAR+)
DEFINE
SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID
TWEETID
TWEET
MATCH_NUMBER
FIRST_TWEET
LAST_TWEET
FIRST_TWEET_ID
LAST_TWEET_ID
NUM_OF_TWEETS
elena
2
aaaa
1
aaaa
aaaa
2
2
1
if you change to say a match that catches the first value and the lag values:
ALL ROWS PER MATCH
PATTERN (SIMILAR_before SIMILAR_after+)
DEFINE
SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
you now match both the first and latter rows..
USERID
TWEETID
TWEET
MATCH_NUMBER
FIRST_TWEET
LAST_TWEET
FIRST_TWEET_ID
LAST_TWEET_ID
NUM_OF_TWEETS
elena
1
aaa
1
aaa
aaa
1
1
1
elena
2
aaaa
1
aaa
aaaa
1
2
2
now if we expand our test a little bit more with four rows of data:
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('elena', 3, 'aaa'),
('elena', 4, 'aaaa')
)
USERID
TWEETID
TWEET
MATCH_NUMBER
FIRST_TWEET
LAST_TWEET
FIRST_TWEET_ID
LAST_TWEET_ID
NUM_OF_TWEETS
elena
1
aaa
1
aaa
aaa
1
1
1
elena
2
aaaa
1
aaa
aaaa
1
2
2
elena
3
aaa
1
aaa
aaa
1
3
3
elena
4
aaaa
1
aaa
aaaa
1
4
4
we see those values are not double registering..
BUT we also see the first ID is correct for all rows, but the last is within the scope of the current matched row, so not after all matches as you are hoping.
If we flip back to one row per match we do how ever get the results we are expecting.
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('scott', 3, 'aaaa'),
('eva', 4, 'bbbb'),
('eva', 5, 'bbbbb'),
('amy', 4, 'eeee'),
('amy', 5, 'zzzz')
)
SELECT
USERID
, NUM_OF_TWEETS
, FIRST_TWEET
, LAST_TWEET
, FIRST_TWEET_ID
, LAST_TWEET_ID
FROM TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
match_number() as match_number,
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ONE ROW PER MATCH
PATTERN (SIMILAR_before SIMILAR_after+)
DEFINE
SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID
NUM_OF_TWEETS
FIRST_TWEET
LAST_TWEET
FIRST_TWEET_ID
LAST_TWEET_ID
elena
2
aaa
aaaa
1
2
eva
2
bbbb
bbbbb
4
5

Naturally, I was working on this while Simeon was attacking the same problem. I ran into similar issues and noted the logic applied to the window frame, and therefore you needed to account for certain functions only working from the matched row, and then you would miss the first, et al.
I did an old-school approach, nesting views to incrementally address the problem.
Both solve the problem at hand - and while I like the use of MATCH_RECOGNIZE in the provided answer (it's more elegant as a single query), it may be difficult for others to understand.
--
-- create test table
--
create
or replace table tweets (
userid varchar,
tweetid integer,
tweet varchar
);
--
-- create test data
--
insert into
tweets (userid, tweetid, tweet)
values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('scott', 3, 'aaaa'),
('eva', 4, 'bbbb'),
('eva', 5, 'bbbbb'),
('amy', 4, 'eeee'),
('amy', 5, 'zzzz');
--
-- Baseline view showing matching tweets by user
--
CREATE
OR REPLACE VIEW MATCHES AS (
SELECT
T1.USERID,
T1.TWEETID AS TWEETID,
T2.TWEETID AS MATCHING_TWEETID
FROM
TWEETS T1,
TWEETS T2
WHERE
T1.USERID = T2.USERID
AND JAROWINKLER_SIMILARITY(T1.TWEET, T2.TWEET) > 90
);
--
-- create a view of non-repeating tweets
--
create or replace view single_tweets as (
select
userid,
tweetid,
count(*) as num_tweets
from
matches
group by
userid,
tweetid
having
count(*) = 1
);
select * from single_tweets;
--
-- Create a view of only repeating tweets by tweetid
--
create
or replace view repeating_tweets as (
select
userid,
tweetid,
matching_tweetid
from
matches
where
(userid, tweetid) not in (
select
userid, tweetid
from
single_tweets
)
and (userid,tweetid) not in (
select
userid, tweetid
from
matches
where
matching_tweetid < tweetid
)
order by
tweetid,
matching_tweetid
);
--
-- only report repeating tweets
--
select
t.userid,
min(t.tweet) as FIRST_TWEET,
max(t.tweet) as LAST_TWEET,
min(t.tweetid) as FIRST_TWEETID,
max(t.tweetid) as LAST_TWEETID,
count(rt.matching_tweetid) as num_tweets
from
tweets t,
repeating_tweets rt
where
t.userid = rt.userid
and t.tweetid = rt.matching_tweetid
group by
t.userid,
rt.tweetid;
Results:
USERID FIRST_TWEET LAST_TWEET FIRST_TWEETID LAST_TWEETID NUM_TWEETS
eva bbbb bbbbb 4 5 2
elena aaa aaaa 1 2 2

Related

Group records based on time interval starting from timestamp of first record in each group

Struggling with this; need to group records within a specific time interval starting from the first timestamp (FREEZE_TIME) - but the first record outside the first group is the starting point for the time interval for the next group and so on. Expected result, THAW_COUNT, is the count of all groups for a PARENT_SAMPLE_ID. So for table:
SAMPLE_ID
FREEZE_TIME
PARENT_SAMPLE_ID
1
null
null
2
2015-11-27 10:23:10
1
3
2015-11-27 10:59:23
1
4
2015-11-27 11:05:43
1
5
2015-11-27 12:53:48
1
6
2015-11-27 13:42:25
1
I would like to get a result of:
PARENT_SAMPLE_ID
THAW_COUNT
1
2
So sample_id:s 2,3 and 4 should be in the same group and sample id:s 5 and 6 are in the next group.
I have tried something like:
with SampleList as
(
select PARENT_SAMPLE_ID, FREEZE_TIME,
ROW_NUMBER() OVER (partition by PARENT_SAMPLE_ID order by FREEZE_TIME asc) RN
from
SAMPLE
)
,
FirstSample as
(
select PARENT_SAMPLE_ID, FREEZE_TIME
from SampleList
where RN = 1
)
,
SelectedSample as
(
select
s.PARENT_SAMPLE_ID,
ABS(DATEDIFF(MINUTE, s.FREEZE_TIME, sFirst.FREEZE_TIME))/60 DiffToFirst
from SampleList s
inner join FirstSample sFirst ON s.PARENT_SAMPLE_ID = sFirst.PARENT_SAMPLE_ID
group by s.PARENT_SAMPLE_ID, ABS(DATEDIFF(MINUTE, s.FREEZE_TIME, sFirst.FREEZE_TIME))/60
)
select PARENT_SAMPLE_ID, count(*) THAW_COUNT
from SelectedSample
group by PARENT_SAMPLE_ID
But this will return a THAW_COUNT of 3 as sampleId:s 5 and 6 will be in different groups because the grouping is based on hour intervals from freeze time of sampleId 2 only. How do I get the grouping for group 2 to start from the first record outside the first group (sampleId 5) and so on?
This can be treated as a gaps and islands problem. Using some windows functions to check counts and using LAG to look at the "previous" row we can solve this. If you have multiple values for SAMPLE_ID you will want to add some partitioning.
create table #Something
(
SAMPLE_ID int
, FREEZE_TIME datetime
, PARENT_SAMPLE_ID int
)
insert #Something
select 1, null, null union all
select 2, '2015-11-27 10:23:10', 1 union all
select 3, '2015-11-27 10:59:23', 1 union all
select 4, '2015-11-27 11:05:43', 1 union all
select 5, '2015-11-27 12:53:48', 1 union all
select 6, '2015-11-27 13:42:25', 1;
with MyGroups as
(
select *
, GroupNum = count(IsNewGroup) over (order by FREEZE_TIME rows unbounded preceding)
from
(
select *
, IsNewGroup = case when LAG(FREEZE_TIME, 1, '') over(order by FREEZE_TIME) < dateadd(hour, -1, FREEZE_TIME) then 1 end
from #Something
) x
)
select coalesce(PARENT_SAMPLE_ID, SAMPLE_ID)
, count(distinct GroupNum)
from MyGroups
group by coalesce(PARENT_SAMPLE_ID, SAMPLE_ID)
drop table #Something

FInd duplicate rows and show only the earliest

I have the following table:
respid, uploadtime
I need a query that will show all the records that respid is duplicate and show them except the latest (by upload time)
exmple:
4 2014-01-01
4 2014-06-01
4 2015-01-01
4 2015-06-01
4 2016-01-01
In this case the query should return four records (the latest is : 4 2016-01-01 )
Thank you very much.
Use ROW_NUMBER:
WITH cte AS (
SELECT respid, uploadtime,
ROW_NUMBER() OVER (PARTITION BY respid ORDER BY uploadtime DESC) rn
FROM yourTable
)
SELECT respid, uploadtime
FROM cte
WHERE rn > 1
ORDER BY respid, uploadtime;
The logic here is to show all records except those having the first row number value, which would be the latest records for each respid group.
If I interpreted your question correctly, then you want to see all records where respid occurs multiple times, but exclude the last duplicate.
Translating this to SQL could sound like "show all records that have a later record for the same respid". That is exactly what the solution below does. It says that for every row in the result a later record with the same respid must exists.
Sample data
declare #MyTable table
(
respid int,
uploadtime date
);
insert into #MyTable (respid, uploadtime) values
(4, '2014-01-01'),
(4, '2014-06-01'),
(4, '2015-01-01'),
(4, '2015-06-01'),
(4, '2016-01-01'), --> last duplicate of respid=4, not part of result
(5, '2020-01-01'); --> has no duplicate, not part of result
Solution
select mt.respid, mt.uploadtime
from #MyTable mt
where exists ( select top 1 'x'
from #MyTable mt2
where mt2.respid = mt.respid
and mt2.uploadtime > mt.uploadtime );
Result
respid uploadtime
----------- ----------
4 2014-01-01
4 2014-06-01
4 2015-01-01
4 2015-06-01

Select rows with same id but different value in another column - SQL Server

I have a table that shows the chain of custody of a piece of evidence as it is being analyzed. The evidence may just be analyzed at one location, but can also be transferred to different labs to be analyzed. I am trying to write a query that returns the Case Number if it is analyzed in different labs.
Here is an example of the data: CaseNumber 1 starts off in the Chemistry Lab, then transferred to DNA lab, then transferred back to Chemistry Lab. Case 2 is just associated in the Chem lab (do not need to see this case in the query results).
ID CaseNumber Lab ActionDate Action
-------------------------------------------------------------------
1 1 Chem 1/1/2019 case created
2 1 Chem 1/2/2019 container created
3 1 DNA 2/1/2019 container routed to DNA
4 1 DNA 2/3/2019 evidence analyzed
5 1 Chem 2/3/2019 edit route
6 2 Chem 2/4/2019 create case
7 2 Chem 2/5/2019 analyze evidence
Here is what I have so far. This returns casenumber and unique lab but I would like to incorporate the ActionDate somehow to show the actual date it was routed to the new location.
SELECT DISTINCT
casenumber, lab
FROM
summary
WHERE
casenumber IN (SELECT a.casenumber
FROM summary a
JOIN summary b on b.casenumber = a.casenumber AND b.lab <> a.lab)
AND actiondate BETWEEN '2019-01-02' AND '2019-01-04'
ORDER BY
casenumber
I expect the results of the query to look like the following. I would like to see the first entry per Lab (since that is the date it was actually routed to the new location)
ID CaseNumber Lab ActionDate Action
-------------------------------------------------------------------
1 1 Chem 1/1/2019 case created
3 1 DNA 2/1/2019 container routed to DNA
5 1 Chem 2/3/2019 container routed to Chem
There are plenty of ways to accomplish this, here is one of them. Note this will only work on sql server 2012+.
create table #Something
(
ID int
, CaseNumber int
, Lab varchar(10)
, ActionDate date
, Action varchar(50)
)
insert #Something values
(1, 1, 'Chem', '1/1/2019', 'case created')
, (2, 1, 'Chem', '1/2/2019', 'container created')
, (3, 1, 'DNA', '2/1/2019', 'container routed to DNA')
, (4, 1, 'DNA', '2/3/2019', 'evidence analyzed')
, (5, 1, 'Chem', '2/3/2019', 'edit route')
, (6, 2, 'Chem', '2/4/2019', 'create case')
, (7, 2, 'Chem', '2/5/2019', 'analyze evidence')
;
with SortedResults as
(
select s.ID
, s.CaseNumber
, s.Lab
, s.ActionDate
, Action = case when LAG(Lab, 1) over(partition by CaseNumber order by ActionDate) is null then Action
when LAG(Lab, 1) over(partition by CaseNumber order by ActionDate) = Lab then NULL
else 'container routed to ' + Lab end
from #Something s
)
select *
from SortedResults
where Action > ''
order by CaseNumber, ActionDate
drop table #Something

SQL - Query log from Users table

Based on the following example : (it is a "QueryLog" table, this table store interactions between a user and two different products N and R):
Id Date UserID Product
--------------------------------------------------
0 2013-06-09 14:50:24.000 100 N
1 2013-06-09 15:27:23.000 100 N
2 2013-06-09 15:29:23.000 100 N
3 2013-06-17 15:31:23.000 100 N
4 2013-06-17 15:32:23.000 100 N
5 2014-05-19 15:30:23.000 250 N
6 2014-07-19 15:27:23.000 250 N
7 2014-07-19 15:27:23.000 333 R
8 2014-08-19 15:27:23.000 333 R
Expected results :
Count
-----
1
(Only UserID 250 is inside my criteria)
If one user interacts 10 times with the product in only one month, he's not in my criteria.
To resume, I am looking for :
The Number of distinct users that had interactions with product N on at least more than one month (what ever the number of interactions this user may have had during a single month)
This is the code I've tried:
select distinct v.UserID, v.mois , v.annee
from
(select c.UserID , c. mois, c.annee, COUNT(c.UserID) as frequence
from
(
SELECT
datepart(month,[DATE]) as mois,
datepart(YEAR,[DATE]) as annee ,
Username,
UserID,
Product
FROM QueryLog
where Product = 'N'
) c
group by c.UserID, c.annee, c.mois
) v
group by v.UserID, v.mois, v.annee
try this:
DECLARE #YourTable table (Id int, [Date] datetime, UserID int, Product char(1))
INSERT INTO #YourTable VALUES (0,'2013-06-09 14:50:24',100 ,'N')
,(1,'2013-06-09 15:27:23',100 ,'N')
,(2,'2013-06-09 15:29:23',100 ,'N')
,(3,'2013-06-17 15:31:23',100 ,'N')
,(4,'2013-06-17 15:32:23',100 ,'N')
,(5,'2014-05-19 15:30:23',250 ,'N')
,(6,'2014-07-19 15:27:23',250 ,'N')
,(7,'2014-07-19 15:27:23',333 ,'R')
,(8,'2014-08-19 15:27:23',333 ,'R')
;WITH MultiMonthUsers AS
(
select
UserID
FROM (select
UserID
FROM #YourTable
WHERE product='N'
GROUP BY UserID, YEAR([Date]),MONTH([Date])
)dt2
GROUP BY UserID
HAVING COUNT(*)>1
)
SELECT COUNT(*) FROM MultiMonthUsers
Depending on number of rows and indexes, this will run slow. Using YEAR([Date]),MONTH([Date]) will prevent any index usage.
I think this will do it, but I need a better dataset to test with:
SELECT COUNT(*)
FROM (
--roll all month/user records into single row
SELECT UserID, datediff(month 0, [date]) As MonthGroup
FROM QueryLog
WHERE Product='N'
GROUP BY datediff(month 0, [date]), UserId
) t
-- look for users with multiple rows
GROUP BY UserID
HAVING COUNT(UserID) > 1
Seems like there should be a way to roll this up further, to avoid the need for the nested select.

Tsql group by clause with exceptions

I have a problem with a query.
This is the data (order by Timestamp):
Data
ID Value Timestamp
1 0 2001-1-1
2 0 2002-1-1
3 1 2003-1-1
4 1 2004-1-1
5 0 2005-1-1
6 2 2006-1-1
7 2 2007-1-1
8 2 2008-1-1
I need to extract distinct values and the first occurance of the date. The exception here is that I need to group them only if not interrupted with a new value in that timeframe.
So the data I need is:
ID Value Timestamp
1 0 2001-1-1
3 1 2003-1-1
5 0 2005-1-1
6 2 2006-1-1
I've made this work by a complicated query, but am sure there is an easier way to do it, just cant think of it. Could anyone help?
This is what I started with - probably could work with that. This is a query that should locate when a value is changed.
> SELECT * FROM Data d1 join Data d2 ON d1.Timestamp < d2.Timestamp and
> d1.Value <> d2.Value
It probably could be done with a good use of row_number clause but cant manage it.
Sample data:
declare #T table (ID int, Value int, Timestamp date)
insert into #T(ID, Value, Timestamp) values
(1, 0, '20010101'),
(2, 0, '20020101'),
(3, 1, '20030101'),
(4, 1, '20040101'),
(5, 0, '20050101'),
(6, 2, '20060101'),
(7, 2, '20070101'),
(8, 2, '20080101')
Query:
;With OrderedValues as (
select *,ROW_NUMBER() OVER (ORDER By TimeStamp) as rn --TODO - specific columns better than *
from #T
), Firsts as (
select
ov1.* --TODO - specific columns better than *
from
OrderedValues ov1
left join
OrderedValues ov2
on
ov1.Value = ov2.Value and
ov1.rn = ov2.rn + 1
where
ov2.ID is null
)
select * --TODO - specific columns better than *
from Firsts
I didn't rely on the ID values being sequential and without gaps. If that's the situation, you can omit OrderedValues (using the table and ID in place of OrderedValues and rn). The second query simply finds rows where there isn't an immediate preceding row with the same Value.
Result:
ID Value Timestamp rn
----------- ----------- ---------- --------------------
1 0 2001-01-01 1
3 1 2003-01-01 3
5 0 2005-01-01 5
6 2 2006-01-01 6
You can order by rn if you need the results in this specific order.

Resources