Altering a QUALIFY with an additional criterion - snowflake-cloud-data-platform

In Snowflake I have this original query which, for a given consumer_ID, produces a list of unique store IDs.
SELECT
t.consumer_id
, t.business_id
, t.store_id
, t.campaign_id
FROM campaigns_mini AS t
QUALIFY ROW_NUMBER() OVER (PARTITION BY t.consumer_id, t.store_id ORDER BY t.campaign_id) = 1
The original purpose was to provide a list that does not duplicate store_id for a given consumer_id. Suppose now I also need to ensure this list does not duplicate business_id as well for a given consumer_ID. Is there an easy way to modify the above?

SELECT
t.consumer_id
, t.business_id
, t.store_id
, t.campaign_id
FROM campaigns_mini AS t
QUALIFY ROW_NUMBER() OVER
(PARTITION BY t.consumer_id
,t.store_id
,t.business_id
ORDER BY t.campaign_id) = 1
The partition by clause forms windows by the combination of all the expressions in the clause.
This will deduplicate by the combination of consumer_id, store_id, and business_id. If this is not what you need, please update with sample input and output to clarify.

So if I make up some data:
WITH campaigns_mini(consumer_id, business_id, store_id, campaign_id) as (
select * from values
(1,10,100,1000),
(1,10,100,1001),
(1,10,101,1002),
(2,20,200,2000)
)
and use your exist SQL
SELECT
t.consumer_id
,t.business_id
,t.store_id
,t.campaign_id
FROM campaigns_mini AS t
QUALIFY ROW_NUMBER() OVER (PARTITION BY t.consumer_id, t.store_id ORDER BY t.campaign_id) = 1
I get
CONSUMER_ID
BUSINESS_ID
STORE_ID
CAMPAIGN_ID
1
10
101
1002
1
10
100
1000
2
20
200
2000
we get the Store not repeated for the Consumer, but as you note you don't want the business repeated ether..
If we change to using business_id instead of store_id we see we get less rows:
SELECT
t.consumer_id
,t.business_id
,t.store_id
,t.campaign_id
FROM campaigns_mini AS t
QUALIFY ROW_NUMBER() OVER (PARTITION BY t.consumer_id, t.business_id ORDER BY t.campaign_id) = 1
ORDER BY 1;
CONSUMER_ID
BUSINESS_ID
STORE_ID
CAMPAIGN_ID
1
10
100
1000
2
20
200
2000
So if we want "no repeating business_id AND no repeating stores" using the Qualify Greg's has proposed will not help, as we are keeping the first for the distinct set of consumer,business, & store:
QUALIFY ROW_NUMBER() OVER (PARTITION BY t.consumer_id, t.business_id, t.store_id ORDER BY t.campaign_id) = 1
which gives:
CONSUMER_ID |BUSINESS_ID |STORE_ID |CAMPAIGN_ID
1 |10 |100 |1000
1 |10 |101 |1002
2 |20 |200 |2000
So the next thing is to think why not keep the only the first of the two sets:
QUALIFY ROW_NUMBER() OVER (PARTITION BY t.consumer_id, t.store_id ORDER BY t.campaign_id) = 1
AND ROW_NUMBER() OVER (PARTITION BY t.consumer_id, t.business_id ORDER BY t.campaign_id) = 1
which for this data works!
CONSUMER_ID
BUSINESS_ID
STORE_ID
CAMPAIGN_ID
1
10
100
1000
2
20
200
2000
but then for this data:
WITH campaigns_mini(consumer_id, business_id, store_id, campaign_id) as (
select * from values
(1,10,100,1000),
(1,10,101,1001),
(1,20,101,1002)
)
there is only one row with business 20, for store 101, but the first 101 store is on campaign 1001, so both those rows are discarded.
CONSUMER_ID
BUSINESS_ID
STORE_ID
CAMPAIGN_ID
1
10
100
1000
So if we use two layers to do the prune, for this data:
select * from (
SELECT
t.consumer_id
,t.business_id
,t.store_id
,t.campaign_id
FROM campaigns_mini AS t
QUALIFY ROW_NUMBER() OVER (PARTITION BY t.consumer_id, t.business_id ORDER BY t.campaign_id) = 1
)
QUALIFY ROW_NUMBER() OVER (PARTITION BY consumer_id, store_id ORDER BY campaign_id) = 1
works:
CONSUMER_ID
BUSINESS_ID
STORE_ID
CAMPAIGN_ID
1
10
100
1000
1
20
101
1002
but if your flip those orders of QUALIFY you are back to just one row..
so as a general problem it cannot be safely solve for all data cases with this pattern...

Related

It is possible to change how row_number inserts the values?

Currently I'm doing this:
select
ProductID = ProductID = ROW_NUMBER() OVER (PARTITION BY PRODUCTID ORDER BY PRODUCtID),
TransactionDate,
TransactionAmount
from ProductsSales
order by ProductID
The results are like this:
ProductID
TransactionDate
TransactionAmount
1
2022-11-06
30
2
2022-11-12
30
3
2022-11-28
30
2
2022-11-03
10
3
2022-11-10
10
4
2022-11-15
10
3
2022-11-02
50
The duplicated IDs are being inserted sequential, but what I need it to be like this:
ProductID
TransactionDate
TransactionAmount
1
2022-11-06
30
1.1
2022-11-12
30
1.2
2022-11-28
30
2
2022-11-03
10
2.1
2022-11-10
10
2.2
2022-11-15
10
3
2022-11-02
50
Is this possible?
Assuming your PRODUCTID field is numeric already, then this should work:
WITH _ProductIdSorted AS
(
SELECT
CONCAT
(
PRODUCTID,
'.',
ROW_NUMBER() OVER (PARTITION BY PRODUCTID ORDER BY TransactionDate) - 1
) AS ProductId,
TransactionDate,
TransactionAmount
FROM ProductsSales
)
SELECT
REPLACE(ProductId, '.0', '') AS ProductId,
TransactionDate,
TransactionAmount
FROM _ProductIdSorted;
By the way, just the same as the ORDER BY clause in your query, the one my answer uses is a nondeterminsitic sort. It seems, based on your Post, it doesn't matter to you the order which the rows are sorted within the partition though.

Write Query That Consider Date Interval

I have a table that contains Transactions of Customers.
I should Find Customers That had have at least 2 transaction with amount>20000 in Three consecutive days each month.
For example , Today is 2022/03/12 , I should Gather Data Of Transactions From 2022/02/13 To 2022/03/12, Then check These Data and See If a Customer had at least 2 Transaction With Amount>=20000 in Three consecutive days.
For Example, Consider Below Table:
Id
CustomerId
Transactiondate
Amount
1
1
2022-01-01
50000
2
2
2022_02_01
20000
3
3
2022_03_05
30000
4
3
2022_03_07
40000
5
2
2022_03_07
20000
6
4
2022_03_07
30000
7
4
2022_03_07
30000
The Out Put Should be : CustomerId =3 and CustomerId=4
I write query that Find Customer For Special day , but i don't know how to find these customers in one month with out using loop.
the query for special day is:
With cte (select customerid, amount, TransactionDate,Dateadd(day,-2,TransactionDate) as PrevDate
From Transaction
Where TransactionDate=2022-03-12)
Select CustomerId,Count(*)
From Cte
Where
TransactionDate>=Prevdate and TransactionDate<=TransactionDate
And Amount>=20000
Group By CustomerId
Having count(*)>=2
Hi there are many options how to achieve this.
I think that easies (from perfomance maybe not) is using LAG function:
WITH lagged_days AS (
SELECT
ISNULL(LAG(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id),
LEAD(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id)) lagged_dt
,*
FROM Transaction
), valid_cust_base as (
SELECT
*
FROM lagged_days
WHERE DATEPART(MONTH, lagged) = DATEPART(MONTH, Transactiondate)
AND datediff(day, Transactiondate, lagged_dt) <= 3
AND Amount >= 20000
)
SELECT
CustomerID
FROM valid_cust_base
GROUP BY CustomerID
HAVING COUNT(*) >= 2
First I have created lagged TransactionDate over customer (I assume that id is incremental). Then I have Selected only transactions within one month, with amount >= 20000 and where date difference between transaction is less then 4 days. Then just select customers who had more than 1 transaction.
In LAG First value is always missing per Customer missing, but you still need to be able say: 1st and 2nd transaction are within 3 days. Thats why I am replacing first NULL value with LEAD. It doesn't matter if you use:
ISNULL(LAG(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id),
LEAD(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id)) lagged_dt
OR
ISNULL(LEAD(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id),
LAG(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id)) lagged_dt
The main goal is to have for each transaction closest TransactionDate.

How to Sum (MAX values) from different value groups in same column SQL Server

I have a table like this:
Date
Consec_Days
2015-01-01
1
2015-01-03
1
2015-01-06
1
2015-01-07
2
2015-01-09
1
2015-01-12
1
2015-01-13
2
2015-01-14
3
2015-01-17
1
I need to Sum the max value (days) for each of the consecutive groupings where Consec_Days are > 1. So the correct result would be 5 days.
This is a type of gaps-and-islands problem.
There are many solutions, here is one simple one
Get the start points of each group using LAG
Calculate a grouping ID using a windowed conditional count
Group by that ID and take the highest sum
WITH StartPoints AS (
SELECT *,
IsStart = CASE WHEN LAG(Consec_Days) OVER (ORDER BY Date) = 1 THEN 1 END
FROM YourTable t
),
Groupings AS (
SELECT *,
GroupId = COUNT(IsStart) OVER (ORDER BY Date)
FROM StartPoints
WHERE Consec_Days > 1
)
SELECT TOP (1)
SUM(Consec_Days)
FROM Groupings
GROUP BY
GroupId
ORDER BY
SUM(Consec_Days) DESC;
db<>fiddle
with cte as (
select Consec_Days,
coalesce(lead(Consec_Days) over (order by Date), 1) as next
from YourTable
)
select sum(Consec_Days)
from cte
where Consec_Days <> 1 and next = 1
db<>fiddle

Filter two consecutive row numbers in sql server

I am using this code to find duplicates
Code:
select donrId,
donrFirstName,
donrLastName,
donrBirthDate,
ROW_NUMBER() over (
partition by donrFirstName,
donrBirthDate order by donrLastName
) as SequenceNumber
from donors ) as dd
where dd.SequenceNumber > 1
order by donrId
Problem:
I can't filter the partitioned result set on two consecutive numbers e.g 1 and 2
Desired Result:
donrFirstName |donrLastName |donrBirthDate |SequenceNumber
---------------------------------------------------------------
king |kong |25/05/2017 |1
king |kong |25/05/2017 |2
Your query will return only the records with sequence number > 1. To return all records starting with the number 1 you can use COUNT(*) window function, like this:
SELECT
donrId, donrFirstName, donrLastName, donrBirthDate, SequenceNumber
FROM
(SELECT
donrId, donrFirstName, donrLastName, donrBirthDate,
ROW_NUMBER() OVER (PARTITION BY donrFirstName, donrBirthDate ORDER BY donrLastName) AS SequenceNumber
COUNT(*) OVER (PARTITION BY donrFirstName, donrBirthDate) AS cnt
FROM
donors) AS dd
WHERE
dd.cnt > 1
ORDER BY
donrId

Query to SELECT non-repeating values from table

I have a table structured as below:
ID Name RunDate
10001 Item 1 12/09/2013 02:11:47
10002 Item 2 12/09/2013 01:13:25
10001 Item 1 12/09/2013 01:11:37
10007 Item 7 12/08/2013 11:02:04
10001 Item 1 12/08/2013 10:25:00
My problem is that this table will be sent to a distribution group email and it makes the e-mail so big because the table has more than hundreds of rows. What I want to achieve is to only show the records that have DISTINCT ID showing only the most-recent RunDate.
ID Name RunDate
10001 Item 1 12/09/2013 02:11:47
10002 Item 2 12/09/2013 01:13:25
10007 Item 7 12/08/2013 11:02:04
Any idea how I can do this? I'm not very good with aggregate stuff and I've used DISTINCT but it always mess up my query.
Thanks!
Group by the values that should be distinct and use max() to get the most current date
select id, name, max(rundate) as rundate
from your_table
group by id, name
This is more flexible because it doesn't require grouping by all columns:
;WITH x AS
(
SELECT ID, Name, RunDate, /* other columns, */
rn = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY RunDate DESC)
FROM dbo.TableName
)
SELECT ID, Name, RunDate /* , other columns */
FROM x
WHERE rn = 1
ORDER BY ID;
(Since Name doesn't really need to be grouped, and in fact shouldn't even be in this table, and the next follow-up question to the GROUP BY solution is almost always, "How do I add <column x> and <column y> to the output, if they have different values and can't be added to the GROUP BY?")

Resources