How do you select a number of random rows from different AgeGroup? - snowflake-cloud-data-platform

I am trying to create a for loop in python to connect it to Snowflake since Snowflake does not support loops.
I want to select a number of random rows from different AgeGroups. eg. 1500 rows from AgeGroup "30-40", 1200 rows from AgeGroup "40-50" , 875 rows from AgeGroup "50-60".
Any ideas how to do it or an alternative method for a loop in Snowflake?

Have you looked at Snowflake's Stored Procedures? They are Javascript and would allow you to loop natively in Snowflake:
https://docs.snowflake.net/manuals/sql-reference/stored-procedures-overview.html

What do you mean by "Snowflake doesn't have loops"? SQL has "loops" if you can find them...
The following query does what you asked for:
WITH POPULATION AS ( /* 10,000 persons with random age 0-100 */
SELECT 'Person ' || SEQ2() ID, ABS(RANDOM()) % 100 AGE
FROM TABLE(GENERATOR(ROWCOUNT => 10000))
)
SELECT
ID,
AGE,
CASE
WHEN AGE < 30 THEN '0-30'
WHEN AGE < 40 THEN '30-40'
WHEN AGE < 50 THEN '40-50'
WHEN AGE < 60 THEN '50-60'
ELSE '60-100'
END AGE_GROUP,
ROW_NUMBER() OVER (PARTITION BY AGE_GROUP ORDER BY RANDOM()) DRAW_ORDER
FROM POPULATION
QUALIFY DRAW_ORDER <= DECODE(AGE_GROUP, '30-40', 1500, '40-50', 1200, '50-60', 875, 0);
Addendum:
As pointed out by waldente, a simpler and more efficient way is to use SAMPLE:
WITH
POPULATION_30_40 AS (SELECT * FROM POPULATION WHERE AGE >= 30 AND AGE < 40),
POPULATION_40_50 AS (SELECT * FROM POPULATION WHERE AGE >= 40 AND AGE < 50),
POPULATION_50_60 AS (SELECT * FROM POPULATION WHERE AGE >= 50 AND AGE < 60)
SELECT * FROM POPULATION_30_40 SAMPLE(1500 ROWS) UNION ALL
SELECT * FROM POPULATION_40_50 SAMPLE(1200 ROWS) UNION ALL
SELECT * FROM POPULATION_50_60 SAMPLE(875 ROWS)

If you want to draw n random samples from each group you could create a subquery containing a row number that is randomly distributed within each group, and then select the top n rows from each group.
If you have a table like this:
USER DATE
1 2018-11-04
1 2018-11-04
1 2018-12-07
1 2018-10-09
1 2018-10-09
1 2018-11-07
1 2018-11-09
1 2018-11-09
2 2019-11-02
2 2019-10-02
2 2019-11-03
2 2019-11-06
3 2019-11-10
3 2019-11-13
3 2019-11-15
This query could be used to return two random rows for User 2 and 3, and 3 random rows for user 1:
SELECT User, Date
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY User ORDER BY RANDOM()) as random_row
FROM Users)
WHERE
(User = 3 AND random_row < 3) OR
(User = 2 AND random_row < 3) OR
(User = 1 AND random_row < 4);
So in your case partition on and filter age_group instead of User.

Snowflake has support for random and deterministic table sampling. For Example:
Return a sample of a table in which each row has a 10% probability of being included in the sample:
SELECT * FROM testtable SAMPLE (10);
https://docs.snowflake.net/manuals/sql-reference/constructs/sample.html

Related

nested case when with count and math expression

how can I achieve this result below?
id
id_status
rate
25
X
62.5%
15
Y
37.5%
having tried this
SELECT
COUNT(tab.id) AS id,
tab.status AS id_status,
(CASE
WHEN tab.status = 'X' THEN (25/40) * 100 -- this is where I'm stucked (40 = total of ids)
WHEN tab.status = 'Y' THEN 100 - ((25/40) * 100)
END AS rate
FROM table AS tab
WHERE tab.status in ('X', 'Y')
GROUP BY ROLLUP (tab.status)
You can use a window function to get total count
select count(tab.id) as id,
tab.status as id_status,
200.0 * count(tab.id) / sum(count(*)) over(order by status rows between unbounded preceding and unbounded following) as rate
from your_table tab
where tab.status in ('X', 'Y')
group by rollup(tab.status)
Note explicit window specification because the default is generally .. and current row and 200 because rollup will add the total row.
db<>fiddle

Displaying all columns in SQL and also sum of columns with same ID in the last Repeating row

I have 2 tables
OrderDetails:
Id Name type Quantity
------------------------------------------
2009 john a 10
2009 john a 20
2010 sam b 25
2011 sam c 50
2012 sam d 30
ValueDetails:
Id Value
-------------------
2009 300
2010 500
2011 200
2012 100
I need to get an output which displays the data as such :
Id Name type Quantity Price
-------------------------------------------------
2009 john a 10
2009 john a 20 9000
2010 sam b 25
2011 sam c 50
2012 sam d 30 25500
The price is calculated by Value x Quantity and the sum of the values is displayed in the last repeating row of the given Name.
I tired to use sum and group by but I get only two rows. I need to display all 5 rows. How can I write this query?
You can use Row_Number with max of Row_Number to get this formatted sum
;with cte as (
select od.*, sm= sum( od.Quantity*vd.value ) over (partition by Name),
RowN = row_number() over(partition by Name order by od.id)
from #yourOrderDetails od
inner join #yourValueDetails vd
on od.Id = vd.Id
)
select Id, Name, Type, Quantity,
case when max(RowN) over(partition by Name) = row_number() over(partition by Name order by Id)
then sm else null end as ActualSum
from cte
Your input tables:
create table #yourOrderDetails (Id int, Name varchar(20), type varchar(2), Quantity int)
insert into #yourOrderDetails (Id, Name, type, Quantity) values
(2009 ,'john','a', 10 )
,(2009 ,'john','a', 20 ) ,(2010 ,'sam ','b', 25 )
,(2011 ,'sam ','c', 50 ) ,(2012 ,'sam ','d', 30 )
create table #yourValueDetails(Id int, Value Int)
insert into #yourValueDetails(Id, value) values
( 2009 , 300 ) ,( 2010 , 500 )
,( 2011 , 200 ) ,( 2012 , 100 )
SELECT a.ID,
a.Name,
a.Type,
a.quantity,
price = (a.quantity * b.price)
FROM OrderDetails a LEFT JOIN
ValueDetails b on a.id = b.id
This will put the price on every row. If you want to do a SUM by Id,Name and Type it's not going to show the individual records like you show them above. If you want to put a SUM on one of the lines that share the same Id, Name and Type then you'd need a rule to figure out which one and then you could probably use a CASE statement to decide on which line you want to show the SUM total.

How to group by on consecutive values in SQL

I have a table in SQL Server 2014 with sample data as follows.
WK_NUM | NET_SPRD_LCL
10 0
11 1500
12 3600
13 3800
14 4000
I am trying to code a bonus structure at work where I need to group on WK_NUM. So, if I see NET_SPRD_LCL > 3500 for two consecutive WK_NUMs WHERE WK_NUM < 27, I need to output 2000. In this example, since NET_SPRD_LCL for WK_NUM 12 and 13 are both greater than 3500, the SQL should output 2000 and exit. So, it should ignore the fact that WK_NUM 13 and 14 also satisfy the condition that NET_SPRD_LCL > 3500.
I would appreciate any help with this.
Assuming you mean consecutive line 1, 2, 3, 4, 5 ... etc. and NOT
1, 3, 5, 8, 12, etc.
then, if you don't need to know which pair of consecutive records it was:
Select case when exists
(Select * from table f
join table n
on n.Wk_Num = f.Wk_Num + 1
and n.NET_SPRD_LCL > 3500
and f.NET_SPRD_LCL > 3500
and n.Wk_Num < 27
then 2000 else null end
If you do need to identify the pair of records, then:
Select f.wk_Num firstWorkNbr, f.NET_SPRD_LCL firstNetSpread,
n.wk_Num nextWorkNbr, n.NET_SPRD_LCL nextNetSpread
from table f
join table n
on n.Wk_Num = f.Wk_Num + 1
and n.NET_SPRD_LCL > 3500
and f.NET_SPRD_LCL > 3500
and n.Wk_Num < 27
Where not exists
(Select * from table f0
join table n0
on n0.Wk_Num = f0.wk_Num + 1
and n0.WkNum < f.Wk_Num))
on the other hand if the consecutive is simply increasing, then it's a bit harder. You need to use a subquery to determine the next consecutive record...
Select case when exists
(Select * from table f
join table n
on n.Wk_Num = (Select Min(Wk_Num) from table
Where Wk_Num > f.Wk_Num)
and n.NET_SPRD_LCL > 3500
and f.NET_SPRD_LCL > 3500
and n.Wk_Num < 27
then 2000 else null end
and if you need to fetch the data for the specific first pair of records that qualify (the 2000 at the end is unnecessary since if there is no qualifying pair nothing will be returned.)
Select f.wk_Num firstWorkNbr, f.NET_SPRD_LCL firstNetSpread,
n.wk_Num nextWorkNbr, n.NET_SPRD_LCL nextNetSpread, 2000 outValue
from table f
join table n
on n.Wk_Num = (Select Min(Wk_Num) from table
Where Wk_Num > f.Wk_Num)
and n.NET_SPRD_LCL > 3500
and f.NET_SPRD_LCL > 3500
and n.Wk_Num < 27
Where not exists
(Select * from table f0
join table n0
on n0.Wk_Num = (Select Min(Wk_Num) from table
Where Wk_Num > f0.Wk_Num)
and n0.WkNum < f.Wk_Num))
First of all, when you say you want your query to 'output' and 'exit', it makes me think you are approaching t-sql as a procedural language, which it is not. Good t-sql queries are nearly always set based.
In any case, before the query, let me add what is helpful for others to work with the data to build queries:
DECLARE #t TABLE (WK_NUM INT, NET_SPRD_LCL INT);
INSERT INTO #t VALUES
(10, 0),
(11, 1500),
(12, 3600),
(13, 3800),
(14, 4000);
You say you are using SQL Server 2014, which means you have relevant window functions at your disposal. The one I am using (LAG) will have superior performance to using subqueries, which, if you insist on using, can be greatly improved by using TOP (1) with ORDER BY and an appropriate index instead of using a MIN function over the whole dataset. With tiny amounts of data you won't notice a difference, but on a real business system it will be obvious.
Adjusted to provide the 2000 bonus on the correct line after OP's clarification:
WITH cteTemp AS
(
SELECT WK_NUM
, thisValue = NET_SPRD_LCL
, lastValue = LAG(NET_SPRD_LCL) OVER(ORDER BY WK_NUM)
FROM #t
WHERE WK_NUM < 27
)
, cteBonusWeek AS
(
SELECT TOP (1)
WK_NUM
, bonus = 2000
FROM cteTemp
WHERE thisValue > 3500 AND lastValue > 3500
ORDER BY WK_NUM
)
SELECT t.WK_NUM
, t.NET_SPRD_LCL
, bonus = COALESCE(b.bonus, 0)
FROM #t AS t
LEFT JOIN cteBonusWeek AS b
ON b.WK_NUM = t.WK_NUM;

SQL Server: How to get a rolling sum over 3 days for different customers within same table

This is the input table:
Customer_ID Date Amount
1 4/11/2014 20
1 4/13/2014 10
1 4/14/2014 30
1 4/18/2014 25
2 5/15/2014 15
2 6/21/2014 25
2 6/22/2014 35
2 6/23/2014 10
There is information pertaining to multiple customers and I want to get a rolling sum across a 3 day window for each customer.
The solution should be as below:
Customer_ID Date Amount Rolling_3_Day_Sum
1 4/11/2014 20 20
1 4/13/2014 10 30
1 4/14/2014 30 40
1 4/18/2014 25 25
2 5/15/2014 15 15
2 6/21/2014 25 25
2 6/22/2014 35 60
2 6/23/2014 10 70
The biggest issue is that I don't have transactions for each day because of which the partition by row number doesn't work.
The closest example I found on SO was:
SQL Query for 7 Day Rolling Average in SQL Server
but even in that case there were transactions made everyday which accomodated the rownumber() based solutions
The rownumber query is as follows:
select customer_id, Date, Amount,
Rolling_3_day_sum = CASE WHEN ROW_NUMBER() OVER (partition by customer_id ORDER BY Date) > 2
THEN SUM(Amount) OVER (partition by customer_id ORDER BY Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
END
from #tmp_taml9
order by customer_id
I was wondering if there is way to replace "BETWEEN 2 PRECEDING AND CURRENT ROW" by "BETWEEN [DATE - 2] and [DATE]"
One option would be to use a calendar table (or something similar) to get the complete range of dates and left join your table with that and use the row_number based solution.
Another option that might work (not sure about performance) would be to use an apply query like this:
select customer_id, Date, Amount, coalesce(Rolling_3_day_sum, Amount) Rolling_3_day_sum
from #tmp_taml9 t1
cross apply (
select sum(amount) Rolling_3_day_sum
from #tmp_taml9
where Customer_ID = t1.Customer_ID
and datediff(day, date, t1.date) <= 3
and t1.Date >= date
) o
order by customer_id;
I suspect performance might not be great though.

Find and replace rows with similar value in one column in Oracle SQL

I want to find the rows which are similar to each other, and replace them with a new row. My table looks like this:
OrderID | Price | Minimum Number | Maximum Number | Volume
1 45 2 10 250
2 46 2 10 250
3 60 2 10 250
"Similar" in this context means that the rows that have same Maximum Number, Minimum Number, and Volume. Prices can be different, but the difference can be at most 2.
In this example, orders with OrderID of 1 and 2 are similar, but 3 is not (since even if it has same Minimum Number, Maximum Number, and Volume, its price is not within 2 units from orders 1 and 2).
Then, I want orders 1 and 2 be replaced by a new order, let's say OrderID 4, which has same Minimum Number and Maximum Number. Its Volume hass to be sum of volumes of the orders it is replacing. Its price can be the Price of any of the previous orders that will be deleted in the output table (45 or 46 in this example). So, the output for the example above would be:
OrderID | Price | Minimum Number | Maximum Number | Volume
4 45 2 10 500
3 60 2 10 250
Here is a way to do this in SQL Server 2012 or Oracle. The idea is to use lag() to find where groups should begin and end and then aggregate.
select min(id) as id, min(price) as price, MinimumNumber, MaximumNumber, sum(Volume)
from (select t.*,
sum(case when prev_price < price - 2 then 1 else 0 end) over
(partition by MinimumNumber, MaximumNumber, Volume order by price) as grp
from (select t.*,
lag(price) over (partition by MinimumNumber, MaximumNumber, Volume
order by price
) as prev_price
from table t
) t
) t
group by grp, price, MinimumNumber, MaximumNumber;
The only issue is the setting of the id. I'm not sure what the exact rule is for that.

Resources