What is the best way to calculate proportions in Snowflake

What is the best way to calculate proportions in Snowflake - snowflake-cloud-data-platform

Suppose I have some kind of discreet variable, let's say a string and I want to know the proportion of occurrences of each value of the string. Is there a recommended way to do this in Snowflake ?

Snowflake supports RATIO_TO_REPORT:
Returns the ratio of a value within a group to the sum of the values within the group
SELECT C_SALUTATION,
RATIO_TO_REPORT(COUNT(*)) OVER() AS ratio
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCDS_SF100TCL".CUSTOMER
GROUP BY C_SALUTATION;

I don't know if there is any way to do this that is recommended for Snowflake in particular, but the standard way to do this from my experience is by using a window function. For example:
select C_SALUTATION as title, COUNT(*) * 100.0 / SUM(COUNT(*)) OVER()
from "SNOWFLAKE_SAMPLE_DATA"."TPCDS_SF100TCL"
group by C_SALUTATION;
TITLE PROPORTION
Ms. 11.676401
Mr. 16.591405
Miss 11.680596
Sir 16.586719
NULL 3.501119
Mrs. 11.682914
Dr. 28.280846

Related

How to write where clause to consider the earlier date given 2 dates?

Control table:
ControlID, Date1, Date2, Date3
Sale table:
ID, ControlID, SaleDate
I want to get the sales from Date1 to which ever date is earlier amongst Date2 and Date3.
SELECT *
FROM SALE S
JOIN CONTROL C ON S.CONTROLID=C.ID
WHERE S.SALEDATE>=C.DATE1 AND S.SALEDATE<EARLIER(DATE2, DATE3)
What is the correct way to write the EARLIER(DATE2, DATE3) logic? For example - implement this as a new scalar function?
Or maybe:
AND S.SALEDATE<C.DATE2 AND S.SALEDATE<C.DATE3

LEAST may be available to you (SQL Server 2022)
SELECT *
FROM SALE S
JOIN CONTROL C ON S.CONTROLID=C.ID
WHERE S.SALEDATE>=C.DATE1 AND S.SALEDATE<LEAST(DATE2, DATE3)
otherwise try
WHERE S.SALEDATE>=C.DATE1 AND S.SALEDATE< DATE2 AND S.SALEDATE< DATE3

In order to be able to SARG saledate, I'd use a case:
WHERE S.SALEDATE>=C.DATE1
AND S.SALEDATE< CASE WHEN DATE2<DATE3 THEN DATE2 ELSE DATE3 END
An alternative which is probably less performant but arguably easier to read (and capable of accepting more values) is to make a "mini" unpivoted subquery:
WHERE S.SALEDATE>=C.DATE1
AND S.SALEDATE< (select min(datex) from (values (date2),(date3)) as t1(datex))
Post comment addendum:
If you have one less_than condition and one greater_than condition, an index can be used once to satisfy both. For ease's sake, let's say your dates are integers (they actually are anyway). Let's say date1=5, date2=20, date3=17.
If you use my case solution (or you are lucky enough to be able to use earlier), then the engine will:
Calculate once that starting point is date1=5 and ending point= case/earlier(20,17)=17
If there is an index on sale.saledate, it will index seek to 5 and 17. This could be very fast if [sale] is a large table.
Now that it quickly found the starting and ending point, it returns all possible rows on the output/next operator
If you use AND S.SALEDATE<C.DATE2 AND S.SALEDATE<C.DATE3, what probably will happen is that it will start fastly on 5 like before, but then create an expression aliased somewhat like expr01 which includes both of these conditions. It will then evaluate this expr01 beginning at 5 and stopping not at 17, but at the end of the table.
This does have some speculation on my part, that's why it would be helpful for you to run both and then pastetheplan.com.
Note: It is highly probable that either 1) such an index does not exist, or 2) the optimizer wouldn't use it, or 3) your query is fast anyway, which makes all this analysis partly a waste of time, much like XKCD suggests:
However, even in these cases, understanding the points and creating a good programming habit of SARGable queries is just good business.

Is there a way to sum an entire quantity in SQL with unique values

I am trying to get a total summation of both the ItemDetail.Quantity column and ItemDetail.NetPrice column. For sake of example, let's say the quantity that is listed is for each individual item is 5, 2, and 4 respectively. I am wondering if there is a way to display quantity as 11 for one single ItemGroup.ItemGroupName
The query I am using is listed below
select Location.LocationName, ItemDetail.DOB, SUM (ItemDetail.Quantity) as "Quantity",
ItemGroup.ItemGroupName, SUM (ItemDetail.NetPrice)
from ItemDetail
Join ItemGroupMember
on ItemDetail.ItemID = ItemGroupMember.ItemID
Join ItemGroup
on ItemGroupMember.ItemGroupID = ItemGroup.ItemGroupID
Join Location
on ItemDetail.LocationID = Location.LocationID
Inner Join Item
on ItemDetail.ItemID = Item.ItemID
where ItemGroup.ItemGroupID = '78' and DOB = '11/20/2019'
GROUP BY Location.LocationName, ItemDetail.DOB, Item.ItemName,
ItemDetail.NetPrice, ItemGroup.ItemGroupName

If you are using SQL Server 2012 , you can use the summation on partition to display the
details and aggregates in the same query.
SUM(SalesYTD) OVER (ORDER BY DATEPART(yy,ModifiedDate)),1)
Link :
https://learn.microsoft.com/en-us/sql/t-sql/functions/sum-transact-sql?view=sql-server-ver15

We can't be certain without seeing sample data. But I suspect you need to remove some fields from you GROUP BY clause -- probably Item.ItemName and ItemDetail.NetPrice.
Generally, you won't GROUP BY a column that you are applying an aggregate function to in the SELECT -- as in SUM(ItemDetail.NetPrice). And it is not very common, in my experience, to GROUP BY columns that aren't included in the SELECT list - as you are doing with Item.ItemName.
I think you need to go back to basics and read about what GROUP BY does.

First of all welcome to the overflow...
Second: The answer is going to be "It depends"
Any time you aggregate data you will need to Group by the other fields in the query, and you have that in the query. The gotcha is what happens when data is spread across multiple locations.
My suggestion is to rethink your problem and see if you really need these other fields in the query. This will depend on what the person using the data really wants to know.
Do they need to know how many of item X there are, or do they really need to know that item X is spread out over three sites?
You might find you are better off with two smaller queries.

calculate average time based on cell value

Ok this is pretty simple but i'm drawing a blank and can't even think on the right combination of words to search for the answer.
I have a tsql table with start and end time, task, as well as a new/repeat flag.
I want to pull the average duration between start and end, both when the record is new and when it is a repeat. I'll be grouping on the task.
My result would look like Task - NewDurationAverage - RepeatDurationAverage.
Cheers in advance.

Your query should be something like this:
SELECT TaskId, NewDurationAverage, RepeatDurationAverage FROM
(SELECT TaskId, DATEDIFF(hh, TaskStart, TaskEnd) as NewDurationAverage
FROM Task WHERE IsNew=1 GROUP BY TaskId) NewTasks
LEFT OUTER JOIN
(SELECT TaskId, DATEDIFF(hh, TaskStart, TaskEnd) as RepeatDurationAverage
FROM Task WHERE IsRepeat=1 GROUP BY TaskId) RepeatTasks
ON NewTasks.TaskId=RepeatTasks.TaskId

You need to follow the steps below:
Find the difference between the start and the end date/time columns. For example, using DATEDIFF function
Perform the AVG on the calculated value
Convert the result in any appropriate format you want
Depending on your needs, you can make DATEDIFF to return the time difference in a desire format (days, minutes, nanoseconds, etc). So, have to decided how precise the results should be (smaller is better).

how to know how many times its a value in the databes? and for all values?

This can be aplied to tags, to coments, ...
My question its to obtain for example, top 20 repetitions for an atribute in a table..
I mean, i know i can ask for a selected atribute with the mysql_num_rows, but how do i do it to know all of them?
For example,
Most popular colors:
Red -100000
blue -5000
white -200
and so on...
if anyone can give me a clue.. thanks!

SELECT `name`, count(*) as `count` FROM `colors`
GROUP BY `name`
ORDER BY `count` DESC

You want to do this computation on the database, not in your application. Usually, a query of the following form should be fine:
SELECT color_id, COUNT(product_id) AS number
FROM products
GROUP BY color_id
ORDER BY number DESC
LIMIT 20
It will be faster this way, as only the value-count data will be sent from the database to the application. Also, if you have indices set up correctly (on color_id, for instance), thighs will be smoother.

Windowing function: Finding max value from LAG()

I'm currently working my way through the exam study book, Querying Microsoft SQL Server 2012. I've been learning SQL over the last few months and I am currently looking over windowing functions. I came to this application question and it got me thinking about another question, which I'll list below:
So in the columns diffprev and diffnext it only lists the difference between the previous and the next value. How could I list the maximum difference between subsequent values across all of the rows (partitioned by custid)? So just scanning the table, I see that in custid 1's history, the greatest difference between subsequent rows is $548. Then for custid 2, the greatest difference is $390.95. I could see these values appearing in a maxdiff column across all the rows pertaining to the partition.
Thank you for aiding my studying!

If you're just looking for the value, this should work:
with cte as (
select custid, val - lag(val)
over (partition by custid order by orderdate, orderid) as prevVal
from Sales.OrderValues
)
select custid, max(abs(val))
from cte
group by custid
If you want the details of the rows that attain that maximum, it'll be a bit more work.
Bonus tip - pictures of text are the worst. You're more likely to get help if the people helping don't need to type your code out. Even better though would be a fully functioning example (complete with table definitions and sample data) so we can verify against your data!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

What is the best way to calculate proportions in Snowflake - snowflake-cloud-data-platform

Suppose I have some kind of discreet variable, let's say a string and I want to know the proportion of occurrences of each value of the string. Is there a recommended way to do this in Snowflake ?

Snowflake supports RATIO_TO_REPORT: Returns the ratio of a value within a group to the sum of the values within the group SELECT C_SALUTATION, RATIO_TO_REPORT(COUNT(*)) OVER() AS ratio FROM "SNOWFLAKE_SAMPLE_DATA"."TPCDS_SF100TCL".CUSTOMER GROUP BY C_SALUTATION;

Related

How to write where clause to consider the earlier date given 2 dates?

Is there a way to sum an entire quantity in SQL with unique values

calculate average time based on cell value

how to know how many times its a value in the databes? and for all values?

Windowing function: Finding max value from LAG()

Categories

Resources