Snowflake Moving Averages - snowflake-cloud-data-platform

Im having issues creating a 4 week moving average calculation on Snowflake, the goal is to have a table with all the distinct weeks and descriptions for each product and the volume and amounts paid for them.
Here is the code that I'm using
Select
(date_trunc(WEEK, TO_DATE(YR || '-' || MTH || '-' || DT, 'YYYY-MM-DD'))) as Week ,
(DESCRIPTION) as "DESC",
AVG(C.VOL) OVER(PARTITION BY DESC ORDER BY Week ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as TXN,
AVG(C.PRNLC) OVER(PARTITION BY C11.DMADES ORDER BY Week ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as PRN
FROM C
GROUP BY Week, "DESC"
I keep getting an error
[CAST(C.VOL_17 AS NUMBER(38,3))] is not a valid group by expression
The Code works fine by removing the averages and window functions but then I'd have to go out on other tools and create the 4 week moving average I'd expect to see there, i have tried to remove the grouping and use it as distinct but it would still return 1 line for each record, this goes beyond my expertise I'm afraid, I'm not really sure how Cast comes into play here. Is there another way to accomplish what I want? do you guys see what I'm doing wrong here?
Thank you!

Related

Use result previous row as start value for next line

I'm working on an MRP simulation in which I have to subtract demand or add supply qty to available stock and I hope you can be of support. Find below the result I want to achieve.
I have 1 value for stock = 22 and a lot of values for future demand/supply on specific dates.
Part
Stock
Demand/Supply qty
Demand/Supply Date
Result
1000680
22
-1
2023-01-01
21
1000680
21* what I want to achieve
-15
2023-01-02
6* expected outcome
1000680
6* what I want to achieve
+10
2023-01-03
16* expected outcome
I'm still on the SQL learning curve. I started to add rownumbers to the lines to make sure that the sequence is correct:
select
 part,
 rownum = ROW_NUMBER() OVER (ORDER BY part, mrp_due_date),
 current_stock_qty,
 demand_supply_qty,
 
 current_stock - qty as new_stock_qty, -- if demand
 current_stock + qty as new_stock_qty, -- if supply
 mrp_due_date
from #base
Then I tried the lag function to derive previous row 'new_stock_qty' at date but this only worked for the first line (see image:
)
So I probably need the loop function to first calculate stock-demand and use the result as new stock.
I have looked through similar questions asked on this site, but I find it difficult to define my solution based on that information.

Matching patterns based on query result in SQL

I have a SQL query which returns three columns Contact Id's, No of participations, Year of participation.
Based on this query result, I need to look for the pattern if anyone attended for certain (same number every year) number of times over the years.
For example, 2 times every year or 3 times every year for 2 years or more consecutively (and not different number of times in each year).
From the sample below, contacts I would be interested to be pulling are 1008637, 1009256, 1010306 & 1011263
Please let me know how to achieve this.
Please see image for sample data.
You would need an aggregation twice. Once to get the number of participations and then to check the number condition for each year.
select id
from (select id,year,count(*) as num_participations
from tbl
group by id,year
) t
group by id
having count(*) = count(distinct case when num_participations = 2 then year end)

Generating Calculated Fields by Time Period in Report Builder 3.0

I've spent weeks building a massive view, followed by a massive report in SSRS/Report Builder 3.0 that shows master inventory levels, purchasing levels, stock outages, etc. in fine form. The end result of this report is that I'll have every one of our stock items listed as rows, and collapse-ably grouped by vendor. As columns, I have time values, currently grouped by year and then by month (I'd like to add seasons/quarters here down the road, since we're a very seasonally oriented business, but that can wait right now).
Right now, the report shows the following columns for each month, per item (row): Quantity Shipped, Quantity Received in Stock, Stock Levels at the end of the Period, Avg Time Spent Out of Stock in the Period, and the Number of Outages within the period.
So, what I need now in order to finish up this beast is the estimated daily usage. To get this, I want to get the period length, by the column groupings, and subtract the sum(DaysOutofStock). The result would be Days in Stock, which I'd then divide by the totals shipped in the period.To boil this down, I just want to know that on days when the item is in stock, we're moving X on average.
So what I need is the period length, as a value to use in a function. Is there any way to get this automatically? Is it hidden somewhere in the function menu? Is there a function or expression I can use that will look at the month or grouping or whatever of the column and tell me that the month I'm in has X days?
My underlying data has massive gaps, especially earlier in the company's life, so there's no easy way to derive it from there (at least, that I can think of). Please help! I'm so close to finally finishing this project! Thanks!
If you need any more info or details, I'll be happy to provide.
Assuming you can do it in SQL code, the month in your report has the following days:
SELECT [Year], [Month], DateDiff(dd, CAST(CAST([Year] AS VARCHAR) + '-' + CAST([Month] AS VARCHAR) + '-01' AS DATETIME), DateAdd(mm, 1, CAST(CAST([Year] AS VARCHAR) + '-' + CAST([Month] AS VARCHAR) + '-01' AS DATETIME))) AS DaysDifference
FROM MyTable

Sum report item in a column group (SSRS 2008)

I have the following payroll table in an SSRS 2008 (R2) report:
The dataset returns labor transactions consisting of the following fields:
STARTDATE
STARTTIME
FINISHDATE
FINISHTIME
REGULARHRS (difference between finish and start)
REFWO (such as "Travel", "Holiday", "Work", etc and is used to sort into the categories shown in the table above)
TIMEWORKED (0/1 flag that indicates whether or not it counts towards "Time Worked" category)
I have a column grouped on STARTDATE so that it displays each day of the week (our weeks go Mon through Sun). Everything down to "Unpaid Time Off" is a simple expression (usually just in the format Sum(IIF(something,A,B)) in both the daily column and the weekly (Totals) column. In the "Interim Regular" box (for the day grouping), I have the following expression:
=IIF(Weekday(Fields!startdate.Value)=1
OR Weekday(Fields!startdate.Value)=7
OR ReportItems!Holiday.Value>0,
0,
ReportItems!TimeWorked.Value-ReportItems!Holiday.Value-ReportItems!Bereave.Value)
Basically what I'm doing is saying: If STARTDATE is a Saturday, Sunday, or Holiday, the regular hours would be 0 since it would fall into OT1.5 (overtime, time and a half), otherwise I calculate the regular hours worked by subtracting Holiday time and Bereavement time from the Time Worked (since both are included as part of Time Worked). This part works great!! But when I try to sum up the total for the week using =Sum(ReportItems!InterimRegularDaily.Value) it tells me that I can't use an aggregate on a report item and that aggregate functions can be used only on report items contained in page headers and footers. I've done extensive googling to see if there is a solution, but everything seems to involve writing custom functions. I can't believe it would be THAT hard to simply sum up a group calculation in an outer group!
Any and all help would be greatly appreciated!!
Thanks!
You can add the scope the your Sum() expression to reverence the the whole dataset or a group:
'Returns the sum of the whole DataSet
=Sum(Fields!TestField.Value, "YourDataSetName")
'Returns the sum of a group
=Sum(Fields!TestField.Value, "YourGroupName")
You can also stack them:
=Sum(Sum(Fields!TestField.Value, "Level3GroupName"), "Level2GroupName")

How to find the one hour period with the most datapoints?

I have a database table with hundreds of thousands of forum posts, and I would like to find out what hour-long period contains the most number of posts.
I could crawl forward one minute at a time, keeping an array of timestamps and keeping track of what hour had the most in it, but I feel like there is a much better way to do this. I will be running this operation on a year of posts so checking every minute in a year seems pretty awful.
Ideally there would be a way to do this inside a single database query.
Given a table filled with every minute in the year you are interested in Minutes and a table Posts with a Time column:
select top 1 minutes.time, count (posts.time)
from Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count (posts.time) desc
To solve generating the minutes table, you can use a function like ufn_GenerateIntegers.
Then the function becomes
select top 5 minutes.time, count (posts.time)
from (select dateadd(minute, IntValue, '2008-01-01') as Time from ufn_GenerateIntegers(525600)) Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count(posts.time) desc
I just did a test run with about 5000 random posts and it took 16 seconds on my machine. So, not trivial, but not rediculous for the occasional one-off query. Fortunately, this is a data-point you can calculate one a day or even once a month and cache if you want to display the value frequently.
Take a look at lassevk's improvement.
Binning will work if you want to look at intervals such as 10:00 - 11:00. However if you had a sudden flurry of interest from 10:30 - 11:30 then it will be split across two bins, and hence may be hidden by an smaller number of hits that happened to fit entirely within a single clock hour.
The only way to avoid this problem is to generate a list sorted by time and step through it. Something like this:
max = 0; maxTime = 0
for each $item in the list:
push $item onto queue
while head of queue is more than an hour before $item
drop queue head.
if queue.count > max then max = queue.count; maxTime = $item.time
That way you only need to hold a 1 hour window in memory rather than the whole list.
Treat the timestamp of every post as the start of such an hour, and count all other posts that fall within that hour, including the post that started it. Sort the resulting hours in descending order by the number of posts in each of them.
Having done that, you'll find the topmost single "hour" that has the most posts in it, but this period of time might not be exactly one hour long, it might be shorter (but never longer).
To get a "prettier" period, you can calculate how long it really is, divide by two, and adjust the start of the period back by that amount and the end forward, this will "center" the posts inside the hour. This adjustment is guaranteed to not include any new posts, so the count is still valid. If posts are close enough to suddenly be included in the period after you have expanded it to one hour, then an earlier point would've had "the most posts" in it instead of the one you picked.
If this is an SQL question, you can reuse the SQL that Josh posted here, just replace the Minutes table with another link to your posts table.
Another method you can use is to use a sliding window.
First sort all the posts according to the timestamp. Keep track of posts using a list, a linked list could be used for this.
Now, for each post, add it to the end of the list. Then, for each post from the start of the list, if that post is more than one hour before the post you just added, remove it from the list.
After doing that 2-step operation for a single new post in the list, check if the number of posts in the list is more than a previous maximum, and if it is, either make a copy of the list or at least store the post you just added.
After you're finished, you've got the "copy of the list" with the most posts in an hour, or you got the post that is the end of a 1-hour window that contains the most posts.
Pseudo-code:
initialize posts-window-list to empty list
for each post in sorted-posts-list:
add post to end of posts-window-list
for each other-post from start of posts-window-list:
if other-post is more than one hour older than post, remove it
otherwise, end this inner loop
if number of posts in list is more than previous maximum:
make copy of list, this is the new maximum
This worked on a small test MS-SQL database.
SELECT TOP 1 id, date_entered,
(SELECT COUNT(*)
FROM dbo.notes AS n2
WHERE n2.date_entered >= n.date_entered
AND n2.date_entered < Dateadd(hh, 1, n.date_entered)) AS num
FROM dbo.notes n
ORDER BY num DESC
This is not very efficient, checks based on an hour from each post.
For MYSQL
SELECT ID,f.Date, (SELECT COUNT(*)
FROM Forum AS f2
WHERE f2.Date >= f.Date AND f2.Date < Date_ADD(f.Date, INTERVAL 1 HOUR)) As num
FROM Forum AS f
ORDER BY num
LIMIT 0,1
Here's a slight variation on the other Josh's implementation this forgoes the immediate table and uses a self join on itself looking for any posts within an hour of that one post.
select top 1 posts.DateCreated, count (posts.datecreated),
min(minutes.DateCreated) as MinPostDate,
max(minutes.datecreated) as MaxPostDate
from posts Minutes
left join posts on posts.datecreated >= minutes.DateCreated
AND posts.datecreated < dateadd(hour, 1, Minutes.DateCreated)
group by posts.DateCreated
order by count(posts.datecreated) desc
From a performance perspective on a table with only 6 rows his method which used the function to generate the intermiadte table took 16 seconds vs this one which was subsecond.
I'm not positive if it would be possible using this to miss a valid timeframe since the timespan is based off of the offset of each post.
This results in an O(n) database query, and an O(n) greatest time search, for a total complexity of O(2n) (which, of course, is still O(n)):
Use a count distinct command in SQL which will 'bin' items for you in minute increments.
So you'd run the count query on this table:
time
1
2
4
3
3
2
4
1
3
2
And it would return:
0 1
1 1
2 3
3 3
4 2
By counting each item.
I suspect you can do the same thing with your table, and bin them by the minute, then run an algorithm on that.
SELECT customer_name, COUNT(DISTINCT city) as "Distinct Cities"
FROM customers
GROUP BY customer_name;
From this tutorial on count: http://www.techonthenet.com/sql/count.php (near the end).
Here is a similar page from MySQL's manual: http://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
So if you have a table with a timedate in it (to the minute, allowing binning to happen by minutes):
datetime (yyyymmddhhmm)
200901121435
200901121538
200901121435
200901121538
200901121435
200901121538
200901121538
200901121435
200901121435
200901121538
200901121435
200901121435
Then the SQL
SELECT datetime, COUNT(DISTINCT datetime) as "Date Time"
FROM post
GROUP BY datetime;
should return
200901121435 7
200901121538 5
You will still need to post process this, but the hard work of grouping and counting is done, and will only result in just over 500k rows per year (60 minutes, 24 hours, 365 days)
The post processing would be:
Start at time T = first post time.
Set greatestTime = T
Sum all counts between T and T+one hour --> currentHourCount and greatestHourCount
While records exist past T+one hour
Increment T by one minute.
While the first element is prior to time T, subtract it
while the last element is before time T+ one hour, add it
If currentHourCount > greatestHourCount then
greatestHourCount = currentHourCount
greatestTime = T
end while
-Adam
This will do it.
SELECT DateOfEvent HourBegin, DATEADD(hh, 1, DateOfEvent)) HourEnd, COUNT(*) AS NumEventsPerHour
FROM tEvents AS A
JOIN tEvents AS B
ON A.DateOfEvent >= B.DateOfEvents AND DATEADD(hh, 1, A.DateOfEvent) <= B.DateOfEvent
GROUP BY A.DateOfEvent
If mysql:
select substr( timestamp, 1, 16 ) as hour, count(*) as count from forum_posts group by hour order by count desc limit 1;
edit: not sure if original question means any possible 60-minute period
If using MySQL:
SELECT DATE(postDate), HOUR(postDate), COUNT(*) AS n
FROM posts
GROUP BY DATE(postDate), HOUR(postDate)
ORDER BY n DESC
LIMIT 1
SELECT DATEPART(hour, PostDateTime) AS HourOfDay,
COUNT(*) AS ForumPosts
FROM Posts
GROUP BY DATEPART(hour, PostDateTime)

Resources