View AVG Task Execution Time in Snowflake - snowflake-cloud-data-platform

I run the following task in Snowflake to see which queries are candidates for inefficiency improvements:
select datediff(second,scheduled_time,query_start_time) as second, *
from table(information_schema.task_history())
where state != 'SCHEDULED'
order by datediff(second,scheduled_time,query_start_time) desc;
However, I frequently see the seconds a query took to run change from day to day. How can I modify this query in Snowflake to get all the historical runs from task history and average their seconds to get a fuller picture with less variance?
The documentation says it pulls the last 7 days but in practice it is only pulling the last 2 days based on the output's scheduled_time (each of my tasks run every 12 hours). I'd like to get the average seconds each task took over the last 30 days and sort them.

The documentation says it pulls the last 7 days but in practice it is only pulling the last 2 days based on the output's scheduled_time (each of my tasks run every 12 hours).
Task history
RESULT_LIMIT => integer
A number specifying the maximum number of rows returned by the function.
Default: 100.
To get more rows the RESULT_LIMIT should be defined:
select datediff(second,scheduled_time,query_start_time) as second, *
from table(information_schema.task_history(RESULT_LIMIT =>10000))
where state != 'SCHEDULED'
order by datediff(second,scheduled_time,query_start_time) desc;
ACCOUNT_USAGE.TASK_HISTORY provides data for the last 365 days.
SELECT datediff(second,scheduled_time,query_start_time) as second, *
FROM snowflake.account_usage.task_history
WHERE state != 'SCHEDULED'
ORDER BY datediff(second,scheduled_time,query_start_time) DESC;

Related

Cron Script to execute a job every 14 days from a given date in specific time zone

I want to execute a Job in CRON for every 14 days from a specific date and timezone.
As an e.g. from JUNE 24TH every 14 days in CST time zone.
Run job every fortnight
The easy way
The easiest way to do this is simply to create the task to run every 14 days from when you want it to first run like:
CREATE TASK mytask_fortnightly
WAREHOUSE = general
SCHEDULE = '20160 MINUTE'
AS
SELECT 'Hello world'
How it works
As there are 60 minutes in an hour, 24 hours in a day and 14 days in a fortnight, ergo that's 20,160 minutes.
Caveat
The above solution does not run the task every fortnight from a given date/time, but rather every fortnight from when the task is created.
Even though this is the simplest method, it does require you to be nominally present to create the task at the exact desired next scheduled time.
As a workaround however, you can create a one-shot task to do that for you the very first time at the exact correct date/time. This means you don't have to remember to be awake / alert / present to do it manually yourself, and you can clean up the creation task afterwards.
The harder way.
Other solutions will require you to create a task which gets run every Thursday (since 2021-06-24 is/was a Thursday, each subsequent Thursday will either be the off-week, or the fortnight week)
e.g. SCHEDULE = 'USING CRON 0 0 * * THU'
Then you will add specific logic to it to determine which one the correct fortnight is.
Using this method will also incur execution cost for the off-week as well to determine if it's the correct week.
Javascript SP
In javascript you can determine if it's the correct week or not by subtracting the start date from the current date and if it's not a mutiple of 14 days, use this as a conditional to short circuit the SP.
const deltaMs = (new Date) - (new Date('2021-06-24'));
const deltaDays = ~~(deltaMs / 86400000);
const run = deltaDays % 14 === 0;
if (!run) return;
// ... continue to do what you want.
SQL
You can also check if it's a fortnight using the following SQL condition in a WHERE clause, or IFF / CASE functions.
DATEDIFF('day', '2021-06-24', CURRENT_DATE) % 14 = 0

Show second to last item in Influx query (or ignore last)

I am using Grafana to show the number of entries added to the database every minute, and I would like to display the last recent fully counted value.
If I give the following command:
SELECT count("value") FROM "numSv" GROUP BY time(1m)
1615904700000000000 60
1615904760000000000 60
1615904820000000000 60
1615904880000000000 60
1615904940000000000 36
Grafana is going to display the last entry, which is still in the process of counting. How can I display the n[-1] entry, which has been fully counted?
Otherwise, how do I ask Influx to give me the same results excluding the last dataset?
P.S.: Using WHERE time > now() - 60s, etc... doesn't work.
Use "magic" Grafana time range math and select dashboard time range from now-1m/m to now-1m/m. That generates an absolute time range, which refers to last fully counted minute. Query is then standard with $timeFilter Grafana macro:
SELECT count("value") FROM "numSv" WHERE $timeFilter

Current Record Calculated Field on Previous Record Calculated Value

In this example, there are 5 periods of actual balances and the implied depreciation rates. Starting in Period 6, need the Balance to be calculated on previous period balance ($8,177,480) * the current period depreciation rate (-1.50%) and so on. I've heard recursive CTE but I am not familiar with them.
Period DeprRate Balance Comment
1 0% $10,000,000 Actual Values
2 -1.62% $9,838,000 Actual Values
3 -7.41% $9,109,004 Actual Values
4 -8.00% $8,380,284 Actual Values
5 -2.42% $8,177,481 Actual Values
6 -1.50% null should be $8,177,481*(1-.015)
7 -1.50% null should be Pd 6 Calc Balance *(1-.015)
8 -5.73% null should be Pd 7 Calc Balance *(1-.0573)
9 -4.13% null should be Pd 8 Calc Balance *(1-.0413)
10 -1.50% null should be Pd 9 Calc Balance *(1-.015)
CREATE TABLE Table1
([Period] int, [DeprRate] float, Balance integer)
;
INSERT INTO Table1
([Period], [DeprRate], Balance)
VALUES
(1,0,10000000),
(2,-0.0162,9838000),
(3,-0.0741,9109004.2),
(4,-0.08,8380283.864),
(5,-0.0242,8177480.9944912),
(6,-0.015,null),
(7,-0.015,null),
(8,-0.0573,null),
(9,-0.0413,null),
(10,-0.015,null)
"This seems relatively easy, but can't get it."
Yes, it is. Did you follow these steps ?
"I have 10 periods of actual balances and the implied depreciation rates."
Step 1 : Create a table (Table_1) and populate it with these values.
" Starting in Period 11, need the Balance to be calculated on previous period balance * the current period depreciation rate."
Step 2 : Create a query for calculation of new rates based on the values of previous table, execute it and populate it to new table (Table_2).
" Period 11 isn't difficult if that's all that was needed by using lag. Problem is Period 12-20 need to be calculating current period balance on previous period calculated balance multiplied by the current period depreciation rate."
Step 3 : Two options here - one is through a recursive query as 'Vinit' commented. Another option (easy) is to repeat Step 2 and append to Table_2.
=======
Knowledge sharing / Value addition to your question : Depreciation is an Accounting concept, which usually taken into account either in the year end (closing of the books) or at the end of life of an asset. This concept is very tricky as at least two (usually) different calculations may have to be performed to satisfy the tax compliance and also management accounting requirements. Additional calculations may also have to be carried out for each type of asset, just to take decision to determine best possible option.
Though you did not include the date column in your sample data, you should be writing the script to calculate and populate the depreciated values based on a particular date. You can also arrange to execute this script by specifying a trigger as well as through a job agent (scheduling).
Hope this helps.

Cumulative Sum - Choosing Portions of Hierarchy

I have a bit of an interesting problem.
I required the cumulative sum on a set that is created by pieces of a Time dimension. The time dimension is based on hours and minutes. This dimension begins at the 0 hour and minute and ends at the 23 hour and 59 minute.
What I need to do is slice out portions from say 09:30 AM - 04:00 PM or 4:30PM - 09:30 AM. And I need these values in order to perform my cumulative sums. I'm hoping that someone could suggest a means of doing this with standard MDX. If not is my only alternative to write my own stored procedure which forms my Periods to date set extraction using the logic described above?
Thanks in advance!
You can create a secondary hiearchy in your time dimension with only the hour and filter the query with it.
[Time].[Calendar] -> the hierarchy with year, months, day and hours level
[Time].[Hour] -> the 'new' hierarchy with only hours level (e.g.) 09:30 AM.
The you can make a query in mdx adding your criteria as filter :
SELECT
my axis...
WHERE ( SELECT { [Time].[Hour].[09:30 AM]:[Time].[Hour].[04:00 PM] } on 0 FROM [MyCube] )
You can also create a new dimension instead of a hierarchy, the different is in the autoexists behaviour and the performance.

How to find the one hour period with the most datapoints?

I have a database table with hundreds of thousands of forum posts, and I would like to find out what hour-long period contains the most number of posts.
I could crawl forward one minute at a time, keeping an array of timestamps and keeping track of what hour had the most in it, but I feel like there is a much better way to do this. I will be running this operation on a year of posts so checking every minute in a year seems pretty awful.
Ideally there would be a way to do this inside a single database query.
Given a table filled with every minute in the year you are interested in Minutes and a table Posts with a Time column:
select top 1 minutes.time, count (posts.time)
from Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count (posts.time) desc
To solve generating the minutes table, you can use a function like ufn_GenerateIntegers.
Then the function becomes
select top 5 minutes.time, count (posts.time)
from (select dateadd(minute, IntValue, '2008-01-01') as Time from ufn_GenerateIntegers(525600)) Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count(posts.time) desc
I just did a test run with about 5000 random posts and it took 16 seconds on my machine. So, not trivial, but not rediculous for the occasional one-off query. Fortunately, this is a data-point you can calculate one a day or even once a month and cache if you want to display the value frequently.
Take a look at lassevk's improvement.
Binning will work if you want to look at intervals such as 10:00 - 11:00. However if you had a sudden flurry of interest from 10:30 - 11:30 then it will be split across two bins, and hence may be hidden by an smaller number of hits that happened to fit entirely within a single clock hour.
The only way to avoid this problem is to generate a list sorted by time and step through it. Something like this:
max = 0; maxTime = 0
for each $item in the list:
push $item onto queue
while head of queue is more than an hour before $item
drop queue head.
if queue.count > max then max = queue.count; maxTime = $item.time
That way you only need to hold a 1 hour window in memory rather than the whole list.
Treat the timestamp of every post as the start of such an hour, and count all other posts that fall within that hour, including the post that started it. Sort the resulting hours in descending order by the number of posts in each of them.
Having done that, you'll find the topmost single "hour" that has the most posts in it, but this period of time might not be exactly one hour long, it might be shorter (but never longer).
To get a "prettier" period, you can calculate how long it really is, divide by two, and adjust the start of the period back by that amount and the end forward, this will "center" the posts inside the hour. This adjustment is guaranteed to not include any new posts, so the count is still valid. If posts are close enough to suddenly be included in the period after you have expanded it to one hour, then an earlier point would've had "the most posts" in it instead of the one you picked.
If this is an SQL question, you can reuse the SQL that Josh posted here, just replace the Minutes table with another link to your posts table.
Another method you can use is to use a sliding window.
First sort all the posts according to the timestamp. Keep track of posts using a list, a linked list could be used for this.
Now, for each post, add it to the end of the list. Then, for each post from the start of the list, if that post is more than one hour before the post you just added, remove it from the list.
After doing that 2-step operation for a single new post in the list, check if the number of posts in the list is more than a previous maximum, and if it is, either make a copy of the list or at least store the post you just added.
After you're finished, you've got the "copy of the list" with the most posts in an hour, or you got the post that is the end of a 1-hour window that contains the most posts.
Pseudo-code:
initialize posts-window-list to empty list
for each post in sorted-posts-list:
add post to end of posts-window-list
for each other-post from start of posts-window-list:
if other-post is more than one hour older than post, remove it
otherwise, end this inner loop
if number of posts in list is more than previous maximum:
make copy of list, this is the new maximum
This worked on a small test MS-SQL database.
SELECT TOP 1 id, date_entered,
(SELECT COUNT(*)
FROM dbo.notes AS n2
WHERE n2.date_entered >= n.date_entered
AND n2.date_entered < Dateadd(hh, 1, n.date_entered)) AS num
FROM dbo.notes n
ORDER BY num DESC
This is not very efficient, checks based on an hour from each post.
For MYSQL
SELECT ID,f.Date, (SELECT COUNT(*)
FROM Forum AS f2
WHERE f2.Date >= f.Date AND f2.Date < Date_ADD(f.Date, INTERVAL 1 HOUR)) As num
FROM Forum AS f
ORDER BY num
LIMIT 0,1
Here's a slight variation on the other Josh's implementation this forgoes the immediate table and uses a self join on itself looking for any posts within an hour of that one post.
select top 1 posts.DateCreated, count (posts.datecreated),
min(minutes.DateCreated) as MinPostDate,
max(minutes.datecreated) as MaxPostDate
from posts Minutes
left join posts on posts.datecreated >= minutes.DateCreated
AND posts.datecreated < dateadd(hour, 1, Minutes.DateCreated)
group by posts.DateCreated
order by count(posts.datecreated) desc
From a performance perspective on a table with only 6 rows his method which used the function to generate the intermiadte table took 16 seconds vs this one which was subsecond.
I'm not positive if it would be possible using this to miss a valid timeframe since the timespan is based off of the offset of each post.
This results in an O(n) database query, and an O(n) greatest time search, for a total complexity of O(2n) (which, of course, is still O(n)):
Use a count distinct command in SQL which will 'bin' items for you in minute increments.
So you'd run the count query on this table:
time
1
2
4
3
3
2
4
1
3
2
And it would return:
0 1
1 1
2 3
3 3
4 2
By counting each item.
I suspect you can do the same thing with your table, and bin them by the minute, then run an algorithm on that.
SELECT customer_name, COUNT(DISTINCT city) as "Distinct Cities"
FROM customers
GROUP BY customer_name;
From this tutorial on count: http://www.techonthenet.com/sql/count.php (near the end).
Here is a similar page from MySQL's manual: http://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
So if you have a table with a timedate in it (to the minute, allowing binning to happen by minutes):
datetime (yyyymmddhhmm)
200901121435
200901121538
200901121435
200901121538
200901121435
200901121538
200901121538
200901121435
200901121435
200901121538
200901121435
200901121435
Then the SQL
SELECT datetime, COUNT(DISTINCT datetime) as "Date Time"
FROM post
GROUP BY datetime;
should return
200901121435 7
200901121538 5
You will still need to post process this, but the hard work of grouping and counting is done, and will only result in just over 500k rows per year (60 minutes, 24 hours, 365 days)
The post processing would be:
Start at time T = first post time.
Set greatestTime = T
Sum all counts between T and T+one hour --> currentHourCount and greatestHourCount
While records exist past T+one hour
Increment T by one minute.
While the first element is prior to time T, subtract it
while the last element is before time T+ one hour, add it
If currentHourCount > greatestHourCount then
greatestHourCount = currentHourCount
greatestTime = T
end while
-Adam
This will do it.
SELECT DateOfEvent HourBegin, DATEADD(hh, 1, DateOfEvent)) HourEnd, COUNT(*) AS NumEventsPerHour
FROM tEvents AS A
JOIN tEvents AS B
ON A.DateOfEvent >= B.DateOfEvents AND DATEADD(hh, 1, A.DateOfEvent) <= B.DateOfEvent
GROUP BY A.DateOfEvent
If mysql:
select substr( timestamp, 1, 16 ) as hour, count(*) as count from forum_posts group by hour order by count desc limit 1;
edit: not sure if original question means any possible 60-minute period
If using MySQL:
SELECT DATE(postDate), HOUR(postDate), COUNT(*) AS n
FROM posts
GROUP BY DATE(postDate), HOUR(postDate)
ORDER BY n DESC
LIMIT 1
SELECT DATEPART(hour, PostDateTime) AS HourOfDay,
COUNT(*) AS ForumPosts
FROM Posts
GROUP BY DATEPART(hour, PostDateTime)

Resources