I have a database table with hundreds of thousands of forum posts, and I would like to find out what hour-long period contains the most number of posts.
I could crawl forward one minute at a time, keeping an array of timestamps and keeping track of what hour had the most in it, but I feel like there is a much better way to do this. I will be running this operation on a year of posts so checking every minute in a year seems pretty awful.
Ideally there would be a way to do this inside a single database query.
Given a table filled with every minute in the year you are interested in Minutes and a table Posts with a Time column:
select top 1 minutes.time, count (posts.time)
from Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count (posts.time) desc
To solve generating the minutes table, you can use a function like ufn_GenerateIntegers.
Then the function becomes
select top 5 minutes.time, count (posts.time)
from (select dateadd(minute, IntValue, '2008-01-01') as Time from ufn_GenerateIntegers(525600)) Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count(posts.time) desc
I just did a test run with about 5000 random posts and it took 16 seconds on my machine. So, not trivial, but not rediculous for the occasional one-off query. Fortunately, this is a data-point you can calculate one a day or even once a month and cache if you want to display the value frequently.
Take a look at lassevk's improvement.
Binning will work if you want to look at intervals such as 10:00 - 11:00. However if you had a sudden flurry of interest from 10:30 - 11:30 then it will be split across two bins, and hence may be hidden by an smaller number of hits that happened to fit entirely within a single clock hour.
The only way to avoid this problem is to generate a list sorted by time and step through it. Something like this:
max = 0; maxTime = 0
for each $item in the list:
push $item onto queue
while head of queue is more than an hour before $item
drop queue head.
if queue.count > max then max = queue.count; maxTime = $item.time
That way you only need to hold a 1 hour window in memory rather than the whole list.
Treat the timestamp of every post as the start of such an hour, and count all other posts that fall within that hour, including the post that started it. Sort the resulting hours in descending order by the number of posts in each of them.
Having done that, you'll find the topmost single "hour" that has the most posts in it, but this period of time might not be exactly one hour long, it might be shorter (but never longer).
To get a "prettier" period, you can calculate how long it really is, divide by two, and adjust the start of the period back by that amount and the end forward, this will "center" the posts inside the hour. This adjustment is guaranteed to not include any new posts, so the count is still valid. If posts are close enough to suddenly be included in the period after you have expanded it to one hour, then an earlier point would've had "the most posts" in it instead of the one you picked.
If this is an SQL question, you can reuse the SQL that Josh posted here, just replace the Minutes table with another link to your posts table.
Another method you can use is to use a sliding window.
First sort all the posts according to the timestamp. Keep track of posts using a list, a linked list could be used for this.
Now, for each post, add it to the end of the list. Then, for each post from the start of the list, if that post is more than one hour before the post you just added, remove it from the list.
After doing that 2-step operation for a single new post in the list, check if the number of posts in the list is more than a previous maximum, and if it is, either make a copy of the list or at least store the post you just added.
After you're finished, you've got the "copy of the list" with the most posts in an hour, or you got the post that is the end of a 1-hour window that contains the most posts.
Pseudo-code:
initialize posts-window-list to empty list
for each post in sorted-posts-list:
add post to end of posts-window-list
for each other-post from start of posts-window-list:
if other-post is more than one hour older than post, remove it
otherwise, end this inner loop
if number of posts in list is more than previous maximum:
make copy of list, this is the new maximum
This worked on a small test MS-SQL database.
SELECT TOP 1 id, date_entered,
(SELECT COUNT(*)
FROM dbo.notes AS n2
WHERE n2.date_entered >= n.date_entered
AND n2.date_entered < Dateadd(hh, 1, n.date_entered)) AS num
FROM dbo.notes n
ORDER BY num DESC
This is not very efficient, checks based on an hour from each post.
For MYSQL
SELECT ID,f.Date, (SELECT COUNT(*)
FROM Forum AS f2
WHERE f2.Date >= f.Date AND f2.Date < Date_ADD(f.Date, INTERVAL 1 HOUR)) As num
FROM Forum AS f
ORDER BY num
LIMIT 0,1
Here's a slight variation on the other Josh's implementation this forgoes the immediate table and uses a self join on itself looking for any posts within an hour of that one post.
select top 1 posts.DateCreated, count (posts.datecreated),
min(minutes.DateCreated) as MinPostDate,
max(minutes.datecreated) as MaxPostDate
from posts Minutes
left join posts on posts.datecreated >= minutes.DateCreated
AND posts.datecreated < dateadd(hour, 1, Minutes.DateCreated)
group by posts.DateCreated
order by count(posts.datecreated) desc
From a performance perspective on a table with only 6 rows his method which used the function to generate the intermiadte table took 16 seconds vs this one which was subsecond.
I'm not positive if it would be possible using this to miss a valid timeframe since the timespan is based off of the offset of each post.
This results in an O(n) database query, and an O(n) greatest time search, for a total complexity of O(2n) (which, of course, is still O(n)):
Use a count distinct command in SQL which will 'bin' items for you in minute increments.
So you'd run the count query on this table:
time
1
2
4
3
3
2
4
1
3
2
And it would return:
0 1
1 1
2 3
3 3
4 2
By counting each item.
I suspect you can do the same thing with your table, and bin them by the minute, then run an algorithm on that.
SELECT customer_name, COUNT(DISTINCT city) as "Distinct Cities"
FROM customers
GROUP BY customer_name;
From this tutorial on count: http://www.techonthenet.com/sql/count.php (near the end).
Here is a similar page from MySQL's manual: http://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
So if you have a table with a timedate in it (to the minute, allowing binning to happen by minutes):
datetime (yyyymmddhhmm)
200901121435
200901121538
200901121435
200901121538
200901121435
200901121538
200901121538
200901121435
200901121435
200901121538
200901121435
200901121435
Then the SQL
SELECT datetime, COUNT(DISTINCT datetime) as "Date Time"
FROM post
GROUP BY datetime;
should return
200901121435 7
200901121538 5
You will still need to post process this, but the hard work of grouping and counting is done, and will only result in just over 500k rows per year (60 minutes, 24 hours, 365 days)
The post processing would be:
Start at time T = first post time.
Set greatestTime = T
Sum all counts between T and T+one hour --> currentHourCount and greatestHourCount
While records exist past T+one hour
Increment T by one minute.
While the first element is prior to time T, subtract it
while the last element is before time T+ one hour, add it
If currentHourCount > greatestHourCount then
greatestHourCount = currentHourCount
greatestTime = T
end while
-Adam
This will do it.
SELECT DateOfEvent HourBegin, DATEADD(hh, 1, DateOfEvent)) HourEnd, COUNT(*) AS NumEventsPerHour
FROM tEvents AS A
JOIN tEvents AS B
ON A.DateOfEvent >= B.DateOfEvents AND DATEADD(hh, 1, A.DateOfEvent) <= B.DateOfEvent
GROUP BY A.DateOfEvent
If mysql:
select substr( timestamp, 1, 16 ) as hour, count(*) as count from forum_posts group by hour order by count desc limit 1;
edit: not sure if original question means any possible 60-minute period
If using MySQL:
SELECT DATE(postDate), HOUR(postDate), COUNT(*) AS n
FROM posts
GROUP BY DATE(postDate), HOUR(postDate)
ORDER BY n DESC
LIMIT 1
SELECT DATEPART(hour, PostDateTime) AS HourOfDay,
COUNT(*) AS ForumPosts
FROM Posts
GROUP BY DATEPART(hour, PostDateTime)
Related
I'm working on an MRP simulation in which I have to subtract demand or add supply qty to available stock and I hope you can be of support. Find below the result I want to achieve.
I have 1 value for stock = 22 and a lot of values for future demand/supply on specific dates.
Part
Stock
Demand/Supply qty
Demand/Supply Date
Result
1000680
22
-1
2023-01-01
21
1000680
21* what I want to achieve
-15
2023-01-02
6* expected outcome
1000680
6* what I want to achieve
+10
2023-01-03
16* expected outcome
I'm still on the SQL learning curve. I started to add rownumbers to the lines to make sure that the sequence is correct:
select
part,
rownum = ROW_NUMBER() OVER (ORDER BY part, mrp_due_date),
current_stock_qty,
demand_supply_qty,
current_stock - qty as new_stock_qty, -- if demand
current_stock + qty as new_stock_qty, -- if supply
mrp_due_date
from #base
Then I tried the lag function to derive previous row 'new_stock_qty' at date but this only worked for the first line (see image:
)
So I probably need the loop function to first calculate stock-demand and use the result as new stock.
I have looked through similar questions asked on this site, but I find it difficult to define my solution based on that information.
I am trying to calculate the avarage of durations from the last 40 days for diffrent IDs.
Example: I have 40 days and for each day IDs from 1-20 and each ID has a start date and end date in HH:MI:SS.
My code is a cursor which fetches the last 40 days, then I made a second for loop. In this one I select all the ids from this day. Then I go through every ID for this day and select start and end dat calculating the duration. So far so good. But how do I calculate the avarage of the duration for the IDs in the last 40 days.
The idea is simple. To take the durations for one id (in the last 40 days) add them together and divide them by 40. And then do the same for all IDs. My plan was to make a 2d Array and in the first array putting all IDs, then in the second array to put the duration and add the values for one id together. Then I would have added all the durations for one ID together and get the value from the array. But I am kinda stuck in that idea.
I also wonder if there is a better solution.
Thanks for any help!
From my point of view, you don't need loops nor PL/SQL - just calculate the average:
select id,
avg(end_date - start_date)
from your_table
where start_date >= trunc(sysdate) - 40
group by id
Drawback might be what you said - that you stored dates as hh:mi:ss. What does it mean? That you stored them as strings? If so, most probably bad idea; dates (as Oracle doesn't have a separate datatype for time) should be stored into DATE datatype columns.
If you really have to work with strings, then convert them to dates:
avg(to_date(end_date, 'hh:mi:ss') - to_date(start_date, 'hh:mi:ss'))
Also, you'll then have to have another DATE datatyp column which is capable of saying what "last 40 days" actually means.
Result (the average) will be number of days between these values. Then you can format it prettier, if you want.
I run the following task in Snowflake to see which queries are candidates for inefficiency improvements:
select datediff(second,scheduled_time,query_start_time) as second, *
from table(information_schema.task_history())
where state != 'SCHEDULED'
order by datediff(second,scheduled_time,query_start_time) desc;
However, I frequently see the seconds a query took to run change from day to day. How can I modify this query in Snowflake to get all the historical runs from task history and average their seconds to get a fuller picture with less variance?
The documentation says it pulls the last 7 days but in practice it is only pulling the last 2 days based on the output's scheduled_time (each of my tasks run every 12 hours). I'd like to get the average seconds each task took over the last 30 days and sort them.
The documentation says it pulls the last 7 days but in practice it is only pulling the last 2 days based on the output's scheduled_time (each of my tasks run every 12 hours).
Task history
RESULT_LIMIT => integer
A number specifying the maximum number of rows returned by the function.
Default: 100.
To get more rows the RESULT_LIMIT should be defined:
select datediff(second,scheduled_time,query_start_time) as second, *
from table(information_schema.task_history(RESULT_LIMIT =>10000))
where state != 'SCHEDULED'
order by datediff(second,scheduled_time,query_start_time) desc;
ACCOUNT_USAGE.TASK_HISTORY provides data for the last 365 days.
SELECT datediff(second,scheduled_time,query_start_time) as second, *
FROM snowflake.account_usage.task_history
WHERE state != 'SCHEDULED'
ORDER BY datediff(second,scheduled_time,query_start_time) DESC;
I have a SQL query which returns three columns Contact Id's, No of participations, Year of participation.
Based on this query result, I need to look for the pattern if anyone attended for certain (same number every year) number of times over the years.
For example, 2 times every year or 3 times every year for 2 years or more consecutively (and not different number of times in each year).
From the sample below, contacts I would be interested to be pulling are 1008637, 1009256, 1010306 & 1011263
Please let me know how to achieve this.
Please see image for sample data.
You would need an aggregation twice. Once to get the number of participations and then to check the number condition for each year.
select id
from (select id,year,count(*) as num_participations
from tbl
group by id,year
) t
group by id
having count(*) = count(distinct case when num_participations = 2 then year end)
I have Billing Rate History table. We are required to report billing rate changes to clients, so I compare this period's (month) billing rate to that of last period. If there is a delta, I report that. At the close of the routine, I record current rates for comparison next month. It works fine, but there is a challenge...
If employee Tom Thumb works for the client in question this month, but did not work last month, there is nothing against which to compare. Or, if Tom worked Overtime this month, but only straight time last month, I have no overtime rate against which to compare.
I'm trying to find a way to walk backward by Period till I find a valid > 0 rate for comparison.
So, let's say we are billing Period 201403.
Tom has a straight time bill rate of 54.04 for Period 201403.
He has an overtime rate of 81.06 for that same Period.
Now, I look at Period 201402 for his straight time and overtime rate. If there is no delta, I move on to the next employee.
But what if Tom has no ST or OT rate in Period 201402? I need to walk backward to 201401, 201312, etc. till I find the rate he had the last time he worked for this client.
I've read that it is not a good practice to use loops on a DB. What's the best practice for accomplishing what I need to accomplish?
You should be able to do this with a cte, a self-join, and a ranking function. Something like:
with cte as (
select *, row_number() over (partition by employeeid order by periodid) as [rn]
)
select employeeid, curr.billrate, [last].billrate
from cte as [curr]
left join cte as [last]
on curr.employeeid = last.employeeid
and curr.rn = last.rn + 1
You'll have to jigger this to fit your actual table structure, but this should get you pretty close. Feel free to add a where clause (i.e. where curr.billrate <> [last].billrate)