Sum OVER Partition By, Not giving expected result - sql-server

I have set of data, which shows minutes used by a "child contract" on a daily basis. I need to have a running total of the minutes consumed by the child contract so that we can keep track.
This is the code I used.
Select
"combined"."id",
"combined"."ContractCustomer",
"combined"."ContractCustomerChild",
"combined"."Minutes",
SUM( "combined"."Minutes") OVER (PARTITION BY "combined"."ContractCustomerChild" ORDER BY "combined"."id") As "CustomerMinutes",
FROM "dbo"."combined"
I expect the sum to be restart for every contract child, however, it does seem to behave that way.
This is the result i got:
ID CustomerContractChild Minutes CustomerMinutes
1 20150101+C1 1000 1000
2 20150101+C1 2000 3000
3 20150101+c2 2500 5500
This is what I expect
ID CustomerContractChild Minutes CustomerMinutes
1 20150101+C1 1000 1000
2 20150101+C1 2000 3000
3 20150101+c2 2500 2500
What did I do wrongly?

I have run this and get exact output u required. This is same as yours without ContractCustomer column (u don't show it in result) and column ContractCustomerChild in code is showing CustomerContractChild in results and removing , before FROM.
Select
"combined".ID,
"combined"."CustomerContractChild",
"combined"."Minutes",
SUM( "combined"."Minutes") OVER (PARTITION BY "combined"."CustomerContractChild" ORDER BY "combined"."id") As "CustomerMinutes"
FROM "dbo"."combined"

Thank you for the responses. My eyes must have been playing trick on me due to sorting etc. I reviewed the results and they look correct. I will close this for now. I'll sample a few more results and have a go if they turn out wrong.

Related

View AVG Task Execution Time in Snowflake

I run the following task in Snowflake to see which queries are candidates for inefficiency improvements:
select datediff(second,scheduled_time,query_start_time) as second, *
from table(information_schema.task_history())
where state != 'SCHEDULED'
order by datediff(second,scheduled_time,query_start_time) desc;
However, I frequently see the seconds a query took to run change from day to day. How can I modify this query in Snowflake to get all the historical runs from task history and average their seconds to get a fuller picture with less variance?
The documentation says it pulls the last 7 days but in practice it is only pulling the last 2 days based on the output's scheduled_time (each of my tasks run every 12 hours). I'd like to get the average seconds each task took over the last 30 days and sort them.
The documentation says it pulls the last 7 days but in practice it is only pulling the last 2 days based on the output's scheduled_time (each of my tasks run every 12 hours).
Task history
RESULT_LIMIT => integer
A number specifying the maximum number of rows returned by the function.
Default: 100.
To get more rows the RESULT_LIMIT should be defined:
select datediff(second,scheduled_time,query_start_time) as second, *
from table(information_schema.task_history(RESULT_LIMIT =>10000))
where state != 'SCHEDULED'
order by datediff(second,scheduled_time,query_start_time) desc;
ACCOUNT_USAGE.TASK_HISTORY provides data for the last 365 days.
SELECT datediff(second,scheduled_time,query_start_time) as second, *
FROM snowflake.account_usage.task_history
WHERE state != 'SCHEDULED'
ORDER BY datediff(second,scheduled_time,query_start_time) DESC;

Calculate working days between two dates: null value returned

I'm trying to figure out the number of working days between two dates. The table (dfDates) is laid out as follows:
Key
StartDateKey
EndDateKey
1
20171227
20180104
2
20171227
20171229
I have another table (dfDimDate) with all the relevant date keys and whether the date key is a working day or not:
DateKey
WorkDayFlag
20171227
1
20171228
1
20171229
1
20171230
0
20171231
0
20180101
0
20180102
1
20180103
1
20180104
1
I'm expecting a result as so:
Key
WorkingDays
1
6
2
3
So far (I realise this isn't complete to get me the above result), I've written this:
workingdays = []
for i in range(0, len(dfDates)):
value = dfDimDate.filter((dfDimDate.DateKey >= dfDates.collect()[i][1]) & (dfDimDate.DateKey <= df.collect()[i][2])).agg({'WorkDayFlag': 'sum'})
workingdays.append(value.collect())
However, only null values are being returned. Also, I've noticed this is very slow and took 54 seconds before it errored.
I think I understand what the error is about but I'm not sure how to fix it. Also, I'm not sure how to optimise the command so it runs faster. I'm looking for a solution in pyspark or spark SQL (whichever is easiest).
Many thanks,
Carolina
Edit: The error below was resolved thanks to a suggestion from #samkart who said to put the agg after the filter
AnalysisException: Resolved attribute(s) DateKey#17075 missing from sum(WorkDayFlag)#22142L in operator !Filter ((DateKey#17075 <= 20171228) AND (DateKey#17075 >= 20171227)).;
A possible and simple solution:
from pyspark.sql import functions as F
dfDates \
.join(dfDimDate, dfDimDate.DateKey.between(dfDates.StartDateKey, dfDates.EndDateKey)) \
.groupBy(dfDates.Key) \
.agg(F.sum(dfDimDate.WorkDayFlag).alias('WorkingDays'))
That is, first join the two datasets in order to link each date with all the dimDate rows in its range (dfDates.StartDateKey <= dfDimDate.DateKey <= dfDates.EndDateKey).
Then simply group the joined dataset by the date key and count the number of working days in its range.
In the solution you proposed, you are performing the calculation directly on the driver, so you are not taking advantage of the parallelism that spark offers. This should be avoided when possible, especially for large datasets.
Apart from that, you are requesting repeated collects in the for-loop, even for the same data, resulting in a further slowdown.

Querying in Google bigQuery work slow

I have a table which contains nearly a million rows. Searching for a single value in it takes 5 sec and around 500 in 15 seconds. This is quite a long time. Please let me know how can I optimize the query?
My query is:
select a,b,c,d from table where a in ('a1','a2')
Job id : stable-apogee-119006:job_ClLDIUSdDLYA6tC2jfC5GxBXmv0
I'm not sure what you mean by "500 it takes 15 secs" but I ran some tests against our database trying to simulate what you are running and I got some similar results to yours
(my query is slower then yours as it has a join operation but still here we go):
SELECT
a.fv fv,
a.v v,
a.sku sku,
a.pp pp from(
SELECT
fullvisitorid fv,
visitid v,
hits.product.productsku sku,
hits.page.pagepath pp
FROM (TABLE_DATE_RANGE([40663402.ga_sessions_], DATE_ADD(CURRENT_DATE(), -3, 'day'), DATE_ADD(CURRENT_DATE(), -3, 'day')))
WHERE
1 = 1 ) a
JOIN EACH (
SELECT
fullvisitorid fv,
FROM (TABLE_DATE_RANGE([40663402.ga_sessions_], DATE_ADD(CURRENT_DATE(), -3, 'day'), DATE_ADD(CURRENT_DATE(), -3, 'day')))
GROUP EACH BY
fv
LIMIT
1 ) b
ON
a.fv = b.fv
Querying for just one day and bringing just one fullvisitor took BQ roughly 5 secs to process 1.7 GBs.
And when I ran the same query for the last month and removed the limit operator it took ~10s to process ~56GB of data (around 33 million rows):
This is insanely fast.
So you might have to evaluate your project specs. If 5 secs is still too much for you then maybe you'll need to find some other strategy in your architecture that suits you best.
BigQuery does consume seconds to process its demands but it's also ready to process hundreds of Gigas still in seconds.
If your project data consumption is expected to grow and you will start processing millions of rows then you might evaluate if waiting a few secs is still acceptable in your application.
Other than that, as far as your query goes, I don't think there's much optimization left to improve its performance.
(ps: I decided to run for 100 days and it processed around 100 GBs in 14s.)

Table Scan very high "actual rows" when filter placed on different table

I have a query, that I did not write, that takes 2.5 minutes to run. I am trying to optimize it without being able to modify the underlying tables, i.e. no new indexes can be added.
During my optimization troubleshooting I commented out a filter and all of a sudden my query ran in .5 seconds. I have screwed with the formatting and placing of that filter and if it is there the query takes 2.5 minutes, without it .5 seconds. The biggest problem is that the filter is not on the table that is being table-scanned (With over 300k records), it is on a table with 300 records.
The "Actual Execution Plan" of both the 0:0:0.5 vs 0:2:30 are identical down to the exact percentage costs of all steps:
Execution Plan
The only difference is that on the table-scanned table the "Actual Number of Rows" on the 2.5 min query shows 3.7 million rows. The table only has 300k rows. Where the .5 sec query shows Actual Number of Rows as 2,063. The filter is actually being placed on the FS_EDIPartner table that only has 300 rows.
With the filter I get the correct 51 records, but it takes 2.5 minutes to return. Without the filter I get duplication, so I get 2,796 rows, and only take half a second to return.
I cannot figure out why adding the filter to a table with 300 rows and a correct index is causing the Table scan of a different table to have such a significant difference in actual number of rows. I am even doing the "Table scan" table as a sub-query to filter its records down from 300k to 17k prior to doing the join. Here is the actual query in its current state, sorry the tables don't make a lot of sense, I could not reproduce this behavior in test data.
SELECT dbo.FS_ARInvoiceHeader.CustomerID
, dbo.FS_EDIPartner.PartnerID
, dbo.FS_ARInvoiceHeader.InvoiceNumber
, dbo.FS_ARInvoiceHeader.InvoiceDate
, dbo.FS_ARInvoiceHeader.InvoiceType
, dbo.FS_ARInvoiceHeader.CONumber
, dbo.FS_EDIPartner.InternalTransactionSetCode
, docs.DocumentName
, dbo.FS_ARInvoiceHeader.InvoiceStatus
FROM dbo.FS_ARInvoiceHeader
INNER JOIN dbo.FS_EDIPartner ON dbo.FS_ARInvoiceHeader.CustomerID = dbo.FS_EDIPartner.CustomerID
LEFT JOIN (Select DocumentName
FROM GentranDatabase.dbo.ZNW_Documents
WHERE DATEADD(SECOND,TimeCreated,'1970-1-1') > '2016-06-01'
AND TransactionSetID = '810') docs on dbo.FS_ARInvoiceHeader.InvoiceNumber = docs.DocumentName COLLATE Latin1_General_BIN
WHERE docs.DocumentName IS NULL
AND dbo.FS_ARInvoiceHeader.InvoiceType = 'I'
AND dbo.FS_ARInvoiceHeader.InvoiceStatus <> 'Y'
--AND (dbo.FS_EDIPartner.InternalTransactionSetCode = '810')
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'CB%'))
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'DM%'))
AND InvoiceDate > '2016-06-01'
The Commented out line in the Where statement is the culprit, uncommenting it causes the 2.5 minute run.
It could be that the table statistics may have gotten out of whack. These include the number of records tables have which is used to choose the best query plan. Try running this and running the query again:
EXEC sp_updatestats
Using #jeremy's comment as a guideline to point out the Actual Number of Rows was not my problem, but instead the number of executions, I figured out that the Hash Match was .5 seconds, the Nested loop was 2.5 minutes. Trying to force the Hash Match using Left HASH Join was inconsistent depending on what the other filters were set to, changing dates took it from .5 seconds, to 30 secs sometimes. So forcing the Hash (Which is highly discouraged anyway) wasn't a good solution. Finally I resorted to moving the poor performing view to a Stored Procedure and splitting out both of the tables that were related to the poor performance into Table Variables, then joining those table variables. This resulted in the most consistently good performance of getting the results. On average the SP returns in less than 1 second, which is far better than the 2.5 minutes it started at.
#Jeremy gets the credit, but since his wasn't an answer, I thought I would document what was actually done in case someone else stumbles across this later.

How to find the one hour period with the most datapoints?

I have a database table with hundreds of thousands of forum posts, and I would like to find out what hour-long period contains the most number of posts.
I could crawl forward one minute at a time, keeping an array of timestamps and keeping track of what hour had the most in it, but I feel like there is a much better way to do this. I will be running this operation on a year of posts so checking every minute in a year seems pretty awful.
Ideally there would be a way to do this inside a single database query.
Given a table filled with every minute in the year you are interested in Minutes and a table Posts with a Time column:
select top 1 minutes.time, count (posts.time)
from Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count (posts.time) desc
To solve generating the minutes table, you can use a function like ufn_GenerateIntegers.
Then the function becomes
select top 5 minutes.time, count (posts.time)
from (select dateadd(minute, IntValue, '2008-01-01') as Time from ufn_GenerateIntegers(525600)) Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count(posts.time) desc
I just did a test run with about 5000 random posts and it took 16 seconds on my machine. So, not trivial, but not rediculous for the occasional one-off query. Fortunately, this is a data-point you can calculate one a day or even once a month and cache if you want to display the value frequently.
Take a look at lassevk's improvement.
Binning will work if you want to look at intervals such as 10:00 - 11:00. However if you had a sudden flurry of interest from 10:30 - 11:30 then it will be split across two bins, and hence may be hidden by an smaller number of hits that happened to fit entirely within a single clock hour.
The only way to avoid this problem is to generate a list sorted by time and step through it. Something like this:
max = 0; maxTime = 0
for each $item in the list:
push $item onto queue
while head of queue is more than an hour before $item
drop queue head.
if queue.count > max then max = queue.count; maxTime = $item.time
That way you only need to hold a 1 hour window in memory rather than the whole list.
Treat the timestamp of every post as the start of such an hour, and count all other posts that fall within that hour, including the post that started it. Sort the resulting hours in descending order by the number of posts in each of them.
Having done that, you'll find the topmost single "hour" that has the most posts in it, but this period of time might not be exactly one hour long, it might be shorter (but never longer).
To get a "prettier" period, you can calculate how long it really is, divide by two, and adjust the start of the period back by that amount and the end forward, this will "center" the posts inside the hour. This adjustment is guaranteed to not include any new posts, so the count is still valid. If posts are close enough to suddenly be included in the period after you have expanded it to one hour, then an earlier point would've had "the most posts" in it instead of the one you picked.
If this is an SQL question, you can reuse the SQL that Josh posted here, just replace the Minutes table with another link to your posts table.
Another method you can use is to use a sliding window.
First sort all the posts according to the timestamp. Keep track of posts using a list, a linked list could be used for this.
Now, for each post, add it to the end of the list. Then, for each post from the start of the list, if that post is more than one hour before the post you just added, remove it from the list.
After doing that 2-step operation for a single new post in the list, check if the number of posts in the list is more than a previous maximum, and if it is, either make a copy of the list or at least store the post you just added.
After you're finished, you've got the "copy of the list" with the most posts in an hour, or you got the post that is the end of a 1-hour window that contains the most posts.
Pseudo-code:
initialize posts-window-list to empty list
for each post in sorted-posts-list:
add post to end of posts-window-list
for each other-post from start of posts-window-list:
if other-post is more than one hour older than post, remove it
otherwise, end this inner loop
if number of posts in list is more than previous maximum:
make copy of list, this is the new maximum
This worked on a small test MS-SQL database.
SELECT TOP 1 id, date_entered,
(SELECT COUNT(*)
FROM dbo.notes AS n2
WHERE n2.date_entered >= n.date_entered
AND n2.date_entered < Dateadd(hh, 1, n.date_entered)) AS num
FROM dbo.notes n
ORDER BY num DESC
This is not very efficient, checks based on an hour from each post.
For MYSQL
SELECT ID,f.Date, (SELECT COUNT(*)
FROM Forum AS f2
WHERE f2.Date >= f.Date AND f2.Date < Date_ADD(f.Date, INTERVAL 1 HOUR)) As num
FROM Forum AS f
ORDER BY num
LIMIT 0,1
Here's a slight variation on the other Josh's implementation this forgoes the immediate table and uses a self join on itself looking for any posts within an hour of that one post.
select top 1 posts.DateCreated, count (posts.datecreated),
min(minutes.DateCreated) as MinPostDate,
max(minutes.datecreated) as MaxPostDate
from posts Minutes
left join posts on posts.datecreated >= minutes.DateCreated
AND posts.datecreated < dateadd(hour, 1, Minutes.DateCreated)
group by posts.DateCreated
order by count(posts.datecreated) desc
From a performance perspective on a table with only 6 rows his method which used the function to generate the intermiadte table took 16 seconds vs this one which was subsecond.
I'm not positive if it would be possible using this to miss a valid timeframe since the timespan is based off of the offset of each post.
This results in an O(n) database query, and an O(n) greatest time search, for a total complexity of O(2n) (which, of course, is still O(n)):
Use a count distinct command in SQL which will 'bin' items for you in minute increments.
So you'd run the count query on this table:
time
1
2
4
3
3
2
4
1
3
2
And it would return:
0 1
1 1
2 3
3 3
4 2
By counting each item.
I suspect you can do the same thing with your table, and bin them by the minute, then run an algorithm on that.
SELECT customer_name, COUNT(DISTINCT city) as "Distinct Cities"
FROM customers
GROUP BY customer_name;
From this tutorial on count: http://www.techonthenet.com/sql/count.php (near the end).
Here is a similar page from MySQL's manual: http://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
So if you have a table with a timedate in it (to the minute, allowing binning to happen by minutes):
datetime (yyyymmddhhmm)
200901121435
200901121538
200901121435
200901121538
200901121435
200901121538
200901121538
200901121435
200901121435
200901121538
200901121435
200901121435
Then the SQL
SELECT datetime, COUNT(DISTINCT datetime) as "Date Time"
FROM post
GROUP BY datetime;
should return
200901121435 7
200901121538 5
You will still need to post process this, but the hard work of grouping and counting is done, and will only result in just over 500k rows per year (60 minutes, 24 hours, 365 days)
The post processing would be:
Start at time T = first post time.
Set greatestTime = T
Sum all counts between T and T+one hour --> currentHourCount and greatestHourCount
While records exist past T+one hour
Increment T by one minute.
While the first element is prior to time T, subtract it
while the last element is before time T+ one hour, add it
If currentHourCount > greatestHourCount then
greatestHourCount = currentHourCount
greatestTime = T
end while
-Adam
This will do it.
SELECT DateOfEvent HourBegin, DATEADD(hh, 1, DateOfEvent)) HourEnd, COUNT(*) AS NumEventsPerHour
FROM tEvents AS A
JOIN tEvents AS B
ON A.DateOfEvent >= B.DateOfEvents AND DATEADD(hh, 1, A.DateOfEvent) <= B.DateOfEvent
GROUP BY A.DateOfEvent
If mysql:
select substr( timestamp, 1, 16 ) as hour, count(*) as count from forum_posts group by hour order by count desc limit 1;
edit: not sure if original question means any possible 60-minute period
If using MySQL:
SELECT DATE(postDate), HOUR(postDate), COUNT(*) AS n
FROM posts
GROUP BY DATE(postDate), HOUR(postDate)
ORDER BY n DESC
LIMIT 1
SELECT DATEPART(hour, PostDateTime) AS HourOfDay,
COUNT(*) AS ForumPosts
FROM Posts
GROUP BY DATEPART(hour, PostDateTime)

Resources