Average for loop in loop PL / SQL - arrays

I am trying to calculate the avarage of durations from the last 40 days for diffrent IDs.
Example: I have 40 days and for each day IDs from 1-20 and each ID has a start date and end date in HH:MI:SS.
My code is a cursor which fetches the last 40 days, then I made a second for loop. In this one I select all the ids from this day. Then I go through every ID for this day and select start and end dat calculating the duration. So far so good. But how do I calculate the avarage of the duration for the IDs in the last 40 days.
The idea is simple. To take the durations for one id (in the last 40 days) add them together and divide them by 40. And then do the same for all IDs. My plan was to make a 2d Array and in the first array putting all IDs, then in the second array to put the duration and add the values for one id together. Then I would have added all the durations for one ID together and get the value from the array. But I am kinda stuck in that idea.
I also wonder if there is a better solution.
Thanks for any help!

From my point of view, you don't need loops nor PL/SQL - just calculate the average:
select id,
avg(end_date - start_date)
from your_table
where start_date >= trunc(sysdate) - 40
group by id
Drawback might be what you said - that you stored dates as hh:mi:ss. What does it mean? That you stored them as strings? If so, most probably bad idea; dates (as Oracle doesn't have a separate datatype for time) should be stored into DATE datatype columns.
If you really have to work with strings, then convert them to dates:
avg(to_date(end_date, 'hh:mi:ss') - to_date(start_date, 'hh:mi:ss'))
Also, you'll then have to have another DATE datatyp column which is capable of saying what "last 40 days" actually means.
Result (the average) will be number of days between these values. Then you can format it prettier, if you want.

Related

Is there a way to consolidate multiple formulas into one

all:
I am trying to design a shared worksheet that measures salespeople performance over a period of time. In addition to calculating # of units, sales price, and profit, I am trying to calculate how many new account were sold in the month (ideally, I'd like to be able to change the timeframe so I can calculate larger time periods like quarter, year etc').
In essence, I want to find out if a customer was sold to in the 12 months before the present month, and if not, that I will see the customer number and the salesperson who sold them.
So far, I was able to calculate that by adding three columns that each calculate a part of the process (see screenshot below):
Column H (SoldLastYear) - Shows customers that were sold in the year before this current month: =IF(AND(B2>=(TODAY()-365),B2<(TODAY()-DAY(TODAY())+1)),D2,"")
Column I (SoldNow) - Shows the customers that were sold this month, and if they are NOT found in column H, show "New Cust": =IFNA(IF(B2>TODAY()-DAY(TODAY()),VLOOKUP(D2,H:H,1,FALSE),""),"New Cust")
Column J (NewCust) - If Column I shows "New Cust", show me the customer number: =IF(I2="New Cust",D2,"")
Column K (SalesName) - if Column I shows "New Cust", show me the salesperson name: =IF(I2="New Cust",C2,"")
Does anyone have an idea how I can make this more efficient? Could an array formula work here or will it be stuck in a loop since its referring to other lines in the same column?
Any help would be appreciated!!
EDIT: Here is what Im trying to achieve:
Instead of:
having Column H showing me what was sold in the 12 months before the 1st day of the current month (for today's date: 8/1/19-7/31/20);
Having Column I showing me what was sold in August 2020; and
Column I searching column H to see if that customer was sold in the timeframe specified in Column H
I want to have one column that does all three: One column that flags all sales made for the last 12 months from the beginning of the current month (so, 8/1/19 to 8/27/20), then compares sales made in current month (august) with the sales made before it, and lets me know the first time a customer shows up in current month IF it doesn't appear in the 12 months prior --> finds the new customers after a dormant period of 12 months.
Im really just trying to find a way to make the formula better and less-resource consuming. With a large dataset, the three columns (copied a few times for different timeframes) really slow down Excel...
Here is an example of the end result:
Example of final product

I need to provide a tabulated output of this month, month+1, month +2, and month +3 derived from numerous tables within the one sheet

The History:
I have a data set that refreshes every Monday morning adding last week's values to a growing tally until there is 52 weeks in the data set (9 separate cohorts), across 38 different departments.
I have built a power query to filter the department and compiled tables for each cohort, limiting the data to the last 17 weeks, and using excel forecast modelling then setup each table to forecast 16 weeks ahead.
Because the week beginning (WB) dates keep changing IO cant hard code the result table to cells within each cohort table.
My result table needs to show current month, month +1, month +2, and month +3 forecast values as per the highest date closest to or equal to EOM and I need this to be automated, hence a formula.
PS added complexity is that the table has date/value adjacent in (last 17 weeks) and columns separated in future 16 weeks of data in each table. Structure is exactly the same across all the 9 cohort forecast tables.
My Question:
Am I best to use a nested EOM formula, or VLOOKUP(MAX) based on the cohort_forecast_table image link below?
Because the current month needs to be current I have created a cell using =NOW().
I then complete a VLOOKUP within each cell in the master table that references the references the data in each sub-table usin MAX and EOMONTH for current month, then month+1, month+2, month+3, etc.
In a simplified broken down solution:
Date array = 'D3:D35'
Volume array = 'E3:E35'
End of current month formula cell B3: =MAX(($D$3:$D$35<EOMONTH(D1,0))*D3:D35)
Call for result in cell C3:
'=VLOOKUP(B3,Dates:Volumes,2,FALSE)'
I think this will work for me and thank you all...

iterating over multindex - a groupby.value_counts() object is only through values and not through original date index

i want to know the percent of males in the ER (emergency room) during days that i defined as over crowded days.
i have a DF named eda with rows repesenting each entry to the ER. a certain column states if the entry occurred in an over crowded day (1 means over crowded) and a certain column states the gender of the person who entered.
so far i managed to get a series of over crowded days as index and a sub-index representing gender and the number of entries in that gender.
i used this code :
eda[eda.over_crowd==1].groupby(eda[eda.over_crowd==1].index.date).gender.value_counts()
and got the following result:
my question is, what is the most 'pandas-ian' way to get the percent of males\females in general. or, how to continue from the point i stopped?
as can be shown in the bottom of the screenshot, when i iterate over the elements, each value is the male of female consecutively. i want to iterate over dates so i could somehow write a more clean loop that will produce another column of male percentage.
i found a pretty elegant solution. i'm sure there are more, but maybe it can help someone else.
so i defined a multi-index series with all dates and counts of females and males. then used .loc to operate on each count of all dates to get percentage of males at each day. finally i just extract only the days that apply for over_crowd==1.
temp=eda.groupby(eda.index.date).gender.value_counts()
crowding['male_percent']=np.divide(100*temp.loc[:,1],temp.loc[:,2]+temp.loc[:,1])
crowding.male_percent[crowding.over_crowd==1]

Excel, counting within a formulatext on array

Imagine rows A2:A11 = name of customer, Columns B1:AE1 = days of the month.
To make it easy:
Daily, we tally if customers purchase (quantity) and separate them with a + to get the total of that day purchase. (example: on 2nd day of the month (C2:C5)
Abe =44+54+10
John =22+10+40
Sara =40
Mary=10+10
Also we need to count total “sales cases” of the whole day (in the above example it is 3+3+1+2)= 9 to show in the last row of the day. (B12 in this example)
The logic is something like
=SUMPRODUCT(LEN(FORMULATEXT(C2:C5))-LEN(SUBSTITUTE(FORMULATEXT(C2:C5),"+","")))
But I’m getting NA.
reminder: when there are no "+" signs & the value is more than zero, it should count as 1.
help?
There is a trick to do this, which might end up a bit convoluted to achieve the end result you want
First, define a new name in Name Manager (in the Formula menu bar)
Name: FormulaText
Refers to: =GET.CELL(6,OFFSET(INDIRECT("RC",FALSE),0,-1))
Now if you have a formula in cell B3 of =10+20+30 enter =FormulaText in cell C3 and you will get the text version of the formula
You can now count + symbols in that formula using =LEN(C3)-LEN(SUBSTITUTE(C3,"+",""))
In your specific circumstances, I would offset all of this to the right of your spreadsheet, say by 35 columns, in which case you will need to change the definition of FormulaText accordingly.
A fair bit of set-up work, but the result should work automagically.

How to find the one hour period with the most datapoints?

I have a database table with hundreds of thousands of forum posts, and I would like to find out what hour-long period contains the most number of posts.
I could crawl forward one minute at a time, keeping an array of timestamps and keeping track of what hour had the most in it, but I feel like there is a much better way to do this. I will be running this operation on a year of posts so checking every minute in a year seems pretty awful.
Ideally there would be a way to do this inside a single database query.
Given a table filled with every minute in the year you are interested in Minutes and a table Posts with a Time column:
select top 1 minutes.time, count (posts.time)
from Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count (posts.time) desc
To solve generating the minutes table, you can use a function like ufn_GenerateIntegers.
Then the function becomes
select top 5 minutes.time, count (posts.time)
from (select dateadd(minute, IntValue, '2008-01-01') as Time from ufn_GenerateIntegers(525600)) Minutes
left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count(posts.time) desc
I just did a test run with about 5000 random posts and it took 16 seconds on my machine. So, not trivial, but not rediculous for the occasional one-off query. Fortunately, this is a data-point you can calculate one a day or even once a month and cache if you want to display the value frequently.
Take a look at lassevk's improvement.
Binning will work if you want to look at intervals such as 10:00 - 11:00. However if you had a sudden flurry of interest from 10:30 - 11:30 then it will be split across two bins, and hence may be hidden by an smaller number of hits that happened to fit entirely within a single clock hour.
The only way to avoid this problem is to generate a list sorted by time and step through it. Something like this:
max = 0; maxTime = 0
for each $item in the list:
push $item onto queue
while head of queue is more than an hour before $item
drop queue head.
if queue.count > max then max = queue.count; maxTime = $item.time
That way you only need to hold a 1 hour window in memory rather than the whole list.
Treat the timestamp of every post as the start of such an hour, and count all other posts that fall within that hour, including the post that started it. Sort the resulting hours in descending order by the number of posts in each of them.
Having done that, you'll find the topmost single "hour" that has the most posts in it, but this period of time might not be exactly one hour long, it might be shorter (but never longer).
To get a "prettier" period, you can calculate how long it really is, divide by two, and adjust the start of the period back by that amount and the end forward, this will "center" the posts inside the hour. This adjustment is guaranteed to not include any new posts, so the count is still valid. If posts are close enough to suddenly be included in the period after you have expanded it to one hour, then an earlier point would've had "the most posts" in it instead of the one you picked.
If this is an SQL question, you can reuse the SQL that Josh posted here, just replace the Minutes table with another link to your posts table.
Another method you can use is to use a sliding window.
First sort all the posts according to the timestamp. Keep track of posts using a list, a linked list could be used for this.
Now, for each post, add it to the end of the list. Then, for each post from the start of the list, if that post is more than one hour before the post you just added, remove it from the list.
After doing that 2-step operation for a single new post in the list, check if the number of posts in the list is more than a previous maximum, and if it is, either make a copy of the list or at least store the post you just added.
After you're finished, you've got the "copy of the list" with the most posts in an hour, or you got the post that is the end of a 1-hour window that contains the most posts.
Pseudo-code:
initialize posts-window-list to empty list
for each post in sorted-posts-list:
add post to end of posts-window-list
for each other-post from start of posts-window-list:
if other-post is more than one hour older than post, remove it
otherwise, end this inner loop
if number of posts in list is more than previous maximum:
make copy of list, this is the new maximum
This worked on a small test MS-SQL database.
SELECT TOP 1 id, date_entered,
(SELECT COUNT(*)
FROM dbo.notes AS n2
WHERE n2.date_entered >= n.date_entered
AND n2.date_entered < Dateadd(hh, 1, n.date_entered)) AS num
FROM dbo.notes n
ORDER BY num DESC
This is not very efficient, checks based on an hour from each post.
For MYSQL
SELECT ID,f.Date, (SELECT COUNT(*)
FROM Forum AS f2
WHERE f2.Date >= f.Date AND f2.Date < Date_ADD(f.Date, INTERVAL 1 HOUR)) As num
FROM Forum AS f
ORDER BY num
LIMIT 0,1
Here's a slight variation on the other Josh's implementation this forgoes the immediate table and uses a self join on itself looking for any posts within an hour of that one post.
select top 1 posts.DateCreated, count (posts.datecreated),
min(minutes.DateCreated) as MinPostDate,
max(minutes.datecreated) as MaxPostDate
from posts Minutes
left join posts on posts.datecreated >= minutes.DateCreated
AND posts.datecreated < dateadd(hour, 1, Minutes.DateCreated)
group by posts.DateCreated
order by count(posts.datecreated) desc
From a performance perspective on a table with only 6 rows his method which used the function to generate the intermiadte table took 16 seconds vs this one which was subsecond.
I'm not positive if it would be possible using this to miss a valid timeframe since the timespan is based off of the offset of each post.
This results in an O(n) database query, and an O(n) greatest time search, for a total complexity of O(2n) (which, of course, is still O(n)):
Use a count distinct command in SQL which will 'bin' items for you in minute increments.
So you'd run the count query on this table:
time
1
2
4
3
3
2
4
1
3
2
And it would return:
0 1
1 1
2 3
3 3
4 2
By counting each item.
I suspect you can do the same thing with your table, and bin them by the minute, then run an algorithm on that.
SELECT customer_name, COUNT(DISTINCT city) as "Distinct Cities"
FROM customers
GROUP BY customer_name;
From this tutorial on count: http://www.techonthenet.com/sql/count.php (near the end).
Here is a similar page from MySQL's manual: http://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
So if you have a table with a timedate in it (to the minute, allowing binning to happen by minutes):
datetime (yyyymmddhhmm)
200901121435
200901121538
200901121435
200901121538
200901121435
200901121538
200901121538
200901121435
200901121435
200901121538
200901121435
200901121435
Then the SQL
SELECT datetime, COUNT(DISTINCT datetime) as "Date Time"
FROM post
GROUP BY datetime;
should return
200901121435 7
200901121538 5
You will still need to post process this, but the hard work of grouping and counting is done, and will only result in just over 500k rows per year (60 minutes, 24 hours, 365 days)
The post processing would be:
Start at time T = first post time.
Set greatestTime = T
Sum all counts between T and T+one hour --> currentHourCount and greatestHourCount
While records exist past T+one hour
Increment T by one minute.
While the first element is prior to time T, subtract it
while the last element is before time T+ one hour, add it
If currentHourCount > greatestHourCount then
greatestHourCount = currentHourCount
greatestTime = T
end while
-Adam
This will do it.
SELECT DateOfEvent HourBegin, DATEADD(hh, 1, DateOfEvent)) HourEnd, COUNT(*) AS NumEventsPerHour
FROM tEvents AS A
JOIN tEvents AS B
ON A.DateOfEvent >= B.DateOfEvents AND DATEADD(hh, 1, A.DateOfEvent) <= B.DateOfEvent
GROUP BY A.DateOfEvent
If mysql:
select substr( timestamp, 1, 16 ) as hour, count(*) as count from forum_posts group by hour order by count desc limit 1;
edit: not sure if original question means any possible 60-minute period
If using MySQL:
SELECT DATE(postDate), HOUR(postDate), COUNT(*) AS n
FROM posts
GROUP BY DATE(postDate), HOUR(postDate)
ORDER BY n DESC
LIMIT 1
SELECT DATEPART(hour, PostDateTime) AS HourOfDay,
COUNT(*) AS ForumPosts
FROM Posts
GROUP BY DATEPART(hour, PostDateTime)

Resources