I am researching a topic on inflation and currently have the problem that part of my code takes over 20 hours to complete because I have a nested loop.
The code looks like this:
avg_timeseries <- function(start_date, end_date, k){
start_date = microdata_sce$survey_date[i]
end_date = microdata_sce$prev_survey[i]
mean(subset(topictimeseries, survey_date <= start_date & survey_date >= end_date)[[k]])
}
start.time <- Sys.time()
for (i in 1:nrow(microdata_sce)){
for(k in 2:ncol(topictimeseries)){
microdata_sce[[i, paste(k-1, 'topic', sep="_")]] <- avg_timeseries(start_date, end_date, k)
}
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
What do I want/what do I have:
I have a panel dataset (microdata_sce) with a total of 130,000 observations. Each userid has a start date and an end date, which are the survey date and the last survey date respectively. I want to take both dates and calculate the average between the two dates separately for each column in the timeseries dataframe (70 columns). Afterwards, the average between the two dates is to be attached to the microdata_sce data set as a column.
My idea was to loop through each row, take the survey data, create a subset from which the average is calculated. I loop through the k different time series.
Unfortunately, the code takes an eternity. Is there a way to speed this up using apply?
Thank you!
Kind regards
Related
I want to calculate the number of overlapping days within multiple date ranges. For example, in the sample data below, there are 167 overlapping days: first from 07jan to 04apr and second from 30may to 15aug.
start end
01jan2000 04apr2000
30may2000 15aug2000
07jan2000 31dec2000
This is fairly crude but gets the job done. Essentially, you
Reshape the data to be in long format, which is usually a good idea when working with panel data in Stata
Fill in gaps between the start and end of each spell
Keep dates that occur more than once
Count the distinct values of dates
clear
/* Fake Data */
input str9(start end)
"01jan2000" "04apr2000"
"30may2000" "15aug2000"
"07jan2000" "31dec2000"
end
foreach var of varlist start end {
gen d = date(`var', "DMY")
drop `var'
gen `var' = d
format %td `var'
drop d
}
/* Count Overlapping Days */
rename (start end) date=
gen spell = _n
reshape long date, i(spell) j(range) string
drop range
xtset spell date, delta(1 day)
tsfill
bys date: keep if _N>1
distinct date
I have second-by-second data for channels A, B, and C as shown below (this just shows the first 6 rows):
date A B C
1 2020-03-06 09:55:42 224.3763 222.3763 226.3763
2 2020-03-06 09:55:43 224.2221 222.2221 226.2221
3 2020-03-06 09:55:44 224.2239 222.2239 226.2239
4 2020-03-06 09:55:45 224.2044 222.2044 226.2044
5 2020-03-06 09:55:46 224.2397 222.2397 226.2397
6 2020-03-06 09:55:47 224.3690 222.3690 226.3690
I would like to be able to extract multiple 5-minute averages for columns A, B and C based off time. Is there a way to do this where I would only need to type in the starting time period, rather than having to type the start AND end times for each time period I want to extract? Essentially, I want to be able to type the start time and have my code calculate and extract the average for the successive 5 minutes.
I was previously using the 'time.average' function from the 'openair' package to obtain 1-minute averages for the entire data set. I then created a vector with the start times and then used the 'subset' function' to extract the 1 minute averages I was interested in.
library(openair)
df.avg <- timeAverage(df, avg.time = "min", statistic = "mean")
cond.1.time <- c(
'2020-03-06 10:09:00',
'2020-03-06 10:13:00',
'2020-03-06 10:18:00',
) #enter start times
library(dplyr)
df.cond.1.avg <- subset(df.avg,
date %in% cond.1.time) #filter data based off vector
df.cond.1.avg <- as.data.frame(df.cond.1.avg) #tibble to df
However, this approach will not work for 5-minute averages since not all of the time frames I am interested in begin in 5 minute increments of each other. Also, my previous approach forced me to only use 1 minute averages that start at the top of the minute.
I need to be able to extract 5-minute averages scattered randomly throughout the day. These are not rolling averages. I will need to extract approximately thirty 5-minute averages per day so being able to only type in the start date would be key.
Thank you!
Using the dplyr and tidyr libraries, the interval to be averaged can be selected by filtering the dates and averaged.
It doesn't seem to be efficient but it can help you.
library(dplyr)
library(tidyr)
data <- data.frame(date = seq(as.POSIXct("2020-02-01 01:01:01"),
as.POSIXct("2020-02-01 20:01:10"),
by = "sec"),
A = rnorm(68410),
B = rnorm(68410),
C = rnorm(68410))
meanMinutes <- function(data, start, interval){
# Interval in minutes
start <- as.POSIXct(start)
end <- start + 60*interval
filterData <- dplyr::filter(data, date <= end, date >= start)
date_start <- filterData$date[1]
meanData <- filterData %>%
tidyr::gather(key = "param", value = "value", A:C) %>%
dplyr::group_by(param) %>%
dplyr::summarise(value = mean(value, na.rm = T)) %>%
tidyr::spread(key = "param", value = "value")
return(cbind(date_start, meanData))
}
For one date
meanMinutes(data, "2020-02-01 07:03:11", 5)
Result:
date_start A B C
1 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.1304691
For multiple dates:
dates <- c("2020-02-01 02:53:41", "2020-02-01 05:23:14",
"2020-02-01 07:03:11", "2020-02-01 19:10:45")
do.call(rbind, lapply(dates, function(x) meanMinutes(data, x, 5)))
Result:
date_start A B C
1 2020-02-01 02:53:41 -0.001929374 -0.03807152 0.06072332
2 2020-02-01 05:23:14 0.009494321 -0.05911055 -0.02698245
3 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.13046909
4 2020-02-01 19:10:45 -0.123574816 -0.02373881 0.05997007
How I can check if one dimension exist on axis in MDX statetment?
I need to check how many time units (days, weeks, months...) exist on axis1 and use it to calculate measure. Here is example, what should happen, I take some dimensions:
days -> [Measures].[A] = [Measures].[B] / number of members in axis 1, from only date dimension (365)
months -> [Measures].[A] = [Measures].[B] / number of members in axis 1, from only date dimension (12)
months, product group -> [Measures].[A] = [Measures].[B] / number of members in axis 1, from only date dimension (12)
So dimension different than date dimension should't affect calcutation. I only need to get count on members from [Date] dimension.
A simple example is counting of days:
With
Member [Measures].[Members on rows] AS
Axis(1).Count
Select
Non Empty [Measures].[Members on rows] on columns,
Non Empty [Date].[Day].[Day].Members on rows
From [Sales]
Where [Date].[Month].[Month].&[201701]
But you'll get only row count, you can't predict what's going on with an axis. Also you may check whether the whole attribute count = the report attribute count:
Count(existing [Date].[Day].[Day].Members) = Count([Date].[Day].[Day].Members)
If it returns true, most likely that means you don't use filter the [Date].[Day] hierarchy within your report.
I feel like I'm making this harder than it should be. I'm trying to display a sum for month C but this sum must include totals for months A, B & C. Then I need to do the same thing for Month D which includes totals for months B, C & D. Once I have this figured out I need to break it down by individual accounts but that part shouldn't be too difficult.
I have a date table to call on but it doesn't have month start or end dates which seems to be causing my difficulty.
So the solution to the above issue is to use a CTE (Common Table Expression) in the join statement identifying the date range accepted for the time period.
Select *
FROM A
LEFT JOIN B CASE WHEN a.DateID >= b.PeriodStart AND a.DateID <= b.PeriodEnd THEN 1 ELSE 0 END = 1
I have a large array with daily data from 1926 to 2012. I want to find out how many observations are in each year (it varies from year-to-year). I have a column vector which has the dates in the form of:
19290101
19290102
.
.
.
One year here is going to be July through June of the next year.
So 19630701 to 19640630
I would like to use this vector to find the number of days in each year. I need the number of observations to use as inputs into a regression.
I can't tell whether the dates are stored numerically or as a string of characters; I'll assume they're numbers. What I suggest doing is to convert each value to the year and then using hist to count the number of dates in each year. So try something like this:
year = floor(date/10000);
obs_per_year = hist(year,1926:2012);
This will give you a vector holding the number of observations in each year, starting from 1926.
Series of years starting July 1st:
bin = datenum(1926:2012,7,1);
Bin your vector of dates within each year with bin(1) <= x < bin(2), bin(2) <= x < bin(3), ...
count = histc(dates,bin);