I am trying to design a function that will calculate 30 day rolling volatility.
I have a file with 3 columns: date, and daily returns for 2 stocks.
How can I do this? I have a problem in summing the first 30 entries to get my vol.
Edit:
So it will read an excel file, with 3 columns: a date, and daily returns.
daily.ret = read.csv("abc.csv")
e.g. date stock1 stock2
01/01/2000 0.01 0.02
etc etc, with years of data. I want to calculate rolling 30 day annualised vol.
This is my function:
calc_30day_vol = function()
{
stock1 = abc$stock1^2
stock2 = abc$stock1^2
j = 30
approx_days_in_year = length(abc$stock1)/10
vol_1 = 1: length(a1)
vol_2 = 1: length(a2)
for (i in 1 : length(a1))
{
vol_1[j] = sqrt( (approx_days_in_year / 30 ) * rowSums(a1[i:j])
vol_2[j] = sqrt( (approx_days_in_year / 30 ) * rowSums(a2[i:j])
j = j + 1
}
}
So stock1, and stock 2 are the squared daily returns from the excel file, needed to calculate vol. Entries 1-30 for vol_1 and vol_2 are empty since we are calculating 30 day vol. I am trying to use the rowSums function to sum the squared daily returns for the first 30 entries, and then move down the index for each iteration.
So from day 1-30, day 2-31, day 3-32, etc, hence why I have defined "j".
I'm new at R, so apologies if this sounds rather silly.
This should get you started.
First I have to create some data that look like you describe
library(quantmod)
getSymbols(c("SPY", "DIA"), src='yahoo')
m <- merge(ROC(Ad(SPY)), ROC(Ad(DIA)), all=FALSE)[-1, ]
dat <- data.frame(date=format(index(m), "%m/%d/%Y"), coredata(m))
tmpfile <- tempfile()
write.csv(dat, file=tmpfile, row.names=FALSE)
Now I have a csv with data in your very specific format.
Use read.zoo to read csv and then convert to an xts object (there are lots of ways to read data into R. See R Data Import/Export)
r <- as.xts(read.zoo(tmpfile, sep=",", header=TRUE, format="%m/%d/%Y"))
# each column of r has daily log returns for a stock price series
# use `apply` to apply a function to each column.
vols.mat <- apply(r, 2, function(x) {
#use rolling 30 day window to calculate standard deviation.
#annualize by multiplying by square root of time
runSD(x, n=30) * sqrt(252)
})
#`apply` returns a `matrix`; `reclass` to `xts`
vols.xts <- reclass(vols.mat, r) #class as `xts` using attributes of `r`
tail(vols.xts)
# SPY.Adjusted DIA.Adjusted
#2012-06-22 0.1775730 0.1608266
#2012-06-25 0.1832145 0.1640912
#2012-06-26 0.1813581 0.1621459
#2012-06-27 0.1825636 0.1629997
#2012-06-28 0.1824120 0.1630481
#2012-06-29 0.1898351 0.1689990
#Clean-up
unlink(tmpfile)
Related
I am researching a topic on inflation and currently have the problem that part of my code takes over 20 hours to complete because I have a nested loop.
The code looks like this:
avg_timeseries <- function(start_date, end_date, k){
start_date = microdata_sce$survey_date[i]
end_date = microdata_sce$prev_survey[i]
mean(subset(topictimeseries, survey_date <= start_date & survey_date >= end_date)[[k]])
}
start.time <- Sys.time()
for (i in 1:nrow(microdata_sce)){
for(k in 2:ncol(topictimeseries)){
microdata_sce[[i, paste(k-1, 'topic', sep="_")]] <- avg_timeseries(start_date, end_date, k)
}
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
What do I want/what do I have:
I have a panel dataset (microdata_sce) with a total of 130,000 observations. Each userid has a start date and an end date, which are the survey date and the last survey date respectively. I want to take both dates and calculate the average between the two dates separately for each column in the timeseries dataframe (70 columns). Afterwards, the average between the two dates is to be attached to the microdata_sce data set as a column.
My idea was to loop through each row, take the survey data, create a subset from which the average is calculated. I loop through the k different time series.
Unfortunately, the code takes an eternity. Is there a way to speed this up using apply?
Thank you!
Kind regards
I have second-by-second data for channels A, B, and C as shown below (this just shows the first 6 rows):
date A B C
1 2020-03-06 09:55:42 224.3763 222.3763 226.3763
2 2020-03-06 09:55:43 224.2221 222.2221 226.2221
3 2020-03-06 09:55:44 224.2239 222.2239 226.2239
4 2020-03-06 09:55:45 224.2044 222.2044 226.2044
5 2020-03-06 09:55:46 224.2397 222.2397 226.2397
6 2020-03-06 09:55:47 224.3690 222.3690 226.3690
I would like to be able to extract multiple 5-minute averages for columns A, B and C based off time. Is there a way to do this where I would only need to type in the starting time period, rather than having to type the start AND end times for each time period I want to extract? Essentially, I want to be able to type the start time and have my code calculate and extract the average for the successive 5 minutes.
I was previously using the 'time.average' function from the 'openair' package to obtain 1-minute averages for the entire data set. I then created a vector with the start times and then used the 'subset' function' to extract the 1 minute averages I was interested in.
library(openair)
df.avg <- timeAverage(df, avg.time = "min", statistic = "mean")
cond.1.time <- c(
'2020-03-06 10:09:00',
'2020-03-06 10:13:00',
'2020-03-06 10:18:00',
) #enter start times
library(dplyr)
df.cond.1.avg <- subset(df.avg,
date %in% cond.1.time) #filter data based off vector
df.cond.1.avg <- as.data.frame(df.cond.1.avg) #tibble to df
However, this approach will not work for 5-minute averages since not all of the time frames I am interested in begin in 5 minute increments of each other. Also, my previous approach forced me to only use 1 minute averages that start at the top of the minute.
I need to be able to extract 5-minute averages scattered randomly throughout the day. These are not rolling averages. I will need to extract approximately thirty 5-minute averages per day so being able to only type in the start date would be key.
Thank you!
Using the dplyr and tidyr libraries, the interval to be averaged can be selected by filtering the dates and averaged.
It doesn't seem to be efficient but it can help you.
library(dplyr)
library(tidyr)
data <- data.frame(date = seq(as.POSIXct("2020-02-01 01:01:01"),
as.POSIXct("2020-02-01 20:01:10"),
by = "sec"),
A = rnorm(68410),
B = rnorm(68410),
C = rnorm(68410))
meanMinutes <- function(data, start, interval){
# Interval in minutes
start <- as.POSIXct(start)
end <- start + 60*interval
filterData <- dplyr::filter(data, date <= end, date >= start)
date_start <- filterData$date[1]
meanData <- filterData %>%
tidyr::gather(key = "param", value = "value", A:C) %>%
dplyr::group_by(param) %>%
dplyr::summarise(value = mean(value, na.rm = T)) %>%
tidyr::spread(key = "param", value = "value")
return(cbind(date_start, meanData))
}
For one date
meanMinutes(data, "2020-02-01 07:03:11", 5)
Result:
date_start A B C
1 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.1304691
For multiple dates:
dates <- c("2020-02-01 02:53:41", "2020-02-01 05:23:14",
"2020-02-01 07:03:11", "2020-02-01 19:10:45")
do.call(rbind, lapply(dates, function(x) meanMinutes(data, x, 5)))
Result:
date_start A B C
1 2020-02-01 02:53:41 -0.001929374 -0.03807152 0.06072332
2 2020-02-01 05:23:14 0.009494321 -0.05911055 -0.02698245
3 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.13046909
4 2020-02-01 19:10:45 -0.123574816 -0.02373881 0.05997007
I have 3 arrays of size 803500*1 with the following details:
Rid: It can contain any number
RidID: It contains elements from 1 to 184 in random order. Each element appears multiple times.
r: It contains elements 0,1,2,...12. All elements (except zero) appear nearly 3400 to 3700 times at random indices in this array.
Following may be useful for generating sample data:
Rid = rand(803500,1);
RidID = randi(184,803500,1);
r = randi(13,803500,1)-1; %This may not be a good sample for r as per previously mentioned details?
What I want to do?
I want to calculate the sum of those entries of Rid which correspond to each positive unique entry of r and each unique entry of RidID.
This may be clearer with the code which I wrote for this problem:
RNum = numel(unique(RidID));
RSum = ones(RNum,12); %Preallocating for better speed
for i=1:12
RperM = r ==i;
for j = 1:RNum
RSum(j,i) = sum(Rid(RperM & (RidID==j)));
end
end
Issue:
My code works but it takes 5 seconds on average on my computer and I have to do this calculation nearly a thousand times. If this time be reduced from 5 seconds to atleast half of it, I'll be very happy. But how do I optimize this? I don't mind if it is made better with vectorization or any better written loop.
I am using MATLAB R2017b.
You can use accumarray :
u = unique(RidID);
A = accumarray([RidID r+1], Rid);
RSum = A(u, 2:13);
This is slower than accumarray as suggested by rahnema, but using findgroups and splitapply may save memory.
In your example, there may be thousands of zero-valued elements in the resulting matrix, where a combination of RidID and r does not occur. In this case a stacked result would be more memory efficient, like so:
RidID | r | Rid_sum
-------------------------
1 | 1 | 100
2 | 1 | 200
4 | 2 | 85
...
This can be achieved with the following code:
[ID, rn, RidIDn] = findgroups(r,RidID); % Get unique combo ID for 'r' and 'RidID'
RSum = splitapply( #sum, Rid, ID ); % Sum for each ID
output = table( RidIDn, rn, RSum ); % Nicely formatted table output
% Get rid of elements where r == 0
output( output.rn == 0, : ) = [];
You could convert this to the same output as the accumarray method, but it's already a slower method...
% Convert to 'unstacked' 2D matrix (optional)
RSum = full( sparse( 1:numel(Ridn), 1:numel(rn), RSum ) );
Lets say i have daily data for 30 years of period in a matrix. To make it simple just assume it has only 1 column and 10957 row indicates the days for 30 years. The year start in 2010. I want to find the max value for every year so that the output will be 1 column and 30 rows. Is there any automated way to program it in Matlab? currently im doing it manually where what i did was:
%for the first year
max(RAINFALL(1:365);
.
.
%for the 30th of year
max(RAINFALL(10593:10957);
It is exhausting to do it manually and i have quite few of same data sets. I used the code below to calculate mean and standard deviation for the 30 years. I tried modified the code to work for my task above but i couldn't succeed. Hope anyone can modify the code or suggest new way to me.
data = rand(32872,100); % replace with your data matrix
[nDays,nData] = size(data);
% let MATLAB construct the vector of dates and worry about things like leap
% year.
dayFirst = datenum(2010,1,1);
dayStamp = dayFirst:(dayFirst + nDays - 1);
dayVec = datevec(dayStamp);
year = dayVec(:,1);
uniqueYear = unique(year);
K = length(uniqueYear);
a = nan(1,K);
b = nan(1,K);
for k = 1:K
% use logical indexing to pick out the year
currentYear = year == uniqueYear(k);
a(k) = mean2(data(currentYear,:));
b(k) = std2(data(currentYear,:));
end
One possible approach:
Create a column containing the year of each data value, using datenum and datevec to take care of leap years.
Find the maximum for each year, with accumarray.
Code:
%// Example data:
RAINFALL = rand(10957,1); %// one column
start_year = 2010; %// data starts on January 1st of this year
%// Computations:
[year, ~] = datevec(datenum(start_year,1,1) + (0:size(RAINFALL,1)-1)); %// step 1
result = accumarray(year.'-start_year+1, RAINFALL.', [], #max); %// step 2
As a bonus: if you change #max in step 2 by either #mean or #std, guess what you get... much simpler than your code.
This may help You:
RAINFALL = rand(1,10957); % - Your data here
firstYear = 2010;
numberOfYears = 4;
cum = 0; % - cumulative factor
yearlyData = zeros(1,numberOfYears); % - this isnt really necessary
for i = 1 : numberOfYears
yearLength = datenum(firstYear+i,1,1) - datenum(firstYear + i - 1,1,1);
yearlyData(i) = max(RAINFALL(1 + cum : yearLength + cum));
cum = cum + yearLength;
end
I have coded this partly..but am not sure, since what i get is only partial data.
so i have a matrix 4D, it has dimensions: xV(6,24,63,15) ---> meaning: xV(min,hour,day,customer).. the data is collected every 10 min for 63 days for 15 customer.
so that is why first 6 row is 10 min interval.
what i want is that i can collect the data for lets say monday every week and use it for plot.
meaning there is 63/7 = 9 mondays.. 9 mondays having 24 hours where each hour has 6 data(every 10 min). i want for each of those hour each monday each 10 min a new matrix..so i can take the mean of it and plot..
is this possible?
i have come so far..but no luck:
n = 0;
m = 0;
while(n<24)
n = n + 1;
while(m<6)
m = m + 1;
Va(:,m) = x(m,n,1:63,1); %(min,hour,day,line)
Vb(:,m) = x(m,n,1:63,1);
Vc(:,m) = x(m,n,1:63,1);
end
end
the file: xV.mat
thanks again for help
firstMonday = 1; %// index of first Monday. 1 if first day is a Monday
result = xV(:,:,firstMonday:7:end,:);
This gives a 6x24x9x15 matrix containing only Mondays. To average over all Mondays, use
squeeze(mean(result,3)) %// mean along 3rd dim. Size is 6x24x15