Select max value from huge matrix of 30 years daily data - arrays

Lets say i have daily data for 30 years of period in a matrix. To make it simple just assume it has only 1 column and 10957 row indicates the days for 30 years. The year start in 2010. I want to find the max value for every year so that the output will be 1 column and 30 rows. Is there any automated way to program it in Matlab? currently im doing it manually where what i did was:
%for the first year
max(RAINFALL(1:365);
.
.
%for the 30th of year
max(RAINFALL(10593:10957);
It is exhausting to do it manually and i have quite few of same data sets. I used the code below to calculate mean and standard deviation for the 30 years. I tried modified the code to work for my task above but i couldn't succeed. Hope anyone can modify the code or suggest new way to me.
data = rand(32872,100); % replace with your data matrix
[nDays,nData] = size(data);
% let MATLAB construct the vector of dates and worry about things like leap
% year.
dayFirst = datenum(2010,1,1);
dayStamp = dayFirst:(dayFirst + nDays - 1);
dayVec = datevec(dayStamp);
year = dayVec(:,1);
uniqueYear = unique(year);
K = length(uniqueYear);
a = nan(1,K);
b = nan(1,K);
for k = 1:K
% use logical indexing to pick out the year
currentYear = year == uniqueYear(k);
a(k) = mean2(data(currentYear,:));
b(k) = std2(data(currentYear,:));
end

One possible approach:
Create a column containing the year of each data value, using datenum and datevec to take care of leap years.
Find the maximum for each year, with accumarray.
Code:
%// Example data:
RAINFALL = rand(10957,1); %// one column
start_year = 2010; %// data starts on January 1st of this year
%// Computations:
[year, ~] = datevec(datenum(start_year,1,1) + (0:size(RAINFALL,1)-1)); %// step 1
result = accumarray(year.'-start_year+1, RAINFALL.', [], #max); %// step 2
As a bonus: if you change #max in step 2 by either #mean or #std, guess what you get... much simpler than your code.

This may help You:
RAINFALL = rand(1,10957); % - Your data here
firstYear = 2010;
numberOfYears = 4;
cum = 0; % - cumulative factor
yearlyData = zeros(1,numberOfYears); % - this isnt really necessary
for i = 1 : numberOfYears
yearLength = datenum(firstYear+i,1,1) - datenum(firstYear + i - 1,1,1);
yearlyData(i) = max(RAINFALL(1 + cum : yearLength + cum));
cum = cum + yearLength;
end

Related

How to change from annual year end to annual mid year in SAS

I currently work in SAS and utilise arrays in this way:
Data Test;
input Payment2018-Payment2021;
datalines;
10 10 10 10
20 20 20 20
30 30 30 30
;
run;
In my opinion this automatically assumes a limit, either the start of the year or the end of the year (Correct me if i'm wrong please)
So, if I wanted to say that this is June data and payments are set to increase every 9 months by 50% I'm looking for a way for my code to recognise that my years go from end of June to the next end of june
For example, if I wanted to say
Data Payment_Pct;
set test;
lastpayrise = "31Jul2018";
array payment:
array Pay_Inc(2018:2021) Pay_Inc: ;
Pay_Inc2018 = 0;
Pay_Inc2019 = 2; /*2 because there are two increments in 2019*/
Pay_Inc2020 = 1;
Pay_Inc2021 = 1;
do I = 2018 to 2021;
if i = year(pay_inc) then payrise(i) * 50% * Pay_Inc(i);
end;
run;
It's all well and good for me to manually do this for one entry but for my uni project, I'll need the algorithm to work these out for themselves and I am currently reading into intck but any help would be appreciated!
P.s. It would be great to have an algorithm that creates the following
Pay_Inc2019 Pay_Inc2020 Pay_Inc2021
1 2 1
OR, it would be great to know how the SAS works in setting the array for 2018:2021 , does it assume end of year or can you set it to mid year or?
Regarding input Payment2018-Payment2021; there is no automatic assumption of yearness or calendaring. The numbers 2018 and 2021 are the bounds for a numbered range list
In a numbered range list, you can begin with any number and end with any number as long as you do not violate the rules for user-supplied names and the numbers are consecutive.
The meaning of the numbers 2018 to 2021 is up to the programmer. You state the variables correspond to the June payment in the numbered year.
You would have to iterate a date using 9-month steps and increment a counter based on the year in which the date falls.
Sample code
Dynamically adapts to the variable names that are arrayed.
data _null_;
array payments payment2018-payment2021;
array Pay_Incs pay_inc2018-pay_inc2021; * must be same range numbers as payments;
* obtain variable names of first and last element in the payments array;
lower_varname = vname(payments(1));
upper_varname = vname(payments(dim(payments)));
* determine position of the range name numbers in those variable names;
lower_year_position = prxmatch('/\d+\s*$/', lower_varname);
upper_year_position = prxmatch('/\d+\s*$/', upper_varname);
* extract range name numbers from the variable names;
lower_year = input(substr(lower_varname,lower_year_position),12.);
upper_year = input(substr(upper_varname,upper_year_position),12.);
* prepare iteration of a date over the years that should be the name range numbers;
date = mdy(06,01,lower_year); * june 1 of year corresponding to first variable in array;
format date yymmdd10.;
do _n_ = 1 by 1; * repurpose _n_ for an infinite do loop with interior leave;
* increment by 9-months;
date = intnx('month', date, 9);
year = year(date);
if year > upper_year then leave;
* increment counter for year in which iterating date falls within;
Pay_Incs( year - lower_year + 1 ) + 1;
end;
put Pay_Incs(*)=;
run;
Increment counter notes
There is a lot to unpack in this statement
Pay_Incs( year - lower_year + 1 ) + 1;
+ 1 at the end of the statement increments the addressed array element by 1, and is the syntax for the SUM Statement
variable + expression The sum statement is equivalent to using the SUM function and the RETAIN statement, as shown here:
retain variable 0;
variable=sum(variable,expression);
year - lower_year + 1 computes the array base-1 index, 1..N, that addresses the corresponding variable in the named range list pay_inc<lower_year>-pay_inc<upper_year>
Pay_Incs( <computed index> ) selects the variable of the SUM statement
This is a wonderful use case of the intnx() function. intnx() will be your best friend when it comes to aligning dates.
In the traditional calendar, the year starts on 01JAN. In your calendar, the year starts in 01JUN. The difference between these two dates is exactly 6 months. We want to shift our date so that the year starts on 01JUN. This will allow you to take the year part of the date and determine what year you are on in the new calendar.
data want;
format current_cal_year
current_new_year year4.
;
current_cal_year = intnx('year', '01JUN2018'd, 0, 'B');
current_new_year = intnx('year.6', '01JUN2018'd, 1, 'B');
run;
Note that we shifted current_new_year by one year. To illustrate why, let's see what happens if we don't shift it by one year.
data want;
format current_cal_year
current_new_year year4.
;
current_cal_year = intnx('year', '01JUN2018'd, 0, 'B');
current_new_year = intnx('year.6', '01JUN2018'd, 0, 'B');
run;
current_new_year shows 2018, but we really are in 2019. For 5 months out of the year, this value will be correct. From June-December, the year value will be incorrect. By shifting it one year, we will always have the correct year associated with this date value. Look at it with different months of the year and you will see that the year part remains correct throughout time.
data want;
format cal_month date9.
cal_year
new_year year4.
;
do i = 0 to 24;
cal_month = intnx('month', '01JAN2016'd, i, 'B');
cal_year = intnx('year', cal_month, i, 'B');
new_year = intnx('year.6', cal_month, i+1, 'B');
year_not_same = (year(cal_year) NE year(new_year) );
output;
end;
drop i;
run;

Stata year-quarter for loop

I am trying to create a loop in Stata. I run a model for the data <= year and <= quarter. Then predict one year look ahead. That is the model is run all time points upto the loop, while the prediction happens in the next quarter out of sample. So my question is how do I handle so that when yridx = 2000, and qtr = 4, the next quarter inside the loop look ahead would be year = 2005, and year = 1.
foreach yridx of numlist 2000/2012 {
forvalues qtridx = 1/4 {
regress Y X if year <= yridx and qtr <= qtridx
predict
}
}
It sounds as if it would be much easier to work in terms of quarterly dates. Here is one of several ways to do it.
gen qdate = yq(year, qtridx)
forval m = `=yq(2000,1)'/`=yq(2012, 4)' {
regress Y X if qdate <= `m'
predict <whatever>
}

Converting days since Jan.1 1900 to today's date

typedef struct dbdatetime
{ // Internal representation of DATETIME data type
LONG dtdays; // No of days since Jan-1-1900 (maybe negative)
ULONG dttime; // No. of 300 hundredths of a second since midnight
} DBDATETIME;
I am trying to convert this struct into today's date. I don't suspect the time will give me much trouble but I am having problems with the logic of converting the total number of days to the proper month and day.
Ex. Friday November 7th is 41948 days.
You can divide by 365.2425+1900 to get the current year but how would you get the proper month / date.
Does C have anything built in to handle this? I am not a C programmer by trade.
There's nothing in the C standard directly to handle this, but if you are willing to write OS specific code, or can import libraries like boost::date_time, this is the best option. Don't attempt to handle it yourself unless you are okay with edge cases being wrong. Dates and times are notoriously difficult to get right.
Here are the docs for date_time which can do arithmetic on dates, including "add N days to 1/1/1900". http://www.boost.org/doc/libs/1_56_0/doc/html/date_time/gregorian.html#date_time.gregorian.date_duration
date d(1900, Jan, 1)
d += days(dtdays);
EDIT: OP can't use boost, but I'll leave this here in case a future visitor could use the info.
As you said, you can divide the number of days by 365.2425 to get a approx estimation of the year. Once you have that, use the number of years and leap years between 1900 and that year to calculate the number of days till that year.
You need to make sure that the logic for detecting leap years is sound. Once you know which years are leap years, you'll be fine. For example, 1900 is NOT a leap year, even though it is divisible by 4. Google leap year rules to find out why.
Once you have that, the remaining days will belong to the current year. Use the leap year logic again to determine whether you the current year is leap or not. Then on it's just a matter of counting days for each month.
C++ is out of the question, thanks for all the input guys.
I did find this algorithm that seems to work so far but am testing it now.
int l = nSerialDate + 68569 + 2415019;
int n = int(( 4 * l ) / 146097);
l = l - int(( 146097 * n + 3 ) / 4);
int i = int(( 4000 * ( l + 1 ) ) / 1461001);
l = l - int(( 1461 * i ) / 4) + 31;
int j = int(( 80 * l ) / 2447);
nDay = l - int(( 2447 * j ) / 80);
l = int(j / 11);
nMonth = j + 2 - ( 12 * l );
nYear = 100 * ( n - 49 ) + i + l;
Trying to find some theory behind it now.
Thanks for all of the input everyone, here is what I ended up doing.
Convert dtdays to seconds, minus the offset of the number of seconds between Jan-1-1900(Serial Date) and Jan-1-1970(UNIX Time). Which is 2208988800, this leaves us with UNIX time.
Convert the hundredths of seconds into seconds and add to the total.
Convert UNIX time using gmtime() to a tm struct.

Finding values of same time every week. code provided, data provided. 4D-matrix

I have coded this partly..but am not sure, since what i get is only partial data.
so i have a matrix 4D, it has dimensions: xV(6,24,63,15) ---> meaning: xV(min,hour,day,customer).. the data is collected every 10 min for 63 days for 15 customer.
so that is why first 6 row is 10 min interval.
what i want is that i can collect the data for lets say monday every week and use it for plot.
meaning there is 63/7 = 9 mondays.. 9 mondays having 24 hours where each hour has 6 data(every 10 min). i want for each of those hour each monday each 10 min a new matrix..so i can take the mean of it and plot..
is this possible?
i have come so far..but no luck:
n = 0;
m = 0;
while(n<24)
n = n + 1;
while(m<6)
m = m + 1;
Va(:,m) = x(m,n,1:63,1); %(min,hour,day,line)
Vb(:,m) = x(m,n,1:63,1);
Vc(:,m) = x(m,n,1:63,1);
end
end
the file: xV.mat
thanks again for help
firstMonday = 1; %// index of first Monday. 1 if first day is a Monday
result = xV(:,:,firstMonday:7:end,:);
This gives a 6x24x9x15 matrix containing only Mondays. To average over all Mondays, use
squeeze(mean(result,3)) %// mean along 3rd dim. Size is 6x24x15

how to calculate rolling volatility

I am trying to design a function that will calculate 30 day rolling volatility.
I have a file with 3 columns: date, and daily returns for 2 stocks.
How can I do this? I have a problem in summing the first 30 entries to get my vol.
Edit:
So it will read an excel file, with 3 columns: a date, and daily returns.
daily.ret = read.csv("abc.csv")
e.g. date stock1 stock2
01/01/2000 0.01 0.02
etc etc, with years of data. I want to calculate rolling 30 day annualised vol.
This is my function:
calc_30day_vol = function()
{
stock1 = abc$stock1^2
stock2 = abc$stock1^2
j = 30
approx_days_in_year = length(abc$stock1)/10
vol_1 = 1: length(a1)
vol_2 = 1: length(a2)
for (i in 1 : length(a1))
{
vol_1[j] = sqrt( (approx_days_in_year / 30 ) * rowSums(a1[i:j])
vol_2[j] = sqrt( (approx_days_in_year / 30 ) * rowSums(a2[i:j])
j = j + 1
}
}
So stock1, and stock 2 are the squared daily returns from the excel file, needed to calculate vol. Entries 1-30 for vol_1 and vol_2 are empty since we are calculating 30 day vol. I am trying to use the rowSums function to sum the squared daily returns for the first 30 entries, and then move down the index for each iteration.
So from day 1-30, day 2-31, day 3-32, etc, hence why I have defined "j".
I'm new at R, so apologies if this sounds rather silly.
This should get you started.
First I have to create some data that look like you describe
library(quantmod)
getSymbols(c("SPY", "DIA"), src='yahoo')
m <- merge(ROC(Ad(SPY)), ROC(Ad(DIA)), all=FALSE)[-1, ]
dat <- data.frame(date=format(index(m), "%m/%d/%Y"), coredata(m))
tmpfile <- tempfile()
write.csv(dat, file=tmpfile, row.names=FALSE)
Now I have a csv with data in your very specific format.
Use read.zoo to read csv and then convert to an xts object (there are lots of ways to read data into R. See R Data Import/Export)
r <- as.xts(read.zoo(tmpfile, sep=",", header=TRUE, format="%m/%d/%Y"))
# each column of r has daily log returns for a stock price series
# use `apply` to apply a function to each column.
vols.mat <- apply(r, 2, function(x) {
#use rolling 30 day window to calculate standard deviation.
#annualize by multiplying by square root of time
runSD(x, n=30) * sqrt(252)
})
#`apply` returns a `matrix`; `reclass` to `xts`
vols.xts <- reclass(vols.mat, r) #class as `xts` using attributes of `r`
tail(vols.xts)
# SPY.Adjusted DIA.Adjusted
#2012-06-22 0.1775730 0.1608266
#2012-06-25 0.1832145 0.1640912
#2012-06-26 0.1813581 0.1621459
#2012-06-27 0.1825636 0.1629997
#2012-06-28 0.1824120 0.1630481
#2012-06-29 0.1898351 0.1689990
#Clean-up
unlink(tmpfile)

Resources