I am trying to create a loop in Stata. I run a model for the data <= year and <= quarter. Then predict one year look ahead. That is the model is run all time points upto the loop, while the prediction happens in the next quarter out of sample. So my question is how do I handle so that when yridx = 2000, and qtr = 4, the next quarter inside the loop look ahead would be year = 2005, and year = 1.
foreach yridx of numlist 2000/2012 {
forvalues qtridx = 1/4 {
regress Y X if year <= yridx and qtr <= qtridx
predict
}
}
It sounds as if it would be much easier to work in terms of quarterly dates. Here is one of several ways to do it.
gen qdate = yq(year, qtridx)
forval m = `=yq(2000,1)'/`=yq(2012, 4)' {
regress Y X if qdate <= `m'
predict <whatever>
}
Related
I am researching a topic on inflation and currently have the problem that part of my code takes over 20 hours to complete because I have a nested loop.
The code looks like this:
avg_timeseries <- function(start_date, end_date, k){
start_date = microdata_sce$survey_date[i]
end_date = microdata_sce$prev_survey[i]
mean(subset(topictimeseries, survey_date <= start_date & survey_date >= end_date)[[k]])
}
start.time <- Sys.time()
for (i in 1:nrow(microdata_sce)){
for(k in 2:ncol(topictimeseries)){
microdata_sce[[i, paste(k-1, 'topic', sep="_")]] <- avg_timeseries(start_date, end_date, k)
}
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
What do I want/what do I have:
I have a panel dataset (microdata_sce) with a total of 130,000 observations. Each userid has a start date and an end date, which are the survey date and the last survey date respectively. I want to take both dates and calculate the average between the two dates separately for each column in the timeseries dataframe (70 columns). Afterwards, the average between the two dates is to be attached to the microdata_sce data set as a column.
My idea was to loop through each row, take the survey data, create a subset from which the average is calculated. I loop through the k different time series.
Unfortunately, the code takes an eternity. Is there a way to speed this up using apply?
Thank you!
Kind regards
I currently work in SAS and utilise arrays in this way:
Data Test;
input Payment2018-Payment2021;
datalines;
10 10 10 10
20 20 20 20
30 30 30 30
;
run;
In my opinion this automatically assumes a limit, either the start of the year or the end of the year (Correct me if i'm wrong please)
So, if I wanted to say that this is June data and payments are set to increase every 9 months by 50% I'm looking for a way for my code to recognise that my years go from end of June to the next end of june
For example, if I wanted to say
Data Payment_Pct;
set test;
lastpayrise = "31Jul2018";
array payment:
array Pay_Inc(2018:2021) Pay_Inc: ;
Pay_Inc2018 = 0;
Pay_Inc2019 = 2; /*2 because there are two increments in 2019*/
Pay_Inc2020 = 1;
Pay_Inc2021 = 1;
do I = 2018 to 2021;
if i = year(pay_inc) then payrise(i) * 50% * Pay_Inc(i);
end;
run;
It's all well and good for me to manually do this for one entry but for my uni project, I'll need the algorithm to work these out for themselves and I am currently reading into intck but any help would be appreciated!
P.s. It would be great to have an algorithm that creates the following
Pay_Inc2019 Pay_Inc2020 Pay_Inc2021
1 2 1
OR, it would be great to know how the SAS works in setting the array for 2018:2021 , does it assume end of year or can you set it to mid year or?
Regarding input Payment2018-Payment2021; there is no automatic assumption of yearness or calendaring. The numbers 2018 and 2021 are the bounds for a numbered range list
In a numbered range list, you can begin with any number and end with any number as long as you do not violate the rules for user-supplied names and the numbers are consecutive.
The meaning of the numbers 2018 to 2021 is up to the programmer. You state the variables correspond to the June payment in the numbered year.
You would have to iterate a date using 9-month steps and increment a counter based on the year in which the date falls.
Sample code
Dynamically adapts to the variable names that are arrayed.
data _null_;
array payments payment2018-payment2021;
array Pay_Incs pay_inc2018-pay_inc2021; * must be same range numbers as payments;
* obtain variable names of first and last element in the payments array;
lower_varname = vname(payments(1));
upper_varname = vname(payments(dim(payments)));
* determine position of the range name numbers in those variable names;
lower_year_position = prxmatch('/\d+\s*$/', lower_varname);
upper_year_position = prxmatch('/\d+\s*$/', upper_varname);
* extract range name numbers from the variable names;
lower_year = input(substr(lower_varname,lower_year_position),12.);
upper_year = input(substr(upper_varname,upper_year_position),12.);
* prepare iteration of a date over the years that should be the name range numbers;
date = mdy(06,01,lower_year); * june 1 of year corresponding to first variable in array;
format date yymmdd10.;
do _n_ = 1 by 1; * repurpose _n_ for an infinite do loop with interior leave;
* increment by 9-months;
date = intnx('month', date, 9);
year = year(date);
if year > upper_year then leave;
* increment counter for year in which iterating date falls within;
Pay_Incs( year - lower_year + 1 ) + 1;
end;
put Pay_Incs(*)=;
run;
Increment counter notes
There is a lot to unpack in this statement
Pay_Incs( year - lower_year + 1 ) + 1;
+ 1 at the end of the statement increments the addressed array element by 1, and is the syntax for the SUM Statement
variable + expression The sum statement is equivalent to using the SUM function and the RETAIN statement, as shown here:
retain variable 0;
variable=sum(variable,expression);
year - lower_year + 1 computes the array base-1 index, 1..N, that addresses the corresponding variable in the named range list pay_inc<lower_year>-pay_inc<upper_year>
Pay_Incs( <computed index> ) selects the variable of the SUM statement
This is a wonderful use case of the intnx() function. intnx() will be your best friend when it comes to aligning dates.
In the traditional calendar, the year starts on 01JAN. In your calendar, the year starts in 01JUN. The difference between these two dates is exactly 6 months. We want to shift our date so that the year starts on 01JUN. This will allow you to take the year part of the date and determine what year you are on in the new calendar.
data want;
format current_cal_year
current_new_year year4.
;
current_cal_year = intnx('year', '01JUN2018'd, 0, 'B');
current_new_year = intnx('year.6', '01JUN2018'd, 1, 'B');
run;
Note that we shifted current_new_year by one year. To illustrate why, let's see what happens if we don't shift it by one year.
data want;
format current_cal_year
current_new_year year4.
;
current_cal_year = intnx('year', '01JUN2018'd, 0, 'B');
current_new_year = intnx('year.6', '01JUN2018'd, 0, 'B');
run;
current_new_year shows 2018, but we really are in 2019. For 5 months out of the year, this value will be correct. From June-December, the year value will be incorrect. By shifting it one year, we will always have the correct year associated with this date value. Look at it with different months of the year and you will see that the year part remains correct throughout time.
data want;
format cal_month date9.
cal_year
new_year year4.
;
do i = 0 to 24;
cal_month = intnx('month', '01JAN2016'd, i, 'B');
cal_year = intnx('year', cal_month, i, 'B');
new_year = intnx('year.6', cal_month, i+1, 'B');
year_not_same = (year(cal_year) NE year(new_year) );
output;
end;
drop i;
run;
Could anyone help with the translation of the following Stata code? I need this code for further analysis in SPSS.
if year<1990 {
bysort country year ID: egen sum080=sum(PY080g)
gen hydisp=(HY020+sum080)*HY025
}
else gen hydisp=HY020*HY025
I tried to solve the problem with the following SPSS code:
DO IF year<1990.
SORT CASES BY country year ID.
COMPUTE sum080 = SUM(PY080g).
COMPUTE hydisp=(HY020+sum080)*HY025.
ELSE.
COMPUTE hydisp=HY020*HY025.
END IF.
EXECUTE.
But this code appears to be wrong. Do you have any idea how to resolve the problem?
This particular use of egen in Stata can be replicated in SPSS by using the AGGREGATE command. Using Nick Cox's revised Stata code:
bysort country year ID: egen sum080 = sum(PY080g)
gen hydisp = (HY020 + sum080) * HY025 if year < 1990
replace hydisp = HY020 * HY025 if year >= 1990
A synonymous set of code in SPSS would be:
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK = country year ID
/sum080 = SUM(PY080g).
DO IF Year < 1990.
COMPUTE hydisp = (HY020+sum080)*HY025.
ELSE.
COMPUTE hydisp = HY020*HY025.
END IF.
This is in no sense an answer on SPSS code, but it makes a point that would not go well in a comment.
The Stata code
if year < 1990 {
bysort country year ID: egen sum080=sum(PY080g)
gen hydisp=(HY020+sum080)*HY025
}
else gen hydisp=HY020*HY025
would get interpreted as
if year[1] < 1990 {
bysort country year ID: egen sum080=sum(PY080g)
gen hydisp=(HY020+sum080)*HY025
}
else gen hydisp=HY020*HY025
i.e. the branching is on the value of year in the first observation (case, record). The if command and the if qualifier are quite different constructs. It seems much more likely that the code desired is something like
bysort country year ID: egen sum080 = sum(PY080g)
gen hydisp = (HY020 + sum080) * HY025 if year < 1990
replace hydisp = HY020 * HY025 if year >= 1990
or
bysort country year ID: egen sum080 = sum(PY080g)
gen hydisp = cond(year < 1990, (HY020 + sum080) * HY025, HY020 * HY025)
The OP's comment that the code appears to be wrong is a poor problem report. What is wrong precisely? It may be nothing more than inability to replicate the results gained in Stata, which would not be surprising as the Stata code is almost certainly not what is intended. It seems unlikely that the first observation is special, but rather that the calculation should be carried out for all observations according to the value of year
Detail: sum() as an egen function is undocumented in favour of total(), but the syntax remains legal.
Detail: The Stata code here would not be considered a loop just because there is a tacit loop over observations.
Lets say i have daily data for 30 years of period in a matrix. To make it simple just assume it has only 1 column and 10957 row indicates the days for 30 years. The year start in 2010. I want to find the max value for every year so that the output will be 1 column and 30 rows. Is there any automated way to program it in Matlab? currently im doing it manually where what i did was:
%for the first year
max(RAINFALL(1:365);
.
.
%for the 30th of year
max(RAINFALL(10593:10957);
It is exhausting to do it manually and i have quite few of same data sets. I used the code below to calculate mean and standard deviation for the 30 years. I tried modified the code to work for my task above but i couldn't succeed. Hope anyone can modify the code or suggest new way to me.
data = rand(32872,100); % replace with your data matrix
[nDays,nData] = size(data);
% let MATLAB construct the vector of dates and worry about things like leap
% year.
dayFirst = datenum(2010,1,1);
dayStamp = dayFirst:(dayFirst + nDays - 1);
dayVec = datevec(dayStamp);
year = dayVec(:,1);
uniqueYear = unique(year);
K = length(uniqueYear);
a = nan(1,K);
b = nan(1,K);
for k = 1:K
% use logical indexing to pick out the year
currentYear = year == uniqueYear(k);
a(k) = mean2(data(currentYear,:));
b(k) = std2(data(currentYear,:));
end
One possible approach:
Create a column containing the year of each data value, using datenum and datevec to take care of leap years.
Find the maximum for each year, with accumarray.
Code:
%// Example data:
RAINFALL = rand(10957,1); %// one column
start_year = 2010; %// data starts on January 1st of this year
%// Computations:
[year, ~] = datevec(datenum(start_year,1,1) + (0:size(RAINFALL,1)-1)); %// step 1
result = accumarray(year.'-start_year+1, RAINFALL.', [], #max); %// step 2
As a bonus: if you change #max in step 2 by either #mean or #std, guess what you get... much simpler than your code.
This may help You:
RAINFALL = rand(1,10957); % - Your data here
firstYear = 2010;
numberOfYears = 4;
cum = 0; % - cumulative factor
yearlyData = zeros(1,numberOfYears); % - this isnt really necessary
for i = 1 : numberOfYears
yearLength = datenum(firstYear+i,1,1) - datenum(firstYear + i - 1,1,1);
yearlyData(i) = max(RAINFALL(1 + cum : yearLength + cum));
cum = cum + yearLength;
end
I have coded this partly..but am not sure, since what i get is only partial data.
so i have a matrix 4D, it has dimensions: xV(6,24,63,15) ---> meaning: xV(min,hour,day,customer).. the data is collected every 10 min for 63 days for 15 customer.
so that is why first 6 row is 10 min interval.
what i want is that i can collect the data for lets say monday every week and use it for plot.
meaning there is 63/7 = 9 mondays.. 9 mondays having 24 hours where each hour has 6 data(every 10 min). i want for each of those hour each monday each 10 min a new matrix..so i can take the mean of it and plot..
is this possible?
i have come so far..but no luck:
n = 0;
m = 0;
while(n<24)
n = n + 1;
while(m<6)
m = m + 1;
Va(:,m) = x(m,n,1:63,1); %(min,hour,day,line)
Vb(:,m) = x(m,n,1:63,1);
Vc(:,m) = x(m,n,1:63,1);
end
end
the file: xV.mat
thanks again for help
firstMonday = 1; %// index of first Monday. 1 if first day is a Monday
result = xV(:,:,firstMonday:7:end,:);
This gives a 6x24x9x15 matrix containing only Mondays. To average over all Mondays, use
squeeze(mean(result,3)) %// mean along 3rd dim. Size is 6x24x15