Forecasting in gretl - forecasting

Consider the following gretl script (hansl):
open bjg.gdt
arima 1 1 0 ; 2 1 0 ; g
series fitted = $yhat
g1 <- gnuplot g fitted --with-lines --time-series --output=display
What I want to do next is to make a forecast for, lets say, 24 steps ahead, that is from Jan 1961 to Dec 1962. I believe the fifth line should be something like
fcast [options] --plot=display
What options to use here? I have tried several combinations but none is successful.

After further experimentation, here is the solution:
open bjg.gdt
arima 1 1 0 ; 2 1 0 ; g
series fitted = $yhat
g1 <- gnuplot g fitted --with-lines --time-series --output=display
dataset addobs 24
g2 <- fcast --dynamic --out-of-sample --plot=display

Related

Array Dimension

I am currently learning to use Keras in R. When I run the command
dim(mnist$train$x)
I get the output as (output 1)
[1] 60000 28 28
which means that there are 60000 matrices each with a 28*28 dimension.
Now when I create an array through R code for the same dimension , I use
test <- array(28*28*6000,dim=c(28,28,6000))
where the inner layers are specified first and upon using the statement dim(test) I get this output (output 2)
[1] 28 28 6000
Both these notations are showing up the same array in different format. Is it possible to get the output in the second case in the output 1 format ?
Do you just want output 2 in output 1 format? You mean something like this:
test <- array(28*28*6000,dim=c(28,28,6000))
d <- dim(test)
newdim <- c(d[length(d)], d[-length(d)] )
newdim will be 60000 28 28

How to get the best subset for a multinomial regression in R?

I am a new R user and I'm using a multinomial regression (i.e. logistic regression with the response variable which has more than 2 classes.) with the function 'vglm' in R. In my dataset there are 11 continuous predictors and 1 response variable which is categorical with 3 classes.
I want to get the best subset for my regression but I don't know how to do it. Is there any function for this or I must do it manually. Because the linear functions don't seem suitable.
I have tried bestglm function but its results don't seem to be suitable for a multinomial regression.
I have also tried a shrinkage method, glmnet which is relative to lasso. It chooses all the variables in the model. But on the other hand the multinomial regression using vglm reports some variables as insignificant.
I've searched a lot on the Internet including this website but haven't found any good answer. So I'm asking here because I need really a help on this.
Thanks
There's a few basic steps involved to get what you want:
define the model grid of all potential predictor combinations
model run all potential combinations of predictors
use a criteria (or a set of multiple criteria) to select the best subset of predictors
The model grid can be defined with the following function:
# define model grid for best subset regression
# defines which predictors are on/off; all combinations presented
model.grid <- function(n){
n.list <- rep(list(0:1), n)
expand.grid(n.list)
}
For example with 4 variables, we get n^2 or 16 combinations. A value of 1 indicates the model predictor is on and a value of zero indicates the predictor is off:
model.grid(4)
Var1 Var2 Var3 Var4
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 1 0 0
5 0 0 1 0
6 1 0 1 0
7 0 1 1 0
8 1 1 1 0
9 0 0 0 1
10 1 0 0 1
11 0 1 0 1
12 1 1 0 1
13 0 0 1 1
14 1 0 1 1
15 0 1 1 1
16 1 1 1 1
I provide another function below that will run all model combinations. It will also create a sorted dataframe table that ranks the different model fits using 5 criteria. The predictor combo at the top of the table is the "best" subset given the training data and the predictors supplied:
# function for best subset regression
# ranks predictor combos using 5 selection criteria
best.subset <- function(y, x.vars, data){
# y character string and name of dependent variable
# xvars character vector with names of predictors
# data training data with y and xvar observations
require(dplyr)
reguire(purrr)
require(magrittr)
require(forecast)
length(x.vars) %>%
model.grid %>%
apply(1, function(x) which(x > 0, arr.ind = TRUE)) %>%
map(function(x) x.vars[x]) %>%
.[2:dim(model.grid(length(x.vars)))[1]] %>%
map(function(x) tslm(paste0(y, " ~ ", paste(x, collapse = "+")), data = data)) %>%
map(function(x) CV(x)) %>%
do.call(rbind, .) %>%
cbind(model.grid(length(x.vars))[-1, ], .) %>%
arrange(., AICc)
}
You'll see the tslm() function is specified...others could be used such as vglm(), etc. Simply swap in the model function you want.
The function requires 4 installed packages. The function simply configures data and uses the map() function to iterate across all model combinations (e.g. no for loop). The forecast package then supplies the cross-validation function CV(), which has the 5 metrics or selection criteria to rank the predictor subsets
Here is an application example lifted from the book "Forecasting Principles and Practice." The example also uses data from the book, which is found in the fpp2 package.
library(fpp2)
# test the function
y <- "Consumption"
x.vars <- c("Income", "Production", "Unemployment", "Savings")
best.subset(y, x.vars, uschange)
The resulting table, which is sorted on the AICc metric, is shown below. The best subset minimizes the value of the metrics (CV, AIC, AICc, and BIC), maximizes adjusted R-squared and is found at the top of the list:
Var1 Var2 Var3 Var4 CV AIC AICc BIC AdjR2
1 1 1 1 1 0.1163 -409.3 -408.8 -389.9 0.74859
2 1 0 1 1 0.1160 -408.1 -407.8 -391.9 0.74564
3 1 1 0 1 0.1179 -407.5 -407.1 -391.3 0.74478
4 1 0 0 1 0.1287 -388.7 -388.5 -375.8 0.71640
5 1 1 1 0 0.2777 -243.2 -242.8 -227.0 0.38554
6 1 0 1 0 0.2831 -237.9 -237.7 -225.0 0.36477
7 1 1 0 0 0.2886 -236.1 -235.9 -223.2 0.35862
8 0 1 1 1 0.2927 -234.4 -234.0 -218.2 0.35597
9 0 1 0 1 0.3002 -228.9 -228.7 -216.0 0.33350
10 0 1 1 0 0.3028 -226.3 -226.1 -213.4 0.32401
11 0 0 1 1 0.3058 -224.6 -224.4 -211.7 0.31775
12 0 1 0 0 0.3137 -219.6 -219.5 -209.9 0.29576
13 0 0 1 0 0.3138 -217.7 -217.5 -208.0 0.28838
14 1 0 0 0 0.3722 -185.4 -185.3 -175.7 0.15448
15 0 0 0 1 0.4138 -164.1 -164.0 -154.4 0.05246
Only 15 predictor combinations are profiled in the output since the model combination with all predictors off has been dropped. Looking at the table, the best subset is the one with all predictors on. However, the second row uses only 3 of 4 variables and the performance results are roughly the same. Also note that after row 4, the model results begin to degrade. Thats because income and savings appear to be the key drivers of consumption. As these two variables are dropped from the predictors, model performance drops significantly.
The performance of the custom function is solid since the results presented here match those of the book referenced.
A good day to you.

Nested loop with increasing number of elements

I have a longitudinal dataset of 18 time periods. For reasons not to be discussed here, this dataset is in the wide shape, not in the long one. More precisely, time-varying variables have an alphabetic prefix which identifies the time it belongs to. For the sake of this question, consider a quantity of interest called pay. This variable is denoted apay in the first period, bpay in the second, and so on, until rpay.
Importantly, different observations have missing values in this variable in different periods, in an unpredictable way. In consequence, running a panel for the full number of periods will reduce my number of observations considerably. Hence, I would like to know precisely how many observations a panel with different lengths will have. To evaluate this, I want to create variables that, for each period and for each number of consecutive periods count how many respondents have the variable with that time sequence. For example, I want the variable b_count_2 to count how many observations have nonmissing pay in the first period and the second. This can be achieved with something like this:
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
Now, since I want to do this automatically, this has to be in a loop. Moreover, there are different numbers of sequences for each period. For example, for the third period, there are two sequences (those with pay in period 2 and 3, and those with sequences in period 1, 2 and 3). Thus, the number of variables to create is 1+2+3+4+...+17 = 153. This variability has to be reflected in the loop. I propose a code below, but there are bits that are wrong, or of which I'm unsure, as highlighted in the comments.
local list b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local counter = 1 // counter to update; reflects sequence length
while `counter' > 0 { // loop over sequence lengths
gen _`var'_counter_`counter' = 0 // generate variable with counter
if `var'pay != . { // HERE IS PROBLEM 1. NEED TO MAKE THIS TO CHECK CONDITIONS WITH INCREASING NUMBER OF ELEMENTS
recode _`var'_counter_`counter' (0 = 1) // IM NOT SURE THIS IS HOW TO UPDATE SPECIFIC OBSERVATIONS.
local counter = `counter' - 1 // update counter to look for a longer sequence in the next iteration
}
}
local counter = `counter' + 1 // HERE IS PROBLEM 2. NEED TO STOP THIS LOOP! Otherwise counter goes to infinity.
}
An example of the result of the above code (if right) is the following. Consider a dataset of five observations, for four periods (denoted a, b, c, and d):
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
where 1 means value is observed in that period, and . is not. The objective of the code is to create 1+2+3=6 new variables such that the new dataset is:
Obs a b c d b_count_2 c_count_2 c_count_3 d_count_2 d_count_3 d_count_4
1 1 1 . 1 1 0 0 0 0 0
2 1 1 . . 1 0 0 0 0 0
3 . . 1 1 0 0 0 1 0 0
4 . 1 1 . 0 1 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1
Now, why is this helpful? Well, because now I can run a set of summarize commands to get a very nice description of the dataset. The code to print this information in one go would be something like this:
local list a b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local list `var'_counter_* // group of sequence variables for each period
foreach var2 of local list { // loop over each element of the list
quietly sum `var'_counter_`var2' if `var'_counter_`var2' == 1 // sum the number of individuals with value = 1 with sequence of length var2 in period var
di as text "Wave `var' has a sequence of length `var2' with " as result r(N) as text " observations." // print result
}
}
For the above example, this produces the following output:
"Wave 'b' has a sequence of length 2 with 3 observations."
"Wave 'c' has a sequence of length 2 with 2 observations."
"Wave 'c' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 2 with 2 observations."
"Wave 'd' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 4 with 1 observations."
This gives me a nice summary of the trade-offs I'm having between a wider panel and a longer panel.
If you insist on doing this with data in wide form, it is very inefficient to create extra variables just to count patterns of missing values. You can create a single string variable that contains the pattern for each observation. Then, it's just a matter of extracting from this pattern variable what you are looking for (i.e. patterns of consecutive periods up to the current wave). You can then loop over lengths of the matching patterns and do counts. Something like:
* create some fake data
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* build the pattern of missing data
gen pattern = ""
foreach pre in a b c d e f g {
qui replace pattern = pattern + cond(mi(`pre'pay), " ", "`pre'")
}
list
qui foreach pre in b c d e f g {
noi dis "{hline 80}" _n as res "Wave `pre'"
// the longest substring without a space up to the wave
gen temp = regexs(1) if regexm(pattern, "([^ ]+`pre')")
noi tab temp
// loop over the various substring lengths, from 2 to max length
gen len = length(temp)
sum len, meanonly
local n = r(max)
forvalues i = 2/`n' {
count if length(temp) >= `i'
noi dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
drop temp len
}
If you are open to working in long form, then here is how you would identify spells with contiguous data and how to loop to get the info you want (the data setup is exactly the same as above):
* create some fake data in wide form
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* reshape to long form
gen id = _n
reshape long #pay, i(id) j(wave) string
* identify spells of contiguous periods
egen wavegroup = group(wave), label
tsset id wavegroup
tsspell, cond(pay < .)
drop if mi(pay)
foreach pre in b c d e f g {
dis "{hline 80}" _n as res "Wave `pre'"
sum _seq if wave == "`pre'", meanonly
local n = r(max)
forvalues i = 2/`n' {
qui count if _seq >= `i' & wave == "`pre'"
dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
}
I echo #Dimitriy V. Masterov in genuine puzzlement that you are using this dataset shape. It can be convenient for some purposes, but for panel or longitudinal data such as you have, working with it in Stata is at best awkward and at worst impracticable.
First, note specifically that
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
will only ever be evaluated in terms of the first observation, i.e. as if you had coded
if apay[1] != . & bpay[1] != .
This is documented here. Even if it is what you want, it is not usually a pattern for others to follow.
Second, and more generally, I haven't tried to understand all of the details of your code, as what I see is the creation of a vast number of variables even for tiny datasets as in your sketch. For a series T periods long, you would create a triangular number [(T - 1)T]/2 of new variables; in your example (17 x 18)/2 = 153. If someone had series 100 periods long, they would need 4950 new variables.
Note that because of the first point just made, these new variables would pertain with your strategy only to individual variables like pay and individual panels. Presumably that limitation to individual panels could be fixed, but the main idea seems singularly ill-advised in many ways. In a nutshell, what strategy do you have to work with these hundreds or thousands of new variables except writing yet more nested loops?
Your main need seems to be to identify spells of non-missing and missing values. There is easy machinery for this long since developed. General principles are discussed in this paper and an implementation is downloadable from SSC as tsspell.
On Statalist, people are asked to provide workable examples with data as well as code. See this FAQ That's entirely equivalent to long-standing requests here for MCVE.
Despite all that advice, I would start by looking at the Stata command xtdescribe and associated xt tools already available to you. These tools do require a long data shape, which reshape will provide for you.
Let me add another answer based on the example now added to the question.
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
The aim of this answer is not to provide what the OP asks but to indicate how many simple tools are available to look at patterns of non-missing and missing values, none of which entail the creation of large numbers of extra variables or writing intricate code based on nested loops for every new question. Most of those tools require a reshape long.
. clear
. input a b c d
a b c d
1. 1 1 . 1
2. 1 1 . .
3. . . 1 1
4. . 1 1 .
5. 1 1 1 1
6. end
. rename (a b c d) (y1 y2 y3 y4)
. gen id = _n
. reshape long y, i(id) j(time)
(note: j = 1 2 3 4)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 5 -> 20
Number of variables 5 -> 3
j variable (4 values) -> time
xij variables:
y1 y2 ... y4 -> y
-----------------------------------------------------------------------------
. xtset id time
panel variable: id (strongly balanced)
time variable: time, 1 to 4
delta: 1 unit
. preserve
. drop if missing(y)
(7 observations deleted)
. xtdescribe
id: 1, 2, ..., 5 n = 5
time: 1, 2, ..., 4 T = 4
Delta(time) = 1 unit
Span(time) = 4 periods
(id*time uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
2 2 2 2 3 4 4
Freq. Percent Cum. | Pattern
---------------------------+---------
1 20.00 20.00 | ..11
1 20.00 40.00 | .11.
1 20.00 60.00 | 11..
1 20.00 80.00 | 11.1
1 20.00 100.00 | 1111
---------------------------+---------
5 100.00 | XXXX
* ssc inst xtpatternvar
. xtpatternvar, gen(pattern)
* ssc inst groups
. groups pattern
+------------------------------------+
| pattern Freq. Percent % <= |
|------------------------------------|
| ..11 2 15.38 15.38 |
| .11. 2 15.38 30.77 |
| 11.. 2 15.38 46.15 |
| 11.1 3 23.08 69.23 |
| 1111 4 30.77 100.00 |
+------------------------------------+
. restore
. egen npresent = total(missing(y)), by(time)
. tabdisp time, c(npresent)
----------------------
time | npresent
----------+-----------
1 | 2
2 | 1
3 | 2
4 | 2
----------------------

Calculating mean over an array of lists in R

I have an array built to accept the outputs of a modelling package:
M <- array(list(NULL), c(trials,3))
Where trials is a number that will generate circa 50 sets of data.
From a sampling loop, I am inserting a specific aspect of the outputs. The output from the modelling package looks a little like this:
Mt$effects
c_name effect Other
1 DPC_I 0.0818277549 0
2 DPR_I 0.0150814475 0
3 DPA_I 0.0405341027 0
4 DR_I 0.1255416311 0
5 (etc.)
And I am inserting it into my array via a loop
For(x in 1:trials) {
Mt<-run_model(params)
M[[x,3]] <- Mt$effects
}
The object now looks as follows
M[,3]
[[1]]
c_name effect Other
1 DPC_I 0.0818277549 0
2 DPR_I 0.0150814475 0
3 DPA_I 0.0405341027 0
4 DR_I 0.1255416311 0
5 (etc.)
[[2]]
c_name effect Other
1 DPC_I 0.0717384637 0
2 DPR_I 0.0190812375 0
3 DPA_I 0.0856456427 0
4 DR_I 0.2330002551 0
5 (etc.)
[[3]]
And so on (up to 50 elements).
What I want to do is calculate an average (and sd) of effect, grouped by each c_name, across each of these 50 trial runs, but I’m unable to extract the data in to a single dataframe (for example) so that I can run a ddply summarise across them.
I have tried various combinations of rbind, cbind, unlist, but I just can’t understand how to correctly lift this data out of the sequential elements. I note also that any reference to .names results in NULL.
Any solution would be most appreciated!

J and L-systems

I'm going to create a program that can generate strings from L-system grammars.
Astrid Lindenmayer's original L-System for modelling the growth of algae is:
variables : A B
constants : none
axiom : A
rules : (A → AB), (B → A)
which produces:
iteration | resulting model
0 | A
1 | AB
2 | ABA
3 | ABAAB
4 | ABAABABA
5 | ABAABABAABAAB
that is naively implemented by myself in J like this:
algae =: 1&algae : (([: ; (('AB'"0)`('A'"0) #. ('AB' i. ]))&.>"0)^:[) "1 0 1
(i.6) ([;algae)"1 0 1 'A'
┌─┬─────────────┐
│0│A │
├─┼─────────────┤
│1│AB │
├─┼─────────────┤
│2│ABA │
├─┼─────────────┤
│3│ABAAB │
├─┼─────────────┤
│4│ABAABABA │
├─┼─────────────┤
│5│ABAABABAABAAB│
└─┴─────────────┘
Step-by-step illustration:
('AB' i. ]) 'ABAAB' NB. determine indices of productions for each variable
0 1 0 0 1
'AB'"0`('A'"0)#.('AB' i. ])"0 'ABAAB' NB. apply corresponding productions
AB
A
AB
AB
A
'AB'"0`('A'"0)#.('AB' i. ])&.>"0 'ABAAB' NB. the same &.> to avoid filling
┌──┬─┬──┬──┬─┐
│AB│A│AB│AB│A│
└──┴─┴──┴──┴─┘
NB. finally ; and use ^: to iterate
By analogy, here is a result of the 4th iteration of L-system that generates Thue–Morse sequence
4 (([: ; (0 1"0)`(1 0"0)#.(0 1 i. ])&.>"0)^:[) 0
0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0
That is the best that I can do so far. I believe that boxing-unboxing method is insufficient here. This is the first time I've missed linked-lists in J - it's much harder to code grammars without them.
What I'm really thinking about is:
a) constructing a list of gerunds of those functions that build final string (in my examples those functions are constants like 'AB'"0 but in case of tree modeling functions are turtle graphics commands) and evoking (`:6) it,
or something that I am able to code:
b) constructing a string of legal J sentence that build final string and doing (".) it.
But I'm not sure if these programs are efficient.
Can you show me a better approach please?
Any hints as well as comments about a) and b) are highly appreciated!
The following will pad the rectangular array with spaces:
L=: rplc&('A';'AB';'B';'A')
L^:(<6) 'A'
A
AB
ABA
ABAAB
ABAABABA
ABAABABAABAAB
Or if you don't want padding:
L&.>^:(<6) <'A'
┌─┬──┬───┬─────┬────────┬─────────────┐
│A│AB│ABA│ABAAB│ABAABABA│ABAABABAABAAB│
└─┴──┴───┴─────┴────────┴─────────────┘
Obviously you'll want to inspect rplc / stringreplace to see what is happening under the covers.
You can use complex values in the left argument of # to expand an array without boxing.
For this particular L-system, I'd probably skip the gerunds and use a temporary substitution:
to =: 2 : 'n (I.m=y) } y' NB. replace n with m in y
ins =: 2 : '(1 j. m=y) #!.n y' NB. insert n after each m in y
L =: [: 'c'to'A' [: 'A'ins'B' [: 'B'to'c' ]
Then:
L^:(<6) 'A'
A
AB
ABA
ABAAB
ABAABABA
ABAABABAABAAB
Here's a more general approach that simplifies the code by using numbers and a gerund composed of constant functions:
'x'-.~"1 'xAB'{~([:,(0:`(1:,2:)`1:)#.]"0)^:(<6) 1
A
AB
ABA
ABAAB
ABAABABA
ABAABABAABAAB
The AB are filled in at the end for display. There's no boxing here because I use 0 as a null value. These get scattered around quite a bit but the -.~"1 removes them. It does pad all the resulting strings with nulls on the right. If you don't want that, you can use <#-.~"1 to box the results instead:
'x'<#-.~"1 'xAB'{~([:,(0:`(1:,2:)`1:)#.]"0)^:(<6) 1
┌─┬──┬───┬─────┬────────┬─────────────┐
│A│AB│ABA│ABAAB│ABAABABA│ABAABABAABAAB│
└─┴──┴───┴─────┴────────┴─────────────┘

Resources