How to get the best subset for a multinomial regression in R? - logistic-regression

I am a new R user and I'm using a multinomial regression (i.e. logistic regression with the response variable which has more than 2 classes.) with the function 'vglm' in R. In my dataset there are 11 continuous predictors and 1 response variable which is categorical with 3 classes.
I want to get the best subset for my regression but I don't know how to do it. Is there any function for this or I must do it manually. Because the linear functions don't seem suitable.
I have tried bestglm function but its results don't seem to be suitable for a multinomial regression.
I have also tried a shrinkage method, glmnet which is relative to lasso. It chooses all the variables in the model. But on the other hand the multinomial regression using vglm reports some variables as insignificant.
I've searched a lot on the Internet including this website but haven't found any good answer. So I'm asking here because I need really a help on this.
Thanks

There's a few basic steps involved to get what you want:
define the model grid of all potential predictor combinations
model run all potential combinations of predictors
use a criteria (or a set of multiple criteria) to select the best subset of predictors
The model grid can be defined with the following function:
# define model grid for best subset regression
# defines which predictors are on/off; all combinations presented
model.grid <- function(n){
n.list <- rep(list(0:1), n)
expand.grid(n.list)
}
For example with 4 variables, we get n^2 or 16 combinations. A value of 1 indicates the model predictor is on and a value of zero indicates the predictor is off:
model.grid(4)
Var1 Var2 Var3 Var4
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 1 0 0
5 0 0 1 0
6 1 0 1 0
7 0 1 1 0
8 1 1 1 0
9 0 0 0 1
10 1 0 0 1
11 0 1 0 1
12 1 1 0 1
13 0 0 1 1
14 1 0 1 1
15 0 1 1 1
16 1 1 1 1
I provide another function below that will run all model combinations. It will also create a sorted dataframe table that ranks the different model fits using 5 criteria. The predictor combo at the top of the table is the "best" subset given the training data and the predictors supplied:
# function for best subset regression
# ranks predictor combos using 5 selection criteria
best.subset <- function(y, x.vars, data){
# y character string and name of dependent variable
# xvars character vector with names of predictors
# data training data with y and xvar observations
require(dplyr)
reguire(purrr)
require(magrittr)
require(forecast)
length(x.vars) %>%
model.grid %>%
apply(1, function(x) which(x > 0, arr.ind = TRUE)) %>%
map(function(x) x.vars[x]) %>%
.[2:dim(model.grid(length(x.vars)))[1]] %>%
map(function(x) tslm(paste0(y, " ~ ", paste(x, collapse = "+")), data = data)) %>%
map(function(x) CV(x)) %>%
do.call(rbind, .) %>%
cbind(model.grid(length(x.vars))[-1, ], .) %>%
arrange(., AICc)
}
You'll see the tslm() function is specified...others could be used such as vglm(), etc. Simply swap in the model function you want.
The function requires 4 installed packages. The function simply configures data and uses the map() function to iterate across all model combinations (e.g. no for loop). The forecast package then supplies the cross-validation function CV(), which has the 5 metrics or selection criteria to rank the predictor subsets
Here is an application example lifted from the book "Forecasting Principles and Practice." The example also uses data from the book, which is found in the fpp2 package.
library(fpp2)
# test the function
y <- "Consumption"
x.vars <- c("Income", "Production", "Unemployment", "Savings")
best.subset(y, x.vars, uschange)
The resulting table, which is sorted on the AICc metric, is shown below. The best subset minimizes the value of the metrics (CV, AIC, AICc, and BIC), maximizes adjusted R-squared and is found at the top of the list:
Var1 Var2 Var3 Var4 CV AIC AICc BIC AdjR2
1 1 1 1 1 0.1163 -409.3 -408.8 -389.9 0.74859
2 1 0 1 1 0.1160 -408.1 -407.8 -391.9 0.74564
3 1 1 0 1 0.1179 -407.5 -407.1 -391.3 0.74478
4 1 0 0 1 0.1287 -388.7 -388.5 -375.8 0.71640
5 1 1 1 0 0.2777 -243.2 -242.8 -227.0 0.38554
6 1 0 1 0 0.2831 -237.9 -237.7 -225.0 0.36477
7 1 1 0 0 0.2886 -236.1 -235.9 -223.2 0.35862
8 0 1 1 1 0.2927 -234.4 -234.0 -218.2 0.35597
9 0 1 0 1 0.3002 -228.9 -228.7 -216.0 0.33350
10 0 1 1 0 0.3028 -226.3 -226.1 -213.4 0.32401
11 0 0 1 1 0.3058 -224.6 -224.4 -211.7 0.31775
12 0 1 0 0 0.3137 -219.6 -219.5 -209.9 0.29576
13 0 0 1 0 0.3138 -217.7 -217.5 -208.0 0.28838
14 1 0 0 0 0.3722 -185.4 -185.3 -175.7 0.15448
15 0 0 0 1 0.4138 -164.1 -164.0 -154.4 0.05246
Only 15 predictor combinations are profiled in the output since the model combination with all predictors off has been dropped. Looking at the table, the best subset is the one with all predictors on. However, the second row uses only 3 of 4 variables and the performance results are roughly the same. Also note that after row 4, the model results begin to degrade. Thats because income and savings appear to be the key drivers of consumption. As these two variables are dropped from the predictors, model performance drops significantly.
The performance of the custom function is solid since the results presented here match those of the book referenced.
A good day to you.

Related

Matlab One Hot Encoding - convert column with categoricals into several columns of logicals

CONTEXT
I have a large number of columns with categoricals, all with different, unrankable choices. To make my life easier for analysis, I'd like to take each of them and convert it to several columns with logicals. For example:
1 GENRE
2 Pop
3 Classical
4 Jazz
...would turn into...
1 Pop Classical Jazz
2 1 0 0
3 0 1 0
4 0 0 1
PROBLEM
I've tried using ind2vec but this only works with numericals or logicals. I've also come across this but am not sure it works with categoricals. What is the right function to use in this case?
If you want to convert from a categorical vector to a logical array, you can use the unique function to generate column indices, then perform your encoding using any of the options from this related question:
% Sample data:
data = categorical({'Pop'; 'Classical'; 'Jazz'; 'Pop'; 'Pop'; 'Jazz'});
% Get unique categories and create indices:
[genre, ~, index] = unique(data)
genre =
Classical
Jazz
Pop
index =
3
1
2
3
3
2
% Create logical matrix:
mat = logical(accumarray([(1:numel(index)).' index], 1))
mat =
6×3 logical array
0 0 1
1 0 0
0 1 0
0 0 1
0 0 1
0 1 0
ind2vec do work with the cell strings, and you could call cellstr function to get such a cell string.
This codes may help (From this ,I only changed a little)
data = categorical({'Pop'; 'Classical'; 'Jazz';});
GENRE = cellstr(data); %change categorical data into cell strings
[~, loc] = ismember(GENRE, unique(GENRE));
genre = ind2vec(loc')';
Gen=full(genre);
array2table(Gen, 'VariableNames', unique(GENRE))
run such a code will return this:
ans =
Classical Jazz Pop
_________ ____ ___
0 0 1
1 0 0
0 1 0
you can call unique(GENRE) to check the categories(in cell strings). In the meanwhile, logical(Gen)(or call logical(full(genre))) contain columns with logical that you need.
P.s. categorical structure might be faster than cell string, but ind2vec function doesn't work with it. unique and accumarray might better.

Calculating mean over an array of lists in R

I have an array built to accept the outputs of a modelling package:
M <- array(list(NULL), c(trials,3))
Where trials is a number that will generate circa 50 sets of data.
From a sampling loop, I am inserting a specific aspect of the outputs. The output from the modelling package looks a little like this:
Mt$effects
c_name effect Other
1 DPC_I 0.0818277549 0
2 DPR_I 0.0150814475 0
3 DPA_I 0.0405341027 0
4 DR_I 0.1255416311 0
5 (etc.)
And I am inserting it into my array via a loop
For(x in 1:trials) {
Mt<-run_model(params)
M[[x,3]] <- Mt$effects
}
The object now looks as follows
M[,3]
[[1]]
c_name effect Other
1 DPC_I 0.0818277549 0
2 DPR_I 0.0150814475 0
3 DPA_I 0.0405341027 0
4 DR_I 0.1255416311 0
5 (etc.)
[[2]]
c_name effect Other
1 DPC_I 0.0717384637 0
2 DPR_I 0.0190812375 0
3 DPA_I 0.0856456427 0
4 DR_I 0.2330002551 0
5 (etc.)
[[3]]
And so on (up to 50 elements).
What I want to do is calculate an average (and sd) of effect, grouped by each c_name, across each of these 50 trial runs, but I’m unable to extract the data in to a single dataframe (for example) so that I can run a ddply summarise across them.
I have tried various combinations of rbind, cbind, unlist, but I just can’t understand how to correctly lift this data out of the sequential elements. I note also that any reference to .names results in NULL.
Any solution would be most appreciated!

Average of dynamic row range

I have a table of rows which consist of zeros and numbers like this:
A B C D E F G H I J K L M N
0 0 0 4 3 1 0 1 0 2 0 0 0 0
0 1 0 1 4 0 0 0 0 0 1 0 0 0
9 5 7 9 10 7 2 3 6 4 4 0 1 0
I want to calculate an average of the numbers including zeros, but starting from the first nonzero value and put it into column after tables end. E.g. for the first row first value is 4, so average - 11/11; for the second - 7/13; the last one is 67/14.
How could I using excel formulas do this? Probably OFFSET with nested IF?
This still needs to be entered as an array formula (ctrl-shift-enter) but it isn't volatile:
=AVERAGE(INDEX(($A2:$O2),MATCH(TRUE,$A2:$O2<>0,0)):$O2)
or, depending on location:
=AVERAGE(INDEX(($A2:$O2);MATCH(TRUE;$A2:$O2<>0;0)):$O2)
The sum is the same no matter how many 0's you include, so all you need to worry about is what to divide it by, which you could determine using nested IFs, or take a cue from this: https://superuser.com/questions/671435/excel-formula-to-get-first-non-zero-value-in-row-and-return-column-header
Thank you, Scott Hunter, for good reference.
I solved the problem using a huge formula, and I think it's a bit awkward.
Here it is:
=AVERAGE(INDIRECT(CELL("address";INDEX(A2:O2;MATCH(TRUE;INDEX(A2:O2<>0;;);0)));TRUE):O2)

which clustering technique i should use?

i have a data matrix given as below..
it is the user access matrix..each row represents users and each column represents page category visited by that user.
0 8 1 0 0 8 0 0 0 0 0 0 0 11 2 2 0
1 0 7 0 0 0 0 0 1 1 0 0 0 0 0 0 1
1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0
6 1 0 0 0 2 6 0 0 0 0 1 0 0 0 0 0
5 3 2 0 2 0 0 0 0 0 1 0 0 0 1 0 0
2 3 0 1 0 1 0 0 0 0 0 1 0 3 0 0 0
9 0 1 1 0 0 5 0 0 0 1 2 0 0 0 0 0
5 1 4 0 0 0 1 0 0 2 0 0 0 9 0 0 0
5 5 0 2 0 1 0 0 0 0 1 1 0 0 0 0 0
1 2 0 0 2 3 3 0 0 1 1 0 0 0 4 0 0
0 1 0 1 0 2 0 0 1 0 0 0 0 2 0 0 0
5 4 0 0 1 0 0 0 0 0 1 0 0 2 0 0 0
0 0 0 2 0 0 2 12 1 0 0 0 2 0 0 0 0
6 1 0 0 0 0 58 15 7 0 1 0 0 0 0 0 0
1 0 2 0 0 1 1 0 0 0 2 0 0 0 0 0 0
I need to apply biclustering technique on it.
This biclustering technique will first generates user clusters and then generates page clusters.after that it combine both user and page clusters to generate biclusters.
Now i am confused about which clustering technique i should use for this purpose.
the best clustering will generate coherent biclusters from this matrix.
Here is a summary of several clustering algorithms that can help to answer the question
"which clustering technique i should use?"
There is no objectively "correct" clustering algorithm Ref
Clustering algorithms can be categorized based on their "cluster model". An algorithm designed for a particular kind of model will generally fail on a different kind of model. For eg, k-means cannot find non-convex clusters, it can find only circular shaped clusters.
Therefore, understanding these "cluster models" becomes the key to understanding how to choose among the various clustering algorithms / methods. Typical cluster models include:
[1] Connectivity models: Builds models based on distance connectivity. Eg hierarchical clustering. Used when we need different partitioning based on tree cut height. R function: hclust in stats package.
[2] Centroid models: Builds models by representing each cluster by a single mean vector. Used when we need crisp partitioning (as opposed to fuzzy clustering described later). R function: kmeans in stats package.
[3] Distribution models: Builds models based on statistical distributions such as multivariate normal distributions used by the expectation-maximization algorithm. Used when cluster shapes can be arbitrary unlike k-means which assumes circular clusters. R function: emcluster in the emcluster package.
[4] Density models: Builds models based on clusters as connected dense regions in the data space. Eg DBSCAN and OPTICS. Used when cluster shapes can be arbitrary unlike k-means which assumes circular clusters.. R function dbscan in package dbscan.
[5] Subspace models: Builds models based on both cluster members and relevant attributes. Eg biclustering (also known as co-clustering or two-mode-clustering). Used when simultaneous row and column clustering is needed. R function biclust in biclust package.
[6] Group models: Builds models based on the grouping information. Eg collaborative filtering (recommender algorithm). R function Recommender in recommenderlab package.
[7] Graph-based models: Builds models based on clique. Community structure detection algorithms try to find dense subgraphs in directed or undirected graphs. Eg R function cluster_walktrap in igraph package.
[8] Kohonen Self-Organizing Feature Map: Builds models based on neural network. R function som in the kohonen package.
[9] Spectral Clustering: Builds models based on non-convex cluster structure, or when a measure of the center is not a suitable description of the complete cluster. R function specc in the kernlab package.
[10] subspace clustering : For high-dimensional data, distance functions could be problematic. cluster models include the relevant attributes for the cluster. Eg, hddc function in the R package HDclassif.
[11] Sequence clustering: Group sequences that are related. rBlast package.
[12] Affinity propagation: Builds models based on message passing between data points. It does not require the number of clusters to be determined before running the algorithm. It is better for certain computer vision and computational biology tasks, e.g. clustering of pictures of human faces and identifying regulated transcripts, than k-means, Ref Rpackage APCluster.
[13] Stream clustering: Builds models based on data that arrive continuously such as telephone records, financial transactions etc. Eg R package BIRCH [https://cran.r-project.org/src/contrib/Archive/birch/]
[14] Document clustering (or text clustering): Builds models based on SVD. It has used in topic extraction. Eg Carrot [http://search.carrot2.org] is an open source search results clustering engine which can cluster documents into thematic categories.
[15] Latent class model: It relates a set of observed multivariate variables to a set of latent variables. LCA may be used in collaborative filtering. R function Recommender in recommenderlab package has collaborative filtering functionality.
[16] Biclustering: Used to simultaneously cluster rows and columns of two-mode data. Eg R function biclust in package biclust.
[17] Soft clustering (fuzzy clustering): Each object belongs to each cluster to a certain degree. Eg R Fclust function in the fclust package.
You cannot tell which clustering algorithms is best by just looking at the matrix. You must try different algorithms (maybe k-means, bayes, nearest-neighbor or whatever your library has). Make a cross validation (clustering is just a type of categorization where you categorize users to cluster centers) and evaluate the results. You could even print it to a chart. Then make a decision. No decision will be perfect, you will always have errors. And the result depends of what you expect. Maybe a result with more errors will have better results in your personal view.
Have you tried any algorithm yet?

3+ dimensional truth table in APL

I would like to enumerate all the combinations (tuples of values) of 3 or more finite-valued variables which satisfy a given condition. In math notation:
For example (inspired by Project Euler problem 9):
The truth tables for two variables at a time are easy enough:
a ∘.≤ b
1 1 1 1
0 1 1 1
0 0 1 1
b ∘.≤ c
1 1 1 1 1
0 1 1 1 1
0 0 1 1 1
0 0 0 1 1
After much head-scratching, I managed to combine them, by computing the ∧ of every 4-valued row of the former with each 4-valued column of the latter, and disclosing (⊃) on the correct axis, between 1 and 2:
⎕← tt ← ⊃[1.5] (⊂[2] a ∘.≤ b) ∘.∧ (⊂[1] b ∘.≤ c)
1 1 1 1 1
0 1 1 1 1
0 0 1 1 1
0 0 0 1 1
0 0 0 0 0
0 1 1 1 1
0 0 1 1 1
0 0 0 1 1
0 0 0 0 0
0 0 0 0 0
0 0 1 1 1
0 0 0 1 1
Then I could use its ravel to filter all possible tuples of values:
⊃ (,tt) / , a ∘., b ∘., c
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
1 2 2
1 2 3
...
3 3 5
3 4 4
3 4 5
Is this the best approach to this particular class of problems in APL?
Is there an easier or faster formula for this example, or for the general case?
More generally, comparing my (naïve?) array approach above to traditional scalar languages, I can see that I'm translating each loop into an additional dimension: 3 nested loops become a 3-rank truth table:
for c in 1..NC:
for b in 1..min(c, NB):
for a in 1..min(b, NA):
collect (a,b,c)
But in a scalar language one can effect optimizations along the way, for example breaking loops as soon as possible, or choosing the loop boundaries dynamically. In this case I don't even need to test for a ≤ b ≤ c, because it's implicit in the loop boundaries.
In this example both approaches have O(N³) complexity, so their runtime will only differ by a factor. But I'm wondering: how could I write the array solution in a more optimized way, if I needed to do so?
Are there any good books or online resources that address algorithmic issues or best practices in APL?
Here's an alternative approach. I'm not sure if it would run faster.
Following your algorithm for scalar languages, the possible values of c are
⎕IO←0
c←1+⍳NC
In the inner loops the values for b and a are
b←1+⍳¨NB⌊c
a←1+⍳¨¨NA⌊b
If we combine those
r←(⊂¨¨¨a,¨¨¨b),¨¨¨c
we get a nested array of (a,b,c) triplets which can be flattened and rearranged in a matrix
r←∊r
(((⍴r)÷3),3)⍴r
ADD:
Morten Kromberg sent me the following solution. On Dyalog APL it's ~ 30 times more efficient than the one above:
⎕IO←1
AddDim←{0≡⍵:⍪⍳⍺ ⋄ n←0⌈⍺-x←¯1+⊢/⍵ ⋄ (n⌿⍵),∊x+⍳¨n}
TTable←{⊃AddDim/⌽0,⍵}
TTable 3 4 5

Resources