Logistic Model Error: Singular matrix while having highly correlated categorical dummy - logistic-regression

Similar to Question here:
If I have one of the dummies of the categorical variables which has high VIF (multicollinearity), I would assume it should not be removed from the predictor list.
But the logistic regression of statsmodels has the 'Singular matrix' problem. What to do when this happens?
Possible solutions: 1. To remove all the dummies of this categorical variable; 2. To remove the high VIF dummy only, which makes the categorical variable missing one subcategory.
Thanks!

Related

Elimination of need to retrain models in "Shaply Sampling Values"

While reading the paper "A Unified Approach to Interpreting Model
Predictions" by Lundberg and Lee (https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf), on page 3 I see:
Shapley sampling values are meant to explain any model by: (1) applying sampling approximations
to Equation 4, and (2) approximating the effect of removing a variable from the model by integrating
over samples from the training dataset. This eliminates the need to retrain the model and allows fewer
than pow(2,|F|) differences to be computed. Since the explanation model form of Shapley sampling values
is the same as that for Shapley regression values, it is also an additive feature attribution method.
My question is: how does sampling from the training dataset eliminate the need to retrain models? It is not obvious to me and I cannot think of a mathematical proof. Any reference or explanation would be greatly appreciated. My internet searches have been unsuccessful. Thank you.

how to do logistic partial least squares using ordinal explanatory variables

This is a general question without codes.
My dataframe consists of a binary response variable and ordinal predictor variables (likert-type scale). I want to do partial least squares by retrieving the most relevant components from the predictor variables (1st stage) and using those as my new predictors for a logit model - 2nd stage (since my response is binary).
So far, the package plsRglm seem the most applicable since it allows a logit in the second stage. The challenge is that it seems plsRglm does not have provision for ordinal factor variables. If you know about the plsRglm package, could you please suggest how to handle ordinal factor variables?
Or could you suggest another package that solves this problem?
Thanks

Regarding logistic regression

#Luke, thanks for your comments. I tried to solve by googling it but I couldn't solve the issue. here is more details about the question with an attachment of the dataset: Perform a logistic regression that regresses the column MEDVBIN on CRIM, RM, NOX, DIS and AGE. you can see the model what I used in attachment.dataset with model
ValueError: endog has evaluated to an array with multiple columns that has shape (506, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).
looking for your kind help

SPSS logistic regression - Exp(B) displays reciprocal of my categorical vars?

I'm performing logistical regression with SPSS and Exp(B) is showing the reciprocal of what I'd like. E.g., where I'd like to display, say 2.0, Exp(B) is listed as 0.5. My variables are all categorical, so the coding is arbitrary.
I know I can recode variables, but I'm wondering if there's a simple setting in one of the dialogs to display reciprocals or recode on the fly? If possible, I'd like to do it through the UI rather than the command line input?
If you're using the LOGISTIC REGRESSION procedure (Analyze>Regression>Binary Logistic in the menus), clicking on the Categorical button will allow you to specify predictor variables as categorical and the desired type of contrast coding for each one. As long as the variables of interest are binary or the contrasts you want use either the first or last level of the variables as the reference category in forming the contrasts, you can specify them in that dialog box in order to get what you want.
If a variable has more than two levels and you want to use a category other than the first or the last as the reference category, you'd have to paste the command from the dialogs and add the sequential number of the desired category to the CONTRAST subcommand for that predictor variable. For example, if you have a three-category variable named X and you want to compare the first and third categories against the second one, you'd edit it to read
/CONTRAST (X)=Indicator(2)
or
/CONTRAST (X)=Simple(2)
depending on the type of contrasts specified in the dialogs (these two would produce the same results for these contrasts in models where X is not contained in an interaction term also in the model, differing only in how the constant or intercept is represented).

how to use pulp to generate variables and constraints of sparse matrix?

there,
I am new to pulp. I learn pulp from some examples I got online. These examples are very helpful and now I am able to write simple models by mtself. But I still feel difficult to build complex model, especailly model with sparse matrix.
Could you please kindly post with some complex examples with sparse matrix, and conplex constraints. I want to learn how to create necessary variables only, instead of simple one, such as, y = LpVariable.dicts("y", (Factorys, Customers) ,0,1,LpBinary).
I have another question: What happen if I simply use y = LpVariable.dicts("y", (Factorys, Customers) ,0,1,LpBinary) to define variables, in which most of variables are useless in model objective function and constraints, and I add some constraints to explicitly set such useless variable to 0? Does pulp algorithm is able to firstly identify such uesless variables and remove them first, then run Integer Programming algorithm (such as B&B or B&C) to solve the problem with reduced size? If this is true, It looks the "setting useless variable to 0" method will not decrease the solution speed at all. Am I right?
This may help
http://www.stuartmitchell.com/journal/2012/2/3/my-top-n-tips-for-python-coding-in-optimisation-1.html
In particular generate a set of of factories and customers first that is sparse.
factories_customers = [(f,c) for f in factories for c in customers
if <insert your condition here>]
Then use
y = LpVariable.dicts("y", factories_customers ,0,1,LpBinary)
Pulp does not remove "useless" variables and constraints so the model build time will be long.
However, the solution algorithms (CBC by default contain pre-solve algorithms that will remove the variables).

Resources