Regarding logistic regression - logistic-regression

#Luke, thanks for your comments. I tried to solve by googling it but I couldn't solve the issue. here is more details about the question with an attachment of the dataset: Perform a logistic regression that regresses the column MEDVBIN on CRIM, RM, NOX, DIS and AGE. you can see the model what I used in attachment.dataset with model
ValueError: endog has evaluated to an array with multiple columns that has shape (506, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).
looking for your kind help

Related

Elimination of need to retrain models in "Shaply Sampling Values"

While reading the paper "A Unified Approach to Interpreting Model
Predictions" by Lundberg and Lee (https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf), on page 3 I see:
Shapley sampling values are meant to explain any model by: (1) applying sampling approximations
to Equation 4, and (2) approximating the effect of removing a variable from the model by integrating
over samples from the training dataset. This eliminates the need to retrain the model and allows fewer
than pow(2,|F|) differences to be computed. Since the explanation model form of Shapley sampling values
is the same as that for Shapley regression values, it is also an additive feature attribution method.
My question is: how does sampling from the training dataset eliminate the need to retrain models? It is not obvious to me and I cannot think of a mathematical proof. Any reference or explanation would be greatly appreciated. My internet searches have been unsuccessful. Thank you.

how to do logistic partial least squares using ordinal explanatory variables

This is a general question without codes.
My dataframe consists of a binary response variable and ordinal predictor variables (likert-type scale). I want to do partial least squares by retrieving the most relevant components from the predictor variables (1st stage) and using those as my new predictors for a logit model - 2nd stage (since my response is binary).
So far, the package plsRglm seem the most applicable since it allows a logit in the second stage. The challenge is that it seems plsRglm does not have provision for ordinal factor variables. If you know about the plsRglm package, could you please suggest how to handle ordinal factor variables?
Or could you suggest another package that solves this problem?
Thanks

Individual P-values in Logistic Regression

I ran a logistic regression with like 10 variables (with R) and some of them have high P-values (>0.05). Should we follow the elimination techniques that we follow in multiple linear regression to remove insignificant variables? Or is the method different in logistic regression?
I'm new to this field so please pardon me if this question sounds silly.
Thank you.

Logistic Model Error: Singular matrix while having highly correlated categorical dummy

Similar to Question here:
If I have one of the dummies of the categorical variables which has high VIF (multicollinearity), I would assume it should not be removed from the predictor list.
But the logistic regression of statsmodels has the 'Singular matrix' problem. What to do when this happens?
Possible solutions: 1. To remove all the dummies of this categorical variable; 2. To remove the high VIF dummy only, which makes the categorical variable missing one subcategory.
Thanks!

Error converting to stm after tf-idf weighting

For several dfms, I have no problem converting them to stm/lda/topicmodels format. However, if I weight the dfms with dfm_tfidf() before converting, I get the following error:
Error in convert.dfm(users_dfm, to = "stm") : cannot convert a
non-count dfm to a topic model format
Any idea why this might be? I've tried different weighting schemes for both term and document frequency (to try and make the weighted dfm a 'count' dfm), but I keep getting the error.
So, this works:
users_dfm <- dfm(users_tokens)
users_stm <- convert(users_dfm, to = "stm")
But this doesn't:
users_dfm <- dfm(users_tokens)
weighted_dfm <- dfm_tfidf(users_dfm)
users_stm <- convert(weighted_dfm, to = "stm")
Thanks!
This is because topic models require counts as inputs, because that is the nature of the assumed statistical distribution for the latent Dirichlet allocation model. tf-idf weighting of the dfm turns the matrix into non-integer values, which are not valid inputs for stm (or any other topic model).
So in short, don't weight your dfm before using it with a topic model.
You should also note that conversion of a dfm to the stm format is not strictly required, since stm::stm() can take a dfm object directly as an input.

Resources