Error In Logictic Regression and Linear SVM - logistic-regression

I am getting a warning in fitting Linear Regression and LinearSVM:
ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
I have train data with dimensions: 720,720 and test data: 80,720.
How can I resolve this?

Related

emcee MCMC Sampling... Small set of walkers not converging?

I'm running mcmc on a simulation where I know the true parameter values. Suppose I use 100 walkers and 10,000 steps in the chain. The bulk of the walkers converge to the correct parameter values, but ~3 walkers (out of the 100) just trail off into the distance, not converging. Not sure what could be at play here, other than a potential error on my end.

What is the purpose of Logit function? At what stage of model building process this logit function is used?

We have two prominent functions (or we can say equations) in logistic regression algorithms:
Logistic regression function.
Logit function.
I would like to know:
Which of these equation(s) is/are used in the logistic regression model building process?
At what stage of model building process which of these equation(s) is/are used?
I know that logit function is used to transform probability values (which range b/w 0 and 1) to real number values (which range b/w -Inf to +Inf). I would like to know the real purpose of logit function in logistic regression modeling process.
Here are few queries which are directly related to the purpose of logit function in Logistic regression modeling:
Has Logit function (i.e. Logit equation LN(P/1-P)) being derived from Logistic Regression equation or its the other way around?
What is the purpose of Logit equation in logistic regression equation? How logit function is used in Logistic regression algorithm? Reason for asking this question will get clear after going through point no. 3 & 4.
Upon building a logistic regression model, we get model coefficients. When we substitute these model coefficients and respective predictor values into the logistic regression equation, we get probability value of being default class (same as the values returned by predict()).
Does this mean that estimated model coefficient values are determined
based on the probability values (computed using logistic regression equation not logit equation) which will be inputed to the likelihood function to determine if it maximizes it or not? If this understanding is correct then, where the logit function is used in the entire process of model building.
Assume that - "Neither logit function is used during model building not during predicting the values". If this is the case then why do we give importance to logit function which is used to map probability values to real number values (ranging between -Inf to +Inf).
Where exactly the logit function is used in the entire logistic regression model buidling process? Is it while estimating the model coefficients?
The model coefficient estimates that we see upon running summary(lr_model) are determined using linear form of logistic regression equation (logit equation) or the actual logistic regression equation?
What is the purpose of Logit function?
The purpose of the Logit function is to convert the real space [0, 1] interval to infinity.
If you check math Logit function, it converts real space from [0,1] interval to infinity [-inf, inf].
Sigmoid and softmax will do exactly the opposite thing. They will convert the [-inf, inf] real space to [0, 1] real space.
This is why in machine learning we may use logit before sigmoid and softmax function, since they match perfectly.

Logistic regression coefficients

I am doing logistic regression with 3 attributes. According to my data set I am expecting all coefficients to be positive. But it gives me both positive and negative coefficients. Is it possible to have all positive coefficients using logistic regression.

Can we change the default Cut-off(0.5) taken by Logistic Regression and not while calculating the classification error

We know that the work flow of logistic regression is it first gets the probability based on some equations and uses default cut-off for classification.
So, I want to know if it is possible to change the default cutoff value(0.5) to 0.75 as per my requirement. If Yes, can someone help me with the code either in R or Python or SAS. If No, can someone provide if with relevant proofs.
In my process of finding the answer for this query, i found that :-
1.) We can find the optimal cutoff value that can give best possible accuracy and build the confusion matrix accordingly :-
R code to find optimul cutoff and build confusion matrix :-
library(InformationValue)
optCutOff <- optimalCutoff(testData$ABOVE50K, predicted)[1]
confusionMatrix(testData$ABOVE50K, predicted, threshold = optCutOff)
Misclassification Error :-
misClassError(testData$ABOVE50K, predicted, threshold = optCutOff)
Note :- We see that the cutoff value is changed while calculating the confusion matrix, but not while building the model. Can someone help me with this.
Reference link :- http://r-statistics.co/Logistic-Regression-With-R.html
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(x_train, y_train)
we find first use
lr.predict_proba(x_test)
to get the probability in each class, for example, first column is probability of y=0 and second column is probability of y=1.
# the probability of being y=1
prob1=lr.predict_proba(X_test)[:,1]
If we use 0.25 as the cutoff value, then we predict like below
predicted=[1 if i > 0.25 else 0 for i in prob1]

Logistic Regression with Gradient Descent on large data

I have a training set with about 300000 examples and about 50-60 features and also it's a multiclass with about 7 classes. I have my logistic regression function that finds out the convergence of the parameters using gradient descent. My gradient descent algorithm, finds the parameters in matrix form as it's faster in matrix form than doing separately and linearly in loops.
Ex :
Matrix(P) <- Matrix(P) - LearningRate( T(Matrix(X)) * ( Matrix(h(X)) -Matrix(Y) ) )
For small training data, it's quite fast and gives correct values with maximum iterations to be around 1000000, but with that much training data, it's extremely slow, that with around 500 iterations it takes 18 minutes, but with that much iterations in gradient descent, the cost is still high and it does not predict the class correctly.
I know, I should implement maybe feature selection, or feature scaling and I can't use the packages provided. Language used is R. How do I go about implementing feature selection or scaling without using any library packages.
According to link, you can use either Z-score normalization or min-max scaling method. Both methods scale the data to [0,1] range. Z-score normalization is calculated as
Min-max scaling method is calculated as:

Resources