Can we change the default Cut-off(0.5) taken by Logistic Regression and not while calculating the classification error - logistic-regression

We know that the work flow of logistic regression is it first gets the probability based on some equations and uses default cut-off for classification.
So, I want to know if it is possible to change the default cutoff value(0.5) to 0.75 as per my requirement. If Yes, can someone help me with the code either in R or Python or SAS. If No, can someone provide if with relevant proofs.
In my process of finding the answer for this query, i found that :-
1.) We can find the optimal cutoff value that can give best possible accuracy and build the confusion matrix accordingly :-
R code to find optimul cutoff and build confusion matrix :-
library(InformationValue)
optCutOff <- optimalCutoff(testData$ABOVE50K, predicted)[1]
confusionMatrix(testData$ABOVE50K, predicted, threshold = optCutOff)
Misclassification Error :-
misClassError(testData$ABOVE50K, predicted, threshold = optCutOff)
Note :- We see that the cutoff value is changed while calculating the confusion matrix, but not while building the model. Can someone help me with this.
Reference link :- http://r-statistics.co/Logistic-Regression-With-R.html

from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(x_train, y_train)
we find first use
lr.predict_proba(x_test)
to get the probability in each class, for example, first column is probability of y=0 and second column is probability of y=1.
# the probability of being y=1
prob1=lr.predict_proba(X_test)[:,1]
If we use 0.25 as the cutoff value, then we predict like below
predicted=[1 if i > 0.25 else 0 for i in prob1]

Related

Estimating vertical asymptotes via numerical integration

I am integrating a differential equation with Runge-Kutta 2 method in order to obtain an approximate solution y_n(t), where n is a varying initial parameter. Then, for a sample of n's in a chosen range I want to find the first root of y_n(t) by recording the first change of sign: the result is a function R(n) that has a vertical asymptote at some value of n, where R(n) is the first root of y_n(t).
My question is then: how can I estimate the value of the vertical asymptote?
I do not see any way of doing that for the moment. I have thought about checking that for small changes of n the value of R(n) should not change too much at each step, but that is not true especially if there is an exponential or worse growth. I have also thought about just choosing an "upper bound", meaning that if R(n) is greater than some limit then I could approximate the vertical asymptote by that value of n, but I do not like this solution because I have no control over the error I am making.
Are there any more clever ideas?
Any comment or answer is much appreciated and let me know if I can explain myself clearer!

Numerical Integration in fortran with infinity as one of the limits

I am asked to normalize a probability distribution P=A(x^2)(e^-x) within 0 to infinity by finding the value for A. I know the algorithms to calculate the Numerical value of Integration, but how do I deal with one of the limits being Infinity.
The only way I have been able to solve this problem with some accuracy (I got full accuracy, indeed) is by doing some math first, in order to obtain the taylor series that represents the integral of the first.
I have been looking here for my sample code, but I don't find it. I'll edit my post if I get a working solution.
The basic idea is to calculate all the derivatives of the function exp(-(x*x)) and use the coeficients to derive the integral form (by dividing those coeficients by one more than the exponent of x of the above function) to get the taylor series of the integral (I recommend you to use the unnormalized version described above to get the simple number coeficients, then adjust the result by multiplying by the proper constants) you'll get a taylor series with good convergence, giving you precise values for full precision (The integral requires a lot of subdivision, and you cannot divide an unbounded interval into a finite number of intervals, all finite)
I'll edit this question if I get on the code I wrote (so stay online, and dont' change the channel :) )

How to find the roots of a function whose analytical form is not known, rather the function is available as a tabulated set of values?

To find the roots of a function, we can generally use bisection method or Newton's method. For a function f(x), this is possible only when we have an analytical expression for the x-dependence of f(x).
I am trying to find the roots of such a function where I don't know the exact form of the function, rather I have a tabulated data for the values of f(x) for each values of x in a particular range of x. I am writing my program in C and I am using a for-loop to calculate f(x) for each value of x by solving a non-linear equation using bisection method and tabulating the data. Now I need to find the roots of the function f(x).
Can anyone help me with any suitable method or algorithm for the problem?
Thanks in advance!
You know from where the sign changes that a root has to be between two points.
Take several nearby points, put a polynomial through them, and then solve for the root of that polynomial using Newton's method.
From your description it looks like you should be able to calculate your function at this new point. If so, then I would suggest that you calculate the value at this point, add the two nearest neighbors, calculate a parabola and solve for the root of that. If your function is smooth and has a non-zero derivative at the root, this step will make your estimate of the root several orders of magnitude more accurate.
(You can repeat again for even more accuracy. But the increased accuracy at this point may be on par with the numerical errors in your estimate of the value of the function.)

How is the range of the last layer of a Neural Network determined when using ReLU

I'm relatively new to Neural Networks.
Atm I am trying to program a Neural Network for simple image recognition of numbers between 0 and 10.
The activation function I'm aiming for is ReLU (rectified linear unit).
With the sigmoid-function it is pretty clear how you can determine a probability for a certain case in the end (because its between 0 and 1).
But as far as I understand it, with the ReLU we don't have these limitations, but can get any value as a sum of previous "neurons" in the end.
So how is this commonly solved?
Do I just take the biggest of all values and say thats probability 100%?
Do I sum up all values and say thats the 100%?
Or is there another aproach I can't see atm?
I hope my question is understandable.
Thanks in advance for taking the time, looking at my question.
You can't use ReLU function as the output function for classification tasks because, as you mentioned, its range can't represent probability 0 to 1. That's why it is used only for regression tasks and hidden layers.
For binary classification, you have to use output function with range between 0 to 1 such as sigmoid. In your case, you would need a multidimensional extension such as softmax function.

Generate pseudo sample of population given probabilities

I would like to generate pseudo data that conforms to the distribution of actual sampled data. Looking for an efficient and accurate method in C/Obj-C for iphone development. Currently the occurrance of 60 different categories in 1000 sampled events has been assigned a probability (0-1). I want to generate 1000 new events which conform to the same probabilities.
Clarification {
I have a categorical distribution of set {1,2,...,60}. I understand that samples from this distribution will conform to the probabilities of each category. Therefore I need to take 1000 samples from this distribution. I have determined (thanks to answers so far) that I need to:
Normalize this distribution by summing the values and dividing each
by the sum.
Order them.
Create a CDF by replacing each value with the sum of all previous values.
Then I can generate a uniform random number between 0 and 1, and find the greatest number in the CDF whose value is less than or equal to the number just chosen, and return the category corresponding to this CDF value.
}
Q1. Is this the correct way to solve the problem?
Q2. The caveat still holds that I'm using NSDecimals to store the category probabilities. Are there any libraries available or functions in Cocoa or Math.h, etc. that I can use to do this simply? I'm open to trying new libraries, currently only have Core-Plot and the standard Cocoa libraries in this project. Thanks.
Your problem description is unclear. But it sounds like you're looking for inverse transform sampling.
Basically, you first need to generate a cumulative distribution function (CDF) corresponding to your original data; call it F(x). You then generate uniform random data in the range 0->1, and then transform it using the inverse CDF, i.e F-1(x).
Here's my suggestion. This assumes that when you say "normalized probability" you mean the sum of the probability of all types is 1. (If not, you'll need to rescale so that's the case.)
Make up some order for your 60 types. (Say, alphabetic.)
Generate a random number between 0 and 1. (Call it your "target".)
Create an accumulator, initially at 0.
Loop through your 60 types. For each type:
Add the probability of that type of event to your accumulator.
If your accumulator is >= your target, generate an event of that type and stop.
If you do that 1000 times, I believe you'll get the distribution you're looking for.

Resources