Which performance metrics (F1 Score, ROC AUC, PRC, MCC Score) can help me assess my model's performance on an imbalanced dataset? - logistic-regression

I am working on a text classification problem with a highly imbalanced dataset. I'm in a dilemma of choosing between performance metrics. I cannot figure out which matrix would be wise to choose amongst these four (ROC AUC, PRC AUC, F1, MCC) measures. Also, there's variety in each type like threshold and probability.
Since these all have different outcomes how do I understand which performance metrics would be best fit for my model and I can get an insight of whether my model is successfully working or not?
Dataset Information:
class 1 - 98%
class 0 - 2%
I have applied various performance metrics for Logistic Regression and got these results:
Accuracy - 0.9824
Precision - 0.9807
Recall - 0.9824
F1 score - 0.9813
ROC AUC score - 0.6151
ROC AUC score (using predict_proba method on the classifier) - 0.9902
PRC AUC score - 0.9655
MCC score - 0.6021
I have also calculated f1 score for separate classes
F1 score for 1 - 0.9910
F1 score for 0 - 0.6021

Related

How can I estimate an overall intercept in a cumulative link mixed model?

I am testing a cumulative link mixed model, and I want to estimate an overall intercept for the model.
The outcome of interest has 4 categories, so the model has 3 logits each with a unique intercept (threshold coefficient).
The model is tested in R with the ordinal package using the clmm function. I included a random intercept, a random slope, and a cross-level interaction.
The model looks like this:
comp.model.fit<-clmm(competence_ordinal~competence.state.lag1+ tantrum_dur.state.lag1+NEUROw1.c+tantrum_dur.state.lag1:NEUROw1.c+ (1+tantrum_dur.state.lag1|ID_nummer), data, Hess = TRUE, na.action = na.exclude)
Results of the fitted model showed that the cross-level interaction is significant. So, I would like to find out the region of significance of the simple slope, that defines the specific values of the moderator (NEUROw1.c) at which the slope of the regression of the outcome (competence_ordinal) on the focal predictor (tantrum_dur.state.lag1) transitions from non-significance to significance.
To compute a test for simple slopes I need an estimate for the intercept, however, in this type of model an overall intercept is not identified alongside the threshold coefficients.
Therefore, my question is how can I estimate an overall intercept?
Is there a way to constrain the first threshold to zero to be able to estimate the intercept?

Logistic Regression with Gradient Descent on large data

I have a training set with about 300000 examples and about 50-60 features and also it's a multiclass with about 7 classes. I have my logistic regression function that finds out the convergence of the parameters using gradient descent. My gradient descent algorithm, finds the parameters in matrix form as it's faster in matrix form than doing separately and linearly in loops.
Ex :
Matrix(P) <- Matrix(P) - LearningRate( T(Matrix(X)) * ( Matrix(h(X)) -Matrix(Y) ) )
For small training data, it's quite fast and gives correct values with maximum iterations to be around 1000000, but with that much training data, it's extremely slow, that with around 500 iterations it takes 18 minutes, but with that much iterations in gradient descent, the cost is still high and it does not predict the class correctly.
I know, I should implement maybe feature selection, or feature scaling and I can't use the packages provided. Language used is R. How do I go about implementing feature selection or scaling without using any library packages.
According to link, you can use either Z-score normalization or min-max scaling method. Both methods scale the data to [0,1] range. Z-score normalization is calculated as
Min-max scaling method is calculated as:

Multiple combinations (ex drug-ADR) with the same unique case ID

I am quite new to R statistics, and I one you can help me. I have tried finding the answer to my question by searching the forum and so on, and I apologize in advance if my question is trivial or stupid.
I have spent the last month collecting my first data set. And my dataset is now ready to be analyzed. I have spent some time learning the most basic function of the R statistics.
My dataset deals with adverse drug reaction reports. Each report may contain several suspect drugs and several adverse reactions. A case can therefore contain several drugs and adverse reaction (drug-ADR) combinations. Some cases contain just one combination and others contain several.
And now my question is: How do I make calculations that are “case-specific”?
I want to calculate a Completeness Score for the percentage of completed data fields for each drug-ADR combination, and then I would like to calculate the average for the entire case/report.
I want to calculate a Completness Score (C) for each drug-ADR combination expressed as:
C = (1-Pi) = (1-P1) x (1-P 2) x (1-P3) …. (1-Pn)
, where Pi refers to the penalty deducted, if the data field is not complete (ex 0.50 for 50%). If the information is not missing the panalty 0. The max score will then be 1. n is the number of parameters / variables.
Ultimately I want to calculate an overall Completness score for the overall case/report. The total score is should be calculated from the average of each drug-ADR combination.
C = Cj / m
, where j denotes the current drug-ADR combination, and m is the total number of combinations of drug-ADR in the full report.
Can anyone help me?
Thanke you for your attention!! I will be very grateful for any help that I can get.

Correctness of logistic regression in Vowpal Wabbit?

I have started using Vowpal Wabbit for logistic regression, however I am unable to reproduce the results it gives. Perhaps there is some undocumented "magic" it does, but has anyone been able to replicate / verify / check the calculations for logistic regression?
For example, with the simple data below, we aim to model the way age predicts label. It is obvious there is a strong relationship as when age increases the probability of observing 1 increases.
As a simple unit test, I used the 12 rows of data below:
age label
20 0
25 0
30 0
35 0
40 0
50 0
60 1
65 0
70 1
75 1
77 1
80 1
Now, performing a logistic regression on this dataset, using R , SPSS or even by hand, produces a model which looks like L = 0.2294*age - 14.08. So if I substitude the age, and use the logit transform prob=1/(1+EXP(-L)) I can obtain the predicted probabilities which range from 0.0001 for the first row, to 0.9864 for the last row, as reasonably expected.
If I plug in the same data in Vowpal Wabbit,
-1 'P1 |f age:20
-1 'P2 |f age:25
-1 'P3 |f age:30
-1 'P4 |f age:35
-1 'P5 |f age:40
-1 'P6 |f age:50
1 'P7 |f age:60
-1 'P8 |f age:65
1 'P9 |f age:70
1 'P10 |f age:75
1 'P11 |f age:77
1 'P12 |f age:80
And then perform a logistic regression using
vw -d data.txt -f demo_model.vw --loss_function logistic --invert_hash aaa
(command line consistent with How to perform logistic regression using vowpal wabbit on very imbalanced dataset ) , I obtain a model L= -0.00094*age - 0.03857 , which is very different.
The predicted values obtained using -r or -p further confirm this. The resulting probabilities end up nearly all the same, for example 0.4857 for age=20, and 0.4716 for age=80, which is extremely off.
I have noticed this inconsistency with larger datasets too. In what sense is Vowpal Wabbit carrying out the logistic regression differently, and how are the results to be interpreted?
This is a common misunderstanding of vowpal wabbit.
One cannot compare batch learning with online learning.
vowpal wabbit is not a batch learner. It is an online learner. Online learners learn by looking at examples one at a time and slightly adjusting the weights of the model as they go.
There are advantages and disadvantages to online learning. The downside is that convergence to the final model is slow/gradual. The learner doesn't do a "perfect" job at extracting information from each example, because the process is iterative. Convergence on a final result is deliberately restrained/slow. This can make online learners appear weak on tiny data-sets like the above.
There are several upsides though:
Online learners don't need to load the full data into memory (they work by examining one example at a time and adjusting the model based on the real-time observed per-example loss) so they can scale easily to billions of examples. A 2011 paper by 4 Yahoo! researchers describes how vowpal wabbit was used to learn from a tera (10^12) feature data-set in 1 hour on 1k nodes. Users regularly use vw to learn from billions of examples data-sets on their desktops and laptops.
Online learning is adaptive and can track changes in conditions over time, so it can learn from non-stationary data, like learning against an adaptive adversary.
Learning introspection: one can observe loss convergence rates while training and identify specific issues, and even gain significant insights from specific data-set examples or features.
Online learners can learn in an incremental fashion so users can intermix labeled and unlabeled examples to keep learning while predicting at the same time.
The estimated error, even during training, is always "out-of-sample" which is a good estimate of the test error. There's no need to split the data into train and test subsets or perform N-way cross-validation. The next (yet unseen) example is always used as a hold-out. This is a tremendous advantage over batch methods from the operational aspect. It greatly simplifies the typical machine-learning process. In addition, as long as you don't run multiple-passes over the data, it serves as a great over-fitting avoidance mechanism.
Online learners are very sensitive to example order. The worst possible order for an online learner is when classes are clustered together (all, or almost all, -1s appear first, followed by all 1s) like the example above does. So the first thing to do to get better results from an online learner like vowpal wabbit, is to uniformly shuffle the 1s and -1s (or simply order by time, as the examples typically appear in real-life).
OK now what?
Q: Is there any way to produce a reasonable model in the sense that it gives reasonable predictions on small data when using an online learner?
A: Yes, there is!
You can emulate what a batch learner does more closely, by taking two simple steps:
Uniformly shuffle 1 and -1 examples.
Run multiple passes over the data to give the learner a chance to converge
Caveat: if you run multiple passes until error goes to 0, there's a danger of over-fitting. The online learner has perfectly learned your examples, but it may not generalize well to unseen data.
The second issue here is that the predictions vw gives are not logistic-function transformed (this is unfortunate). They are akin to standard deviations from the middle point (truncated at [-50, 50]). You need to pipe the predictions via utl/logistic (in the source tree) to get signed probabilities. Note that these signed probabilities are in the range [-1, +1] rather than [0, 1]. You may use logistic -0 instead of logistic to map them to a [0, 1] range.
So given the above, here's a recipe that should give you more expected results:
# Train:
vw train.vw -c --passes 1000 -f model.vw --loss_function logistic --holdout_off
# Predict on train set (just as a sanity check) using the just generated model:
vw -t -i model.vw train.vw -p /dev/stdout | logistic | sort -tP -n -k 2
Giving this more expected result on your data-set:
-0.95674145247658 P1
-0.930208359811439 P2
-0.888329575506748 P3
-0.823617739247262 P4
-0.726830630992614 P5
-0.405323815830325 P6
0.0618902961794472 P7
0.298575998150221 P8
0.503468453150847 P9
0.663996516371277 P10
0.715480084449868 P11
0.780212725426778 P12
You could make the results more/less polarized (closer to 1 on the older ages and closer to -1 on the younger) by increasing/decreasing the number of passes. You may also be interested in the following options for training:
--max_prediction <arg> sets the max prediction to <arg>
--min_prediction <arg> sets the min prediction to <arg>
-l <arg> set learning rate to <arg>
For example, by increasing the learning rate from the default 0.5 to a large number (e.g. 10) you can force vw to converge much faster when training on small data-sets thus requiring less passes to get there.
Update
As of mid 2014, vw no longer requires the external logistic utility to map predictions back to [0,1] range. A new --link logistic option maps predictions to the logistic function [0, 1] range. Similarly --link glf1 maps predictions to a generalized logistic function [-1, 1] range.

SPSS creating a loop for a multiple regression over several variables

For my master thesis I have to use SPSS to analyse my data. Actually I thought that I don't have to deal with very difficult statistical issues, which is still true regarding the concepts of my analysis. BUT the problem is now that in order to create my dependent variable I need to use the syntax editor/ programming in general and I have no experience in this area at all. I hope you can help me in the process of creating my syntax.
I have in total approximately 900 companies with 6 year observations. For all of these companies I need the predicted values of the following company-specific regression:
Y= ß1*X1+ß2*X2+ß3*X3 + error
(I know, the ß won t very likely be significant, but this is nothing to worry about in my thesis, it will be mentioned in the limitations though).
So far my data are ordered in the following way
COMPANY YEAR X1 X2 X3
1 2002
2 2002
1 2003
2 2003
But I could easily change the order, e.g. in
1
1
2
2 etc.
Ok let's say I have rearranged the data: what I need now is that SPSS computes for each company the specific ß and returns the output in one column (the predicted values with those ß multiplied by the specific X in each row). So I guess what I need is a loop that does a multiple linear regression for 6 rows for each of the 939 companies, am I right?
As I said I have no experience at all, so every hint is valuable for me.
Thank you in advance,
Janina.
Bear in mind that with only six observations per company and three (or 4 if you also have a constant term) coefficients to estimate, the coefficient estimates are likely to be very imprecise. You might want to consider whether companies can be pooled at least in part.
You can use SPLIT FILE to estimate the regressions specific for each company, example below. Note that one would likely want to consider other panel data models, and assess whether there is autocorrelation in the residuals. (This is IMO a useful approach though for exploratory analysis of multi-level models.)
The example declares a new dataset to pipe the regression estimates to (see the OUTFILE subcommand on REGRESSION) and suppresses the other tables (with 900+ tables much of the time is spent rendering the output). If you need other statistics either omit the OMS that suppresses the tables, or tweak it to only show the tables you want. (You can use OMS to pipe other results to other datasets as well.)
************************************************************.
*Making Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP #Comp = 1 to 1000.
COMPUTE #R1 = RV.NORMAL(10,2).
COMPUTE #R2 = RV.NORMAL(-3,1).
COMPUTE #R3 = RV.NORMAL(0,5).
LOOP Year = 2003 to 2008.
COMPUTE Company = #Comp.
COMPUTE Rand1 = #R1.
COMPUTE Rand2 = #R2.
COMPUTE Rand3 = #R3.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Companies.
COMPUTE x1 = RV.NORMAL(0,1).
COMPUTE x2 = RV.NORMAL(0,1).
COMPUTE x3 = RV.NORMAL(0,1).
COMPUTE y = Rand1*x1 + Rand2*x2 + Rand3*x3 + RV.NORMAL(0,1).
FORMATS Company Year (F4.0).
*Now sorting cases by Company and Year, then using SPLIT file to estimate
*the regression.
SORT CASES BY Company Year.
*Declare new set and have OMS suppress the other results.
DATASET DECLARE CoeffTable.
OMS
/SELECT TABLES
/IF COMMANDS = 'Regression'
/DESTINATION VIEWER = NO.
*Now split file to get the coefficients.
SPLIT FILE BY Company.
REGRESSION
/DEPENDENT y
/METHOD=ENTER x1 x2 x3
/SAVE PRED (CompSpePred)
/OUTFILE = COVB ('CoeffTable').
SPLIT FILE OFF.
OMSEND.
************************************************************.

Resources