SMOTE before train test split, SAME AUC as testing set accuracy - auc

Mates, I wanna apply SMOTE to all of my training and testing data, is it inappropriate.
Does it cause the same value in my testing set accuracy as well as my AUC value?

Related

Logistic Regression with Gradient Descent on large data

I have a training set with about 300000 examples and about 50-60 features and also it's a multiclass with about 7 classes. I have my logistic regression function that finds out the convergence of the parameters using gradient descent. My gradient descent algorithm, finds the parameters in matrix form as it's faster in matrix form than doing separately and linearly in loops.
Ex :
Matrix(P) <- Matrix(P) - LearningRate( T(Matrix(X)) * ( Matrix(h(X)) -Matrix(Y) ) )
For small training data, it's quite fast and gives correct values with maximum iterations to be around 1000000, but with that much training data, it's extremely slow, that with around 500 iterations it takes 18 minutes, but with that much iterations in gradient descent, the cost is still high and it does not predict the class correctly.
I know, I should implement maybe feature selection, or feature scaling and I can't use the packages provided. Language used is R. How do I go about implementing feature selection or scaling without using any library packages.
According to link, you can use either Z-score normalization or min-max scaling method. Both methods scale the data to [0,1] range. Z-score normalization is calculated as
Min-max scaling method is calculated as:

How is the hold out set chosen in vowpal wabbit

I am using vowpal wabbit for logistic regression. I came to know that vowpal wabbit selects a hold out set for validation from the given training data. Is this set chosen randomly. I have a very unbalanced dataset with 100 +ve examples and 1000 -ve examples. I want to know given this training data, how vowpal wabbit selects the hold out examples?
How do I assign more weights to the +ve examples
By default each 10th example is used for holdout (you can change it with --holdout_period,
see https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments#holdout-options).
This means the model trained with holdout evaluation on is trained only on 90% of the training data.
This may result in slightly worse accuracy.
On the other hand, it allows you to use --early_terminate (which is set to 3 passes by default),
which makes it easier to reduce the risk of overtraining caused by too many training passes.
Note that by default holdout evaluation is on, only if multiple passes are used (VW uses progressive validation loss otherwise).
As for the second question, you can add importance weight to the positive examples. The default importance weight is 1. See https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format

Data Mining KNN Classifier

Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting whether a customer will buy a mobile home insurance policy. S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). Based on that the analyst decided to deploy kNN with k=1. Was it a good choice? How would you select an optimal number of neighbours in this case?
It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (i.e. performance towards unseen cases). You should apply a cross-validation technique e.g. 10-fold cross-validation to select the best K (i.e. K with largest F-value) within a range.
This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. Iterate such that each part has been left out for validation. If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant.
See e.g. also:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm
The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. Anyway most software packages just use the abovementioned techniques e.g. see SPSS in the link.
A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N).

How to know that the Backpropagation can train successfully or not?

I have an AI project, which uses a Backpropagation neural network.
It is training for about 1 hour, and it has trained 60-70 inputs from all 100 inputs. I mean, 60-70 inputs are correct in the condition of Backpropagation. (the number of trained inputs is moving between 60 and 70).
And currently, more than 10000 epochs are completed, and each epoch is taking almost 0.5 seconds.
How to know if the neural network can be trained successfully if I leave it for a long time? (or it can't train better?)
Check out my answer to this question: whats is the difference between train, validation and test set, in neural networks?
You should use 3 sets of data:
Training
Validation
Testing
The Validation data set tells you when you should stop (as I said in the other answer):
The validation data set is used to minimize overfitting. You're not
adjusting the weights of the network with this data set, you're just
verifying that any increase in accuracy over the training data set
actually yields an increase in accuracy over a data set that has not
been shown to the network before, or at least the network hasn't
trained on it (i.e. validation data set). If the accuracy over the
training data set increases, but the accuracy over then validation
data set stays the same or decreases, then you're overfitting your
neural network and you should stop training.
A good method for validation is to use 10-fold (k-fold) cross-validation. Additionally, there are specific "strategies" for splitting your data set into training, validation and testing. It's somewhat of a science in itself, so you should read up on that too.
Update
Regarding your comment on the error, I would point you to some resources which can give you a better understanding of neural networks (it's kinda math heavy, but see below for more info):
http://www.colinfahey.com/neural_network_with_back_propagation_learning/neural_network_with_back_propagation_learning_en.html
http://www.willamette.edu/~gorr/classes/cs449/linear2.html
Section 5.9 of Colin Fahey article describes it best:
Backward error propagation formula:
The error values at the neural network outputs are computed using the following formula:
Error = (Output - Desired); // Derived from: Output = Desired + Error;
The error accumulation in a neuron body is adjusted according to the output of the neuron body and the output error (specified by links connected to the neuron body).
Each output error value contributes to the error accumulator in the following manner:
ErrorAccumulator += Output * (1 - Output) * OutputError;

How to convert the output of an artificial neural network into probabilities?

I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.
I think there are two ways :
1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.
2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.
In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.
Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"
Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.
Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.
With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.
So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).
Et voilĂ !
Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).
In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.
What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).
The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.
If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).
The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.
What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).
A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.
(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)
I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.
did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?
as an example create a three layer network with the following:
linear neurons [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons [ number of classes ]
then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.
Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course # http://coursera.com regarding the details.
I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.
(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)
When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.
So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.
To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.
This could be generalized for multi-class classification using softmax activation function.
You can read more, including proofs of probabilistic interpretation here:
[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Resources