Variational Autoencoders (VAE) are quite a heaving concept themselves. Non-surprisingly most post, comments and tutorials focus on the theory and architecture, but most also fail to address the topic of data scaling. While experimenting with VAEs I have come across a (to me) surprising read flag that the way the data is scaled into an VAE is very important and I could not put my head around it what is the explanation.
To visualize the following issue descripting access the Notebook here: https://github.com/mobias17/VAE-Input-Scaling/blob/master/VAE%20Input%20Scaling.ipynb
Let’s assume the goal is to reconstruct a sine wave (e.g. a sound wave) by a VAE. When I feed the standardized data through the model, it is only able to approximate values between -1 and 1. Obviously, the quick answer would be to normalize the data. Still, this leads to the following questions:
1) What is the rational that the VAE can only approximate values between -1 and 1? (is it the gaussian reparameterization, vanishing gradients?)
2) Is there a way to overcome this boundary (model changes)?
3) What is the best practice to scale data for a VAE? Should the data be normalized over the std dev?
Results showing Sutputs are between -1 & 1
Variational autoencoders can approximate values in any range. The problem here is with this particular model's architecture.
The decoder of this VAE uses as the last layer keras.layers.LSTM.
This layer's default activation function is tanh, and the tanh function outputs values in the range (-1,1). This is why the model cannot generate values outside that range.
However, if we change the activation function to, say, linear, replacing
decoder_mean = LSTM(input_dim, return_sequences=True)
with
decoder_mean = LSTM(input_dim, return_sequences=True, activation=None)
the VAE now can approximate the data. This is the result I got after training for 100 epochs.
The general recommendation is to ensure that the data you are trying to approximate lies in the range of the function you are using to approximate it, either by scaling the data or choosing a more expressive function.
Related
I'm playing around with Neural Networks trying to understand the best practices for designing their architecture based on the kind of problem you need to solve.
I generated a very simple data set composed of a single convex region as you can see below:
Everything works fine when I use an architecture with L = 1, or L = 2 hidden layers (plus the output layer), but as soon as I add a third hidden layer (L = 3) my performance drops down to slightly better than chance.
I know that the more complexity you add to a network (number of weights and parameters to learn) the more you tend to go towards over-fitting your data, but I believe this is not the nature of my problem for two reasons:
my performance on the Training set is also around 60% (whereas over-fitting typically means you have a very low training error and high test error),
and I have a very large amount of data examples (don't look at the figure that's only a toy figure I uplaoded).
Can anybody help me understand why adding an extra hidden layer gives
me this drop in performances on such a simple task?
Here is an image of my performance as a function of the number of layers used:
ADDED PART DUE TO COMMENTS:
I am using a sigmoid functions assuming values between 0 and 1, L(s) = 1 / 1 + exp(-s)
I am using early stopping (after 40000 iterations of backprop) as a criteria to stop the learning. I know it is not the best way to stop but I thought that it would ok for such a simple classification task, if you believe this is the main reason I'm not converging I I might implement some better criteria.
At least on the surface of it, this appears to be a case of the so-called "vanishing gradient" problem.
Activation functions
Your neurons activate according to the logistic sigmoid function, f(x) = 1 / (1 + e^-x) :
This activation function is used frequently because it has several nice properties. One of these nice properties is that the derivative of f(x) is expressible computationally using the value of the function itself, as f'(x) = f(x)(1 - f(x)). This function has a nonzero value for x near zero, but quickly goes to zero as |x| gets large :
Gradient descent
In a feedforward neural network with logistic activations, the error is typically propagated backwards through the network using the first derivative as a learning signal. The usual update for a weight in your network is proportional to the error attributable to that weight times the current weight value times the derivative of the logistic function.
delta_w(w) ~= w * f'(err(w)) * err(w)
As the product of three potentially very small values, the first derivative in such networks can become small very rapidly if the weights in the network fall outside the "middle" regime of the logistic function's derivative. In addition, this rapidly vanishing derivative becomes exacerbated by adding more layers, because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
In networks with more than, say, two hidden layers, this can become a serious problem for training the network, since the first-order gradient information will lead you to believe that the weights cannot usefully change.
However, there are some solutions that can help ! The ones I can think of involve changing your learning method to use something more sophisticated than first-order gradient descent, generally incorporating some second-order derivative information.
Momentum
The simplest solution to approximate using some second-order information is to include a momentum term in your network parameter updates. Instead of updating parameters using :
w_new = w_old - learning_rate * delta_w(w_old)
incorporate a momentum term :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old)
w_new = w_old + w_dir_new
Intuitively, you want to use information from past derivatives to help determine whether you want to follow the new derivative entirely (which you can do by setting mu = 0), or to keep going in the direction you were heading on the previous update, tempered by the new gradient information (by setting mu > 0).
You can actually get even better than this by using "Nesterov's Accelerated Gradient" :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old + mu * w_dir_old)
w_new = w_old + w_dir_new
I think the idea here is that instead of computing the derivative at the "old" parameter value w, compute it at what would be the "new" setting for w if you went ahead and moved there according to a standard momentum term. Read more in a neural-networks context here (PDF).
Hessian-Free
The textbook way to incorporate second-order gradient information into your neural network training algorithm is to use Newton's Method to compute the first and second order derivatives of your objective function with respect to the parameters. However, the second order derivative, called the Hessian matrix, is often extremely large and prohibitively expensive to compute.
Instead of computing the entire Hessian, some clever research in the past few years has indicated a way to compute just the values of the Hessian in a particular search direction. You can then use this process to identify a better parameter update than just the first-order gradient.
You can learn more about this by reading through a research paper (PDF) or looking at a sample implementation.
Others
There are many other optimization methods that could be useful for this task -- conjugate gradient (PDF -- definitely worth a read), Levenberg-Marquardt (PDF), L-BFGS -- but from what I've seen in the research literature, momentum and Hessian-free methods seem to be the most common ones.
Because the number of iterations of training required for convergence increases as you add complexity to a neural network, holding the length of training constant while adding layers to a neural network will certainly result in you eventually observing a drop like this. To figure out whether that is the explanation for this particular observation, try increasing the number of iterations of training that you're using and see if it improves. Using a more intelligent stopping criterion is also a good option, but a simple increase in the cut-off will give you answers faster.
I have a set of training data, each item in this set consists of 4 numerical values and 1 nominal-value which is the name of the method that these values have been calculated with. (There are 8 methods)
I'm training a Neural Network with these. To get rid of the nominal-value I simply assigned a value from 1 to 8 to each method and used one input to pass it to Neural Network and 4 other inputs for numerical-values. It is sort of working, but the result is not as amazing as I want.
So my question is could it be because of this simple assignment of numbers to nominal-values? or maybe it is because of mixing two different categories of inputs which are not really at the same level (numbers and method types)
As a general note, a better way for coding nominal values would be a binary vector. In your case, in addition to the 4 continuous-valued inputs, you'd have 8 binary input neurons, where only one is activated (1) and the other 7 are inactive.
The way you did it implies an artificial relationship between the computation methods, which is almost certainly an artifact. For example, 1 and 2 are numerically (and from your network's point of view!) nearer than 1 and 8. But are the methods nr. 1 and 2 really more similar, or related, than the methods 1 and 8?
Since you don't provide much detail, my answer can't be very specific.
Generally speaking neural networks tend to perform worse when coding nominal values as numeric values since the transformation will impose a (probably) false ordering on the variables. Mixing inputs with very varied levels also tend to worsen the performance.
However, given the little information provided here there is no way of telling if this is the reason that the networks performance is "not as amazing" as you want. It could just as well be the case that you don't have enough training data, or that your training data contains a lot of noise. Perhaps you need to pre-scale your data, perhaps there is an error in your network code, perhaps you have chosen ill-suited values of constants for your learning algorithm...
The reasons a neural network doesn't perform as expected are many and diverse (on of them beeing unreasonably high expectations). Without much more information there is no way of knowing what the problem is in your case.
Mapping categories to numerical values is not a good practice in statistics. Especially in the case of neural networks. Bear in mind that neural networks tend to map similar inputs to similar outputs. If you map category A to 1 and category B to 2 (both as inputs), the NN will try to output similar values for both categories, even if they have nothing to do with each other.
A sparser representation is preferred. If you have 4 categories, map them like this:
A -> 0001
B -> 0010
etc
Take a look at the "Subject: How should categories be encoded?" in this link:
ftp://ftp.sas.com/pub/neural/FAQ2.html#A_cat
The previous answers are right - do not map nominal values into arbitrary numeric ones. However, if the attribute has an ordinal nature ("Low", "Medium", High" for example), you can replace the nominal values by ascending numeric values. Note that this may not be the optimal solution - since there is no guarantee for example that "High"=3 by the nature of your data. Instead, use one-hot bit encoding as suggested.
The reason for this is that a neural network is very similar to regression in the sense that multiple numeric values go through some kind of an aggregating function - but this happens multiple times. Each input is also multiplied by a weight.
So when you enter a numeric value, it undergoes a series of mathematical manipulations that adjusts its weights in the network. So if you use numeric values for non-nomial data - nominal values that were mapped to closer numeric values will be treated about the same in the best case, in the worst case - it can harm your model.
Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.
How can I set Neural Networks so they accept and output a continuous range of values instead of a discrete ones?
From what I recall from doing a Neural Network class a couple of years ago, the activation function would be a sigmoid, which yields a value between 0 and 1. If I want my neural network to yield a real valued scalar, what should I do? I thought maybe if I wanted a value between 0 and 10 I could just multiply the value by 10? What if I have negative values? Is this what people usually do or is there any other way? What about the input?
Thanks
Much of the work in the field of neuroevolution involves using neural networks with continuous inputs and outputs.
There are several common approaches:
One node per value
Linear activation functions - as others have noted, you can use non-sigmoid activation functions on output nodes if you are concerned about the limited range of sigmoid functions. However, this can cause your output to become arbitrarily large, which can cause problems during training.
Sigmoid activation functions - simply scaling sigmoid output (or shifting and scaling, if you want negative values) is a common approach in neuroevolution. However, it is worth making sure that your sigmoid function isn't too steep: a steep activation function means that the "useful" range of values is small, which forces network weights to be small. (This is mainly an issue with genetic algorithms, which use a fixed weight modification strategy that doesn't work well when small weights are desired.)
(source: natekohl.net)
(source: natekohl.net)
Multiple nodes per value - spreading a single continuous value over multiple nodes is a common strategy for representing continuous inputs. It has the benefit of providing more "features" for a network to play with, at the cost of increasing network size.
Binning - spread a single input over multiple nodes (e.g. RBF networks, where each node is a basis function with a different center that will be partially activated by the input). You get some of the benefits of discrete inputs without losing a smooth representation.
Binary representation - divide a single continuous value into 2N chunks, then feed that value into the network as a binary pattern to N nodes. This approach is compact, but kind of brittle and results in input that changes in a non-continuous manner.
There are no rules which require the output ( * ) to be any particular function. In fact we typically need to add some arithmetic operations at the end of the function per-se implemented in a given node, in order to scale and otherwise coerce the output to a particular form.
The advantage of working with all-or-nothing outputs and/or 0.0 to 1.0 normalized output is that it makes things more easily tractable, and also avoid issues of overflowing and such.
( * ) "Output" can be understood here as either the ouptut a given node (neuron) within the network or that of the network as a whole.
As indicated by Mark Bessey the input [to the network as a whole] and the output [of the network] typically receive some filtering/conversion. As hinted in this response and in Mark's comment, it may be preferable to have normalized/standard nodes in the "hidden" layers of the network, and apply some normalization/conversion/discretization as required for the input and/or for the output of the network; Such practice is however only a matter of practicality rather than an imperative requirement of Neural Networks in general.
You will typically need to do some filtering (level conversion, etc) on both the input and the output. Obviously, filtering the input will change the internal state, so some consideration needs to be given to not losing the signal you're trying to train on.
I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.
I think there are two ways :
1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.
2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.
In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.
Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"
Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.
Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.
With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.
So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).
Et voilà!
Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).
In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.
What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).
The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.
If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).
The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.
What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).
A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.
(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)
I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.
did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?
as an example create a three layer network with the following:
linear neurons [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons [ number of classes ]
then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.
Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course # http://coursera.com regarding the details.
I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.
(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)
When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.
So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.
To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.
This could be generalized for multi-class classification using softmax activation function.
You can read more, including proofs of probabilistic interpretation here:
[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.