I'm trying to train an ANN (I use this library: http://leenissen.dk/fann/ ) and the results are somewhat puzzling - basically if I run the trained network on the same data used for training, the output is not what specified in the training set, but some random number.
For example, the first entry in the training file is something like
88.757004 88.757004 104.487999 138.156006 100.556000 86.309998 86.788002
1
with the first line being the input values and the second line is the desired output neuron's value. But when I feed the exact same data to the trained network, I get different results on each train attempt, and they are quite different from 1, e.g.:
Max epochs 500000. Desired error: 0.0010000000.
Epochs 1. Current error: 0.0686412785. Bit fail 24.
Epochs 842. Current error: 0.0008697828. Bit fail 0.
my test result -4052122560819626000.000000
and then on another attempt:
Max epochs 500000. Desired error: 0.0010000000.
Epochs 1. Current error: 0.0610717005. Bit fail 24.
Epochs 472. Current error: 0.0009952184. Bit fail 0.
my test result -0.001642
I realize that the training set size may be inadequate (I only have about a 100 input/output pairs so far), but shouldn't at least the training data trigger the right output value? The same code works fine for the "getting started" XOR function described at the FANN's website (I've already used up my 1 link limit)
Short answer: No
Longer answer (but possibly not the as correct):
1st: a training run only moves the weights of the neurons towards a position where they affect the output to be as in the testdata. After some/many iterations the output should be close to the expected output. Iff the neurol network is up to the task, which brings me to
2nd: Not every neuronal network works for every problem. For a single neuron it is pretty easy to come up with a simple function that can not get approximated by a single neuron. Though not as easy to see, the same limit applies for every neural network. In such cases your results will very likely look like random numbers. Edit after comment: In many cases this can be fixed by adding neurons to the network.
3rd: actually the first point is a strength of a neural network, because it allows the network to handle outliers nicely.
4th: I blame 3 for my lacking understanding of music. It just doesn't fit my brain ;-)
No, if you get your ANN to work perfectly on the training data, you either have a really easy problem or you're overfitting.
Related
I am using a recurrent neural network for time series prediction with LSTM as the activation function. The inputs are sequence datasets, with the output being the next datum after the input sequence. I have hundreds of inputs, one hidden layer of equal size, and a single output in the output layer. However much I train, the result is always much higher than the actual value (with other functions too), shown respectively by green and blue below. What is the solution?
It seems that LSTM is not suited for this kind of pattern. Softmax works well.
Well,
I am confused really.
I have a simple order of features, i.e. all the letters and a few symbols, counting how many times are contained in a string.
My selection as a result is as follows
numberOf_a
numberOf_b
...
numberOf_Z
numberOf_.
numberOf_,
I have a test sample of 65 values, and the MLP can get 46 correct.
Now If I chance the order of features in random order, train with the same data, evaluate the same values, I get a different number of correct predictions, e.g. 49.
Results are consistent (the same order will yield the same accuracy) but the accuracy changes between random orders.
The question is, is this supposed to happen? I cannot see how this is backed up by the theory. I am missing something large here?
PS. I am using WEKA's implementation of the MLP
I'm not familiar with the WEKA implementation of the MLP but that doesn't seem like something that should be happening with a neural network algorithm.
It almost seems like it's getting stuck in some sort of local minimum. The algorithm may be initializing the weights of the individual neurons the same way every time. Changing the parameter order might then cause the algorithm to arrive at the same answer for a certain parameter order each time, dependent on the initial parameter order. The "local minimum" might be determined by the algorithm only going through a certain number of iterations each time.
I have built my first neural network in python, and i've been playing around with a few datasets; it's going well so far !
I have a quick question regarding modelling events with multiple outcomes: -
Say i wish to train a network to tell me the probability of each runner winning a 100m sprint. I would give the network all of the relevant data regarding each runner, and the number of outputs would be equal to the number of runners in the race.
My question is, using a sigmoid function, how can i ensure the sum of the outputs will be equal to 1.0 ? Will the network naturally learn to do this, or will i have to somehow make this happen explicitly ? If so, how would i go about doing this ?
Many Thanks.
The output from your neural network will approach 1. I don't think it will actually get to 1.
You actually don't need to see which output is equal to 1. Once you've trained your network up to a specific error level, when you present the inputs, just look for the maximum output in your output later. For example, let's say your output layer presents the following output: [0.0001, 0.00023, 0.0041, 0.99999412, 0.0012, 0.0002], then the runner that won the race is runner number 4.
So yes, your network will "learn" to produce 1, but it won't exactly be 1. This is why you train to within a certain error rate. I recently created a neural network to recognize handwritten digits, and this is the method that I used. In my output layer, I have a vector with 10 components. The first component represents 0, and the last component represents 9. So when I present a 4 to the network, I expect the output vector to look like [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]. Of course, it's not what I get exactly, but it's what I train the network to provide. So to find which digit it is, I simply check to see which component has the highest output or score.
Now in your second question, I believe you're asking how the network would learn to provide the correct answer? To do this, you need to provide your network with some training data and train it until the output is under a certain error threshold. So what you need is a set of data that contains the inputs and the correct output. Initially your neural network will be set up with random weights (there are some algorithms that help you select better weights to minimize training time, but that's a little more advanced). Next you need a way to tell the neural network to learn from the data provided. So basically you give the data to the neural network and it provides an output, which is highly likely to be wrong. Then you compare that data with the expected (correct) output and you tell the neural network to update its weights so that it gets closer to the correct answer. You do this over and over again until the error is below a certain threshold.
The easiest way to do this is to implement the stochastic backpropagation algorithm. In this algorithm, you calculate the error between the actual output of the neural network and the expected output. Then you backpropagate the error from the output layer all the way up to the weights to the hidden layer, adjusting the weights as you go. Then you repeat this process until the error that you calculate is below a certain threshold. So during each step, you're getting closer and closer towards your solution.
You can use the algorithm described here. There is a decent amount of math involved, so be prepared for that! If you want to see an example of an implementation of this algorithm, you can take a look at this Java code that I have on github. The code uses momentum and a simple form of simulated annealing as well, but the standard backpropagation algorithm should be easily discernible. The Wikipedia article on backpropagation has a link to an implementation of the backpropagation algorithm in Python.
You're probably not going to understand the algorithm immediately; expect to spend some time understanding it and working through some of the math. I sat down with a pencil and paper as I was coding, and that's how I eventually understood what was going on.
Here are a few resources that should help you understand backpropagation a little better:
The learning process: backpropagation
Error backpropagation
If you want some more resources, you can also take a look at my answer here.
Basically you want a function of multiple real numbers that converts those real numbers into probabilities (each between 0 to 1, sum to 1). You can this easily by post processing the output of your network.
Your network gives you real numbers r1, r2, ..., rn that increases as the probability of each runner wins the race.
Then compute exp(r1), exp(r2), ..., and sum them up for ers = exp(r1) + exp(r2) + ... + exp(rn). Then the probability that the first racer wins is exp(r1) / ers.
This is a one use of the Boltzman distribution. http://en.wikipedia.org/wiki/Boltzmann_distribution
Your network should work around that and learn it naturally eventually.
To make the network learn that a little faster, here's what springs to mind first:
add an additional output called 'sum' (summing all the other output neurons) -- if you want all the output neurons to be in an separate layer, just add a layer of outputs, first numRunners outputs just connect to corresponding neuron in the previous layer, and the last numRunners+1-th neuron you connect to all the neurons from the previous layer, and fix the weights to 1)
the training set would contain 0-1 vectors for each runner (did-did not run), and the "expected" result would be a 0-1 vector 00..00001000..01 first 1 marking the runner that won the race, last 1 marking the "sum" of "probabilities"
for the unknown races, the network would try to predict which runner would win. Since the outputs have contiguous values (more-or-less :D) they can be read as "the certainty of the network that the runner would win the race" -- which is what you're looking for
Even without the additional sum neuron, this is the rough description of the way the training data should be arranged.
I have an AI project, which uses a Backpropagation neural network.
It is training for about 1 hour, and it has trained 60-70 inputs from all 100 inputs. I mean, 60-70 inputs are correct in the condition of Backpropagation. (the number of trained inputs is moving between 60 and 70).
And currently, more than 10000 epochs are completed, and each epoch is taking almost 0.5 seconds.
How to know if the neural network can be trained successfully if I leave it for a long time? (or it can't train better?)
Check out my answer to this question: whats is the difference between train, validation and test set, in neural networks?
You should use 3 sets of data:
Training
Validation
Testing
The Validation data set tells you when you should stop (as I said in the other answer):
The validation data set is used to minimize overfitting. You're not
adjusting the weights of the network with this data set, you're just
verifying that any increase in accuracy over the training data set
actually yields an increase in accuracy over a data set that has not
been shown to the network before, or at least the network hasn't
trained on it (i.e. validation data set). If the accuracy over the
training data set increases, but the accuracy over then validation
data set stays the same or decreases, then you're overfitting your
neural network and you should stop training.
A good method for validation is to use 10-fold (k-fold) cross-validation. Additionally, there are specific "strategies" for splitting your data set into training, validation and testing. It's somewhat of a science in itself, so you should read up on that too.
Update
Regarding your comment on the error, I would point you to some resources which can give you a better understanding of neural networks (it's kinda math heavy, but see below for more info):
http://www.colinfahey.com/neural_network_with_back_propagation_learning/neural_network_with_back_propagation_learning_en.html
http://www.willamette.edu/~gorr/classes/cs449/linear2.html
Section 5.9 of Colin Fahey article describes it best:
Backward error propagation formula:
The error values at the neural network outputs are computed using the following formula:
Error = (Output - Desired); // Derived from: Output = Desired + Error;
The error accumulation in a neuron body is adjusted according to the output of the neuron body and the output error (specified by links connected to the neuron body).
Each output error value contributes to the error accumulator in the following manner:
ErrorAccumulator += Output * (1 - Output) * OutputError;
Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.