Backpropagation overall error chart with very small slope... Is this normal? - artificial-intelligence

I'm training a neural network with backpropagation algorithm and this is the chart of Overall Errors:
( I'm calculating Overall error by this formula : http://www.colinfahey.com/neural_network_with_back_propagation_learning/neural_network_with_back_propagation_learning_en.html Part 6.3 : Overall training error)
I have used Power Trendline and after calculations, I saw that if epoches = 13000 => overall error = 0.2
Isn't this too high?
Is this chart normal? Seems that the training process will take too long... Right? What should I do? Isn't there any faster way?
EDIT : My neural network has a hidden layer with 200 neurons. and my input and output layers have 10-12 neurons. My problem is clustering characters. (it clusters Persian characters into some clusters with supervised training)

So you are using a ANN with 200 input nodes with 10-12 hidden nodes in the hidden layer, what activation function are you using if any for your hidden layer and output layer?
Is this a standard back propagation training algorithm and what training function are you using?
Each type of training function will affect the speed of training and in some cases its ability to generalise, you don't want to train against your data such that your neural network is only good for your training data.
So ideally you want decent training data that could be a sub sample of your real data say 15%.
You could training your data using conjugate gradient based algorithm:
http://www.mathworks.co.uk/help/toolbox/nnet/ug/bss331l-1.html#bss331l-2
this will train your network quickly.
10-12 nodes may not be ideal for your data, you can try changing the number in blocks of 5 or add another layer, in general more layers will improve the ability of your network to classify your problem but will increase the computational complexity and hence slow down the training.
Presumably these 10-12 nodes are 'features' you are trying to classify?
If so, you may wish to normalise them, so rescale each to between 0 and 1 or -1 to 1 depending on your activation function (e.g. tan sigmoidal will produce values in range -1 to +1):
http://www.heatonresearch.com/node/706
You may also train a neural network to identify the ideal number of nodes you should have in your hidden layer.

Related

Is dimensionality reduction reversible?

I have implemented a dimentionality reduction algorithm using ENCOG, that takes a dataset (call it A) with multiple features and reduces it to a dataset (B) with only one feature (I need that for time series analisys).
Now my question is, I have a value from B - predicted by the time series analysis, can I convert it back to two dimensions like in the A dataset?
No, dimensionality reduction is not reversible in general. It loses information.
Dimensionality reduction (compression of information) is reversible in auto-encoders. Auto-encoder is regular neural network with bottleneck layer in the middle. You have for instance 20 inputs in the first layer, 10 neurons in the middle layer and again 20 neurons in the last layer. When you train such network you force it to compress information to 10 neurons and then uncompress again minimizing error in the last layer(desired output vector equals input vector). When you use well known Back-propagation algorithm to train such network it performs PCA - Principal Component Analysis. PCA returns uncorrelated features. It's not very powerful.
By using more sophisticated algorithm to train auto-encoder you can make it perform nonlinear ICA - Independent Component Analysis. ICA returns statistically independent features. This training algorithm searches for low complexity neural networks with high generalization capability. As a byproduct of regularization you get ICA.

How to determine optimum hidden layers and neurons based on inputs and outputs in a NN?

I'm refering mostly to this paper here: http://clgiles.ist.psu.edu/papers/UMD-CS-TR-3617.what.size.neural.net.to.use.pdf
Current Setup:
I'm currently trying to port the neural-genetic AI solution that I have laying around to get into a multi-purpose multi-agent tool. So, for example, it should work as an AI in a game engine for moving around entities and let 'em shoot and destroy the enemy (so e.g. 4 inputs like distance x,y and angle x,y and 2 outputs like accelerate left,right).
The state so far is that I'm using the same amount of genomes as there are agents to determine the fittest agents. 20% of the fittest agents are combined with each other (zz, zw genomes selected) and create 2 babies for the new population each. The rest of the new population per-new-generation is selected randomly across the old population, including the fittest with-an-unfit-genome.
That works pretty well to prime the AI, after generation 50-100 it is pretty much human-unbeatable in a Breakout clone and a little Tank game where you can shoot and move around.
As I had the idea to use on evolution population for each "type of Agent" the question is now if it is possible to determine the amount of hidden layers and the amount of neurons in the hidden layers generically.
My setup for the tank game is 4 inputs, 3 outputs and 1 hidden layer with 12 neurons that worked the best (around 50 generations to be really strong).
My setup for a breakout game is 6 inputs, 2 outputs and 2 hidden layers with 12 neurons that seems to work best.
Done Research:
So, back to the paper: On page 32 you can see that it seems that more neurons per hidden layer need of course more time for priming, but the more neurons are in between, the more are the chances to get into the function without noise.
I currently prime my AI only using the fitness increase on successfully being better than the last try.
So in a tank game it means he successfully shot the other tank (wounded him 4 times is better, then enemy is dead) and won the round.
In the breakout game it's similar as I have a paddle that the AI can move around and it can collect points. "Getting shot" or negative treatment here is that it forgot to catch the ball. So potential noise input would be 2 output values (move-left, move-right) that depend on 4 input values (ball x, y, degx, degy).
Questions:
So, what kind of calculation for the amount of hidden layers and amount of neurons do you think can be a good tradeoff to have no noise that kills the genome evolution?
What is the minimum amount of agents until you can say that "it evolves further"? My current training setup is always around having 50 agents that learn in parallel (so they basically simulate 50 games in parallel "behind the scenes").
In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.
-doug
In short. It's an ongoing area of research. Most (All that I know of) ANN using numerous neurons and H-Layers don't set a static number of either, instead they use algorithms to continuously modify these values. Usually constructing and destroying when outputs converge/diverge.
Since it sounds like you're already using some evolutionary computing, consider looking into Andrew Turner's work on CGPANN, I remember it getting pretty decent improvements on benchmarks similar to your work.

How to determine the threshold for neuron firings in neural networks?

I have a simple task to classify people by their height and hair length to either MAN or WOMAN category using a neural network. Also teach it the pattern with some examples and then use it to classify on its own.
I have a basic understanding of neural networks but would really need some help here.
I know that each neuron divides the area to two subareas, basically that is why P = w0 + w1*x1 + w2*x2 + ... + wn*xn is being used here (weights are just moving the line if we consider geometric representation).
I do understand that each epoche should modify the weights to get closer to correct result, yet I have never program it and I am hopeless about how to start.
How should I proceed, meaning: How can I determine the threshold and how should I deal with the inputs?
It is not a homework rather than task for the ones who were interested. I am and I would like to understand it.
Looks like you are dealing with a simple Perceptron with a threshold activation function. Have a look at this question. Since you ARE using a bias neuron (w0), you would set the threshold to 0.
You then simply take the output of your network and compare it to 0, so you would e.g. output class 1 if x < 0 and class 2 if x > 0. You could model the case x=0 as "indistinct".
For learning the weights you need to apply the Delta Learning Rule which can be implemented very easily. But be careful: a perceptron with a simple threshold activation function can only be correct if your data are linearly separable. If you have more complex data you will need a Multilayer Perceptron and a nonlinear activation function like the Logistic Sigmoid Function.
Have a look at Geoffrey Hintons Coursera Course, Lecture 2 for details.
I've been working with machine learning lately (but I'm not an expert) but you should look at the Accord.NET framework. It contains all the common machine learning algorithme out of the box. So it's easy to take an existing samples and modify it instead of starting from scratch. Also, the developper of the framework is very helpful in the forum available on the same page.
With the available samples, you may also discover something better than neural network like the Kernel Support Vector Machine. If you stick to the neural network, have fun modifying all the different variables and by tryout and error you will understand how it work.
Have fun!
Since you said:
I know that each neuron divides the area to two subareas
&
weights are just moving the line if we consider geometric representation
I think you want to use perseptron or ADALINE neural networks. These neural networks can just classify linear separable patterns. since your input data is complicated, It's better to use a Multi layer Non-Linear Neural network. (my suggestion is a two layer neural network with tanh activation function) . For training these network you should use back propagation algorithm.
For answering to
how should I deal with the inputs?
I need to know more details about the inputs( Like: are they just height and hair length or there is more, what is their range and your resolution and etc.)
If you're dealing with just height and hair length I suggest that divide heights and length in some classes (for example 160cm-165cm, 165cm-170cm & etc.) and for each one of these classes set an On/Off input neuron. then put a hidden layer after all classes related to heights and another hidden layer after all classes related to hair length (tanh activation function). Number of neurons in these two hidden layer is determined based on number of training cases.
then take these two hidden layer output and send them to an aggregation layer with 1 output neuron.

Neural network inputs

I don't really know how to word this question so bear with me..
Let's say I am developing a neural network for rating each runner in an athletics race. I give the neural network information regarding the runner, eg. win%, days since last run etc. etc.
My question is - in this case where the neural network is rating runners, can I give the network an input like the race weather ? e.g. I give the network 1.00 for hot, 2.00 for cold, 3.00 for OK .. ?
The reason I am asking this question: The greater the output of the neural network, the better the runner. So, this means that the higher the win % input, the bigger the rating. If I give the neural network inputs whereby the greater the value doesn't necessarily mean the better the runner, will the network be able to understand and use/interpret this input?
Please let me know if the question doesn't make sense!
Neural nets can correctly model irrelevant inputs (by assigning them low weights) and inputs that are inversely related to the desired output (by assigning negative weights). Neural nets do better with inputs that are continuously-varying, so your example of 1.00 for hot, 2.00 for cold, 3.00 for OK .. is not ideal: better would be 0.00 for hot, 1.00 for OK, 2.00 for cool.
In situations such as your country code where there is no real continuous relationship, the best encoding (from a convergence standpoint) is to use a set of boolean attributes (isArgentina, isAustralia, ..., isZambia). Even without that, though, a neural net ought to be able to model an input of discrete values (i.e., if countries were relevant and if you encoded them as numbers, eventually a neural net should be able to converge on 87 (Kenya) is correlated with high performance). In such a situation, it might take more hidden nodes or a longer training period.
The whole point of neural nets is to use them in situations where simple statistical analysis is difficult, so I disagree with the other answer that says that you should pre-judge your data.
What a neural network does is map relations between inputs and outputs. This means that you have to have some kind of objective for your neural network. Examples of such objectives could be "to predict the winner", "to predict how fast each runner will be" or to "predict the complete results from a race". Which of those examples that are plausible for you to attempt depends, of course, on what data you have available.
If you have a large dataset (say e.g. a few hundred races for each runner) where the resulting time and all predictory variables (including weather) are recorded and you establish that there is a relationship between weather and an individual runners performance a neural network would very well be able to map such a relationship, even if it is a different relationship for each individual runner.
Example of good weather variables to record could be sun intensity (W/m2), head wind (m/s) and temperature (deg C). Then each runners performance could be modeled using these variables and then the neural network could be used to predict a runners performance (observe that this approach would require one neural network per runner).

How to know that the Backpropagation can train successfully or not?

I have an AI project, which uses a Backpropagation neural network.
It is training for about 1 hour, and it has trained 60-70 inputs from all 100 inputs. I mean, 60-70 inputs are correct in the condition of Backpropagation. (the number of trained inputs is moving between 60 and 70).
And currently, more than 10000 epochs are completed, and each epoch is taking almost 0.5 seconds.
How to know if the neural network can be trained successfully if I leave it for a long time? (or it can't train better?)
Check out my answer to this question: whats is the difference between train, validation and test set, in neural networks?
You should use 3 sets of data:
Training
Validation
Testing
The Validation data set tells you when you should stop (as I said in the other answer):
The validation data set is used to minimize overfitting. You're not
adjusting the weights of the network with this data set, you're just
verifying that any increase in accuracy over the training data set
actually yields an increase in accuracy over a data set that has not
been shown to the network before, or at least the network hasn't
trained on it (i.e. validation data set). If the accuracy over the
training data set increases, but the accuracy over then validation
data set stays the same or decreases, then you're overfitting your
neural network and you should stop training.
A good method for validation is to use 10-fold (k-fold) cross-validation. Additionally, there are specific "strategies" for splitting your data set into training, validation and testing. It's somewhat of a science in itself, so you should read up on that too.
Update
Regarding your comment on the error, I would point you to some resources which can give you a better understanding of neural networks (it's kinda math heavy, but see below for more info):
http://www.colinfahey.com/neural_network_with_back_propagation_learning/neural_network_with_back_propagation_learning_en.html
http://www.willamette.edu/~gorr/classes/cs449/linear2.html
Section 5.9 of Colin Fahey article describes it best:
Backward error propagation formula:
The error values at the neural network outputs are computed using the following formula:
Error = (Output - Desired); // Derived from: Output = Desired + Error;
The error accumulation in a neuron body is adjusted according to the output of the neuron body and the output error (specified by links connected to the neuron body).
Each output error value contributes to the error accumulator in the following manner:
ErrorAccumulator += Output * (1 - Output) * OutputError;

Resources