Interpretation of my neural network output - artificial-intelligence

I am using a NN with activation function:
F = 1 / ( 1 + e^(-4.9*S) )
S is the sum of inputs
the network has 1 output node and is interpreted as states of a motor
the motor has 3 states: 1-clockwise motion 2-counterclockwise motion 3-locked
the question is how should i interpret the output?
is it correct to say for example:
if ( output > 0.8 ) then clockwise motion
if ( 0.2 > output < 0.8 ) then locked
if ( output < 0.2 ) then counterclockwise motion
i mean is it correct to interpret the output as it has 3 states? does a single node have the power to have 3 states? or i must have 3 different nodes for 3 states?
another way to ask this: does the value between 0.2 and 0.8 mean anything or it is just undecided?
another related question: can a single output node mean degrees of a motor? for example 0->0 degress 0.5->180 degress 1->360 degress ...

That depends entirely on your neural network. For that one you described, I would say that it could represent the middle state, or it could represent 'confused neural network'.
Thus, I would recommend having three outputs. If, for whatever reason, none of them fire, or more than one fires, you know something is broken.
Yes, you could have a neural network output a continuous variable, but it would require somewhat careful tuning, and probably a linear activation function for at least the last layer.

I agree (with zabediah49) that it sounds more sensible with three outputs, one for each state. If the states are mutually exclusive, and it sounds like they are, I even would even consider to have a softmax output instead of sigmoid.
-Ø

Related

How to represent 4 states as input for an artifical neural network?

The title might have been a bit unclear. Sadly enough, I could not come up with a better one.
So the problem is the following. There is an array of a fixed size where each position can have 4 states: empty, blocked, positive (a limited value between 0 and 1) and negative (a limited value between 0 and 1 (though this could be 0 to -1)). By limited, I mean that the value can only take the form of 0.1, 0.2, ... 1.0 and each value also only occurs once. Depending on how the array is filled, I would like to predict what the next version of the array will look like. I tried to represent each position in the array as one input node, but I could not figure out how to do this (How to represent all the four states as one number). What also should be noted is that the maximum amount of each state is known. So rather than having each node represent an index in the array, I could have each node represent a state (blocked, -1.0, -0.9, ..., 0.9, 1.0) and then say at what index that state occurs as an input value for that node.
Which way is more practical or efficient for a neural network?
By the way, it is a neural network with one input layer, one hidden layer and one output layer.
I would suggest you start by using 3 neurons for each cell in your array:
One which tells the network if the cell is empty. You can just use the values 0 for empty and 1 for non-empty (just an example, other values should work as well, just make sure to be consistent).
One which does the exact same thing, but it tells the network if the cell is blocked. (again 0 and 1 as possible inputs)
And finally the Neuron to receive the positive/negative value. You can try various things here. You should probably first try, if it works to just set the value if there is one, and if there is no value you set the input to 0. This might confuse the network a little tough, since it can't see the difference between 0 = null and 0 = 0. If it doesn't work you could try to map all your input values between 0 and 1 (in your case just add 1 and then divide by 2 before passing the value to the network) and input -1 if the cell is blocked/empty.
If you can't get this to work first play around with the parameters for a while, change the amount of hidden layers and neurons, vary the learning-rates and the momentum for training.
If the inputs turn out not to be very good for your task, i would suggest you do what you mentioned before: one neuron for every possible state of each cell. This method will take longer to train, but it should definitely work (if not, your task is rather too comolex for the network, or you need more training time).

How come random weight initiation is better then just using 0 as weights in ANN?

In a trained neural net the weight distribution will fall close around zero. So it makes sense for me to initiate all weights to zero. However there are methods such as random assignment for -1 to 1 and Nguyen-Widrow that outperformes zero initiation. How come these random methods are better then just using zero?
Activation & learning:
Additionally to the things cr0ss said, in a normal MLP (for example) the activation of layer n+1 is the dot product of the output of layer n and the weights between layer n and n + 1...so basically you get this equation for the activation a of neuron i in layer n:
Where w is the weight of the connection between neuron j (parent layer n-1) to current neuron i (current layer n), o is the output of neuron j (parent layer) and b is the bias of current neuron i in the current layer.
It is easy to see initializing weights with zero would practically "deactivate" the weights because weights by output of parent layer would equal zero, therefore (in the first learning steps) your input data would not be recognized, the data would be negclected totally.
So the learning would only have the data supplied by the bias in the first epochs.
This would obviously render the learning more challenging for the network and enlarge the needed epochs to learn heavily.
Initialization should be optimized for your problem:
Initializing your weights with a distribution of random floats with -1 <= w <= 1 is the most typical initialization, because overall (if you do not analyze your problem / domain you are working on) this guarantees some weights to be relatively good right from the start. Besides, other neurons co-adapting to each other happens faster with fixed initialization and random initialization ensures better learning.
However -1 <= w <= 1 for initialization is not optimal for every problem. For example: biological neural networks do not have negative outputs, so weights should be positive when you try to imitate biological networks. Furthermore, e.g. in image processing, most neurons have either a fairly high output or send nearly nothing. Considering this, it is often a good idea to initialize weights between something like 0.2 <= w <= 1, sometimes even 0.5 <= w <= 2 showed good results (e.g. in dark images).
So the needed epochs to learn a problem properly is not only dependent on the layers, their connectivity, the transfer functions and learning rules and so on but also to the initialization of your weights.
You should try several configurations. In most situations you can figure out what solutions are adequate (like higher, positive weights for processing dark images).
Reading the Nguyen article, I'd say it is because when you assign the weight from -1 to 1, you are already defining a "direction" for the weight, and it will learn if the direction is correct and it's magnitude to go or not the other way.
If you assign all the weights to zero (in a MLP neural network), you don't know which direction it might go to. Zero is a neutral number.
Therefore, if you assign a small value to the node's weight, the network will learn faster.
Read Picking initial weights to speed training section of the article. It states:
First, the elements of Wi are assigned values from a uniform random distributation between -1 and 1 so that its direction is random. Next, we adjust the magnitude of the weight vectors Wi, so that each hidden node is linear over only a small interval.
Hope it helps.

Q-Learning in combination with neural-networks (rewarding understanding)

As far as my understanding is, it's possible to replace a look-up-table for Q-values (state-action-pair-evaluation) by a neural network for estimating these state-action pairs. I programmed a small library, which is able to propagate and backpropagate through a self-built neural network for learning wanted target-values for a certain in-out-put.
So I also found this site while googling, and googling through the whole web (as it felt for me): http://www.cs.indiana.edu/~gasser/Salsa/nn.html where the Q-learning combined with a neural network is shortly explained.
For each action, there's an extra output neuron, and the activation-value of one of these output-"units" tells me, the estimated Q-value. (One question: Is the activation value the same as the "output" of the neuron or something different?)
I used the standard sigmoid-function as activation-function, so the range of the function-values x is
0<x<1
So I thought, my target value should always be from 0.0 to 1.0 -> Question: Is that point of my understanding correct? Or did I missunderstand something about that?
If yes, there comes following problem:
The equation for calculating the target-reward / new Q-value is:
q(s,a) = q(s,a) + learningrate * (reward + discountfactor * q'(s,a) - q(s,a))
So how do I perform this equation to get the right target for the neural network, if targets should be from 0.0 to 1.0?!
How do I calculate good reward-values? Is moving toward the aim more worth it, than going away from it? (more +reward when nearing the aim than -reward for bigger distance to aim?)
I think there are some missunderstandings of mine. I hope, you can help me to answer that questions. Thank you very much!
Using a neural-network to store q-value is a good extension of table lookup. This makes it possible to use q-learning when the state space is continuous.
input layer ......
|/ \ | \|
output layer a1 a2 a3
0.1 0.2 0.9
Suppose you have 3 actions available. Above shows the outputs from the neural network using current state and learned weights. So you know a3 is the best action to go with.
Now the questions you have:
One question: Is the activation value the same as the "output" of the neuron or something different?
Yes, I think so. In the referred link, the author said:
Some of the units may also be designated output units; their activations represent the network's response.
So I thought, my target value should always be from 0.0 to 1.0 -> Question: Is that point of my understanding correct? Or did I missunderstand something about that?
If you choose sigmoid as your activation function, for sure you output will be from 0.0 to 1.0. There are different choices of activation functions, e.g., here. Sigmoid is one of the most popular choices though. I think the output value being from 0.0 to 1.0 is not a problem here. if at current time, you have only two available actions, Q(s,a1) = 0.1, Q(s,a2) = 0.9, you know that action a2 is much better than a1 with respective to q-value.
So how do I perform this equation to get the right target for the neural network, if targets should be from 0.0 to 1.0?! How do I calculate good reward-values?
I am not sure for this, but you can try to clamp the new target q-value to be between 0.0 and 1.0, i.e.,
q(s,a) = min(max(0.0, q(s,a) + learningrate * (reward + discountfactor * q'(s,a) - q(s,a))), 1.0)
Try to do some experiments for finding a proper reward value.
Is moving toward the aim more worth it, than going away from it? (more +reward when nearing the aim than -reward for bigger distance to aim?)
Normally you should give more reward when it's close to the aim if you use the classical update equation, so that the new q-value gets increased.

Using genetic programming to estimate probability

I would like to use a genetic program (gp) to estimate the probability of an 'outcome' from an 'event'. To train the nn I am using a genetic algorithm.
So, in my database I have many events, with each event containing many possible outcomes.
I will give the gp a set of input variables that relate to each outcome in each event.
My questions is - what should the fitness function be in the gp be ????
For instance, right now I am giving the gp a set of input data (outcome input variables), and a set of target data (1 if outcome DID occur, 0 if outcome DIDN'T occur, with the fitness function being the mean squared error of the outputs and targets). I then take the sum of each output for each outcome, and divide each output by the sum (to give the probability). However, I know for sure that this is not the right way to be doing this.
For clarity, this is how I am CURRENTLY doing this:
I would like to estimate the probability of 5 different outcomes occurring in an event:
Outcome 1 - inputs = [0.1, 0.2, 0.1, 0.4]
Outcome 1 - inputs = [0.1, 0.3, 0.1, 0.3]
Outcome 1 - inputs = [0.5, 0.6, 0.2, 0.1]
Outcome 1 - inputs = [0.9, 0.2, 0.1, 0.3]
Outcome 1 - inputs = [0.9, 0.2, 0.9, 0.2]
I will then calculate the gp output for each input:
Outcome 1 - output = 0.1
Outcome 1 - output = 0.7
Outcome 1 - output = 0.2
Outcome 1 - output = 0.4
Outcome 1 - output = 0.4
The sum of the outputs for each outcome in this event would be: 1.80. I would then calculate the 'probability' of each outcome by dividing the output by the sum:
Outcome 1 - p = 0.055
Outcome 1 - p = 0.388
Outcome 1 - p = 0.111
Outcome 1 - p = 0.222
Outcome 1 - p = 0.222
Before you start - I know that these aren't real probabilities, and that this approach does not work !! I just put this here to help you understand what I am trying to achieve.
Can anyone give me some pointers on how I can estimate the probability of each outcome ? (also, please note my maths is not great)
Many thanks
I understand the first part of your question: What you described is a classification problem. You're learning if your inputs relate to whether an outcome was observed (1) or not (0).
There are difficulties with the second part though. If I understand you correctly you take the raw GP output for a certain row of inputs (e.g. 0.7) and treat it as a probability. You said this doesn't work, obviously. In GP you can do classification by introducing a threshold value that splits your classes. If it's bigger than say 0.3 the outcome should be 1 if it's smaller it should be 0. This threshold isn't necessarily 0.5 (again it's just a number, not a probability).
I think if you want to obtain a probability you should attempt to learn multiple models that all explain your classification problem well. I don't expect you have a perfect model that explains your data perfectly, respectively if you have you wouldn't want a probability anyway. You can bag these models together (create an ensemble) and for each outcome you can observe how many models predicted 1 and how many models predicted 0. The amount of models that predicted 1 divided by the number of models could then be interpreted as a probability that this outcome will be observed. If the models are all equally good then you can forget weighing between them, if they're different in quality of course you could factor these into your decision. Models with less quality on their training set are less likely to contribute to a good estimate.
So in summary you should attempt to apply GP e.g. 10 times and then use all 10 models on the training set to calculate their estimate (0 or 1). However, don't force yourself to GP only, there are many classification algorithms that can give good results.
As a sidenote, I'm part of the development team of a software called HeuristicLab which runs under Windows and with which you can run GP and create such ensembles. The software is open source.
AI is all about complex algorithms. Think about it, the downside is very often, that these algorithms become black boxes. So the counterside to algoritms, such as NN and GA, are they are inherently opaque. That is what you want if you want to have a car driving itself. On the other hand this means, that you need tools to look into the black box.
What I'm saying is that GA is probably not what you want to solve your problem. If you want to solve AI types of problems, you first have to know how to use standard techniques, such as regression, LDA etc.
So, combining NN and GA is usually a bad sign, because you are stacking one black box on another. I believe this is bad design. An NN and GA are nothing else than non-linear optimizers. I would suggest to you to look at principal component analysis (PDA), SVD and linear classifiers first (see wikipedia). If you figure out to solve simple statistical problems move on to more complex ones. Check out the great textbook by Russell/Norvig, read some of their source code.
To answer the questions one really has to look at the dataset extensively. If you are working on a small problem, define the probabilities etc., and you might get an answer here. Perhaps check out Bayesian statistics as well. This will get you started I believe.

Determining Bias for Neural network Perceptrons?

This is one thing in my beginning of understand neural networks is I don't quite understand what to initially set a "bias" at?
I understand the Perceptron calculates it's output based on:
P * W + b > 0
and then you could calculate a learning pattern based on b = b + [ G - O ] where G is the Correct Output, and O is the actual Output (1 or 0) to calculate a new bias...but what about an initial bias.....I don't really understand how this is calculated, or what initial value should be used besides just "guessing", is there any type of formula for this?
Pardon if Im mistaken on anything, Im still learning the whole Neural network idea before I implement my own (crappy) one.
The same goes for learning rate.....I mean most books and such just kinda "pick one" for μ.
The short answer is, it depends...
In most cases (I believe) you can just treat the bias just like any other weight (so it might get initialised to some small random value), and it will get updated as you train your network. The idea is that all the biases and weights will end up converging on some useful set of values.
However, you can also set the weights manually (with no training) to get some special behaviours: for example, you can use the bias to make a perceptron behave like a logic gate (assume binary inputs X1 and X2 are either 0 or 1, and the activation function is scaled to give an output of 0 or 1).
OR gate: W1=1, W2=1, Bias=0
AND gate: W1=1, W2=1, Bias=-1
You can solve the classic XOR problem by using AND and OR as the first layer in a multilayer network, and feed them into a third perceptron with W1=3 (from the OR gate), W2=-2 (from the AND gate) and Bias=-2, like this:
(Note: these values will be different if your activation function is scaled to -1/+1, ie a SGN function)
As to how to set the learning rate, that also depends(!) but I think usually something like 0.01 is recommended. Basically you want the system to learn as quickly as possible, but not so quickly that the weights fail to converge properly.
Since #Richard has already answered the greater part of the question I'll only elaborate on the learning rate. From what I've read (and it's working) there is a very simple formula that you can use in order to update the learning rate for each iteration k and it is:
learningRate_k = constant/k
Here obviously the 0th iteration is excluded since you'll be dividing by zero. The constant can be whatever you want it to be (except 0 of course since it will not be making any sense :D) but the easiest is naturally 1 so you get
learningRate_k = 1/k
The resulting series obeys two basic rules:
lim_(t->inf) SUM from k=1 to t (learningRate_k) = inf
lim_(t->inf) SUM from k=1 to t (learningRate_k^2) < inf
Note that the convergence of your perceptron is directly connected to the learning rate series. It starts big (for k=1 you get 1/1=1) and gets smaller and smaller with each and every update of your perceptron since - as in real life - when you encounter something new at the beginning you learn a lot but later on you learn less and less.

Resources