Determining Bias for Neural network Perceptrons? - artificial-intelligence

This is one thing in my beginning of understand neural networks is I don't quite understand what to initially set a "bias" at?
I understand the Perceptron calculates it's output based on:
P * W + b > 0
and then you could calculate a learning pattern based on b = b + [ G - O ] where G is the Correct Output, and O is the actual Output (1 or 0) to calculate a new bias...but what about an initial bias.....I don't really understand how this is calculated, or what initial value should be used besides just "guessing", is there any type of formula for this?
Pardon if Im mistaken on anything, Im still learning the whole Neural network idea before I implement my own (crappy) one.
The same goes for learning rate.....I mean most books and such just kinda "pick one" for μ.

The short answer is, it depends...
In most cases (I believe) you can just treat the bias just like any other weight (so it might get initialised to some small random value), and it will get updated as you train your network. The idea is that all the biases and weights will end up converging on some useful set of values.
However, you can also set the weights manually (with no training) to get some special behaviours: for example, you can use the bias to make a perceptron behave like a logic gate (assume binary inputs X1 and X2 are either 0 or 1, and the activation function is scaled to give an output of 0 or 1).
OR gate: W1=1, W2=1, Bias=0
AND gate: W1=1, W2=1, Bias=-1
You can solve the classic XOR problem by using AND and OR as the first layer in a multilayer network, and feed them into a third perceptron with W1=3 (from the OR gate), W2=-2 (from the AND gate) and Bias=-2, like this:
(Note: these values will be different if your activation function is scaled to -1/+1, ie a SGN function)
As to how to set the learning rate, that also depends(!) but I think usually something like 0.01 is recommended. Basically you want the system to learn as quickly as possible, but not so quickly that the weights fail to converge properly.

Since #Richard has already answered the greater part of the question I'll only elaborate on the learning rate. From what I've read (and it's working) there is a very simple formula that you can use in order to update the learning rate for each iteration k and it is:
learningRate_k = constant/k
Here obviously the 0th iteration is excluded since you'll be dividing by zero. The constant can be whatever you want it to be (except 0 of course since it will not be making any sense :D) but the easiest is naturally 1 so you get
learningRate_k = 1/k
The resulting series obeys two basic rules:
lim_(t->inf) SUM from k=1 to t (learningRate_k) = inf
lim_(t->inf) SUM from k=1 to t (learningRate_k^2) < inf
Note that the convergence of your perceptron is directly connected to the learning rate series. It starts big (for k=1 you get 1/1=1) and gets smaller and smaller with each and every update of your perceptron since - as in real life - when you encounter something new at the beginning you learn a lot but later on you learn less and less.

Related

How come random weight initiation is better then just using 0 as weights in ANN?

In a trained neural net the weight distribution will fall close around zero. So it makes sense for me to initiate all weights to zero. However there are methods such as random assignment for -1 to 1 and Nguyen-Widrow that outperformes zero initiation. How come these random methods are better then just using zero?
Activation & learning:
Additionally to the things cr0ss said, in a normal MLP (for example) the activation of layer n+1 is the dot product of the output of layer n and the weights between layer n and n + 1...so basically you get this equation for the activation a of neuron i in layer n:
Where w is the weight of the connection between neuron j (parent layer n-1) to current neuron i (current layer n), o is the output of neuron j (parent layer) and b is the bias of current neuron i in the current layer.
It is easy to see initializing weights with zero would practically "deactivate" the weights because weights by output of parent layer would equal zero, therefore (in the first learning steps) your input data would not be recognized, the data would be negclected totally.
So the learning would only have the data supplied by the bias in the first epochs.
This would obviously render the learning more challenging for the network and enlarge the needed epochs to learn heavily.
Initialization should be optimized for your problem:
Initializing your weights with a distribution of random floats with -1 <= w <= 1 is the most typical initialization, because overall (if you do not analyze your problem / domain you are working on) this guarantees some weights to be relatively good right from the start. Besides, other neurons co-adapting to each other happens faster with fixed initialization and random initialization ensures better learning.
However -1 <= w <= 1 for initialization is not optimal for every problem. For example: biological neural networks do not have negative outputs, so weights should be positive when you try to imitate biological networks. Furthermore, e.g. in image processing, most neurons have either a fairly high output or send nearly nothing. Considering this, it is often a good idea to initialize weights between something like 0.2 <= w <= 1, sometimes even 0.5 <= w <= 2 showed good results (e.g. in dark images).
So the needed epochs to learn a problem properly is not only dependent on the layers, their connectivity, the transfer functions and learning rules and so on but also to the initialization of your weights.
You should try several configurations. In most situations you can figure out what solutions are adequate (like higher, positive weights for processing dark images).
Reading the Nguyen article, I'd say it is because when you assign the weight from -1 to 1, you are already defining a "direction" for the weight, and it will learn if the direction is correct and it's magnitude to go or not the other way.
If you assign all the weights to zero (in a MLP neural network), you don't know which direction it might go to. Zero is a neutral number.
Therefore, if you assign a small value to the node's weight, the network will learn faster.
Read Picking initial weights to speed training section of the article. It states:
First, the elements of Wi are assigned values from a uniform random distributation between -1 and 1 so that its direction is random. Next, we adjust the magnitude of the weight vectors Wi, so that each hidden node is linear over only a small interval.
Hope it helps.

How should perceptron work with 0 inputs (for AND/OR)

I am trying to make a simple perceptron to perform logical AND but I do not know how to solve the 0 input issue.
weight+=error*learning_rate*input
Also when input is 0, no matter what the error was, the weight will not change.
And one more question, in general, when training the perceptron, can I repeat the examples for both sets (lets say have one for 0 and one for 1) or do they need to be different?
That is an interesting and very important insight. That's why you usually should have a bias in a neural network.
Imagine the decision surface of a perceptron as a line of the form
y = w*x+b
When you remove the b (bias) from the equation, you will only be able to learn lines that pass (0, 0).

Fitting an unknown curve [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
There are some related questions that I've come across (like this, this, this, and this) but they all deal with fitting data to a known curve. Is there a way to fit given data to an unknown curve? By which I mean, given some data the algorithm will give me a fit which is one function or a sum of functions. I'm programming in C, but I'm at a complete loss on how to use the gsl package to do this. I'm open to using anything that can (ideally) be piped through C. But any help on what direction I should look will be greatly appreciated.
EDIT: This is basically experimental (physics) data that I've collected, so the data will have some trend modified by additive gaussian distributed noise. In general the trend will be non-linear, so I guess that a linear regression fitting method will be unsuitable. As for the ordering, the data is time-ordered, so the curve necessarily has to be fit in that order.
You might be looking for polynomial interpolation, in the field of numerical analysis.
In polynomial interpolation - given a set of points (x,y) - you are trying to find the best polynom that fits these points. One way to do it is using Newton interpolation, which is fairly easy to program.
The field of numerical analysis and interpolations in specifics is widely studied, and you might be able to get some nice upper bound to the error of the polynom.
Note however, because you are looking for a polynom that best fits your data, and the function is not really a polynom - the scale of the error when getting far from your initial training set blasts off.
Also note, your data set is finite, and there are inifnite number (actually, non-enumerable infinity) of functions that can fit the data (exactly or approximately) - so which one out of these is the best might be specific to what you actually are trying to achieve.
If you are looking for a model to fit your data, note that linear regression and polynomial interpolations are at the opposite ends of the scale: polynomial interpolation might be an overfitting to a model, while a linear regression might be underfitting it, what exactly should be used is case specific and varies from one application to the other.
Simple polynomial interpolation example:
Let's say we have (0,1),(1,2),(3,10) as our data.
The table1 we get using newton method is:
0 | 1 | |
1 | 2 | (2-1)/(1-0)=1 |
3 | 9 | (10-2)/(3-1)=4 | (4-1)/(3-0)=1
Now, the polynom we get is the "diagonal" that ends with the last element:
1 + 1*(x-0) + 1*(x-0)(x-1) = 1 + x + x^2 - x = x^2 +1
(and that is a perfect fit indeed to the data we used)
(1) The table is recursively created: The first 2 columns are the x,y values - and each next column is based on the prior one. It is really easy to implement once you get it, the full explanation is in the wikipedia page for newton interpolation.
Another alternative is using linear regression, but multi-dimensional.
The trick here is to artificially generate extra dimensions. You can do so by simply implying some functions on the original data set. A common usage is doing it to generate polynoms to match the data, so in here the function you imply is f(x) = x^i for all i < k (where k is the degree of the polynom you want to get).
For example, the data set (0,2),(2,3) with k = 3 you will get extra 2 dimnsions, and your data set will be: (0,2,4,8),(2,3,9,27).
The linear-regression algorithm will find the values a_0,a_1,...,a_k for the polynom p(x) = a_0 + a_1*x + ... + a_k * x^k that minimized the error for each point in the data comparing to the predicted model (the value of p(x)).
Now, the problem is - when you start increasing the dimension - you are moving from underfitting (of 1 dimensional linear regression) to overfitting (when k==n, you effectively getting polynomial interpolation).
To "chose" what is the best k value - you can use cross-validation, and chose the k that minimized the error according to your cross-validation.
Note that this process can be fully automated, all you need is to iteratively check all k values in the desired range1, and chose the model with the k that minimized the error according to the cross-validation.
(1) The range could be [1,n] - though it will probably be way too time consuming, I'd go for [1,sqrt(n)] or even [1,log(n)] - but it is just a hunch.
You might want to use (Fast) Fourier Transforms to convert data to frequency domain.
With the result of the transform (a set of amplitudes and phases and frequencies) even the most twisted set of data can be represented by several functions (harmonics) of the form:
r * cos(f * t - p)
where r is the harmonic amplitude, f is the frequency an p the phase.
Finally, the unknonwn data curve is the sum of all harmonics.
I have done this in R (you have some examples of it) but I believe C has enough tools to manage it. It is also possible to pipe C and R but don't know much about it. This might be of help.
This method is really good for large chunks of data because it has complexities of:
1) decompose data with Fast Fourier Transforms (FTT) = O(n log n)
2) built the function with the resulting components = O(n)

TD(λ) in Delphi/Pascal (Temporal Difference Learning)

I have an artificial neural network which plays Tic-Tac-Toe - but it is not complete yet.
What I have yet:
the reward array "R[t]" with integer values for every timestep or move "t" (1=player A wins, 0=draw, -1=player B wins)
The input values are correctly propagated through the network.
the formula for adjusting the weights:
What is missing:
the TD learning: I still need a procedure which "backpropagates" the network's errors using the TD(λ) algorithm.
But I don't really understand this algorithm.
My approach so far ...
The trace decay parameter λ should be "0.1" as distal states should not get that much of the reward.
The learning rate is "0.5" in both layers (input and hidden).
It's a case of delayed reward: The reward remains "0" until the game ends. Then the reward becomes "1" for the first player's win, "-1" for the second player's win or "0" in case of a draw.
My questions:
How and when do you calculate the net's error (TD error)?
How can you implement the "backpropagation" of the error?
How are the weights adjusted using TD(λ)?
Thank you so much in advance :)
If you're serious about making this work, then understanding TD-lambda would be very helpful. Sutton and Barto's book, "Reinforcement Learning" is available for free in HTML format and covers this algorithm in detail. Basically, what TD-lambda does is create a mapping between a game state and the expected reward at the game's end. As games are played, states that are more likely to lead to winning states tend to get higher expected reward values.
For a simple game like tic-tac-toe, you're better off starting with a tabular mapping (just track an expected reward value for every possible game state). Then once you've got that working, you can try using a NN for the mapping instead. But I would suggest trying a separate, simpler NN project first...
I have been confused about this too, but I believe this is the way it works:
Starting from the end node, you check R, (output received) and E, (output expected). If E = R, it's fine, and you have no changes to make.
If E != R, you see how far off it was, based on thresholds and whatnot, and then shift the weights or threshold up or down a bit. Then, based on the new weights, you go back in, and guess whether or not it was too high, or too low, and repeat, with a weaker effect.
I've never really tried this algorithm, but that's basically the version of the idea as I understand it.
As far as I remember you do the training with a known result set - so you calculate the output for a known input and subtract your known output value from that - that is the error.
Then you use the error to correct the net - for a single layer NN adjusted with the delta rule I know that an epsilon of 0.5 is too high - something like 0.1 is better - slower but better. With backpropagation it is a bit more advanced - but as far as I remember the math equation description of a NN is complex and hard to understand - it is not that complicated.
take a look at
http://www.codeproject.com/KB/recipes/BP.aspx
or google for "backpropagation c" - it is probably easier to understand in code.

Determining which inputs to weigh in an evolutionary algorithm

I once wrote a Tetris AI that played Tetris quite well. The algorithm I used (described in this paper) is a two-step process.
In the first step, the programmer decides to track inputs that are "interesting" to the problem. In Tetris we might be interested in tracking how many gaps there are in a row because minimizing gaps could help place future pieces more easily. Another might be the average column height because it may be a bad idea to take risks if you're about to lose.
The second step is determining weights associated with each input. This is the part where I used a genetic algorithm. Any learning algorithm will do here, as long as the weights are adjusted over time based on the results. The idea is to let the computer decide how the input relates to the solution.
Using these inputs and their weights we can determine the value of taking any action. For example, if putting the straight line shape all the way in the right column will eliminate the gaps of 4 different rows, then this action could get a very high score if its weight is high. Likewise, laying it flat on top might actually cause gaps and so that action gets a low score.
I've always wondered if there's a way to apply a learning algorithm to the first step, where we find "interesting" potential inputs. It seems possible to write an algorithm where the computer first learns what inputs might be useful, then applies learning to weigh those inputs. Has anything been done like this before? Is it already being used in any AI applications?
In neural networks, you can select 'interesting' potential inputs by finding the ones that have the strongest correlation, positive or negative, with the classifications you're training for. I imagine you can do similarly in other contexts.
I think I might approach the problem you're describing by feeding more primitive data to a learning algorithm. For instance, a tetris game state may be described by the list of occupied cells. A string of bits describing this information would be a suitable input to that stage of the learning algorithm. actually training on that is still challenging; how do you know whether those are useful results. I suppose you could roll the whole algorithm into a single blob, where the algorithm is fed with the successive states of play and the output would just be the block placements, with higher scoring algorithms selected for future generations.
Another choice might be to use a large corpus of plays from other sources; such as recorded plays from human players or a hand-crafted ai, and select the algorithms who's outputs bear a strong correlation to some interesting fact or another from the future play, such as the score earned over the next 10 moves.
Yes, there is a way.
If you choose M selected features there are 2^M subsets, so there is a lot to look at.
I would to the following:
For each subset S
run your code to optimize the weights W
save S and the corresponding W
Then for each pair S-W, you can run G games for each pair and save the score L for each one. Now you have a table like this:
feature1 feature2 feature3 featureM subset_code game_number scoreL
1 0 1 1 S1 1 10500
1 0 1 1 S1 2 6230
...
0 1 1 0 S2 G + 1 30120
0 1 1 0 S2 G + 2 25900
Now you can run some component selection algorithm (PCA for example) and decide which features are worth to explain scoreL.
A tip: When running the code to optimize W, seed the random number generator, so that each different 'evolving brain' is tested against the same piece sequence.
I hope it helps in something!

Resources