How to apply gradient descent on the weights of a neural network? - c

Considering a neural network with two hidden layers. In this case we have three matrices of weights. Lets say I'm starting the training. In the first round I'll set random values for all weights of the three matrices. If this is correct I have two questions about:
1- Should I do the training from the input layer to the right or otherwise?
2- In the second round of the trainging I have to apply the gradient descent on the weights. Should I apply on all weights of all matrices an after that calculate the error or apply it weight by weight checking if the error has decreased to go to the next weight and so on to finally go to the next training round?

You need to be familiar with forward propagation and the backward propagation. In a neural network, first you initialize weights randomly. Then you predict the y value(let's say y_pred) according to the training set values(X_train). For each X_train sample you have y_train which is the true output(we say ground truth) for the training sample. Then you calculate a loss value according to the loss function, for simplicity let's say loss=y_pred-y_train (This is no the actual loss function it is a bit more complex than that). This is the forward propagation in short.
So you get the loss then you calculate the how much you need to change the weights in order to train your neural network in the next iteration. For this we use gradient descent algorithm. You calculate new weights using the loss value you get. This is the backward propagation in short.
You redo this steps multiple times and you will improve your weights from random to trained weights.

Related

How come random weight initiation is better then just using 0 as weights in ANN?

In a trained neural net the weight distribution will fall close around zero. So it makes sense for me to initiate all weights to zero. However there are methods such as random assignment for -1 to 1 and Nguyen-Widrow that outperformes zero initiation. How come these random methods are better then just using zero?
Activation & learning:
Additionally to the things cr0ss said, in a normal MLP (for example) the activation of layer n+1 is the dot product of the output of layer n and the weights between layer n and n + 1...so basically you get this equation for the activation a of neuron i in layer n:
Where w is the weight of the connection between neuron j (parent layer n-1) to current neuron i (current layer n), o is the output of neuron j (parent layer) and b is the bias of current neuron i in the current layer.
It is easy to see initializing weights with zero would practically "deactivate" the weights because weights by output of parent layer would equal zero, therefore (in the first learning steps) your input data would not be recognized, the data would be negclected totally.
So the learning would only have the data supplied by the bias in the first epochs.
This would obviously render the learning more challenging for the network and enlarge the needed epochs to learn heavily.
Initialization should be optimized for your problem:
Initializing your weights with a distribution of random floats with -1 <= w <= 1 is the most typical initialization, because overall (if you do not analyze your problem / domain you are working on) this guarantees some weights to be relatively good right from the start. Besides, other neurons co-adapting to each other happens faster with fixed initialization and random initialization ensures better learning.
However -1 <= w <= 1 for initialization is not optimal for every problem. For example: biological neural networks do not have negative outputs, so weights should be positive when you try to imitate biological networks. Furthermore, e.g. in image processing, most neurons have either a fairly high output or send nearly nothing. Considering this, it is often a good idea to initialize weights between something like 0.2 <= w <= 1, sometimes even 0.5 <= w <= 2 showed good results (e.g. in dark images).
So the needed epochs to learn a problem properly is not only dependent on the layers, their connectivity, the transfer functions and learning rules and so on but also to the initialization of your weights.
You should try several configurations. In most situations you can figure out what solutions are adequate (like higher, positive weights for processing dark images).
Reading the Nguyen article, I'd say it is because when you assign the weight from -1 to 1, you are already defining a "direction" for the weight, and it will learn if the direction is correct and it's magnitude to go or not the other way.
If you assign all the weights to zero (in a MLP neural network), you don't know which direction it might go to. Zero is a neutral number.
Therefore, if you assign a small value to the node's weight, the network will learn faster.
Read Picking initial weights to speed training section of the article. It states:
First, the elements of Wi are assigned values from a uniform random distributation between -1 and 1 so that its direction is random. Next, we adjust the magnitude of the weight vectors Wi, so that each hidden node is linear over only a small interval.
Hope it helps.

Is dimensionality reduction reversible?

I have implemented a dimentionality reduction algorithm using ENCOG, that takes a dataset (call it A) with multiple features and reduces it to a dataset (B) with only one feature (I need that for time series analisys).
Now my question is, I have a value from B - predicted by the time series analysis, can I convert it back to two dimensions like in the A dataset?
No, dimensionality reduction is not reversible in general. It loses information.
Dimensionality reduction (compression of information) is reversible in auto-encoders. Auto-encoder is regular neural network with bottleneck layer in the middle. You have for instance 20 inputs in the first layer, 10 neurons in the middle layer and again 20 neurons in the last layer. When you train such network you force it to compress information to 10 neurons and then uncompress again minimizing error in the last layer(desired output vector equals input vector). When you use well known Back-propagation algorithm to train such network it performs PCA - Principal Component Analysis. PCA returns uncorrelated features. It's not very powerful.
By using more sophisticated algorithm to train auto-encoder you can make it perform nonlinear ICA - Independent Component Analysis. ICA returns statistically independent features. This training algorithm searches for low complexity neural networks with high generalization capability. As a byproduct of regularization you get ICA.

Neural Network training method

I've been studying Neural Networks lately. I'll explain my goal: i'm trying to teach monsters to walk, stand, basically perform actions that "reward" them (maximize the fitness function).
The NN receives sensor inputs, and outputs muscle activity. The problem gets down to training the weights and biases of the neurons.
My problem is that i'm not sure if i'm doing things right, and with neural networks i can make a mistake and never know about it. So i'll explain what i'm doing in general, and if you spot a mistake please correct me!
1) I create a neural network with neurons that use hyperbolic tangent transfer function.
2) Create a population of random "Chromosomes", each containing an array of doubles as genes(the weights and biases in the NN), the length of the array being amount of weights and biases in the NN. The genes have a lower and upper limit, usually [-2,2] in which their random value is generated in initialization and mutation.
For each generation:
3) For each chromosome, I update the NN weights and test the monster for about 5000 frames. Every 10 frames, network outputs are generated with sensor input. The outputs are double values normalized to [0,1] and they control "muscles" (springs) in the body by changing their neutral length, according to that value. Fitness value is calculated.
4) Perform Genetic Algorithm operators- first create cross overs with ~0.4 probability, then mutate with ~0.1 probability, depending on chromosome length. Mutation randomizes the gene to a value between some lower and upper limit. Elitism - two best solutions are left unchanged for the next generation.
Repeat until generations>maxGenerations or max fitness is reached.
I'm not sure about a few things in my code: should there be a limit for weights and biases? if yes, it constricts the potential results the NN could achieve. If no, then how do i initialize values, and mutate? I'm afraid that adding a random value as mutation will get stuck in local optima, like hill climbing. No limit will reduce the amount of parameters i need to consider when initializing the whole thing, which is nice!
Is hyperbolic tangent a good choice? why or why not?
Do i have to normalize inputs sensor data? if yes, between what values?
Also i'm not sure if i'm doing a mistake by outputting a double value for flexing instead of binary- higher than 0.5 is flex, less is release, could be an option, when now i'm just using the value as flex amount.
Don't consider bugs in my code as reasons for bad results, because i checked many times and implemented XOR that worked perfectly.
I would greatly appreciate any help, thank you!
I assume you are referring to Feed Forward Neural Networks, ie, forward connected layers of neurons.
It's ok to use hyperbolic tangent or a sigmoid function. Just make sure they are continuous and derivable in their domain. Else the learning algorithm (gradient descent) might not feedback correctly the error back into first layers.
You should normalize each input to either a range such as [-1,+1] or [-std,+std] using zscore. Therefore, the values of your inputs will have a similar weight in the decision function.
You do not specify the targets of your outputs, if they are discrete or floating point.
I wonder, as FFNN are supervised, with what data are you training your algorithm?

How to convert the output of an artificial neural network into probabilities?

I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.
I think there are two ways :
1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.
2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.
In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.
Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"
Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.
Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.
With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.
So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).
Et voilĂ !
Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).
In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.
What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).
The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.
If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).
The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.
What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).
A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.
(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)
I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.
did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?
as an example create a three layer network with the following:
linear neurons [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons [ number of classes ]
then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.
Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course # http://coursera.com regarding the details.
I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.
(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)
When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.
So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.
To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.
This could be generalized for multi-class classification using softmax activation function.
You can read more, including proofs of probabilistic interpretation here:
[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

How to program a neural network for chess?

I want to program a chess engine which learns to make good moves and win against other players. I've already coded a representation of the chess board and a function which outputs all possible moves. So I only need an evaluation function which says how good a given situation of the board is. Therefore, I would like to use an artificial neural network which should then evaluate a given position. The output should be a numerical value. The higher the value is, the better is the position for the white player.
My approach is to build a network of 385 neurons: There are six unique chess pieces and 64 fields on the board. So for every field we take 6 neurons (1 for every piece). If there is a white piece, the input value is 1. If there is a black piece, the value is -1. And if there is no piece of that sort on that field, the value is 0. In addition to that there should be 1 neuron for the player to move. If it is White's turn, the input value is 1 and if it's Black's turn, the value is -1.
I think that configuration of the neural network is quite good. But the main part is missing: How can I implement this neural network into a coding language (e.g. Delphi)? I think the weights for each neuron should be the same in the beginning. Depending on the result of a match, the weights should then be adjusted. But how? I think I should let 2 computer players (both using my engine) play against each other. If White wins, Black gets the feedback that its weights aren't good.
So it would be great if you could help me implementing the neural network into a coding language (best would be Delphi, otherwise pseudo-code). Thanks in advance!
In case somebody randomly finds this page. Given what we know now, what the OP proposes is almost certainly possible. In fact we managed to do it for a game with much larger state space - Go ( https://deepmind.com/research/case-studies/alphago-the-story-so-far ).
I don't see why you can't have a neural net for a static evaluator if you also do some classic mini-max lookahead with alpha-beta pruning. Lots of Chess engines use minimax with a braindead static evaluator that just adds up the pieces or something; it doesn't matter so much if you have enough levels of minimax. I don't know how much of an improvement the net would make but there's little to lose. Training it would be tricky though. I'd suggest using an engine that looks ahead many moves (and takes loads of CPU etc) to train the evaluator for an engine that looks ahead fewer moves. That way you end up with an engine that doesn't take as much CPU (hopefully).
Edit: I wrote the above in 2010, and now in 2020 Stockfish NNUE has done it. "The network is optimized and trained on the [classical Stockfish] evaluations of millions of positions at moderate search depth" and then used as a static evaluator, and in their initial tests they got an 80-elo improvement when using this static evaluator instead of their previous one (or, equivalently, the same elo with a little less CPU time). So yes it does work, and you don't even have to train the network at high search depth as I originally suggested: moderate search depth is enough, but the key is to use many millions of positions.
Been there, done that. Since there is no continuity in your problem (the value of a position is not closely related to an other position with only 1 change in the value of one input), there is very little chance a NN would work. And it never did in my experiments.
I would rather see a simulated annealing system with an ad-hoc heuristic (of which there are plenty out there) to evaluate the value of the position...
However, if you are set on using a NN, is is relatively easy to represent. A general NN is simply a graph, with each node being a neuron. Each neuron has a current activation value, and a transition formula to compute the next activation value, based on input values, i.e. activation values of all the nodes that have a link to it.
A more classical NN, that is with an input layer, an output layer, identical neurons for each layer, and no time-dependency, can thus be represented by an array of input nodes, an array of output nodes, and a linked graph of nodes connecting those. Each node possesses a current activation value, and a list of nodes it forwards to. Computing the output value is simply setting the activations of the input neurons to the input values, and iterating through each subsequent layer in turn, computing the activation values from the previous layer using the transition formula. When you have reached the last (output) layer, you have your result.
It is possible, but not trivial by any means.
https://erikbern.com/2014/11/29/deep-learning-for-chess/
To train his evaluation function, he utilized a lot of computing power to do so.
To summarize generally, you could go about it as follows. Your evaluation function is a feedforward NN. Let the matrix computations lead to a scalar output valuing how good the move is. The input vector for the network is the board state represented by all the pieces on the board so say white pawn is 1, white knight is 2... and empty space is 0. An example board state input vector is simply a sequence of 0-12's. This evaluation can be trained using grandmaster games (available at a fics database for example) for many games, minimizing loss between what the current parameters say is the highest valuation and what move the grandmasters made (which should have the highest valuation). This of course assumes that the grandmaster moves are correct and optimal.
What you need to train a ANN is either something like backpropagation learning or some form of a genetic algorithm. But chess is such an complex game that it is unlikly that a simple ANN will learn to play it - even more if the learning process is unsupervised.
Further, your question does not say anything about the number of layers. You want to use 385 input neurons to encode the current situation. But how do you want to decide what to do? On neuron per field? Highest excitation wins? But there is often more than one possible move.
Further you will need several hidden layers - the functions that can be represented with an input and an output layer without hidden layer are really limited.
So I do not want to prevent you from trying it, but chances for a successful implemenation and training within say one year or so a practically zero.
I tried to build and train an ANN to play Tic-tac-toe when I was 16 years or so ... and I failed. I would suggest to try such an simple game first.
The main problem I see here is one of training. You say you want your ANN to take the current board position and evaluate how good it is for a player. (I assume you will take every possible move for a player, apply it to the current board state, evaluate via the ANN and then take the one with the highest output - ie: hill climbing)
Your options as I see them are:
Develop some heuristic function to evaluate the board state and train the network off that. But that begs the question of why use an ANN at all, when you could just use your heuristic.
Use some statistical measure such as "How many games were won by white or black from this board configuration?", which would give you a fitness value between white or black. The difficulty with that is the amount of training data required for the size of your problem space.
With the second option you could always feed it board sequences from grandmaster games and hope there is enough coverage for the ANN to develop a solution.
Due to the complexity of the problem I'd want to throw the largest network (ie: lots of internal nodes) at it as I could without slowing down the training too much.
Your input algorithm is sound - all positions, all pieces, and both players are accounted for. You may need an input layer for every past state of the gameboard, so that past events are used as input again.
The output layer should (in some form) give the piece to move, and the location to move to.
Write a genetic algorithm using a connectome which contains all neuron weights and synapse strengths, and begin multiple separated gene pools with a large number of connectomes in each.
Make them play one another, keep the best handful, crossover and mutate the best connectomes to repopulate the pool.
Read blondie24 : http://www.amazon.co.uk/Blondie24-Playing-Kaufmann-Artificial-Intelligence/dp/1558607838.
It deals with checkers instead of chess but the principles are the same.
Came here to say what Silas said. Using a minimax algorithm, you can expect to be able to look ahead N moves. Using Alpha-beta pruning, you can expand that to theoretically 2*N moves, but more realistically 3*N/4 moves. Neural networks are really appropriate here.
Perhaps though a genetic algorithm could be used.

Resources