PyBrain RNN prediction failure - artificial-intelligence

I am using a recurrent neural network for time series prediction with LSTM as the activation function. The inputs are sequence datasets, with the output being the next datum after the input sequence. I have hundreds of inputs, one hidden layer of equal size, and a single output in the output layer. However much I train, the result is always much higher than the actual value (with other functions too), shown respectively by green and blue below. What is the solution?

It seems that LSTM is not suited for this kind of pattern. Softmax works well.

Related

Multiple Output Neural Network

I have built my first neural network in python, and i've been playing around with a few datasets; it's going well so far !
I have a quick question regarding modelling events with multiple outcomes: -
Say i wish to train a network to tell me the probability of each runner winning a 100m sprint. I would give the network all of the relevant data regarding each runner, and the number of outputs would be equal to the number of runners in the race.
My question is, using a sigmoid function, how can i ensure the sum of the outputs will be equal to 1.0 ? Will the network naturally learn to do this, or will i have to somehow make this happen explicitly ? If so, how would i go about doing this ?
Many Thanks.
The output from your neural network will approach 1. I don't think it will actually get to 1.
You actually don't need to see which output is equal to 1. Once you've trained your network up to a specific error level, when you present the inputs, just look for the maximum output in your output later. For example, let's say your output layer presents the following output: [0.0001, 0.00023, 0.0041, 0.99999412, 0.0012, 0.0002], then the runner that won the race is runner number 4.
So yes, your network will "learn" to produce 1, but it won't exactly be 1. This is why you train to within a certain error rate. I recently created a neural network to recognize handwritten digits, and this is the method that I used. In my output layer, I have a vector with 10 components. The first component represents 0, and the last component represents 9. So when I present a 4 to the network, I expect the output vector to look like [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]. Of course, it's not what I get exactly, but it's what I train the network to provide. So to find which digit it is, I simply check to see which component has the highest output or score.
Now in your second question, I believe you're asking how the network would learn to provide the correct answer? To do this, you need to provide your network with some training data and train it until the output is under a certain error threshold. So what you need is a set of data that contains the inputs and the correct output. Initially your neural network will be set up with random weights (there are some algorithms that help you select better weights to minimize training time, but that's a little more advanced). Next you need a way to tell the neural network to learn from the data provided. So basically you give the data to the neural network and it provides an output, which is highly likely to be wrong. Then you compare that data with the expected (correct) output and you tell the neural network to update its weights so that it gets closer to the correct answer. You do this over and over again until the error is below a certain threshold.
The easiest way to do this is to implement the stochastic backpropagation algorithm. In this algorithm, you calculate the error between the actual output of the neural network and the expected output. Then you backpropagate the error from the output layer all the way up to the weights to the hidden layer, adjusting the weights as you go. Then you repeat this process until the error that you calculate is below a certain threshold. So during each step, you're getting closer and closer towards your solution.
You can use the algorithm described here. There is a decent amount of math involved, so be prepared for that! If you want to see an example of an implementation of this algorithm, you can take a look at this Java code that I have on github. The code uses momentum and a simple form of simulated annealing as well, but the standard backpropagation algorithm should be easily discernible. The Wikipedia article on backpropagation has a link to an implementation of the backpropagation algorithm in Python.
You're probably not going to understand the algorithm immediately; expect to spend some time understanding it and working through some of the math. I sat down with a pencil and paper as I was coding, and that's how I eventually understood what was going on.
Here are a few resources that should help you understand backpropagation a little better:
The learning process: backpropagation
Error backpropagation
If you want some more resources, you can also take a look at my answer here.
Basically you want a function of multiple real numbers that converts those real numbers into probabilities (each between 0 to 1, sum to 1). You can this easily by post processing the output of your network.
Your network gives you real numbers r1, r2, ..., rn that increases as the probability of each runner wins the race.
Then compute exp(r1), exp(r2), ..., and sum them up for ers = exp(r1) + exp(r2) + ... + exp(rn). Then the probability that the first racer wins is exp(r1) / ers.
This is a one use of the Boltzman distribution. http://en.wikipedia.org/wiki/Boltzmann_distribution
Your network should work around that and learn it naturally eventually.
To make the network learn that a little faster, here's what springs to mind first:
add an additional output called 'sum' (summing all the other output neurons) -- if you want all the output neurons to be in an separate layer, just add a layer of outputs, first numRunners outputs just connect to corresponding neuron in the previous layer, and the last numRunners+1-th neuron you connect to all the neurons from the previous layer, and fix the weights to 1)
the training set would contain 0-1 vectors for each runner (did-did not run), and the "expected" result would be a 0-1 vector 00..00001000..01 first 1 marking the runner that won the race, last 1 marking the "sum" of "probabilities"
for the unknown races, the network would try to predict which runner would win. Since the outputs have contiguous values (more-or-less :D) they can be read as "the certainty of the network that the runner would win the race" -- which is what you're looking for
Even without the additional sum neuron, this is the rough description of the way the training data should be arranged.

Continuous output in Neural Networks

How can I set Neural Networks so they accept and output a continuous range of values instead of a discrete ones?
From what I recall from doing a Neural Network class a couple of years ago, the activation function would be a sigmoid, which yields a value between 0 and 1. If I want my neural network to yield a real valued scalar, what should I do? I thought maybe if I wanted a value between 0 and 10 I could just multiply the value by 10? What if I have negative values? Is this what people usually do or is there any other way? What about the input?
Thanks
Much of the work in the field of neuroevolution involves using neural networks with continuous inputs and outputs.
There are several common approaches:
One node per value
Linear activation functions - as others have noted, you can use non-sigmoid activation functions on output nodes if you are concerned about the limited range of sigmoid functions. However, this can cause your output to become arbitrarily large, which can cause problems during training.
Sigmoid activation functions - simply scaling sigmoid output (or shifting and scaling, if you want negative values) is a common approach in neuroevolution. However, it is worth making sure that your sigmoid function isn't too steep: a steep activation function means that the "useful" range of values is small, which forces network weights to be small. (This is mainly an issue with genetic algorithms, which use a fixed weight modification strategy that doesn't work well when small weights are desired.)
(source: natekohl.net)
(source: natekohl.net)
Multiple nodes per value - spreading a single continuous value over multiple nodes is a common strategy for representing continuous inputs. It has the benefit of providing more "features" for a network to play with, at the cost of increasing network size.
Binning - spread a single input over multiple nodes (e.g. RBF networks, where each node is a basis function with a different center that will be partially activated by the input). You get some of the benefits of discrete inputs without losing a smooth representation.
Binary representation - divide a single continuous value into 2N chunks, then feed that value into the network as a binary pattern to N nodes. This approach is compact, but kind of brittle and results in input that changes in a non-continuous manner.
There are no rules which require the output ( * ) to be any particular function. In fact we typically need to add some arithmetic operations at the end of the function per-se implemented in a given node, in order to scale and otherwise coerce the output to a particular form.
The advantage of working with all-or-nothing outputs and/or 0.0 to 1.0 normalized output is that it makes things more easily tractable, and also avoid issues of overflowing and such.
( * ) "Output" can be understood here as either the ouptut a given node (neuron) within the network or that of the network as a whole.
As indicated by Mark Bessey the input [to the network as a whole] and the output [of the network] typically receive some filtering/conversion. As hinted in this response and in Mark's comment, it may be preferable to have normalized/standard nodes in the "hidden" layers of the network, and apply some normalization/conversion/discretization as required for the input and/or for the output of the network; Such practice is however only a matter of practicality rather than an imperative requirement of Neural Networks in general.
You will typically need to do some filtering (level conversion, etc) on both the input and the output. Obviously, filtering the input will change the internal state, so some consideration needs to be given to not losing the signal you're trying to train on.

How are neural networks used when the number of inputs could be variable?

All the examples I have seen of neural networks are for a fixed set of inputs which works well for images and fixed length data. How do you deal with variable length data such sentences, queries or source code? Is there a way to encode variable length data into fixed length inputs and still get the generalization properties of neural networks?
I have been there, and I faced this problem.
The ANN was made for fixed feature vector length, and so are many other classifiers such as KNN, SVM, Bayesian, etc.
i.e. the input layer should be well defined and not varied, this is a design problem.
However, some researchers opt for adding zeros to fill the missing gap, I personally think that this is not a good solution because those zeros (unreal values) will affect the weights that the net will converge to. in addition there might be a real signal ending with zeros.
ANN is not the only classifier, there are more and even better such as the random forest. this classifier is considered the best among researchers, it uses a small number of random features, creating hundreds of decision trees using bootstrapping an bagging, this might work well, the number of the chosen features normally the sqrt of the feature vector size. those features are random. each decision tree converges to a solution, using majority rules the most likely class will chosen then.
Another solution is to use the dynamic time warping DTW, or even better to use Hidden Markov models HMM.
Another solution is the interpolation, interpolate (compensate for missing values along the small signal) all the small signals to be with the same size as the max signal, interpolation methods include and not limited to averaging, B-spline, cubic.....
Another solution is to use feature extraction method to use the best features (the most distinctive), this time make them fixed size, those method include PCA, LDA, etc.
another solution is to use feature selection (normally after feature extraction) an easy way to select the best features that give the best accuracy.
that's all for now, if non of those worked for you, please contact me.
You would usually extract features from the data and feed those to the network. It is not advisable to take just some data and feed it to net. In practice, pre-processing and choosing the right features will decide over your success and the performance of the neural net. Unfortunately, IMHO it takes experience to develop a sense for that and it's nothing one can learn from a book.
Summing up: "Garbage in, garbage out"
Some problems could be solved by a recurrent neural network.
For example, it is good for calculating parity over a sequence of inputs.
The recurrent neural network for calculating parity would have just one input feature.
The bits could be fed into it over time. Its output is also fed back to the hidden layer.
That allows to learn the parity with just two hidden units.
A normal feed-forward two-layer neural network would require 2**sequence_length hidden units to represent the parity. This limitation holds for any architecture with just 2 layers (e.g., SVM).
I guess one way to do it is to add a temporal component to the input (recurrent neural net) and stream the input to the net a chunk at a time (basically creating the neural network equivalent of a lexer and parser) this would allow the input to be quite large but would have the disadvantage that there would not necessarily be a stop symbol to seperate different sequences of input from each other (the equivalent of a period in sentances)
To use a neural net on images of different sizes, the images themselves are often cropped and up or down scaled to better fit the input of the network. I know that doesn't really answer your question but perhaps something similar would be possible with other types of input, using some sort of transformation function on the input?
i'm not entirely sure, but I'd say, use the maximum number of inputs (e.g. for words, lets say no word will be longer than 45 characters (longest word found in a dictionary according to wikipedia), and if a shorter word is encountered, set the other inputs to a whitespace character.
Or with binary data, set it to 0. the only problem with this approach is if an input filled with whitespace characters/zeros/whatever collides with a valid full length input (not so much a problem with words as it is with numbers).

How to convert the output of an artificial neural network into probabilities?

I've read about neural network a little while ago and I understand how an ANN (especially a multilayer perceptron that learns via backpropagation) can learn to classify an event as true or false.
I think there are two ways :
1) You get one output neuron. It it's value is > 0.5 the events is likely true, if it's value is <=0.5 the event is likely to be false.
2) You get two output neurons, if the value of the first is > than the value of the second the event is likely true and vice versa.
In these case, the ANN tells you if an event is likely true or likely false. It does not tell how likely it is.
Is there a way to convert this value to some odds or to directly get odds out of the ANN. I'd like to get an output like "The event has a 84% probability to be true"
Once a NN has been trained, for eg. using backprogation as mentioned in the question (whereby the backprogation logic has "nudged" the weights in ways that minimize the error function) the weights associated with all individual inputs ("outside" inputs or intra-NN inputs) are fixed. The NN can then be used for classifying purposes.
Whereby the math (and the "options") during the learning phase can get a bit thick, it is relatively simple and straightfoward when operating as a classifier. The main algorithm is to compute an activation value for each neuron, as the sum of the input x weight for that neuron. This value is then fed to an activation function which purpose's is to normalize it and convert it to a boolean (in typical cases, as some networks do not have an all-or-nothing rule for some of their layers). The activation function can be more complex than you indicated, in particular it needn't be linear, but whatever its shape, typically sigmoid, it operate in the same fashion: figuring out where the activation fits on the curve, and if applicable, above or below a threshold. The basic algorithm then processes all neurons at a given layer before proceeding to the next.
With this in mind, the question of using the perceptron's ability to qualify its guess (or indeed guesses - plural) with a percentage value, finds an easy answer: you bet it can, its output(s) is real-valued (if anything in need of normalizing) before we convert it to a discrete value (a boolean or a category ID in the case of several categories), using the activation functions and the threshold/comparison methods described in the question.
So... How and Where do I get "my percentages"?... All depends on the NN implementation, and more importantly, the implementation dictates the type of normalization functions that can be used to bring activation values in the 0-1 range and in a fashion that the sum of all percentages "add up" to 1. In its simplest form, the activation function can be used to normalize the value and the weights of the input to the output layer can be used as factors to ensure the "add up" to 1 question (provided that these weights are indeed so normalized themselves).
Et voilĂ !
Claritication: (following Mathieu's note)
One doesn't need to change anything in the way the Neural Network itself works; the only thing needed is to somehow "hook into" the logic of output neurons to access the [real-valued] activation value they computed, or, possibly better, to access the real-valued output of the activation function, prior its boolean conversion (which is typically based on a threshold value or on some stochastic function).
In other words, the NN works as previously, neither its training nor recognition logic are altered, the inputs to the NN stay the same, as do the connections between various layers etc. We only get a copy of the real-valued activation of the neurons in the output layer, and we use this to compute a percentage. The actual formula for the percentage calculation depends on the nature of the activation value and its associated function (its scale, its range relative to other neurons' output etc.).
Here are a few simple cases (taken from the question's suggested output rules)
1) If there is a single output neuron: the ratio of the value provided by the activation function relative to the range of that function should do.
2) If there are two (or more output neurons), as with classifiers for example: If all output neurons have the same activation function, the percentage for a given neuron is that of its activation function value divided by the sum of all activation function values. If the activation functions vary, it becomes a case by case situation because the distinct activation functions may be indicative of a purposeful desire to give more weight to some of the neurons, and the percentage should respect this.
What you can do is to use a sigmoid transfer function on the output layer nodes (that accepts data ranges (-inf,inf) and outputs a value in [-1,1]).
Then by using the 1-of-n output encoding (one node for each class), you can map the range [-1,1] to [0,1] and use it as probability for each class value (note that this works naturally for more than just two classes).
The activation value of a single output neuron is a linearly weighted sum, and may be directly interpreted as an approximate probability if the network is trained to give outputs a range from 0 to 1. This would tend to be the case if the transfer function (or output function) in both the preceding stage and providing the final output is in the 0 to 1 range too (typically the sigmoidal logistic function). However, there is no guarantee that it will but repairs are possible. Moreover unless the sigmoids are logistic and the weights are constrained to be positive and sum to 1, it is unlikely. Generally a neural network will train in a more balanced way using the tanh sigmoid and weights and activations that range positive and negative (due to the symmetry of this model). Another factor is the prevalence of the class - if it is 50% then a 0.5 threshold is likely to be effective for logistic and a 0.0 threshold for tanh. The sigmoid is designed to push things towards the centre of the range (on backpropogation) and constrain it from going out of the range (in feedforward). The significance of the performance (with respect to the Bernoulli distribution) can also be interpreted as a probability that the neuron is making real predictions rather than guessing. Ideally the bias of the predictor to positives should match the prevalence of positives in the real world (which may vary at different times and places, e.g. bull vs bear markets, e.g. credit worthiness of people applying for loans vs people who fail to make loan payments) - calibrating to probabilities has the advantage that any desired bias can be set easily.
If you have two neurons for two classes, each can be interpreted independently as above, and the halved difference between them can also be. It is like flipping the negative class neuron and averaging. The differences can also give rise to a probability of significance estimate (using the T-test).
The Brier score and its Murphy decomposition give a more direct estimate of the probability that an average answer is correct, while Informedness gives the probability the classifier is making an informed decision rather than a guess, ROC AUC gives the probability a positive class will be ranked higher than a negative class (by a positive predictor), and Kappa will give a similar number that matches Informedness when prevalence = bias.
What you normally want is both a significance probability for the overall classifier (to ensure that you are playing on a real field, and not in an imaginary framework of guestimates) and a probability estimate for a specific example. There are various ways to calibrate, including doing a regression (linear or nonlinear) versus probability and using its inverse function to remap to a more accurate probability estimate. This can be seen by the Brier score improving, with the calibration component reducing towards 0, but the discrimination component remaining the same, as should ROC AUC and Informedness (Kappa is subject to bias and may worsen).
A simple non-linear way to calibrate to probabilities is to use the ROC curve - as the threshold changes for the output of a single neuron or the difference between two competing neurons, we plot the results true and false positive rates on a ROC curve (the false and true negative rates are naturally the complements, as what isn't really a positive is a negative). Then you scan the ROC curve (polyline) point by point (each time the gradient changes) sample by sample and the proportion of positive samples gives you a probability estimate for positives corresponding to the neural threshold that produced that point. Values between points on the curve can be linearly interpolated between those that are represented in the calibration set - and in fact any bad points in the ROC curve, represented by deconvexities (dents) can be smoothed over by the convex hull - probabilistically interpolating between the endpoints of the hull segment. Flach and Wu propose a technique that actually flips the segment, but this depends on information being used the wrong way round and although it could be used repeatedly for arbitrary improvement on the calibration set, it will be increasingly unlikely to generalize to a test situation.
(I came here looking for papers I'd seen ages ago on these ROC-based approaches - so this is from memory and without these lost references.)
I will be very prudent in interpreting the outputs of a neural networks (in fact any machine learning classifier) as a probability. The machine is trained to discriminate between classes, not to estimate the probability density. In fact, we don't have this information in the data, we have to infer it. For my experience I din't advice anyone to interpret directly the outputs as probabilities.
did you try prof. Hinton's suggestion of training the network with softmax activation function and cross entropy error?
as an example create a three layer network with the following:
linear neurons [ number of features ]
sigmoid neurons [ 3 x number of features ]
linear neurons [ number of classes ]
then train them with cross entropy error softmax transfer with your favourite optimizer stochastic descent/iprop plus/ grad descent. After training the output neurons should be normalized to sum of 1.
Please see http://en.wikipedia.org/wiki/Softmax_activation_function for details. Shark Machine Learning framework does provide Softmax feature through combining two models. And prof. Hinton an excellent online course # http://coursera.com regarding the details.
I can remember I saw an example of Neural network trained with back propagation to approximate the probability of an outcome in the book Introduction to the theory of neural computation (hertz krogh palmer). I think the key to the example was a special learning rule so that you didn't have to convert the output of a unit to probability, but instead you got automatically the probability as output.
If you have the opportunity, try to check that book.
(by the way, "boltzman machines", although less famous, are neural networks designed specifically to learn probability distributions, you may want to check them as well)
When using ANN for 2-class classification and logistic sigmoid activation function is used in the output layer, the output values could be interpreted as probabilities.
So if you choosing between 2 classes, you train using 1-of-C encoding, where 2 ANN outputs will have training values (1,0) and (0,1) for each of classes respectively.
To get probability of first class in percent, just multiply first ANN output to 100. To get probability of other class use the second output.
This could be generalized for multi-class classification using softmax activation function.
You can read more, including proofs of probabilistic interpretation here:
[1] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

How to program a neural network for chess?

I want to program a chess engine which learns to make good moves and win against other players. I've already coded a representation of the chess board and a function which outputs all possible moves. So I only need an evaluation function which says how good a given situation of the board is. Therefore, I would like to use an artificial neural network which should then evaluate a given position. The output should be a numerical value. The higher the value is, the better is the position for the white player.
My approach is to build a network of 385 neurons: There are six unique chess pieces and 64 fields on the board. So for every field we take 6 neurons (1 for every piece). If there is a white piece, the input value is 1. If there is a black piece, the value is -1. And if there is no piece of that sort on that field, the value is 0. In addition to that there should be 1 neuron for the player to move. If it is White's turn, the input value is 1 and if it's Black's turn, the value is -1.
I think that configuration of the neural network is quite good. But the main part is missing: How can I implement this neural network into a coding language (e.g. Delphi)? I think the weights for each neuron should be the same in the beginning. Depending on the result of a match, the weights should then be adjusted. But how? I think I should let 2 computer players (both using my engine) play against each other. If White wins, Black gets the feedback that its weights aren't good.
So it would be great if you could help me implementing the neural network into a coding language (best would be Delphi, otherwise pseudo-code). Thanks in advance!
In case somebody randomly finds this page. Given what we know now, what the OP proposes is almost certainly possible. In fact we managed to do it for a game with much larger state space - Go ( https://deepmind.com/research/case-studies/alphago-the-story-so-far ).
I don't see why you can't have a neural net for a static evaluator if you also do some classic mini-max lookahead with alpha-beta pruning. Lots of Chess engines use minimax with a braindead static evaluator that just adds up the pieces or something; it doesn't matter so much if you have enough levels of minimax. I don't know how much of an improvement the net would make but there's little to lose. Training it would be tricky though. I'd suggest using an engine that looks ahead many moves (and takes loads of CPU etc) to train the evaluator for an engine that looks ahead fewer moves. That way you end up with an engine that doesn't take as much CPU (hopefully).
Edit: I wrote the above in 2010, and now in 2020 Stockfish NNUE has done it. "The network is optimized and trained on the [classical Stockfish] evaluations of millions of positions at moderate search depth" and then used as a static evaluator, and in their initial tests they got an 80-elo improvement when using this static evaluator instead of their previous one (or, equivalently, the same elo with a little less CPU time). So yes it does work, and you don't even have to train the network at high search depth as I originally suggested: moderate search depth is enough, but the key is to use many millions of positions.
Been there, done that. Since there is no continuity in your problem (the value of a position is not closely related to an other position with only 1 change in the value of one input), there is very little chance a NN would work. And it never did in my experiments.
I would rather see a simulated annealing system with an ad-hoc heuristic (of which there are plenty out there) to evaluate the value of the position...
However, if you are set on using a NN, is is relatively easy to represent. A general NN is simply a graph, with each node being a neuron. Each neuron has a current activation value, and a transition formula to compute the next activation value, based on input values, i.e. activation values of all the nodes that have a link to it.
A more classical NN, that is with an input layer, an output layer, identical neurons for each layer, and no time-dependency, can thus be represented by an array of input nodes, an array of output nodes, and a linked graph of nodes connecting those. Each node possesses a current activation value, and a list of nodes it forwards to. Computing the output value is simply setting the activations of the input neurons to the input values, and iterating through each subsequent layer in turn, computing the activation values from the previous layer using the transition formula. When you have reached the last (output) layer, you have your result.
It is possible, but not trivial by any means.
https://erikbern.com/2014/11/29/deep-learning-for-chess/
To train his evaluation function, he utilized a lot of computing power to do so.
To summarize generally, you could go about it as follows. Your evaluation function is a feedforward NN. Let the matrix computations lead to a scalar output valuing how good the move is. The input vector for the network is the board state represented by all the pieces on the board so say white pawn is 1, white knight is 2... and empty space is 0. An example board state input vector is simply a sequence of 0-12's. This evaluation can be trained using grandmaster games (available at a fics database for example) for many games, minimizing loss between what the current parameters say is the highest valuation and what move the grandmasters made (which should have the highest valuation). This of course assumes that the grandmaster moves are correct and optimal.
What you need to train a ANN is either something like backpropagation learning or some form of a genetic algorithm. But chess is such an complex game that it is unlikly that a simple ANN will learn to play it - even more if the learning process is unsupervised.
Further, your question does not say anything about the number of layers. You want to use 385 input neurons to encode the current situation. But how do you want to decide what to do? On neuron per field? Highest excitation wins? But there is often more than one possible move.
Further you will need several hidden layers - the functions that can be represented with an input and an output layer without hidden layer are really limited.
So I do not want to prevent you from trying it, but chances for a successful implemenation and training within say one year or so a practically zero.
I tried to build and train an ANN to play Tic-tac-toe when I was 16 years or so ... and I failed. I would suggest to try such an simple game first.
The main problem I see here is one of training. You say you want your ANN to take the current board position and evaluate how good it is for a player. (I assume you will take every possible move for a player, apply it to the current board state, evaluate via the ANN and then take the one with the highest output - ie: hill climbing)
Your options as I see them are:
Develop some heuristic function to evaluate the board state and train the network off that. But that begs the question of why use an ANN at all, when you could just use your heuristic.
Use some statistical measure such as "How many games were won by white or black from this board configuration?", which would give you a fitness value between white or black. The difficulty with that is the amount of training data required for the size of your problem space.
With the second option you could always feed it board sequences from grandmaster games and hope there is enough coverage for the ANN to develop a solution.
Due to the complexity of the problem I'd want to throw the largest network (ie: lots of internal nodes) at it as I could without slowing down the training too much.
Your input algorithm is sound - all positions, all pieces, and both players are accounted for. You may need an input layer for every past state of the gameboard, so that past events are used as input again.
The output layer should (in some form) give the piece to move, and the location to move to.
Write a genetic algorithm using a connectome which contains all neuron weights and synapse strengths, and begin multiple separated gene pools with a large number of connectomes in each.
Make them play one another, keep the best handful, crossover and mutate the best connectomes to repopulate the pool.
Read blondie24 : http://www.amazon.co.uk/Blondie24-Playing-Kaufmann-Artificial-Intelligence/dp/1558607838.
It deals with checkers instead of chess but the principles are the same.
Came here to say what Silas said. Using a minimax algorithm, you can expect to be able to look ahead N moves. Using Alpha-beta pruning, you can expand that to theoretically 2*N moves, but more realistically 3*N/4 moves. Neural networks are really appropriate here.
Perhaps though a genetic algorithm could be used.

Resources