The correctness of neural networks - artificial-intelligence

I have asked other AI folk this question, but I haven't really been given an answer that satisfied me.
For anyone else that has programmed an artificial neural network before, how do you test for its correctness?
I guess, another way to put it is, how does one debug the code behind a neural network?

With neural networks, generally what is happening is you are taking an untrained neural network, and you are training it up using a given set of data, so that it responds in the way you expect. Here's the deal; usually, you're training it up to a certain confidence level for your inputs. Generally (and again, this is just generally; your mileage may vary), you cannot get neural networks to always provide the right answer; rather, you are getting the estimation of the right answer, to within a confidence range. You know that confidence range by how you have trained the network.
The question arises as to why you would want to use neural networks if you cannot be certain that the conclusion they come to is verifiably correct; the answer is that neural networks can arrive at high-confidence answers for certain classes of problems (specifically, NP-Complete problems) in linear time, whereas verifiably correct solutions of NP-Complete problems can only be arrived at in polynomial time. In layman's terms, neural networks can "solve" problems that normal computation can't; but you can only be a certain percentage confident that you have the right answer. You can determine that confidence by the training regimen, and can usually make sure that you will have at least 99.9% confidence.

Correctness is a funny concept in most of "soft computing." The best I can tell you is: "a neural network is correct when it consistently satisfies the parameters of it's design." You do this by training it with data, and then verifying with other data, and having a feedback loop in the middle which lets you know if the neural network is functioning appropriately.
This is of-course the case only for neural networks that are large enough where a direct proof of correctness is not possible. It is possible to prove that a neural network is correct through analysis if you are attempting to build a neural network that learns XOR or something similar, but for that class of problem an aNN is seldom necessary.

You're opening up a bigger can of worms here than you might expect.
NN's are perhaps best thought of as universal function approximators, by the way, which may help you in thinking about this stuff.
Anyway, there is nothing special about NN's in terms of your question, the problem applies to any sort of learning algorithm.
The confidence you have in the results it is giving is going to rely on both the quantity and the quality (often harder to determine) of the training data that you have.
If you're really interested in this stuff, you may want to read up a bit on the problems of overtraining, and ensemble methods (bagging, boosting, etc.).
The real problem is that you usually aren't actually interested in the "correctness" (cf quality) of an answer on a given input that you've already seen, rather you care about predicting the quality of answer on an input you haven't seen yet. This is a much more difficult problem. Typical approaches then, involve "holding back" some of your training data (i.e. the stuff you know the "correct" answer for) and testing your trained system against that. It gets subtle though, when you start considering that you may not have enough data, or it may be biased, etc. So there are many researchers who basically spend all of their time thinking about these sort of issues!

I've worked on projects where there is test data as well as training data, so you know the expected outputs for a set of inputs the NN hasn't seen.
One common way of analysing the result of any classifier is use of an ROC curve; an introduction to the statistics of classifiers and ROC curves can be found at Interpreting Diagnostic Tests

I'm a complete amateur in this field, but don't you use a pre-determined set of data you know is correct?

I don't believe there is a single correct answer but there are well-proven probabilistic or statistical methods that can provide reassurance. The statistical methods are usually referred to as Resampling.
One method that I can recommend is the Jackknife.

My teacher always said his rule of thumb was to train the NN with 80% of your data and validate it with the other 20%. And, of course, make sure that data set is as comprehensive as you need.

If you want to find out whether the backpropagation of the network is correct, there is an easy way.
Since you calculate the derivate of the error landscape, you can check whether your implementation is correct numerically. You will calculate the derivative of the error with respect to a specific weight, ∂E/∂w. You can show that
∂E/∂w = (E(w + e) - E(w - e)) / (2 * e) + O(e^2).
(Bishop, Machine Learning and Pattern Recognition, p. 246)
Essentially, you evaluate the error to the left of the weight, evaluate it to the right of the weight and chheck if the numerical gradient is the same as your analytical gradient.
(Here's an implementation: http://github.com/bayerj/arac/raw/9f5b225d6293974f8adfc5f20dfc6439cc1bed35/src/cpp/utilities/utilities.cpp)

To me probably there is only one value(s) takes extra effort to verify, the gradient of the back propagation. I think Bayer's answer is actually commonly used and suggested. You need to write extra code to this but all are forward propagation matrix multiplications which is easy to write and verify.
There are some other issues which will prevent you from getting the best answer, for example:
The cost function of NN is not concave so your gradient descent is not guaranteed to find the global optimum.
Over/under fitting
Not choosing the "right" features/model
etc
However I think they are beyond the scope of programming bug.

Related

How a neural network can recognize more patterns at once - how does it work?

I am a beginner when it comes to NN. I understand the basics and I am not sure abou the following - lets consider handwriting recognition network. I understand you can train a network to recognize a pattern, i.e the weights are set appropriately. But if the network will be trained to recognize "A", how could it recognize "B" then, which would certainly require the weights to be set differently?
Or does the network only searches for one letter it is currently trained? I hope I made myselft clear - I basically try to understand how a trained network can recognize various characters if the weights will be mixed when training for all.
When a neural network is being trained, what is happening is that the network is searching for a set of weights which when combined with the test inputs, will yield the expected output.
One of the key features in neural networks is the setting up and assignment of the Learning Rate. What this means is essentially how much of the previous acquired information is kept.
It is important that this value be neither too high (if memory serves, setting it to 1 would mean that the weight will be changed by taking into consideration only the current test case) nor too low (setting it up to zero will mean that no weight change will be made). In either case, the neural network would never converge.
When training for hand writing, as far as I know, the training set involves various letters written in various forms. That being said, although neural networks tend to fair better than other AI approaches when there are variations in its input, there are always limitations.
EDIT:
As per your question, assuming that you are dealing with a back propagation neural network, what you do is that at each layer, you apply an activation function and pass the result of the current layer to the next.
The extra bit comes during testing, where you compare the result you have with the result you want. This is where you apply the back propagation algorithm to amend the weights, and in this section is where the learning rate comes in.
As you have mentioned in your comment, the weights will be changed, however, the value of the learning rate will determine how much will the weights change. Usually, you want them to change relatively slowly so that they converge, hence why you want to keep the value of the learning rate relatively low. However, if you have a very high learning rate, the current data set will, as you are saying, affect any improvements made by the next.
The way you can look at it is that while training, the neural network is searching for a set of weights which can given its test inputs, it will yield the expected results. So basically, you are looking for weights which satisfy all your test cases.

ANN: How to correctly pick initial weights to avoid local minima?

In backpropagation training, during gradient descent down the error surface, network with large amount of neurons in hidden layer can get stuck in local minimum. I have read that reinitializing weights to random numbers in all cases will eventually avoid this problem.
This means that there always IS a set of "correct" initial weight values. (Is this safe to assume?)
I need to find or make an algorithm that finds them.
I have tried googling the algorithm, tried devising it myself but to no avail. Can anyone propose a solution? Perhaps a name of algorithm that I can search for?
Note: this is a regular feed-forward 3-layer burrito :)
Note: I know attempts have been made to use GAs for that purpose, but that requires re-training the network on each iteration which is time costly when it gets large enough.
Thanks in advance.
There is never a guarantee that you will not get stuck in a local optimum, sadly. Unless you can prove certain properties about the function you are trying to optimize, local optima exist and hill-climbing methods will fall prey to them. (And typically, if you can prove the things you need to prove, you can also select a better tool than a neural network.)
One classic technique is to gradually reduce the learning rate, then increase it and slowly draw it down, again, several times. Raising the learning rate reduces the stability of the algorithm, but gives the algorithm the ability to jump out of a local optimum. This is closely related to simulated annealing.
I am surprised that Google has not helped you, here, as this is a topic with many published papers: Try terms like, "local minima" and "local minima problem" in conjunction with neural networks and backpropagation. You should see many references to improved backprop methods.

How do I pick a good representation for a board game tactic for a genetic algorithm?

For my bachelor's thesis I want to write a genetic algorithm that learns to play the game of Stratego (if you don't know this game, it's probably safe to assume I said chess). I haven't ever before done actual AI projects, so it's an eye-opener to see how little I actually know of implementing things.
The thing I'm stuck with is coming up with a good representation for an actual strategy. I'm probably making some thinking error, but some problems I encounter:
I don't assume you would have a representation containing a lot of
transitions between board positions, since that would just be
bruteforcing it, right?
What could branches of a decision tree look
like? Any representation I come up with don't have interchangeable
branches... If I were to use a bit string, which is apparently also
common, what would the bits represent?
Do I assign scores to the distance between certain pieces? How would I represent that?
I think I ought to know these things after three+ years of study, so I feel pretty stupid - this must look likeI have no clue at all. Still, any help or tips on what to Google would be appreciated!
I think, you could define a decision model and then try to optimize the parameters of that model. You can create multi-stage decision models also. I once did something similar for solving a dynamic dial-a-ride problem (paper here) by modeling it as a two stage linear decision problem. To give you an example, you could:
For each of your figures decide which one is to move next. Each figure is characterized by certain features derived from its position on the board, e.g. ability to make a score, danger, protecting x other figures, and so on. Each of these features can be combined (e.g. in a linear model, through a neural network, through a symbolic expression tree, a decision tree, ...) and give you a rank on which figure to act next with.
Acting with the figure you selected. Again there are a certain number of actions that can be taken, each has certain features. Again you can combine and rank them and one action will have the highest priority. This is the one you choose to perform.
The features you extract can be very simple or insanely complex, it's up to what you think will work best vs what takes how long to compute.
To evaluate and improve the quality of your decision model you can then simulate these decisions in several games against opponents and train the parameters of the model that combines these features to rank the moves (e.g. using a GA). This way you tune the model to win as many games as possible against the specified opponents. You can test the generality of that model by playing against opponents it has not seen before.
As Mathew Hall just said, you can use GP for this (if your model is a complex rule), but this is just one kind of model. In my case a linear combination of the weights did very well.
Btw, if you're interested we've also got a software on heuristic optimization which provides you with GA, GP and that stuff. It's called HeuristicLab. It's GPL and open source, but comes with a GUI (Windows). We've some Howto on how to evaluate the fitness function in an external program (data exchange using protocol buffers), so you can work on your simulation and your decision model and let the algorithms present in HeuristicLab optimize your parameters.
Vincent,
First, don't feel stupid. You've been (I infer) studying basic computer science for three years; now you're applying those basic techniques to something pretty specialized-- a particular application (Stratego) in a narrow field (artificial intelligence.)
Second, make sure your advisor fully understands the rules of Stratego. Stratego is played on a larger board, with more pieces (and more types of pieces) than chess. This gives it a vastly larger space of legal positions, and a vastly larger space of legal moves. It is also a game of hidden information, increasing the difficulty yet again. Your advisor may want to limit the scope of the project, e.g., concentrate on a variant with full observation. I don't know why you think this is simpler, except that the moves of the pieces are a little simpler.
Third, I think the right thing to do at first is to take a look at how games in general are handled in the field of AI. Russell and Norvig, chapters 3 (for general background) and 5 (for two player games) are pretty accessible and well-written. You'll see two basic ideas: One, that you're basically performing a huge search in a tree looking for a win, and two, that for any non-trivial game, the trees are too large, so you search to a certain depth and then cop out with a "board evaluation function" and look for one of those. I think your third bullet point is in this vein.
The board evaluation function is the magic, and probably a good candidate for using either a genetic algorithm, or a genetic program, either of which might be used in conjunction with a neural network. The basic idea is that you are trying to design (or evolve, actually) a function that takes as input a board position, and outputs a single number. Large numbers correspond to strong positions, and small numbers to weak positions. There is a famous paper by Chellapilla and Fogel showing how to do this for a game of Checkers:
http://library.natural-selection.com/Library/1999/Evolving_NN_Checkers.pdf
I think that's a great paper, tying three great strands of AI together: Adversarial search, genetic algorithms, and neural networks. It should give you some inspiration about how to represent your board, how to think about board evaluations, etc.
Be warned, though, that what you're trying to do is substantially more complex than Chellapilla and Fogel's work. That's okay-- it's 13 years later, after all, and you'll be at this for a while. You're still going to have a problem representing the board, because the AI player has imperfect knowledge of its opponent's state; initially, nothing is known but positions, but eventually as pieces are eliminated in conflict, one can start using First Order Logic or related techniques to start narrowing down individual pieces, and possibly even probabilistic methods to infer information about the whole set. (Some of these may be beyond the scope of an undergrad project.)
The fact you are having problems coming up with a representation for an actual strategy is not that surprising. In fact I would argue that it is the most challenging part of what you are attempting. Unfortunately, I haven't heard of Stratego so being a bit lazy I am going to assume you said chess.
The trouble is that a chess strategy is rather a complex thing. You suggest in your answer containing lots of transitions between board positions in the GA, but a chess board has more possible positions than the number of atoms in the universe this is clearly not going to work very well. What you will likely need to do is encode in the GA a series of weights/parameters that are attached to something that takes in the board position and fires out a move, I believe this is what you are hinting at in your second suggestion.
Probably the simplest suggestion would be to use some sort of generic function approximation like a neural network; Perceptrons or Radial Basis Functions are two possibilities. You can encode weights for the various nodes into the GA, although there are other fairly sound ways to train a neural network, see Backpropagation. You could perhaps encode the network structure instead/as well, this also has the advantage that I am pretty sure a fair amount of research has been done into developing neural networks with a genetic algorithm so you wouldn't be starting completely from scratch.
You still need to come up with how you are going to present the board to the neural network and interpret the result from it. Especially, with chess you would have to take note that a lot of moves will be illegal. It would be very beneficial if you could encode the board and interpret the result such that only legal moves are presented. I would suggest implementing the mechanics of the system and then playing around with different board representations to see what gives good results. A few ideas top of the head ideas to get you started could be, although I am not really convinced any of them are especially great ways to do this:
A bit string with all 64 squares one after another with a number presenting what is present in each square. Most obvious, but probably a rather bad representation as a lot of work will be required to filter out illegal moves.
A bit string with all 64 squares one after another with a number presenting what can move to each square. This has the advantage of embodying the covering concept of chess where you what to gain as much coverage of the board with your pieces as possible, but still has problems with illegal moves and dealing with friendly/enemy pieces.
A bit string with all 32 pieces one after another with a number presenting the location of that piece in each square.
In general though I would suggest that chess is rather a complex game to start with, I think it will be rather hard to get something playing to standard which is noticeably better than random. I don't know if Stratego is any simpler, but I would strongly suggest you opt for a fairly simple game. This will let you focus on getting the mechanics of the implementation correct and the representation of the game state.
Anyway hope that is of some help to you.
EDIT: As a quick addition it is worth looking into how standard chess AI's work, I believe most use some sort of Minimax system.
When you say "tactic", do you mean you want the GA to give you a general algorithm to play the game (i.e. evolve an AI) or do you want the game to use a GA to search the space of possible moves to generate a move at each turn?
If you want to do the former, then look into using Genetic programming (GP). You could try to use it to produce the best AI you can for a fixed tree size. JGAP already comes with support for GP as well. See the JGAP Robocode example for an instance of this. This approach does mean you need a domain specific language for a Stratego AI, so you'll need to think carefully how you expose the board and pieces to it.
Using GP means your fitness function can just be how well the AI does at a fixed number of pre-programmed games, but that requires a good AI player to start with (or a very patient human).
#DonAndre's answer is absolutely correct for movement. In general, problems involving state-based decisions are hard to model with GAs, requiring some form of GP (either explicit or, as #DonAndre suggested, trees that are essentially declarative programs).
A general Stratego player seems to me quite challenging, but if you have a reasonable Stratego playing program, "Setting up your Stratego board" would be an excellent GA problem. The initial positions of your pieces would be the phenotype and the outcome of the external Stratego-playing code would be the fitness. It is intuitively likely that random setups would be disadvantaged versus setups that have a few "good ideas" and that small "good ideas" could be combined into fitter-and-fitter setups.
...
On the general problem of what a decision tree, even trying to come up with a simple example, I kept finding it hard to come up with a small enough example, but maybe in the case where you are evaluation whether to attack a same-ranked piece (which, IIRC destroys both you and the other piece?):
double locationNeed = aVeryComplexDecisionTree();
if(thatRank == thisRank){
double sacrificeWillingness = SACRIFICE_GENETIC_BASE; //Assume range 0.0 - 1.0
double sacrificeNeed = anotherComplexTree(); //0.0 - 1.0
double sacrificeInContext = sacrificeNeed * SACRIFICE_NEED_GENETIC_DISCOUNT; //0.0 - 1.0
if(sacrificeInContext > sacrificeNeed){
...OK, this piece is "willing" to sacrifice itself
One way or the other, the basic idea is that you'd still have a lot of coding of Stratego-play, you'd just be seeking places where you could insert parameters that would change the outcome. Here I had the idea of a "base" disposition to sacrifice itself (presumably higher in common pieces) and a "discount" genetically-determined parameter that would weight whether the piece would "accept or reject" the need for a sacrifice.

Backpropagation issues

I have a couple of questions about how to code the backpropagation algorithm of neural networks:
The topology of my networks is an input layer, hidden layer and output layer. Both the hidden layer and output layer have sigmoid functions.
First of all, should I use the bias?
To where should I connect the bias
in my network? Should I put one bias
unit per layer in both the hidden
layer and output layer? What about
the input layer?
In this link, they define the last delta as the input - output and they backpropagate the deltas as can be seen in the figure. They hold a table to put
all the deltas before actually
propagating the errors in a
feedforward fashion. Is this a
departure from the standard
backpropagation algorithm?
Should I decrease the learning
factor over time?
In case anyone knows, is Resilient
Propagation an online or batch
learning technique?
Thanks
edit: One more thing. In the following picture, d f1(e) / de, assuming I'm using the sigmoid function, is f1(e) * [1- f1(e)], right?
It varies. Personally, I don't see much of a reason for bias, but I haven't studied NN enough to actually make a valid case for or against them. I'd try it out to and test results.
That's correct. Backpropagation involves calculation of deltas first, and then propagating them across the network.
Yes. Learning factor should be decreased over time. However, with BP, you can hit local, incorrect plateaus, so sometimes around the 500th iteration, it makes sense to reset the learning factor to the intial rate.
I can't answer that.....never heard anything about RP.
Your question needs to be specified a bit more thoroughly... What is your need? Generalization or memorization? Are you anticipating a complex pattern matching data set, or a continuous-domain input-output relationship? Here are my $0.02:
I would suggest you leave a bias neuron in just in case you need it. If it is deemed unnecessary by the NN, training should drive the weights to negligible values. It will connect to every neuron in the layer up ahead, but is not connected to from any neuron in the preceding layer.
The equation looks like standard backprop as far as I can tell.
It is hard to generalize whether your learning rate needs to be decreased over time. The behaviour is highly data-dependent. The smaller your learning rate, the more stable your training will be. However, it can be painfully slow, especially if you're running it in a scripting language like I did once upon a time.
Resilient backprop (or RProp in MATLAB) should handle both online and batch training modes.
I'd just like to add that you might want to consider alternative activation functions if possible. The sigmoid function doesn't always give the best results...

Brain modelling

Just wondering, since we've reached 1 teraflop per PC, yet we are still not able to model an insect's brain.
Has anyone seen a decent implementation of a self-learning, self-developing neural network?
I saw an interesting experiment mapping the physical neural layout of a rat's brain to a digital neural network with weighting modelled on the neuron chemistry of each component taken using MRI and others. Quite interesting. (new scientist or Focus, 2 issues ago?)
IBM Blue Brain comes to mind
http://news.bbc.co.uk/1/hi/sci/tech/8012496.stm
The problem is computation power as you rightly point out. But for a sequence of stimuli to a neural network the range of calculations tends to be exponential as that stimuli encounters deeper nested nodes. Any complex weighting algorithm means that time spent at each node can get expensive. Domain specific neural-maps tend to be quicker because they are specialized. Brains in mammals have many general paths, making it harder to teach them, and for a computer to model a real mammal brain in a given space/time.
Real brains also have tons of cross-talk like static (some people think this is where creativity or original thought stems from). Brains also don't learn using 'direct' stimulus/reward ... they use past experience of non-related matter to create their own learning. Recreating the neurons is one thing in a computational space, creating an accurate learning is another. Never-mind the dopamine (octopamine in insects) and other neurological chemicals.
imagine giving a digital brain LSD or anti-depressants. As a real simulation. Awesome. That would be a complex simulation I suspect.
I think you're kind of making the assumption that our idea of how neural networks work is a good model for the brain at a large-scale level; I'm not sure that is a good assumption. Hell, not too many years ago we didn't think the glial cells were important to mental functions, and it was the idea for a long time that there is no neurogenesis after the brain matures.
On the other hand, neural networks do seem to handle some apparently complex functions pretty well.
So, here's a little puzzle question for you: how many teraflops or petaflops do you think a human brain's computation represents?
Jeff Hawkins would say that a neural net is a poor approximation of a brain. His "On Intelligence" is a terrific read.
Yup: OpenCog is working on it.
It's the structure. Even if we had computers today with the same or higher performance than a human brain (there are different predictions when we'll get there, but there are still a few years to go), we still need to program it. And while we know a lot of the brain today, there are still many, many more things we do not know. And these aren't just details, but large areas that are not understood at all.
Focusing only on the Tera-/Peta-FLOPS is like looking only at megapixels with digital cameras: it focuses on only one value when there are many factors involved (and there are a few more of those in a brain than in a camera). I also believe that many of the estimates just how many FLOPS would be needed to simulate a brain are way off - but that's a different discussion altogether.
Just wondering, we've reached 1 teraflop per PC, and we are still not able to model an insect's brain. has anyone seen a decent implementation of a self-learning self-developing neural network?
We can already model brains. The question these days, is how fast, and how accurate.
In the beginning, there was effort expended on trying to find the most abstract representation of neurons with the least amount of physical properties needed.
This led to the invention of the perceptron at Cornell University, which is a very simple model indeed. In fact, it may have been too simple, as the famous MIT AI professor, Marvin Minsky, wrote a paper which mistakenly concluded that it would be impossible for this type of model to learn XOR (a basic logic gate that could be emulated by every computer we have today). Unfortunately, his paper plunged neural network research into the dark ages for at least 10 years.
While probably not as impressive as many would like, there are learning networks that are already in existence that can do visual and speech learning and recognition.
And even though we have faster CPUs, it is still not the same as a neuron. Neurons in our brain are, at the very least, parallel adder units. So imagine 100 billion simulated human neurons, adding each second, sending their outputs to 100 trillion connections with a "clock" of about 20hz. The amount of computation going on here far exceeds the petaflops of processing power we have, especially when our cpus are mostly serial instead of parallel.
In 2007, they simulated the equivalent of a half mouse brain for 10 seconds at half the actual speed: http://news.bbc.co.uk/1/hi/technology/6600965.stm
There is a worm named C. Elegance and its anatomy is completely know to us. Every cell is mapped out and every neuron is well studied. This worm has an interesting property by birth and that is it follows or grow towards only those temperature regions in which it was born. Here is link to the paper. This paper has implementation of the property with neuronal model. And there are some students who have built robot that only follows dark regions in the region having different shades of light, using this neuronal model. This work could have been done using other methods as well but this method is more noise resilient as proved by paper to which I have given link above.

Resources