Bayesian Network: Independance and Conditional Independance - artificial-intelligence

I'm having some misunderstanding concerning Bayesian network. My main misunderstanding are independence and conditional independence!!
If e.g. I have to calculate
P(Burglary|Johncall),
is it P(Burglary|Johncalls)=P(Burglary) because i'm seeing that Burglary is independent of Johncalls??

Burglary is independent from JohnCalls given Alarm. So P(B|A,J) = P(B|A).
Explaining the example
The idea is, that John can only tell you that there is an alarm. But if you already know that there is an alarm, then the phone call from John will tell you nothing new about the possibility of a burglary. Yes, you know that John heard the alarm, but that's not what you're interested in when asking for Burglary.
Conditional Independence
In school, you've probably learned about unconditional independence, given when P(A|B) = P(A)*P(B). Unconditional independence makes things easy to calculate but happens pretty rarely - inside the belief network unconditionally independent nodes would be unconnected.
Conditional independence on the other hand is a bit more complicated but happens more often. It means that the probability of two events becomes independent of each other when another "separating" fact is learned.

Related

How do I pick a good representation for a board game tactic for a genetic algorithm?

For my bachelor's thesis I want to write a genetic algorithm that learns to play the game of Stratego (if you don't know this game, it's probably safe to assume I said chess). I haven't ever before done actual AI projects, so it's an eye-opener to see how little I actually know of implementing things.
The thing I'm stuck with is coming up with a good representation for an actual strategy. I'm probably making some thinking error, but some problems I encounter:
I don't assume you would have a representation containing a lot of
transitions between board positions, since that would just be
bruteforcing it, right?
What could branches of a decision tree look
like? Any representation I come up with don't have interchangeable
branches... If I were to use a bit string, which is apparently also
common, what would the bits represent?
Do I assign scores to the distance between certain pieces? How would I represent that?
I think I ought to know these things after three+ years of study, so I feel pretty stupid - this must look likeI have no clue at all. Still, any help or tips on what to Google would be appreciated!
I think, you could define a decision model and then try to optimize the parameters of that model. You can create multi-stage decision models also. I once did something similar for solving a dynamic dial-a-ride problem (paper here) by modeling it as a two stage linear decision problem. To give you an example, you could:
For each of your figures decide which one is to move next. Each figure is characterized by certain features derived from its position on the board, e.g. ability to make a score, danger, protecting x other figures, and so on. Each of these features can be combined (e.g. in a linear model, through a neural network, through a symbolic expression tree, a decision tree, ...) and give you a rank on which figure to act next with.
Acting with the figure you selected. Again there are a certain number of actions that can be taken, each has certain features. Again you can combine and rank them and one action will have the highest priority. This is the one you choose to perform.
The features you extract can be very simple or insanely complex, it's up to what you think will work best vs what takes how long to compute.
To evaluate and improve the quality of your decision model you can then simulate these decisions in several games against opponents and train the parameters of the model that combines these features to rank the moves (e.g. using a GA). This way you tune the model to win as many games as possible against the specified opponents. You can test the generality of that model by playing against opponents it has not seen before.
As Mathew Hall just said, you can use GP for this (if your model is a complex rule), but this is just one kind of model. In my case a linear combination of the weights did very well.
Btw, if you're interested we've also got a software on heuristic optimization which provides you with GA, GP and that stuff. It's called HeuristicLab. It's GPL and open source, but comes with a GUI (Windows). We've some Howto on how to evaluate the fitness function in an external program (data exchange using protocol buffers), so you can work on your simulation and your decision model and let the algorithms present in HeuristicLab optimize your parameters.
Vincent,
First, don't feel stupid. You've been (I infer) studying basic computer science for three years; now you're applying those basic techniques to something pretty specialized-- a particular application (Stratego) in a narrow field (artificial intelligence.)
Second, make sure your advisor fully understands the rules of Stratego. Stratego is played on a larger board, with more pieces (and more types of pieces) than chess. This gives it a vastly larger space of legal positions, and a vastly larger space of legal moves. It is also a game of hidden information, increasing the difficulty yet again. Your advisor may want to limit the scope of the project, e.g., concentrate on a variant with full observation. I don't know why you think this is simpler, except that the moves of the pieces are a little simpler.
Third, I think the right thing to do at first is to take a look at how games in general are handled in the field of AI. Russell and Norvig, chapters 3 (for general background) and 5 (for two player games) are pretty accessible and well-written. You'll see two basic ideas: One, that you're basically performing a huge search in a tree looking for a win, and two, that for any non-trivial game, the trees are too large, so you search to a certain depth and then cop out with a "board evaluation function" and look for one of those. I think your third bullet point is in this vein.
The board evaluation function is the magic, and probably a good candidate for using either a genetic algorithm, or a genetic program, either of which might be used in conjunction with a neural network. The basic idea is that you are trying to design (or evolve, actually) a function that takes as input a board position, and outputs a single number. Large numbers correspond to strong positions, and small numbers to weak positions. There is a famous paper by Chellapilla and Fogel showing how to do this for a game of Checkers:
http://library.natural-selection.com/Library/1999/Evolving_NN_Checkers.pdf
I think that's a great paper, tying three great strands of AI together: Adversarial search, genetic algorithms, and neural networks. It should give you some inspiration about how to represent your board, how to think about board evaluations, etc.
Be warned, though, that what you're trying to do is substantially more complex than Chellapilla and Fogel's work. That's okay-- it's 13 years later, after all, and you'll be at this for a while. You're still going to have a problem representing the board, because the AI player has imperfect knowledge of its opponent's state; initially, nothing is known but positions, but eventually as pieces are eliminated in conflict, one can start using First Order Logic or related techniques to start narrowing down individual pieces, and possibly even probabilistic methods to infer information about the whole set. (Some of these may be beyond the scope of an undergrad project.)
The fact you are having problems coming up with a representation for an actual strategy is not that surprising. In fact I would argue that it is the most challenging part of what you are attempting. Unfortunately, I haven't heard of Stratego so being a bit lazy I am going to assume you said chess.
The trouble is that a chess strategy is rather a complex thing. You suggest in your answer containing lots of transitions between board positions in the GA, but a chess board has more possible positions than the number of atoms in the universe this is clearly not going to work very well. What you will likely need to do is encode in the GA a series of weights/parameters that are attached to something that takes in the board position and fires out a move, I believe this is what you are hinting at in your second suggestion.
Probably the simplest suggestion would be to use some sort of generic function approximation like a neural network; Perceptrons or Radial Basis Functions are two possibilities. You can encode weights for the various nodes into the GA, although there are other fairly sound ways to train a neural network, see Backpropagation. You could perhaps encode the network structure instead/as well, this also has the advantage that I am pretty sure a fair amount of research has been done into developing neural networks with a genetic algorithm so you wouldn't be starting completely from scratch.
You still need to come up with how you are going to present the board to the neural network and interpret the result from it. Especially, with chess you would have to take note that a lot of moves will be illegal. It would be very beneficial if you could encode the board and interpret the result such that only legal moves are presented. I would suggest implementing the mechanics of the system and then playing around with different board representations to see what gives good results. A few ideas top of the head ideas to get you started could be, although I am not really convinced any of them are especially great ways to do this:
A bit string with all 64 squares one after another with a number presenting what is present in each square. Most obvious, but probably a rather bad representation as a lot of work will be required to filter out illegal moves.
A bit string with all 64 squares one after another with a number presenting what can move to each square. This has the advantage of embodying the covering concept of chess where you what to gain as much coverage of the board with your pieces as possible, but still has problems with illegal moves and dealing with friendly/enemy pieces.
A bit string with all 32 pieces one after another with a number presenting the location of that piece in each square.
In general though I would suggest that chess is rather a complex game to start with, I think it will be rather hard to get something playing to standard which is noticeably better than random. I don't know if Stratego is any simpler, but I would strongly suggest you opt for a fairly simple game. This will let you focus on getting the mechanics of the implementation correct and the representation of the game state.
Anyway hope that is of some help to you.
EDIT: As a quick addition it is worth looking into how standard chess AI's work, I believe most use some sort of Minimax system.
When you say "tactic", do you mean you want the GA to give you a general algorithm to play the game (i.e. evolve an AI) or do you want the game to use a GA to search the space of possible moves to generate a move at each turn?
If you want to do the former, then look into using Genetic programming (GP). You could try to use it to produce the best AI you can for a fixed tree size. JGAP already comes with support for GP as well. See the JGAP Robocode example for an instance of this. This approach does mean you need a domain specific language for a Stratego AI, so you'll need to think carefully how you expose the board and pieces to it.
Using GP means your fitness function can just be how well the AI does at a fixed number of pre-programmed games, but that requires a good AI player to start with (or a very patient human).
#DonAndre's answer is absolutely correct for movement. In general, problems involving state-based decisions are hard to model with GAs, requiring some form of GP (either explicit or, as #DonAndre suggested, trees that are essentially declarative programs).
A general Stratego player seems to me quite challenging, but if you have a reasonable Stratego playing program, "Setting up your Stratego board" would be an excellent GA problem. The initial positions of your pieces would be the phenotype and the outcome of the external Stratego-playing code would be the fitness. It is intuitively likely that random setups would be disadvantaged versus setups that have a few "good ideas" and that small "good ideas" could be combined into fitter-and-fitter setups.
...
On the general problem of what a decision tree, even trying to come up with a simple example, I kept finding it hard to come up with a small enough example, but maybe in the case where you are evaluation whether to attack a same-ranked piece (which, IIRC destroys both you and the other piece?):
double locationNeed = aVeryComplexDecisionTree();
if(thatRank == thisRank){
double sacrificeWillingness = SACRIFICE_GENETIC_BASE; //Assume range 0.0 - 1.0
double sacrificeNeed = anotherComplexTree(); //0.0 - 1.0
double sacrificeInContext = sacrificeNeed * SACRIFICE_NEED_GENETIC_DISCOUNT; //0.0 - 1.0
if(sacrificeInContext > sacrificeNeed){
...OK, this piece is "willing" to sacrifice itself
One way or the other, the basic idea is that you'd still have a lot of coding of Stratego-play, you'd just be seeking places where you could insert parameters that would change the outcome. Here I had the idea of a "base" disposition to sacrifice itself (presumably higher in common pieces) and a "discount" genetically-determined parameter that would weight whether the piece would "accept or reject" the need for a sacrifice.

Artificial Intelligence: If there is such a thing as a "sequential-decision" task, what does a "non-sequential decision" task look like?

I have heard that neural networks are very good when implementing solutions to sequential-decision tasks. However, I assume that the qualifier "sequential" exists because there must likewise be a "non-sequential" realm. Just wondering if there were some standard, canonical examples some people knew of the top of their head. Thanks :)
Driving is a good example of a sequential task: You are always deciding whether to steer straight, left, or right. It is sequential because there is some uncertainty in the vehicle; you might steer straight but the car goes a little bit to the right. When you notice this, you correct by steering a little to the left. There would be no way to make all the decisions ahead of time, close your eyes, and drive without crashing.
Planning a path from A to B is an example of a non-sequential task (e.g., Google maps). You know the map, you know the start & goal, and you know the road network. You can make all decisions at once (not sequential) and complete the task (finding a path).
More generally, you'll be making sequential decisions anytime you have some uncertainty and observe something, then react appropriately. From a machine learning perspective, this is the difference between Supervised learning and Online learning.
Recognizing a picture of your mother. Deciding what you're going to have for breakfast. Coming up with examples of non-sequential decision tasks.

Best way to automate testing of AI algorithms?

I'm wondering how people test artificial intelligence algorithms in an automated fashion.
One example would be for the Turing Test - say there were a number of submissions for a contest. Is there any conceivable way to score candidates in an automated fashion - other than just having humans test them out.
I've also seen some data sets (obscured images of numbers/letters, groups of photos, etc) that can be fed in and learned over time. What good resources are out there for this.
One challenge I see: you don't want an algorithm that tailors itself to the test data over time, since you are trying to see how well it does in the general case. Are there any techniques to ensure it doesn't do this? Such as giving it a random test each time, or averaging its results over a bunch of random tests.
Basically, given a bunch of algorithms, I want some automated process to feed it data and see how well it "learned" it or can predict new stuff it hasn't seen yet.
This is a complex topic - good AI algorithms are generally the ones which can generalize well to "unseen" data. The simplest method is to have two datasets: a training set and an evaluation set used for measuring the performances. But generally, you want to "tune" your algorithm so you may want 3 datasets, one for learning, one for tuning, and one for evaluation. What defines tuning depends on your algorithm, but a typical example is a model where you have a few hyper-parameters (for example parameters in your Bayesian prior under the Bayesian view of learning) that you would like to tune on a separate dataset. The learning procedure would already have set a value for it (or maybe you hardcoded their value), but having enough data may help so that you can tune them separately.
As for making those separate datasets, there are many ways to do so, for example by dividing the data you have available into subsets used for different purposes. There is a tradeoff to be made because you want as much data as possible for training, but you want enough data for evaluation too (assuming you are in the design phase of your new algorithm/product).
A standard method to do so in a systematic way from a known dataset is cross validation.
Generally when it comes to this sort of thing you have two datasets - one large "training set" which you use to build and tune the algorithm, and a separate smaller "probe set" that you use to evaluate its performance.
#Anon has the right of things - training and what I'll call validation sets. That noted, the bits and pieces I see about developments in this field point at two things:
Bayesian Classifiers: there's something like this probably filtering your email. In short you train the algorithm to make a probabilistic decision if a particular item is part of a group or not (e.g. spam and ham).
Multiple Classifiers: this is the approach that the winning group involved in the Netflix challenge took, whereby it's not about optimizing one particular algorithm (e.g. Bayesian, Genetic Programming, Neural Networks, etc..) by combining several to get a better result.
As for data sets Weka has several available. I haven't explored other libraries for data sets, but mloss.org appears to be a good resource. Finally data.gov offers a lot of sets that provide some interesting opportunities.
Training data sets and test sets are very common for K-means and other clustering algorithms, but to have something that's artificially intelligent without supervised learning (which means having a training set) you are building a "brain" so-to-speak based on:
In chess: all possible future states possible from the current gameState.
In most AI-learning (reinforcement learning) you have a problem where the "agent" is trained by doing the game over and over. Basically you ascribe a value to every state. Then you assign an expected value of each possible action at a state.
So say you have S states and a actions per state (although you might have more possible moves in one state, and not as many in another), then you want to figure out the most-valuable states from s to be in, and the most valuable actions to take.
In order to figure out the value of states and their corresponding actions, you have to iterate the game through. Probabilistically, a certain sequence of states will lead to victory or defeat, and basically you learn which states lead to failure and are "bad states". You also learn which ones are more likely to lead to victory, and these are subsequently "good" states. They each get a mathematical value associated, usually as an expected reward.
Reward from second-last state to a winning state: +10
Reward if entering a losing state: -10
So the states that give negative rewards then give negative rewards backwards, to the state that called the second-last state, and then the state that called the third-last state and so-on.
Eventually, you have a mapping of expected reward based on which state you're in, and based on which action you take. You eventually find the "optimal" sequence of steps to take. This is often referred to as an optimal policy.
It is true of the converse that normal courses of actions that you are stepping-through while deriving the optimal policy are called simply policies and you are always implementing a certain "policy" with respect to Q-Learning.
Usually the way of determining the reward is the interesting part. Suppose I reward you for each state-transition that does not lead to failure. Then the value of walking all the states until I terminated is however many increments I made, however many state transitions I had.
If certain states are extremely unvaluable, then loss is easy to avoid because almost all bad states are avoided.
However, you don't want to discourage discovery of new, potentially more-efficient paths that don't follow just this-one-works, so you want to reward and punish the agent in such a way as to ensure "victory" or "keeping the pole balanced" or whatever as long as possible, but you don't want to be stuck at local maxima and minima for efficiency if failure is too painful, so no new, unexplored routes will be tried. (Although there are many approaches in addition to this one).
So when you ask "how do you test AI algorithms" the best part is is that the testing itself is how many "algorithms" are constructed. The algorithm is designed to test a certain course-of-action (policy). It's much more complicated than
"turn left every half mile"
it's more like
"turn left every half mile if I have turned right 3 times and then turned left 2 times and had a quarter in my left pocket to pay fare... etc etc"
It's very precise.
So the testing is usually actually how the A.I. is being programmed. Most models are just probabilistic representations of what is probably good and probably bad. Calculating every possible state is easier for computers (we thought!) because they can focus on one task for very long periods of time and how much they remember is exactly how much RAM you have. However, we learn by affecting neurons in a probabilistic manner, which is why the memristor is such a great discovery -- it's just like a neuron!
You should look at Neural Networks, it's mindblowing. The first time I read about making a "brain" out of a matrix of fake-neuron synaptic connections... A brain that can "remember" basically rocked my universe.
A.I. research is mostly probabilistic because we don't know how to make "thinking" we just know how to imitate our own inner learning process of try, try again.

Implementing crossover in genetic programming [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm writing a genetic programming (GP) system (in C but that's a minor detail). I've read a lot of the literature (Koza, Poli, Langdon, Banzhaf, Brameier, et al) but there are some implementation details I've never seen explained. For example:
I'm using a steady state population rather than a generational approach, primarily to use all of the computer's memory rather than reserve half for the interim population.
Q1. In GP, as opposed to GA, when you perform crossover you select two parents but do you create one child or two, or is that a free choice you have?
Q2. In steady state GP, as opposed to a generational system, what members of the population do the children created by crossover replace? This is what I haven't seen discussed. Is it the two parents, or is it two other, randomly-selected members? I can understand if it's the latter, and that you might use negative tournament selection to choose members to replace, but would that not create premature convergence? (After a crossover event the population contains the two original parents plus two children of those parents, and two other random members get removed. Elitism is inherent.)
Q3. Is there a Web forum or mailing list focused on GP? Oddly I haven't found one. Yahoo's GP group is used almost exclusively for announcements, the Poli/Langdon Field Guide forum is almost silent, and GP discussions on general/game programming sites like gamedev.net are very basic.
Thanks for any help you can provide!
Firstly, relax.
There are no "correct" methods in GP. GP is more art than science. Try lots of schemes and pick the ones that work best.
Q1: 1, 2, or many. You choose.
Q2: Replace, 1, 2, all. Or try some elitism.
Q3: You probably won't find forums discussing these questions b/c there are no right/best answers. Sorry.
PS. In my research, crossover never really performed well...
If you can read Python, you may want to take a look at Pyevolve. I am mainly involved in it on the GA side, but it has support for GP as well. May be you can get some hint there.
Q1 is your choice, but single child would probably be more common. Every time you do the lottery selection of parents, you're applying selection pressure, which is what you want.
Q2: Negative tournament selection is exactly the right approach. Yes, losing low-fitness members of the population causes rapid convergence initially, but once your population gets into the hard-to-search part of the solution space, it won't be as cut-and-dried which ones lose the tournament / lottery. What you do have to beware of is stagnation of the gene pool; I suggest monitoring the entropy of the genome to track its heterogeneity. "elitism is inherent" -- Well, yeah, that's the point! ;-)
Q3: comp.ai.genetic is probably your best bet. Sometimes the topic is picked up in game development fora, like on Gamasutra.
P.S. Genetic programming in C?!? How are you assuring the viability of the offspring? Doing genetic programming in a non-homoiconic language is a real challenge.
Check out MetaOptimize.com for your stacky needs.
As Ray, says, it's mostly up to you but typically in a steady-state setup you would only create a single offspring.
Again you have options. I wouldn't replace the parents. If they've been picked as parents based on their fitness you could be eliminating some of the fittest members of the population. Easiest is just to randomly pick an individual to be replaced. Alternatively, you could replace the least fit individual, but that can lead to premature convergence. Another option is to use the same selection strategy that you use to choose parents but use the inverse fitness so that it favours less fit individuals.
You could try comp.ai.genetic on USENET (and Google Groups).
It sounds like some of your questions are not necessarily specific to genetic programming; if that's true, you might have some luck asking the folks over at the NEAT Users Group.
They primarily discuss the Neuroevolution of Augmenting Topologies (or NEAT) algorithm, which is a genetic algorithm used to evolve neural networks. But topics like elitism and crossover strategies are pretty general, and can apply to both GA and GP algorithms.
Otherwise, as Dan and Ray have said, a lot of these decisions are made after experimentation with one's particular software and domain. Try applying your algorithm to different problems and pay attention to how it behaves -- after a while, you'll probably develop an intuition for what works and what doesn't.
I would create an unlimited number of offspring, but only on the basis of success, and let older members of the population die. Lack of fitness can also lead to early death. This just seems to follow a natural order.
Q1. In GP, as opposed to GA, when you perform crossover you select two parents but
do you create one child or two, or is that a free choice you have?
Yes its your choice; but generally, its not advisable to create many individuals with the same parents, because the difference among the individual's trends created by the same parents would be very limited and that could cost processing speed and memory which could have been spent on other individuals showing different trends and behaviors that requires analysis (but creating more individuals cannot be a problem if the evolution process is close to reaching its endpoint).
Q2. In steady state GP...
It is advisable to replace individuals based on the ranking provided by the fitness function you have adopted.

The correctness of neural networks

I have asked other AI folk this question, but I haven't really been given an answer that satisfied me.
For anyone else that has programmed an artificial neural network before, how do you test for its correctness?
I guess, another way to put it is, how does one debug the code behind a neural network?
With neural networks, generally what is happening is you are taking an untrained neural network, and you are training it up using a given set of data, so that it responds in the way you expect. Here's the deal; usually, you're training it up to a certain confidence level for your inputs. Generally (and again, this is just generally; your mileage may vary), you cannot get neural networks to always provide the right answer; rather, you are getting the estimation of the right answer, to within a confidence range. You know that confidence range by how you have trained the network.
The question arises as to why you would want to use neural networks if you cannot be certain that the conclusion they come to is verifiably correct; the answer is that neural networks can arrive at high-confidence answers for certain classes of problems (specifically, NP-Complete problems) in linear time, whereas verifiably correct solutions of NP-Complete problems can only be arrived at in polynomial time. In layman's terms, neural networks can "solve" problems that normal computation can't; but you can only be a certain percentage confident that you have the right answer. You can determine that confidence by the training regimen, and can usually make sure that you will have at least 99.9% confidence.
Correctness is a funny concept in most of "soft computing." The best I can tell you is: "a neural network is correct when it consistently satisfies the parameters of it's design." You do this by training it with data, and then verifying with other data, and having a feedback loop in the middle which lets you know if the neural network is functioning appropriately.
This is of-course the case only for neural networks that are large enough where a direct proof of correctness is not possible. It is possible to prove that a neural network is correct through analysis if you are attempting to build a neural network that learns XOR or something similar, but for that class of problem an aNN is seldom necessary.
You're opening up a bigger can of worms here than you might expect.
NN's are perhaps best thought of as universal function approximators, by the way, which may help you in thinking about this stuff.
Anyway, there is nothing special about NN's in terms of your question, the problem applies to any sort of learning algorithm.
The confidence you have in the results it is giving is going to rely on both the quantity and the quality (often harder to determine) of the training data that you have.
If you're really interested in this stuff, you may want to read up a bit on the problems of overtraining, and ensemble methods (bagging, boosting, etc.).
The real problem is that you usually aren't actually interested in the "correctness" (cf quality) of an answer on a given input that you've already seen, rather you care about predicting the quality of answer on an input you haven't seen yet. This is a much more difficult problem. Typical approaches then, involve "holding back" some of your training data (i.e. the stuff you know the "correct" answer for) and testing your trained system against that. It gets subtle though, when you start considering that you may not have enough data, or it may be biased, etc. So there are many researchers who basically spend all of their time thinking about these sort of issues!
I've worked on projects where there is test data as well as training data, so you know the expected outputs for a set of inputs the NN hasn't seen.
One common way of analysing the result of any classifier is use of an ROC curve; an introduction to the statistics of classifiers and ROC curves can be found at Interpreting Diagnostic Tests
I'm a complete amateur in this field, but don't you use a pre-determined set of data you know is correct?
I don't believe there is a single correct answer but there are well-proven probabilistic or statistical methods that can provide reassurance. The statistical methods are usually referred to as Resampling.
One method that I can recommend is the Jackknife.
My teacher always said his rule of thumb was to train the NN with 80% of your data and validate it with the other 20%. And, of course, make sure that data set is as comprehensive as you need.
If you want to find out whether the backpropagation of the network is correct, there is an easy way.
Since you calculate the derivate of the error landscape, you can check whether your implementation is correct numerically. You will calculate the derivative of the error with respect to a specific weight, ∂E/∂w. You can show that
∂E/∂w = (E(w + e) - E(w - e)) / (2 * e) + O(e^2).
(Bishop, Machine Learning and Pattern Recognition, p. 246)
Essentially, you evaluate the error to the left of the weight, evaluate it to the right of the weight and chheck if the numerical gradient is the same as your analytical gradient.
(Here's an implementation: http://github.com/bayerj/arac/raw/9f5b225d6293974f8adfc5f20dfc6439cc1bed35/src/cpp/utilities/utilities.cpp)
To me probably there is only one value(s) takes extra effort to verify, the gradient of the back propagation. I think Bayer's answer is actually commonly used and suggested. You need to write extra code to this but all are forward propagation matrix multiplications which is easy to write and verify.
There are some other issues which will prevent you from getting the best answer, for example:
The cost function of NN is not concave so your gradient descent is not guaranteed to find the global optimum.
Over/under fitting
Not choosing the "right" features/model
etc
However I think they are beyond the scope of programming bug.

Resources