Artificial neural networks and Markov Processes

Artificial neural networks and Markov Processes - artificial-intelligence

I read a little about ANN and Markov process. Can someone please help me in understanding where exactly Markov process fits in with ANN and genetic algorithms. Or simply, what could be the role of Markov processes in this scenario.
Thanks alot

The accepted answer is correct, but I just wanted to add a couple of details.
A Markov process is any system that goes through a series of states randomly in such a way that if you know the current state, you can predict the likelihood of each possible next states. A common example is the weather; if it's sunny now, you can predict that it is likely to be sunny later, regardless of previous weather.
A genetic algorithm is one that begins by generating a bunch of arbitrary random solutions to a given problem. It then checks these solutions to see how good they are. The 'bad' solutions are discarded, the 'good' solutions are kept and combined together to form (hopefully) better solutions, just like successful members of a species breeding a new generation. In theory, repeating this process will give better and better solutions until you eventually have an optimal one.
As you can see, they aren't algorithmically related. However, genetic algorithms are often used to generate Hidden Markov Models, for example here. The basic idea is that an HMM is initialized with random weights, a 'training set' of related Markov processes is run through it, and the weights are adjusted to give the members of the training set the highest probability of occurring. This is often done in speech recognition software.

Markov process and Artificial Neural Networks are completely different concepts.
Markov processes describes any events that follow a certain statistical property. The same way the words "Gaussian" or "Random" describes a certain set of events in terms of its statistical property.
Artificial Neural Network is an algorithm that helps solve you problems, and is not really related to Markov processes. You could be thinking of Hidden Markov Models which is also an algorithm. HMM's assumes the underlying system is a Markov Process with hidden states.

Related

What are the differences between Monte Carlo and Markov chains techniques?

I want to develop RISK board game, which will include an AI for computer players. Moreovor, I read two articles, this and this, about it, and I realised that I must learn about Monte Carlo simulation and Markov chains techniques. And I thought that I have to use these techniques together, but I guess they are different techniques relevant to calculate probabilities about transition states.
So, could anyone explain what are the important differences and advantages and disadvantages between them?
Finally, which way you will prefer if you would implement an AI for RISK game?
Here you can find simple determined probabilities about outcomes of a battle in risk board game, and the brute force algorithm used. There is a tree diagram to which specifies all possible states. Should I use Monte Carlo or Markov chain on this tree?

Okay, so I skimmed the articles to get a sense of what they were doing. Here is my overview of the terms you asked about:
A Markov Chain is simply a model of how your system moves from state to state. Developing a Markov model from scratch can sometimes be difficult, but once you have one in hand, they're relatively easy to use, and relatively easy to understand. The basic idea is that your game will have certain states associated with it; that as part of the game, you will move from state to state; and, critically, that this movement from state to state happens based on probability and that you know these probabilities.
Once you have that information, you can represent it all as a graph, where nodes are states and arcs between states (labelled with probabilities) are transitions. You can also represent as a matrix that satisfies certain constraints, or several other more exotic data structures.
The short article is actually all about the Markov Chain approach, but-- and this is important-- it is using that approach only as a quick means of estimating what will happen if army A attacks a territory with army B defending it. It's a nice use of the technique, but it is not an AI for Risk, it is merely a module in the AI helping to figure out the likely results of an attack.
Monte Carlo techniques, by contrast, are estimators. Once you have a model of something, whether a Markov model or any other, you often find yourself in the position of wanting to estimate something about it. (Often it happens to be something that you can, if you squint hard enough, put into the form of an integral of something that happens to be very hard to work with in that format.) Monte Carlo techniques just sample randomly and aggregate the results into an estimate of what's going to happen.
Monte Carlo techniques are not, in my opinion, AI techniques. They are a very general purpose technique that happen to be useful for AI, but they are not AI per se. (You can say the same thing about Markov models, but the statement is weaker because Markov models are so extremely useful for planning in AI, that entire philosophies of AI are built around the technique. Markov models are used elsewhere, though, as well.)
So that is what they are. You also asked which one I would use if I had to implement a Risk AI. Well, neither of those is going to be sufficient. Monte Carlo, as I said, is not an AI technique, it's a general math tool. And Markov Models, while they could in theory represent the entirety of a game of Risk, are going to end up being very unwieldy: You would need to represent every state of the game, meaning every possible configuration of armies in territories and every possible configuration of cards in hands, etc. (I am glossing over many details, here: There are a lot of other difficulties with this approach.)
The core of Wolf's thesis is neither the Markov approach nor the Monte Carlo approach, it is actually what he describes as the evaluation function. This is the heart of the AI problem: How to figure out what action is best. The Monte Carlo method in Blatt's paper describes a method to figure out what the expected result of an action is, but that is not the same as figuring out what action is best. Moreover, there's a low key statement in Wolf's thesis about look-ahead being hard to perform in Risk because the game trees become so very large, which is what led him (I think) to focus so much on the evaluation function.
So my real advice would be this: Read up on search-tree approaches, like minimax, alpha-beta pruning and especially expecti-minimax. You can find good treatments of these early in Russell and Norvig, or even on Wikipedia. Try to understand why these techniques work in general, but are cumbersome for Risk. That will lead you to some discussion of board evaluation techniques. Then go back and look at Wolf's thesis, focusing on his action evaluation function. And finally, focus on the way he tries to automatically learn a good evaluation function.
That is a lot of work. But Risk is not an easy game to develop an AI for.
(If you just want to figure out the expected results of a given attack, though, I'd say go for Monte Carlo. It's clean, very intuitive to understand, and very easy to implement. The only difficult-- and it's not a big one-- is making sure you run enough trials to get a good result.)

Markov chains are simply a set of transitions and their probabilities, assuming no memory of past events.
Monte Carlo simulations are repeated samplings of random walks over a set of probabilities.
You can use both together by using a Markov chain to model your probabilities and then a Monte Carlo simulation to examine the expected outcomes.
For Risk I don't think I would use Markov chains because I don't see an advantage. A Monte Carlo analysis of your possible moves could help you come up with a pretty good AI in conjunction with a suitable fitness function.
I know this glossed over a lot but I hope it helped.

Using Markov chains (or something similar) to produce an IRC-bot

I tried google and found little that I could understand.
I understand Markov chains to a very basic level: It's a mathematical model that only depends on previous input to change states..so sort of a FSM with weighted random chances instead of different criteria?
I've heard that you can use them to generate semi-intelligent nonsense, given sentences of existing words to use as a dictionary of kinds.
I can't think of search terms to find this, so can anyone link me or explain how I could produce something that gives a semi-intelligent answer? (if you asked it about pie, it would not start going on about the vietnam war it had heard about)
I plan on:
Having this bot idle in IRC channels for a bit
Strip any usernames out of the string and store as sentences or whatever
Over time, use this as the basis for the above.

Yes, a Markov chain is a finite-state machine with probabilistic state transitions. To generate random text with a simple, first-order Markov chain:
Collect bigram (adjacent word pair) statistics from a corpus (collection of text).
Make a markov chain with one state per word. Reserve a special state for end-of-text.
The probability of jumping from state/word x to y is the probability of the words y immediately following x, estimated from relative bigram frequencies in the training corpus.
Start with a random word x (perhaps determined by how often that word occurs as the first word of a sentence in the corpus). Then pick a state/word y to jump to randomly, taking into account the probability of y following x (the state transition probability). Repeat until you hit end-of-text.
If you want to get something semi-intelligent out of this, then your best shot is to train it on lots of carefully collected texts. The "lots" part makes it produce proper sentences (or plausible IRC speak) with high probability; the "carefully collected" part means you control what it talks about. Introducing higher-order Markov chains also helps in both areas, but takes more storage to store the necessary statistics. You may also look into things like statistical smoothing.
However, having your IRC bot actually respond to what is said to it takes a lot more than Markov chains. It may be done by doing text categorization (aka topic spotting) on what is said, then picking a domain-specific Markov chain for text generation. Naïve Bayes is a popular model for topic spotting.
Kernighan and Pike in The Practice of Programming explore various implementation strategies for Markov chain algorithms. These, and natural language generation in general, is covered in great depth by Jurafsky and Martin, Speech and Language Processing.

You want to look for Ian Barber Text Generation ( phpir.com ). Unfortunately the site is down or offline. I have a copy of his text and I want to send it to you.

It seems to me you are trying multiple things at the same time:
extracting words/sentences by idling in IRC
building a knowledge base
listening to some chat, parsing keywords
generate some sentence regarding keywords
Those are basically very different tasks. Markov models are often used for machine learning. I don't see much learning in your tasks though.
larsmans answer shows how you generate sentences from word-based markov-models. You can also train the weights to favor those word-pairs that other IRC users used. But nonetheless this will not generate keyword-related sentences, because building/refining a markov model is not the same as "driving" it.
You might try hidden markov models (HMM) where the visible output is the keywords and the hidden states are made from those word-pairs. You could then favor sentences more appropriate to specific keywords dynamically.

Best way to automate testing of AI algorithms?

I'm wondering how people test artificial intelligence algorithms in an automated fashion.
One example would be for the Turing Test - say there were a number of submissions for a contest. Is there any conceivable way to score candidates in an automated fashion - other than just having humans test them out.
I've also seen some data sets (obscured images of numbers/letters, groups of photos, etc) that can be fed in and learned over time. What good resources are out there for this.
One challenge I see: you don't want an algorithm that tailors itself to the test data over time, since you are trying to see how well it does in the general case. Are there any techniques to ensure it doesn't do this? Such as giving it a random test each time, or averaging its results over a bunch of random tests.
Basically, given a bunch of algorithms, I want some automated process to feed it data and see how well it "learned" it or can predict new stuff it hasn't seen yet.

This is a complex topic - good AI algorithms are generally the ones which can generalize well to "unseen" data. The simplest method is to have two datasets: a training set and an evaluation set used for measuring the performances. But generally, you want to "tune" your algorithm so you may want 3 datasets, one for learning, one for tuning, and one for evaluation. What defines tuning depends on your algorithm, but a typical example is a model where you have a few hyper-parameters (for example parameters in your Bayesian prior under the Bayesian view of learning) that you would like to tune on a separate dataset. The learning procedure would already have set a value for it (or maybe you hardcoded their value), but having enough data may help so that you can tune them separately.
As for making those separate datasets, there are many ways to do so, for example by dividing the data you have available into subsets used for different purposes. There is a tradeoff to be made because you want as much data as possible for training, but you want enough data for evaluation too (assuming you are in the design phase of your new algorithm/product).
A standard method to do so in a systematic way from a known dataset is cross validation.

Generally when it comes to this sort of thing you have two datasets - one large "training set" which you use to build and tune the algorithm, and a separate smaller "probe set" that you use to evaluate its performance.

#Anon has the right of things - training and what I'll call validation sets. That noted, the bits and pieces I see about developments in this field point at two things:
Bayesian Classifiers: there's something like this probably filtering your email. In short you train the algorithm to make a probabilistic decision if a particular item is part of a group or not (e.g. spam and ham).
Multiple Classifiers: this is the approach that the winning group involved in the Netflix challenge took, whereby it's not about optimizing one particular algorithm (e.g. Bayesian, Genetic Programming, Neural Networks, etc..) by combining several to get a better result.
As for data sets Weka has several available. I haven't explored other libraries for data sets, but mloss.org appears to be a good resource. Finally data.gov offers a lot of sets that provide some interesting opportunities.

Training data sets and test sets are very common for K-means and other clustering algorithms, but to have something that's artificially intelligent without supervised learning (which means having a training set) you are building a "brain" so-to-speak based on:
In chess: all possible future states possible from the current gameState.
In most AI-learning (reinforcement learning) you have a problem where the "agent" is trained by doing the game over and over. Basically you ascribe a value to every state. Then you assign an expected value of each possible action at a state.
So say you have S states and a actions per state (although you might have more possible moves in one state, and not as many in another), then you want to figure out the most-valuable states from s to be in, and the most valuable actions to take.
In order to figure out the value of states and their corresponding actions, you have to iterate the game through. Probabilistically, a certain sequence of states will lead to victory or defeat, and basically you learn which states lead to failure and are "bad states". You also learn which ones are more likely to lead to victory, and these are subsequently "good" states. They each get a mathematical value associated, usually as an expected reward.
Reward from second-last state to a winning state: +10
Reward if entering a losing state: -10
So the states that give negative rewards then give negative rewards backwards, to the state that called the second-last state, and then the state that called the third-last state and so-on.
Eventually, you have a mapping of expected reward based on which state you're in, and based on which action you take. You eventually find the "optimal" sequence of steps to take. This is often referred to as an optimal policy.
It is true of the converse that normal courses of actions that you are stepping-through while deriving the optimal policy are called simply policies and you are always implementing a certain "policy" with respect to Q-Learning.
Usually the way of determining the reward is the interesting part. Suppose I reward you for each state-transition that does not lead to failure. Then the value of walking all the states until I terminated is however many increments I made, however many state transitions I had.
If certain states are extremely unvaluable, then loss is easy to avoid because almost all bad states are avoided.
However, you don't want to discourage discovery of new, potentially more-efficient paths that don't follow just this-one-works, so you want to reward and punish the agent in such a way as to ensure "victory" or "keeping the pole balanced" or whatever as long as possible, but you don't want to be stuck at local maxima and minima for efficiency if failure is too painful, so no new, unexplored routes will be tried. (Although there are many approaches in addition to this one).
So when you ask "how do you test AI algorithms" the best part is is that the testing itself is how many "algorithms" are constructed. The algorithm is designed to test a certain course-of-action (policy). It's much more complicated than
"turn left every half mile"
it's more like
"turn left every half mile if I have turned right 3 times and then turned left 2 times and had a quarter in my left pocket to pay fare... etc etc"
It's very precise.
So the testing is usually actually how the A.I. is being programmed. Most models are just probabilistic representations of what is probably good and probably bad. Calculating every possible state is easier for computers (we thought!) because they can focus on one task for very long periods of time and how much they remember is exactly how much RAM you have. However, we learn by affecting neurons in a probabilistic manner, which is why the memristor is such a great discovery -- it's just like a neuron!
You should look at Neural Networks, it's mindblowing. The first time I read about making a "brain" out of a matrix of fake-neuron synaptic connections... A brain that can "remember" basically rocked my universe.
A.I. research is mostly probabilistic because we don't know how to make "thinking" we just know how to imitate our own inner learning process of try, try again.

Neural Network "Breeding"

I just watched a Google tech talk video covering "Polyworld" (found here) and they talk about breeding two neural networks together to form offspring. My question is, how would one go about combining two neural networks? They seem so different that any attempt to combine them would simply form a third, totally unrelated network. Perhaps I'm missing something, but I don't see a good way to take the positive aspects of two separate neural networks and combine them into a single one. If anyone could elaborate on this process, I'd appreciate it.

Neither response so far is true to the nature of Polyworld!...
They both describe a typical Genetic Algorithm (GA) application. While GA incorporates some of the elements found in Polyworld (breeding, selection), GA also implies some form of "objective" criteria aimed at guiding evolution towards [relatively] specific goals.
Polyworld, on the other hand is a framework for Artificial Life (ALife). With ALife, the survival of individual creatures and their ability to pass their genes to other generations, is not directed so much by their ability to satisfy a particular "fitness function", but instead it is tied to various broader, non-goal-oriented, criteria, such as the ability of the individual to feed itself in ways commensurate with its size and its metabolism, its ability to avoid predators, its ability to find mating partners and also various doses of luck and randomness.
Polyworld's model associated with the creatures and their world is relatively fixed (for example they all have access to (though may elect not to use) various basic sensors (for color, for shape...) and various actuators ("devices" to eat, to mate, to turn, to move...) and these basic sensorial and motor functions do not evolve (as it may in nature, for example when creatures find ways to become sensitive to heat or to sounds and/or find ways of moving that are different from the original motion primitives etc...)
On the other hand, the brain of creatures has structure and connections which are both the product of the creature's genetic make-up ("stuff" from its ancestors) and of its own experience. For example the main algorithm used to determine the strength of connections between neurons uses Hebbian logic (i.e. fire-together, wire-together) during the lifetime of the creature (early on, I'm guessing, as the algorithm often has a "cooling" factor which minimize its ability to change things in a big way, as times goes by). It is unclear if the model includes some form of Lamarkian evolution, whereby some of the high-level behaviors are [directly] passed on through the genes, rather than being [possibly] relearnt with each generation (on the indirect basis of some genetically passed structure).
The salient difference between ALife and GA (and there are others!) is that with ALife, the focus is on observing and fostering in non-directed ways, emergent behaviors -whatever they may be- such as, for example, when some creatures evolve a makeup which prompts them to wait nearby piles of green food and wait for dark green creatures to kill them, or some creatures may start collaborating with one another, for example by seeking each other's presence for other purposes than mating etc. With GA, the focus is on a particular behavior of the program being evolved. For example the goal may be to have the program recognize edges in a video image, and therefore evolution is favored in this specific direction. Individual programs which perform this task better (as measure with some "fitness function") are favored with regards to evolution.
Another less obvious but important difference regards the way creatures (or programs in the case of GA) reproduce themselves. With ALife, individual creatures find their own mating partners, at random at first although, after some time they may learn to reproduce only with creatures exhibiting a particular attribute or behavior. With GA, on the other hand, "sex" is left to the GA framework itself, which chooses, for example, to preferably cross-breed individuals (and clones thereof) which score well on the fitness function (and always leaving room for some randomness, lest the solution search stays stuck at some local maxima, but the point is that the GA framework decides mostly who has sex with whom)...
Having clarified this, we can return to the OP's original question...
... how would one go about combining two neural networks? They seem so different that any attempt to combine them would simply form a third, totally unrelated network. ...I don't see a good way to take the positive aspects of two separate neural networks and combine them into a single one...
The "genetic makeup" of a particular creature affects parameters such as the size of the creature, its color and such. It also includes parameters associated with the brain, in particular its structure: the number of neurons, the existence of connection from various sensors (eg. does the creature see the Blue color very well ?) the existence of connections towards various actuators (eg. does the creature use its light?). The specific connections between neurons and the relative strength of these may also be passed in the genes, if only to serve as initial values, to be quickly changed during brain learning phase.
By taking two creatures, we [nature!] can select in a more or less random fashion, which parameter come from the first creature and which come from the other creature (as well as a few novel "mutations" which come from neither parents). For example if the "father" had many connections with red color sensor, but the mother didn't the offspring may look like the father in this area, but also get his mother's 4 neuron-layers structure rather than father's 6 neuron-layers structure.
The interest of doing so is to discover new capabilities from the individuals; in the example above, the creature may now better detect red colored predators, and also process info more quickly in its slightly simpler brain (compared with the father's). Not all offspring are better equipped than their parents, such weaker individuals, may disappear in short order (or possibly and luckily survive long enough, to provide, say, their fancy way of moving and evading predators, even though their parent made them blind or too big or whatever... The key thing again: is not to be so worried about immediate usefulness of a particular trait, only to see it play in the long term.

They wouldn't really be breeding two neural networks together. Presumably they have a variety of genetic algorithm that produces a particular neural network structure given a particular sequence of "genes". They would start with a population of gene sequences, produce their characteristic neural networks, and then expose each of these networks to the same training regimen. Presumably, some of these networks would respond to the training better than some others (i.e. they would be more easily "trainable" to achieve the desired behavior). They would then take the genetic sequences that produced the best "trainees", cross-breed them with each other, produce their characteristic neural networks, which would then be exposed to the same training regimen. Presumably, some of these neural networks in the second generation would be even more trainable than those from the first generation. These would become the parents of the third generation, and so on and so forth.

Neural networks aren't (probably) in this case arbitrary trees. They are probably networks with a constant structure, i.e. same nodes and connections, so 'breeding' them would involve 'averaging' the weights of nodes. You could average the weights for each pair of nodes in the two corresponding nets to produce the 'offspring' net. Or you could use a more complicated function dependent on ever-further sets of neighboring nodes – the possibilities are Vast.
My answer is incomplete if the assumption about the fixed structure is false or unwarranted.

The correctness of neural networks

I have asked other AI folk this question, but I haven't really been given an answer that satisfied me.
For anyone else that has programmed an artificial neural network before, how do you test for its correctness?
I guess, another way to put it is, how does one debug the code behind a neural network?

With neural networks, generally what is happening is you are taking an untrained neural network, and you are training it up using a given set of data, so that it responds in the way you expect. Here's the deal; usually, you're training it up to a certain confidence level for your inputs. Generally (and again, this is just generally; your mileage may vary), you cannot get neural networks to always provide the right answer; rather, you are getting the estimation of the right answer, to within a confidence range. You know that confidence range by how you have trained the network.
The question arises as to why you would want to use neural networks if you cannot be certain that the conclusion they come to is verifiably correct; the answer is that neural networks can arrive at high-confidence answers for certain classes of problems (specifically, NP-Complete problems) in linear time, whereas verifiably correct solutions of NP-Complete problems can only be arrived at in polynomial time. In layman's terms, neural networks can "solve" problems that normal computation can't; but you can only be a certain percentage confident that you have the right answer. You can determine that confidence by the training regimen, and can usually make sure that you will have at least 99.9% confidence.

Correctness is a funny concept in most of "soft computing." The best I can tell you is: "a neural network is correct when it consistently satisfies the parameters of it's design." You do this by training it with data, and then verifying with other data, and having a feedback loop in the middle which lets you know if the neural network is functioning appropriately.
This is of-course the case only for neural networks that are large enough where a direct proof of correctness is not possible. It is possible to prove that a neural network is correct through analysis if you are attempting to build a neural network that learns XOR or something similar, but for that class of problem an aNN is seldom necessary.

You're opening up a bigger can of worms here than you might expect.
NN's are perhaps best thought of as universal function approximators, by the way, which may help you in thinking about this stuff.
Anyway, there is nothing special about NN's in terms of your question, the problem applies to any sort of learning algorithm.
The confidence you have in the results it is giving is going to rely on both the quantity and the quality (often harder to determine) of the training data that you have.
If you're really interested in this stuff, you may want to read up a bit on the problems of overtraining, and ensemble methods (bagging, boosting, etc.).
The real problem is that you usually aren't actually interested in the "correctness" (cf quality) of an answer on a given input that you've already seen, rather you care about predicting the quality of answer on an input you haven't seen yet. This is a much more difficult problem. Typical approaches then, involve "holding back" some of your training data (i.e. the stuff you know the "correct" answer for) and testing your trained system against that. It gets subtle though, when you start considering that you may not have enough data, or it may be biased, etc. So there are many researchers who basically spend all of their time thinking about these sort of issues!

I've worked on projects where there is test data as well as training data, so you know the expected outputs for a set of inputs the NN hasn't seen.
One common way of analysing the result of any classifier is use of an ROC curve; an introduction to the statistics of classifiers and ROC curves can be found at Interpreting Diagnostic Tests

I'm a complete amateur in this field, but don't you use a pre-determined set of data you know is correct?

I don't believe there is a single correct answer but there are well-proven probabilistic or statistical methods that can provide reassurance. The statistical methods are usually referred to as Resampling.
One method that I can recommend is the Jackknife.

My teacher always said his rule of thumb was to train the NN with 80% of your data and validate it with the other 20%. And, of course, make sure that data set is as comprehensive as you need.

If you want to find out whether the backpropagation of the network is correct, there is an easy way.
Since you calculate the derivate of the error landscape, you can check whether your implementation is correct numerically. You will calculate the derivative of the error with respect to a specific weight, ∂E/∂w. You can show that
∂E/∂w = (E(w + e) - E(w - e)) / (2 * e) + O(e^2).
(Bishop, Machine Learning and Pattern Recognition, p. 246)
Essentially, you evaluate the error to the left of the weight, evaluate it to the right of the weight and chheck if the numerical gradient is the same as your analytical gradient.
(Here's an implementation: http://github.com/bayerj/arac/raw/9f5b225d6293974f8adfc5f20dfc6439cc1bed35/src/cpp/utilities/utilities.cpp)

To me probably there is only one value(s) takes extra effort to verify, the gradient of the back propagation. I think Bayer's answer is actually commonly used and suggested. You need to write extra code to this but all are forward propagation matrix multiplications which is easy to write and verify.
There are some other issues which will prevent you from getting the best answer, for example:
The cost function of NN is not concave so your gradient descent is not guaranteed to find the global optimum.
Over/under fitting
Not choosing the "right" features/model
etc
However I think they are beyond the scope of programming bug.