I am teaching an agent to get out of a maze collecting all apples on its way using Qlearning.
I read that is possible to leave a fixed epsilon or to choose an epsilon and decay it as time passes.
I couldn't find the advantages or disadventures of each approach, I would love to hear more if you can help me understanding which should I use.
Thanks!
I'm going to assume you're referring to epsilon as in "epsilon-green exploration". The goal of this parameter is to control how much your agent believe in his current policy. With a large epsilon value, your agent will tend to ignore his policy and choose random action. This exploration is often a good idea when your policy is rather weak, especially at the beginning of training. Sometimes, people decay epsilon as time passes in order to reflect that their policy gets better and better and they want to exploit rather than explore.
There is no right way to pick epsilon, or its decay rate, for every problem. The best way is probably to try out different values.
Related
The book 'Introduction to Reinforcement Learning' by Barto and Sutton, mentions the following about non-stationary RL problems -
"we often encounter reinforcement learning problems that are effectively nonstationary. In such cases, it makes sense to weight recent rewards more heavily than long-past ones. " (see here -https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node20.html)
I am not absolutely convinced by this. For example, an explorer agent whose task is to find an exit for a maze might actually lose because it made a wrong choice in the distant past. Could you please explain why it makes sense to weight more recent rewards higher in simple terms?
If the problem is non-stationary, then past experience is increasingly out of date and should be given lower weight. That way, if an explorer makes a mistake in distant past, the mistake is overwritten by more recent experience.
The text explicitly refers to nonstationary problems. In such problems, the MDP characteristics change. For example, the environment can change and therefore the transition matrix or the reward function might be different. In this case, a reward collected in the past might not be significant anymore.
In your example, the MDP is stationary, because the maze never changes, so your statement is correct. If (for example) the exit of the maze would change according to some law (which you do not know), then it makes sense to weigh recent rewards more (for example, if the reward is the Manhattan distance from the agent position to the exit).
In general, dealing with nonstationary MDPs is very complex, because usually you don't know how the characteristics change (in the example above, you don't know how the exit location is changed). On the contrary, if you know the law determining how the environment changes, you should include it in the MDP model.
In backpropagation training, during gradient descent down the error surface, network with large amount of neurons in hidden layer can get stuck in local minimum. I have read that reinitializing weights to random numbers in all cases will eventually avoid this problem.
This means that there always IS a set of "correct" initial weight values. (Is this safe to assume?)
I need to find or make an algorithm that finds them.
I have tried googling the algorithm, tried devising it myself but to no avail. Can anyone propose a solution? Perhaps a name of algorithm that I can search for?
Note: this is a regular feed-forward 3-layer burrito :)
Note: I know attempts have been made to use GAs for that purpose, but that requires re-training the network on each iteration which is time costly when it gets large enough.
Thanks in advance.
There is never a guarantee that you will not get stuck in a local optimum, sadly. Unless you can prove certain properties about the function you are trying to optimize, local optima exist and hill-climbing methods will fall prey to them. (And typically, if you can prove the things you need to prove, you can also select a better tool than a neural network.)
One classic technique is to gradually reduce the learning rate, then increase it and slowly draw it down, again, several times. Raising the learning rate reduces the stability of the algorithm, but gives the algorithm the ability to jump out of a local optimum. This is closely related to simulated annealing.
I am surprised that Google has not helped you, here, as this is a topic with many published papers: Try terms like, "local minima" and "local minima problem" in conjunction with neural networks and backpropagation. You should see many references to improved backprop methods.
I've finished my algorithm to apply the ANN on the C++ language, but I stack with the value of lambda, eta, and alpha, to show the best resul. I don't know if there is a rule or range that will give a good result of training.
The dataset is 4000
The hidden neurons are 15
Can anyone help me please to give the reasons of chosing the best value of Lambda, Eta, and Alpha?
Thank you very much
Obviously, the exact optimal values are completely problem-dependent, but I suggest using commonly-applied values (Murray Smith makes some suggestions in "Neural Networks for Statistical Modeling") and not tinkering with them.
Even today, the complaint is sometimes made that neural networks are difficult to work with because of all of the experimentation needed to optimize parameters like eta, etc. In most situations, though, the idea is to discover a good approximation, not the optimal one. Assuming that one hasn't chosen insane parameter values, then the only experimentation needed is adjustment of the size of the hidden layer.
I'm wondering how people test artificial intelligence algorithms in an automated fashion.
One example would be for the Turing Test - say there were a number of submissions for a contest. Is there any conceivable way to score candidates in an automated fashion - other than just having humans test them out.
I've also seen some data sets (obscured images of numbers/letters, groups of photos, etc) that can be fed in and learned over time. What good resources are out there for this.
One challenge I see: you don't want an algorithm that tailors itself to the test data over time, since you are trying to see how well it does in the general case. Are there any techniques to ensure it doesn't do this? Such as giving it a random test each time, or averaging its results over a bunch of random tests.
Basically, given a bunch of algorithms, I want some automated process to feed it data and see how well it "learned" it or can predict new stuff it hasn't seen yet.
This is a complex topic - good AI algorithms are generally the ones which can generalize well to "unseen" data. The simplest method is to have two datasets: a training set and an evaluation set used for measuring the performances. But generally, you want to "tune" your algorithm so you may want 3 datasets, one for learning, one for tuning, and one for evaluation. What defines tuning depends on your algorithm, but a typical example is a model where you have a few hyper-parameters (for example parameters in your Bayesian prior under the Bayesian view of learning) that you would like to tune on a separate dataset. The learning procedure would already have set a value for it (or maybe you hardcoded their value), but having enough data may help so that you can tune them separately.
As for making those separate datasets, there are many ways to do so, for example by dividing the data you have available into subsets used for different purposes. There is a tradeoff to be made because you want as much data as possible for training, but you want enough data for evaluation too (assuming you are in the design phase of your new algorithm/product).
A standard method to do so in a systematic way from a known dataset is cross validation.
Generally when it comes to this sort of thing you have two datasets - one large "training set" which you use to build and tune the algorithm, and a separate smaller "probe set" that you use to evaluate its performance.
#Anon has the right of things - training and what I'll call validation sets. That noted, the bits and pieces I see about developments in this field point at two things:
Bayesian Classifiers: there's something like this probably filtering your email. In short you train the algorithm to make a probabilistic decision if a particular item is part of a group or not (e.g. spam and ham).
Multiple Classifiers: this is the approach that the winning group involved in the Netflix challenge took, whereby it's not about optimizing one particular algorithm (e.g. Bayesian, Genetic Programming, Neural Networks, etc..) by combining several to get a better result.
As for data sets Weka has several available. I haven't explored other libraries for data sets, but mloss.org appears to be a good resource. Finally data.gov offers a lot of sets that provide some interesting opportunities.
Training data sets and test sets are very common for K-means and other clustering algorithms, but to have something that's artificially intelligent without supervised learning (which means having a training set) you are building a "brain" so-to-speak based on:
In chess: all possible future states possible from the current gameState.
In most AI-learning (reinforcement learning) you have a problem where the "agent" is trained by doing the game over and over. Basically you ascribe a value to every state. Then you assign an expected value of each possible action at a state.
So say you have S states and a actions per state (although you might have more possible moves in one state, and not as many in another), then you want to figure out the most-valuable states from s to be in, and the most valuable actions to take.
In order to figure out the value of states and their corresponding actions, you have to iterate the game through. Probabilistically, a certain sequence of states will lead to victory or defeat, and basically you learn which states lead to failure and are "bad states". You also learn which ones are more likely to lead to victory, and these are subsequently "good" states. They each get a mathematical value associated, usually as an expected reward.
Reward from second-last state to a winning state: +10
Reward if entering a losing state: -10
So the states that give negative rewards then give negative rewards backwards, to the state that called the second-last state, and then the state that called the third-last state and so-on.
Eventually, you have a mapping of expected reward based on which state you're in, and based on which action you take. You eventually find the "optimal" sequence of steps to take. This is often referred to as an optimal policy.
It is true of the converse that normal courses of actions that you are stepping-through while deriving the optimal policy are called simply policies and you are always implementing a certain "policy" with respect to Q-Learning.
Usually the way of determining the reward is the interesting part. Suppose I reward you for each state-transition that does not lead to failure. Then the value of walking all the states until I terminated is however many increments I made, however many state transitions I had.
If certain states are extremely unvaluable, then loss is easy to avoid because almost all bad states are avoided.
However, you don't want to discourage discovery of new, potentially more-efficient paths that don't follow just this-one-works, so you want to reward and punish the agent in such a way as to ensure "victory" or "keeping the pole balanced" or whatever as long as possible, but you don't want to be stuck at local maxima and minima for efficiency if failure is too painful, so no new, unexplored routes will be tried. (Although there are many approaches in addition to this one).
So when you ask "how do you test AI algorithms" the best part is is that the testing itself is how many "algorithms" are constructed. The algorithm is designed to test a certain course-of-action (policy). It's much more complicated than
"turn left every half mile"
it's more like
"turn left every half mile if I have turned right 3 times and then turned left 2 times and had a quarter in my left pocket to pay fare... etc etc"
It's very precise.
So the testing is usually actually how the A.I. is being programmed. Most models are just probabilistic representations of what is probably good and probably bad. Calculating every possible state is easier for computers (we thought!) because they can focus on one task for very long periods of time and how much they remember is exactly how much RAM you have. However, we learn by affecting neurons in a probabilistic manner, which is why the memristor is such a great discovery -- it's just like a neuron!
You should look at Neural Networks, it's mindblowing. The first time I read about making a "brain" out of a matrix of fake-neuron synaptic connections... A brain that can "remember" basically rocked my universe.
A.I. research is mostly probabilistic because we don't know how to make "thinking" we just know how to imitate our own inner learning process of try, try again.
I remember when I was in college we went over some problem where there was a smart agent that was on a grid of squares and it had to clean the squares. It was awarded points for cleaning. It also was deducted points for moving. It had to refuel every now and then and at the end it got a final score based on how many squares on the grid were dirty or clean.
I'm trying to study that problem since it was very interesting when I saw it in college, however I cannot find anything on wikipedia or anywhere online. Is there a specific name for that problem that you know about? Or maybe it was just something my teacher came up with for the class.
I'm searching for AI cleaning agent and similar things, but I don't find anything. I don't know, I'm thinking maybe it has some other name.
If you know where I can find more information about this problem I would appreciate it. Thanks.
Perhaps a "stigmergy" approach is closely related to your problem. There is a starting point here, and you can find something by searching for "dead ants" and "robots" on google scholar.
Basically: instead of modelling a precise strategy you work toward a probabilistic approach. Ants (probably) collect their deads by piling up according to a simple rule such as "if there is a pile of dead ants there, I bring this corpse hither; otherwise, I'll make a new pile". You can start by simplifying your 'cleaning' situation with that, and see where you go.
Also, I think (another?) suitable approach could be modelled with a Genetic Algorithm using a carefully chosen combination of fitness functions such as:
the end number of 'clean' tiles
the number of steps made by the robot
of course if the robots 'dies' out of starvation it automatically removes itself from the gene pool, a-la darwin awards :)
You could start by modelling a very, very simple genotype that will be 'computed' into a behaviour. Consider using a simple GA such as this one by Inman Harvey, then to each gene assign either a part of the strategy, or a complete behaviour. E.g.: if gene A is turned to 1 then the robot will try to wander randomly; if gene B is also turned to 1, then it will give priority to self-charging unless there are dirty tiles at distance X. Or use floats and model probability. Your mileage may vary but I can assure it will be fun :)
The problem is reminiscent of Shakey, although there's cleaning involved (which is like the Roomba -- a device that can also be programmed to perform these very tasks).
If the "problem space" (or room) is small enough, you can solve for an optimal solution using a simple A*-based search, but likely it won't be, since that won't leave for very interesting problems.
The machine learning approach suggested here using genetic algorithms is an interesting approach. Given the problem domain you would only have one "rule" (a move-to action, since clean could be eliminated by implicitly cleaning any square you move to that is dirty) so your learner would essentially be learning how to move around an environment. The problem there would be to build a learner that would be adaptable to any given floor plan, instead of just becoming proficient at cleaning a very specific space.
Whatever approach you have, I'd also consider doing a further meta-reasoning step if the problem sets are big enough, and use a partition approach to divide the floor up into separate areas and then conquering them one at a time.
Can you use techniques to create data to use "offline"? In that case, I'd even consider creating a "database" of optimal routes to take to clean certain floor spaces (1x1 up to, say, 5x5) that include all possible start and end squares. This is similar to "endgame databases" that game AIs use to effectively "solve" games once they reach a certain depth (c.f. Chinook).
This problem reminds me of this. A similar problem is briefly mentioned in the book Complexity as an example of a genetic algorithm. These versions are simplified though, they don't take into account fuel consumption.