Partially Observable Markov Decision Process Optimal Value function - artificial-intelligence

I understood how belief states are updated in POMDP. But in Policy and Value function section, in http://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process I could not figure out how to calculate value of V*(T(b,a,o)) for finding optimal value function V*(b). I have read a lot of resources on the internet but none explain how to calculate this clearly. Can some one provide me with a mathematically solved example with all the calculations or provide me with a mathematically clear explanation.

You should check out this tutorial on POMDPs:
http://cs.brown.edu/research/ai/pomdp/tutorial/index.html
It includes a section about Value Iteration, which can be used to find an optimal policy/value function.

I try to use the same notation in this answer as Wikipedia.
First I repeat the Value Function as stated on Wikipedia:
V*(b) is the value function with the belief b as parameter. b contains the probability of all states s, which sum up to 1:
r(b,a) is the reward for belief b and action a which has to be calculated using the belief over each state given the original reward function R(s,a): the reward for being in state s and having done action a.
We can also write the function O in terms of states instead of belief b:
this is the probability of having observation o given a belief b and action a. Note that O and T are probability functions.
Finally the function τ(b,a,o) gives the new belief state b'=τ(b,a,o) given the previous belief b, action a and observation o. Per state we can calculate the new probability:
Now the new belief b' can be used to calculate iteratively: V(τ(b,a,o)).
The optimal value function can be approached by using for example Value Iteration which applies dynamic programming. Then the function is iteratively updated until the difference is smaller then a small value ε.
There is a lot more information on POMDPs, for example:
Sebastian Thrun, Wolfram Burgard, and Dieter Fox. 2005. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press.
A brief introduction to reinforcement learning
A POMDP Tutorial
Reinforcement Learning and Markov Decision Processes

Related

The purpose of using Q-Learning algorithm

What is the point of using Q-Learning? I have used an example code that represents 2D board with pawn moving on this board. At the right end of the board there is goal which we want to reach. After completion of algorithm I have a q table with values assigned to every state-action junction. Is it all about getting this q table to see which state-actions (which actions are the best in case of specific states) pairs are the most useful? That's how I understand it right now. Am I right?
Is it all about getting this q table to see which state-actions (which
actions are the best in case of specific states) pairs are the most
useful?
Yep! That's pretty much it. Given a finite state space, Q-learning is guaranteed to eventually learn the optimal policy. Once an optimal policy is reached (also known as convergence) every time the agent is in a given state s, it looks in its Q-table for the action a with the highest reward for that (s, a) pair.

Neural Network and Temporal Difference Learning

I have a read few papers and lectures on temporal difference learning (some as they pertain to neural nets, such as the Sutton tutorial on TD-Gammon) but I am having a difficult time understanding the equations, which leads me to my questions.
-Where does the prediction value V_t come from? And subsequently, how do we get V_(t+1)?
-What exactly is getting back propagated when TD is used with a neural net? That is, where does the error that gets back propagated come from when using TD?
The backward and forward views can be confusing, but when you are dealing with something simple like a game-playing program, things are actually pretty simple in practice. I'm not looking at the reference you're using, so let me just provide a general overview.
Suppose I have a function approximator like a neural network, and that it has two functions, train and predict for training on a particular output and predicting the outcome of a state. (Or the outcome of taking an action in a given state.)
Suppose I have a trace of play from playing a game, where I used the predict method to tell me what move to make at each point and suppose that I lose at the end of the game (V=0). Suppose my states are s_1, s_2, s_3...s_n.
The monte-carlo approach says that I train my function approximator (e.g. my neural network) on each of the states in the trace using the trace and the final score. So, given this trace, you would do something like call:
train(s_n, 0)
train(s_n-1, 0)
...
train(s_1, 0).
That is, I'm asking every state to predict the final outcome of the trace.
The dynamic programming approach says that I train based on the result of the next state. So my training would be something like
train(s_n, 0)
train(s_n-1, test(s_n))
...
train(s_1, test(s_2)).
That is, I'm asking the function approximator to predict what the next state predicts, where the last state predicts the final outcome from the trace.
TD learning mixes between the two of these, where λ=1 corresponds to the first case (monte carlo) and λ=0 corresponds to the second case (dynamic programming). Suppose that we use λ=0.5. Then our training would be:
train(s_n, 0)
train(s_n-1, 0.5*0 + 0.5*test(s_n))
train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+)
...
Now, what I've written here isn't completely correct, because you don't actually re-test the approximator at each step. Instead you just start with a prediction value (V = 0 in our example) and then you update it for training the next step with the next predicted value. V = λ·V + (1-λ)·test(s_i).
This learns much faster than monte carlo and dynamic programming methods because you aren't asking the algorithm to learn such extreme values. (Ignoring the current prediction or ignoring the final outcome.)

What is the difference between Q-learning and SARSA?

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms.
According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(st, at), can be updated as follows
Q(st, at) = Q(st, at) + α*(rt + γ*Q(st+1, at+1) - Q(st, at))
On the other hand, the update step for the Q-learning algorithm is the following
Q(st, at) = Q(st, at) + α*(rt + γ*maxa Q(st+1, a) - Q(st, at))
which can also be written as
Q(st, at) = (1 - α) * Q(st, at) + α * (rt + γ*maxa Q(st+1, a))
where γ (gamma) is the discount factor and rt is the reward received from the environment at timestep t.
Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?
TLDR (and my own answer)
Thanks to all those answering this question since I first asked it. I've made a github repo playing with Q-Learning and empirically understood what the difference is. It all amounts to how you select your next best action, which from an algorithmic standpoint can be a mean, max or best action depending on how you chose to implement it.
The other main difference is when this selection is happening (e.g., online vs offline) and how/why that affects learning. If you are reading this in 2019 and are more of a hands-on person, playing with a RL toy problem is probably the best way to understand the differences.
One last important note is that both Suton & Barto as well as Wikipedia often have mixed, confusing or wrong formulaic representations with regards to the next state best/max action and reward:
r(t+1)
is in fact
r(t)
When I was learning this part, I found it very confusing too, so I put together the two pseudo-codes from R.Sutton and A.G.Barto hoping to make the difference clearer.
Blue boxes highlight the part where the two algorithms actually differ. Numbers highlight the more detailed difference to be explained later.
TL;NR:
| | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' | π | π |
| Updating Q | π | μ |
where π is a ε-greedy policy (e.g. ε > 0 with exploration), and μ is a greedy policy (e.g. ε == 0, NO exploration).
Given that Q-learning is using different policies for choosing next action A' and updating Q. In other words, it is trying to evaluate π while following another policy μ, so it's an off-policy algorithm.
In contrast, SARSA uses π all the time, hence it is an on-policy algorithm.
More detailed explanation:
The most important difference between the two is how Q is updated after each action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn from it. In contrast, Q-learning uses the maximum Q' over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, i.e. NO exploration in this part.
However, when actually taking an action, Q-learning still uses the action taken from a ε-greedy policy. This is why "Choose A ..." is inside the repeat loop.
Following the loop logic in Q-learning, A' is still from the ε-greedy policy.
Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.
In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:
Q-Learning: Q(st+1,at+1) = maxaQ(st+1,a)
SARSA: Q(st+1,at+1) = ε·meanaQ(st+1,a) + (1-ε)·maxaQ(st+1,a)
What is the difference mathematically?
As is already described in most other answers, the difference between the two updates mathematically is indeed that, when updating the Q-value for a state-action pair (St, At):
Sarsa uses the behaviour policy (meaning, the policy used by the agent to generate experience in the environment, which is typically epsilon-greedy) to select an additional action At+1, and then uses Q(St+1, At+1) (discounted by gamma) as expected future returns in the computation of the update target.
Q-learning does not use the behaviour policy to select an additional action At+1. Instead, it estimates the expected future returns in the update rule as maxA Q(St+1, A). The max operator used here can be viewed as "following" the completely greedy policy. The agent is not actually following the greedy policy though; it only says, in the update rule, "suppose that I would start following the greedy policy from now on, what would my expected future returns be then?".
What does this mean intuitively?
As mentioned in other answers, the difference described above means, using technical terminology, that Sarsa is an on-policy learning algorithm, and Q-learning is an off-policy learning algorithm.
In the limit (given an infinite amount of time to generate experience and learn), and under some additional assumptions, this means that Sarsa and Q-learning converge to different solutions / "optimal" policies:
Sarsa will converge to a solution that is optimal under the assumption that we keep following the same policy that was used to generate the experience. This will often be a policy with some element of (rather "stupid") randomness, like epsilon-greedy, because otherwise we are unable to guarantee that we'll converge to anything at all.
Q-Learning will converge to a solution that is optimal under the assumption that, after generating experience and training, we switch over to the greedy policy.
When to use which algorithm?
An algorithm like Sarsa is typically preferable in situations where we care about the agent's performance during the process of learning / generating experience. Consider, for example, that the agent is an expensive robot that will break if it falls down a cliff. We'd rather not have it fall down too often during the learning process, because it is expensive. Therefore, we care about its performance during the learning process. However, we also know that we need it to act randomly sometimes (e.g. epsilon-greedy). This means that it is highly dangerous for the robot to be walking alongside the cliff, because it may decide to act randomly (with probability epsilon) and fall down. So, we'd prefer it to quickly learn that it's dangerous to be close to the cliff; even if a greedy policy would be able to walk right alongside it without falling, we know that we're following an epsilon-greedy policy with randomness, and we care about optimizing our performance given that we know that we'll be stupid sometimes. This is a situation where Sarsa would be preferable.
An algorithm like Q-learning would be preferable in situations where we do not care about the agent's performance during the training process, but we just want it to learn an optimal greedy policy that we'll switch to eventually. Consider, for example, that we play a few practice games (where we don't mind losing due to randomness sometimes), and afterwards play an important tournament (where we'll stop learning and switch over from epsilon-greedy to the greedy policy). This is where Q-learning would be better.
There's an index mistake in your formula for Q-Learning.
Page 148 of Sutton and Barto's.
Q(st,at) <-- Q(st,at) + alpha * [r(t+1) + gamma * max Q(st+1,a) -
Q(st,at) ]
The typo is in the argument of the max:
the indexes are st+1 and a,
while in your question they are st+1 and at+1 (these are correct for SARSA).
Hope this helps a bit.
In Q-Learning
This is your:
Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,At) - Q(St,At) ]
should be changed to
Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,a) - Q(St,At) ]
As you said, you have to find the maximum Q-value for the update eq. by changing the a, Then you will have a new Q(St,At). CAREFULLY, the a that give you the maximum Q-value is not the next action. At this stage, you only know the next state (St+1), and before going to next round, you want to update the St by the St+1 (St <-- St+1).
For each loop;
choose At from the St using the Q-value
take At and observe Rt+1 and St+1
Update Q-value using the eq.
St <-- St+1
Until St is terminal
The only difference between SARSA and Qlearning is that SARSA takes the next action based on the current policy while qlearning takes the action with maximum utility of next state
I didn't read any book just I see the implication of them
q learning just focus on the (action grid)
SARSA learning just focus on the (state to state) and observe the action list of s and s' and then update the (state to state grid)
Both SARSA and Q-learnig agents follow e-greedy policy to interact with environment.
SARSA agent updates its Q-function using the next timestep Q-value with whatever action the policy provides(mostly still greedy, but random action also accepted). The policy being executed and the policy being updated towards are the same.
Q-learning agent updates its Q-function with only the action brings the maximum next state Q-value(total greedy with respect to the policy). The policy being executed and the policy being updated towards are different.
Hence, SARSA is on-policy, Q-learning is off-policy.

TD(λ) in Delphi/Pascal (Temporal Difference Learning)

I have an artificial neural network which plays Tic-Tac-Toe - but it is not complete yet.
What I have yet:
the reward array "R[t]" with integer values for every timestep or move "t" (1=player A wins, 0=draw, -1=player B wins)
The input values are correctly propagated through the network.
the formula for adjusting the weights:
What is missing:
the TD learning: I still need a procedure which "backpropagates" the network's errors using the TD(λ) algorithm.
But I don't really understand this algorithm.
My approach so far ...
The trace decay parameter λ should be "0.1" as distal states should not get that much of the reward.
The learning rate is "0.5" in both layers (input and hidden).
It's a case of delayed reward: The reward remains "0" until the game ends. Then the reward becomes "1" for the first player's win, "-1" for the second player's win or "0" in case of a draw.
My questions:
How and when do you calculate the net's error (TD error)?
How can you implement the "backpropagation" of the error?
How are the weights adjusted using TD(λ)?
Thank you so much in advance :)
If you're serious about making this work, then understanding TD-lambda would be very helpful. Sutton and Barto's book, "Reinforcement Learning" is available for free in HTML format and covers this algorithm in detail. Basically, what TD-lambda does is create a mapping between a game state and the expected reward at the game's end. As games are played, states that are more likely to lead to winning states tend to get higher expected reward values.
For a simple game like tic-tac-toe, you're better off starting with a tabular mapping (just track an expected reward value for every possible game state). Then once you've got that working, you can try using a NN for the mapping instead. But I would suggest trying a separate, simpler NN project first...
I have been confused about this too, but I believe this is the way it works:
Starting from the end node, you check R, (output received) and E, (output expected). If E = R, it's fine, and you have no changes to make.
If E != R, you see how far off it was, based on thresholds and whatnot, and then shift the weights or threshold up or down a bit. Then, based on the new weights, you go back in, and guess whether or not it was too high, or too low, and repeat, with a weaker effect.
I've never really tried this algorithm, but that's basically the version of the idea as I understand it.
As far as I remember you do the training with a known result set - so you calculate the output for a known input and subtract your known output value from that - that is the error.
Then you use the error to correct the net - for a single layer NN adjusted with the delta rule I know that an epsilon of 0.5 is too high - something like 0.1 is better - slower but better. With backpropagation it is a bit more advanced - but as far as I remember the math equation description of a NN is complex and hard to understand - it is not that complicated.
take a look at
http://www.codeproject.com/KB/recipes/BP.aspx
or google for "backpropagation c" - it is probably easier to understand in code.

Variable elimination in Bayes Net

We have studied the Variable Elimination recently and the teacher emphasizes that it is the Bayesian Network that makes varibale elimination more efficient.
I am bit confused on this,why is this the case?
Hope you guys can give me some idea,many thanks.
Robert
Bayesian Networks can take advantage of the order of variable elimination because of the conditional independence assumptions built in.
Specifically, imagine having the joint distribution P(a,b,c,d) and wanting to know the marginal P(a). If you knew nothing about the conditional independence, you could calculate this by summing out over b,c and d. If these have k-ary domains, you need to do O(k^3) operations.
On the other hand, assume that you have a bayes net where A is the root, B is a child of A, C is a child of B and D is a child of C. Then, you can rewrite the joint as P(a|b)P(b|c)P(c|d)P(d) and distribute your three summations as far to the right of the equation as possible. When you actually want to compute P(a), you can precompute the value of sum_d P(d) and store this function. Likewise, you can precompute the value of P(c|d)*sum_d P(d) and store this.
In this way, you end up doing work of O(k^w*+1), where W* is the largest number of children any node in your bayes net has. In this case, we do O(k^2) work, which is also the size of the largest conditional probability table we must keep in memory. Note this is better than our original O(k^3) result, and would be even better if we had more variables.
In short, the conditional independence of a BN allows you to marginalize out variables more efficiently. Another explanation of this can be found at http://www.cs.uiuc.edu/class/sp08/cs440/notes/varElimLec.pdf.
I think it's because a variable which one can eliminate is one which has one and only one variable which is dependent on it. In a Bayes net these would be easy to find because they are nodes with a single child.

Resources