Intuition behind policy iteration on a grid world - artificial-intelligence

I am supposed to come up with an MDP agent that uses policy iteration and value iteration for an assignment and compare its performance with the utility value of a state.
How does an MDP agent, given that it knows the transition probabilities and rewards, know which action to move?
From my understanding, an MDP agent will perform policy iteration and, given a policy, calculate the rewards that it gained while reaching the termination state. This policy is developed from value iteration algorithm.
Can someone provide some intuition for how policy iteration works?

Assuming you have already seen what the policy iteration and and value iteration algorithms are, the agent simply builds the new policy by selecting the action with the highest value for each state.
The value of an action is the sum of the probability of reaching a next state * (the value of the next state + the reward of the transition) over all possible next states for that action.

Related

(SPSS) Assign values to remaining time point based on value on another variable, and looping for each case

I am currently working on analyzing a within-subject dataset with 8 time-ordered assessment points for each subject.
The variables of interest in this example is ID, time point, and accident.
I want to create two variables: accident_intercept and accident_slope, based on the value on accident at a particular time point.
For the accident_intercept variable, once a participant indicated the occurrence of an accident (e.g., accident = 1) at a specific time point, I want the values for that time point and the remaining time points to be 1.
For the accident_slope variable, once a participant indicated the occurrence of an accident (e.g., accident = 1) at a specific time point, I want the value of that time point to be 0, but count up by 1 for the remaining time points until the end time point, for each subject.
The main challenge here is that the process stated above need to be repeated/looped for each participant that occupies 8 rows of data.
Please see how the newly created variables would look like:
I have looked into the instruction for different SPSS syntax, such as loop, the lag/lead functions. I also tried to break my task into different components and google each one. However, I have not made any progress :)
I would be really grateful of any helps and directions that you provide.
Here is one way to do what you need using aggregate to calculate "accident time":
if accident=1 accidentTime=TimePoint.
aggregate out=* mode=addvariables overwrite=yes /break=ID/accidentTime=max(accidentTime).
if TimePoint>=accidentTime Accident_Intercept=1.
if TimePoint>=accidentTime Accident_Slope=TimePoint-accidentTime.
recode Accident_Slope accidentTime (miss=0).
Here is another approach using the lag function:
compute Accident_Intercept=0.
if accident=1 Accident_Intercept=1.
if $casenum>1 and id=lag(id) and lag(Accident_Intercept)=1 Accident_Intercept=1.
compute Accident_Slope=0.
if $casenum>1 and id=lag(id) and lag(Accident_Intercept)=1 Accident_Slope=lag(Accident_Slope) +1.
exe.

The purpose of using Q-Learning algorithm

What is the point of using Q-Learning? I have used an example code that represents 2D board with pawn moving on this board. At the right end of the board there is goal which we want to reach. After completion of algorithm I have a q table with values assigned to every state-action junction. Is it all about getting this q table to see which state-actions (which actions are the best in case of specific states) pairs are the most useful? That's how I understand it right now. Am I right?
Is it all about getting this q table to see which state-actions (which
actions are the best in case of specific states) pairs are the most
useful?
Yep! That's pretty much it. Given a finite state space, Q-learning is guaranteed to eventually learn the optimal policy. Once an optimal policy is reached (also known as convergence) every time the agent is in a given state s, it looks in its Q-table for the action a with the highest reward for that (s, a) pair.

Weight assignment to define an objective function

I have a set of jobs with execution times (C1,C2...Cn) and deadlines (D1,D2,...Dn). Each job will complete its execution in some time, i.e,
response time (R1,R2,....Rn). However, there is a possibility that not every job will complete its execution before its deadline. So I define a variable called Slack for each job, i.e., (S1,S2,...Sn). Slack is basically the difference between the deadline and the response time of jobs, i.e.,
S1=D1-R1
S2=D2-R2, .. and so on
I have a set of slacks [S1,S2,S3,...Sn]. These slacks can be positive or negative depending on the deadline and completion time of tasks, i.e., D and R.
The problem is I need to define weights (W) for each job (or slack) such that the job with negative slack (i.e., R>D, jobs that miss deadlines) has more weight (W) than the jobs with positive slack and based on these weights and slacks I need to define an objective function that can be used to maximize the slack.
The problem doesn't seem to be a difficult one. However, I couldn't find a solution. Some help in this regard is much appreciated.
Thanks
This can often be done easily with variable splitting:
splus(i) - smin(i) = d(i) - r(i)
splus(i) ≥ 0, smin(i) ≥ 0
If we have a term in the objective so that we are minimizing:
sum(i, w1 * splus(i) + w2 * smin(i) )
this will work ok: we don't need to add the complementarity condition splus(i)*smin(i)=0.

How to handle terminal nodes in Monte Carlo Tree Search?

When my tree has gotten deep enough that terminal nodes are starting to be selected, I would have assumed that I should just perform a zero-move "playout" from it and back-propagate the results, but the IEEE survey of MCTS methods indicates that the selection step should be finding the "most urgent expandable node" and I can't find any counterexamples elsewhere. Am I supposed to be excluding them somehow? What's the right thing to do here?
If you actually reach a terminal node in the selection phase, you'd kind of skip expansion and play-out (they'd no longer make sense) and straight up backpropagate the value of that terminal node.
From the paper you linked, this is not clear from page 6, but it is clear in Algorithm 2 on page 9. In that pseudocode, the TreePolicy() function will end up returning a terminal node v. When the state of this node is then passed into the DefaultPolicy() function, that function will directly return the reward (the condition of that function's while-loop will never be satisfied).
It also makes sense that this is what you'd want to do if you have a good intuitive understanding of the algorithm, and want it to be able to guarantee optimal estimates of values given an infinite amount of processing time. With an infinite amount of processing time (infinite number of simulations), you'll want to backup values from the ''best'' terminal states infinitely often, so that the averaged values from backups in nodes closer to the root also converge to those best leaf node values in the limit.

What is the difference between Q-learning and SARSA?

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms.
According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(st, at), can be updated as follows
Q(st, at) = Q(st, at) + α*(rt + γ*Q(st+1, at+1) - Q(st, at))
On the other hand, the update step for the Q-learning algorithm is the following
Q(st, at) = Q(st, at) + α*(rt + γ*maxa Q(st+1, a) - Q(st, at))
which can also be written as
Q(st, at) = (1 - α) * Q(st, at) + α * (rt + γ*maxa Q(st+1, a))
where γ (gamma) is the discount factor and rt is the reward received from the environment at timestep t.
Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?
TLDR (and my own answer)
Thanks to all those answering this question since I first asked it. I've made a github repo playing with Q-Learning and empirically understood what the difference is. It all amounts to how you select your next best action, which from an algorithmic standpoint can be a mean, max or best action depending on how you chose to implement it.
The other main difference is when this selection is happening (e.g., online vs offline) and how/why that affects learning. If you are reading this in 2019 and are more of a hands-on person, playing with a RL toy problem is probably the best way to understand the differences.
One last important note is that both Suton & Barto as well as Wikipedia often have mixed, confusing or wrong formulaic representations with regards to the next state best/max action and reward:
r(t+1)
is in fact
r(t)
When I was learning this part, I found it very confusing too, so I put together the two pseudo-codes from R.Sutton and A.G.Barto hoping to make the difference clearer.
Blue boxes highlight the part where the two algorithms actually differ. Numbers highlight the more detailed difference to be explained later.
TL;NR:
| | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' | π | π |
| Updating Q | π | μ |
where π is a ε-greedy policy (e.g. ε > 0 with exploration), and μ is a greedy policy (e.g. ε == 0, NO exploration).
Given that Q-learning is using different policies for choosing next action A' and updating Q. In other words, it is trying to evaluate π while following another policy μ, so it's an off-policy algorithm.
In contrast, SARSA uses π all the time, hence it is an on-policy algorithm.
More detailed explanation:
The most important difference between the two is how Q is updated after each action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn from it. In contrast, Q-learning uses the maximum Q' over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, i.e. NO exploration in this part.
However, when actually taking an action, Q-learning still uses the action taken from a ε-greedy policy. This is why "Choose A ..." is inside the repeat loop.
Following the loop logic in Q-learning, A' is still from the ε-greedy policy.
Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.
In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:
Q-Learning: Q(st+1,at+1) = maxaQ(st+1,a)
SARSA: Q(st+1,at+1) = ε·meanaQ(st+1,a) + (1-ε)·maxaQ(st+1,a)
What is the difference mathematically?
As is already described in most other answers, the difference between the two updates mathematically is indeed that, when updating the Q-value for a state-action pair (St, At):
Sarsa uses the behaviour policy (meaning, the policy used by the agent to generate experience in the environment, which is typically epsilon-greedy) to select an additional action At+1, and then uses Q(St+1, At+1) (discounted by gamma) as expected future returns in the computation of the update target.
Q-learning does not use the behaviour policy to select an additional action At+1. Instead, it estimates the expected future returns in the update rule as maxA Q(St+1, A). The max operator used here can be viewed as "following" the completely greedy policy. The agent is not actually following the greedy policy though; it only says, in the update rule, "suppose that I would start following the greedy policy from now on, what would my expected future returns be then?".
What does this mean intuitively?
As mentioned in other answers, the difference described above means, using technical terminology, that Sarsa is an on-policy learning algorithm, and Q-learning is an off-policy learning algorithm.
In the limit (given an infinite amount of time to generate experience and learn), and under some additional assumptions, this means that Sarsa and Q-learning converge to different solutions / "optimal" policies:
Sarsa will converge to a solution that is optimal under the assumption that we keep following the same policy that was used to generate the experience. This will often be a policy with some element of (rather "stupid") randomness, like epsilon-greedy, because otherwise we are unable to guarantee that we'll converge to anything at all.
Q-Learning will converge to a solution that is optimal under the assumption that, after generating experience and training, we switch over to the greedy policy.
When to use which algorithm?
An algorithm like Sarsa is typically preferable in situations where we care about the agent's performance during the process of learning / generating experience. Consider, for example, that the agent is an expensive robot that will break if it falls down a cliff. We'd rather not have it fall down too often during the learning process, because it is expensive. Therefore, we care about its performance during the learning process. However, we also know that we need it to act randomly sometimes (e.g. epsilon-greedy). This means that it is highly dangerous for the robot to be walking alongside the cliff, because it may decide to act randomly (with probability epsilon) and fall down. So, we'd prefer it to quickly learn that it's dangerous to be close to the cliff; even if a greedy policy would be able to walk right alongside it without falling, we know that we're following an epsilon-greedy policy with randomness, and we care about optimizing our performance given that we know that we'll be stupid sometimes. This is a situation where Sarsa would be preferable.
An algorithm like Q-learning would be preferable in situations where we do not care about the agent's performance during the training process, but we just want it to learn an optimal greedy policy that we'll switch to eventually. Consider, for example, that we play a few practice games (where we don't mind losing due to randomness sometimes), and afterwards play an important tournament (where we'll stop learning and switch over from epsilon-greedy to the greedy policy). This is where Q-learning would be better.
There's an index mistake in your formula for Q-Learning.
Page 148 of Sutton and Barto's.
Q(st,at) <-- Q(st,at) + alpha * [r(t+1) + gamma * max Q(st+1,a) -
Q(st,at) ]
The typo is in the argument of the max:
the indexes are st+1 and a,
while in your question they are st+1 and at+1 (these are correct for SARSA).
Hope this helps a bit.
In Q-Learning
This is your:
Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,At) - Q(St,At) ]
should be changed to
Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,a) - Q(St,At) ]
As you said, you have to find the maximum Q-value for the update eq. by changing the a, Then you will have a new Q(St,At). CAREFULLY, the a that give you the maximum Q-value is not the next action. At this stage, you only know the next state (St+1), and before going to next round, you want to update the St by the St+1 (St <-- St+1).
For each loop;
choose At from the St using the Q-value
take At and observe Rt+1 and St+1
Update Q-value using the eq.
St <-- St+1
Until St is terminal
The only difference between SARSA and Qlearning is that SARSA takes the next action based on the current policy while qlearning takes the action with maximum utility of next state
I didn't read any book just I see the implication of them
q learning just focus on the (action grid)
SARSA learning just focus on the (state to state) and observe the action list of s and s' and then update the (state to state grid)
Both SARSA and Q-learnig agents follow e-greedy policy to interact with environment.
SARSA agent updates its Q-function using the next timestep Q-value with whatever action the policy provides(mostly still greedy, but random action also accepted). The policy being executed and the policy being updated towards are the same.
Q-learning agent updates its Q-function with only the action brings the maximum next state Q-value(total greedy with respect to the policy). The policy being executed and the policy being updated towards are different.
Hence, SARSA is on-policy, Q-learning is off-policy.

Resources