How to calculate the value function in reinforcement learning

How to calculate the value function in reinforcement learning - artificial-intelligence

Could anybody help to explain how to following value function been generated, the problem and solution are attached, I just don't know how the solution is generated. thank you!
STILL NEED HELP WITH THIS!!!

Since no one else has taken a stab at it, I'll present my understanding of the problem (disclaimer: I'm not an expert on reinforced learning and I'm posting this as an answer because it's too long to be a comment)
Think of it this way: when starting at, for example, node d, a random walker has a 50% chance to jump to either node e or node a. Each such jump reduces the reward (r) with the multiplier y (gamma in the picture). You continue jumping around until you get to the target node (f in this case), after which you collect the reward r.
If I've understood correctly, the two smaller 3x2 squares represent the expected values of reward when starting from each node. Now, it's obvious why in the first 3x2 square every node has a value of 100: because y = 1, the reward never decreases. You can just keep jumping around until you eventually end up in the reward node, and gather the reward of r=100.
However, in the second 3x2 square, with every jump the reward is decreased with a multiplier of 0.9. So, to get the expected value of reward when starting from square c, you sum together the reward you get from different paths, multiplied by their probabilities. Going from c to f has a chance of 50% and it's 1 jump, so you get r = 0.5*0.9^0*100 = 50. Then there's the path c-b-e-f: 0.5*(1/3)*(1/3)*0.9^2*100 = 4.5. Then there's c-b-c-f: 0.9^2*0.5^2*(1/3)^1*100 = 6.75. You keep going this way until the reward from the path you're examining is insignificantly small, and sum together the rewards from all the paths. This should give you the result of the corresponding node, that is, 50+6.75+4.5+... = 76.
I guess the programmatic way of doing to would be to use a modified dfs/bfs to explore all the paths of length N or less, and sum together the rewards from those paths (N chosen so that 0.9^N is small).
Again, take this with a grain of salt; I'm not an expert on reinforced learning.

Related

Implementing A* algorithm

So I am implementing A* algorithm in C. Here's the procedure.
I am using Priority Queue [using array] for all the open nodes. Since I'll have duplicate distances, that is more than one node with same distance/Priority, hence while inserting a node in PQ, if the parent of the inserted node has the same priority, I still swap them both, so that my newest entered member remains on the top( or as high as possible),so that I keep following a particular direction. Also, on removing, when I swap the topmost element with the last one, then again, if the swapped last element has the same as one of its children, then it gets swapped to the bottom.(I am not sure if this will affect in any way).
Now the problem is say I have a 100*100 matrix, and I have obstacles from (0,20) to (15,20) of the 2D array , in which I am moving. Now for a starting position (2,2) and ending position (16,20) I get a straight path, i.e. firstly go all the way to right, then go down till 15 then move one right and I am done.
But, if I have starting as (2,2) and last as (12,78) i.e. the points are separated by the obstacles and the path has to go around it, I still go via (16,20) and my path after (16,20) is still straight, but my path upto (16,20) is zig zag, i.e. I go some distance straight down, then some right, then down then right and so on, ultimately reaching (16,20) and going straight after that.
Why this zig zag path for the first half of the distance, what can I do to make sure that my path is straight, as it is, when my destination is (16,20) and not (12,78).
Thanks.
void findPath(array[ROW][COLUMN],sourceX,sourceY,destX,destY) {
PQ pq[SIZE];
int x,y;
insert(pq,sourceX,sourceY);
while(!empty(pq)) {
remove(pq);
if(removedIsDestination)
break; //Path Found
insertAdjacent(pq,x,y,destX,destY);
}
}
void insert(PQ pq[SIZE],element){
++sizeOfPQ;
PQ[sizeOfPQ]==element
int i=sizeOfPQ;
while(i>0){
if(pq[i].priority <= pq[(i-1)/2].priority){
swapWithParent
i=(i-1)/2;
}
else
break;
}
}

You should change your scoring part. Right now you calculate absolute distance. Instead calculate min move distance. If you count each move as one then if you were at (x,y) and going to (dX,dY) that would be
distance moved + (max(x,dX) - min(x,dx) + max(y,dY) - min(y,dY))
A lower value is considered a higher score.
This heuristic is a guess at how many moves it would take if there was nothing in the way.
The nice thing about the heuristic is you can change it to get the results you want, for example if you prefer to move in a straight line as you suggest, then you can make this change:
= distance moved + (max(x,dX) - min(x,dx) + max(y,dY) - min(y,dY))
+ (1 if this is a turn from the last move)
This will cause you to "find" solutions which tend to go in the same direction.
If you want to FORCE as few turns as possible:
= distance moved + (max(x,dX) - min(x,dx) + max(y,dY) - min(y,dY))
+ (1 times the number of turns made)
This is what is nice about A* -- the heuristic will inform the search -- you will still always find a solution, but if there is more than one you can influence where you look first -- this makes it good for simulating AI behavior.
Doubt : How is the first one and second calculating way different from
each other?
The first one puts a lower priority on a move that is a turn. The second one puts a lower priority on a path with more turns. In some cases (eg, the first turn) the value will be the same, but over all the 2nd one will pick paths that have as few turns as possible, where the first one might not.
Also, 1 if this is a turn from the last move , for this,
say i have source at top left and destination at bottom right, now my
path normally would be, left,left,left...down,down,down.... Now, 1 if
this is a turn from the last move, according to this, when I change
from left to down, will I add 1?
Yes
Wont it make the total value more and the priority for down will decrease.
Yes, exactly. You want to not look at choices that have a turn in them first. This will make them lower priority and your algorithm will investigate other options with a higher priority -- exactly what you want.
Or 1 if this is a turn from the last move is when I move to a cell, that is not abutting the cell previously worked upon? Thnks –
No, I don't understand this question -- I don't think it makes sense in this context -- all moves have to abut the previous cell, diagonal moves are not allowed.
Though, I'd really appreciate if you could tell me one instance where the first and second methods will give different answers. If you could. Thanks alot. :) 
Not so easy without seeing the details of your algorithm but the following might work:
The red are blocks. The green is what I would expect the first one to do, it locally tries to find the least turn. The blue is the least turn solution. Note, how far the red areas are from each other and the details of how your algorithm influence if this will work. As I have it above -- having an extra turn only costs 1 in the heuristic. SO, if you want to be sure this will work change the heuristic like this:
= distance moved + (max(x,dX) - min(x,dx) + max(y,dY) - min(y,dY))
+ (25 times the number of turns made)
Where 25 is bigger than the distance to get past the 2nd turn in the green path. (Thus after the 2nd turn the blue path will be searched.)

A* search algorithm heuristic function

I am trying to find the optimal solution to a Sliding Block Puzzle of any length using the A* algorithm.
The Sliding Block Puzzle is a game with white (W) and black tiles (B) arranged on a linear game board with a single empty space(-). Given the initial state of the board, the aim of the game is to arrange the tiles into a target pattern.
For example my current state on the board is BBW-WWB and I have to achieve BBB-WWW state.
Tiles can move in these ways :
1. slide into an adjacent empty space with a cost of 1.
2. hop over another tile into the empty space with a cost of 1.
3. hop over 2 tiles into the empty space with a cost of 2.
I have everything implemented, but I am not sure about the heuristic function. It computes the shortest distance (minimal cost) possible for a misplaced tile in current state to a closest placed same color tile in goal state.
Considering the given problem for the current state BWB-W and goal state BB-WW the heuristic function gives me a result of 3. (according to minimal distance: B=0 + W=2 + B=1 + W=0). But the actual cost of reaching the goal is not 3 (moving the misplaced W => cost 1 then the misplaced B => cost 1) but 2.
My question is: should I compute the minimal distance this way and don't care about the overestimation, or should I divide it by 2? According to the ways tiles can move, one tile can for the same cost overcome twice as much(see moves 1 and 2).
I tried both versions. While the divided distance gives better final path cost to the achieved goal, it visits more nodes => takes more time than the not divided one. What is the proper way to compute it? Which one should I use?

It is not obvious to me what an admissible heuristic function for this problem looks like, so I won't commit to saying, "Use the divided by two function." But I will tell you that the naive function you came up with is not admissible, and therefore will not give you good performance. In order for A* to work properly, the heuristic used must be admissible; in order to be admissible, the heuristic must absolutely always give an optimistic estimate. This one doesn't, for exactly the reason you highlight in your example.
(Although now that I think about it, dividing by two does seem like a reasonable way to force admissibility. I'm just not going to commit to it.)

Your heuristic is not admissible, so your A* is not guaranteed to find the optimal answer every time. An admissible heuristic must never overestimate the cost.
A better heuristic than dividing your heuristic cost by 3, would be: instead of adding the distance D of each letter to its final position, add ceil(D/2). This way, a letter 1 or 2 away, gets a 1 value, 3 or 4 away, gets a 2 value, an so on.

Fastest way to find minimum distance of one point to points on a curve

I'm looking for a fast solution for the following problem:
I have a fixed point (let's say the upper right on the white measurement line) and need to find the closest point on a curve made of equally spaced points (the lower curve). Additionally, I do this for every point on the upper curve to draw the distances between the curves with different colours (three levels: below minimum [red], between minimum and maximum [orange] and above maximum [green]).
My current solution is a tradeoff: I take the fixed point, iterate through an arbitrary interval (e. g. 50 units to the left and right of the fixed point) and calculate the distance of each pair. This saves some CPU power, but it is neither elegant nor accurate, since I could miss a minimum distance outside my chosen interval.
Any proposals for a faster algorithm?
Edit: Equally spaced means all points have the same distance on the x-axis, this is true for both curves. Also I do not need to interpolate between the points, this would be too time consuming.

Rather than an arbitrary distance, you could perhaps iterate until "out of range".
In your example, suppose you start with the point on the upper curve at the top-right of your line. Then drop vertically downwards, you get a distance of (by my eye) about 200um.
Now you can move right from here testing points until the horizontal distance is 200um. Beyond that, it's impossible to get a distance less than 200um.
Moving left, the distance goes down until you find the 150um minimum, then starts rising again. Once you're 150um to the left of your upper point, again, it's impossible to beat the minimum you've found.
If you'd gone left first, you wouldn't have had to go so far right, so as an optimization either follow the direction in which the distance falls, or else work out from the middle in both directions at once.
I don't know how many um 50 units is, so this might be slower or faster than what you have. It does avoid the risk of missing a lower value, though.
Since you're doing lots of tests against the same set of points on the lower curve, you can proably improve on this by ignoring the fact that the points form a curve at all. Stick them all in a k-d tree or similar, and search that repeatedly. It's called a Nearest neighbor search.

It may help to identify this problem as a nearest neighbour search problem. That link includes a good discussion about the various algorithms that are used for this. If you are OK with using C++ rather than straight C, ANN looks like a good library for this.
It also looks as though this question has been asked before.

We can label the top curve y=t(x) and the bottom curve y=b(x). Label the closest-function x_b=c(x_t). We know that the closest-function is weakly monotone non-decreasing as two shortest paths never cross each other.
If you know that the distance function d(x_t,x_b) has only one local minimum for every fixed x_t (this happens if the curve is "smooth enough"), then you can save time by "walking" the curve:
- start with x_t=0, x_b=0
- while x_t <= x_max
-- find the closest x_b by local search
(increment x_b while the distance is decreasing)
-- add {x_t, x_b} to the result set
-- increment x_t
If you expect x_b to be smooth enough, but you cannot assume that and you want an exact result,
Walk the curve in both directions. Where the results agree, they are correct. Where they disagree, run a complete search betwen the two results (the leftmost and the rightmost local maxima). Sample the "ambiguous block" in such an order (binary division) to allow the most pruning due to the monotonicity.
As a middle ground:
Walk the curve in both directions. If the results disagree, choose among the two. If you can guarantee at most two local maxima for each fixed x_t, this produces the optimal solution. There are still some pathological cases where the optimal solution is not found, and contain a local minimum that is flanked by two other local minima that are both worse than this one. I dare say it is uncommon to find a case where the solution is far from optimal (assuming smooth y=b(x)).

TD(λ) in Delphi/Pascal (Temporal Difference Learning)

I have an artificial neural network which plays Tic-Tac-Toe - but it is not complete yet.
What I have yet:
the reward array "R[t]" with integer values for every timestep or move "t" (1=player A wins, 0=draw, -1=player B wins)
The input values are correctly propagated through the network.
the formula for adjusting the weights:
What is missing:
the TD learning: I still need a procedure which "backpropagates" the network's errors using the TD(λ) algorithm.
But I don't really understand this algorithm.
My approach so far ...
The trace decay parameter λ should be "0.1" as distal states should not get that much of the reward.
The learning rate is "0.5" in both layers (input and hidden).
It's a case of delayed reward: The reward remains "0" until the game ends. Then the reward becomes "1" for the first player's win, "-1" for the second player's win or "0" in case of a draw.
My questions:
How and when do you calculate the net's error (TD error)?
How can you implement the "backpropagation" of the error?
How are the weights adjusted using TD(λ)?
Thank you so much in advance :)

If you're serious about making this work, then understanding TD-lambda would be very helpful. Sutton and Barto's book, "Reinforcement Learning" is available for free in HTML format and covers this algorithm in detail. Basically, what TD-lambda does is create a mapping between a game state and the expected reward at the game's end. As games are played, states that are more likely to lead to winning states tend to get higher expected reward values.
For a simple game like tic-tac-toe, you're better off starting with a tabular mapping (just track an expected reward value for every possible game state). Then once you've got that working, you can try using a NN for the mapping instead. But I would suggest trying a separate, simpler NN project first...

I have been confused about this too, but I believe this is the way it works:
Starting from the end node, you check R, (output received) and E, (output expected). If E = R, it's fine, and you have no changes to make.
If E != R, you see how far off it was, based on thresholds and whatnot, and then shift the weights or threshold up or down a bit. Then, based on the new weights, you go back in, and guess whether or not it was too high, or too low, and repeat, with a weaker effect.
I've never really tried this algorithm, but that's basically the version of the idea as I understand it.

As far as I remember you do the training with a known result set - so you calculate the output for a known input and subtract your known output value from that - that is the error.
Then you use the error to correct the net - for a single layer NN adjusted with the delta rule I know that an epsilon of 0.5 is too high - something like 0.1 is better - slower but better. With backpropagation it is a bit more advanced - but as far as I remember the math equation description of a NN is complex and hard to understand - it is not that complicated.
take a look at
http://www.codeproject.com/KB/recipes/BP.aspx
or google for "backpropagation c" - it is probably easier to understand in code.

What is the difference between Greedy-Search and Uniform-Cost-Search?

When searching in a tree, my understanding of uniform cost search is that for a given node A, having child nodes B,C,D with associated costs of (10, 5, 7), my algorithm will choose C, as it has a lower cost. After expanding C, I see nodes E, F, G with costs of (40, 50, 60). It will choose 40, as it has the minimum value from both 3.
Now, isn't it just the same as doing a Greedy-Search, where you always choose what seems to be the best action?
Also, when defining costs from going from certain nodes to others, should we consider the whole cost from the beginning of the tree to the current node, or just the cost itself from going from node n to node n'?
Thanks

Nope. Your understanding isn't quite right.
The next node to be visited in case of uniform-cost-search would be D, as that has the lowest total cost from the root (7, as opposed to 40+5=45).
Greedy Search doesn't go back up the tree - it picks the lowest value and commits to that. Uniform-Cost will pick the lowest total cost from the entire tree.

In a uniform cost search you always consider all unvisited nodes you have seen so far, not just those that are connected to the node you looked at. So in your example, after choosing C, you would find that visiting G has a total cost of 40 + 5 = 45 which is higher than the cost of starting again from the root and visiting D, which has cost 7. So you would visit D next.

The difference between them is that the Greedy picks the node with the lowest heuristic value while the UCS picks the node with the lowest action cost. Consider the following graph:
If you run both algorithms, you'll get:
UCS
Picks: S (cost 0), B (cost 1), A (cost 2), D (cost 3), C (cost 5), G (cost 7)
Answer: S->A->D->G
Greedy:
*supposing it chooses the A instead of B; A and B have the same heuristic value
Picks: S , A (h = 3), C (h = 1), G (h = 0)
Answer: S->A->C->G
So, it's important to differentiate the action cost to get to the node from the heuristic value, which is a piece of information that is added to the node, based on the understanding of the problem definition.

Greedy search (for most of this answer, think of greedy best-first search when I say greedy search) is an informed search algorithm, which means the function that is evaluated to choose which node to expand has the form of f(n) = h(n), where h is the heuristic function for a given node n that returns the estimated value from this node n to a goal state. If you're trying to travel to a place, one example of a heuristic function is one that returns the estimated distance from node n to your destination.
Uniform-cost search, on the other hand, is an uninformed search algorithm, also known as a blind search strategy. This means that the value of the function f for a given node n, f(n), for uninformed search algorithms, takes into consideration g(n), the total action cost from the root node to the node n, that is, the path cost. It doesn't have any information about the problem apart from the problem description, so that's all it can know. You don't have any information that can help you decide how close one node is to a goal state, only to the root node. You can watch the nodes expanding here (Animation of the Uniform Cost Algorithm) and see how the cost from node n to the root is used to choose which nodes to expand.
Greedy search, just like any greedy algorithm, takes locally optimal solutions and uses a function that returns an estimated value from a given node n to the goal state. You can watch the nodes expanding here (Greedy Best First Search | Quick Explanation with Visualization) and see how the return of the heuristic function from node n to the goal state is used to choose which nodes to expand.
By the way, sometimes, the path chosen by greedy search is not a global optimum. In the example in the video, for example, node A is never expanded because there are always nodes with smaller values of h(n). But what if A has such a high value, and the values for the next nodes are very small and therefore a global optimum? That can happen. A bad heuristic function can cause this. Getting stuck in a loop is also possible. A*, which is also a greedy search algorithm, fixes this by making use of both the path cost (which implies knowing nodes already visited) and a heuristic function, that is, f(n) = g(n) + h(n).
It's possible that to this point, it's still not clear to you HOW uniform-cost knows there is another path that looks better locally but not globally. It should become clear after telling you that if all paths have the same cost, uniform cost search is the same thing as the breadth-first search (BFS). It would expand all nodes just like BFS.

UCS cares about history,
Greedy does not.
In your example, after expanding C, the next node would be D according to the UCS. Because, it's our history. UCS can't forget the past and remember that the total cost of D is much lower than E.
Don't be Greedy. Be UCS and if going back is really a better choice, don't afraid of going back!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight