How to handle terminal nodes in Monte Carlo Tree Search? - artificial-intelligence

When my tree has gotten deep enough that terminal nodes are starting to be selected, I would have assumed that I should just perform a zero-move "playout" from it and back-propagate the results, but the IEEE survey of MCTS methods indicates that the selection step should be finding the "most urgent expandable node" and I can't find any counterexamples elsewhere. Am I supposed to be excluding them somehow? What's the right thing to do here?

If you actually reach a terminal node in the selection phase, you'd kind of skip expansion and play-out (they'd no longer make sense) and straight up backpropagate the value of that terminal node.
From the paper you linked, this is not clear from page 6, but it is clear in Algorithm 2 on page 9. In that pseudocode, the TreePolicy() function will end up returning a terminal node v. When the state of this node is then passed into the DefaultPolicy() function, that function will directly return the reward (the condition of that function's while-loop will never be satisfied).
It also makes sense that this is what you'd want to do if you have a good intuitive understanding of the algorithm, and want it to be able to guarantee optimal estimates of values given an infinite amount of processing time. With an infinite amount of processing time (infinite number of simulations), you'll want to backup values from the ''best'' terminal states infinitely often, so that the averaged values from backups in nodes closer to the root also converge to those best leaf node values in the limit.

Related

Is best first search optimal and complete?

I have some doubts regarding best first search algorithm. The pseudocode that I have is the following:
best first search pseudocode
First doubt: is it complete? I have read that it is not because it can enter in a dead end, but I don't know when can happen, because if the algorithm chooses a node that has not more neighbours it does not get stucked in it because this node is remove from the open list and in the next iteration the following node of the open list is treated and the search continues.
Second doubt: is it optimal? I thought that if it is visiting the nodes closer to the goal along the search process, then the solution would be the shortest, but it is not in that way and I do not know the reason for that and therefore, the reason that makes this algorithm not optimal.
The heuristic I was using is the straight line distance between two points.
Thanks for your help!!
Of course, if heuristic function underestimates the costs, best first search is not optimal. In fact, even if your heuristic function is exactly right, best first search is never guaranteed to be optimal. Here is a counter example. Consider the following graph:
The green numbers are the actual costs and the red numbers are the exact heuristic function. Let's try to find a path from node S to node G.
Best first search would give you S->A->G following the heuristic function. However, if you look at the graph closer, you would see that the path S->B->C->G has lower cost of 5 instead of 6. Thus, this is an example of best first search performing suboptimal under perfect heuristic function.
In general case best first search algorithm is complete as in worst case scenario it will search the whole space (worst option). Now, it should be also optimal - given the heuristic function is admissible - meaning it does not overestimate the cost of the path from any of the nodes to goal. (It also needs to be consistent - that means that it adheres to triangle inequality, if it is not then the algorithm would not be complete - as it could enter a cycle)
Checking your algorithm I do not see how the heuristic function is calculated. Also I do not see there is calculated the cost of the path to get to the particular node.
So, it needs to calculate the actual cost of the path to reach a particular node and then it needs to add a heuristics estimate of the cost of the path from the node towards goal.
The formula is f(n)=g(n)+h(n) where g(n) is the cost of the path to reach the node and h(n) is the heuristics estimating the cost of the cheapest path from n to the goal.
Check the implementation of A* algorithm which is an example of best first search on path planning.
TLDR In best first search, you need to calculate the cost of a node as a sum of the cost of the path to get to that node and the heuristic function that estimate the cost of the path from that node to the goal. If the heuristic function will be admissible and consistent the algorithm will be optimal and complete.

Monte Carlo tree search - handling game ending nodes

I have implemented a MCTS for a 4 player game which is working well, but I'm not sure I understand expansion when the game ending move is in the actual Tree rather than in the rollout.
At the start the game game winning/losing positions are only found in the rollout and I understand how to score these and propagate them back up the tree. But as the game progresses, I eventually find a leaf node, chosen by UCB1 that cannot be expanded as it is a losing position with no possible move allowed, so there is nothing to expand, nor is there a game to 'rollout'. At the moment I just score this as a 'win' for the last remaining player and backpropagate a win for them.
However when I look at the visit stats this node gets revisited thousands of time, so obviously UCB1 'chooses' to visit this node many times, but really this is a bit of a waste, should I be back-propagating something other than a single win for these 'always win' nodes?
I've had a good Google search for this and cant really find much mention of it, so am I misunderstanding something or missing something obvious, none of the 'standard' MCTS tutorials/algorithms even mention game ending nodes in the tree as special cases, so I'm worried I've misunderstood something fundamental.
At the moment I just score this as a 'win' for the last remaining player and backpropagate a win for them.
However when I look at the visit stats this node gets revisited thousands of time, so obviously UCB1 'chooses' to visit this node many times, but really this is a bit of a waste, should I be back-propagating something other than a single win for these 'always win' nodes?
No, what you're currently already doing is correct.
MCTS essentially evaluates the value of a node as the average of the outcomes of all paths you have run through that node. In reality, we are generally interested in minimax-style evaluations.
For MCTS' average-based evaluations to become equal to minimax-evaluations in the limit (after an infinite amount of time), we rely on the Selection phase (e.g. UCB1) to send so many simulations (= Selection + Play-out phases) down the path(s) that would be optimal according to minimax evaluations that the average evaluations also tend, in the limit, to the minimax evaluations.
Suppose, for example, that there is a winning node directly below the root node. This is an extreme example of your situation, where the terminal node is already reached in the Selection phase, and no Play-out is required afterwards. The minimax evaluation of the root node would be a win, since we can directly get to a win in one step. This means we want the average-based scoring of MCTS to also become very close to a winning evaluation for the root node. This means that we want the Selection phase to send the vast majority of simulations immediately down into this node. If e.g. 99% of all simulations immediately go to this winning node from the root node, the average evaluation of the root node will also become very close to a win, and that's exactly what we need.
This answer is only about the implementation of basic UCT (MCTS with UCB1 for Selection). For more sophisticated modifications to that basic MCTS implementation related to the question, see manlio's answer
none of the 'standard' MCTS tutorials/algorithms even mention game ending nodes in the tree as special cases
There are MCTS variants able to prove the game theoretical value of a position.
MCTS-Solver is (quite) well known: the backpropagation and selection steps are modified for this variant, as well as the procedure for choosing the final move
to play.
Terminal win and loss positions occurring in the tree are handled differently and a special provision is taken when backing such proven values up the tree.
You can take a look at:
Monte-Carlo Tree Search Solver by Mark H. M. Winands, Yngvi Björnsson, Jahn Takeshi Saito (part of the Lecture Notes in Computer Science book series volume 5131)
for details.
so I'm worried I've misunderstood something fundamental.
Although in the long run MCTS equipped with the UCT formula is able to converge to the game-theoretical value, basic MCTS is unable to prove the game-theoretical value.

Number of simulation per node in Monte Carlo tree search

In the mcts algorithm described in Wikipedia, it performs exactly one playout(simulation) in each node selection. Now, I am experimenting this algorithm in a simple connect-k game. I wonder, in practice, do we perform more playouts to reduce the variance?
I tried the original algorithm with exactly one random playout (non-biased). The result is bad compared to my heuristic search with alpha-beta pruning. It converges very slowly. When I perform 500 playouts instead, the noise is a lot less. However, each node simulation is too slow for the algorithm to explore other parts of the tree in the given time hence missing the most critical move sometimes.
I then added the AMAF (in particular with RAVE transition) heuristic to the basic MCTS. I don't notice too much difference with 500 playouts perhaps because the variance is already low. I haven't analyzed the result with 1 playout yet.
Could anyone give me any insights?
Typically, you'd do exactly one play-out per selection step. However, subsequent selection steps can go through the same node multiple times.
Consider, for example, a case where there are only two moves available in the root node. If you then run, let's say, 10,000 complete iterations of MCTS (where one iteration = Selection + Expansion + Play-out + Backpropagation), each of the two nodes below the root node will get selected roughly 5,000 times (or maybe one gets selected 9,000 times and the other 1,000 times if the first is clearly a better option than the seocnd, but still, both get selected more than once).
Does this match what you are currently doing in your implementation? If not, try providing some code that you currently have so that we can see where it goes wrong. But if this is how you implemented it (which is how it should be), then there should be no problems with doing only one play-out per selection step

How to apply Iterative Deepening Depth First Search(IDDFS) on Graphs

I tried to apply IDDFS on this graph by first making it in tree form and the result was this :
At level 1: d,e,p
At level 2: d,b,e,c,e,h,r,p,q
At level 3: d,b,a,e,h,c,a,e,h,q,p,r,f,p,q
At level 4: d,b,a,e,h,p,q,c,a,e,h,q,p,q,r,f,c,GOAL
I am confused about those repeated nodes in the path, can we eliminate them or they will appear in the final path ?
Is this the correct approach of traversing the graph to reach the GOAL ? And how we come to know which node to visit next in graph(e.g as in tree we start from left to right).
And what would be the path if we apply DFS and BFS on same graph ?
Will there be any difference in DFS result and IDDFS ? It seems to be similar
Yes, you can and SHOULD get rid of repeated nodes when you implement DFS, by keeping track of which nodes you've already visited. If you don't, your code won't terminate when it finds a cycle. Don't forget to clear the set of visited nodes with each new level. So leave out the visited nodes from your listing, unless it's important to include nodes that are being considered but not re-visited.
If you write out the expansion for BFS and DFS, you'll see that IDDFS starts out looking like BFS and ends up looking more and more like DFS the more you crank up the level. When level = length-of-longest-path, voila, you get DFS, which is not surprising, since IDDFS is DFS, only with paths cut off at a given number; in that particular case, the number has no effect, because there aren't any paths long enough to be cut off.
The order in a graph is not well defined. You choose one order or another yourself. If you choose a next node at random, you get non-deterministic algorithms. If you choose them, dunno, alphabetically, then you get some determinism. Sometimes the distinction doesn't matter, but determinism is good for debugging your code, etc. Now, when you do this exercise, you do it to see patterns, so it's best to leave out the randomness.
Your question really does look like homework. ;)

Find linearity in a graph

I am doing automation for a project and the results I get is in the form of a graph wherein I take the performance results.
Now the performance results which I take is generally at a straight line from the graph.
For example lets say the results from the graph in a List could be like this:
10, 30,90,100, 150,200,250,300,350,400,450,800,1000,1500,2000,2010,2006,2004,2000,1900,1800,1700, 1600,1000,500,400,0.
As you see the performance of the device starts increasing and then at a certain point it remains linear and with failures it starts dropping.
The point I want to take is the linear line.
As you can see in the list of numbers we see that from (2000,2010,2006,2004,2000) there is some kind of a linear line.
I am not asking for any code or Algorithm to solve this....I do not need an answer. If anyone can just give me a hint or a little clue I will try to do the rest.
Do you mean constant or linear?
If you mean linear:
Why not take the differences of adjacent values and search for a sequence that stays close to constant?
If you mean constant:
Why not take the differences of adjacent values and search for a sequence that stays close to 0?
First decide on the absolute or relative tolerance you can handle, that decides what is a straight line.
Then iterate trough the array checking the value of a point with the next point, if they are within tolerance, continue iterating until you get a point that is not and store those points. They represent a straight line.
This solution is very simple, not perfect and takes O(n) time.

Resources