Number of simulation per node in Monte Carlo tree search - artificial-intelligence

In the mcts algorithm described in Wikipedia, it performs exactly one playout(simulation) in each node selection. Now, I am experimenting this algorithm in a simple connect-k game. I wonder, in practice, do we perform more playouts to reduce the variance?
I tried the original algorithm with exactly one random playout (non-biased). The result is bad compared to my heuristic search with alpha-beta pruning. It converges very slowly. When I perform 500 playouts instead, the noise is a lot less. However, each node simulation is too slow for the algorithm to explore other parts of the tree in the given time hence missing the most critical move sometimes.
I then added the AMAF (in particular with RAVE transition) heuristic to the basic MCTS. I don't notice too much difference with 500 playouts perhaps because the variance is already low. I haven't analyzed the result with 1 playout yet.
Could anyone give me any insights?

Typically, you'd do exactly one play-out per selection step. However, subsequent selection steps can go through the same node multiple times.
Consider, for example, a case where there are only two moves available in the root node. If you then run, let's say, 10,000 complete iterations of MCTS (where one iteration = Selection + Expansion + Play-out + Backpropagation), each of the two nodes below the root node will get selected roughly 5,000 times (or maybe one gets selected 9,000 times and the other 1,000 times if the first is clearly a better option than the seocnd, but still, both get selected more than once).
Does this match what you are currently doing in your implementation? If not, try providing some code that you currently have so that we can see where it goes wrong. But if this is how you implemented it (which is how it should be), then there should be no problems with doing only one play-out per selection step

Related

what does worst case big omega(n) means?

If Big-Omega is the lower bound then what does it mean to have a worst case time complexity of Big-Omega(n).
From the book "data structures and algorithms with python" by Michael T. Goodrich:
consider a dynamic array that doubles it size when the element reaches its capacity.
this is from the book:
"we fully explored the append method. In the worst case, it requires
Ω(n) time because the underlying array is resized, but it uses O(1)time in the amortized sense"
The parameterized version, pop(k), removes the element that is at index k < n
of a list, shifting all subsequent elements leftward to fill the gap that results from
the removal. The efficiency of this operation is O(n−k), as the amount of shifting
depends upon the choice of index k. Note well that this
implies that pop(0) is the most expensive call, using Ω(n) time.
how is "Ω(n)" describes the most expensive time?
The number inside the parenthesis is the number of operations you must do to actually carry out the operation, always expressed as a function of the number of items you are dealing with. You never worry about just how hard those operations are, only the total number of them.
If the array is full and has to be resized you need to copy all the elements into the new array. One operation per item in the array, thus an O(n) runtime. However, most of the time you just do one operation for an O(1) runtime.
Common values are:
O(1): One operation only, such as adding it to the list when the list isn't full.
O(log n): This typically occurs when you have a binary search or the like to find your target. Note that the base of the log isn't specified as the difference is just a constant and you always ignore constants.
O(n): One operation per item in your dataset. For example, unsorted search.
O(n log n): Commonly seen in good sort routines where you have to process every item but can divide and conquer as you go.
O(n^2): Usually encountered when you must consider every interaction of two items in your dataset and have no way to organize it. For example a routine I wrote long ago to find near-duplicate pictures. (Exact duplicates would be handled by making a dictionary of hashes and testing whether the hash existed and thus be O(n)--the two passes is a constant and discarded, you wouldn't say O(2n).)
O(n^3): By the time you're getting this high you consider it very carefully. Now you're looking at three-way interactions of items in your dataset.
Higher orders can exist but you need to consider carefully what's it's going to do. I have shipped production code that was O(n^8) but with very heavy pruning of paths and even then it took 12 hours to run. Had the nature of the data not been conductive to such pruning I wouldn't have written it at all--the code would still be running.
You will occasionally encounter even nastier stuff which needs careful consideration of whether it's going to be tolerable or not. For large datasets they're impossible:
O(2^n): Real world example: Attempting to prune paths so as to retain a minimum spanning tree--I computed all possible trees and kept the cheapest. Several experiments showed n never going above 10, I thought I was ok--until a different seed produced n = 22. I rewrote the routine for not-always-perfect answer that was O(n^2) instead.
O(n!): I don't know any examples. It blows up horribly fast.

Monte Carlo tree search - handling game ending nodes

I have implemented a MCTS for a 4 player game which is working well, but I'm not sure I understand expansion when the game ending move is in the actual Tree rather than in the rollout.
At the start the game game winning/losing positions are only found in the rollout and I understand how to score these and propagate them back up the tree. But as the game progresses, I eventually find a leaf node, chosen by UCB1 that cannot be expanded as it is a losing position with no possible move allowed, so there is nothing to expand, nor is there a game to 'rollout'. At the moment I just score this as a 'win' for the last remaining player and backpropagate a win for them.
However when I look at the visit stats this node gets revisited thousands of time, so obviously UCB1 'chooses' to visit this node many times, but really this is a bit of a waste, should I be back-propagating something other than a single win for these 'always win' nodes?
I've had a good Google search for this and cant really find much mention of it, so am I misunderstanding something or missing something obvious, none of the 'standard' MCTS tutorials/algorithms even mention game ending nodes in the tree as special cases, so I'm worried I've misunderstood something fundamental.
At the moment I just score this as a 'win' for the last remaining player and backpropagate a win for them.
However when I look at the visit stats this node gets revisited thousands of time, so obviously UCB1 'chooses' to visit this node many times, but really this is a bit of a waste, should I be back-propagating something other than a single win for these 'always win' nodes?
No, what you're currently already doing is correct.
MCTS essentially evaluates the value of a node as the average of the outcomes of all paths you have run through that node. In reality, we are generally interested in minimax-style evaluations.
For MCTS' average-based evaluations to become equal to minimax-evaluations in the limit (after an infinite amount of time), we rely on the Selection phase (e.g. UCB1) to send so many simulations (= Selection + Play-out phases) down the path(s) that would be optimal according to minimax evaluations that the average evaluations also tend, in the limit, to the minimax evaluations.
Suppose, for example, that there is a winning node directly below the root node. This is an extreme example of your situation, where the terminal node is already reached in the Selection phase, and no Play-out is required afterwards. The minimax evaluation of the root node would be a win, since we can directly get to a win in one step. This means we want the average-based scoring of MCTS to also become very close to a winning evaluation for the root node. This means that we want the Selection phase to send the vast majority of simulations immediately down into this node. If e.g. 99% of all simulations immediately go to this winning node from the root node, the average evaluation of the root node will also become very close to a win, and that's exactly what we need.
This answer is only about the implementation of basic UCT (MCTS with UCB1 for Selection). For more sophisticated modifications to that basic MCTS implementation related to the question, see manlio's answer
none of the 'standard' MCTS tutorials/algorithms even mention game ending nodes in the tree as special cases
There are MCTS variants able to prove the game theoretical value of a position.
MCTS-Solver is (quite) well known: the backpropagation and selection steps are modified for this variant, as well as the procedure for choosing the final move
to play.
Terminal win and loss positions occurring in the tree are handled differently and a special provision is taken when backing such proven values up the tree.
You can take a look at:
Monte-Carlo Tree Search Solver by Mark H. M. Winands, Yngvi Björnsson, Jahn Takeshi Saito (part of the Lecture Notes in Computer Science book series volume 5131)
for details.
so I'm worried I've misunderstood something fundamental.
Although in the long run MCTS equipped with the UCT formula is able to converge to the game-theoretical value, basic MCTS is unable to prove the game-theoretical value.

How to handle terminal nodes in Monte Carlo Tree Search?

When my tree has gotten deep enough that terminal nodes are starting to be selected, I would have assumed that I should just perform a zero-move "playout" from it and back-propagate the results, but the IEEE survey of MCTS methods indicates that the selection step should be finding the "most urgent expandable node" and I can't find any counterexamples elsewhere. Am I supposed to be excluding them somehow? What's the right thing to do here?
If you actually reach a terminal node in the selection phase, you'd kind of skip expansion and play-out (they'd no longer make sense) and straight up backpropagate the value of that terminal node.
From the paper you linked, this is not clear from page 6, but it is clear in Algorithm 2 on page 9. In that pseudocode, the TreePolicy() function will end up returning a terminal node v. When the state of this node is then passed into the DefaultPolicy() function, that function will directly return the reward (the condition of that function's while-loop will never be satisfied).
It also makes sense that this is what you'd want to do if you have a good intuitive understanding of the algorithm, and want it to be able to guarantee optimal estimates of values given an infinite amount of processing time. With an infinite amount of processing time (infinite number of simulations), you'll want to backup values from the ''best'' terminal states infinitely often, so that the averaged values from backups in nodes closer to the root also converge to those best leaf node values in the limit.

What to Do when Monte Carlo Tree Search Hits Memory Limit

I have taken interest into monte carlo tree search applied in games recently.
I have read several papers, but i use "Monte-Carlo Tree Search" A Phd thesis by Chaslot, G as i find it more easy to understand the basics of monte carlo tree search
I have tried to code it, and stuck on certain problem. The algorithm tries to expand one node into the game tree for every one simulation. This quickly escalates to memory problem. I have quickly read the paper, but it doesnt seem to explain what the technique will do if it hits certain memory limit.
Can you suggest what should the technique do if it hits certain memory limit?
you can see the paper here :
http://www.unimaas.nl/games/files/phd/Chaslot_thesis.pdf
One very effective approach is to grow the tree more slowly. That is, instead of expanding the tree every time you reach a leaf node, you expand it once it has at least k visits. This will significantly slow the growth of the tree, and often does not reduce performance. I was told by one of the authors of the Fuego Go program that he tried the approach, and it worked well in practice.
This idea was originally described in this paper:
Remi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In Computers and games, pages 72–83. Springer, 2007.
It was also used in:
Max Roschke and Nathan Sturtevant. UCT Enhancements in Chinese Checkers Using an Endgame Database, IJCAI Workshop on Computer Games, 2013.
The paper Memory Bounded Monte Carlo Tree Search evaluates a variety of solutions for this problem :
Stopping : you stop the algorithm when you hit your memory limit
Stunting : you stop growing the tree when you hit your memory limit (but keep updating it)
Ensemble : you keep your result and restart the search from an empty tree when you hit your memory limit (fusing the results at the end)
Flattening : when you hit your memory limit you get rid of all the nodes exept the root and its direct children and restart the search from this new basis
Garbage collection : when you hit your memory limit, you remove all nodes that have not been visited a given number of times
Recycling : when you add a node, you delete the node that has not been visited for the longest time
You can throw away all nodes with number of visits smaller than some threshold that was not visited recently (how many playouts ago).
That's a quick but not efficient solution.
It's better to implement progressive widening too.

Genetic Algorithm Sudoku - optimizing mutation

I am in the process of writing a genetic algorithm to solve Sudoku puzzles and was hoping for some input. The algorithm solves puzzles occasionally (about 1 out of 10 times on the same puzzle with max 1,000,000 iterations) and I am trying to get a little input about mutation rates, repopulation, and splicing. Any input is greatly appreciated as this is brand new to me and I feel like I am not doing things 100% correct.
A quick overview of the algorithm
Fitness Function
Counts the number of unique values of numbers 1 through 9 in each column, row, and 3*3 sub box. Each of these unique values in the subsets are summed and divided by 9 resulting in a floating value between 0 and 1. The sum of these values is divided by 27 providing a total fitness value ranging between 0 and 1. 1 indicates a solved puzzle.
Population Size:
100
Selection:
Roulette Method. Each node is randomly selected where nodes containing higher fitness values have a slightly better chance of selection
Reproduction:
Two randomly selected chromosomes/boards swap a randomly selected subset (row, column, or 3*3 subsets) The selection of subset(which row, column, or box) is random. The resulting boards are introduced into population.
Reproduction Rate: 12% of population per cycle
There are six reproductions per iteration resulting in 12 new chromosomes per cycle of the algorithm.
Mutation: mutation occurs at a rate of 2 percent of population after 10 iterations of no improvement of highest fitness.
Listed below are the three mutation methods which have varying weights of selection probability.
1: Swap randomly selected numbers. The method selects two random numbers and swaps them throughout the board. This method seems to have the greatest impact on growth early in the algorithms growth pattern. 25% chance of selection
2: Introduce random changes: Randomly select two cells and change their values. This method seems to help keep the algorithm from converging. %65 chance of selection
3: count the number of each value in the board. A solved board contains a count of 9 of each number between 1 and 9. This method takes any number that occurs less than 9 times and randomly swaps it with a number that occurs more than 9 times. This seems to have a positive impact on the algorithm but only used sparingly. %10 chance of selection
My main question is at what rate should I apply the mutation method. It seems that as I increase mutation I have faster initial results. However as the result approaches a correct result, I think the higher rate of change is introducing too many bad chromosomes and genes into the population. However, with the lower rate of change the algorithm seems to converge too early.
One last question is whether there is a better approach to mutation.
You can anneal the mutation rate over time to get the sort of convergence behavior you're describing. But I actually think there are probably bigger gains to be had by modifying other parts of your algorithm.
Roulette wheel selection applies a very high degree of selection pressure in general. It tends to cause a pretty rapid loss of diversity fairly early in the process. Binary tournament selection is usually a better place to start experimenting. It's a more gradual form of pressure, and just as importantly, it's much better controlled.
With a less aggressive selection mechanism, you can afford to produce more offspring, since you don't have to worry about producing so many near-copies of the best one or two individuals. Rather than 12% of the population producing offspring (possible less because of repetition of parents in the mating pool), I'd go with 100%. You don't necessarily need to literally make sure every parent participates, but just generate the same number of offspring as you have parents.
Some form of mild elitism will probably then be helpful so that you don't lose good parents. Maybe keep the best 2-5 individuals from the parent population if they're better than the worst 2-5 offspring.
With elitism, you can use a bit higher mutation rate. All three of your operators seem useful. (Note that #3 is actually a form of local search embedded in your genetic algorithm. That's often a huge win in terms of performance. You could in fact extend #3 into a much more sophisticated method that looped until it couldn't figure out how to make any further improvements.)
I don't see an obvious better/worse set of weights for your three mutation operators. I think at that point, you're firmly within the realm of experimental parameter tuning. Another idea is to inject a bit of knowledge into the process and, for example, say that early on in the process, you choose between them randomly. Later, as the algorithm is converging, favor the mutation operators that you think are more likely to help finish "almost-solved" boards.
I once made a fairly competent Sudoku solver, using GA. Blogged about the details (including different representations and mutation) here:
http://fakeguido.blogspot.com/2010/05/solving-sudoku-with-genetic-algorithms.html

Resources