Monte Carlo tree search - handling game ending nodes - artificial-intelligence

I have implemented a MCTS for a 4 player game which is working well, but I'm not sure I understand expansion when the game ending move is in the actual Tree rather than in the rollout.
At the start the game game winning/losing positions are only found in the rollout and I understand how to score these and propagate them back up the tree. But as the game progresses, I eventually find a leaf node, chosen by UCB1 that cannot be expanded as it is a losing position with no possible move allowed, so there is nothing to expand, nor is there a game to 'rollout'. At the moment I just score this as a 'win' for the last remaining player and backpropagate a win for them.
However when I look at the visit stats this node gets revisited thousands of time, so obviously UCB1 'chooses' to visit this node many times, but really this is a bit of a waste, should I be back-propagating something other than a single win for these 'always win' nodes?
I've had a good Google search for this and cant really find much mention of it, so am I misunderstanding something or missing something obvious, none of the 'standard' MCTS tutorials/algorithms even mention game ending nodes in the tree as special cases, so I'm worried I've misunderstood something fundamental.

At the moment I just score this as a 'win' for the last remaining player and backpropagate a win for them.
However when I look at the visit stats this node gets revisited thousands of time, so obviously UCB1 'chooses' to visit this node many times, but really this is a bit of a waste, should I be back-propagating something other than a single win for these 'always win' nodes?
No, what you're currently already doing is correct.
MCTS essentially evaluates the value of a node as the average of the outcomes of all paths you have run through that node. In reality, we are generally interested in minimax-style evaluations.
For MCTS' average-based evaluations to become equal to minimax-evaluations in the limit (after an infinite amount of time), we rely on the Selection phase (e.g. UCB1) to send so many simulations (= Selection + Play-out phases) down the path(s) that would be optimal according to minimax evaluations that the average evaluations also tend, in the limit, to the minimax evaluations.
Suppose, for example, that there is a winning node directly below the root node. This is an extreme example of your situation, where the terminal node is already reached in the Selection phase, and no Play-out is required afterwards. The minimax evaluation of the root node would be a win, since we can directly get to a win in one step. This means we want the average-based scoring of MCTS to also become very close to a winning evaluation for the root node. This means that we want the Selection phase to send the vast majority of simulations immediately down into this node. If e.g. 99% of all simulations immediately go to this winning node from the root node, the average evaluation of the root node will also become very close to a win, and that's exactly what we need.
This answer is only about the implementation of basic UCT (MCTS with UCB1 for Selection). For more sophisticated modifications to that basic MCTS implementation related to the question, see manlio's answer

none of the 'standard' MCTS tutorials/algorithms even mention game ending nodes in the tree as special cases
There are MCTS variants able to prove the game theoretical value of a position.
MCTS-Solver is (quite) well known: the backpropagation and selection steps are modified for this variant, as well as the procedure for choosing the final move
to play.
Terminal win and loss positions occurring in the tree are handled differently and a special provision is taken when backing such proven values up the tree.
You can take a look at:
Monte-Carlo Tree Search Solver by Mark H. M. Winands, Yngvi Björnsson, Jahn Takeshi Saito (part of the Lecture Notes in Computer Science book series volume 5131)
for details.
so I'm worried I've misunderstood something fundamental.
Although in the long run MCTS equipped with the UCT formula is able to converge to the game-theoretical value, basic MCTS is unable to prove the game-theoretical value.


How to handle terminal nodes in Monte Carlo Tree Search?

When my tree has gotten deep enough that terminal nodes are starting to be selected, I would have assumed that I should just perform a zero-move "playout" from it and back-propagate the results, but the IEEE survey of MCTS methods indicates that the selection step should be finding the "most urgent expandable node" and I can't find any counterexamples elsewhere. Am I supposed to be excluding them somehow? What's the right thing to do here?
If you actually reach a terminal node in the selection phase, you'd kind of skip expansion and play-out (they'd no longer make sense) and straight up backpropagate the value of that terminal node.
From the paper you linked, this is not clear from page 6, but it is clear in Algorithm 2 on page 9. In that pseudocode, the TreePolicy() function will end up returning a terminal node v. When the state of this node is then passed into the DefaultPolicy() function, that function will directly return the reward (the condition of that function's while-loop will never be satisfied).
It also makes sense that this is what you'd want to do if you have a good intuitive understanding of the algorithm, and want it to be able to guarantee optimal estimates of values given an infinite amount of processing time. With an infinite amount of processing time (infinite number of simulations), you'll want to backup values from the ''best'' terminal states infinitely often, so that the averaged values from backups in nodes closer to the root also converge to those best leaf node values in the limit.

Number of simulation per node in Monte Carlo tree search

In the mcts algorithm described in Wikipedia, it performs exactly one playout(simulation) in each node selection. Now, I am experimenting this algorithm in a simple connect-k game. I wonder, in practice, do we perform more playouts to reduce the variance?
I tried the original algorithm with exactly one random playout (non-biased). The result is bad compared to my heuristic search with alpha-beta pruning. It converges very slowly. When I perform 500 playouts instead, the noise is a lot less. However, each node simulation is too slow for the algorithm to explore other parts of the tree in the given time hence missing the most critical move sometimes.
I then added the AMAF (in particular with RAVE transition) heuristic to the basic MCTS. I don't notice too much difference with 500 playouts perhaps because the variance is already low. I haven't analyzed the result with 1 playout yet.
Could anyone give me any insights?
Typically, you'd do exactly one play-out per selection step. However, subsequent selection steps can go through the same node multiple times.
Consider, for example, a case where there are only two moves available in the root node. If you then run, let's say, 10,000 complete iterations of MCTS (where one iteration = Selection + Expansion + Play-out + Backpropagation), each of the two nodes below the root node will get selected roughly 5,000 times (or maybe one gets selected 9,000 times and the other 1,000 times if the first is clearly a better option than the seocnd, but still, both get selected more than once).
Does this match what you are currently doing in your implementation? If not, try providing some code that you currently have so that we can see where it goes wrong. But if this is how you implemented it (which is how it should be), then there should be no problems with doing only one play-out per selection step

minimax: what happens if min plays not optimal

the description of the minimax algo says, that both player have to play optimal, so that the algorithm is optimal. Intuitively it is understandable. But colud anyone concretise, or proof what happens if min plays not optimal?
The definition of "optimal" is that you play so as to minimize the "score" (or whatever you measure) of your opponent's optimal answer, which is defined by the play that minimizes the score of your optimal answer and so forth.
Thus, by definition, if you don't play optimal, your opponent has at least one path that will give him a higher score than his best score if you played optimal.
One way to find out what is optimal is to brute force the entire game tree. For less than trivial problems you can use alpha-beta search, which guarantees optimum without needing to search the entire tree. If you tree is still too complex, you need a heuristic that estimates what the score of a "position" is and halts at a certain depth.
Was that understandable?
I was having problems with that precise question.
When you think about it for a bit you will get the idea that the minimax graph contains ALL possible games including the bad games. So if a player plays a sub optimal game then that game is part of the tree - but has been discarded in favor of a better game.
Its similar to alpha beta. I was getting stuck on what happens if I sacrifice some pieces intentionally to create space and then make a winning move through the gap. ie there is a better move further down the tree.
With alpha beta - lets say a sequence of losing moves followed by a killer move is in fact in the tree - but in that case the alpha and beta act as a window filter "a< x < b" and would have discarded it if YOU had a better game. You can see it in alpha beta if you imagine putting a +/- infinity into a pruned branch to see what happens.
In any case both algorithms recalculate every move so that if a player plays a sub optimal game them that will open up branches of the graph that are better for the opponent.
rinse repeat.
Consider a MIN node whose children are terminal nodes. If MIN plays suboptimally, then the value of the node is greater than or equal to the value it would have if MIN played optimally. Hence, the value of the MAX node that is the MIN node’s parent can only be increased. This argument can be extended by a simple induction all the way to the root. If the suboptimal play by MIN is predictable, then one can do better than a minimax strategy. For example, if MIN always falls for a certain kind of trap and loses, then setting the trap guarantees a win even if there is actually a devastating response for MIN.

Pathfinding Algorithm For 2 Pacmans

I'm trying to implement Pacman. It works fine, but so far, the ghosts aren't using any pathfinding, but instead just decide randomly on each path junction which path to take. So you can imagine that it isn't really difficult for Pacman to win the game ;)
So I read a little bit about path finding algorithms in Pacman and here on SO I found a really good answer: Pathfinding Algorithm For Pacman
The answers are referring to
This is all fine, but in my implementation of Pacman, there are two Pacmans which are played by two different players. So I wonder how to adapt the pathfinding algorithms, so that the ghosts are not always chasing one player.
Any thoughts on how to modify the algorithm so that the ghosts are more or less equally fair to both players?
I think the easiest strategy is to make each ghost chase the player closest to it. Proximity can be calculated using Manhattan distance (there was a link to it in the pathfinding question) or Euclidean distance or by a path length to the players. The last option means that you will have to compute paths to both players. Try all these options and choose one to your taste.
Also, on a side note. All people answering the pathfinding question didn't mention Dijkstra's algorithm which is even slower than BFS :) but allows to search all shortest paths only once. That is, if you implement A* or BFS and have n ghosts you will make at least n pathfinding queries. With Dijkstra you can do it only once starting from the player. But it all depends. If your game field is too large, Dijkstra is not the best choice. Try, experiment and maybe it'll suit you.
(Haven't looked but) I'm guessing that all the ghost algorithms base their behaviour on the relative positions of the ghost and 'the player' - well, simply have each ghost change its mind about which of the two players it uses as 'the player' in its algorithm, every so often.
Determining what exactly "every so often* means here is going to be a question for playtesting - should it be on a fixed schedule? Vary per ghost? Vary based on the relative proximity of the two players? Randomly - on a uniform / Poisson / other distribution?
There are as you can see many possibilities. Bear in mind that you want to avoid both behaviour which is 'too good' and behaviour which is 'too stupid'...
If you can query the distance and direction to any one Pacman from any one Ghost and also the number of Ghosts (and which Ghosts) are currently chasing any one Pacman, you should be able to make a pretty good and simple AI with some creativity.
I think you keep the pathfinding algorithms described on this web page you mentioned. That will make the game feel more true to the original. The only problem then is to determine how many ghosts chase a particular Pacman. I think this behavior should include scenarios where all of the ghosts are chasing one player. So, an algorithm is needed to determine if 1, 2, 3, or 4 ghosts are chasing a player. The algorithm could be based on the point difference between the players. So, the player in the lead would get chased by more ghosts. The algorithm should probably factor in the number of lives left for the player. So, if the player in the lead has fewer lives, the algorithm should delay increasing the number of ghosts chasing the player in the lead. The frequency of change in the number of ghosts chasing a player should also not happen too often. If a ghost changes the player being chased too much, then the ghost will seem to not really be chasing either. Just like the web page mentioned, getting a good behavior is going to take some experimentation. I think keeping it simple at first is key because sometimes complex looking behavior can be achieved by using a few simple rules. Good luck and I would love to see what you come up with. Please post a link when you get done!
I don't know if this coincides with your notion of "fairness", but I imagine one would like to prevent the case where one player happened to be the closer target to all 4 ghosts and so they end up ganging up on him and following him around, never again to chase the other player. This would be a possible result of the rule to have the ghost always follow the closest player.
You might consider first allocating fairly 2 ghosts to player 1 and 2 other ghosts to player 2, and then have them chase their targets (and reassigning this every so often). Although, if I were a ghost in the real world I wouldn't care if all my friends and I were ganging up on one pacman.
Instead of BFS or Dijkstra, I would use depth first search to depth 3 or 4, using Cartesian distance between your ghost and the Pacman at the leaves of this search tree and picking the value of the best leaf up to the root. For a small lookahead, it would be faster and easier to code compared to BFS and Dijkstra. Depth limited search should give you pretty intelligent behavior for your ghosts, assuming your gameboard does not have spiraling corridors where the number of moves required to escape the spiral is greater than 3 or 4. It also means the running time of the algorithm doesn't increase with larger and larger boards as does BFS and Dijkstra, again assuming you don't have spiraling corridors.

How to program a neural network for chess?

I want to program a chess engine which learns to make good moves and win against other players. I've already coded a representation of the chess board and a function which outputs all possible moves. So I only need an evaluation function which says how good a given situation of the board is. Therefore, I would like to use an artificial neural network which should then evaluate a given position. The output should be a numerical value. The higher the value is, the better is the position for the white player.
My approach is to build a network of 385 neurons: There are six unique chess pieces and 64 fields on the board. So for every field we take 6 neurons (1 for every piece). If there is a white piece, the input value is 1. If there is a black piece, the value is -1. And if there is no piece of that sort on that field, the value is 0. In addition to that there should be 1 neuron for the player to move. If it is White's turn, the input value is 1 and if it's Black's turn, the value is -1.
I think that configuration of the neural network is quite good. But the main part is missing: How can I implement this neural network into a coding language (e.g. Delphi)? I think the weights for each neuron should be the same in the beginning. Depending on the result of a match, the weights should then be adjusted. But how? I think I should let 2 computer players (both using my engine) play against each other. If White wins, Black gets the feedback that its weights aren't good.
So it would be great if you could help me implementing the neural network into a coding language (best would be Delphi, otherwise pseudo-code). Thanks in advance!
In case somebody randomly finds this page. Given what we know now, what the OP proposes is almost certainly possible. In fact we managed to do it for a game with much larger state space - Go ( ).
I don't see why you can't have a neural net for a static evaluator if you also do some classic mini-max lookahead with alpha-beta pruning. Lots of Chess engines use minimax with a braindead static evaluator that just adds up the pieces or something; it doesn't matter so much if you have enough levels of minimax. I don't know how much of an improvement the net would make but there's little to lose. Training it would be tricky though. I'd suggest using an engine that looks ahead many moves (and takes loads of CPU etc) to train the evaluator for an engine that looks ahead fewer moves. That way you end up with an engine that doesn't take as much CPU (hopefully).
Edit: I wrote the above in 2010, and now in 2020 Stockfish NNUE has done it. "The network is optimized and trained on the [classical Stockfish] evaluations of millions of positions at moderate search depth" and then used as a static evaluator, and in their initial tests they got an 80-elo improvement when using this static evaluator instead of their previous one (or, equivalently, the same elo with a little less CPU time). So yes it does work, and you don't even have to train the network at high search depth as I originally suggested: moderate search depth is enough, but the key is to use many millions of positions.
Been there, done that. Since there is no continuity in your problem (the value of a position is not closely related to an other position with only 1 change in the value of one input), there is very little chance a NN would work. And it never did in my experiments.
I would rather see a simulated annealing system with an ad-hoc heuristic (of which there are plenty out there) to evaluate the value of the position...
However, if you are set on using a NN, is is relatively easy to represent. A general NN is simply a graph, with each node being a neuron. Each neuron has a current activation value, and a transition formula to compute the next activation value, based on input values, i.e. activation values of all the nodes that have a link to it.
A more classical NN, that is with an input layer, an output layer, identical neurons for each layer, and no time-dependency, can thus be represented by an array of input nodes, an array of output nodes, and a linked graph of nodes connecting those. Each node possesses a current activation value, and a list of nodes it forwards to. Computing the output value is simply setting the activations of the input neurons to the input values, and iterating through each subsequent layer in turn, computing the activation values from the previous layer using the transition formula. When you have reached the last (output) layer, you have your result.
It is possible, but not trivial by any means.
To train his evaluation function, he utilized a lot of computing power to do so.
To summarize generally, you could go about it as follows. Your evaluation function is a feedforward NN. Let the matrix computations lead to a scalar output valuing how good the move is. The input vector for the network is the board state represented by all the pieces on the board so say white pawn is 1, white knight is 2... and empty space is 0. An example board state input vector is simply a sequence of 0-12's. This evaluation can be trained using grandmaster games (available at a fics database for example) for many games, minimizing loss between what the current parameters say is the highest valuation and what move the grandmasters made (which should have the highest valuation). This of course assumes that the grandmaster moves are correct and optimal.
What you need to train a ANN is either something like backpropagation learning or some form of a genetic algorithm. But chess is such an complex game that it is unlikly that a simple ANN will learn to play it - even more if the learning process is unsupervised.
Further, your question does not say anything about the number of layers. You want to use 385 input neurons to encode the current situation. But how do you want to decide what to do? On neuron per field? Highest excitation wins? But there is often more than one possible move.
Further you will need several hidden layers - the functions that can be represented with an input and an output layer without hidden layer are really limited.
So I do not want to prevent you from trying it, but chances for a successful implemenation and training within say one year or so a practically zero.
I tried to build and train an ANN to play Tic-tac-toe when I was 16 years or so ... and I failed. I would suggest to try such an simple game first.
The main problem I see here is one of training. You say you want your ANN to take the current board position and evaluate how good it is for a player. (I assume you will take every possible move for a player, apply it to the current board state, evaluate via the ANN and then take the one with the highest output - ie: hill climbing)
Your options as I see them are:
Develop some heuristic function to evaluate the board state and train the network off that. But that begs the question of why use an ANN at all, when you could just use your heuristic.
Use some statistical measure such as "How many games were won by white or black from this board configuration?", which would give you a fitness value between white or black. The difficulty with that is the amount of training data required for the size of your problem space.
With the second option you could always feed it board sequences from grandmaster games and hope there is enough coverage for the ANN to develop a solution.
Due to the complexity of the problem I'd want to throw the largest network (ie: lots of internal nodes) at it as I could without slowing down the training too much.
Your input algorithm is sound - all positions, all pieces, and both players are accounted for. You may need an input layer for every past state of the gameboard, so that past events are used as input again.
The output layer should (in some form) give the piece to move, and the location to move to.
Write a genetic algorithm using a connectome which contains all neuron weights and synapse strengths, and begin multiple separated gene pools with a large number of connectomes in each.
Make them play one another, keep the best handful, crossover and mutate the best connectomes to repopulate the pool.
Read blondie24 :
It deals with checkers instead of chess but the principles are the same.
Came here to say what Silas said. Using a minimax algorithm, you can expect to be able to look ahead N moves. Using Alpha-beta pruning, you can expand that to theoretically 2*N moves, but more realistically 3*N/4 moves. Neural networks are really appropriate here.
Perhaps though a genetic algorithm could be used.
