In Russell and Norvig, third edition, they give the following definition of the minimax value of a node in a game tree (zero-sum, perfect information, deterministic)
The minimax value of a node is the utility (for MAX) of being in the corresponding state, assuming that both players play optimally from there to the end of the game.
Only thing is, that in their setup of a game, the utility of a node is only defined for terminal nodes, so how should one understand the utility of a general node ? Thanks.
Utility is defined for non-terminal nodes. It's that for terminal nodes, utility is estimated by some external heuristic (which they call UTILITY), but for non-terminal nodes utility is computed by the minimax algorithm. The minimax value (or utility) of a non-terminal node is either the maximum or the minimum of the minimax values of its children (depending on whose move it is). The minimax value of the root will be the utility of the outcome you'll get to if both you and your opponent make optimal moves from there on out.
There's a worked-out example here that should make it clearer.
Each node should represent a gamestate, given a set of actions by each player.
Utility should be defined for each gamestate, and thereby, each node. It should represent how favorable a gamestate is for a player.
Minimax tree nodes are computed every other layer. That is to say, I don't evaluate the state of the game directly after my move, but instead after each time my opponent(s) make a move.
For a two player game:
I have X possible moves.
For each of my X possible moves, there is a gamestate. We do not need the utility of these gamestates.
For each of those X gamestates, my opponent has Y possible moves.
For each of those Y possible moves, there is another gamestate. We need the utilities of these gamestates.
Related
I have implemented a MCTS for a 4 player game which is working well, but I'm not sure I understand expansion when the game ending move is in the actual Tree rather than in the rollout.
At the start the game game winning/losing positions are only found in the rollout and I understand how to score these and propagate them back up the tree. But as the game progresses, I eventually find a leaf node, chosen by UCB1 that cannot be expanded as it is a losing position with no possible move allowed, so there is nothing to expand, nor is there a game to 'rollout'. At the moment I just score this as a 'win' for the last remaining player and backpropagate a win for them.
However when I look at the visit stats this node gets revisited thousands of time, so obviously UCB1 'chooses' to visit this node many times, but really this is a bit of a waste, should I be back-propagating something other than a single win for these 'always win' nodes?
I've had a good Google search for this and cant really find much mention of it, so am I misunderstanding something or missing something obvious, none of the 'standard' MCTS tutorials/algorithms even mention game ending nodes in the tree as special cases, so I'm worried I've misunderstood something fundamental.
At the moment I just score this as a 'win' for the last remaining player and backpropagate a win for them.
However when I look at the visit stats this node gets revisited thousands of time, so obviously UCB1 'chooses' to visit this node many times, but really this is a bit of a waste, should I be back-propagating something other than a single win for these 'always win' nodes?
No, what you're currently already doing is correct.
MCTS essentially evaluates the value of a node as the average of the outcomes of all paths you have run through that node. In reality, we are generally interested in minimax-style evaluations.
For MCTS' average-based evaluations to become equal to minimax-evaluations in the limit (after an infinite amount of time), we rely on the Selection phase (e.g. UCB1) to send so many simulations (= Selection + Play-out phases) down the path(s) that would be optimal according to minimax evaluations that the average evaluations also tend, in the limit, to the minimax evaluations.
Suppose, for example, that there is a winning node directly below the root node. This is an extreme example of your situation, where the terminal node is already reached in the Selection phase, and no Play-out is required afterwards. The minimax evaluation of the root node would be a win, since we can directly get to a win in one step. This means we want the average-based scoring of MCTS to also become very close to a winning evaluation for the root node. This means that we want the Selection phase to send the vast majority of simulations immediately down into this node. If e.g. 99% of all simulations immediately go to this winning node from the root node, the average evaluation of the root node will also become very close to a win, and that's exactly what we need.
This answer is only about the implementation of basic UCT (MCTS with UCB1 for Selection). For more sophisticated modifications to that basic MCTS implementation related to the question, see manlio's answer
none of the 'standard' MCTS tutorials/algorithms even mention game ending nodes in the tree as special cases
There are MCTS variants able to prove the game theoretical value of a position.
MCTS-Solver is (quite) well known: the backpropagation and selection steps are modified for this variant, as well as the procedure for choosing the final move
to play.
Terminal win and loss positions occurring in the tree are handled differently and a special provision is taken when backing such proven values up the tree.
You can take a look at:
Monte-Carlo Tree Search Solver by Mark H. M. Winands, Yngvi Björnsson, Jahn Takeshi Saito (part of the Lecture Notes in Computer Science book series volume 5131)
for details.
so I'm worried I've misunderstood something fundamental.
Although in the long run MCTS equipped with the UCT formula is able to converge to the game-theoretical value, basic MCTS is unable to prove the game-theoretical value.
When my tree has gotten deep enough that terminal nodes are starting to be selected, I would have assumed that I should just perform a zero-move "playout" from it and back-propagate the results, but the IEEE survey of MCTS methods indicates that the selection step should be finding the "most urgent expandable node" and I can't find any counterexamples elsewhere. Am I supposed to be excluding them somehow? What's the right thing to do here?
If you actually reach a terminal node in the selection phase, you'd kind of skip expansion and play-out (they'd no longer make sense) and straight up backpropagate the value of that terminal node.
From the paper you linked, this is not clear from page 6, but it is clear in Algorithm 2 on page 9. In that pseudocode, the TreePolicy() function will end up returning a terminal node v. When the state of this node is then passed into the DefaultPolicy() function, that function will directly return the reward (the condition of that function's while-loop will never be satisfied).
It also makes sense that this is what you'd want to do if you have a good intuitive understanding of the algorithm, and want it to be able to guarantee optimal estimates of values given an infinite amount of processing time. With an infinite amount of processing time (infinite number of simulations), you'll want to backup values from the ''best'' terminal states infinitely often, so that the averaged values from backups in nodes closer to the root also converge to those best leaf node values in the limit.
In the minimax algorithm, the first player plays optimally, which means it wants to maximise its score, and the second player tries to minimise the first player's chances of winning. Does this mean that the second player also plays optimally to win the game? Trying to choose some path in order to minimise the first player's chances of winning also means trying to win?
I am actually trying to solve this task from TopCoder: EllysCandyGame. I wonder whether we can apply the minimax algorithm here. That statement "both to play optimally" really confuses me and I would like some advice how to deal with this type of problems, if there is some general idea.
Yes, you can use the minimax algorithm here.
The problem statement says that the winner of the game is "the girl who has more candies at the end of the game." So one reasonable scoring function you could use is the difference in the number of candies held by the first and second player.
Does this mean that the second player also plays optimally to win the game?
Yes. When you are evaluating a MIN level, the MIN player will always choose the path with the lowest score for the MAX player.
Note: both the MIN and MAX levels can be implemented with the same code, if you evaluate every node from the perspective of the player making the move in that round, and convert scores between levels. If the score is a difference in number of candies, you could simply negate it between levels.
Trying to choose some path in order to minimize the first player's chances of winning also means trying to win?
Yes. The second player is trying to minimize the first player's score. A reasonable scoring function will give the first player a lower score for a loss than a tie.
I wonder whether we can apply the minimax algorithm here.
Yes. If I've read the problem correctly, the number of levels will be equal to the number of boxes. If there's no limit on the number of boxes, you'll need to use an n-move lookahead, evaluating nodes in the minimax tree to a maximum depth.
Properties of the game:
At each point, there are a limited, well defined number of moves (picking one of the non-empty boxes)
The game ends after a finite number of moves (when all boxes are empty)
As a result, the search tree consists of a finite number of leafs. You are right that by applying Minimax, you can find the best move.
Note that you only have to evaluate the game at the final positions (when there are no more moves left). At that point, there are only three results: The first player won, the second player won, or it is a draw.
Note that the standard Minimax algorithm has nothing to do with probabilities. The result of the Minimax algorithm determines the perfect play for both side (assuming that both sides make no mistakes).
By the way, if you need to improve the search algorithm, a safe and simple optimization is to apply Alpha Beta pruning.
the description of the minimax algo says, that both player have to play optimal, so that the algorithm is optimal. Intuitively it is understandable. But colud anyone concretise, or proof what happens if min plays not optimal?
thx
The definition of "optimal" is that you play so as to minimize the "score" (or whatever you measure) of your opponent's optimal answer, which is defined by the play that minimizes the score of your optimal answer and so forth.
Thus, by definition, if you don't play optimal, your opponent has at least one path that will give him a higher score than his best score if you played optimal.
One way to find out what is optimal is to brute force the entire game tree. For less than trivial problems you can use alpha-beta search, which guarantees optimum without needing to search the entire tree. If you tree is still too complex, you need a heuristic that estimates what the score of a "position" is and halts at a certain depth.
Was that understandable?
I was having problems with that precise question.
When you think about it for a bit you will get the idea that the minimax graph contains ALL possible games including the bad games. So if a player plays a sub optimal game then that game is part of the tree - but has been discarded in favor of a better game.
Its similar to alpha beta. I was getting stuck on what happens if I sacrifice some pieces intentionally to create space and then make a winning move through the gap. ie there is a better move further down the tree.
With alpha beta - lets say a sequence of losing moves followed by a killer move is in fact in the tree - but in that case the alpha and beta act as a window filter "a< x < b" and would have discarded it if YOU had a better game. You can see it in alpha beta if you imagine putting a +/- infinity into a pruned branch to see what happens.
In any case both algorithms recalculate every move so that if a player plays a sub optimal game them that will open up branches of the graph that are better for the opponent.
rinse repeat.
Consider a MIN node whose children are terminal nodes. If MIN plays suboptimally, then the value of the node is greater than or equal to the value it would have if MIN played optimally. Hence, the value of the MAX node that is the MIN node’s parent can only be increased. This argument can be extended by a simple induction all the way to the root. If the suboptimal play by MIN is predictable, then one can do better than a minimax strategy. For example, if MIN always falls for a certain kind of trap and loses, then setting the trap guarantees a win even if there is actually a devastating response for MIN.
Source: https://www.studocu.com/en-us/document/university-of-oregon/introduction-to-artificial-intelligence/assignments/solution-2-past-exam-questions-on-computer-information-system/1052571/view
Hi
I'm confused how you can determine the utility functions on with a minimax search
Explain it with any game that you can use a minimax search with
Basically i am asking how do you determine the utility functions
Cheers
The utility value is just some arbitrary value that the player receives when arriving at a certain state in the game. For instance, in Tic-tac-toe, your utility function could simply be 1 for a win, 0 for a tie, or -1 for a loss.
Running minmax on this would at best find a set of actions that result in 1 (a win).
Another example would be chess (not that you can feasibly run minimax on a game of chess). Say your utility function comes from a certain number that is based on the value of the piece you captured or lost
Determining the utility value of a move at a certain state has to do with the experience of the programmer and his/her knowledge of the game.
Utility values on a terminal state are kind of easy to determine. In Tic-tac-toe, for instance, a terminal state for player X is when the Xs are aligned in diagonal, vertically, or horizontally. Any move that creates such a state is a terminal state and you can create a function that checks that. If it is a terminal state, the function returns a 1 or -1.
If your player agent is player X and after player X's move it determines that player O will win, then the function returns a -1. The function returns a 1 if it determines that it is its own winning move.
If all cells are occupied with the last possible move and nobody has won, then the function returns a zero.
This is at terminal states only. It is critical to evaluate intermediate states because, even in a 3x3 game, there are lots of combinations to consider. If you include symmetrical moves you have 9! possible states in Tic-tac-toe. For those intermediate cases, you need to come up with an evaluation function that returns a score for each state as they related to other states.
Suppose that I assign the terminal state values of 810, 0, and -810. For each move, the score would be 810 / (# of moves). So if I reach a terminal state in 6 moves, the score would be 810/6 = 135. In 9 moves, the score would be 90. An evaluation function fashioned this way would favor moves that reach a terminal state faster. However, it still evaluates to a leaf node. We need to evaluate before reaching a leaf node, though, but this could also be part of an evaluation function.
Supposed that, in the game below, player 1 is X. So X moves next. The following are the legal moves (row, column) for X:
(1) 0,0
(2) 0,2
(3) 2,0
(4) 2,1
(5) 2,2
| |O| |
|O|X|X|
| | | |
The utility value for each move should favor the best moves.
The best moves, in this case, are either (2) or (5). So an evaluation function will assign a utility value of 81, for instance to each of those. Move (4) is the worst possible move for the X player (and would also warranty that you lose the game against an intelligent player) so the function would assign a value of -9 to that move. Moves (1) and (3), while not ideal, will not make you lose, so we might assign a 1.
So when minimax evaluates those 5 moves, because your player X, is max, the choice would be either (2) or (5).
If we focus on options (2) or (5), the game will be on a terminal state two moves after these. So, in reality, the evaluation function should look 2 moves ahead of the current legal moves to return the utility values. (This strategy follows the lines of depth limited search, where your function evaluates at a certain depth and produces a utility value without reaching a leaf node - or terminal state)
Now I'll circle back to my first statement. The utility value will be determined by an evaluation function coded per the programmer's knowledge of the game.
Hopefully, I'm not confusing you...