In the minimax algorithm, the first player plays optimally, which means it wants to maximise its score, and the second player tries to minimise the first player's chances of winning. Does this mean that the second player also plays optimally to win the game? Trying to choose some path in order to minimise the first player's chances of winning also means trying to win?
I am actually trying to solve this task from TopCoder: EllysCandyGame. I wonder whether we can apply the minimax algorithm here. That statement "both to play optimally" really confuses me and I would like some advice how to deal with this type of problems, if there is some general idea.
Yes, you can use the minimax algorithm here.
The problem statement says that the winner of the game is "the girl who has more candies at the end of the game." So one reasonable scoring function you could use is the difference in the number of candies held by the first and second player.
Does this mean that the second player also plays optimally to win the game?
Yes. When you are evaluating a MIN level, the MIN player will always choose the path with the lowest score for the MAX player.
Note: both the MIN and MAX levels can be implemented with the same code, if you evaluate every node from the perspective of the player making the move in that round, and convert scores between levels. If the score is a difference in number of candies, you could simply negate it between levels.
Trying to choose some path in order to minimize the first player's chances of winning also means trying to win?
Yes. The second player is trying to minimize the first player's score. A reasonable scoring function will give the first player a lower score for a loss than a tie.
I wonder whether we can apply the minimax algorithm here.
Yes. If I've read the problem correctly, the number of levels will be equal to the number of boxes. If there's no limit on the number of boxes, you'll need to use an n-move lookahead, evaluating nodes in the minimax tree to a maximum depth.
Properties of the game:
At each point, there are a limited, well defined number of moves (picking one of the non-empty boxes)
The game ends after a finite number of moves (when all boxes are empty)
As a result, the search tree consists of a finite number of leafs. You are right that by applying Minimax, you can find the best move.
Note that you only have to evaluate the game at the final positions (when there are no more moves left). At that point, there are only three results: The first player won, the second player won, or it is a draw.
Note that the standard Minimax algorithm has nothing to do with probabilities. The result of the Minimax algorithm determines the perfect play for both side (assuming that both sides make no mistakes).
By the way, if you need to improve the search algorithm, a safe and simple optimization is to apply Alpha Beta pruning.
Related
I'm trying to solve a variant of 2048 by a Monte-Carlo Tree Search. I found that UCT could a good way to have some trade-off between exploration/exploitation.
My only issue is that all the versions I've seen assume that the score is a win percentage. How can I adapt it to a game where the score is the value of the board at the last state, and thus going from 1-MAX and not a win.
I could normalize the score using the constant c by dividing by MAX but then it would overweight exploration at early stage of the game (since you get bad average score) and overweight exploitation at late stage of the game.
Indeed most of the literature assumes your games are either lost or won and award a score of 0 or 1, which will turn into a win ratio when averaged over the number of games played. Then exploration parameter C is usually set to sqrt(2) which is optimal for the UCB in bandit problems.
To find out what a good C is in general you have to step back a bit and see what the UCT is really doing. If one node in your tree had an exceptionally bad score in the one rollout it had then exploitation says you should never choose it again. But you've only played that node once, so it might have just been bad luck. To acknowledge this you give that node a bonus. How much? Enough to make it a viable choice even if its average score is the lowest possible and some other node has the highest average score possible. Because with enough plays it might turn out that the one rollout your bad node had was indeed a fluke, and the node actually turns out to be pretty reliable with good scores. Of course, if you get more bad scores then it will likely not be bad luck so it won't deserve more rollouts.
So with scores ranging from 0 to 1 a C of sqrt(2) is a good value. If your game has a maximum achievable score then you can normalize your scores by dividing by the max and force your scores into to 0-1 range to suit a C of sqrt(2). Alternatively you don't normalize the scores but multiply C by your maximum score. The effect is the same: the UCT exploration bonus is large enough to give your underdog nodes some rollouts and a chance to prove themselves.
There is an alternative way of setting C dynamically that has given me good results. As you play, you keep track of the highest and lowest scores you've ever seen in each node (and subtree). This is the range of scores possible and this gives you a hint of how big C should be in order to give not well explored underdog nodes a fair chance. Every time i descend into the tree and pick a new root i adjust C to be sqrt(2) * score range for the new root. In addition, as rollouts complete and their scores turn out the be a new highest or lowest score i adjust C in the same way. By continually adjusting C this way as you play but also as you pick a new root you keep C as large as it needs to be to converge but as small as it can be to converge fast. Note that the minimum score is as important as the max: if every rollout will yield at minimum a certain score then C won't need to overcome it. Only the difference between max and min matters.
I've been trying to apply MCTS in card games. Basically, I need a formula or modify the UCB formula, so it is best when selecting which node to proceed.
The problem is, the card games are no win/loss games, they have score distribution in each node, like 158:102 for example. We have 2 teams, so basically it is 2-player game. The games I'm testing are constant sum games (number of tricks, or some score from the taken tricks and so on).
Let's say the maximum sum of teamA and teamB score is 260 at each leaf. Then I search the best move from the root, and the first I try gives me average 250 after 10 tries. I have 3 more possible moves, that had never been tested. Because 250 is too close to the maximum score, the regret factor to test another move is very high, but, what should be mathematically proven to be the optimal formula that gives you which move to chose when you have:
Xm - average score for move m
Nm - number of tries for move m
MAX - maximum score that can be made
MIN - minimum score that can be made
Obviously the more you try the same move, the more you want to try the other moves, but the more close you are to the maximum score, the less you want to try others. What is the best math way to choose a move based ot these factors Xm, Nm, MAX, MIN?
Your problem obviously is an exploration problem, and the problem is that with Upper Confidence Bound (UCB), the exploration cannot be tuned directly. This can be solved by adding an exploration constant.
The Upper Confidence Bound (UCB) is calculated as follows:
with V being the value function (expected score) which you are trying to optimize, s the state you are in (the cards in the hands), and a the action (putting a card for example). And n(s) is the number of times a state s has been used in the Monte Carlo simulations, and n(s,a) the same for the combination of s and action a.
The left part (V(s,a)) is used to exploit knowledge of the previously obtained scores, and the right part is the adds a value to increase exploration. However there is not way to increase/decrease this exploration value, and this is done in the Upper Confidence Bounds for Trees (UCT):
Here Cp > 0 is the exploration constant, which can be used to tune the exploration. It was shown that:
holds the Hoeffding's inequality if the rewards (scores) are between 0 and 1 (in [0,1]).
Silver & Veness propose: Cp = Rhi - Rlo, with Rhi being the highest value returned using Cp=0, and Rlo the lowest value during the roll outs (i.e. when you randomly choose actions when no value function is calculated yet).
Reference:
Cameron Browne, Edward J. Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton.
A Survey of Monte Carlo Tree Search Methods.
IEEE Trans. Comp. Intell. AI Games, 4(1):1–43, 2012.
Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. Advances in Neural Information Processing Systems, 1–9.
Idea is simple. Function needs one argument which is the players amount. It generates the graph where each player is placed versus another one (screen included). If players are even, rounds equals to players-1 else, it equals players.
I've noticed that the best way to do the pairing is to change order of numbers (source).
I can't find any solution to make it work with uneven player's count. Any suggestions are welcomed, as I really need this algorithm to start working ASAP. It looks simple, and won't take much coding, so it's not an issue. I just need the tip.
If you have an odd number of players, add a dummy player. Whoever plays the dummy player in a certain round, doesn't compete in that round.
You can even see that in your example image where player 6 is the dummy. The left table is obtained by skipping all matches versus number 6.
I was challenged by coworker into creating a Tic Tac Toe game AI that plays five-in-a-row games (not the traditional 3). My initial thoughts are that I create a "scoreboard", i.e. every cell in the game gets a score between 0 and infinite. The AI finds shapes and determines which places hold how much value and give score to the cells. In the end, highest scored cell is the choice.
Is there a better way to approach this problem?
5x5 Tic-Tac-Toe might still be small enough to solve directly, depending on your time constraints, if you're clever about the board symmetries. Oddly enough, I just wrote a description of the general technique last night, for this question:
How to code simple AI for a windows phone board game?
If not, that's still a good starting point. The next most obvious thing to me would be to change the board evaluation function and search only as deep in the tree as is feasible for your time constraints. The idea is that you, as a human, might have some ideas about what strong and weak positions are. So, as a guess, we know five in a row wins, so assign X wins as +5 and O wins as -5. One way to win is to get four in a row prior to that, so if X has four in a row, that might be worth 4, and if O has four in a row, that might be worth -4. The idea is that if you can't search all the way down the tree, you search as far as you can with the minimax technique, confident that you're working your way toward a strong position.
That board eval function is only an example. Coming up with a good board evaluation function can be tricky, and the one I described misses some obvious details.
Another thing to try is to use a genetic algorithm and neural networks to evolve the board evaluation function. Now the idea is to feed board positions into neural networks, which do the board evaluations, and let them play according to the technique I described above, tournament style. Then, after tournament rounds, new neural networks are created (through genetic algorithm) from the winners and losers are eliminated. The board evaluation function evolves naturally.
the description of the minimax algo says, that both player have to play optimal, so that the algorithm is optimal. Intuitively it is understandable. But colud anyone concretise, or proof what happens if min plays not optimal?
thx
The definition of "optimal" is that you play so as to minimize the "score" (or whatever you measure) of your opponent's optimal answer, which is defined by the play that minimizes the score of your optimal answer and so forth.
Thus, by definition, if you don't play optimal, your opponent has at least one path that will give him a higher score than his best score if you played optimal.
One way to find out what is optimal is to brute force the entire game tree. For less than trivial problems you can use alpha-beta search, which guarantees optimum without needing to search the entire tree. If you tree is still too complex, you need a heuristic that estimates what the score of a "position" is and halts at a certain depth.
Was that understandable?
I was having problems with that precise question.
When you think about it for a bit you will get the idea that the minimax graph contains ALL possible games including the bad games. So if a player plays a sub optimal game then that game is part of the tree - but has been discarded in favor of a better game.
Its similar to alpha beta. I was getting stuck on what happens if I sacrifice some pieces intentionally to create space and then make a winning move through the gap. ie there is a better move further down the tree.
With alpha beta - lets say a sequence of losing moves followed by a killer move is in fact in the tree - but in that case the alpha and beta act as a window filter "a< x < b" and would have discarded it if YOU had a better game. You can see it in alpha beta if you imagine putting a +/- infinity into a pruned branch to see what happens.
In any case both algorithms recalculate every move so that if a player plays a sub optimal game them that will open up branches of the graph that are better for the opponent.
rinse repeat.
Consider a MIN node whose children are terminal nodes. If MIN plays suboptimally, then the value of the node is greater than or equal to the value it would have if MIN played optimally. Hence, the value of the MAX node that is the MIN node’s parent can only be increased. This argument can be extended by a simple induction all the way to the root. If the suboptimal play by MIN is predictable, then one can do better than a minimax strategy. For example, if MIN always falls for a certain kind of trap and loses, then setting the trap guarantees a win even if there is actually a devastating response for MIN.
Source: https://www.studocu.com/en-us/document/university-of-oregon/introduction-to-artificial-intelligence/assignments/solution-2-past-exam-questions-on-computer-information-system/1052571/view