MCTS UCT with a scoring system - artificial-intelligence

I'm trying to solve a variant of 2048 by a Monte-Carlo Tree Search. I found that UCT could a good way to have some trade-off between exploration/exploitation.
My only issue is that all the versions I've seen assume that the score is a win percentage. How can I adapt it to a game where the score is the value of the board at the last state, and thus going from 1-MAX and not a win.
I could normalize the score using the constant c by dividing by MAX but then it would overweight exploration at early stage of the game (since you get bad average score) and overweight exploitation at late stage of the game.

Indeed most of the literature assumes your games are either lost or won and award a score of 0 or 1, which will turn into a win ratio when averaged over the number of games played. Then exploration parameter C is usually set to sqrt(2) which is optimal for the UCB in bandit problems.
To find out what a good C is in general you have to step back a bit and see what the UCT is really doing. If one node in your tree had an exceptionally bad score in the one rollout it had then exploitation says you should never choose it again. But you've only played that node once, so it might have just been bad luck. To acknowledge this you give that node a bonus. How much? Enough to make it a viable choice even if its average score is the lowest possible and some other node has the highest average score possible. Because with enough plays it might turn out that the one rollout your bad node had was indeed a fluke, and the node actually turns out to be pretty reliable with good scores. Of course, if you get more bad scores then it will likely not be bad luck so it won't deserve more rollouts.
So with scores ranging from 0 to 1 a C of sqrt(2) is a good value. If your game has a maximum achievable score then you can normalize your scores by dividing by the max and force your scores into to 0-1 range to suit a C of sqrt(2). Alternatively you don't normalize the scores but multiply C by your maximum score. The effect is the same: the UCT exploration bonus is large enough to give your underdog nodes some rollouts and a chance to prove themselves.
There is an alternative way of setting C dynamically that has given me good results. As you play, you keep track of the highest and lowest scores you've ever seen in each node (and subtree). This is the range of scores possible and this gives you a hint of how big C should be in order to give not well explored underdog nodes a fair chance. Every time i descend into the tree and pick a new root i adjust C to be sqrt(2) * score range for the new root. In addition, as rollouts complete and their scores turn out the be a new highest or lowest score i adjust C in the same way. By continually adjusting C this way as you play but also as you pick a new root you keep C as large as it needs to be to converge but as small as it can be to converge fast. Note that the minimum score is as important as the max: if every rollout will yield at minimum a certain score then C won't need to overcome it. Only the difference between max and min matters.

Related

Monte Carlo Tree Search for card games like Belot and Bridge, and so on

I've been trying to apply MCTS in card games. Basically, I need a formula or modify the UCB formula, so it is best when selecting which node to proceed.
The problem is, the card games are no win/loss games, they have score distribution in each node, like 158:102 for example. We have 2 teams, so basically it is 2-player game. The games I'm testing are constant sum games (number of tricks, or some score from the taken tricks and so on).
Let's say the maximum sum of teamA and teamB score is 260 at each leaf. Then I search the best move from the root, and the first I try gives me average 250 after 10 tries. I have 3 more possible moves, that had never been tested. Because 250 is too close to the maximum score, the regret factor to test another move is very high, but, what should be mathematically proven to be the optimal formula that gives you which move to chose when you have:
Xm - average score for move m
Nm - number of tries for move m
MAX - maximum score that can be made
MIN - minimum score that can be made
Obviously the more you try the same move, the more you want to try the other moves, but the more close you are to the maximum score, the less you want to try others. What is the best math way to choose a move based ot these factors Xm, Nm, MAX, MIN?
Your problem obviously is an exploration problem, and the problem is that with Upper Confidence Bound (UCB), the exploration cannot be tuned directly. This can be solved by adding an exploration constant.
The Upper Confidence Bound (UCB) is calculated as follows:
with V being the value function (expected score) which you are trying to optimize, s the state you are in (the cards in the hands), and a the action (putting a card for example). And n(s) is the number of times a state s has been used in the Monte Carlo simulations, and n(s,a) the same for the combination of s and action a.
The left part (V(s,a)) is used to exploit knowledge of the previously obtained scores, and the right part is the adds a value to increase exploration. However there is not way to increase/decrease this exploration value, and this is done in the Upper Confidence Bounds for Trees (UCT):
Here Cp > 0 is the exploration constant, which can be used to tune the exploration. It was shown that:
holds the Hoeffding's inequality if the rewards (scores) are between 0 and 1 (in [0,1]).
Silver & Veness propose: Cp = Rhi - Rlo, with Rhi being the highest value returned using Cp=0, and Rlo the lowest value during the roll outs (i.e. when you randomly choose actions when no value function is calculated yet).
Reference:
Cameron Browne, Edward J. Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton.
A Survey of Monte Carlo Tree Search Methods.
IEEE Trans. Comp. Intell. AI Games, 4(1):1–43, 2012.
Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. Advances in Neural Information Processing Systems, 1–9.

Does the min player in the minimax algorithm play optimally?

In the minimax algorithm, the first player plays optimally, which means it wants to maximise its score, and the second player tries to minimise the first player's chances of winning. Does this mean that the second player also plays optimally to win the game? Trying to choose some path in order to minimise the first player's chances of winning also means trying to win?
I am actually trying to solve this task from TopCoder: EllysCandyGame. I wonder whether we can apply the minimax algorithm here. That statement "both to play optimally" really confuses me and I would like some advice how to deal with this type of problems, if there is some general idea.
Yes, you can use the minimax algorithm here.
The problem statement says that the winner of the game is "the girl who has more candies at the end of the game." So one reasonable scoring function you could use is the difference in the number of candies held by the first and second player.
Does this mean that the second player also plays optimally to win the game?
Yes. When you are evaluating a MIN level, the MIN player will always choose the path with the lowest score for the MAX player.
Note: both the MIN and MAX levels can be implemented with the same code, if you evaluate every node from the perspective of the player making the move in that round, and convert scores between levels. If the score is a difference in number of candies, you could simply negate it between levels.
Trying to choose some path in order to minimize the first player's chances of winning also means trying to win?
Yes. The second player is trying to minimize the first player's score. A reasonable scoring function will give the first player a lower score for a loss than a tie.
I wonder whether we can apply the minimax algorithm here.
Yes. If I've read the problem correctly, the number of levels will be equal to the number of boxes. If there's no limit on the number of boxes, you'll need to use an n-move lookahead, evaluating nodes in the minimax tree to a maximum depth.
Properties of the game:
At each point, there are a limited, well defined number of moves (picking one of the non-empty boxes)
The game ends after a finite number of moves (when all boxes are empty)
As a result, the search tree consists of a finite number of leafs. You are right that by applying Minimax, you can find the best move.
Note that you only have to evaluate the game at the final positions (when there are no more moves left). At that point, there are only three results: The first player won, the second player won, or it is a draw.
Note that the standard Minimax algorithm has nothing to do with probabilities. The result of the Minimax algorithm determines the perfect play for both side (assuming that both sides make no mistakes).
By the way, if you need to improve the search algorithm, a safe and simple optimization is to apply Alpha Beta pruning.

minimax: what happens if min plays not optimal

the description of the minimax algo says, that both player have to play optimal, so that the algorithm is optimal. Intuitively it is understandable. But colud anyone concretise, or proof what happens if min plays not optimal?
thx
The definition of "optimal" is that you play so as to minimize the "score" (or whatever you measure) of your opponent's optimal answer, which is defined by the play that minimizes the score of your optimal answer and so forth.
Thus, by definition, if you don't play optimal, your opponent has at least one path that will give him a higher score than his best score if you played optimal.
One way to find out what is optimal is to brute force the entire game tree. For less than trivial problems you can use alpha-beta search, which guarantees optimum without needing to search the entire tree. If you tree is still too complex, you need a heuristic that estimates what the score of a "position" is and halts at a certain depth.
Was that understandable?
I was having problems with that precise question.
When you think about it for a bit you will get the idea that the minimax graph contains ALL possible games including the bad games. So if a player plays a sub optimal game then that game is part of the tree - but has been discarded in favor of a better game.
Its similar to alpha beta. I was getting stuck on what happens if I sacrifice some pieces intentionally to create space and then make a winning move through the gap. ie there is a better move further down the tree.
With alpha beta - lets say a sequence of losing moves followed by a killer move is in fact in the tree - but in that case the alpha and beta act as a window filter "a< x < b" and would have discarded it if YOU had a better game. You can see it in alpha beta if you imagine putting a +/- infinity into a pruned branch to see what happens.
In any case both algorithms recalculate every move so that if a player plays a sub optimal game them that will open up branches of the graph that are better for the opponent.
rinse repeat.
Consider a MIN node whose children are terminal nodes. If MIN plays suboptimally, then the value of the node is greater than or equal to the value it would have if MIN played optimally. Hence, the value of the MAX node that is the MIN node’s parent can only be increased. This argument can be extended by a simple induction all the way to the root. If the suboptimal play by MIN is predictable, then one can do better than a minimax strategy. For example, if MIN always falls for a certain kind of trap and loses, then setting the trap guarantees a win even if there is actually a devastating response for MIN.
Source: https://www.studocu.com/en-us/document/university-of-oregon/introduction-to-artificial-intelligence/assignments/solution-2-past-exam-questions-on-computer-information-system/1052571/view

Determining which inputs to weigh in an evolutionary algorithm

I once wrote a Tetris AI that played Tetris quite well. The algorithm I used (described in this paper) is a two-step process.
In the first step, the programmer decides to track inputs that are "interesting" to the problem. In Tetris we might be interested in tracking how many gaps there are in a row because minimizing gaps could help place future pieces more easily. Another might be the average column height because it may be a bad idea to take risks if you're about to lose.
The second step is determining weights associated with each input. This is the part where I used a genetic algorithm. Any learning algorithm will do here, as long as the weights are adjusted over time based on the results. The idea is to let the computer decide how the input relates to the solution.
Using these inputs and their weights we can determine the value of taking any action. For example, if putting the straight line shape all the way in the right column will eliminate the gaps of 4 different rows, then this action could get a very high score if its weight is high. Likewise, laying it flat on top might actually cause gaps and so that action gets a low score.
I've always wondered if there's a way to apply a learning algorithm to the first step, where we find "interesting" potential inputs. It seems possible to write an algorithm where the computer first learns what inputs might be useful, then applies learning to weigh those inputs. Has anything been done like this before? Is it already being used in any AI applications?
In neural networks, you can select 'interesting' potential inputs by finding the ones that have the strongest correlation, positive or negative, with the classifications you're training for. I imagine you can do similarly in other contexts.
I think I might approach the problem you're describing by feeding more primitive data to a learning algorithm. For instance, a tetris game state may be described by the list of occupied cells. A string of bits describing this information would be a suitable input to that stage of the learning algorithm. actually training on that is still challenging; how do you know whether those are useful results. I suppose you could roll the whole algorithm into a single blob, where the algorithm is fed with the successive states of play and the output would just be the block placements, with higher scoring algorithms selected for future generations.
Another choice might be to use a large corpus of plays from other sources; such as recorded plays from human players or a hand-crafted ai, and select the algorithms who's outputs bear a strong correlation to some interesting fact or another from the future play, such as the score earned over the next 10 moves.
Yes, there is a way.
If you choose M selected features there are 2^M subsets, so there is a lot to look at.
I would to the following:
For each subset S
run your code to optimize the weights W
save S and the corresponding W
Then for each pair S-W, you can run G games for each pair and save the score L for each one. Now you have a table like this:
feature1 feature2 feature3 featureM subset_code game_number scoreL
1 0 1 1 S1 1 10500
1 0 1 1 S1 2 6230
...
0 1 1 0 S2 G + 1 30120
0 1 1 0 S2 G + 2 25900
Now you can run some component selection algorithm (PCA for example) and decide which features are worth to explain scoreL.
A tip: When running the code to optimize W, seed the random number generator, so that each different 'evolving brain' is tested against the same piece sequence.
I hope it helps in something!

How to detect aggressiveness of a player in a game?

I am writing a game for gameboy advance and I am implementing basic AI in the form of a binary search tree. There is 1 human player and 1 computer player. I need to figure out a way of telling how aggressive the human player is being in attacking the computer. The human will push a button to attack (and must be in a certain radius of the computer), so my first thought was to see how big the number of attacks was compared to the number of iterations my main while loop had gone through. This seems like a poor way to do it though, because that number will depend on the frame rate, which can fluctuate. Does anyone have an idea for a better way to do this?
Why not just keep a count of the number of moments the human player can attack the computer player? I mean, you obviously have to check to see if he's in range, so why not check whether he's in range, and if he is, increment the could_have_attacked counter. Then if he actually does attack, increment an attacked counter. Then you can compare how often the human player attacks to how often he has the opportunity.
Alternatively, compare how often he is in range to how often he is not in range, or more generally keep track of the average distance between the players. More aggressive players will probably stay a lot closer to their opponents than passive players.
Your best bet will probably be a combination of different methods. Keep several different measurements, and try several different ways of weighing the methods against each other, and use extensive playtesting to pick which one is better.
You have to define what aggressive is. For example, you could say aggressive is a player that pursues the AI, so you would check whether he is in range more often than he is out of range. Or you could say aggressive is a player that attacks every chance he gets, so you would check the ratio of successful attacks to the amount of time the player has to attack. Or you could say an aggressive player will attack when he is injured. Once you define what "aggressive" means in your game world, you can establish key performance indicators of when the player falls under that definition.
You can use a pseudorandom generator for that. Say you have a rnd() function that returns a value from 0 to 1 - greater value corresponds to higher agressiveness. You introduce a coefficient from 0 to 1. Each time you need to determine whether the character would attack you call rnd() and compare the value with the coefficient. If it's greater than that - attack, otherwise not.
I thought of something along the lines of counting the average time between combats and counting the average rate of fire while in combat. To do so, you would need 5 variables like this:
float avgTimeWithoutCombat;
float currentTimeWithoutCombat;
float avgRateOfFire;
int shotsFired;
float combatTimeElapsed;
As long as the player does not fight, you count the "peaceful" time in currentTimeWithoutCombat. When he fires, add it to avgTimeWithoutCombat and divide it by 2 to get the average.
Everytime the player attacks, increase the count of shotsFired. Also, count the time of the combat. Check in the main loop if a certain time without any attacks elasped. If so, the combat has ended and you can calculate the avgRateOfFire.
Now you have two values to measure the aggressiveness of the player:
avgTimeWithoutCombat: The bigger this number, the less aggressive is the player.
avgRateOfFire: The bigger this number, the more aggressive the player.

Resources