So, I have been playing around with minmax trees to create a simple computer player in a two person board game. I understand the basics of the algorithm, but there is a case that eludes my turkey infused brain ... what happens when MIN can win in two steps?
Example, assume a connect4/tic-tac-toe type game where only one of the two players can own a square. How do I get MAX to occupy a square solely for the purpose of preventing MIN from getting the square?
Let's try a simplified example (shown in nifty ASCII art) where the options are Left and Right. Assume he tree is too big to traverse all the way to the terminal states, so intermediate values are calculated based on a heuristics function (marked with * below). -INF is a terminal state where MIN wins.
MAX (a)
/ \
A B
/ \
MIN (b) MIN (c)
/ \ / \
A B A B
/ | | \
-INF *5 *22 *20
MIN is going to pick action A in state (b) for a score of -INF
MIN is going to pick action B in state (c) for a score of +20
MAX is going to pick action B in state (a) for a score of +20
The problem - of course - is that if MAX picks B then MIN will perform action A (since that square is still available) and thus MIN will win. I need to get MAX to realize the value of picking action A in state (a) to prevent MIN from getting -INF in the next move.
I would put a bunch of tests into the code to check if MIN can win, but it seems to me that the algorithm should take care of this. I think I am missing a piece in the determination of the value with regards to MAX which is causing this.
(Edited to clarify)
Each node in the minimax tree is a complete game state. When a player picks an action, the game moves to that state, restricting the actions of both players (there is no way to pick another action from another branch). So in your example, if in state (a) player Max picks action B, the game is now in state C. The only two choices for the min player at this point are A(22) and B(20). The depth of the tree does not matter; the max and min players will always pick their best action from the current state of the game.
For the game of tic-tac-toe, each state needs to be a complete board (feasible, of course). For example, the first level will be each possible place X could place their marker. Then each child of those states will be each possible place O can place, given the parent state (where X placed), etc...
Heuristics are helpful when you cannot represent the entire game tree (eg, chess), but don't change how the minimax tree is used.
If think the problem is with the heuristics function. As you say, if MAX chooses B in state (a),
MIN will perform action A (since that square is still available) and thus MIN will win
but on the tree you mark this with *22 not -Inf as it should be (MIN wins).
Related
I've been trying to apply MCTS in card games. Basically, I need a formula or modify the UCB formula, so it is best when selecting which node to proceed.
The problem is, the card games are no win/loss games, they have score distribution in each node, like 158:102 for example. We have 2 teams, so basically it is 2-player game. The games I'm testing are constant sum games (number of tricks, or some score from the taken tricks and so on).
Let's say the maximum sum of teamA and teamB score is 260 at each leaf. Then I search the best move from the root, and the first I try gives me average 250 after 10 tries. I have 3 more possible moves, that had never been tested. Because 250 is too close to the maximum score, the regret factor to test another move is very high, but, what should be mathematically proven to be the optimal formula that gives you which move to chose when you have:
Xm - average score for move m
Nm - number of tries for move m
MAX - maximum score that can be made
MIN - minimum score that can be made
Obviously the more you try the same move, the more you want to try the other moves, but the more close you are to the maximum score, the less you want to try others. What is the best math way to choose a move based ot these factors Xm, Nm, MAX, MIN?
Your problem obviously is an exploration problem, and the problem is that with Upper Confidence Bound (UCB), the exploration cannot be tuned directly. This can be solved by adding an exploration constant.
The Upper Confidence Bound (UCB) is calculated as follows:
with V being the value function (expected score) which you are trying to optimize, s the state you are in (the cards in the hands), and a the action (putting a card for example). And n(s) is the number of times a state s has been used in the Monte Carlo simulations, and n(s,a) the same for the combination of s and action a.
The left part (V(s,a)) is used to exploit knowledge of the previously obtained scores, and the right part is the adds a value to increase exploration. However there is not way to increase/decrease this exploration value, and this is done in the Upper Confidence Bounds for Trees (UCT):
Here Cp > 0 is the exploration constant, which can be used to tune the exploration. It was shown that:
holds the Hoeffding's inequality if the rewards (scores) are between 0 and 1 (in [0,1]).
Silver & Veness propose: Cp = Rhi - Rlo, with Rhi being the highest value returned using Cp=0, and Rlo the lowest value during the roll outs (i.e. when you randomly choose actions when no value function is calculated yet).
Reference:
Cameron Browne, Edward J. Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton.
A Survey of Monte Carlo Tree Search Methods.
IEEE Trans. Comp. Intell. AI Games, 4(1):1–43, 2012.
Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. Advances in Neural Information Processing Systems, 1–9.
Could anybody help to explain how to following value function been generated, the problem and solution are attached, I just don't know how the solution is generated. thank you!
STILL NEED HELP WITH THIS!!!
Since no one else has taken a stab at it, I'll present my understanding of the problem (disclaimer: I'm not an expert on reinforced learning and I'm posting this as an answer because it's too long to be a comment)
Think of it this way: when starting at, for example, node d, a random walker has a 50% chance to jump to either node e or node a. Each such jump reduces the reward (r) with the multiplier y (gamma in the picture). You continue jumping around until you get to the target node (f in this case), after which you collect the reward r.
If I've understood correctly, the two smaller 3x2 squares represent the expected values of reward when starting from each node. Now, it's obvious why in the first 3x2 square every node has a value of 100: because y = 1, the reward never decreases. You can just keep jumping around until you eventually end up in the reward node, and gather the reward of r=100.
However, in the second 3x2 square, with every jump the reward is decreased with a multiplier of 0.9. So, to get the expected value of reward when starting from square c, you sum together the reward you get from different paths, multiplied by their probabilities. Going from c to f has a chance of 50% and it's 1 jump, so you get r = 0.5*0.9^0*100 = 50. Then there's the path c-b-e-f: 0.5*(1/3)*(1/3)*0.9^2*100 = 4.5. Then there's c-b-c-f: 0.9^2*0.5^2*(1/3)^1*100 = 6.75. You keep going this way until the reward from the path you're examining is insignificantly small, and sum together the rewards from all the paths. This should give you the result of the corresponding node, that is, 50+6.75+4.5+... = 76.
I guess the programmatic way of doing to would be to use a modified dfs/bfs to explore all the paths of length N or less, and sum together the rewards from those paths (N chosen so that 0.9^N is small).
Again, take this with a grain of salt; I'm not an expert on reinforced learning.
I am trying to find the optimal solution to a Sliding Block Puzzle of any length using the A* algorithm.
The Sliding Block Puzzle is a game with white (W) and black tiles (B) arranged on a linear game board with a single empty space(-). Given the initial state of the board, the aim of the game is to arrange the tiles into a target pattern.
For example my current state on the board is BBW-WWB and I have to achieve BBB-WWW state.
Tiles can move in these ways :
1. slide into an adjacent empty space with a cost of 1.
2. hop over another tile into the empty space with a cost of 1.
3. hop over 2 tiles into the empty space with a cost of 2.
I have everything implemented, but I am not sure about the heuristic function. It computes the shortest distance (minimal cost) possible for a misplaced tile in current state to a closest placed same color tile in goal state.
Considering the given problem for the current state BWB-W and goal state BB-WW the heuristic function gives me a result of 3. (according to minimal distance: B=0 + W=2 + B=1 + W=0). But the actual cost of reaching the goal is not 3 (moving the misplaced W => cost 1 then the misplaced B => cost 1) but 2.
My question is: should I compute the minimal distance this way and don't care about the overestimation, or should I divide it by 2? According to the ways tiles can move, one tile can for the same cost overcome twice as much(see moves 1 and 2).
I tried both versions. While the divided distance gives better final path cost to the achieved goal, it visits more nodes => takes more time than the not divided one. What is the proper way to compute it? Which one should I use?
It is not obvious to me what an admissible heuristic function for this problem looks like, so I won't commit to saying, "Use the divided by two function." But I will tell you that the naive function you came up with is not admissible, and therefore will not give you good performance. In order for A* to work properly, the heuristic used must be admissible; in order to be admissible, the heuristic must absolutely always give an optimistic estimate. This one doesn't, for exactly the reason you highlight in your example.
(Although now that I think about it, dividing by two does seem like a reasonable way to force admissibility. I'm just not going to commit to it.)
Your heuristic is not admissible, so your A* is not guaranteed to find the optimal answer every time. An admissible heuristic must never overestimate the cost.
A better heuristic than dividing your heuristic cost by 3, would be: instead of adding the distance D of each letter to its final position, add ceil(D/2). This way, a letter 1 or 2 away, gets a 1 value, 3 or 4 away, gets a 2 value, an so on.
First I must say this is not a Homework or something related, this is a problem of a game named (freeciv).
Ok, in the game we have 'n' number of cities usually (8-12), each city can have a max number of trade-routes of 'k' usually (4), and those trade-routes need to be 'd' distance or further (8 Manhattan tiles).
The problem consist in to find the k*n trade-routes with (max distances or min distances), obviously this problem can be solved with a brute-force algorithm but it is really slow when you the player have more than 10 cities because the program has to make several iterations; I tried to solve it using graph theory but I am not an not an expert in it, I even asked some of my teachers and none could explain me an smart-algorithm, so I didn't come here to find the exact solution but I did to get the idea or the steps to analyze this.
The problem has two parts:
Calculate pair-wise distances between the cities
Select which pairs should become trade-route
I don't think the first part can be calculated faster than O(n·t) where t is number of tiles, as each run of Dijkstra's algorithm will give you distances from one city to all other cities. However if I understand correctly, distance between two cities never changes and is symmetrical. So whenever a new city is built, you just need to run Dijkstra's algorithm from it and cache the distances.
For the second part I would expect greedy algorithm to work. Order all pairs of cities by suitability and in each step pick the first one that does not violate the constraint of k routes per city. I am not sure whether it can be proven (the proof should be similar to the one for Kruskal's minimal spanning-tree algorithm if it exists. But I suspect it will work fine in practice even if you find that it does not work in theory (I haven't tried to either prove or disprove it; it's up to you)
continue #Jan Hudec way:
Init Stage:
lets say you have N cities (c1, c2,... cN). you should build a list of connections when each entity in the list will have a format of (cX, cY, Distance) (while X < Y, this is n^2/2 time) and order it by distance (descending for max distance or ascending for min distance), and you should also have an array/list which will hold the number of connection per City (cZ = W) which initialized for each city at N-1 because they all connected at the beginning.
Iterations:
iterate over the lists of connections
for each (cX, cY, D) if the number of connection (in the connection number array) of cX > k and cY > k then delete (cX, cY, D) from the connection list and also decrees by one the value of cX and cY in the connection array.
in the end, you'll have the connection list with the value you wish for.
Hi
I'm confused how you can determine the utility functions on with a minimax search
Explain it with any game that you can use a minimax search with
Basically i am asking how do you determine the utility functions
Cheers
The utility value is just some arbitrary value that the player receives when arriving at a certain state in the game. For instance, in Tic-tac-toe, your utility function could simply be 1 for a win, 0 for a tie, or -1 for a loss.
Running minmax on this would at best find a set of actions that result in 1 (a win).
Another example would be chess (not that you can feasibly run minimax on a game of chess). Say your utility function comes from a certain number that is based on the value of the piece you captured or lost
Determining the utility value of a move at a certain state has to do with the experience of the programmer and his/her knowledge of the game.
Utility values on a terminal state are kind of easy to determine. In Tic-tac-toe, for instance, a terminal state for player X is when the Xs are aligned in diagonal, vertically, or horizontally. Any move that creates such a state is a terminal state and you can create a function that checks that. If it is a terminal state, the function returns a 1 or -1.
If your player agent is player X and after player X's move it determines that player O will win, then the function returns a -1. The function returns a 1 if it determines that it is its own winning move.
If all cells are occupied with the last possible move and nobody has won, then the function returns a zero.
This is at terminal states only. It is critical to evaluate intermediate states because, even in a 3x3 game, there are lots of combinations to consider. If you include symmetrical moves you have 9! possible states in Tic-tac-toe. For those intermediate cases, you need to come up with an evaluation function that returns a score for each state as they related to other states.
Suppose that I assign the terminal state values of 810, 0, and -810. For each move, the score would be 810 / (# of moves). So if I reach a terminal state in 6 moves, the score would be 810/6 = 135. In 9 moves, the score would be 90. An evaluation function fashioned this way would favor moves that reach a terminal state faster. However, it still evaluates to a leaf node. We need to evaluate before reaching a leaf node, though, but this could also be part of an evaluation function.
Supposed that, in the game below, player 1 is X. So X moves next. The following are the legal moves (row, column) for X:
(1) 0,0
(2) 0,2
(3) 2,0
(4) 2,1
(5) 2,2
| |O| |
|O|X|X|
| | | |
The utility value for each move should favor the best moves.
The best moves, in this case, are either (2) or (5). So an evaluation function will assign a utility value of 81, for instance to each of those. Move (4) is the worst possible move for the X player (and would also warranty that you lose the game against an intelligent player) so the function would assign a value of -9 to that move. Moves (1) and (3), while not ideal, will not make you lose, so we might assign a 1.
So when minimax evaluates those 5 moves, because your player X, is max, the choice would be either (2) or (5).
If we focus on options (2) or (5), the game will be on a terminal state two moves after these. So, in reality, the evaluation function should look 2 moves ahead of the current legal moves to return the utility values. (This strategy follows the lines of depth limited search, where your function evaluates at a certain depth and produces a utility value without reaching a leaf node - or terminal state)
Now I'll circle back to my first statement. The utility value will be determined by an evaluation function coded per the programmer's knowledge of the game.
Hopefully, I'm not confusing you...