How to use MinMax trees with Q-Learning?
I want to implement a Q-Learning connect four agent and heard that adding MinMax trees into it helps.
Q-learning is a Temporal difference learning algorithm. For every possible state (board), it learns the value of the available actions (moves). However, it is not suitable for use with Minimax, because the Minimax algorithm needs an evaluation function that returns the value of a position, not the value of an action at that position.
However, temporal difference methods can be used to learn such an evaluation function. Most notably, Gerald Tesauro used the TD(λ) ("TD lambda") algorithm to create TD-Gammon, a human-competitive Backgammon playing program. He wrote an article describing the approach, which you can find here.
TD(λ) was later extended to TDLeaf(λ), specifically to better deal with Minimax searches. TDLeaf(λ) has been used, for example, in the chess program KnightCap. You can read about TDLeaf in this paper.
Minimax allows you to look a number of moves into the future and play in a way to maximize your chances of scoring in that timespan. This is good for Connect-4, where a game can end almost at any moment and the number of moves available at each turn is not very large. Q-Learning would provide you with a value function to guide the Minimax search.
Littman has used minimax with Q learning. Hence proposed Minimix-Q learning algorithm in his famous and pioneering work Markov Games as a framework for multiagent reinforcement learning. His work is on zero-sum game in multiagent settings. Later Hu & Wellman extended his work to develop NashQ learning which you can find here.
Related
I just read about Q-learning and I'm not sure if I understand this correctly. All examples I saw are rat-in-a-maze, where the rat must move towards the cheese, and the cheese doesn't move.
I'm just wondering if it's possible to do Q-learning in a situation where both the mouse and the cheese move (so one agent chases and the other runs away).
If Q-learning doesn't work in that situation, do we have any other algorithms (greedy or non-greedy) that work?
Also is there a formal/academic name the situation? I'd like to search for papers that talks about this but can't find its formal/academic name.
Thank you so much!
All RL algorithms enable a single agent to learn a policy. In problems that involve multiple actors such as a mouse and a cheese, one actor (the mouse) would learn a policy using an RL algorithm and the other actor (the cheese) would be guided by some AI that is not RL. If both the mouse and cheese are RL agents, then you're looking at multiagent RL. Here is a nice framework for it: https://github.com/PettingZoo-Team/PettingZoo/
Q-learning is probably the most popular RL technique for beginners, but can only solve very simple toy problems with a discrete state space, such as a 2D maze. It is not very effective in addressing problems with a continuous state space, even simple ones, such as the Cartpole. It might solve them but would take much longer than other RL methods. Q-learning combined with a neural network, however, can be very powerful, as demonstrated by RL methods such as deep Q-network (DQN) and double DQN.
I have developed a simple hex player based on Monte Carlo Tree Search for the game of Hex. Now I want to extend the hex player using RAVE (Rapid Action Value Estimation) and LGP (last good reply). The articles are here and here.
I was wondering if anyone here has used any of these methods to improve the tree search performance and could help me understand it?
I also want to know why these algorithms are called AMAF (All Moves As First) heuristics?
In the realm of monte carlo tree search in games which takes advantage of reinforcement learning, there are two types of back-propagation, AMAF and UCT.
UCT method back-propagates the path which during selection phase it has passed. only nodes which during selection are met are back-propagated exactly at their states. But in AMAF, All the nodes that are met during roll_out phase are stored and in the back-propagation phase, along with nodes in the selection path, are back-propagated without considering the states.
UCT gives a very precise and local value of a (state,action) pair, but it is too slow to converge. on the other hand AMAF heuristic converges very fast, but (state,action) pair value is too general and can't be reliable.
We can have benefits of both strategies by using a decreasing coefficient for values like this:
a * UCT + (1 - a) * AMAF
This is RAVE(Rapid Action Value Stimation) heuristic.
Last-Good-Reply is AMAF-based but could benefit from RAVE. its general idea is that in playout phase, when we use moves against opponent moves, if these moves were successful against opponents', so we might be able to store these moves and use them in next playouts.
I need to implement an intelligent agent to play Abalone game, for this kind of game the best way to proceed seems a min-max strategy with alpha beta pruning.
I have already implemented a naive search algorithm that use min-max with pruning,
my problem is...
How to generate the nodes of the tree where perform the search?
I have no idea of the right way to do this, and how assign the weigh to each node.
For generating the tree nodes: You need to implement a method that returns a collection of all possible legal moves given the current board position and the player whose turn it is. All these moves will become children of the node representing the current board position. Repeat until memory is exhausted to generate the game tree ;) or rather until you reach a reasonable tree depth.
For alpha-beta search you also need an evaluation function which calculates the weight for each board position/node. You can do some research or think about such a function yourself, maybe considering the number of stones still on the board. However a bad evaluation function can seriously screw up your results, so take care and run a lot of tests.
If you have trouble coming up with a reasonable evaluation function, I recommend you take a look into Monte-Carlo techniques such as UCT.
http://en.wikipedia.org/wiki/Monte_Carlo_tree_search
These tackle the problem using a probabilistic approach and have some nice advantages over alpha-beta. Also they don't require an evaluation function so you could skip this step.
Good luck!
I have published two libraries for move generation in Abalone. You didn't mention the programming language used for your search implementation, but you can easily port the functions.
For C++, https://sourceforge.net/projects/abnet/
For Python, https://gitlab.com/peer.sommerlund/haliotis
For an evaluation function, distance between all your marbles, or distance to their gravity center (same thing), works nicely. Tino Werner used this with a twist for his program that won ICGA 2003.
For understanding distance when using hex coordinates, I can recommend Amit Patel's page: https://www.redblobgames.com/grids/hexagons/
I need to implement Minesweeper solver. I have started to implement rule based agent.
I have implemented certain rules. I have a heuristic function for choosing best matching rule for current cell (with info about surrounding cells) being treated. So for each chosen cell it can decide for 8 surroundings cells to open them, to mark them or to do nothing. I mean. at the moment, the agent gets as an input some revealed cell and decides what to do with surrounding cells (at the moment, the agent do not know, how to decide which cell to treat).
My question is, what algorithm to implement for deciding which cell to treat?
Suppose, for, the first move, the agent will reveal a corner cell (or some other, according to some rule for the first move). What to do after that?
I understand that I need to implement some kind of search. I know many search algorithms (BFS, DFS, A-STAR and others), that is not the problem, I just do not understand how can I use here these searches.
I need to implement it in a principles of Artificial Intelligence: A modern approach.
BFS, DFS, and A* are probably not appropriate here. Those algorithms are good if you are trying to plan out a course of action when you have complete knowledge of the world. In Minesweeper, you don't have such knowledge.
Instead, I would suggest trying to use some of the logical inference techniques from Section III of the book, particularly using SAT or the techniques from Chapter 10. This will let you draw conclusions about where the mines are using facts like "one of the following eight squares is a mine, and exactly two of the following eight squares is a mine." Doing this at each step will help you identify where the mines are, or realize that you must guess before continuing.
Hope this helps!
I ported this (with a bit of help). Here is the link to it working: http://robertleeplummerjr.github.io/smartSweepers.js/ . Here is the project: https://github.com/robertleeplummerjr/smartSweepers.js
Have fun!
I'm creating a logic game based on Fox and Hounds game. The player plays the fox and AI plays the hounds. (as far as I can see) I managed to make the AI perfect, so it never loses. Leaving it as such would not be much fun for human players.
Now, I have to dumb-down the AI so human can win, but I'm not sure how. The current AI logic is based on pattern-matching - if I introduce random moves which make the board go out of pattern space the AI would most probably play dumb until the end of the game.
I'm also thinking about removing a set of patterns, so it would seem as AI does not know that "trick" but this way players could find a way to beat the computer using the same moves every time.
Any ideas how to dumb down the AI in such way that is does not go from "genius" to "completely dumb" in a single move?
We used MinMax as the AI algorithm for our game and we implemented the AI levels by setting different depth for each level
I ended up creating a couple of quasi-smart pattern plays (as if a 10 year old might play) so it is not completely dumb, and then I pick one or two of those at random before the game starts. This way the game is always beatable, but the player does not know how (i.e. he cannot use the same strategy to always win, he has to explore for weak spot first).
If your game is a zero-sum one, the MiniMax algorithm with an alpha-beta optimization is a good choice. You can create difficulty levels by making the search stop when the algorithm reaches a certain depth.
Since I'm not an expert game developer and my AI is actually too dumb at the moment, I'd make a sort of 'learning AI'. Say you keep all your regexes disabled and you enable them once the player uses them.
I think this is a zero-sum game, and you can use the MinMax algorithm to solve the game instead of pattern matching, in that way by controlling the search depth you can control the level of expertise of the agent.
On the other hand you can use the A* search to determine the best move for a given fox/hound. And choosing different heuristics you can control the effectiveness if the agent.