How to reward agent's action in self-play adversarial game reinforcement learning? - artificial-intelligence

I am beginner in this field and I am trying to implement an agent playing an adversarial game like chess. I create two agents that share the same neural network and experience buffer. In every step, the neural network will be updated by both agents (features order swapped).
Does my self-play approach make sense? And if it does, how will I reward the agent's behaviors?
More clearly, following this:
(0) state -> (1) agent0 action -> (2) reward -> (3) state -> (4) agent1 action -> (5) reward -> (6) state
Is the next state of agent0 after (1) would be (3) or (6)? And is the corresponding reward (2) or (5) or something else (for example (2) - (5))?

Here is my understanding. You start with two randomized initialized networks. Please don't share the network. Then you train one of them and only that one. When the network under training begins to show some progress, you stop this match and you make a copy of the good network to be the new opponent. Then you continue. This way you bootstrap yourself and the opponents. You don't try to learn against an expert, but rather against someone who is more or less equal to you.
Above I've described saving only one copy of a previous you, but you can of course keep the last 10 instances and play against all of those.

Related

How can I predict a user generated distribution by learning from previous distributions

I am trying to program a prediction algorithmus which predicts the distribution of marbles among 4 cups based on the prevision user input. But I have no idea where to start or which techniques can be used to solve this.
Example:
There are 4 cups numbered from 0 to 3 and the user will receive x marbles which he distributes among those cups. Each round the user receives another amount of marbles (or the same amount) and before
the user distributes them, the algorithm tries to predict the distribution based on the users previous inputs. After that the user “corrects” it. The goal is that the user does not have to correct anything hence the
algorithm predicts the correct distribution. However the pattern which the user distributes the marbles can change, and the algorithm has to adapt.
This is the simplest design of the problem which already is not trivial to solve. However it get exponential more complex when the marbles have additional properties which can be used for distribution.
For example, they could have a color and a weight.
So, for example, how does the algorithm learn that the user (most of the time) put the marbles with the same color in one cup, but cup 2 is (most of the time empty) and the rest is equally distributed?
So in my head the algorithm has to do something like this:
search for a pattern after the users distribution is done. Those patterns can be the amount of marbles per cup / weight per cup or anything else.
if a pattern is found a predefined value (weight) is added to the pattern
if a previous pattern was not found a predefined value has to be subtracted from the pattern
when the algorithm has to predict, all patterns with a predefined weight have to be applied.
I am not sure if I am missing something and how I would implement something like this or in which area I have to look for answers.
First of all, bear in mind that human behavior does not always follow a pattern. If the user distributes these marbles randomly, it will be hard to predict the next move!
But if there IS a pattern in the distributions, you might create a prediction using an algorithm such as a neural network or a decision tree.
For example:
// dataset1
// the weights and colors of 10 marbles
let dataset1 = [4,3,1,1,2,3,4,5,3,4,7,6,4,4,2,4,1,6,1,2]
// cups1
// the cups distribution of the above marbles
let labels1 = [2,1,0,2,1,1,3,1,0,1]
Now you can train an algorithm, for example a neural network or a decision tree.
This isn't real code, just an example of how it could work.
let net = new NeuralNet()
net.train(dataset1, labels1)
After training with lots of data (at least hundreds of these datasets), you can give the network a new dataset and it will give you a prediction of the cups distribution
let newMarbleSet = [...]
let prediction = net.predict(newMarbleSet)
It's up to you what you want to do with this prediction.

How to undo multiple steps in a Tic-Tac-Toe game in C

I want to program a version of the Tic-Tac-Toe game using C, in which we have 'n × n' board decided by the user, and the loser is decided by the first who get first 'n' X's or O's in a row or column etc..
One of the requirements is to let the players be able to undo multiple steps, that means to get back to the board status as it was a couple of steps ago by entering a negative odd number.
For example, if player 1 entered '-3' as the row index, the game needs to revert back how it was 3 steps before (in case there has already been 3 steps done within the game), show the board and give the turn to player 2.
Any idea how I would be able to make such a function or at least a tip how would I start programming it?
Thanks!
idea how I would be able to make such a function or at least a tip how would I start programming it?
An alternative to undoing a step it to instead keep a history (linked list) of all actively. Replay the list sans the last one to "undo". Less efficient yet less complicated and less error prone to undo complicated steps - which may be present in more advanced games. Sometimes, "backing up" is wrought with subtle complexities where the prior state is not truly fully restored. Akin to #Tom Karzes #David C. Rankin
As a bonus, you have a game record.
Consider your game has events like:
Default setup-up
Select size n.
Select X or O to go first
Make a move
Rotate view
Save game
Restore game
Undo last command <-- now one can undo more than just the last "move"
etc.
Normally you would have a stack where you can put each new move and when you want to undo a move, you just pop it from the stack and do the reverse move on your game model.

The purpose of using Q-Learning algorithm

What is the point of using Q-Learning? I have used an example code that represents 2D board with pawn moving on this board. At the right end of the board there is goal which we want to reach. After completion of algorithm I have a q table with values assigned to every state-action junction. Is it all about getting this q table to see which state-actions (which actions are the best in case of specific states) pairs are the most useful? That's how I understand it right now. Am I right?
Is it all about getting this q table to see which state-actions (which
actions are the best in case of specific states) pairs are the most
useful?
Yep! That's pretty much it. Given a finite state space, Q-learning is guaranteed to eventually learn the optimal policy. Once an optimal policy is reached (also known as convergence) every time the agent is in a given state s, it looks in its Q-table for the action a with the highest reward for that (s, a) pair.

Neural Network and Temporal Difference Learning

I have a read few papers and lectures on temporal difference learning (some as they pertain to neural nets, such as the Sutton tutorial on TD-Gammon) but I am having a difficult time understanding the equations, which leads me to my questions.
-Where does the prediction value V_t come from? And subsequently, how do we get V_(t+1)?
-What exactly is getting back propagated when TD is used with a neural net? That is, where does the error that gets back propagated come from when using TD?
The backward and forward views can be confusing, but when you are dealing with something simple like a game-playing program, things are actually pretty simple in practice. I'm not looking at the reference you're using, so let me just provide a general overview.
Suppose I have a function approximator like a neural network, and that it has two functions, train and predict for training on a particular output and predicting the outcome of a state. (Or the outcome of taking an action in a given state.)
Suppose I have a trace of play from playing a game, where I used the predict method to tell me what move to make at each point and suppose that I lose at the end of the game (V=0). Suppose my states are s_1, s_2, s_3...s_n.
The monte-carlo approach says that I train my function approximator (e.g. my neural network) on each of the states in the trace using the trace and the final score. So, given this trace, you would do something like call:
train(s_n, 0)
train(s_n-1, 0)
...
train(s_1, 0).
That is, I'm asking every state to predict the final outcome of the trace.
The dynamic programming approach says that I train based on the result of the next state. So my training would be something like
train(s_n, 0)
train(s_n-1, test(s_n))
...
train(s_1, test(s_2)).
That is, I'm asking the function approximator to predict what the next state predicts, where the last state predicts the final outcome from the trace.
TD learning mixes between the two of these, where λ=1 corresponds to the first case (monte carlo) and λ=0 corresponds to the second case (dynamic programming). Suppose that we use λ=0.5. Then our training would be:
train(s_n, 0)
train(s_n-1, 0.5*0 + 0.5*test(s_n))
train(s_n-2, 0.25*0 + 0.25*test(s_n) + 0.5*test(s_n-1)+)
...
Now, what I've written here isn't completely correct, because you don't actually re-test the approximator at each step. Instead you just start with a prediction value (V = 0 in our example) and then you update it for training the next step with the next predicted value. V = λ·V + (1-λ)·test(s_i).
This learns much faster than monte carlo and dynamic programming methods because you aren't asking the algorithm to learn such extreme values. (Ignoring the current prediction or ignoring the final outcome.)

How to program a neural network for chess?

I want to program a chess engine which learns to make good moves and win against other players. I've already coded a representation of the chess board and a function which outputs all possible moves. So I only need an evaluation function which says how good a given situation of the board is. Therefore, I would like to use an artificial neural network which should then evaluate a given position. The output should be a numerical value. The higher the value is, the better is the position for the white player.
My approach is to build a network of 385 neurons: There are six unique chess pieces and 64 fields on the board. So for every field we take 6 neurons (1 for every piece). If there is a white piece, the input value is 1. If there is a black piece, the value is -1. And if there is no piece of that sort on that field, the value is 0. In addition to that there should be 1 neuron for the player to move. If it is White's turn, the input value is 1 and if it's Black's turn, the value is -1.
I think that configuration of the neural network is quite good. But the main part is missing: How can I implement this neural network into a coding language (e.g. Delphi)? I think the weights for each neuron should be the same in the beginning. Depending on the result of a match, the weights should then be adjusted. But how? I think I should let 2 computer players (both using my engine) play against each other. If White wins, Black gets the feedback that its weights aren't good.
So it would be great if you could help me implementing the neural network into a coding language (best would be Delphi, otherwise pseudo-code). Thanks in advance!
In case somebody randomly finds this page. Given what we know now, what the OP proposes is almost certainly possible. In fact we managed to do it for a game with much larger state space - Go ( https://deepmind.com/research/case-studies/alphago-the-story-so-far ).
I don't see why you can't have a neural net for a static evaluator if you also do some classic mini-max lookahead with alpha-beta pruning. Lots of Chess engines use minimax with a braindead static evaluator that just adds up the pieces or something; it doesn't matter so much if you have enough levels of minimax. I don't know how much of an improvement the net would make but there's little to lose. Training it would be tricky though. I'd suggest using an engine that looks ahead many moves (and takes loads of CPU etc) to train the evaluator for an engine that looks ahead fewer moves. That way you end up with an engine that doesn't take as much CPU (hopefully).
Edit: I wrote the above in 2010, and now in 2020 Stockfish NNUE has done it. "The network is optimized and trained on the [classical Stockfish] evaluations of millions of positions at moderate search depth" and then used as a static evaluator, and in their initial tests they got an 80-elo improvement when using this static evaluator instead of their previous one (or, equivalently, the same elo with a little less CPU time). So yes it does work, and you don't even have to train the network at high search depth as I originally suggested: moderate search depth is enough, but the key is to use many millions of positions.
Been there, done that. Since there is no continuity in your problem (the value of a position is not closely related to an other position with only 1 change in the value of one input), there is very little chance a NN would work. And it never did in my experiments.
I would rather see a simulated annealing system with an ad-hoc heuristic (of which there are plenty out there) to evaluate the value of the position...
However, if you are set on using a NN, is is relatively easy to represent. A general NN is simply a graph, with each node being a neuron. Each neuron has a current activation value, and a transition formula to compute the next activation value, based on input values, i.e. activation values of all the nodes that have a link to it.
A more classical NN, that is with an input layer, an output layer, identical neurons for each layer, and no time-dependency, can thus be represented by an array of input nodes, an array of output nodes, and a linked graph of nodes connecting those. Each node possesses a current activation value, and a list of nodes it forwards to. Computing the output value is simply setting the activations of the input neurons to the input values, and iterating through each subsequent layer in turn, computing the activation values from the previous layer using the transition formula. When you have reached the last (output) layer, you have your result.
It is possible, but not trivial by any means.
https://erikbern.com/2014/11/29/deep-learning-for-chess/
To train his evaluation function, he utilized a lot of computing power to do so.
To summarize generally, you could go about it as follows. Your evaluation function is a feedforward NN. Let the matrix computations lead to a scalar output valuing how good the move is. The input vector for the network is the board state represented by all the pieces on the board so say white pawn is 1, white knight is 2... and empty space is 0. An example board state input vector is simply a sequence of 0-12's. This evaluation can be trained using grandmaster games (available at a fics database for example) for many games, minimizing loss between what the current parameters say is the highest valuation and what move the grandmasters made (which should have the highest valuation). This of course assumes that the grandmaster moves are correct and optimal.
What you need to train a ANN is either something like backpropagation learning or some form of a genetic algorithm. But chess is such an complex game that it is unlikly that a simple ANN will learn to play it - even more if the learning process is unsupervised.
Further, your question does not say anything about the number of layers. You want to use 385 input neurons to encode the current situation. But how do you want to decide what to do? On neuron per field? Highest excitation wins? But there is often more than one possible move.
Further you will need several hidden layers - the functions that can be represented with an input and an output layer without hidden layer are really limited.
So I do not want to prevent you from trying it, but chances for a successful implemenation and training within say one year or so a practically zero.
I tried to build and train an ANN to play Tic-tac-toe when I was 16 years or so ... and I failed. I would suggest to try such an simple game first.
The main problem I see here is one of training. You say you want your ANN to take the current board position and evaluate how good it is for a player. (I assume you will take every possible move for a player, apply it to the current board state, evaluate via the ANN and then take the one with the highest output - ie: hill climbing)
Your options as I see them are:
Develop some heuristic function to evaluate the board state and train the network off that. But that begs the question of why use an ANN at all, when you could just use your heuristic.
Use some statistical measure such as "How many games were won by white or black from this board configuration?", which would give you a fitness value between white or black. The difficulty with that is the amount of training data required for the size of your problem space.
With the second option you could always feed it board sequences from grandmaster games and hope there is enough coverage for the ANN to develop a solution.
Due to the complexity of the problem I'd want to throw the largest network (ie: lots of internal nodes) at it as I could without slowing down the training too much.
Your input algorithm is sound - all positions, all pieces, and both players are accounted for. You may need an input layer for every past state of the gameboard, so that past events are used as input again.
The output layer should (in some form) give the piece to move, and the location to move to.
Write a genetic algorithm using a connectome which contains all neuron weights and synapse strengths, and begin multiple separated gene pools with a large number of connectomes in each.
Make them play one another, keep the best handful, crossover and mutate the best connectomes to repopulate the pool.
Read blondie24 : http://www.amazon.co.uk/Blondie24-Playing-Kaufmann-Artificial-Intelligence/dp/1558607838.
It deals with checkers instead of chess but the principles are the same.
Came here to say what Silas said. Using a minimax algorithm, you can expect to be able to look ahead N moves. Using Alpha-beta pruning, you can expand that to theoretically 2*N moves, but more realistically 3*N/4 moves. Neural networks are really appropriate here.
Perhaps though a genetic algorithm could be used.

Resources