Introduction
Introduction
Denition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. We will call a program (respectively the component of a program), which learns, a learner.
Finally the training experience should be similar like the real-world experience (over the real-world examples the performance P must be measured). That means the training experience should represent the real-world. We say the distribution of training examples should follow the distribution similar to the test (real-world) examples. Tic Tac Toe is a very simple example, so in our scenario this should not be a difcult problem. But if we have a more complicate learning task (like the checker task described in chapter 1 in [1]), this could be serious issue. If a checker-learner just plays against itself (in the training phase), it might never encounter some important board states, which it would need in the real-world. In such a case we say the distribution of training examples is not fully representative of the distribution of the real-world examples. In practice, it is often necessary to learn from a distribution of examples that is not fully representative. It is important to understand, that mastering one distribution does not necessary lead to a good performance over some other distribution. And it is also important to know, that most of the modern machine learning theory is based on the assumption that the distribution of the training examples is similar to the distribution of the test examples. This assumption is necessary for the learners ability to learn, but we have to keep in mind, that in practice this assumption has often be violated. For our learner we have decided that our system will train by playing games against itself. So now we have to dene what type of knowledge will be learned. Our Tic Tac Toe system can generate every legal move from any board state. So our system has to learn how to choose the best move from among the legal moves. This legal moves represent a search space and we need the best search strategy. We call a function, which choose the best move from a given board state, target function. For Tic Tac Toe we dene the target function this way: V : B -> R, where B is a legal board state and V maps a numeric score R to B. There, better board states get a higher score, worse board states lower score. So in our scenario, our learner has to learn this target function. To select the best move from a board state, the learner has to generate all possible successor board states and has to use V to choose the best board state (and so the best move). Most real-world examples are to complex to learn V exactly. In general we are looking for a approximation of the target function. We call this function V. There are many options for V. For the Tic Tac Toe system we choose a linear combination for V. V (b) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5 + w6 x6 where w0 through w6 are weights and the xs are so called features: x1 : number of blue marks (1)
x2 : number of red x3 : number of two in one row/column/diagonal (blue) x4 : number of two in one row/column/diagonal (red) x5 : number of red in winning position x6 : number of red in winning position With this target function our learner has just to adjust the weights. This is our whole learning task. The weights will determine the importance of each feature. So we complete our denition of the Tic Tac Toe learning task: Denition: Task T: playing Tic Tac Toe Performance P: percent of games one against opponents Training experience E: playing practice games against itself Target function: V:B -> R Target function representation: V (b) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5 + w6 x6 (2)
Note: With this denition we have reduced the learning of a Tic Tac Toe strategy to the adjusting of weights (w0 through w6 ) in the target function representation. We have decided, that our learner will train by playing against itself. So the only information the learner is able to access is, whether the game was won or lost. We have said, that the learner has to learn to choose the best move. Therefore the learner has to store every single board state and has to assign a score to each board state. The board with the best score represent the best move. It is very simple to assign a score to the last board (the board before the end of the game): If the game was won, we assign +100, if it was lost, we assign -100. So our next challenge is to assign a score to the intermediate board states. If a game was lost, it does not mean, that every intermediate board state is bad. It is i.e. possible, that a game was perfect and just the last move was fatally bad. In [1] a very surprising solution for this problem is presented: Vtrain (b) < V c (Successor(b)) (3)
V c is the current approximation of V, Successor denotes the next board state following b and Vt rain is the training value of the board b. So, to summarize, we use the successor board state of b to calculate the score of the training board state b.
The last thing we need is the learning algorithm to adjust the weights. We decide to use the least mean squares, the LMS training rule. LMS training rule helps us to minimize the error between the training values and the values of the current approximation. The algorithm adjust the weights a small amount in the direction that reduces the error. Note that there are other algorithms, but for our problem this algorithm is sufcient. Now we can design the Tic Tac Toe system; learner plays against itself calculates the features of every board state (xi ) calculates the score of every board state using the features uses the current weights to choose the current best move calculates the training scores for the boards (using the successor board state) if game was won, set last training score to +100 if game was lost, set last training score to -100 if game was a tie, set last training score to 0 for each training score adjust weights using: wi < wi + n(Vtrain (b) V c (b)) xi (4)
n is a small constant (e.g. 0.1). Vtrain (b) - V c (b) is the error, we can see, that we change the weights to reduce the error between the training examples. The implementation of the TicTacToe-system contains three classes: 1. Game: represents one game 2. TicTacToeSimpleOpponent: A TicTacToe player, who randomly makes his moves 3. TicTacToeLearner: uses the LMS training rule to learn playing TicTacToe (the training partner is TicTacToeSimpleOpponent) After some experiments with the parameters (number of training loops, ...) I got following results: If two TicTacToeSimpleOpponents play against each other, the TicTacToeSimpleOpponent, who starts, will win around 59% of the games and loses 29%. A TicTacToeLearner, who uses 70 rounds to train with a TicTacToeSimpleOpponent, wins about 70% of following games against TicTacToeSimpleOpponents.
If we increase the training rounds to 500, it will win approximately 86% of the games and will lose only 6% of it. We can increase the training rounds to 1500, but the result is just a marginal increase of won games.
3 Exercises
Our LMS training rule is a stochastic gradient-descent search. We will now try to prove this (but I am not sure if it is correct). We have to show that LMS training E rule alters weights in propotion to: xi E = (VT V c )2
n
V c = w0 + w1 x1 + ... =
i=1
wi xi
wi = wi + n(VT V c )xi
n n
E = 2(VT V ) V
= 2(VT
i=1
wi xi )
i=1
xi
tbc ...
References
[1] Machine Learning, Mitchell, Tom M. (1997), ISBN 0-07-115467-1 [2] Tic Tac Toe, http://en.wikipedia.org/wiki/Tic_tac_toe [3] Source Code, http://code.google.com/p/mindthegap/