Advertisement

Question on states / state values in Q-Learning (and RL in general)

Started by January 22, 2009 10:17 AM
3 comments, last by cossie 15 years, 9 months ago
Hi, I'm trying to get to grips with the basics of RL, I've been doing my best to get through Sutton and Barto's "Reinforcement Learning: An Introduction" online book. But I'm having a bit of difficulty with states. One of the examples they go into is Tic-Tac-Toe it's more a comparison of RL vs Search vs Evolutionary Algorithms but I've been trying to work through that. Are all the states at the start of the game "empty" states because they are neither an 'X' nor an 'O'? And a state could be consider a position in the board? And each state has a Q-value for the possible actions that could be taken from that state right? Does that mean in a game like Tic-Tac-Toe a possible Q-value table would be something like:

State | Possible Action 
 (0,0)| Place X / Leave Blank 
 (1,1)| Place X / Leave Blank
 (1,2)| Place X / Leave Blank
 (1,3)| ...
  ... | ...
or is it:

State | Possible Action 
 (0,0)| (1,1), (1,2), (1,3), (2,1), (2,2), ...
 (1,1)| (1,2), (1,3), (2,1) ...
 (1,2)| (1,1), (1,3), (2,1) ...
 (1,3)| ......basically all the empty squares but the current one
And so, (I apologise if I make a mess of my attempt at an explanation, I try hard but I'm not that smart!) if 'X' plays first, it would be in the state (0,0) (an initial state, i.e. not being on the board) and then selecting an action ("place an X in (1,1)), which leads to:

  1  2  3
1 _X|__|__ 
2 __|__|__ 
3   |  | 
Would it be correct to say that the new state (s') for X is now (1,1)? And because they haven't lost or won the reward = zero? So when updating the Q-value The next move is then made by 'O:

   1  2  3
1 _X|__|__ 
2 __|_0|__ 
3   |  | 
Do the states for 'O' contain information about 'X' positions or is that sort of thing taken care of because it will either make an illegal move in the game (trying to go place an 'O' in an occupied 'X' square) or it will have lost the game and through the reward value (-1) will be probabilisically less likely to chose a losing move in future games? I'd be really grateful for some advice on this and I'd also like to thank anyone who manages to reads this humungous post!
The state is the entire state of the board, not the state of an individual square. There is only one starting state, which is the empty board. There are 9 valid moves from this state.
Advertisement
Thanks for your reply Vorpy, I can see I had a hole in my fundamental understanding. And it's actually raised another question, but this is about the number of possible Q values that would be needed.

I understand a Q value to be a value associated with a state-action pairing, but I'm unsure as to how to calculate the number of values you'd need for a game of tic-tac-toe.

I've never been great with permutations, so I'd appreciate a bit more help if someone's kind enough to help me?

Thanks,

Cossie
Each square has 3 possible values and there are nine squares. There are 3^9 or 19683 permutations. Many of these are not reachable in an actual game (for example, boards where both X and O have 3 in a row).

Since the number isn't that high, you could probably just make a table of that size and you'll just have entries that are never visited. Or use a hash table/map/associative container of your choice.

You can encode the entire state of a tic tac toe board into a 16 bit integer. A simple way to do this is to assign empty, X, and O the values 0, 1, and 2 (in any order you'd like), then set an accumulator to 0. For each space in the board, multiply the accumulated value by 3 and add the value of that space. Or even encode the entire board into a 32 bit integer by simply using 2 independent bits to store the state of each square.

There are far fewer possible states if you account for symmetry. With symmetry, there are only 3 states after the first move (corner, side, and center). But if the goal here is to find weaknesses in an imperfect player, then you wouldn't want to exploit symmetry because the imperfect player may have asymmetric weaknesses.
Thanks a million Vorpy. I had never considered the fact there may be states that are not reachable.

This topic is closed to new replies.

Advertisement