Advertisement

UCT + Policy

Started by June 19, 2020 02:29 AM
5 comments, last by alvaro 4 years, 5 months ago

Hello,

I want to add heuristics to an MCTS implementation but I still want MCTS to “take over” and make the final decision. Say I have a function policy() that returns a value from 1.0 to 0.0 depending on the strength of the move as determined by heuristics. Is this modification to the UCT formula correct?

UCT = move.rewards/move.visits + exploration_rate * policy(move) * sqrt(log(totalSiblingVisits) / move.visits)

It depends entirely on what UCT is supposed to be, try to provide enough context when asking questions.
I go as far as guessing that MCTS means Monte Carlo Tree Search and that you are computing some kind of weight or priority for a possible move, but specific techniques are something you need to define and explain properly.

Regardless of the application, though, taking the square root of a logarithm is highly suspect (the logarithm can be negative unless you restrict its argument to be greater than 1, like in this case) and dividing the logarithm of a visit count by another visit count doesn't seem dimensionally correct. You are also hiding a constant in the choice of logarithm basis and another constant in the choice of using a square root: make them explicit.
I'd expect this formula to behave nicely in a specific range of totalSiblingVisits and move.visits values, but not in general.

Omae Wa Mou Shindeiru

Advertisement

Thanks for your reply & apologies for the confusion. I've asked questions about MCTS/UCT and discussed extensively in this forum in the past, albeit a few years ago, which is why I assumed members would be familiar to the topic.

UCT (Upper Confidence bound applied to Trees) is the name for MCTS with UCB (Upper Confidence Bound) as a selection function. But in papers, MCTS and UCT are usually synonymous. It has success in Go and Chess through AlphaZero and some hidden information cards games through the ISMCTS variant. Personally, I have an implementation for a card game.

This is the original UCB formula as used in MCTS:

UCB = move.rewards/move.visits + exploration_rate * sqrt(log(totalSiblingVisits) / move.visits)

This formula is used to balance exploration & exploitation (i.e. the multi-armed bandit problem) of nodes in the MCTS tree. In AlphaZero, they use pUCT (UCT with policy):

pUCT = move.rewards/move.visits + exploration_rate * p(move) * sqrt(log(totalSiblingVisits) / move.visits)

p() in the above formula is provided by a neural network. If my understanding is correct, the policy is supposed to guide the selection function initially. When there are enough iterations, the real value of the node is determined and is visited more if high valued ("MCTS takes over" as my original post).

What I want to achieve is a policy function that is hand tweaked heuristics, not a neural network. It seems that I just need to swap out p(move) with my heuristics. Is it correct that the value of p() range from 1.0 to 0.0?

Please correct me if I misunderstood anything. Thanks!

I haven't done anything with MCTS in many years, but I would have thought the way to use a policy heuristic is to use it to initialize counts to some fake values that will make the first few playouts through this node approximately follow the given probability distribution. I don't have a formula ready to give you, but I can try to write one for you if you need it.

EDIT: The fact that pUCT is guided by the heuristic even after millions of visits seems to bother others as well: https://github.com/LeelaChessZero/lc0/issues/743

alvaro said:
the way to use a policy heuristic is to use it to initialize counts to some fake values that will make the first few playouts through this node approximately follow the given probability distribution

That makes a lot of sense. This is how I implemented it:

(move.rewards+p(move)*10)/move.visits + exploration_rate * sqrt(log(totalSiblingVisits) / move.visits)

I fake the rewards of the node at the start, so the node is visited more initially. p() returns a value from 1.0 to 0.0. What do you think?

What I had in mind is something a bit different. Let's figure out how often UCT visits each move, then tweak things to initially get the probability distribution we want. In order to do this, let's assume the result of every playout happens to be exactly the reward you already have for that move (so you are not learning anything about which move is better), how often will UCT pick each move? We can think that all UCT scores are about equal. Then picking a move and increasing the number of visits for that move will change the UCT score by some amount. I used a derivative to figure out approximately how much that is. The inverse of that derivative is proportional to the frequency with which the move will be picked. By my computations, that means the probability of picking a move is proportional to move.visits^(3/2).

This suggests the following initialization procedure: Make move.visits = K * pow(p(move), ⅔) for some K that controls how much memory of the policy we'll have. Make totalSiblingVisits be the sum. Then initialize the rewards to be equal to -exploration_rate * sqrt(log(totalsSiblingVisits-1) / move.visits), or whatever other initialization you need to make so at the initial stage all moves have similar UCT score.

UCT initialized this way seems to indeed follow the desired probabilities initially.

#include <iostream>
#include <cmath>

double probabilities[10];
double scores[10];
double visits[10];
double total_visits;

double UCT_score(int i) {
  return scores[i] + std::sqrt(std::log(total_visits) / visits[i]);
}

int pick_UCT_move() {
  double best = UCT_score(0);
  int result = 0;
  for (int i = 1; i < 10; ++i) {
    double score = UCT_score(i);
    if (score > best) {
      best = score;
      result = i;
    }
  }
  return result;
}

int main() {
  for (int i = 0; i < 10; ++i)
    probabilities[i] = i < 5 ? 0.14 : 0.06; // Just an example

  total_visits = 0.0;
  for (int i = 0; i < 10; ++i) {
    scores[i] = 0.0;
    visits[i] = 1000.0 * std::pow(probabilities[i], 2.0/3.0);
    total_visits += visits[i];
  }
  
  for (int i = 0; i < 10; ++i)
    scores[i] = -UCT_score(i);
  
  for (int n = 0; n < 100; ++n) {
    int move = pick_UCT_move();
    std::cout << move << '\n';
    visits[move] += 1.0;
  }
}

This topic is closed to new replies.

Advertisement