Thanks for your reply & apologies for the confusion. I've asked questions about MCTS/UCT and discussed extensively in this forum in the past, albeit a few years ago, which is why I assumed members would be familiar to the topic.
UCT (Upper Confidence bound applied to Trees) is the name for MCTS with UCB (Upper Confidence Bound) as a selection function. But in papers, MCTS and UCT are usually synonymous. It has success in Go and Chess through AlphaZero and some hidden information cards games through the ISMCTS variant. Personally, I have an implementation for a card game.
This is the original UCB formula as used in MCTS:
UCB = move.rewards/move.visits + exploration_rate * sqrt(log(totalSiblingVisits) / move.visits)
This formula is used to balance exploration & exploitation (i.e. the multi-armed bandit problem) of nodes in the MCTS tree. In AlphaZero, they use pUCT (UCT with policy):
pUCT = move.rewards/move.visits + exploration_rate * p(move) * sqrt(log(totalSiblingVisits) / move.visits)
p() in the above formula is provided by a neural network. If my understanding is correct, the policy is supposed to guide the selection function initially. When there are enough iterations, the real value of the node is determined and is visited more if high valued ("MCTS takes over" as my original post).
What I want to achieve is a policy function that is hand tweaked heuristics, not a neural network. It seems that I just need to swap out p(move) with my heuristics. Is it correct that the value of p() range from 1.0 to 0.0?
Please correct me if I misunderstood anything. Thanks!