[Computer-go] Exploration formulas for UCT

Sun Jan 2 05:50:27 PST 2011

Let me preface my answer by saying that in no way does Pebbles formula
represent a thoroughly researched effort. Actually, I have cobbled together
a lot of ideas from theory and practice, and tweaked the rules without
extensive testing.

Looking at my code, it seems like I have 10x as many comments about possible
alternatives and variations than about actual tests.

Pebbles uses

	(1-beta) * qUCT + beta * qRAVE;

Beta is set according to Silver's thesis.

Both qUCT and qRAVE incorporate exploration terms. My current implementation
uses the exploration term from the Beta-Distribution formula, from
https://webdisk.lclark.edu/drake/publications/BetaDistribution.pdf. This
differs from the standard exploration UCT exploration policy in that it is
theoretically correct for binary data.

For the record, I did not see a significant change in overall strength when
using the beta-distribution formula for exploration. There were cases where
it was better and it was a better theoretical match, and that was enough for
me to switch.

I have not tested dropping the exploration terms from the qRAVE estimate.

Pebbles initializes RAVE data with wins and losses based upon patterns. The
weights for the patterns are not ELO-based. Instead, they are slowly tweaked
to optimize competitive results. I have not tried ELO (or any other a-priori
tuning method) because I don't trust the model. There is no reason why the
optimal weights for the purpose of searching should match the frequency at
which moves are played in games.

Pebbles patterns are nothing special. 3x3, atari, CFG distance,...

Looking at my code, I notice many special cases directed at limiting
behavior. For example, the following snippet is important:

	// Preserves the consistency of MCTS in the sense that it will
eventually
	// search everything even if qRAVE seriously underestimates qUCT.
	// Verified by Olivier Teytaud.
	//
	if (qRAVE < 0.1) {
		qRAVE = 0.1;
	}

I don't know about the optimality of the constant 0.1, nor whether this
catches all failure modes, but this line must be part of the implementation
or else I see serious failures that cannot be overcome by any reasonable
search.

Another type of limit is the following:

	// Limit the bounds, since it is impossible to do more than win.
Also,
	// note the assumption that every move gets one loss, and that moves
	// with more trials are better. Allocating trials to moves that
	// have been as successful as possible reduces the dependency on the
	// exploration constant in the early life of a node. (This is a good
thing.)
	//
	if (metric > t / (t + 0.5)) {
		metric = t / (t + 0.5);
	}

This type of limit is in several places. It isn't asymptotically necessary,
or anything, but it has the effect of bounding the optimism of the
estimates. I put such code in because when you have an exploration term then
qUCT and qRAVE can exceed 1.0.

Now, there is nothing wrong with exploration terms that exceed 1.0, in the
sense that if you continue to invest trials then eventually those estimates
will come down. However, trials are at a premium in Pebbles because I don't
have much hardware. So I am very particular about what happens in the early
lifecycle of a node. My finding was that Pebbles had better results. Your
mileage may vary.

It is possible that this is unnecessary, because exploration terms based on
the beta distribution should not produce confident bounds larger than 1.0.

Hope this helps.

Brian