[Computer-go] Value network that doesn't want to learn.
Brian Sheppard
sheppardco at aol.com
Fri Jun 23 05:32:45 PDT 2017
>... my value network was trained to tell me the game is balanced at the beginning...
:-)
The best training policy is to select positions that correct errors.
I used the policies below to train a backgammon NN. Together, they reduced the expected loss of the network by 50% (cut the error rate in half):
- Select training positions from the program's own games.
- Can be self-play or versus an opponent.
- Best is to have a broad panel of opponents.
- Beneficial to bootstrap with pro games, but then add ONLY training examples from program's own games.
- Train only the moves made by the winner of the game
- Very important for deterministic games!
- Note that the winner can be either your program or the opponent.
- If your program wins then training reinforces good behavior; if opponent wins then training corrects bad behavior.
- Per game, you should aim to get only a few training examples (3 in backgammon. Maybe 10 in Go?). Use two policies:
- Select positions where the static evaluation of a position is significantly different from a deep search
- Select positions where the move selected by a deep search did not have the highest static evaluation. (And in this case you have two training positions, which differ by the move chosen.)
- Of course, you are selecting examples where you did as badly as possible.
- The training value of the position is the result of a deep search.
- This is equivalent to "temporal difference learning", but accelerated by the depth of the search.
- Periodically refresh the training evaluations as your search/eval improve.
These policies actively seek out cases where your evaluation function has some weakness, so training is definitely focused on improving results in the distribution of positions that your program will actually face.
You will need about 30 training examples for every free parameter in your NN. You can do the math on how many games that will take. It is inevitable: you will train your NN based on blitz games.
Good luck!
More information about the Computer-go
mailing list