[Computer-go] The heuristic "last good reply"
Peter Drake
drake at lclark.edu
Tue Jan 25 13:48:15 PST 2011
I'm all for a learning policy, if you can figure out how to do it. :-)
Peter Drake
http://www.lclark.edu/~drake/
On Jan 25, 2011, at 11:31 AM, Aja wrote:
> Hi Professor Drake,
>
> I will try with more playouts. Thanks for your reminding.
>
> I give an example to show my view: default policy should also be
> included to learn. I suppose: if there are several decisive life-and-
> death or semeai in a position, the tree search cannot go to/clarify
> every one of them.
>
> In this example, Black's L2 and L4 will cause White to play L3 to
> capture by default policy (it's completely bad). Then Black may
> learn quickly by "last good play" to atari immediately and kill
> White's whole group to win. The problem is, White is not able to
> learn the correct answer H1 or H2 because it is fixed in default
> policy.
>
> In the playouts, the configuration of such a big semeai might be
> very similar. Such evaluation bias is exactly an issue that we can
> fix by learning. By considering probability, I can fix this problem
> by increasing the probability of the "last good reply" H1 or H2,
> without tree's aid.
>
> Every program's implementation of the playout is more or less
> different. But I think excluding default policy from the learning
> might limit the full power of "last good reply".
>
> Aja
>
> ----- Original Message -----
> From: Peter Drake
> To: computer-go at dvandva.org
> Sent: Wednesday, January 26, 2011 2:27 AM
> Subject: Re: [Computer-go] The heuristic "last good reply"
>
> On Jan 25, 2011, at 10:19 AM, Aja wrote:
>
>> Dear all,
>>
>> Today I have tried Professor Drake's "last good reply" in Erica. So
>> far, I got at most 20-30 elo from it.
>>
>> I tested by self-play, with 3000 playouts/move on 19x19. The amount
>> of playouts might be too few, but I would like to test more
>> playouts IF the playing strength is not weaker with 3000 playouts.
>
> Yes -- the smallest experiments in the paper were with 8k playouts
> per move. There may not be time to fill up the reply tables with
> only 3k.
>
>> From this preliminary experiments with 3000 playouts, I have some
>> observations:
>>
>> 1. In Erica, it's better to consider probability for this heuristic.
>>
>> 2. In Prof. Drake's implementation, there is a weakness in
>> learning. I think the main problem is that for a reply which is
>> deterministically played by default policy, there is no room to
>> learn a new reply. For example, if "save by capture" produces a
>> lost game, then in the next simulation, it will still play "save by
>> capture" by default policy. If I am wrong in this point, I am glad
>> to be corrected by anyone.
>
> This is true, but only if the previous move (or previous two moves)
> come up again in exactly the same board configuration. When the
> configuration is exactly the same, we are probably still in the
> search tree, which overrides the policy. If we are beyond the tree,
> the configuration is almost always different.
>
>> 3. This heuristic has potential to perform better in Erica. I hope
>> this brief result would encourage other authors to try it.
>
> It's reassuring to see that you got some strength improvement out of
> it!
>
> Thanks,
>
> Peter Drake
> http://www.lclark.edu/~drake/
>
>
>
>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go at dvandva.org
> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
> <default_policy.sgf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20110125/4296d63c/attachment.html>
More information about the Computer-go
mailing list