[Computer-go] The heuristic "last good reply"

Tue Jan 25 13:48:15 PST 2011

I'm all for a learning policy, if you can figure out how to do it. :-)

Peter Drake
http://www.lclark.edu/~drake/

On Jan 25, 2011, at 11:31 AM, Aja wrote:

> Hi Professor Drake,
>
> I will try with more playouts. Thanks for your reminding.
>
> I give an example to show my view: default policy should also be  
> included to learn. I suppose: if there are several decisive life-and- 
> death or semeai in a position, the tree search cannot go to/clarify  
> every one of them.
>
> In this example, Black's L2 and L4 will cause White to play L3 to  
> capture by default policy (it's completely bad). Then Black may  
> learn quickly by "last good play" to atari immediately and kill  
> White's whole group to win. The problem is, White is not able to  
> learn the correct answer H1 or H2 because it is fixed in default  
> policy.
>
> In the playouts, the configuration of such a big semeai might be  
> very similar. Such evaluation bias is exactly an issue that we can  
> fix by learning. By considering probability, I can fix this problem  
> by increasing the probability of the "last good reply" H1 or H2,  
> without tree's aid.
>
> Every program's implementation of the playout is more or less  
> different. But I think excluding default policy from the learning  
> might limit the full power of "last good reply".
>
> Aja
>
> ----- Original Message -----
> From: Peter Drake
> To: computer-go at dvandva.org
> Sent: Wednesday, January 26, 2011 2:27 AM
> Subject: Re: [Computer-go] The heuristic "last good reply"
>
> On Jan 25, 2011, at 10:19 AM, Aja wrote:
>
>> Dear all,
>>
>> Today I have tried Professor Drake's "last good reply" in Erica. So  
>> far, I got at most 20-30 elo from it.
>>
>> I tested by self-play, with 3000 playouts/move on 19x19. The amount  
>> of playouts might be too few, but I would like to test more  
>> playouts IF the playing strength is not weaker with 3000 playouts.
>
> Yes -- the smallest experiments in the paper were with 8k playouts  
> per move. There may not be time to fill up the reply tables with  
> only 3k.
>
>> From this preliminary experiments with 3000 playouts, I have some  
>> observations:
>>
>> 1. In Erica, it's better to consider probability for this heuristic.
>>
>> 2. In Prof. Drake's implementation, there is a weakness in  
>> learning. I think the main problem is that for a reply which is  
>> deterministically played by default policy, there is no room to  
>> learn a new reply. For example, if "save by capture" produces a  
>> lost game, then in the next simulation, it will still play "save by  
>> capture" by default policy. If I am wrong in this point, I am glad  
>> to be corrected by anyone.
>
> This is true, but only if the previous move (or previous two moves)  
> come up again in exactly the same board configuration. When the  
> configuration is exactly the same, we are probably still in the  
> search tree, which overrides the policy. If we are beyond the tree,  
> the configuration is almost always different.
>
>> 3. This heuristic has potential to perform better in Erica. I hope  
>> this brief result would encourage other authors to try it.
>
> It's reassuring to see that you got some strength improvement out of  
> it!
>
> Thanks,
>
> Peter Drake
> http://www.lclark.edu/~drake/
>
>
>
>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go at dvandva.org
> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
> <default_policy.sgf>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20110125/4296d63c/attachment.html>