[Computer-go] Understanding statistics for benchmarking

Tue Nov 3 01:47:49 PST 2015

Thank you Remi!

So the 85.5% +/- 2.5 reported by GoGui would be 85.5% +/- 5 for 95% and
85.5% +/- 7.5. Correct?

And thanks for the table. I think that’s good enough for now. I’ve now
figured out how to calculate the std. deviation myself (it is easy) and
with those two tools together I can now see that 200 games is a bit on the
low end. :) I had expected as much but it’s good to know for sure.

Urban

On Tue, Nov 3, 2015 at 9:46 AM, Rémi Coulom <remi.coulom at free.fr> wrote:

> The intervals given by gogui are the standard deviation, not the usual 95%
> confidence intervals.
>
> For 95% confidence intervals, you have to multiply the standard deviation
> by two.
>
> And you still have the 5% chance of not being inside the interval, so you
> can still get the occasional non-overlapping intervals.
>
> Likelihood of superiority is an interesting statistical tool:
> https://chessprogramming.wikispaces.com/LOS+Table
>
> For more advanced tools for deciding when to stop testing, there is SPRT:
> http://www.open-chess.org/viewtopic.php?f=5&t=2477
> https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
>
> Rémi
>
>
> On 11/03/2015 09:38 AM, Urban Hafner wrote:
>
>> So,
>>
>> I’m currently running 200 games against GnuGo to see if a change to my
>> program made a difference. But I now wonder if that’s enough games as I ran
>> the same benchmark with the same code (but a different compiler version)
>> and received different results:
>>
>> 85.5% wins (171 games of 200) the first time (+/- 2.5 according to
>> gogui-twogtp)
>> 79.0% wins (158 games of 200) the second time (+/- 2.9 according to
>> gogui-twogtp)
>>
>> Looking at these results would make me believe that the difference is
>> significant (the intervals don’t overlap) but then the real difference is
>> only 13 wins …
>>
>> My statistics knowledge is sketchy at best but assuming that what
>> gogui-twogtp calculates is the 95% confidence interval (I’m pretty sure I’m
>> mixing terms here) it could well be that the difference between the two
>> runs above is just random.
>>
>> So, this leads me to two questions:
>>
>> 1. How many games do you normally run to test if a change is significant
>> “enough”?
>> 2. Any good resources on how to calculate these statistics (i.e. if I
>> wanted to find the error margin for a 99% confidence interval)?
>>
>> Urban
>> --
>> Blog: http://bettong.net/
>> Twitter: https://twitter.com/ujh
>> Homepage: http://www.urbanhafner.com/
>>
>>
>> _______________________________________________
>> Computer-go mailing list
>> Computer-go at computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go

-- 
Blog: http://bettong.net/
Twitter: https://twitter.com/ujh
Homepage: http://www.urbanhafner.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20151103/d39e6973/attachment.html>