[Computer-go] Understanding statistics for benchmarking

Tue Nov 3 05:17:09 PST 2015

On Tue, 3 Nov 2015, Urban Hafner wrote:

> Thank you Remi!
> So the 85.5% +/- 2.5 reported by GoGui would be 85.5% +/- 5 for 95% and 85.5% +/- 7.5.
> Correct?

Correct.
But you do not need that intervals do not overlap for significativity.
You may divide by $\sqrt{2}$ those intervals before testing if they
overlap (in the limit, of course, but the whole discussion till here has been).
The value $\sqrt{2}$ is when you have played the same number of games,
as you have in your example.
More precisely, you are computing a confidence interval on the
difference of expectations. You would need a few corrections to be
perfectly rigorous, but that should be enouhgh for your needs.

Jonas

> And thanks for the table. I think that’s good enough for now. I’ve now figured out how
> to calculate the std. deviation myself (it is easy) and with those two tools together
> I can now see that 200 games is a bit on the low end. :) I had expected as much but
> it’s good to know for sure.
> 
> Urban
> 
> On Tue, Nov 3, 2015 at 9:46 AM, Rémi Coulom <remi.coulom at free.fr> wrote:
>       The intervals given by gogui are the standard deviation, not the usual 95%
>       confidence intervals.
>
>       For 95% confidence intervals, you have to multiply the standard deviation
>       by two.
>
>       And you still have the 5% chance of not being inside the interval, so you
>       can still get the occasional non-overlapping intervals.
>
>       Likelihood of superiority is an interesting statistical tool:
>       https://chessprogramming.wikispaces.com/LOS+Table
>
>       For more advanced tools for deciding when to stop testing, there is SPRT:
>       http://www.open-chess.org/viewtopic.php?f=5&t=2477
>       https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
>
>       Rémi
>
>       On 11/03/2015 09:38 AM, Urban Hafner wrote:
>       So,
>
>       I’m currently running 200 games against GnuGo to see if a change to
>       my program made a difference. But I now wonder if that’s enough
>       games as I ran the same benchmark with the same code (but a
>       different compiler version) and received different results:
>
>       85.5% wins (171 games of 200) the first time (+/- 2.5 according to
>       gogui-twogtp)
>       79.0% wins (158 games of 200) the second time (+/- 2.9 according to
>       gogui-twogtp)
>
>       Looking at these results would make me believe that the difference
>       is significant (the intervals don’t overlap) but then the real
>       difference is only 13 wins …
>
>       My statistics knowledge is sketchy at best but assuming that what
>       gogui-twogtp calculates is the 95% confidence interval (I’m pretty
>       sure I’m mixing terms here) it could well be that the difference
>       between the two runs above is just random.
>
>       So, this leads me to two questions:
>
>       1. How many games do you normally run to test if a change is
>       significant “enough”?
>       2. Any good resources on how to calculate these statistics (i.e. if
>       I wanted to find the error margin for a 99% confidence interval)?
>
>       Urban
>       --
>       Blog: http://bettong.net/
>       Twitter: https://twitter.com/ujh
>       Homepage: http://www.urbanhafner.com/
> 
> 
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
> 
> 
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
> 
> 
> 
> 
> --
> Blog: http://bettong.net/
> Twitter: https://twitter.com/ujh
> Homepage: http://www.urbanhafner.com/
> 
>