[Computer-go] Understanding statistics for benchmarking
Kahn Jonas
jonas.kahn at math.u-psud.fr
Tue Nov 3 05:17:09 PST 2015
On Tue, 3 Nov 2015, Urban Hafner wrote:
> Thank you Remi!
> So the 85.5% +/- 2.5 reported by GoGui would be 85.5% +/- 5 for 95% and 85.5% +/- 7.5.
> Correct?
Correct.
But you do not need that intervals do not overlap for significativity.
You may divide by $\sqrt{2}$ those intervals before testing if they
overlap (in the limit, of course, but the whole discussion till here has been).
The value $\sqrt{2}$ is when you have played the same number of games,
as you have in your example.
More precisely, you are computing a confidence interval on the
difference of expectations. You would need a few corrections to be
perfectly rigorous, but that should be enouhgh for your needs.
Jonas
> And thanks for the table. I think that’s good enough for now. I’ve now figured out how
> to calculate the std. deviation myself (it is easy) and with those two tools together
> I can now see that 200 games is a bit on the low end. :) I had expected as much but
> it’s good to know for sure.
>
> Urban
>
> On Tue, Nov 3, 2015 at 9:46 AM, Rémi Coulom <remi.coulom at free.fr> wrote:
> The intervals given by gogui are the standard deviation, not the usual 95%
> confidence intervals.
>
> For 95% confidence intervals, you have to multiply the standard deviation
> by two.
>
> And you still have the 5% chance of not being inside the interval, so you
> can still get the occasional non-overlapping intervals.
>
> Likelihood of superiority is an interesting statistical tool:
> https://chessprogramming.wikispaces.com/LOS+Table
>
> For more advanced tools for deciding when to stop testing, there is SPRT:
> http://www.open-chess.org/viewtopic.php?f=5&t=2477
> https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
>
> Rémi
>
> On 11/03/2015 09:38 AM, Urban Hafner wrote:
> So,
>
> I’m currently running 200 games against GnuGo to see if a change to
> my program made a difference. But I now wonder if that’s enough
> games as I ran the same benchmark with the same code (but a
> different compiler version) and received different results:
>
> 85.5% wins (171 games of 200) the first time (+/- 2.5 according to
> gogui-twogtp)
> 79.0% wins (158 games of 200) the second time (+/- 2.9 according to
> gogui-twogtp)
>
> Looking at these results would make me believe that the difference
> is significant (the intervals don’t overlap) but then the real
> difference is only 13 wins …
>
> My statistics knowledge is sketchy at best but assuming that what
> gogui-twogtp calculates is the 95% confidence interval (I’m pretty
> sure I’m mixing terms here) it could well be that the difference
> between the two runs above is just random.
>
> So, this leads me to two questions:
>
> 1. How many games do you normally run to test if a change is
> significant “enough”?
> 2. Any good resources on how to calculate these statistics (i.e. if
> I wanted to find the error margin for a 99% confidence interval)?
>
> Urban
> --
> Blog: http://bettong.net/
> Twitter: https://twitter.com/ujh
> Homepage: http://www.urbanhafner.com/
>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
>
>
>
> --
> Blog: http://bettong.net/
> Twitter: https://twitter.com/ujh
> Homepage: http://www.urbanhafner.com/
>
>
More information about the Computer-go
mailing list