[Computer-go] Understanding statistics for benchmarking

Tue Nov 3 07:02:46 PST 2015

Here's Orego's Java code for this:

It involves a "two-tailed test for difference of proportions".

I usually run 500-1000 games in each condition. (The exact number depends
on the hardware available at the time.)

On Tue, Nov 3, 2015 at 5:50 AM, Urban Hafner <contact at urbanhafner.com>
wrote:

> Yes, I noticed that too. But luckily that's the one thing I didn't even
> consider doing. Running the same number of games feels like the most
> natural thing to do anyway.
>
> Von meinem iPhone gesendet
>
> > Am 03.11.2015 um 14:22 schrieb Petr Baudis <pasky at ucw.cz>:
> >
> >> On Tue, Nov 03, 2015 at 09:46:00AM +0100, Rémi Coulom wrote:
> >> The intervals given by gogui are the standard deviation, not the usual
> 95%
> >> confidence intervals.
> >>
> >> For 95% confidence intervals, you have to multiply the standard
> deviation by
> >> two.
> >>
> >> And you still have the 5% chance of not being inside the interval, so
> you
> >> can still get the occasional non-overlapping intervals.
> >>
> >> Likelihood of superiority is an interesting statistical tool:
> >> https://chessprogramming.wikispaces.com/LOS+Table
> >>
> >> For more advanced tools for deciding when to stop testing, there is
> SPRT:
> >> http://www.open-chess.org/viewtopic.php?f=5&t=2477
> >> https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
> >
> > An important corollary to this (noted on this list every few years)
> > is that in the most naive scenario where your statistical test is just
> > SD-based overlap after N games, you should fix your N number of games
> > in advance and not rig it by terminating out of schedule.  If you look
> > at the progress of your playtesting often, you could spot a few moments
> > where the intervals do not overlap, enve if in the long run they
> > typically would.
> >
> > (The situation is a bit dire if you have limited computing resources.
> > I admit that sometimes I didn't follow the above myself in less formal
> > exploratory experiments, but at least I tried to look only
> > "infrequently", e.g. single check every few hours, only at "round"
> > numbers of playouts, etc.  I hope it's not a grave sin.)
> >
> > --
> >                Petr Baudis
> >    If you have good ideas, good data and fast computers,
> >    you can do almost anything. -- Geoffrey Hinton
> > _______________________________________________
> > Computer-go mailing list
> > Computer-go at computer-go.org
> > http://computer-go.org/mailman/listinfo/computer-go
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>

-- 
Peter Drake
https://sites.google.com/a/lclark.edu/drake/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://computer-go.org/pipermail/computer-go/attachments/20151103/18b1c149/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Significance.java
Type: application/octet-stream
Size: 2095 bytes
Desc: not available
URL: <http://computer-go.org/pipermail/computer-go/attachments/20151103/18b1c149/attachment.obj>