[Computer-go] Understanding statistics for benchmarking
Petr Baudis
pasky at ucw.cz
Tue Nov 3 05:22:01 PST 2015
On Tue, Nov 03, 2015 at 09:46:00AM +0100, Rémi Coulom wrote:
> The intervals given by gogui are the standard deviation, not the usual 95%
> confidence intervals.
>
> For 95% confidence intervals, you have to multiply the standard deviation by
> two.
>
> And you still have the 5% chance of not being inside the interval, so you
> can still get the occasional non-overlapping intervals.
>
> Likelihood of superiority is an interesting statistical tool:
> https://chessprogramming.wikispaces.com/LOS+Table
>
> For more advanced tools for deciding when to stop testing, there is SPRT:
> http://www.open-chess.org/viewtopic.php?f=5&t=2477
> https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
An important corollary to this (noted on this list every few years)
is that in the most naive scenario where your statistical test is just
SD-based overlap after N games, you should fix your N number of games
in advance and not rig it by terminating out of schedule. If you look
at the progress of your playtesting often, you could spot a few moments
where the intervals do not overlap, enve if in the long run they
typically would.
(The situation is a bit dire if you have limited computing resources.
I admit that sometimes I didn't follow the above myself in less formal
exploratory experiments, but at least I tried to look only
"infrequently", e.g. single check every few hours, only at "round"
numbers of playouts, etc. I hope it's not a grave sin.)
--
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton
More information about the Computer-go
mailing list