[Computer-go] Standard Computer Go Datasets - Proposal

Fri Nov 13 03:48:00 PST 2015

I think if you start calculating the Zobrist hashes and scraping 
features yourself you will have a neverending variety of datasets.

I would prefer datasets of whole, high quality games without SGF errors, 
perhaps cleaned of identifying information. Parsing an SGF is already 
trivial. I personally divide them in:

- Handicap used or not
- Normal (5.5 - 7.5) or not komi, this disqualifies some older games
- Rules used
- Board size

Following the idea of having more information instead of very specific 
features already extracted, it would be interesting to also have the 
playing times, although I don't know where you'd get that from.

You'd be an angel if you could provide a large dataset of matches with 
Chinese rules, specially in board sizes other than 19x19.

It would of course also have to be completely free for any use. I 
personally only use the KGS 6d+ and a collection of 70k pro games that I 
don't know where it came from. The GoGoD is proprietary. :)

Gonçalo F.

On 11/13/2015 08:39 AM, Josef Moudrik wrote:
> Hello List,
>
> There has been some debate in science about making the research more
> reproducible and open. Recently, I have been thinking about making a
> standard public fixed dataset of Go games, mainly to ease comparison of
> different methods, to make results more reproducible and maybe free the
> authors of the burden of composing a dataset. I think that the current
> practice can be improved a lot.
>
> Since the success of this endeavor crucially depends on how many authors
> use the dataset, I would like to ask You (potential authors) a few
> questions:
>
> 1) Would this be welcomed and used? Would You personally use it? (Am I not
> reinventing the wheel?)
>
> 2) What parameters should the dataset have? The number of dataset variants
> (if any) should be in my opinion kept at bare minimum to reduce
> "fragmentation".
>
> 2a) Size: My current view is that at least 2 sizes are necessary: small
> (1000-2000 games?) and large dataset (50000-60000 games).
> 2b) Strength & year span: Currently I am thinking about including modern
> professional games only (1970-2015)
>
> 3) Do you have any other comments, requirements for the dataset and ideas?
>
>
> Thanks for Your attention,
> Kind regards
> Josef Moudrik
>
>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go at computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>