Freestyle Chess Versus Computers Alone

Brand-new version of 2010 study, relevant to the book Average is Over by Tyler Cowen.


"Freestyle" chess allows humans unrestricted use of computers during games. The team is commonly called a "centaur". The largest-scale Freestyle tournaments were sponsored by the PAL Group in Abu Dhabi and implemented by Computer-Schach und Spiele on the PlayChess server run by the German company ChessBase. There were eight PAL/CSS events played from 2005 to 2008. This page has links to ChessBase.com articles on all of them.

My older study used the chess program Rybka 3, which was released in August 2008, to analyze games from the PAL/CSS events, together with games of computers playing alone in the World Computer Chess Championship (WCCC) and other events. I also included the championships of the International Correspondence Chess Federation (ICCF), in which computer use is not forbidden, but is perceived even by finalists to come with shades-of-grey.

The PAL/CSS tournaments were played with time controls ranging from 60 minutes plus a 15-second increment per move, to 90 minutes plus a 30-second increment. The former would be called "semi-rapid", but the latter is close to those used in high-level human tournaments. The computer tournaments covered here have ranged from semi-rapid (CCT) to 105 minutes plus 15-second increments (WCCC), while the CEGT matches highlighted here have used the onetime human world championship time control of 120 minutes for the first 40 moves, 60 for the next 20, and 30 minutes for the rest of the game. Correspondence chess of course allows nearly unlimited time by comparison.

Data

The new study uses Stockfish 4, slightly modified as detailed below, run to depth 19 in single-PV mode. Stockfish 4 was released on 8/20/2013. Data points need at least 100 moves for inclusion.

The data now includes the entire set of PAL/CSS Freestyle tournaments, plus Freestyle competitions for some of the world's top players staged in Leon, Spain, in 1998--2002. In all there are:

This makes 4,374 total data points. CEGT runs 50-game matches (sometimes 44 or 30 games) as part of larger tournaments. For better comparison with all other events, these matches were broken into 10-game segments, labeled a,b,c,d,e in the filenames. TTC stands for a "tournament time control" of 40-in-2, then 20-in-1, and finally G/30. REP stands for the 40-in-2-hour time control repeated for every block of 40 moves; some matches do this for 40-in-400 minute blocks.

The three quality measures are (MM) percentage of matching moves with the analyzer, (AE, r3) average error per move judged by the analyzer in raw terms, and (AE, sc3) the same with error scaled in proportion to the overall evaluation of the position. The last reflects that human players make markedly more raw error when they are ahead or behind by a pawn or so, even half a pawn, than when the game is even. This phenomenon shows up just slightly less with computers. It may reflect perceiving differences in proportion to overall value (a-la studies by Kahneman and Tversky) and/or rational attitudes toward risk, but it may also just reflect grester variance between depths in unbalanced chess positions. The "3" means positions with one side ahead over 3.00 (colloquially, three pawns) are thrown out.


Results

There are 850 PAL/CSS data points. By happenstance the numbers in parentheses above for international computer competitions in 2005--2008, not counting the regional MPPS, AUSNCC, and MASPV events, give 425 performances which is exactly half. And the following is an unbiased way to obtain 424 CEGT data points:

PAL/CSS Freestyle Compared With Computers in 2005--2008

Sorted By Move-Match % to Stockfish 4.

Sorted By Raw Error as judged by Stockfish 4.

Sorted By Scaled Error as judged by Stockfish 4.

Freestyle has an absolute lock on the move-matching measure, with 8-of-10, 42-of-50, and 85-of-100, but does not quite break even in the other measures. The reason is the dominance of CEGT results in the raw and scaled average-error measures, over those from all other world computer events in 2005--2008:

CEGT Versus All Other Major Computer Events in 2005--2008

Sorted By Move-Match % to Stockfish 4.

Sorted By Raw Error as judged by Stockfish 4.

Sorted By Scaled Error as judged by Stockfish 4.

Freestyle Compared With All Computer Events to 2008

Sorted By Move-Match % to Stockfish 4.

Sorted By Raw Error as judged by Stockfish 4.

Sorted By Scaled Error as judged by Stockfish 4.


Comparison to Computers, all years

Sorted By Move-Match % to Stockfish 4.

Sorted By Raw Error as judged by Stockfish 4.

Sorted By Scaled Error as judged by Stockfish 4.

When the past 5 years of programs are included, PAL/CSS retains its lock in the move-matching measure, but goes toward parity per numbers in the others.

All Results, Including Correspondence

Sorted By Move-Match % to Stockfish 4.

Sorted By Raw Error as judged by Stockfish 4.

Sorted By Scaled Error as judged by Stockfish 4.


All Computer-Only Results

Sorted By Move-Match % to Stockfish 4.

Sorted By Raw Error as judged by Stockfish 4.

Sorted By Scaled Error as judged by Stockfish 4.


The scaling is the same used for Rybka, and its rationale is described in my paper with Bartlomiej Macieja and Guy Haworth. It will change for Stockfish because Stockfish tends to give evaluations about 50-60% higher than Rybka does for the same position---I am currently normalizing all engines to a common metric of value in terms of average % score for the player to move, in a large reference dataset of games.

Results will be plotted and analyzed further.

For the record, here are the Top 200 of my current omnibus master files of every human and computer performance that I've analyzed with Rybka and Houdini, ranked by raw error and ranked by scaled error. These do not have as many computer-computer and ICCF data points---the omnibus files have other purposes besides doing a study of Freestyle. Actually you get a bonus half-dozen; there are interesting reasons why I cut both files off at 206, while the rank by MM is not given because some humans sensitively crash the party. They are similar to what Cowen saw while the book was being written.

Commentary

My computer analyzer is taking much less time, rarely more than 20 seconds per move on just one thread of an x64 PC, than the players in all these games. Yet it is still sensitive to the overall quality of play, in the way a net can still be used to compare particle density in bodies of water even though most particles are smaller than the openings in the net. (See also this justification by Matej Guid, Aritz Pérez, and Ivan Bratko of their results using weaker chess programs pre-2008.)

The stronger caveat to my previous study was that many of the PAL/CSS teams were using earlier versions of Rybka, which could bias results obtained via Rybka 3. Hence now I am re-creating the study using the just-released Stockfish 4 engine, which my tests show to be appreciably stronger than Stockfish 3. This engine existed only as an ancestor called Glaurung at the time.

If anything, the fact I'm including major computer chess competitions in the 5 years since the last PAL/CSS event ended creates bias against the evaluation of Freestyle. In Martin Thoresen's 2013 New TCEC tournament, what was basically Stockfish 3 lost to top-rated Houdini 3 in the final. For my analysis, Stockfish 4 was modified (only) by changing its GrainSize parameter from 4 to 1, and making it clear hash before every fixed-depth search. Both of these changes make the program slightly weaker in play, but having evaluations to the maximum supported precision gives more information, while clearing hash (and using a single core thread) makes the analysis logs themselves scientifically reproducible.