{\huge The New Chess World Champion}

Available for $59.96, or---for free? \vspace{.25in}

Larry Kaufman is a Grandmaster of chess, and has teamed in the development of two champion computer chess programs, Rybka and Komodo. I have known him from chess tournaments since the 1970s. He earned the title of International Master (IM) from the World Chess Federation in 1980, a year before I did. He earned his GM title in 2008 by dint of winning the World Senior Chess Championship, equal with GM Mihai Suba.

Today we salute Komodo for winning the 7th Thoresen Chess Engines Competition (TCEC), which some regard as the de-facto world computer chess championship. Larry first worked on the Rybka chess program with its originator, Vasik Rajlich. In 2008 his manner of tuning the evaluation function by field testing helped create a big leap in strength from version 2.3.1 to version 3 of Rybka. Hence Rybka 3 became the clear choice of master engine for my statistical model.

After two years of silicon hiatus, Larry teamed with Komodo's originator, the late Don Dailey, and they brought out their first commercial version in 2011, by which time it had at least pulled even with Rybka. Since then computer chess has seen a triumvirate joined by the commercial engine Houdini by Robert Houdart, and the free open-source engine Stockfish by Tord Romstad, Marco Costalba, and Joona Kiiski. The current CCRL rating list shows their most recent public versions atop the list:

\includegraphics[width=4in]{CCRLtop4.png}

The last column is CCRL's confidence in superiority over the next engine on the list. This was borne out by all four engines surviving to the TCEC 7 semifinal stage and finishing in that order. Previously Houdini ruled TCEC through mid-2013, then Komodo won TCEC 5 over Stockfish a year ago, and Stockfish got revenge in TCEC 6 last May. The current final between the latest versions was a squeaker, with Komodo ahead by one point until getting two late wins to finish 33.5--30.5.

Human Depth of Games

Larry and I have had closely similar chess ratings for four decades. However, in the last game we played he gave me odds of rook and bishop and beat me handily. Then he told me that the world champion could probably beat him giving the same odds.

This was not Western chess, where I would be pretty confident of beating anyone given just an extra bishop. It was Japanese chess, called Shogi. Shogi has no piece with the power of a queen, and the armies have just one rook and bishop each, so the odds I received were maximal. The main difference from Western chess is that captured pieces become their taker's property and can be ``paratrooped'' back into the game. This prevents the odds receiver from winning by attrition through exchanges as prevails in chess, and throws upon the leader a burden of attack.

It also makes Shogi deeper than chess in a way that can be defined mathematically. Say two players are a class unit apart if the stronger expects to score 75% against the weaker in a match. In chess, this corresponds to a difference of almost exactly 200 points in the standard Elo rating system. László Mér\H{o}, in his 1990 book Ways of Thinking, called the number of class units from a typical beginning adult player to the human world champion the depth of a game.

Tic-tac-toe may have a depth of 1: if you assume a beginner knows to block an immediate threat of three-in-a-row but plays randomly otherwise, then you can score over 75% by taking a corner when you go first and grifting a few games when you go second. Another school-recess game, dots-and-boxes, is evidently deeper. We don't know its depth for sure because it doesn't have a rating system and championship format like chess does.

Chess ratings go all the way down to the formal floor of 100 among scholastic players, but I concur with the estimate of Elo 600 for a young-adult beginner by a discussion of Mér\H{o}'s book which I saw in the 1990s but did not preserve. This gave chess a depth of 11 class units up to 2800, which was world champion Garry Kasparov's rating in 1990. If I recall correctly, checkers ($latex {8 \times 8}&fg=000000$) and backgammon had depth 10 while bridge tied chess at 11, but Shogi scored 14 and yet was dwarfed by Japan's main head game, Go, at 25.

Game Depth and Moore's Law

One weakness in the notion of depth is dependence on how contests are aggregated. For tennis, should we apply 75% expectation to games, pairs of games, sets, or matches---and how many games in a set or sets in a match? Another is that any game can be 'reduced' to depth 1 by flipping a coin; if heads the game is played as-usual; if tails a second coin flip defines the outcome. Then nobody ever has more than 75% expectation. I regard both as beside the point for most board games, but both become nontrivial for games like backgammon that involve chance and are played with match stakes.

Although it is coming on 18 years since Deep Blue beat Kasparov, humans are still barely fending off computers at shogi, while we retain some breathing room at Go. Since depth 14 translates to Elo 3400 on the chess scale, while Komodo 8 is scraping 3300 on several chess rating lists, This feels about right.

Ten years ago, each doubling of speed was thought to add 50 Elo points to strength. Now the estimate is closer to 30. Under the double-in-2-years version of Moore's Law, using an average of 50 Elo gained per doubling since Kasparov was beaten, one gets 450 Elo over 18 years, which again checks out.

To be sure, the gains in computer chess have come from better algorithms not just speed, and include nonlinear jumps, so Go should not count on a cushion of (25 - 14)*9 = 99 years. That Moore's Law has recently gone more toward processor core density than raw CPU speed, while chess programs gain only about 60--65% from each doubling of cores, contributes more offsetting factors. Nevertheless, Mér\H{o}'s notion of depth in games gives a new angle on computational progress.

Patterns and Singularity

Can we design games to be more human-friendly and computer-resistant? Both Shogi and Go are overtly more complex than chess by certain measures of numbers of possible positions and typical lengths of games. There are versions of chess and Shogi with bigger boards, which Kaufman once compared in an informative forum post. However, I agree with various people's opinions that raw complexity taxes carbon and silicon equally while these factors do the most to sway between them:

Chess scores poorly for brains by all three: Tactics are often short firecracker strings that can dazzle human players but are now trivial for computers. Common advice for beginning players is to flash-study tactics. Chess teacher Dan Heisman once said:

It's not that chess is 99% tactics, it's just that tactics takes up 99% of your time.

I equate the ``tactical'' nature of a chess position with the degree to which the best move stands apart from any other, which relates in turn to the effective branching factor. My statistical model stamps a definite percentage number: not ``99%'' but a hefty 53.3% in typical human games. Computers playing each other tend to reach positions whose lower 49% figure I interpret as greater freedom of strategy. The difference from 53.3% is statistically significant, but it's still almost 50%.

The new game Arimaa achieves greatly higher branching factor by longer tactical horizons and by ``moves'' that consist of four separate motions. Human players are still comfortably defending a $12,000 prize with six years to run. Keeping closer to standard chess, my best effort involves doubling up the pawns, which are the game's strategic ``soul.''

More broadly, I agree with the contention of the BBC/National Geographic documentary ``My Brilliant Brain'' featuring Susan Polgar that human skill consists most in pattern matching from previously seen positions, as Dick and I discussed in this post. Not even Komodo or Stockfish includes a pattern classifier. Does there exist a succinct machine-learning style distinguisher that always steers to advantageous positions? If so, then more than the fall of Garry or the current desperate defenses of Mt. Shogi and Mt. Go, we may speak of the ``Singularity'' invading our brains.

Skill and Rating Horizons

Related to diminishing returns from higher speed is the high incidence of draws in top-level chess, well over 50%. This compares to a rate under 2% for shogi, as also noted by Kaufman, while Go tournaments include rules that make draws impossible.

Computers are not immune: Only 11 of 64 games in the TCEC final were decisive, and all of them were won by White. Some games became rooftop swordfights in which human players would surely topple over the parapet, but the windows of the engines' own analyses 15 or more moves ahead on the official site often showed guy-wires locking both into a safely drawn outcome. In game 59, Komodo played for twenty moves a full rook down but felt no urgency to push its compensating pawns further and in truth was never close to losing. Only 28 of 72 games in the penultimate stage were decisive, only 6 won by Black.

If a human 2800 player can achieve just one draw out of 50 games against an engine, then by the mathematics of the rating system, the engine's rating cannot go over 3600. To reach 4000 against humans, a program would have to concede a draw but once in 500 games.

This logic applies with even greater force to programs playing each other. Perhaps Komodo and Stockfish are already strong enough to hold a draw at least once in 5 games against any player. Then the ceiling for any player---without needing to define a 'perfect' player---is under 3700.

A Little Game Theory

Even for games with negligible draw rate, once can still infer an effective rating horizon by considering players that randomize their move choices. If there is no succinct perfect deterministic strategy for chess, this means any such strategy $latex {S}&fg=000000$ can be beaten by an all-knowing adversary able to steer to positions on which $latex {S}&fg=000000$ fails. A feasible randomized $latex {S'}&fg=000000$, however, may still win (or draw) $latex {p}&fg=000000$ percent of the time against any adversary. From $latex {p}&fg=000000$ and a rating measure for $latex {S'}&fg=000000$ one can infer a maximum possible Elo rating for any adversary, at least in matches against $latex {S'}&fg=000000$.

My statistical chess model is set up to provide exactly this kind of rating estimation. It defines an ensemble over skill parameters $latex {\vec{z}}&fg=000000$ of probability distributions $latex {P(\vec{z})}&fg=000000$ over all possible moves in a given position. Given a set of positions and the sequence $latex {\vec{m}}&fg=000000$ of played moves, it estimates a rating for the player of those moves by regression to find $latex {\hat{z}}&fg=000000$ that minimizes some distance measure between $latex {P(\hat{z})}&fg=000000$ and $latex {\vec{m}}&fg=000000$. This general prediction task is now the subject of a Kaggle competition through March 23.

The mapping from $latex {\hat{z}}&fg=000000$ to Elo ratings uses a normalized curve of the total numerical inferiority of a player's moves judged by a chess engine. The curve with Rybka 3 as the judge crosses the y-axis at Elo 3570, which becomes the skill horizon. Adding the Houdini engine to the mix suggested an even lower horizon under 3500. It is possible that the appearance of a good linear fit to my human training sets in the range Elo 2000--2700 may be revealed as a segment where a logistic regression looks linear when the range is extended lower and I complete the replacement of Rybka by the newer engines, which could curve up the horizon. Separate discussion of diminishing strength returns with someone who was communicating also with Kaufman suggests a horizon in the 4000--4500 range.

Regardless, 3300 is plenty high, almost 450 points better than any brain alive, and you can have it at your fingertips. On your laptop or smartphone you would get maybe only 100--250 points lower. It will cost under $60 for the champion Komodo, and even less for Stockfish. There is more pause for thought here, but for another time.

Open Problems

Can Misses or Misters resist the transistors? And is there a sensible notion of ``depth'' in an academic field, measuring distance to the frontier in terms of productivity-doubling units?