{\huge A Chess Firewall at Zero?}
Humans halve their blunder rate when infinitesimally ahead, but why? \vspace{.15in}
Lucius Seneca was a Roman playwright, philosopher, and statesman of the first century. He is called ``the Younger'' because his father Marcus Seneca was also a famous writer. His elder brother Lucius Gallio appears as a judge in the Book of Acts. Beside many quotations from his work, Seneca is famous for one he is not known to have said:
``To err is human.''
Lest we cluck at human error in pinning down ancient quotations, the source for the following updated version is also unknown---even with our legions of computers by which to track it:
``To err is human, but to really screw things up requires a computer.''
Today I report a phenomenon about human error that is magnified by today's computers' deeper search, and that I believe arises from their interaction with complexity properties of chess.
I have previously reported some phenomena that my student Tamal Biswas and I believe owe primarily to human psychology. This one I believe is different---but I noticed it only a week ago so who knows. It is that the proportion of large errors by human players in positions where computers judge them to be a tiny fraction of a pawn ahead is under half the rate in positions where the player is judged ever so slightly behind.
The full version of what Seneca actually wrote---or rather didn't write---is even more interesting in the original Latin:
``Errare humanum est, perseverare autem diabolicum, et tertia non datur.''
This means: ``To err is human; to persevere in error is of the devil; and no third possibility is granted.'' The phrase tertium non datur is used for the Law of Excluded Middle in logic. In logic the law says that either Seneca wrote the line or he didn't, with no third possibility. We will say more about this law shortly. Amid disputes about whether human behavior and its measurement follow primarily ``rational'' or ``psychological'' lines, we open a third possibility: ``complexitarian.''
State of Computer Chess
Here are some important things to know about chess and computer chess programs (called ``engines'').
One upshot is that depth of cogitation is solidly quantifiable in the chess setting. We have previously posted about our papers giving evidence of its connection to human thinking and error. The new phenomenon leans on this connection but we will argue that it has a different explanation.
The Phenomenon
My training sets include all recorded games in the years 2010--2014 between players rated within 10 points of the same century or half-century milepost of the Elo rating system. They range all the way from Elo 1050 to Elo 2800+, that is, from beginning adult tournament-level players to the world champion and his closest challengers. This ``milepost set'' has over 1.15 million positions from 18,702 games, counting occurrences of the same position in different games but skipping the first 8 turns for both players. Each position was analyzed to depth at least 19 by both the newly-released version 7 of the number-two ranked Stockfish and last month's version 9.3 of Komodo, using a multitude of single threads (for reproducibility) on the supercomputing cluster of the University at Buffalo Center for Computational Research.
The value of a position is the same as the value of the best move(s) in the position. In multi-line mode we can get the value of the played move directly, while in single-line mode we can if needed take the value of the next position. A value of +1.50 or more is commonly labeled a ``decisive advantage'' by chess software (though there are many exceptions), whereas values between -0.30 and +0.30 are considered grades of equality. Accordingly let's call any move that drops the value by 1.50 or more a ``blunder.'' We will tabulate ranges in -0.40...+0.40 where a blunder most matters. Value exactly 0.00 gets a range to itself.
The final wrinkle is that we will use the engines' highest-depth values to group the positions, but distinguish cases where the human player's move was regarded as a blunder at depth 5, 10, 15, and 20. The last includes some depth-19 values. Doing so distinguishes immediate blunders like hanging your queen from subtle mistakes that require a lot of depth to expose. The data divides pretty neatly into thirds, for ``amateur'' players under 2000 (387,024 positions from 6,187 games), then the 2000--2250 range (394,186 from 6,380), then 2300 and above (372,587 from 6,035). The depth-5 numbers for Stockfish 7 are buggy for reasons I demonstrated in a chess forum post; responses seem to confirm a recent bug that has lessened impact for depths 9 and higher. The tables are for single-line mode (the multi-line mode results are similar including the Stockfish anomaly), and positions are analyzed sans game history so that only repetitions downstream in the search are detected.
The point is not that weaker players make more large mistakes, but rather the ``Red Sea'' effect at 0.00. It becomes more pronounced at the greatest depths and for the strongest players. Here are the figures for players rated 2600 and above, from 90,045 positions in 1,335 games. They are tabled in smaller intervals of 0.10 within the range -0.30 to +0.30, all labeled ``equality'' by some chess software.
Although Stockfish and Komodo have differences in their evaluation scales---happily less pronounced than they were 1 and 2 years ago---they agree that the world's elite made 6x more large errors when on the lower side of equality. This begs asking not only why but whether it is a real human phenomenon.
What Can Explain It?
There seems no possible rational explanation---indeed the sharp change at 0.00 counters the rational hypothesis discussed in this paper and toward the end of my 2014 TEDxBuffalo talk. But it is also hard to sustain a psychological explanation. If being slightly behind puts one off one's game, why is the ratio accentuated for the top players? That this training set includes only games in which the players are evenly matched minimizes any variance from overconfidence or ``fear factors.''
My proposed explanation leans on the observed engine behavior with repetitions and 0.00 values. Spot-checks affirm that the wide 0.00 bin includes many positions where the armies trade blows but fall into a repeating sequence from which neither can veer without cost. If the machine judges your position worth +0.01 or more then it places you discretely above this no-man's zone. Any notionally big slip you make still has a good chance of being caught in this large attraction basin and hence being charged as only a small error by the computer. Whereas a slip from 0.00 or below has no safety net and falls right down.
This hypothesis comes with one large falsifiable prediction and perhaps a few others regarding games played by computers and computer-human teams as opposed to humans alone. My entire model uses no feature of chess apart from the move values supplied by engines at progressively increasing depths of search. It thus transfers to any alternating-decision game in which sufficiently authoritative move values can be computed. Some games like Go rule out repeating sequences, while others like Shogi allow them but rarely end in draws.
For any strategy game that disallows repetitions and/or has negligible draw frequency, there will be no similar ``firewall at 0.00'' effect on the rate of human mistakes.
Go and Shogi are just as hard as chess, so this hypothesis does not pertain to worst-case complexity. Rather it addresses heuristically complex ``most-case'' behavior. It also highlights a general risk factor when using computers as judges: notional human errors are necessarily being evaluated by machines but the machines are evidently not doing so with human-relevant accuracy.
How to Handle It?
This last issue is not theoretical or didactic for me. In applying my statistical model to allegations of cheating I need to project distributions of errors by unaided human players of all skill levels accurately. I have been aware of ramped-up error below and away from 0.00 since 2008, when I devised a pleasingly symmetric way to smooth it out: Take a differential $latex {d\mu(x)}&fg=000000$ that has value $latex {1}&fg=000000$ at $latex {x = 0.00}&fg=000000$ but tapers to $latex {0}&fg=000000$ symmetrically. It says that the marginal value of a centipawn is the same whether you are ahead or behind the same amount but lessens when the (dis)advantage is large. When a player makes an error of raw magnitude $latex {e}&fg=000000$ in a position of value $latex {v}&fg=000000$, charge not $latex {e}&fg=000000$ but rather the integral of $latex {d\mu(x)}&fg=000000$ from $latex {x = v - e}&fg=000000$ to $latex {x = v}&fg=000000$. The metric can be weighted to make $latex {0.00}&fg=000000$ effectively as wide as neighboring intervals.
The idea is that if you blunder when, say, 0.50 ahead, the integral will go through the fattest part of the metric and so charge near face value. But when 0.50 behind it will go through the thinner region $latex {-0.50}&fg=000000$ to $latex {-0.50 - e}&fg=000000$ and so record a lesser value for my analyzer. This ``metric correction'' balanced and sharpened my system in ways I could verify by trials such as those I described on this blog in 2011 here.
Privately I've indulged an analogy to how relativity ``corrects'' the raw figures from Newtonian physics. But with today's chess programs cannonballing to greater depths, I can tell when carrying over my previous calibrations to the earlier Rybka and Houdini engines that this method is being strained. Having now many times the amount of data I previously took on commodity PCs has made the quantum jump at 0.00 so clear as to need special attention. Making an ad hoc change in my model's coefficients for positions of non-positive value still feels theoretically uncomfortable, but this may be what the data is dictating.
Open Problems
How can you explain the phenomenon? How should regard for phenomenology---in both experimental and philosophical senses---influence the manner of handling it?