First principle: The testing must be realistic for the activity being modeled. In the case of cheating with computer chess engine(s), the tests must cover the timeframe and search depths attainable during the actual games. Since standard time controls give each player 2 hrs. thru move 40, 1 hr. more thru move 60, then 30 minutes for the rest of the game, games typically last 4--7 hours, so one test of one engine on one game on similar hardware should have a similar length. This already indicates that unless a super-computer is available and deviation from the same engine running on standard PC hardware available to cheaters is known to be minimal, one cannot get a "30-minute verdict" by pushing a button, as would be needed for a ruling between same-day rounds of a Swiss System event. (Our data files do, however, show enough correlation between matches on low and high ply depths that a 30-minute run can be used as a basis for deciding to investigate further.)
This also rules out using the built-in ChessBase/Fritz-GUI analysis modes,
since these run backwards from the end of the game---and preserve
evaluations of later moves in hash while evaluating earlier ones.
Such "hindsight bias" is fine for retrospectively analyzing one's own games,
but the only player whose cheating might be modeled this way is Merlin!
Hence before detailing the methodology
we need to consider various ways to cheat with computers in
(over-the-board) chess, as well as the data we need to collect.
Required Data
Second principle: The significance of matches depends on the evaluations of alternative moves.
We hold that the evaluations must be objectively obtained. Taking players' styles into account is objective only if quantified by a computer analysis of their games. Many engines themselves have user-selectable parameters that are represented as changing the program's "style". Other engines recommend using their default parameter settings, such as Fritz 9 here. If the settings chosen by the programs' creators during top-level computer competitions differ from the defaults, we may need background information on which was used or tests on both or on a spectrum of settings. Then to test an allegation of cheating with a particular engine E, we need:
The two requirements are logically separate, and as argued next, need
to be separate in realistic testing.
Types of Cheating to Model
The key points are (1) whether the cheating player has active partners, (2) whether he/she can initiate communication with partner(s), (3) whether cheating occurs for all-or-most moves of a game (after opening book lines are left) or just a few "key moves", and (4) whether the engine(s) are run in single-line mode or multi-line mode. Single-line mode (SLM) maximizes the search depth per unit time of the computer's preferred move, but curtails the evaluation of alternatives---which in the standard GUIs aren't even displayed. Multi-line mode (MLM) guarantees full evaluation of the top N moves where the user sets N (most MLM runs here set N = 10). Both modes use incremental search, giving preferred move(s) and evaluations at the end of the (d-1)-ply round (which UCI engines mark as "d" or "d/d") before proceeding to the d-ply round.
"Computer In A Shoe" cheating, alleged for the "John von Neumann" incident at the 1993 World Open, would probably use SLM and simple move-entry by the player. Testing methodology would be the same as for "RMT" below.
"Closet" cheating involves the player viewing the engine screen, which seems to necessitate going from the board to a secret place. Since SLM cheating can be implemented more simply and with less risk, the only attraction seems to be the player using MLM to compare various ideas in his/her mind, and as a "sanity check" against blunders. This is how I originally modeled the Elista WC accusations, and is one of several reasons those data files are in MLM.
Key-move cheating could begin with a signal by the player to a confederate in the audience---note that 1-way glass would not stop this. To minimize communication, the agreement could be to do this only when a position has a unique salient critical move (typified by a Bxh7+ sacrifice), and the player's query is "does it work?". A yes/no reply could be rendered audibly amid general noise.
This is said to be the main worry at top-level events, and would present small sample sizes to statistical methods. Moreover, absent eyewitness or video evidence of alleged signaling, identifying ``key moves'' would require subjectivity by a strong player. One constructive idea would be to tell the player only the number N of alleged assisted moves in a game and ask em to identify which N moves they would most likely be---a perfect match would be chancier than a "police lineup" but would also substantiate the importance of the N junctures. Otherwise, this site does not have any "magic bullet" for combatting key-move cheating.
Remote move transmission (RMT) involves
engine operator(s) transmitting moves to the player, either electronically
or by signals that carry more than simple yes/no information.
This can be employed only on key moves selected by the
operator, but human factors including the greater risk of physical
detection point toward exploiting the advantage for long stretches of moves.
All major allegations in 2006 and 2007 have involved long-stretch
RMT, which is what this site is best equipped to detect.
RMT Mode Cheating
Our best guess of how RMT cheating would work is that when it is the cheating player P's turn to move, the operators run in single-line mode (SLM), because it reaches the highest ply-depth the fastest. Then after P makes a move m_P, they may enter m_P and switch to multi-line mode (MLM) while the opponent O is thinking, setting as many lines as seems reasonable to consider. This guarantees that whatever move m_O is made by O, the engine(s) being run will already have searched m_O to a substantial ply depth d. Upon m_O being entered, the engine(s) will reconstruct the search to depth d almost instantly from the hash table, thus gaining a jump on the next decision. We guess that the operators would enter move m_O first, thus staying in MLM for a few seconds to see which options for P are identified as reasonable in the first few seconds, before clicking on the "Reduce Lines" button (standardly bearing a minus sign) to go back into SLM.
Although the Guid-Bratko study considered a depth of 12 ply to be the "gold standard" for evaluating world champions, consensus gives that no more than a 2700 rating. Major engines on 2+Ghz laptop computers (especially dual-core) can regularly reach the 15-17 ply range in SLM within the 2-3 minute average timeframe for a move at standard tournament time controls, even in complicated middlegame positions, as our data files show. Quad-core laptops do not yet exist; a quad-core desktop PC or multi-processor machine might be employed but might not be conveniently transportable within the range of securer short-range communication.
Embellishments of this basic RMT mode might involve the operators
stepping several moves ahead into lines that look critical, then
stepping back to the present move in order to reflect the
subline's evaluation in the current hash table. However, all such
behavior is a complicating factor, and for maintaining a simple standardized
procedure we guess that such deviations would be minimal.
A record of time taken by P for each move might be used to determine
how long to extend the testing for that move---but one might find that this
already scales with the time taken for the testing engine to reach
a given target ply depth. (Unaccountably in our opinion,
timing data seems not
to have been made available for recent major events.)
When the move m_P and/or m_O are "obvious", the engine(s) would be
expected to reach a target ply depth quickly, and transmitting the
chosen move then rather than carrying on for 2-3 minutes would save P
valuable time overall.
Thus we believe that testing to a uniform target ply depth d---perhaps
upping d automatically after Queens or other major pieces are traded---is
most likely relevant, besides its
advantages of objectivity and
prospectively being "scriptable" via extensions to the UCI standard.
We describe such a method based on the above estimate of cheating
procedure.
The "Zig-Zag" Testing Method
The lone operating difference is that we switch to MLM before entering the accused player P's move m_P, rather than after entering m_P. This is needed because the statistical methods require high-ply evaluations of major alternatives to m_P, which SLM does not provide. We postulate that spending about 4*t time in MLM before entering m_P has similar effect to spending t time in MLM after entering m_P. Here is the resulting procedure to test a game G:
The arbitrary choice dmin = 11 is made because lower ply depths are believed to have strength below GM level, making them ignorable, and often come too quickly in MLM to be recorded! The choice of recording L = 10 lines in MLM is a compromise. The statistical methods formally involve the evals of all alternatives, but having the 9 next-closest evals is usually a fine approximation. Computing, say, 30 lines in MLM would not only slow the test by a factor of 3, but would often burn extra time as engines seek optimum mates after disastrous moves. This said, a manual (or scripted!) tester must be prepared to use a lower L on moves with fewer than 9 non-disaster replies, e.g. when a recapture is forced or the King is in check and piece-hanging interpositions should be skipped.
The first main alternative is to do one run entirely in SLM to record only the SLM data---perhaps still switching to MLM after entering moves m_P for 2-3 or so minutes to ape the supposed cheating procedure---and then do separate run(s) in MLM. MLM runs by multiple engines might provide a more-robust "jury consensus" on which to base the conversion from evaluations to (prior) probabilities. If multiple engines are tested in SLM, a "jury" of corresponding MLM runs may be available in any case.
The second main alternative is to use only an MLM run, to high depth d (not d_L), taking the MLM evals both to determine whether the player's move "matches" the engine and to provide the statistical basis. This was done for the Elista WC testing. One advantage is that recording MLM data (say, 10 lines) at every ply depth helps identify "swings" in move evaluations, which can be figured into more-refined formulas for estimating prior probabilities of moves being played. Disadvantages, however, are: (1) which moves match in MLM versus SLM appear to differ "more than expected", (2) SLM is held more likely to be the cheater's mode of choice, (3) running MLM to full depth d takes scads of time!, and (4) clipping each ply depth currently requires constant human attention. For reasons (1--2), only the SLM data is used to say which moves "match" at ply depths up to d, while only MLM data is used to judge the significance of matches and non-matches.
The main advantage of "zig-zagging" between SLM and MLM is that all data is obtained in 1 run, while preserving relevance to how the engine might be operated during the opponent's thinking time. Saving the 5-10 hours needed for alternative MLM runs is substantial, especially when they require manual operation.
The main disadvantage is an artifact of hashing when d_L < d that we call the "leader effect". The SLM evaluation e_m of the preferred move m at the final depth d is used for m at all ply levels in the subsequent MLM, since the critical lines for m are preserved in hash. Occasionally, this is done also for moves m' that were tied with m or just displaced by m in the final round of the SLM computation. Thus the MLM at d_L will have the depth-d_L evals for the other 9 moves, but e_m rather than the depth-d_L eval for move m. This inconsistency can skew the differences ("deltas") between m and the other moves, particularly when greater depth tends to show greater advantage (or disadvantage) in move m---hence the name "leader" effect. Moreover, if a move m' != m was displaced only at SLM depth d (i.e., had a late swing in value), the MLM at depth d_L may still show m' with a higher value. This accounts for some cases in the log files where move m is a "full match" at every ply depth, but the "delta" from the MLM is negative. Since the statistical methods employ "smoothing", we expect the resulting skewing to be tolerable. Moreover, both kinds of observed skewing lessen the significance values obtained, thus making statistical "positives" more difficult to get and hence erring on the side of caution. In light of the substantial time savings in gathering data, we employ the zig-zag method.
This covers the gathering and presentation of the basic data. Rules for statistical analysis, and for drawing conclusions according to recognized significance levels, will be detailed in our main papers.