Chess Cheating Testing---Rationale and Methodology

Testing Rationale and Methodology

First principle: The testing must be realistic for the activity being modeled. In the case of cheating with computer chess engine(s), the tests must cover the timeframe and search depths attainable during the actual games. Since standard time controls give each player 2 hrs. thru move 40, 1 hr. more thru move 60, then 30 minutes for the rest of the game, games typically last 4--7 hours, so one test of one engine on one game on similar hardware should have a similar length. This already indicates that unless a super-computer is available and deviation from the same engine running on standard PC hardware available to cheaters is known to be minimal, one cannot get a "30-minute verdict" by pushing a button, as would be needed for a ruling between same-day rounds of a Swiss System event. (Our data files do, however, show enough correlation between matches on low and high ply depths that a 30-minute run can be used as a basis for deciding to investigate further.)

This also rules out using the built-in ChessBase/Fritz-GUI analysis modes, since these run backwards from the end of the game---and preserve evaluations of later moves in hash while evaluating earlier ones. Such "hindsight bias" is fine for retrospectively analyzing one's own games, but the only player whose cheating might be modeled this way is Merlin! Hence before detailing the methodology we need to consider various ways to cheat with computers in (over-the-board) chess, as well as the data we need to collect.

Required Data

Second principle: The significance of matches depends on the evaluations of alternative moves.

We hold that the evaluations must be objectively obtained. Taking players' styles into account is objective only if quantified by a computer analysis of their games. Many engines themselves have user-selectable parameters that are represented as changing the program's "style". Other engines recommend using their default parameter settings, such as Fritz 9 here. If the settings chosen by the programs' creators during top-level computer competitions differ from the defaults, we may need background information on which was used or tests on both or on a spectrum of settings. Then to test an allegation of cheating with a particular engine E, we need:

One or more runs of E to "realistic" ply depths, to being tested, to determine the fact of whether a played move "matches" E (at various times/ply-depths); and
One or more runs by E and/or other engines to "authoritative" ply depths, to compare values of significant alternatives to the played move.

The two requirements are logically separate, and as argued next, need to be separate in realistic testing.

Types of Cheating to Model

The key points are (1) whether the cheating player has active partners, (2) whether he/she can initiate communication with partner(s), (3) whether cheating occurs for all-or-most moves of a game (after opening book lines are left) or just a few "key moves", and (4) whether the engine(s) are run in single-line mode or multi-line mode. Single-line mode (SLM) maximizes the search depth per unit time of the computer's preferred move, but curtails the evaluation of alternatives---which in the standard GUIs aren't even displayed. Multi-line mode (MLM) guarantees full evaluation of the top N moves where the user sets N (most MLM runs here set N = 10). Both modes use incremental search, giving preferred move(s) and evaluations at the end of the (d-1)-ply round (which UCI engines mark as "d" or "d/d") before proceeding to the d-ply round.

Solo Cheating

"Computer In A Shoe" cheating, alleged for the "John von Neumann" incident at the 1993 World Open, would probably use SLM and simple move-entry by the player. Testing methodology would be the same as for "RMT" below.

"Closet" cheating involves the player viewing the engine screen, which seems to necessitate going from the board to a secret place. Since SLM cheating can be implemented more simply and with less risk, the only attraction seems to be the player using MLM to compare various ideas in his/her mind, and as a "sanity check" against blunders. This is how I originally modeled the Elista WC accusations, and is one of several reasons those data files are in MLM.

Accomplice Cheating

Key-move cheating could begin with a signal by the player to a confederate in the audience---note that 1-way glass would not stop this. To minimize communication, the agreement could be to do this only when a position has a unique salient critical move (typified by a Bxh7+ sacrifice), and the player's query is "does it work?". A yes/no reply could be rendered audibly amid general noise.

This is said to be the main worry at top-level events, and would present small sample sizes to statistical methods. Moreover, absent eyewitness or video evidence of alleged signaling, identifying ``key moves'' would require subjectivity by a strong player. One constructive idea would be to tell the player only the number N of alleged assisted moves in a game and ask em to identify which N moves they would most likely be---a perfect match would be chancier than a "police lineup" but would also substantiate the importance of the N junctures. Otherwise, this site does not have any "magic bullet" for combatting key-move cheating.

Remote move transmission (RMT) involves engine operator(s) transmitting moves to the player, either electronically or by signals that carry more than simple yes/no information. This can be employed only on key moves selected by the operator, but human factors including the greater risk of physical detection point toward exploiting the advantage for long stretches of moves. All major allegations in 2006 and 2007 have involved long-stretch RMT, which is what this site is best equipped to detect.

RMT Mode Cheating

Our best guess of how RMT cheating would work is that when it is the cheating player P's turn to move, the operators run in single-line mode (SLM), because it reaches the highest ply-depth the fastest. Then after P makes a move m_P, they may enter m_P and switch to multi-line mode (MLM) while the opponent O is thinking, setting as many lines as seems reasonable to consider. This guarantees that whatever move m_O is made by O, the engine(s) being run will already have searched m_O to a substantial ply depth d. Upon m_O being entered, the engine(s) will reconstruct the search to depth d almost instantly from the hash table, thus gaining a jump on the next decision. We guess that the operators would enter move m_O first, thus staying in MLM for a few seconds to see which options for P are identified as reasonable in the first few seconds, before clicking on the "Reduce Lines" button (standardly bearing a minus sign) to go back into SLM.

Although the Guid-Bratko study considered a depth of 12 ply to be the "gold standard" for evaluating world champions, consensus gives that no more than a 2700 rating. Major engines on 2+Ghz laptop computers (especially dual-core) can regularly reach the 15-17 ply range in SLM within the 2-3 minute average timeframe for a move at standard tournament time controls, even in complicated middlegame positions, as our data files show. Quad-core laptops do not yet exist; a quad-core desktop PC or multi-processor machine might be employed but might not be conveniently transportable within the range of securer short-range communication.

Embellishments of this basic RMT mode might involve the operators stepping several moves ahead into lines that look critical, then stepping back to the present move in order to reflect the subline's evaluation in the current hash table. However, all such behavior is a complicating factor, and for maintaining a simple standardized procedure we guess that such deviations would be minimal. A record of time taken by P for each move might be used to determine how long to extend the testing for that move---but one might find that this already scales with the time taken for the testing engine to reach a given target ply depth. (Unaccountably in our opinion, timing data seems not to have been made available for recent major events.) When the move m_P and/or m_O are "obvious", the engine(s) would be expected to reach a target ply depth quickly, and transmitting the chosen move then rather than carrying on for 2-3 minutes would save P valuable time overall. Thus we believe that testing to a uniform target ply depth d---perhaps upping d automatically after Queens or other major pieces are traded---is most likely relevant, besides its advantages of objectivity and prospectively being "scriptable" via extensions to the UCI standard. We describe such a method based on the above estimate of cheating procedure.

The "Zig-Zag" Testing Method

The lone operating difference is that we switch to MLM before entering the accused player P's move m_P, rather than after entering m_P. This is needed because the statistical methods require high-ply evaluations of major alternatives to m_P, which SLM does not provide. We postulate that spending about 4*t time in MLM before entering m_P has similar effect to spending t time in MLM after entering m_P. Here is the resulting procedure to test a game G:

Identify the first move m_P by the player P that is not in "book" contemporary to game G; often this is determined by the testing engine itself.
Choose d₀ so that in the middlegame positions following move m, the testing engine on comparable hardware completes the depth d₀ round in SLM within 2-3 minutes.
Set d = d₀ + 1, i.e. going 1 ply further than the supposed cheater's target. Fix d_min = 11 (figuring no ply depth below that is relevant), and L = 10 (the number of lines to use in MLM). Then set d_L = the depth typically reachable in 8-12 minutes in MLM. It is OK to have d_L < d provided d_L >= d-2.
Go to the player's move m' prior to m, without entering m', and run SLM to depth d. Then run MLM to depth d_L. Record the data but do not use it---this step is to fill the hash table so as to approximate the presumed operators' engine's memory content at move m_P in the actual game.
Staying in MLM, enter m' and the opponent's reply. Clip the data at the end of depth d_min (i.e. at "d_min+1/d_min+1" in most engines' UCI display), and quickly change to SLM.
Run in SLM to the end of depth d (i.e., to "d+1/d+1"), clip data, and quickly change to MLM.
Clip MLM data at the end of depth d_L. It is AOK if round d_L' = d_L+1 has already finished the top 1 or 2 lines---these are likely in hash from the SLM and so updated almost instantly. Optionally, clip more MLM data after s = more lines of round d_L' have finished, labeling the clip "d_L'/__ + s", where either d_L' itself or the maximum extension depth e can be filled in the __. And/or go to rounds deeper than d_L', since doing so should not have a deleterious effect on the deeper SLM for the next move.
Staying in MLM, step ahead 2 ply and goto step 5. (Or if testing both players in the same run, step ahead just 1 ply.)
It is OK to stop before doing any move m. To resume, go to step 4 for the move m' prior to m, filling up hash before testing move m.
If some lines in an MLM step are disastrous, it is OK to click "-" to reduce the number of lines---this can avert substantial engine slowdown when it tries to find optimal mates in disaster lines.
After trades that cause depth d itself to be readily completeable in under 3 minutes, figure that operators would extend their search accordingly and increment d by 1 or 2 as appropriate. (Always going in SLM at least 1 ply beyond the "2-3 min. depth d₀" already builds-in some robustness for this and longer-thinking situations that might arise during the game.) Note in logfiles when d has been increased.

The arbitrary choice d_min = 11 is made because lower ply depths are believed to have strength below GM level, making them ignorable, and often come too quickly in MLM to be recorded! The choice of recording L = 10 lines in MLM is a compromise. The statistical methods formally involve the evals of all alternatives, but having the 9 next-closest evals is usually a fine approximation. Computing, say, 30 lines in MLM would not only slow the test by a factor of 3, but would often burn extra time as engines seek optimum mates after disastrous moves. This said, a manual (or scripted!) tester must be prepared to use a lower L on moves with fewer than 9 non-disaster replies, e.g. when a recapture is forced or the King is in check and piece-hanging interpositions should be skipped.

Alternative Testing Methods

The first main alternative is to do one run entirely in SLM to record only the SLM data---perhaps still switching to MLM after entering moves m_P for 2-3 or so minutes to ape the supposed cheating procedure---and then do separate run(s) in MLM. MLM runs by multiple engines might provide a more-robust "jury consensus" on which to base the conversion from evaluations to (prior) probabilities. If multiple engines are tested in SLM, a "jury" of corresponding MLM runs may be available in any case.

The second main alternative is to use only an MLM run, to high depth d (not d_L), taking the MLM evals both to determine whether the player's move "matches" the engine and to provide the statistical basis. This was done for the Elista WC testing. One advantage is that recording MLM data (say, 10 lines) at every ply depth helps identify "swings" in move evaluations, which can be figured into more-refined formulas for estimating prior probabilities of moves being played. Disadvantages, however, are: (1) which moves match in MLM versus SLM appear to differ "more than expected", (2) SLM is held more likely to be the cheater's mode of choice, (3) running MLM to full depth d takes scads of time!, and (4) clipping each ply depth currently requires constant human attention. For reasons (1--2), only the SLM data is used to say which moves "match" at ply depths up to d, while only MLM data is used to judge the significance of matches and non-matches.

Appraisal

The main advantage of "zig-zagging" between SLM and MLM is that all data is obtained in 1 run, while preserving relevance to how the engine might be operated during the opponent's thinking time. Saving the 5-10 hours needed for alternative MLM runs is substantial, especially when they require manual operation.

The main disadvantage is an artifact of hashing when d_L < d that we call the "leader effect". The SLM evaluation e_m of the preferred move m at the final depth d is used for m at all ply levels in the subsequent MLM, since the critical lines for m are preserved in hash. Occasionally, this is done also for moves m' that were tied with m or just displaced by m in the final round of the SLM computation. Thus the MLM at d_L will have the depth-d_L evals for the other 9 moves, but e_m rather than the depth-d_L eval for move m. This inconsistency can skew the differences ("deltas") between m and the other moves, particularly when greater depth tends to show greater advantage (or disadvantage) in move m---hence the name "leader" effect. Moreover, if a move m' != m was displaced only at SLM depth d (i.e., had a late swing in value), the MLM at depth d_L may still show m' with a higher value. This accounts for some cases in the log files where move m is a "full match" at every ply depth, but the "delta" from the MLM is negative. Since the statistical methods employ "smoothing", we expect the resulting skewing to be tolerable. Moreover, both kinds of observed skewing lessen the significance values obtained, thus making statistical "positives" more difficult to get and hence erring on the side of caution. In light of the substantial time savings in gathering data, we employ the zig-zag method.

Detailed Directions and Log File Format

Remaining Methodology

This covers the gathering and presentation of the basic data. Rules for statistical analysis, and for drawing conclusions according to recognized significance levels, will be detailed in our main papers.