World Open 2006 Testing

This page begins with testing a specific allegation that Eugen Varshavsky cheated with confederates transmitting moves during the 2006 World Open, and most specifically that this cheating occurred during his Round 7 upset of GM Ilya Smirin. (I had thought the allegation included his having a receiver in his hat, but none was found on inspection.) Varshavsky at the time had a USCF rating of 2160, which compared to Smirin's rating above 2600 would leave Varshavsky expecting to score under 5% in the long run---especially with the Black pieces as in this game. This topic, accusation, and game were featured in the 4/8/07 NY Times Chess Column (now behind subscriber-only curtain) by Dylan Loeb McClain. This article also reported on the March 2007 Chess Life story by Jon Jacobs on anti-cheating efforts (including this site). Primary sources on this particular controversy include:

Oct. 2006 Chess Life story on the 2006 World Open,
IHT mirror of August NY Times story by McClain,
7/6/06 Chess Ninja story and 7/7/06 followup on the World Open---right after it concluded.

The 4/8 McClain column was also linked in this 4/8/07 post in the Susan Polgar chess blog. For human-computer similarity testing, the operative words in McClain's article are:

"...After Varshavsky won, Larry Christiansen, a grandmaster, found that the last 25 moves matched those chosen by a commercially available computer program called Shredder. With no copy of Shredder, I compared Black's moves with those suggested by Fritz 9. From 14 ... a5 (the first move that varied from what had previously been played) to the end, Fritz agreed with 34 out of 44 moves."

(This continues the uniform pattern that results of scientific experiments are reported in the chess world with no provision of data, methodology, logs, reports, anything to permit reproducibility of tests by others... These scientific fundamentals are overlooked amid need for due process with persons directly named and reputations involved. Neither the ply-depth of testing, the mode of testing (single-line or multi-line, or the "retrograde" game-analysis modes in the Fritz GUI itself), nor even the version of Shredder, has been given by any source I've seen on this story---which is still reverberating after 9+ months. This site attempts to remedy these lacks---you can dispute my methods but at least they're reviewable!)

Data and Interpretation

Long test file of Smirin-Varshavsky, 0-1, round 7 of the 2006 World Open, with Shredder 9.1 and the basic methodology used on the Corus 2007 testing page. Tabulated results in this file.

These results show 23 of the last 25 moves matching, and 17 of the previous 19 at some high ply depth as well, plus 1 tie.

Testing with Shredder 10, which was released 6 weeks prior to the game, is in-process. Larry Christiansen tells me he used Shredder Classic/Solid, a prior version, running for about a minute per move to (at least) ply-depth 10. Queries and testing on whether differences between Shredder versions are as large as those between Fritz 9 and 10 shown in Corus 2007 rounds 2 and 3 here are in-process.

The lone move reported by no one as a match to "Shredder" is 29...Rf8?!, which appears to give away Black's advantage and is not near the top 10. However, the long test file has a materially-relevant hypothetical explanation even for this non-match: In single-line mode, the move 29...Qa7 is initially preferred, until at 16 ply Shredder (9.1) uncovers a "surprising" big swing to White's advantage, after 2-1/2 minutes of running on my 2Ghz laptop. It then takes Shredder almost 25 minutes to resolve the resulting confusion about the best move---too long for real-time advice to a player (assuming the confederates' hardware was not greatly superior to my laptop). So they may have had to say ``play a move'' as matters shook out.

Bottom line: The results substantially confirm the testing by GM Christiansen and the above reports of it (except for commenters in the Chess Ninja items ascribing the "25 matches" to Fritz 9), and further indicate a consistent narrative of cheating during the entire game. Most to the point, many of the matches are in close situations, in contrast to Topalov-Kramnik Elista 2006 game 2. Our statistical calculations, when finalized, will show high information gain and will assert that the results meet court standards of statistical evidence of improbability for the null hypothesis (of no cheating). In other words, if this is not a "smoking gun", nothing is...

Long test with Fritz 9, with tabulated results here. Close but not perfect agreement with the match rate reported by McClain. Inspection of the results leads us to believe that the formal statistical testing will show both significant evidence of collusion with strong programs in general and a significant difference between this and the Shredder (9.1) test results.

(NEW, 5/24/07) Long test file of Bartholomew-Varshavsky, Round 5 from the same tournament, with tabulated results in this file. These results also show a high information gain, with only 4 clear non-matches out of 48 moves, and 28 significant ones (plus 9 matches on clearly forced moves and 7 unclear/partial matches). Note that this test was conducted entirely after conclusions on this page from the game with Smirin were written, and hence confirms the preliminary finding of significant evidence of collusion.

Preliminary public-service messages

If someone cheats with more than one program, these methods can still detect cheating; and
Results alleging cheating with a specific program can be told apart from other programs (and possibly from versions of the same program).

And both determinations can be made to accepted court standards of evidentiary statistics.

Current working consensus in the chess world, reflected by National Director Steve Immitt at the end of (the move-match section of) the March 2007 Chess Life article, is that match-rate statistics must be accompanied by some other primary evidence---such as physical or eyewitness evidence. This site supports this policy. One temporal reason is the preliminary state of both the theory and the gathering of necessary data. A second, permanent, reason is Littlewood's Law. Here this "Law" says that if you play 1,000 games, chances are at least one of them will match a given engine in a way that in isolation would be deemed to have a less than 1-in-a-thousand chance of happening without collusion. Hence other factors that in court cases go under the headings of "motive" and "probable cause" must be brought into play. In this case the game was distinguished by being against a top player in a big-money event. Some evidence of odd behavior is given in the Chess Ninja blog comments linked above, but nothing physically concrete.

Finally, a public appeal for help doing the testing in a scientifically rigorous manner. The only automated/scriptable modes of testing provided currently by chess engines (in the Fritz/Chessbase GUI) work in reverse from the end of the game---and preserve hash evaluations of later positions in the main line that tangibly affect evaluations of current positions in ways not available to prospective cheaters. In the March 2007 Chess Life cover story I am quoted as requesting greater ``scriptability'' of commercial chess engines (as could be provided in a revision to the UCI standard), but until then, realistic tests require manual operation over a similar hours-long timeframe as the actual game and activity they are modeling. Tough for one busy prof to do---but lovers of chess who have the discipline to do science faithfully can really help out.