The tests of Game 6 and the Sukhovsky-Mihalevsky for-control game use matches during the 13-ply search, which finishes in 1-3 minutes on my 2Ghz laptop. At the time I believed the only possible narrative for alleged cheating was that Kramnik kept a "Pocket Fritz" in the bathroom that somehow managed to evade electronic detection---or a tiny tablet-PC with no more horsepower than my laptop. Such a device could not be expected to go further given the game's time limit---even if Kramnik took 10 minutes or more on a move, the narrative by Topalov's team had him going in/out for only 1-2 minutes at a time (to try another variation on later visits?). But respondents in the Rybka forums, Susan Polgar's blog, and comments elsewhere referenced that the intended "Faraday shield" over the playing area was not provided, and then Kramnik's "fear of a planted bug" letter changed the narrative to one of communicating with a team of helpers with more powerful hardware. Hence I took the Game 3 "acid test" out to 15-ply and beyond.
Going out to greater ply is needed for a different reason anyway. The crude "Hamming metric" which only counts matches and non-matches ignores the difference caused by a forced move, versus a case where the computer's pick was followed amid a large number of plausible alternatives. Clearly the latter case is a more significant match than the former! Accounting for this requires formulating a notion of a-priori probability for a choice of move, for which the program itself run to "3,000 Elo" strength is the most objective referee. Technically this requires evaluating all the moves, but doing 10 moves is a reasonable compromise since that usually catches all ideas (my logs note a couple of exceptions) and since truncating the contributions of moves beyond 10 should usually cause negligible error. Modeling this kind of human-agent "fidelity" becomes a general AI problem, and I am sounding out experts in this field for further work.
Finally for now, an important point I see is that the tests need to be flexible, but this currently also makes them extremely time-intensive to do and difficult to automate. Fritz 9 has several substantial features (under Tools--Analysis) called Full Analysis..., Blunder Check..., Deep Position Analysis..., but I found them all too constrained for the situation being modeled here, and moreover not providing output with the needed detail. My partner Jason Buczyna, far more experienced than I in computer chess, also expressed privately "...If only all of this could be automated..." at the end of an e-mail on Oct. 11. My need to observe (at least) 10 moves in full (figuring errors beyond 10 will be minimal) also runs counter to how chess enginers usually conduct analyses---showing just the top line. (It has irked me not to get the benefits of both the one-line and multi-line views simultaneously.) Thus I stood over Fritz 9 (with baseball games and then ESPN in the background) to "Clip Analysis" by hand at the 12/12...15/15 and further junctures, compiling those logs by-hand. In endgames for instance, 12/12 was so quick as to seem meaningless, so I "bent the box". But this gave the collateral benefit that I was observing Fritz 9, as a strong enough player to spot some developments and ideas that supposed cheaters would see, and thus try to imitate both the "in-bathroom Pocket Fritz" and "by ceiling cable to team of analysts" modes of cheating. I left half-unread on my nightstand a birthday present titled A Madman Dreams of Turing Machines---because I was spending the same night hours observing one, quite sane :-). The flexibility makes me feel confident to predict that further tests, provided they use Fritz 9 default parameters, reasonable hash size (at least 128MB, say), and enough overlap of my ply-depth window of significance, will stay within the bounds of variation considered here, with statistically preponderant probability. This all bears informative contrast to two computer applications whose needs appear to be opposite, and for whom the ability to automate is the main point: