Response to a comment with 20+ thoughtful questions by RichT---some of my answers make the matters even more thoughtful. Primary source for details I don't go into here is my January 2013 open letter and report on the Ivanov case. His Qs are in italics, my replies in Roman.
There are many questions raised by the use of a statistical method of looking for cheaters. At this time, I think it is highly prone to both type 1 and type 2 errors, meaning it will falsely accuse some players of cheating while providing a solid defense for those who are cheating.
The false accusations will arise because there is a bias towards accusing those who play better than their current rating, which is anyone who is talented and improving. Someone like Magnus Carlsen, for example, at many points during his career, and even a young Regan.
As noted in my presentation to FIDE last October, "young Regan" tied with GM Gennadi Sosonko for the 3rd-highest matching in the entire series of famous Lone Pine tournaments. But my games were so tactical/forcing that my model projects just a slightly smaller number for a 2400 player, just over for my 2510 performance, and well over for a budding 2600 player. So on the last score my second IM norm was an underperformance. I have not yet done my games from years before 1976, but have been exhaustive for certain other young guns in the 2700+ club. My formal reports have given slack for this factor, besides the null-hypothesis requirement of using the player's rating after the big event.
The defense will arise because it seems too easy to beat the system when players know how they are being judged. Just make a few moves that are sub-optimal but not blunders, that make you play like you are 300 points better than you are, and you will win more often than you should but not show up anywhere near 5 sigma. As your rating goes up, you keep improving the moves you choose to stay one step ahead of the field, eventually making GM then super-GM, lauded as a prodigy but a cheater nonetheless, and all you have to do is point to the low z score to rebut any of your accusers. How hard is it to play like a 2000 when you are a 1700? Just tell the engine to play like a 2000.
The latent effect of "educating cheaters" is acknowledged, but the point is to make things harder for them. The larger point is in my FIDE talk's title whereby I emphasized "Analytics"---which includes distinguishing the way computers currently reduce their strength from human play at that level.
That said, there are a number of problems with Regan's methodology that should make us pause even more before using the z score to accuse (or not accuse) players.
1. You can't combine moves played over time when a player's rating changed without accounting for it, and even then you would need to know a player's true current rating to create the z score. Not only do players improve quickly earlier in their careers, but they also have dips late in their career or when they get rusty or haven't studied for awhile, and they can also get better or worse based on what's going on in their personal lives or have health issues, jet lag, time zone changes, etc.
The testing is on a per-event basis. Z-scores from consecutive events can have different rating settings if warranted, and are combined using Fisher/Stouffer's Rule. Regarding the last points, the only time my system has been requested for a case of sandbagging, it revealed that the alleged sandbaggers played at a 3000+ level while losing almost all their games.
2. The article incorrectly states that GM's make as many mistakes as low-rated players. This is obviously not true. But it also makes it particularly difficult to create z scores for GMs, since they play an optimal or near-optimal move almost every time. That's why draws are so common in GM vs GM games. How do you catch a GM in this situation?
It does not say that---it says both GMs and lower players make more mistakes relative to engines, and that players at all levels make 60--90% more error (the word was meant to be singular indicating an aggregate quantity) when the engine says they are 0.50 ahead or behind than when it says even-steven (a mysterious phenomenon addressed in my latest paper). GM games have received highest emphasis in my modeling, and it is important to me that ancillary results such as the IPR's pass "sniff tests"---for which see my full chart of a year's worth of supertournaments. What I am aware of having trouble with is doing IPR's and z-scores for human-computer tandems ("Freestyle").
3. Does the choice of engine change the z score? Has anyone tested engine agreement/disagreement?
The theory is actually engine-neutral, caring only about strength; my formal tests are still using Rybka 3 which is definitely the "wrong" engine. Engine agreement is much less in their own games than in analyzing human games, as covered in the final long section of the above paper. I am still converting from Rybka to a Houdini-Komodo-Stockfish troika; results so far indicate little difference (where I have run hundreds of the same player-event performances thru Rybka 3 and Houdini 3-or-4, the Houdini percentiles are lower by about 0.3%, which I ascribe to Houdini being slightly stronger hence slightly further away from the human players.) The draft FIDE ACC proposal calls for doing a second full test with a different engine.
4. How long does one run the engine to come up with each move's evaluation? In a tournament the player can't let the engine run too long or they'll lose on time, so cheaters may end up making sub-optimal moves when measured against engines that have a lot of time to calculate.
My work uses fixed-depth standards for definiteness (e.g. independence from underlying hardware times) and scientific reproducibility. The screening test averages 10--15 minutes on one processor core for a typical 40-move game; the full test can run 4--8 hours one one core per game. What you say has been an evident factor in several cases.
5. If the engine is run for just a short time during the tournament, one engine's move order may greatly impact what it shows at a given point in time. For example, one engine may look at pawn moves first and another pawn moves last.
The theory and large-scale empirical testing treat factors at this level as a wash. In specific cases some "analytical" factors of a more general nature have been noted in my reports, and a good defense could always raise more. The issue of how long to run an engine, and "confirmation bias" of a player calling a "match" if a move ever shows as top at any depth, arose in my turning aside an incorrect complaint while a European open last summer was in progress.
6. What about players who are very good most of the time and then blunder often enough to lose a lot of games and rating points? They may look like cheaters because they come close to making optimal moves a large percentage of the time.
My model having two parameters, "s" and "c", is targeted at exactly this.
7. The article states the IPM [IPR] is not as reliable as the z score for a small set of games. But a low number of games creates just as low a confidence level in the z score.
The distinction is between a z-score giving confidence in measurement, as for the IPR's, and giving a statistical test of a null hypothesis. My reports have stated that using the IPR measurement bars against the player's pre- or post-event rating is improper; wide as those error bars are, z-scores imputed from them are inflated relative to the actual statistical tests.
8. Low-rated players may still memorize some book openings, and if the moves that are examined include the opening then they will look like they're cheating when they're not.
The tests exclude all initial moves previously made by 2300+ players in ChessBase Big 2013 augmented by TWIC since then---OpeningMaster.com uses 2300 as cutoff for their select opening database. The shift in scores from excluding book has thus far been less than the effect of reducing the sample size, which directly lowers z-scores. Of course there is preparation beyond book, but as you go below super-GM level it is progressively more offset by players not knowing book, while even the top players sometimes forget their prep. (Even when players are out of book before move 9, turns 1--8 are excluded; beginning at turn 9 is arbitrary but IMHO best, and reflects choices made by the TCEC competition and by other researchers.)
9. Unless the game is recorded live one must analyze the moves after the fact, when it is too late to prove anything by other means such as searching for devices or watching what the player is doing.
That's what the separation into "screening" and "full" tests is for, in the FIDE-ACP ACC proposal. The screening test can potentially run au-courant, though in major-tournament trials I've run it overnight (Euro time).
10. How do you deal with notation errors?
Good question! In the Feller case I recognized one, and also communicated that I thought someone else's reconstruction was incorrect as well, but the other player (in France not at Khanty) turned down requests to correct the error. This contributed to my unwillingness to publicize the result from the tournament in question, which mainly owed to the score still being well below the threshold for results with no physical or behavioral evidence, as enunciated here and here.
The larger problem---especially recently with the Iasi Open and several "whispers" this year---has been getting the games in the first place. Organizer(s) of the 2013 Neckar Open turned down requests for the last-round game Kotainy-Bai, from which Jens Kotainy earned a GM norm, even though it would have been an important test for the science.
11. Often the best move is part of a multi-move tactic. Looking at each move in isolation won't capture this.
One place this factor certainly shows up is as an effective reduction in my sample size, from N total moves to N' < N units of independence over all the decisions. The dependence is sparse when considered among all N-choose-2 pairs of moves.
12. Because tactics are often built over several moves, different engines may not see the tactics the same way. What would happen if you have 3 engines, A, B and C with C the weakest, and C plays a combination of A on move 1, B on move 2, C on move 3, A on move 4, and so on. Who would win? It may in fact be C. The implications of this affect how one evaluates a cheater who doesn't use the engine on every move.
Over large enough sample sizes, the expectation is for such factors largely to "even-out" and offset. One important fact is that because the full test is done in a different analysis mode (often with a different engine), it irons out such occurrences. This is also a form of "regression to the mean", and has been a telltale of honest performances.
The issue of a person not using the engine on every move is of course thornier. I can however say that a person using different engines on different moves won't gain any real cover.
13. From what we can tell from the article, Regan's algorithm treats all moves equally, even though in some [turns] the move choice has a much larger bearing on the outcome.
My program is set up to handle arbitrary move-weighting schemes (the effect on mean and variance is nicely linear), but I have not (yet) seen a compelling reason to eschew unit weighting. The important units as mentioned above are independent decisions. The equations of the model already minimize the impact of an obviously forced move.
Up-weighting "critical" decisions would call for an innate measure of criticality or complexity, as opposed to getting human judgment for each move. One measure I've tried is akin to entropy, which (in this case) increases when there are more moves whose values are close to that of the best move. The problem is that the entropy measure is maximized in positions that are lifeless, where "anything draws". Thus we need a notion of complexity that is more structured. Ironically this is where papers on "Intelligent Design" would tend to crop up in searches again. (My positions on ID are (1) it is fundamentally a skeptical position and should be taught as part of a short unit on the role of skepticism within science, and (2) that unless and until we are able to prove some, any non-trivial lower bound on computational complexity, we really have no firm idea of the computing potential of information-accruing processes, and should refrain from making large-scale claims about them.)
14. We probably care a lot more about cheating at high levels than low levels, in much the same way we care more in Major League baseball than Little League. Yes, we care about both, but not equally. But it is precisely at the highest levels that these type of statistical algorithms are the least useful.
The flip side is that if a 1700-player uses a 3000-level source---or is unable to distinguish a 2500-level source from a 2000-level source as you "recommend" above, then the statistical results are going to be a whole lot clearer than for a 2650 player using a 3000-level source. (But the system did deliver multiple supporting positives on one such player.)
15. What if someone cheats for 20 moves and then gets into time trouble and blunders, or at least does not play the optimal move? They will look like they haven't cheated over the course of the game.
This was a major consideration in one of my reports last year, involving a game in which check-and-fork-a-Rook was overlooked by both players twice at moves 38--40. The four win-to-loss flips racked up the maximum error per move under my scaling scheme, and made the difference between a clear supporting positive and a borderline result. My policy forbids excluding such moves unless there is a clear report of what happened---as there has been in other cases. Cyril Marzolo's confession in the Feller case made clear that they lost one game because the scheme was taking too long relative to the clock past Move 45 or so, and that could have informed my work. In general, identifying properties of sub-sequences of moves is where this becomes "Analytics Not Just Statistics."
16. Do you create a cumulative score over many games, over one game, or look at each move individually to decide that someone is playing well above their level?
Again, testing is on a per-event basis. Even the empirical testing of my confidence intervals on 10,000s of simulated performances (the buzzword is "bootstrap" re-sampling) used 9-game units.
17. The algorithm doesn't take time control into account.
My results on Melody Amber Rapids, the somewhat faster World Rapid Championship, and the World Blitz Championships, show that the strongest players lose about 200 Elo at Ambers, 275 in World Rapid, and 575 at World Blitz. From this I estimate that Game/60 loses about 75 for them, which I mentioned to some at FIDE who were considering voting to allow rating some forms of G/60 last October in Tallinn. A further interpolation then puts the differences between various "Standard" time controls under 30 Elo, which is less than the slack standardly used in the tests. Hence I have not prioritized segregating games by the exact time control used.
18. The algorithm doesn't take opponent level into account. Players will often let the rating differential affect their move choices, whether that is sensible or not.
This is a major factor, the one clear bias in my training data that will take a lot of work to root out. My training data uses games between players of nearly equal rating, each +- 10 (sometimes +- 15 or 20) from an Elo century mark up to 2700, as here. This gives absolutely spot-on results for the aggregate of all World Championship matches since 1972 (that is, IPR == average rating within +- 2 Elo, when weighted by # of moves), but regularly gives round-robins an aggregate IPR 15--25 Elo lower than the tourney average. Other results this year (including the above new paper) support the explanation that it happens because games between unequally-rated players are "choppier".
This is, however, overtly a conservative bias---that is, it tends against false positives in the cheating tests, because the parameters used for a given rating such as 2600 come from the less-"choppy" training data.
19. How accurate is the player's current rating? Is there some measure of confidence associated with ratings?
Yes---see Glicko. The slack used in my tests has been more than the radius of a "glob" for a player with a lot of games.
20. Ultimately, this sophisticated looking algorithm amounts to the same thing as comparing current rating to performance over a specified set of games. Ratings are based purely on results: we don't give a player more points for crushing their opponent versus maintaining a one pawn advantage. And that's OK, because taking material and positions into account is fraught with peril. But the only difference between Regan's method and using performance minus rating is that Regan looks at each move individually. But by summing them up across many moves, which must be done, good moves followed by bad ones come out to being the same as the final outcome, by and large, averaged over time and players. Because playing similarly to the engine is a sign of a stronger player, and the probability calculated is the probability a player of level X was capable of a performance of level X+Y. Given that the ELO was created such that a 400 point difference in ratings equates to the higher rated player having an expectation of 9 points out of 10 in a match with the lower rated player, we can use simple probability theory to calculate the probability a player with rating X will get a particular score against other players with average rating W, and convert that probability to a z score. One simply has to use a binomial distribution with probability .1 of a player winning against a 400 point deficit, and a .01 chance of winning with an 800 point deficit, etc. Though of course this method has the same disadvantage of requiring us to come up with an accurate current rating when the player may be improving or declining at a rapid pace.
On what basis compared to the details in my papers is all this asserted? One major difference is that the number of games is a tiny sample compared to the number of moves in those games. Even for a 9-game event, the number of moves is not great, but the model understands that. If you take 50 games over five or six events, that's still a small sample, but those games will expect to have about 1,500 analyzed moves, which is a fine sample. (This is why I haven't regarded the recent Kaggle competitions on rating systems as bearing on my work---they all predict from results of games alone.)
Regarding the relation to performance---both score and Tournament Performance Rating---look at the bottom of my chart of supertourneys from 9/12 to 9/13. Among tailenders, Ivan Sokolov with 3/13 at Wijk Tata A and Jon Ludvig Hammer with 1.5/9 at Norway Chess 2013 get IPR's under 2400, but Gawain Jones with 1.5/8 in London gets a decent 2560, and Erwin L'Ami with 4/13 at Wijk gets 2715, above his rating! Unfortunately for L'Ami his opponents played at 2850---which my current work is trying to parse as whether L'Ami was "unlucky" or didn't set a lot of "challenge". (I'd be interested to know whether a GM analyst would find this reflected in the games.) At any rate, this is enough to indicate that the model is not just echoing the score or performance rating.
I don't have any good answers but I hope these points will give people pause about blindly applying statistics and engines to this problem. Doing low tech things like watching people for suspicious activity is necessary, though unfortunately not sufficient. I would like it if TDs could at least have a way to quickly check who is playing above their level and use that as an early warning system to watch those players more carefully.
Indeed. The quick-check is called the Single-PV test in the article, and the "screening test" in the FIDE-ACP ACC document.