{\huge Littlewood's Law}
Why it may take a `miracle' to catch some cheaters \vspace{.3in}
John Littlewood really existed. He appeared as second author on so many papers with Godfrey Hardy that some believed him to be a fictional appendage. He kept on writing papers for a quarter century after Hardy's death in 1947 and lived into his 90's, passing away in 1977.
Today I wish to discuss ``Littlewood's Law'' and its relevance to judging the incidence of cheating---at chess and in general.
Littlewood's law has the informal statement,
Everyone witnesses a miracle every month.
Littlewood defined an event as a `miracle' when one could reasonably say the odds against it were a million to one or higher. The logic of the ``Law'' works as follows:
I have elaborated---really belabored---this more than Littlewood himself did. Littlewood's point was just to argue that it is not unusual to see some unusual happenings. This conclusion does not require rigor. For chess cheating, however, I need to show where and when this reasoning carries through rigorously. My point will still be the same as Littlewood's.
Chess Cheating
Chess is a game of complete information. There are no cards to hide that might be palmed, switched, or played illegally, no dice that could be loaded. So how is it possible to cheat at chess? Alas the complete information can be conveyed to a computer, and thanks to the exponential increase in computer power and smarter chess-playing algorithms, consumer hardware can play better than any human. Hence cheating in chess in possible, and unfortunately this year it has seemed to become common.
Some players have been caught looking at PDA's, or with a hearing device, or receiving signals, or with a computer-in-a-shoe. Recently the top-rated player in a tournament in Dortmund, Germany, was disqualified after his cellphone was found to emit code-like vibrations even when seemingly switched off. But in other widely-suspected cases, there is no solid physical or observational evidence at all. What can one do?
One can ramp up prevention, trying to make it impossible for a player to have direct or even indirect access to a computer. The trouble is that truly effective measures are too draconian or expensive to be used in large tournaments, while lighter ones can be evaded by clever schemers. Tournament officials have become more watchful, and yet cheating seems to be growing.
I have been working for years on a method whose beauty is that it requires no direct observation of the player, no monitoring during play, no draconian measures. As a negative statement its goal is simple: Test whether the accused player's moves correlate ``too closely'' with the moves of strong chess programs. It comes, however, from positive research questions: How closely do the moves of players of various rating levels correlate with those of our computer superiors? And how can we thereby measure skill based directly on the move decisions they make, instead of the sometimes-capricious results of games? I have developed a skill-rating model and program that does statistical cheating detection as a by-product. It gives plausible results on even the strongest human players and computers themselves. The one person who confounds it, however, is John Littlewood---not the late well-known British player of that name, but the mathematician.
Having to Break the `Law'
Testing for cheating from the game record is non-invasive, inexpensive, and can be used after the fact. It doesn't involve body searches or EF jamming, both of which might break local laws. However, it meets a serious issue, one directly connected to Littlewood's insight: What if the accused player's correlation with the program is simply a Littlewood miracle? This is the central question to discuss in the remainder of this note.
The issue would be unavoidable even if I were not part of a ten-person committee created by the World Chess Federation (FIDE) and the Association of Chess Professionals (ACP) to combat cheating. Opposing players test games with computers all the time, in a scattershot manner, and they have not been shy about voicing machinations from their observations. I have dealt with even more cases that my data say clearly are crying-wolf; here are three in the public record. The problem occurs when the numbers are not so clear.
My work has been criticized for its putative inability to detect players who might cheat on only a few `critical moves' per game, keeping purposeful deviation within the bounds ascribed to chance. If the numbers gave a reading of ``insignificant'' in such a case, well that would be the end of it. The problem comes in, and Littlewood's Law strikes back, when the reading is some ways beyond the standard ``two-sigma'' threshold of significance.
I will take some sections for details and analogies, but here's the bottom line for understanding. Not just for my program, but for many uses of evidentiary statistics to compute odds:
The computed odds are not the odds of cheating, but rather best-estimates of the frequency with which the policy of applying a sanction based on statistical results must expect to be erroneous.
That's the frequency Littlewood is talking about.
Some Details
Based on the rating of a player $latex {P}&fg=000000$, and on whether the positions faced tended to have few or many reasonable choices, my program generates projected probabilities for each move. Given any sequence $latex {C}&fg=000000$ of distinguished moves in each of those positions, my program thus computes both an expected number $latex {m_C}&fg=000000$ of times that $latex {P}&fg=000000$ will choose the move in $latex {C}&fg=000000$, and a standard deviation $latex {\sigma_C}&fg=000000$ for that statistic. The latter is based on viewing each turn as an independent Bernouilli trial over the probabilities generated for each possible move in that position, but includes an empirically-tested adjustment for dependence between consecutive moves that might be parts of a single strategy---this is a kind of ``sparse dependence.''
Once $latex {m_C}&fg=000000$ and $latex {\sigma_C}&fg=000000$ are in hand, the actual number $latex {p_C}&fg=000000$ of moves in $latex {C}&fg=000000$ that $latex {P}&fg=000000$ played generates a z-score:
$latex \displaystyle z = \frac{p_C - m_C}{\sigma_C},&fg=000000$
which is measured in units of standard deviations or ``sigmas.'' For any value of $latex {z}&fg=000000$, the theory of normal distribution yields (``one-sided'') odds of a deviation of $latex {z}&fg=000000$ or greater, which one may readily look up in a table or get from an applet. For example, the value$latex \displaystyle z = 2.326&fg=000000$
corresponds to odds of 1-in-100. What does this mean?
And if you have $latex {1,000,000X}&fg=000000$ events, then expect $latex {X}&fg=000000$-many `miracles.' How can we interpret this? Eighteen months ago I wrote a page to explain this in terms of golf, in which a hole-in-one is a `minor miracle.' But now I'll reference a simpler game for analogy: Marbles.
Let's Not Lose Our Marbles
Suppose you have a bag of 100 marbles of different grades of brilliance. If you reach into the bag and pick one, and get one that is shinier than the mean by 2.326 sigmas, you can count yourself lucky.
But if you spill them onto a white carpet, and pick one quickly, chances are you'll notice the shiniest one. By Littlewood's Law it's better-than-even that it will be shinier than average by 2.326 sigmas. With 160 marbles, you should count yourself unlucky not to find such a shiny one.
If the marbles come from a big factory that fills its bags in a highly uniform manner, you can be pretty sure that your expectations apply to any bag. Or maybe the company makes bags with different ``ratings'' for brightness---then you can suppose a consistent mean for bags of a given rating.
In my analogy, the marbles are players---of various chess Elo ratings---and the bags are tournaments. Round-robin tournaments rarely have more than a dozen players nowadays (sixteen used to be the norm), but Opens have 50, 100, sometimes even 1,000 players. There are also perhaps 100 sizable tournaments going on worldwide in any given week or weekend, especially in the northern summer. About 20--40 of them are prominent enough to make a roundup called The Week In Chess, which England's Mark Crowther has offered single-handedly for free for coming on 20 years. But before we deal with combinations of numbers, let's try a simpler illustration.
A Quiz on the Law
About 25--50 or so of the world's top players are regularly invited to lucrative round-robin ``super-tournaments.'' The rest frequent Open tournaments, usually receiving modest appearance fees as well as the chance to win prize money, and the next several hundred players can also make a decent living by supplementing these competitions with coaching and writing and various other activities. Many do more playing, others more writing or teaching or officiating.
Say you are one of those several hundred, with a career going back 25 years since becoming an internationally ranked player. Now let's run my cheating test on every one of your performances.
Suppose we do not find a single performance on which we would be 99% confident in isolation that you were cheating. Which of the following conclusions is supported materially by this result?
We won't keep you in suspense. The answer is 3. Those earning livable money from Open prizes pretty much have to play more than once a month. Playing once every two months won't cut it, but even that makes 150 tournaments. By Littlewood's Law, from 150 performances we have fairly high likelihood of seeing a 1-in-100 deviation on the plus side---and one on the down side.
Players and Floods
Now suppose we have a tournament of 100 players, and we select some to test for cheating. We might select the first prize winner, or the top three or all those in a tie for first, but chances are they were among the highest rated players to begin with. Instead, as we glance over the tournament crosstable we may notice a remarkably high Plus Figure, like +250. This means the player had enough wins and draws to meet the expectation of a player rated 250 points higher. The player may still have finished in the bottom half of the tournament with more losses than wins, but if the player was among the lowest rated that could still be a highly ``plus'' performance and even win a so-called rating class prize. Some players are even alleged to lose on purpose in one tournament so that their rating will be below the cutoff for a class prize in the next, a practice called ``sandbagging.''
So we test our Plus Player, and we get a z-score of 2.50, meaning odds of 160-to-1 against it happening ``by chance,'' and certainly a significant statistical outcome by civil convention. Then what do we do? The answer is: absolutely nothing. Chances are that player was just the shiny marble.
In any one weekly file of games from TWIC, there are typically 1,000--1,500 players, more in summer. Since many tournaments are split across two weeks of TWIC, this translates to about 1,000 player-performances per week. Now if we hear someone did ``suspiciously well,'' and the test obtains a z-score of 3.10, say, what then? That's 1,000-1 odds, well-nigh unlikely in isolation. However, by Littlewood's Law, one of those 1,000 players was bound to have an unusually fine weekend by the test's measures. Since the measures correlate with the results of games, this is most likely to be the player who caused the buzz.
I've proposed thinking about odds in units of ``weeks of TWIC,'' multiplying 1,000. Thus if the test obtains a z-score of 4.00, for 31,600--1 odds, that's about 7 months of TWIC. A score of 4.50, for just under 300,000--1, is about 6 years of TWIC. And 4.75, the closest round z-score to million-to-one odds, is about 20 years of TWIC. A performance with that high a deviation can be regarded the way insurance companies think of a ``twenty-year flood.'' And 5.00 sigmas, which physicists use as their threshold of confidence in discovery, is a 60-year flood in chess.
It Takes a Miracle
Vetting the accuracy of my model is a separate issue. I have conducted empirical tests of the kind sketched here to calibrate its internal error bars, and I have comparison data from most of the entire top-level history of chess by which to sanity-check its conclusions. It is also possible that someone will devise a better model. So let's take the solidity the statistical evidence as given.
I am proposing a 5.00-sigma standard in chess for statistical evidence to rise above the caveats of selection that go with Littlewood's Law. This might be shaded to 4.75, giving exactly Littlewood's quantification of ``miracle,'' something we'd expect to see in chess only once in two decades. Or perhaps to 4.50, which my post ``Thirteen Sigma'' noted is the end-product criterion for ``Six Sigma'' in industry.
In the case that occasioned my open letter to ACP, my test gave results well north of 5.00 for a single tournament. I was actually shocked to see such numbers tumble out, because in every previous case the z-scores were around 3.00 or somewhat beyond. I have also contrasted this case with three other performances considered `miraculous' in chess terms that don't even register significant deviations on my quick filtering test.
Littlewood's Law says a single result of the lower 3.00-ish kind must be ignored, unless there is something completely distinct from game results or move-matching percentages that determines the selection---such as physical or observational evidence of cheating. But when the result is 5.00, from one or possibly a combination of tournaments in a short time span, I think it is beyond the caveats of the `Law' and some action needs to be taken.
The action from a world body need not include a formal finding of cheating---that can be left as a consideration in local-authority action for recovery of prize money, for instance. One contention of my ``Thirteen Sigma'' post is that somewhere between z = 4.50 and z = 13.00 there needs to be a threshold on which the society of chessplayers can agree, as a burden of the privilege of competing, that exceeding it brings a sanction. Perhaps this can align with the developing understanding of the rules and procedures of blood-level readings in cycling.
Open Problems
What does it take to be able to regard statistical evidence as primary in possible cheating cases, rather than the supporting role reserved when the z-score is under 4.00? What should the threshold be?