{\huge Thirteen Sigma}

Monitoring fabrication in industry and chess \vspace{.5in}

Bill Smith joined Motorola as a quality control engineer in 1986. He coined the term Six Sigma to express technically a goal for vastly reducing the fault rate of manufactured components. Part of the task was improving not only the monitoring of quality but also the resolution of testing devices and statistical tools so they could make reliable projections in units of faults per million rather than per thousand. The resulting empowerment of Motorola's engineers created such a verifiable improvement that Motorola received the Malcolm Baldridge National Quality Award in 1988.

Today I want to talk about the meaning of high-sigma confidence in areas where the results may not be verifiable.

``Six Sigma'' refers of course to the normal distribution curve, whose major properties were established by Carl Gauss. Gauss and others discovered that deviations in scientific measurements followed this distribution, and the Central Limit Theorem provided an explanation of its universality. Thus magnitudes of deviations of many kinds can be expressed as multiples of the standard deviation $latex {\sigma}&fg=000000$ of this distribution, which then estimates the frequency of such a deviation or larger. The goal in manufacturing is to make the process so reliable that its $latex {\sigma}&fg=000000$ is below $latex {1/6}&fg=000000$ of the magnitude of deviations that would cause components to fail at point of creation. When only one side of deviations matters, this puts the failure rate below the tail-error function value $latex {Q(6)}&fg=000000$, which is almost exactly two parts per billion. By the end of assembly the tolerance is raised to $latex {Q(4.5)}&fg=000000$, so it is really ``Four Point Five Sigma'' that sets the end-product goal to failure of less than 3.4 parts per million.

Six-Sigma programs spread quickly, and have evolved a martial-arts mythos. Six-Sigma organizations award officially-certified Green Belts, Yellow Belts, Black Belts, Brown Belts, and Master Black Belts. They also have a ``Champion'' designation. I wonder if they are given according to how many Sigmas one achieves, 8 being greater than 7, which is greater than the basic 6 (or rather, 4.5). If so then I should apply, because last week I achieved a whopping 13 Sigmas of confidence from my own software process (or rather, 11.3).

Chess Cheating Developments

Last month I was named to a 10-person joint commission of the World Chess Federation (FIDE) and the Association of Chess Professionals (ACP) to combat cheating with computers in human chess events. Discussions have gone into high swing this month, working toward drafting concrete proposals at the FIDE General Assmebly in Tallinn, Estonia, the first week of October. I am on the committee because my statistical model of human decision-making at chess answers a need voiced many commentators, including British ``dean of chess'' Leonard Barden as quoted here.

I have, however, been even busier with a welter of actual cases, reporting on four to the full committee on Thursday. One concerned accusations made in public last week by Uzbek grandmaster Anton Filippov about the second-place finisher in a World Cup regional qualifier he won in Kyrgyzstan last month. My results do not support his allegations. Our committee is equally concerned about due-diligence requirements for complaints and curbing careless allegations, such as two against Austrian player's in May's European Individual Championship. A second connects to our deliberations on the highly sensitive matter of searching players, as was done also to Borislav Ivanov during the Zadar Open tournament last December. A third is a private case where I find similar odds as with Ivanov, but the fourth raises the fixing of an entire tournament, and I report it here.

Add to this a teen caught consulting an Android chess app in a toilet cubicle in April and a 12-year-old caught reading his phone in June, plus some I've heard only second-hand, and it is all scary and sad. It is also highly stressful having my statistics be the only `regular' evidence in several current cases---all in which other players made accusations based on unscientific testing before my work came on the scene. Previously, as with the case of Sébastien Feller (which ended for truth purposes with an accomplice's confession last year), my results were supporting clear physical or observational evidence. But these cases have deviations beyond the pale of selection-effect caveats, while the following story is on another plane.

Unquiet Flows the Don

Over all my playing years I've heard nonspecific rumors of rigged tournaments. Besides prizes and qualifying spots for championship competitions, a motive can be achieving a so-called title norm. The titles of FIDE Master (FM), International Master (IM), and Grandmaster (GM) are FIDE's green, brown, and black belts, and to earn them one must score a designated number of points according to the strength category of the tournament. I scored two IM norms in early 1977, but they covered only 23 of the 24 required total games, and achieving my third norm took until 1980. The higher titles bring financial benefit along with prestige. However, until now my results on the few specific rumors had been inconclusive.

The Don Cup 2010 International was held three years ago in Azov, Russia, as a 12-player round-robin. The average Elo rating of 2395 made it a ``Category 6'' event with 7 points from 11 games needed for the IM norm, 8.5 for the GM norm. It was prominent enough to have its 66 games published in the weekly TWIC roundup, and they are also downloadable from FIDE's own website. Half the field scored 7 or higher, while two tailenders lost all their games except for drawing each other and one other draw, while another beat only them and had another draw, losing eight games.

My informant suspected various kinds of ``sandbagging'': throwing games in the current event, or having an artifically-inflated Elo rating from previous fixed events, so as to bring up the category. He noted some of the tailenders now have ratings 300 points below what they were then. Hence I thought to test for deviations down. I first took the 21 games involving the bottom two, with their 19 losses, and ran the procedure to compute their ``Intrinsic Performance Rating'' (IPR) detailed in a new paper whose final version will be presented at the IEEE CIG 2013 conference next month. I wondered whether getting significantly high error with an IPR under 2000 would really constitute evidence of ``unreasonably poor'' play, but even the oddly positive results of my preliminary ``quick test'' did not prepare me for the enormity of the printout of the full test:

IPR = 2925.

When I included the moves made by their opponents in the 21 games, my program gave 3008. This is well above the ratings of the strongest human players, but in the range typical for computer programs before Rybka 3 (my mainstay) emerged in 2008. Moreover my program gave about $latex {4.5\sigma}&fg=000000$ confidence that players with their 2300 ratings would not show so many agreements with Rybka 3.

That was from the losers. I wondered what the winners' games would look like, so I took the 3 days needed to run all my cores on the other 45 games.

Sigmas Amok

Running all 66 games created a sample of almost 4,000 analyzed moves---after excluding turns 1--8 of any game, so called ``repetition moves,'' and positions where one side has a crushing advantage. Most cases with single players have involved 9 games totaling about 250 analyzed moves, barely one-fifth of the sample size recommended for a reliable poll. This was effectively 132 games since it covered both sides of a game.

Hence the baseline $latex {\sigma}&fg=000000$ value was only about $latex {\sqrt{9/132} \approx 1/4}&fg=000000$ the size I usually get. This lent extra heft to the 2880 IPR for the whole tournament, higher than any human tournament I've recorded except 2904 for the 4-player Bilbao Grand Slam Final in 2010. When I took out the 6th and 7th place finishers, the IPR jumped to 2997. This is despite some games having blunders and ending before move 20, while others have many moves, discarded by my analyzer, where most humans would have given up long ago.

The IPR does not come with a formal statement of unlikelihood, so I ran my Rybka-agreement test for that purpose. My program last Saturday printed the $latex {\sigma}&fg=000000$ multiplier (which for normal distribution is called a \it z-score) needed for 2400-rated players to produce such computer concordance as:

$latex {z = 13.0011}&fg=000000$.

The last two digits are not significant---they owe to my global use of a 4-place C++ format specifier---but they show that the ``13'' is not rounded up. For reasons described earlier on this blog I divide by 1.15 to report an ``adjusted z-score,'' which allows for lack of full independence between moves and other modeling error. This yields the aforementioned 11.3. But I've tested that policy only for $latex {z \leq 4}&fg=000000$; beyond that I have no idea except thinking that dividing $latex {z}&fg=000000$ by a fixed factor should be mathematically conservative.

There it is: $latex {13\sigma}&fg=000000$ internal confidence in a fabrication process---here in one having been used to manufacture games that were not actually played. The corresponding odds are about $latex {6.15 \times 10^{-39}}&fg=000000$, meaning

1-in-163,000,000,000,000,000,000,000,000,000,000,000,000.

I don't know whether any physics experiment for a yes/no predicate has ever claimed $latex {13\sigma}&fg=000000$ confidence---for comparison, $latex {5\sigma}&fg=000000$ sufficed for the Higgs Boson. However, this still raised for me a question I have understandably been posed on the anti-cheating committee:

Is it a proof?

And here is the difference from Six-Sigma: an industrial process can be verified by later automated testing of the millions of items, but a one-shot predicate often cannot be.

It Shines Like Truth

The German word for probability, Wahrscheinlichkeit, has the great feature of literally meaning, ``the quality of shining like truth.'' Whereas the corresponding root for our own word is Latin proba, meaning ``test'' or ``proof.'' Truth or proof, can it be either?

In this case I did not have to wait long for more-than-probability. Another member of our committee noticed by searching his million-game database that:

Six of the sixty-six games are move-by-move identical with games played in the 2008 World Computer Chess Championship.

For example, three games given as won by one player are identical with Rybka's 28-move win over the program Jonny and two losses in 50 and 44 moves by the program Falcon to Sjeng and HIARCS, except one move is missing from the last. One of his victims has three lost games, while another player has two wins and another two losses. Indeed the six games are curiously close to an all-play-all cluster.

I verified this against my own collection of over 11,000 major computer-played games, tolerating 8-move differences, and was surprised to find just the same six identities, no more. So where do the other 60 games come from? My program's confidence in computer origin is no less, but perhaps someone actually took the trouble to generate them fresh by playing two chess programs against each other?

I am expanding the search to match my database of over 200,000 human games per year against the 11,000 computer games, but each year is taking a day. A trial partial search of 2012 turned up a game in a junior tournament identical to the 38-move draw between Garry Kasparov and IBM's Deep Blue in game 3 of their first match in 1996, but nothing beyond a children's joke is apparent.

Open Problems

Six identical games may amount to six smoking-gunshots, but why don't six sigmas, or thirteen?