{\huge Benford's Law and Baseball}

What distributions follow the knock of opportunity?

\vspace{.3in}

Ted Hill is a Professor Emeritus in the Mathematics Department at Georgia Tech, and has two other affiliations. He graduated from West Point in 1966 where he roomed with General Wesley Clark, then served in Vietnam, and now maintains a website on his academy class. His other Georgia Tech site has interesting and vigorous personal material, and some fascinating mathematics projects with applications. He is arguably the world's expert on Benford's Law, along with Arno Berger of Alberta. The ``law''---or phenomenon---is that many tables of numbers drawn from real-life data are skewed to favor $latex {1,2,3}&fg=000000$ as the leading non-zero digit at the expense of higher ones.

Today I wish to probe the boundaries of this law, and argue for a new case of the law whose explanation seems particularly simple.

Simon Newcomb first noticed in books of logarithms that the pages with those beginning $latex {1,2,3}&fg=000000$ had more human wear-and-tear than the others, and derived the mathematical formula for such a distribution. I love the ``low-tech" detection method: look for pages that are worn---we would do that very differently today.

In his 1881 two-page paper, Newcomb computed the frequencies of first digit in base 10, and also the second (which can be zero):

\includegraphics[width=2in]{NewcombTable.png}

Frank Benford, however, was the first to observe the phenomenon in large-scale data---his 23-page paper published in 1938 observed 20,229 data points. Curiously they did not include the famous example of the first digits of heights of hills and mountains. The surprise is that regardless of the units of measurement (such as feet or meters) or the numerical base---provided the base is substantially less than the ratio of the largest value to the smallest---$latex {1,2,3}&fg=000000$ occur as leading nonzero digits markedly more often than the others.

Our questions are when do distributions follow this law, and what does it mean when data doesn't?

Derivations and Explanations

The shared insight between Newcomb and Benford is that many data sets are really ratios of two uniformly distributed quantities. The unseen denominator is the choice of units. Represent the ratio in base $latex {b}&fg=000000$ as $latex {b^x/b^y = b^{x-y}}&fg=000000$. Newcomb's own insight was that adding or subtracting an integer from the exponent only shifts the `decimal' point in base $latex {b}&fg=000000$, and does not affect the identity of the leading nonzero digit. Hence only the circular difference modulo 1 of $latex {x-y}&fg=000000$ matters. Provided $latex {x}&fg=000000$ and $latex {y}&fg=000000$ are drawn uniformly from at least a couple go-rounds of the circle---meaning that the data spreads over a couple powers of $latex {b}&fg=000000$---the difference $latex {x-y}&fg=000000$ is also nearly uniformly distributed, even conditioned on $latex {x > y}&fg=000000$. Thus the distribution is given by

$latex \displaystyle B: [0,1) \rightarrow [1,b):\quad B(z) = b^z;\quad\text{so~} \Pr_B([c,d]) = \log_b(\frac{d}{c}). &fg=000000$

We can identify intervals $latex {[a,a+1)}&fg=000000$ of the range with the leading digit $latex {a}&fg=000000$. If $latex {a+1 \approx b^{1/2}}&fg=000000$, then roughly half of the probability is on the digits $latex {1}&fg=000000$ through $latex {a}&fg=000000$. This is in fact a defining property of Benford's Law:

A data set follows Benford's Law if in any sufficiently small base $latex {b}&fg=000000$, about half of the data points have leading digit between $latex {1}&fg=000000$ and $latex {\sqrt{b}-1}&fg=000000$.

For base 10, with $latex {\sqrt{10} = 3.16\dots}&fg=000000$, this implies a little under half the probability should be on $latex {1}&fg=000000$ and $latex {2}&fg=000000$, and Newcomb's table shows $latex {47.7\%}&fg=000000$.

It follows that Benford's Law is scale-invariant in the sense of units not mattering---the only requirement is values being spread over a couple of powers of the base. This should not be confused with the idea of focusing on subsets of the range, such as 100 to 999 versus 1,000 to 9,999, or of identifying the numerical base with the unit of measurement, as either case can violate the requirement. Scale invariance can be axiomatized so that the above continuous version $latex {B(z)}&fg=000000$ is the only distribution that satisfies it.

Hill's two 1995 papers rigorously proving the base/scale invariance, and deriving Benford's Law for certain processes of selecting a distribution and then choosing from it, have been credited as ``explaining'' the law. This goes also for an extension to certain mixtures of uniform distributions by \'Elise Janvresse and Thierry de la Rue, and a post three years ago by Terry Tao. However, Hill and Berger warned last year that there is

No Simple Explanation In Sight For [the] Mathematical Gem.

The kind of explanation I seek would help in cases where the law applies-but-not-quite, to recognize where and why it is not holding. We start with a crude idea that gets the skew right, but not necessarily the distribution.

An Opportunistic Explanation

The explanation that I first heard, which is also listed first by Wikipedia here, is that Benford's Law results from exponential growth processes. Picture mountains growing as land is pushed up until the process stops. Then we could say the skew aspect of Benford's Law holds ``because''

the opportunity to stop growing at 1,xxx feet always comes before the opportunity to stop growing at 2,xxx feet, which comes before 3,xxx feet, and so on.

Whether the numbers conform depends on how the ``stopping probability'' $latex {p}&fg=000000$ behaves at various times. Assuming independence in all unit time intervals, is $latex {p}&fg=000000$ constant? Or is it lower for smaller values when there is more ``momentum'' of growth? It is beyond my scope to derive conditions on $latex {p}&fg=000000$ here, except to note that (optimal) stopping theory is both Hill's second research area after Benford's Law and one of the grand challenges of Constraint Programming according to a talk by Barry O'Sullivan, which I heard at the AAAI-2012 conference in Toronto this past week. Instead I wish to consider a simple kind of stopping process.

Take Me Out Of The Ballgame

One of the categories in Benford's original table of statistics is labeled ``Am. League,'' but this is not elaborated in his 1938 paper and I do not know what it refers to. I do know, however, that it cannot mean the statistic that first caught my eye when I suspected a data bug in an online fantasy baseball league I was playing in eight years ago. It concerns whether a baseball pitcher logs $latex {X+1/3}&fg=000000$ or $latex {X+2/3}&fg=000000$ innings, where $latex {X}&fg=000000$ is a whole number.

In the great majority of cases, a starting pitcher pitches a whole number $latex {X}&fg=000000$ of innings, owing to the structure of the game of baseball. For relief pitchers, meaning anyone who enters the game after the starter has been taken out, cases where a reliever enters at the beginning of an inning and pitches exactly that inning form a plurality. However, for both starters and relievers, the frequency of the other cases, between $latex {X+1/3}&fg=000000$ or $latex {X+2/3}&fg=000000$ innings, might seem to be ``completely random.'' It is not.

After not finding a convenient way to gather the data for these cases online, I went ``low-tech'' myself and humanly scanned the boxscores printed by the local Buffalo newspaper from last Monday through today (Sunday). This missed some late West Coast games but gave an unbiased seelction of about 100 games. I found these values:

In games there is a strong correlation of $latex {X+1/3}&fg=000000$ for a starter and $latex {Y+2/3}&fg=000000$ for a reliever who is able to finish the inning, and vice-versa. Still, largely thanks to five instances where two relievers finished an inning with $latex {1/3}&fg=000000$ inning each, plus some ``walkoff wins'' where the losing pitcher had recorded just one out, the $latex {Y+1/3}&fg=000000$ relievers turned the deficit of 10 from the starters into a plus of 7. A week gives only a small amount of data but this is enough to be suggestive. My explanation for this skew is:

The opportunity for the manager to take a pitcher out of the game after $latex {X+1/3}&fg=000000$ innings always comes before the opportunity to do so after $latex {X+2/3}&fg=000000$ innings, for all whole-number values of $latex {X}&fg=000000$.

A countervailing factor is that $latex {X+2/3}&fg=000000$ gives the batting team more time to put runners on-base and get the starting pitcher ``in trouble'' so that the manager feels a need to take him out. However, runners on base are a greater problem with one out than two, and I suspect all such factors are lower-order than the sequential opportunity explanation.

A Different Test, and a Bug

In a different test, on Thursday July 19 I looked up total stats for all pitchers using the ``7 days'' option in my Yahoo! fantasy league. This summed multiple games for some starters and many relievers, but the phenomenon still held: 25-20 for starters, 56-45 for relievers.

Over a full season I would expect the effect in cumulative stats to weaken, much as with Benford's Law for digits after the leading one. The effect I saw in August 2004, however, definitely involved one-day stats.

In a fantasy league run for MLB.com by SportingNews.com, I noticed one morning that the reported change in the total of innings pitched by my players was $latex {1/3}&fg=000000$ less than the total shown on my team page for the previous day. Curious, I summed my team's total for every day of the season and found a bigger discrepancy. I found similar effects for two other teams in my twelve-team league. Here is my belief of the explanation.

The interface displayed one-third of an inning as $latex {.3}&fg=000000$ and two-thirds as $latex {.7}&fg=000000$. Clearly the programming had a routine to display values that way, and perhaps it was even applied to round the daily totals. What I believe is that the daily team totals were being summed as numbers of the form $latex {X.3}&fg=000000$ and $latex {X.7}&fg=000000$ (besides $latex {X.0}&fg=000000$ in whole-number instances) to make the season team total, and that this was then rounded to display as $latex {Y.3}&fg=000000$ or $latex {Y.7}&fg=000000$ or $latex {Y.0}&fg=000000$. Of course this is silly, but it explains what I saw: The Benford-esque plurality $latex {R}&fg=000000$ of $latex {X.3}&fg=000000$ day-totals over $latex {X.7}&fg=000000$ data points would accumulate an error of $latex {-0.033R}&fg=000000$ in the grand total, and over a matter of weeks $latex {R}&fg=000000$ would be large enough to cause rounding or truncation to compute a smaller value.

I brought this to the attention of the game's tech-support, and received a reply that acknowledged it as a bug, but said it was limited to the display---that actual team totals used to enforce season-long innings-pitched quotas and compute other stats were not affected. I was not convinced, and thought to check it by the more-arduous task of computing my team's ERA (earned run average) and WHIP (walks plus hits per inning pitched) statistics manually, but I realized the two-place precision by which they were displayed would not be enough to identify the discrepancy.

I considered pressing this further, with dreams of getting a minute on NPR's ``Science Friday'' or somesuch, but I was trying to finish various things before the start of term. I did not find mention of this on an independent forum. Upon realizing that even if I was right, the reality behind the glittering ``MLB.com'' label would probably be no more than some young programmer taking a bad shortcut, I let it drop.

Open Problems

Is this a valid instance of Benford's Law, or of a cruder principle that aligns with it? How far does opportunity-for-stopping go as an explanation?

Was I right about the fantasy-baseball bug?

Does the distribution of evaluations of chess positions given by chess programs, standardly in units of hundredths of a Pawn, follow Benford's Law? I've logged millions of such evaluations, and they seem to follow a distribution flatter than Benford but more skewed than half of a bell curve. This may be thrown off by the fact that $latex {0.00}&fg=000000$ (meaning dead equality or game immediately drawn) has over ten times as many data points than any other value for the Rybka program, and apparently over twenty times for another program called Stockfish. A future post may elaborate this.