Benford’s law and SciPy

Imagine you picked up a dictionary and found that the pages with A’s were dirty and the Z’s were clean. In between there was a gradual transition with the pages becoming cleaner as you progressed through the alphabet. You might conclude that people have been looking up a lot of words that begin with letters near the beginning of the alphabet and not many near the end.

That’s what Simon Newcomb did in 1881, only he was looking at tables of logarithms. He concluded that people were most interested in looking up the logarithms of numbers that began with 1 and progressively less interested in logarithms of numbers beginning with larger digits. This sounds absolutely bizarre, but he was right. The pattern he described has been repeatedly observed and is called Benford’s law. (Benford re-discovered the same principle in 1938, and per Stigler’s law, Newcomb’s observation was named after Benford.)

Benford’s law predicts that for data sets such as collections of physical constants, about 30% of the numbers will begin with 1 down to about 5% starting with 8 or 9. To be precise, it says the leading digit will be d with probability log10(1 + 1/d). For a good explanation of Benford’s law, see TAOCP volume 2.

A couple days ago I blogged about using SciPy’s collection of physical constants to look for values that were approximately factorials. Let’s look at that set of constants again and see whether the most significant digits of these constants follows Benford’s law.

Here’s a bar chart comparing the actual number of constants starting with each digit to the results we would expect from Benford’s law.

Here’s the code that was used to create the data for the chart.

from math import log10, floor
from scipy.constants import codata

def most_significant_digit(x):
    e = floor(log10(x))
    return int(x*10**-e)

# count how many constants have each leading digit
count = [0]*10
d = codata.physical_constants
for c in d:
    (value, unit, uncertainty) = d[ c ]
    x = abs(value)
    count[ most_significant_digit(x) ] += 1
total = sum(count)

# expected number of each leading digit per Benford's law
benford = [total*log10(1 + 1./i) for i in range(1, 10)]

The chart itself was produced using matplotlib, starting with this sample code.

The actual counts we see in scipy.constants line up fairly well with the predictions from Benford’s law. The results are much closer to Benford’s prediction than to the uniform distribution that you might have expected before hearing of Benford’s law.

Update: See the next post for an explanation of why factorials also follow Benford’s law.

Related posts

Software knowledge shelf life

In my experience, software knowledge has a longer useful shelf life in the Unix world than in the Microsoft world. (In this post Unix is a shorthand for Unix and Linux.)

A pro-Microsoft explanation would say that Microsoft is more progressive, always improving their APIs and tools, and that Unix is stagnant.

A pro-Unix explanation would say that Unix got a lot of things right the first time, that it is more stable, and that Microsoft’s technology turn-over is more churn than progress.

Pick your explanation. But for better or worse, change comes slower on the Unix side. And when it comes, it’s less disruptive.

At least that’s how it seems to me. Although I’ve used Windows and Unix, I’ve done different kinds of work on the two platforms. Maybe the pace of change relates more to the task than the operating system. Also, I have more experience with Windows and so perhaps I’m more aware of the changes there. But nearly everything I knew about Unix 20 years ago is still useful, and much of what I knew about Windows 10 years ago is not.

More programming posts

Typesetting “C#” in LaTeX

How do you refer to the C# programming language in LaTeX? Simply typing C# doesn’t work because # is a special character in LaTeX. You could type C#. That works, but it looks a little odd. The number sign is too big and too low.

What about using a musical sharp sign, i.e. C$\sharp$? That also looks a little odd. Even though the language is pronounced “C sharp,” it’s usually written with a number sign, not a sharp.

Let’s look at recommended ways of typesetting C++ to see whether that helps. The top answer to this question on TeX Stack Exchange is to define a new command as follows:

\newcommand{\CC}{C\nolinebreak\hspace{-.05em}\raisebox{.4ex}{\tiny\bf +}\nolinebreak\hspace{-.10em}\raisebox{.4ex}{\tiny\bf +}}

This does several things. First, it prevents line breaks between the constituent characters. It also does several things to the plus signs:

  • Draws them in closer
  • Makes them smaller
  • Raises them
  • Makes them bold

The result is what we’re subconsciously accustomed to seeing in print.

Here’s an analogous command for C#.

\newcommand{\CS}{C\nolinebreak\hspace{-.05em}\raisebox{.6ex}{\tiny\bf \#}}

And here’s the output. The number sign is a little too small.

To make a little larger number sign, replace \tiny with \scriptsize.

More LaTeX posts

Physical constants and factorials

The previous post mentioned that Avogadro’s constant is approximately 24!. Are there other physical constants that are nearly factorials?

I searched SciPy’s collection of physical constants looking for values that are either nearly factorials or nearly reciprocals of factorials.

The best example is the “classical electron radius” re which is 2.818 × 10-15 m and 1/17! = 2.811 × 10-15.

Also, the “Hartree-Hertz relationship” Eh/h equals 6.58 × 1015 and 18! = 6.4 × 1015. (Eh is the Hartree energy and h is Plank’s constant.)

Here’s the Python code I used to discover these relationships.

from scipy.special import gammaln
from math import log, factorial
from scipy.optimize import brenth
from scipy.constants import codata

def inverse_factorial(x):
    # Find r such that gammaln(r) = log(x)
    # So gamma(r) = x and (r-1)! = x
    r = brenth(lambda t: gammaln(t) - log(x), 1.0, 100.0)
    return r-1

d = codata.physical_constants
for c in d:

    (value, unit, uncertainty) = d[ c ]
    x = value
    if x < 0: x = abs(x)
    if x < 1.0: x = 1.0/x
    r = inverse_factorial(x)
    n = round(r)
    # Use n > 6 to weed out uninteresting values.
    if abs(r - n) < 0.01 and n > 6:
        fact = factorial(n)
    if value < 1.0:
        fact = 1.0/fact
    print c, n, value, fact

Avogadro’s number

Avogadro’s number NA is the number of atoms in 12 grams of carbon-12. It’s about 6.02 × 1023. (Update: Avogadro’s number is now exactly NA = 6.02214076 × 1023 mol-1 by definition. More details here.)

Here are a few fun coincidences with Avogadro’s number.

  1. NA is approximately 24! (i.e., 24 factorial.)
  2. The mass of the earth is approximately 10 NA kilograms.
  3. The number of stars in the observable universe is 0.5 NA.

The first observation comes from here. I forget where I first heard the second. The third comes from Andrew Dalke in the comments below, verified by WolframAlpha.

For more constants that approximately equal factorials, see the next post.

Related posts

Moby Dick and the tautochrone

The tautochrone is a curve such that a ball rolling down the curve takes the same amount of time to reach the bottom, no matter where along the curve it starts. (The name comes from the Greek tauto for same and chrono for time.) It doesn’t sound like such a curve should be possible because balls starting further up the curve have longer to travel. However, balls starting higher also have more potential energy, and so they travel further but faster. See the video below for a demonstration.

[The video is entitled “brachistochrone race” rather than “tautochrone race.” The brachistochrone problem is to find the curve of fastest descent. But its solution is the same curve as the tautochrone. So different problems, same solution.]

I first heard of the tautochrone as a differential equation problem to find its equation. But someone could run into it in an American literature class.

Clifford Pickover’s new book The Physics Book has a chapter on the tautochrone. (In this book, “chapters” are only two pages: one page of prose and one full-page illustration.) Pickover points out a passage in Moby Dick that discusses a bowl called a try-pot that is shaped like a tautochrone in the radial direction.

[The try-pot] is a place also for profound mathematical meditation. It was in the left hand try-pot of the Pequod, with the soapstone diligently circling round me, that I was first indirectly struck by the remarkable fact, that in geometry all bodies gliding along the cycloid, my soapstone for example, will descend from any point in precisely the same time.

A brief note on Moore’s law

One variation on Moore’s law says that computer performance doubles every 18 months. Another says performance improves by a factor of 100 every decade. The two are equivalent, though the latter sounds more impressive.

Exponential growth doesn’t mesh well with human intuition. Scott Berkun said “Technological progress is overestimated in the short term, underestimated in the long term.” This is probably not limited to technology but true of exponential growth in general.

(Why are the two statements of Moore’s law equivalent? The first says performance grows like exp( log(2) t / 1.5 ) and the second says it grows like exp( log(100) t / 10 ). And log(2) / 1.5 = 0.462 and log(100)/10 = 0.461.)

Related post: Moore’s law and software bloat

The most powerful people are right

From Seth Roberts:

If you ignore data, the answer to every hard question is the same: the most powerful people are right. That way lies stagnation (problems build up unsolved because powerful people prefer the status quo) and collapse (when the problems become overwhelming).

Science is far more political than I had imagined before starting a career in science. Data trumps politics eventually, but it may take a long time.

Rational right triangles

Suppose you have a right triangle and every side has rational length. What can you say about the area? For example, is it possible that such a triangle could have area 1?

It turns out the answer is no, and here’s a proof. Suppose you had a right triangle with sides of length a, b, and c with c being the length of the hypotenuse. And suppose a, b, and c are all rational numbers.

Start with the equation

(a2b2)2 = (a2 + b2)2 – 4a2b2

Suppose the area of the triangle is 1. Then ab/2 = 1 and so ab = 2. Use this and the Pythagorean theorem to rewrite the right side of the equation above. Now we have

(a2b2)2 = c4 – 16

But this result contradicts a theorem of Fermat: in rational numbers, the difference of two fourth powers cannot be a square.

So a rational right triangle cannot have area 1. Also, it cannot have area r2 for any rational number r. (If it did, you could divide each side by r and have a rational triangle with area 1.)

Now things are about to get a lot deeper.

What values can the area of a rational right triangle take on? Euler called such numbers congruent. Determining easily testable criteria for congruent numbers is an open problem. However, if the Birch and Swinnerton-Dyer conjecture is correct, then an algorithm of Jerrold Tunnell resolves the congruent number problem. (Incidentally, the Clay Mathematics Institute has placed a $1,000,000 bounty on the Birch and Swinnerton-Dyer conjecture.)

What if instead of limiting the problem to rational right triangles we considered all triangles with rational sides? See this paper for such a generalization.