Benford’s law and SciPy

Imagine you picked up a dictionary and found that the pages with A’s were dirty and the Z’s were clean. In between there was a gradual transition with the pages becoming cleaner as you progressed through the alphabet. You might conclude that people have been looking up a lot of words that begin with letters near the beginning of the alphabet and not many near the end.

That’s what Simon Newcomb did in 1881, only he was looking at tables of logarithms. He concluded that people were most interested in looking up the logarithms of numbers that began with 1 and progressively less interested in logarithms of numbers beginning with larger digits. This sounds absolutely bizarre, but he was right. The pattern he described has been repeatedly observed and is called Benford’s law. (Benford re-discovered the the same principle in 1938, and per Stigler’s law, Newcomb’s observation was named after Benford.)

Benford’s law predicts that for data sets such as collections of physical constants, about 30% of the numbers will begin with 1 down to about 5% starting with 8 or 9. To be precise, it says the leading digit will be d with probability log10(1 + 1/d). For a good explanation of Benford’s law, see TAOCP volume 2.

A couple days ago I blogged about using SciPy’s collection of physical constants to look for values that were approximately factorials. Let’s look at that set of constants again and see whether the most significant digits of these constants follows Benford’s law.

Here’s a bar chart comparing the actual number of constants starting with each digit to the results we would expect from Benford’s law.

Here’s the code that was used to create the data for the chart.

from math import log10, floor
from scipy.constants import codata

def most_significant_digit(x):
    e = floor(log10(x))
    return int(x*10**-e)

# count how many constants have each leading digit
count = [0]*10
d = codata.physical_constants
for c in d:
    (value, unit, uncertainty) = d[c]
    x = abs(value)
    count[ most_significant_digit(x) ] += 1
total = sum(count)

# expected number of each leading digit per Benford's law
benford = [total*log10(1 + 1./i) for i in range(1, 10)]

The chart itself was produced using matplotlib, starting with this sample code.

The actual counts we see in scipy.constants line up fairly well with the predictions from Benford’s law. The results are much closer to Benford’s prediction than to the uniform distribution that you might have expected before hearing of Benford’s law.

Update: See the next post for an explanation of why factorials also follow Benford’s law.

Related posts:

Physical constants and factorials

The previous post mentioned that Avogadro’s constant is approximately 24!. Are there other physical constants that are nearly factorials?

I searched SciPy’s collection of physical constants looking for values that are either nearly factorials or nearly reciprocals of factorials.

The best example is the “classical electron radius” re which is 2.818 × 10-15 m and 1/17! = 2.811 × 10-15.

Also, the “Hartree-Hertz relationship” Eh/h equals 6.58 × 1015 and 18! = 6.4 × 1015. (Eh is the Hartree energy and h is Plank’s constant.)

Here’s the Python code I used to discover these relationships.

from scipy.special import gammaln
from math import log, factorial
from scipy.optimize import brenth
from scipy.constants import codata

def inverse_factorial(x):
    # Find r such that gammaln(r) = log(x)
    # So gamma(r) = x and (r-1)! = x
    r = brenth(lambda t: gammaln(t) - log(x), 1.0, 100.0)
    return r-1

d = codata.physical_constants
for c in d:

    (value, unit, uncertainty) = d[c]
    x = value
    if x < 0: x = abs(x)
    if x < 1.0: x = 1.0/x
    r = inverse_factorial(x)
    n = round(r)
    # Use n > 6 to weed out uninteresting values.
    if abs(r - n) < 0.01 and n > 6:
        fact = factorial(n)
    if value < 1.0:
        fact = 1.0/fact
    print c, n, value, fact

Python for high performance computing

William Scullin’s talk from PyCon 2011: Python for high performance computing.

At least in our shop [Argonne National Laboratory] we have three accepted languages for scientific computing. In this order they are C/C++, Fortran in all its dialects, and Python. You’ll notice the absolute and total lack of Ruby, Perl, Java.

If you’re interested in Python and HPC, check out SciPyTip.

Ruby, Python, and Science

David Jacobs has written a long blog post Ruby is beautiful (but I’m moving to Python). [Update: link no longer available.] Here’s my summary.

Ruby is much better than Java, but the Ruby community is too focused on web development and the language has no scientific library. Python has a lot of the same advantages as Ruby, is used for more than web programming, and has SciPy.

Update: There is now a fledgling SciRuby project.

Further reading:

Bug in SciPy’s erf function

Last night I produced the plot below and was very surprised at the jagged spike. I knew the curve should be smooth and strictly increasing.

My first thought was that there must be a numerical accuracy problem in my code, but it turns out there’s a bug in SciPy version 0.8.0b1. I started to report it, but I saw there were similar bug reports and one such report was marked as closed, so presumably the fix will appear in the next release.

The problem is that SciPy’s erf function is inaccurate for arguments with imaginary part near 5.8. For example, Mathematica computes erf(1.0 + 5.7i) as -4.5717×1012 + 1.04767×1012 i. SciPy computes the same value as -4.4370×1012 + 1.3652×1012 i. The imaginary component is off by about 30%.

Here is the code that produced the plot.

from scipy.special import erf
from numpy import linspace, exp
import matplotlib.pyplot as plt

def g(y):
    z = (1 + 1j*y) /  sqrt(2)
    temp = exp(z*z)*(1 - erf(z))
    u, v = temp.real, temp.imag
    return -v / u

x = linspace(0, 10, 101)
plt.plot(x, g(x))

Replacing Mathematica with Python

Everything I do regularly in Mathematica can be done in Python. Even though Mathematica has a mind-boggling amount of functionality, I only use a tiny proportion of it. I skimmed through some of my Mathematica files to see what functions I use and then looked for Python counterparts. I found I use less of Mathematica than I imagined.

The core mathematical functions I need are in SciPy. The plotting features are in matplotlib. The SymPy library appears to have the symbolic functionality I need, though I’m as not sure about this one.

I don’t have much experience with the Python libraries listed above. I haven’t used SymPy at all; I’ve only browsed its web site. Maybe I’ll find I’d rather work in Mathematica, particularly when I’m just trying out ideas. But I want to experiment with using Python for more tasks.

As I’ve blogged about before, I’d like to consolidate my tools. I started using Emacs again because I was frustrated with using a different editor for every kind of file. One of the things I find promising about Python is that I may be able to do more in Python and reduce the number of programming languages I use regularly.

Related posts:

Languages that are easy to pick back up
Where the Unix philosophy breaks down

SciPy and NumPy for .NET

Travis Oliphant announced this morning at the SciPy 2010 conference that Microsoft is partnering with Enthought to produce a version of NumPy and SciPy for .NET. NumPy and SciPy are Python libraries for scientific computing. Oliphant is the president of Enthought and the original developer of NumPy.

It is possible to call NumPy and SciPy from IronPython now by using IronClad. However, going through IronClad can be inefficient.  The new libraries will enable efficient access to NumPy and SciPy from .NET languages and in particular from IronPython.

Here is the official press release from Enthought. [Update: press release no longer available.]

 

Using py2exe with SciPy

py2exe is a program that takes Python code and produces a Windows executable that can run on computers that do not have Python installed. My focus here is in using py2exe on Python code that depends on SciPy.

py2exe is itself a Python program, and its latest version is built for Python 2.6. The code I want it to compile is written with Python 2.5 because the latest version of SciPy depends on Python 2.5. How do I tell py2exe that my code uses an earlier version of Python?

It took me a while to realize this, but py2exe has two version numbers: one for the version of py2exe and one for the version of Python. The key was to download and install py2exe-0.6.9.win32-py2.5.exe. This is the latest version of py2exe (version 0.6.9) for an earlier version of Python (version 2.5). The py2exe software runs in the same version of Python as the code it is compiling.

I tested py2exe on a script hellpscipy.py as follows:

from scipy.special import gamma
print "Gamma(1/2) = ", gamma(0.5)

The instructions given in the py2exe tutorial don’t quite work because of the dependence on SciPy, but the site gives an example of how to modify the setup.py file to specify the dependence on SciPy. Here’s where things get a little murky. The instructions don’t say exactly how to modify the setup.py file, but they give strong hints. I know my code depends on scipy.special, but I don’t know what further dependencies I might have. Here’s the setup file I used.

from distutils.core import setup
import py2exe

excludes = []
includes = ["scipy.special"]

opts = {
    "py2exe": {
        "includes":includes,
        "excludes":excludes
    }
}

setup(console=['helloscipy.py'], options=opts)

This worked. The output listed 94 additional SciPy dependencies that I might need to include, some of which were clearly not needed. I was pretty sure, for example, that my program did not need email.Utils. Apparently I didn’t need any of these because the code worked fine.

It’s not clear just what files need to be distributed along with the .exe file that py2exe produces. py2exe creates a dist directory with the .exe file as well as other files that you might need, primarily .dll and .pyd files. Many of these were obviously unnecessary. I knew, for example, that my little command line program did not depend on Tk graphics. I deleted these and the code worked fine as expected.

Related posts:

Python code for computing distribution parameters from percentiles

A few days ago I wrote a post on finding parameters so that a probability distribution satisfies two percentile conditions. Since then I’ve written Python code to carry out the calculations described in that article and the accompanying technical report.

The article is Finding probability distribution parameters from percentiles posted on CodeProject. The article comes with Python source code and some commentary. The article shows how SciPy and the functools module make it possible for the code to be very succinct.

Related posts:

Probability distribution parameterizations in SciPy

Parameterizations are the bane of statistical software. One of the most common errors is to assume that one software package uses the same parameterization as another package. For example, some packages specify the exponential distribution in terms of the mean but others use the rate.

Python’s SciPy library has a somewhat unusual approach to parameterization with some advantages. SciPy makes every continuous distribution a location-scale family, even those distributions that typically do not have a location or scale parameter. This eliminates, for example, the question of whether an exponential distribution is parameterized by its mean or its rate. There is no mean or rate parameter per se. But there is a scale parameter, which happens to also be the mean.

Some methods on distribution classes have unusual names. For example, the inverse CDF function, often called the quantile function, is ppf for “percentile point function.” The complementary CDF function, or CCDF, is called sf for “survival function.” (Survival function is not an unusual name, though my preference would have been ccdf since that would make the API more symmetric.)

Discrete distributions in SciPy do not have a scale parameter. Also instead of a pdf method the discrete distributions have a pmf method; continuous functions have a probability density function but discrete methods have a probability mass function.

One surprise with SciPy distributions is that the SciPy implementation of the lognormal distribution does not correspond to the definition I’m more familiar with unless the location is 0. In order to be consistent with other continuous distributions, SciPy shifts the PDF argument x whereas I believe it is more common to shift log(x). This isn’t just a difference in parameterization. It actually amounts to different distributions.

For more details, see these notes on distributions in SciPy. See also these notes on distributions in R and in Mathematica for comparison.

Related posts:

IEEE floating point arithmetic in Python

Sometimes a number is not a number. Numeric data types represent real numbers in a computer fairly well most of the time, but sometimes the abstraction leaks. The sum of two numeric types is always a numeric type, but the result might be a special bit pattern that says overflow occurred. Similarly, the ratio of two numeric types is a numeric type, but that type might be a special type that says the result is not a number.

The IEEE 754 standard dictates how floating point numbers work. I’ve talked about IEEE exceptions in C++ before. This post is the Python counterpart. Python’s floating point types are implemented in terms of C’s double type  and so the C++ notes describe what’s going on at a low level. However, Python creates a higher level abstraction for floating point numbers. (Python also has arbitrary precision integers, which we will discuss at the end of this post.)

There are two kinds of exceptional floating point values: infinities and NaNs. Infinite values are represented by inf and can be positive or negative. A NaN, not a number, is represented by nan. Let x = 10200. Then x2 will overflow because 10400 is too big to fit inside a C double. (To understand just why, see Anatomy of a floating point number.) In the following code, y will contain a positive infinity.

x = 1e200; y = x*x

If you’re running Python 3.0 and you print y, you’ll see inf. If you’re running an earlier version of Python, the result may depend on your operating system. On Windows, you’ll see 1.#INF but on Linux you’ll see inf. Now keep the previous value of y and run the following code.

z = y; z /= y

Since z = y/y, you might think z should be 1. But since y was infinite, it doesn’t work that way. There’s no meaningful way to assign a numeric value to the ratio of infinite values and so z contains a NaN. (You’d have to know “how they got there” so you could take limits.) So if you print z you’d see nan or 1.#IND depending on your version of Python and your operating system.

The way you test for inf and nan values depends on your version of Python. In Python 3.0, you can use the functions math.isinf and math.isnan respectively. Earlier versions of Python do not have these functions. However, the SciPy library has corresponding functions scipy.isinf and scipy.isnan.

What if you want to deliberately create an inf or a nan? In Python 3.0, you can use float('inf') or float('nan'). In earlier versions of Python you can use scipy.inf and scipy.nan if you have SciPy installed.

IronPython does not yet support Python 3.0, nor does it support SciPy directly. However, you can use SciPy with IronPython by using Ironclad from Resolver Systems. If you don’t need a general numerical library but just want functions like isinf and isnan you can create your own.


def isnan(x): return type(x) is float and x != x
def isinf(x): inf = 1e5000; return x == inf or x == -inf

The isnan function above looks odd. Why would x != x ever be true? According to the IEEE standard, NaNs don’t equal anything, even each other. (See comments on the function IsFinite here for more explanation.) The isinf function is really a dirty hack but it works.

To wrap things up, we should talk a little about integers in Python. Although Python floating point numbers are essentially C floating point numbers, Python integers are not C integers. Python integers have arbitrary precision, and so we can sometimes avoid problems with overflow by working with integers. For example, if we had defined x as 10**200 in the example above, x would be an integer and so would y = x*x and y would not overflow; a Python integer can hold 10400 with no problem. We’re OK as long as we keep producing integer results, but we could run into trouble if we do anything that produces a non-integer result. For example,

x = 10**200; y = (x + 0.5)*x

would cause y to be inf, and

x = 10**200; y = x*x + 0.5

would throw an OverflowError exception.

Related posts:

Floating point numbers are a leaky abstraction
Anatomy of a floating point number
Overflow and loss of precision

Probability distributions in SciPy

Here are some notes on how to work with probability distributions using the SciPy numerical library for Python.

Functions related to probability distributions are located in scipy.stats. The general pattern is

scipy.stats.<distribution family>.<function>

There are 81 supported continuous distribution families and 12 discrete distribution families. Some distributions have obvious names: gamma, cauchy, t, f, etc. The only possible surprise is that all distributions begin with a lower-case letter, even those corresponding to a proper name (e.g. Cauchy). Other distribution names are less obvious: expon for the exponential, chi2 for chi-squared distribution, etc.

Each distribution supports several functions. The density and cumulative distribution functions are pdf and cdf respectively. (Discrete distributions use pmf rather than pdf.) One surprise here is that the inverse CDF function is called ppf for “percentage point function.” I’d never heard that terminology and would have expected something like “quantile.”

Example: scipy.stats.beta.cdf(0.1, 2, 3) evaluates the CDF of a beta(2, 3) random variable at 0.1.

Random values are generated using rvs which takes an optional size argument. The size is set to 1 by default.

Example: scipy.stats.norm.rvs(2, 3) generates a random sample from a normal (Gaussian) random variable with mean 2 and standard deviation 3. The function call scipy.stats.norm.rvs(2, 3, size = 10) returns an array of 10 samples from the same distribution.

The command line help() facility does not document the distribution parameterizations, but the external documentation does. Most distributions are parameterized in terms of location and scale. This means, for example, that the exponential distribution is parameterized in terms of its mean, not its rate. Somewhat surprisingly, the exponential distribution has a location parameter. This means, for example, that scipy.stats.expon.pdf(x, 7) evaluates at x the PDF of an exponential distribution with location 7. This is not what I expected. I assumed there would be no location parameter and that the second argument, 7, would be the mean (scale). Instead, the location was set to 7 and the scale was left at its default value 1. Writing scipy.stats.expon.pdf(x, scale=7) would have given the expected result because the default location value is 0.

SciPy also provides constructors for objects representing random variables.

Example: x = scipy.stats.norm(3, 1); x.cdf(2.7) returns the same value as scipy.stats.norm.cdf(2.7, 3, 1).

Constructing objects representing random variables encapsulates the differences between distributions in the constructors. For example, some distributions take more parameters than others and so their object constructors require more arguments. But once a distribution object is created, its PDF, for example, can be called with a single argument. This makes it easier to write code that takes a general distribution object as an argument.

Related posts:

Numerical computing in IronPython with IronClad
Stand-alone error function erf(x)

Numerical computing in IronPython with Ironclad

In a previous post, I discuss my difficulties calling some Python modules from IronPython. In particular I wanted to call SciPy from IronPython and couldn’t. The discussion following that post brought up Ironclad as a possible solution. I wanted to learn more about Ironclad, and so I invited William Reade to write a guest post about the project. I want to thank William for responding to my request with a very helpful article. — John


Hi! My name’s William Reade, and I’ve spent the last year or so working on Ironclad, an open-source project which helps IronPython to inter-operate better with CPython. Michael Foord recently introduced  me to our host John, who kindly offered me the opportunity to write a bit about  my work and, er, how well it works. So, here I am.

To give you a little bit of context, I’ve been working at Resolver Systems for several years now; our main product, Resolver  One, is a spreadsheet with very tight IronPython integration. We like to describe  it as a “Pythonic spreadsheet”, and that’s clearly a concept that people like.  However, when people think of a “Pythonic spreadsheet”, they apparently expect it  to work with popular Python libraries — such as NumPy and SciPy — and we found that IronPython’s incompatibility put us at a serious disadvantage. And, for some reason, nobody seemed very keen to  solve the problem for us, so we had to do it ourselves.

The purpose of Ironclad is to allow you to use Python C extensions (of which there are many) from inside IronPython without recompiling anything. The secret purpose  has always been to get NumPy working in Resolver One, and in release 1.4 we finally  achieved this goal. Although the integration is still alpha level, you can import  and use NumPy inside the spreadsheet grid and user code: you can see a screencast  about the integration here.

However, while Resolver One is a great tool, you aren’t required to use it to get the benefits: Ironclad has been developed completely separately, has no external  dependencies, and is available under an open source license. If you consider  yourself adequately teased, keep reading for a discussion of what Ironclad actually  does, what it enables you to do, and where it’s headed.

As you may know, Python is written in C and IronPython is written in C#. While IronPython is an excellent implementation of Python, it works very differently  under the hood, and it certainly doesn’t have anything resembling Python’s API for  writing C extensions. However, Ironclad can work around this problem by loading a  stub DLL into an IronPython process which impersonates the real python25.dll, and  hence allows us to us intercept the CPython API calls. We can then ensure that the  appropriate things happen in response to those calls … except that we use  IronPython objects instead of CPython ones.

So long as we wrap IronPython objects for consumption by CPython, and vice versa, the two systems can coexist and inter-operate quite happily. Of course, the mix of  deterministic and non-deterministic garbage collection makes it a little tricky [1] to  ensure that unreferenced objects — and only unreferenced objects — die in a  timely manner, and there are a number of other dark corners, but I’ve done enough  work to confidently state that the problem is “just” complex and fiddly. While it’s  not the sort of project that will ever be finished, it hopefully is the sort that  can be useful without being perfect.

The upshot of my recent work is that you can now download Ironclad, type ‘import ironclad; import scipy‘ in an IronPython console, and it will Just Work [2]. I am  programmer, hear me roar!

Hundreds of tests now pass in both NumPy and SciPy, and I hope that some of you will be inspired to test it against your own requirements. For example, the Gaussian error function has been mentioned a few times on this blog (and, crucially, I have a  vague idea of what it actually is), and I can demonstrate that scipy.special.erf works perfectly under Ironclad:

C:devironclad-headbuild>ipy
IronPython 2.0 (2.0.0.0) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> import ironclad
>>> from scipy.special import erf
Detected scipy import
  faking out numpy._import_tools.PackageLoader
Detected numpy import
  faking out modules: mmap, nosetester, parser
>>> erf(0)
0.0
>>> erf(0.1)
0.1124629160182849
>>> erf(1)
0.84270079294971478
>>> erf(10)
1.0

Numerical integration also seems to work pretty well, even for tricky cases (note that the quad function returns a tuple of (result, error)):

>>> from scipy.integrate import quad
>>> quad(erf, 0, 1)
(0.48606495811225592, 5.3964050795968879e-015)
>>> quad(erf, -1, 1)
(0.0, 1.0746071094349994e-014)
>>> from scipy import inf
>>> quad(erf, -inf, inf)
(0.0, 0.0)
>>> quad(erf, 0, inf) # ok, this one is probably more of a 'stupid' case
Warning: The integral is probably divergent, or slowly convergent.
(-1.564189583542768, 3.2898350710297564e-010)

And, while this exposes a little import-order wart, we can re-implement erf in terms of the normal CDF, and see that we get pretty similar results:

>>> from scipy import misc # shouldn't really be necessary - sorry :)
>>> from scipy.stats.distributions import norm
>>> import numpy as np
>>> def my_erf(x):
...   y = norm.cdf(x * np.sqrt(2))
...   return (2 * y) - 1
...
>>> my_erf(0.1)
0.11246291601828484
>>> my_erf(1)
0.84270079294971501
>>> quad(my_erf, 0, 1)
(0.48606495811225597, 5.3964050795968887e-015)
>>> quad(my_erf, -inf, inf)
(2.8756927650058737e-016, 6.1925307417506635e-016)

I also know that it’s possible to run through the whole Tentative NumPy Tutorial [3] with identical output on  CPython and IronPython [4], and the SciPy tutorial appears to work equally well in both  environments [5]. In short, if you’re trying to do scientific computing with  IronPython, Ironclad is now probably mature enough to let you get significant value  out of SciPy/NumPy.

However, I can’t claim that everything is rosy: Ironclad has a number of flaws which may impact you.

  • It won’t currently work outside Windows, and it won’t work in 64-bit processes.  However, NumPy itself doesn’t yet take advantage of 64-bit Windows. I’ll start work  on this  as soon as it’s practical; for now, it should be possible to run in 32-bit  mode without problems.
  • Performance is generally poor compared to CPython. In many places it’s only a matter of a few errant microseconds — and we’ve seen NumPy integration deliver some great performance benefits for Resolver One — but in pathological cases it’s  worse by many orders of magnitude. This is another area where I would really like  to hear back from users with examples of what needs to be faster.
  • Unicode data doesn’t work, and I don’t plan to work on this problem because it’ll disappear when IronPython catches up to Python 3000. At that point both systems will have Unicode strings only, instead of the current situation where I would have to map one string type on one side to two string types on the other.
  • NumPy’s distutils and f2py subpackages don’t currently work at all, and nor do memory-mapped files.
  • Plenty of other CPython extensions work, to a greater or lesser extent, but lots won’t even import.

However, just about every problem with Ironclad is fixable, at least in theory: if you need it to do something that it can’t, please talk to me about it (or even send me a patch!).

Footnotes

[1] CPython uses reference counting to track objects’ states, and deletes them deterministically the moment they become unreferenced, while .NET uses a more advanced garbage collection strategy which unfortunately leads to non-deterministic finalization.

[2] Assuming you have the directories containing the ironclad, numpy and scipy packages already on your sys.path, at any rate. I personally just install  everything for Python 2.5, and have added the CPython install’s ‘Dlls‘ and ‘lib/site-packages‘ subdirectories to my IRONPYTHONPATH.

[3] Apart from the matplotlib/pylab bit, but even that should be workable with a little extra setup if you don’t mind using a non-interactive back-end.

[4] Modulo PRNG output, of course.

[5] That is to say, not very well at all, but at least they go wrong in similar ways.