Estimating the exponent of discrete power law data

Suppose you have data from a discrete power law with exponent α. That is, the probability of an outcome n is proportional to n. How can you recover α?

A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That does a very poor job [1]. The discrete case needs its own estimator.

To illustrate this, we start by generating 5,000 samples from a discrete power law with exponent 3.

   import numpy.random

   alpha = 3
   n = 5000
   x = numpy.random.zipf(alpha, n)

The continuous MLE is very simple to implement:

    alpha_hat = 1 + n / sum(log(x))

Unfortunately, it gives an estimate of 6.87 for alpha, though we know it should be around 3.

The MLE for the discrete power law distribution satisfies

\frac{\zeta'(\hat{\alpha})}{\zeta(\hat{\alpha})} = -\frac{1}{n} \sum_{i-1}^n \log x_i

Here ζ is the Riemann zeta function, and xi are the samples. Note that the left side of the equation is the derivative of log ζ, or what is sometimes called the logarithmic derivative.

There are three minor obstacles to finding the estimator using Python. First, SciPy doesn’t implement the Riemann zeta function ζ(x) per se. It implements a generalization, the Hurwitz zeta function, ζ(x, q). Here we just need to set q to 1 to get the Riemann zeta function.

Second, SciPy doesn’t implement the derivative of zeta. We don’t need much accuracy, so it’s easy enough to implement our own. See an earlier post for an explanation of the implementation below.

Finally, we don’t have an explicit equation for our estimator. But we can easily solve for it using the bisection algorithm. (Bisect is slow but reliable. We’re not in a hurry, so we might as use something reliable.)

    from scipy import log
    from scipy.special import zeta
    from scipy.optimize import bisect 

    xmin = 1

    def log_zeta(x):
        return log(zeta(x, 1))

    def log_deriv_zeta(x):
        h = 1e-5
        return (log_zeta(x+h) - log_zeta(x-h))/(2*h)

    t = -sum( log(x/xmin) )/n
    def objective(x):
        return log_deriv_zeta(x) - t

    a, b = 1.01, 10
    alpha_hat = bisect(objective, a, b, xtol=1e-6)

We have assumed that our data follow a power law immediately from n = 1. In practice, power laws generally fit better after the first few elements. The code above works for the more general case if you set xmin to be the point at which power law behavior kicks in.

The bisection method above searches for a value of the power law exponent between 1.01 and 10, which is somewhat arbitrary. However, power law exponents are very often between 2 and 3 and seldom too far outside that range.

The code gives an estimate of α equal to 2.969, very near the true value of 3, and much better than the naive estimate of 6.87.

Of course in real applications you don’t know the correct result before you begin, so you use something like a confidence interval to give you an idea how much uncertainty remains in your estimate.

The following equation [2] gives a value of σ from a normal approximation to the distribution of our estimator.

\sigma = \frac{1}{\sqrt{n\left[ \frac{\zeta''(\hat{\alpha}, x_{min})}{\zeta(\hat{\alpha}, x_{min})} - \left(\frac{\zeta'(\hat{\alpha}, x_{min})}{\zeta(\hat{\alpha}, x_{min})}\right)^2\right]}}

So an approximate 95% confidence interval would be the point estimate +/- 2σ.

    from scipy.special import zeta
    from scipy import sqrt

    def zeta_prime(x, xmin=1):
        h = 1e-5
        return (zeta(x+h, xmin) - zeta(x-h, xmin))/(2*h)

    def zeta_double_prime(x, xmin=1):
        h = 1e-5
        return (zeta(x+h, xmin) -2*zeta(x,xmin) + zeta(x-h, xmin))/h**2

    def sigma(n, alpha_hat, xmin=1):
        z = zeta(alpha_hat, xmin)
        temp = zeta_double_prime(alpha_hat, xmin)/z
        temp -= (zeta_prime(alpha_hat, xmin)/z)**2
        return 1/sqrt(n*temp)

    print( sigma(n, alpha_hat) )

Here we use a finite difference approximation for the second derivative of zeta, an extension of the idea used above for the first derivative. We don’t need high accuracy approximations of the derivatives since statistical error will be larger than the approximation error.

In the example above, we have α = 2.969 and σ = 0.0334, so a 95% confidence interval would be [2.902, 3.036].

Click to find out more about consulting for statistical computing


* * *

[1] Using the continuous MLE with discrete data is not so bad when the minimum output xmin is moderately large. But here, where xmin = 1 it’s terrible.

[2] Equation 3.6 from Power-law distributions in empirical data by Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman.


Numerical differentiation

Today I needed to the derivative of the zeta function. SciPy implements the zeta function, but not its derivative, so I needed to write my own version.

The most obvious way to approximate a derivative would be to simply stick a small step size into the definition of derivative:

f’(x) ≈ (f(x+h) – f(x)) / h

However, we could do much better using

f’(x) ≈ (f(x+h) – f(x-h)) / 2h

To see why, expand f(x) in a power series:

f(x + h) = f(x) + h f‘(x) + h2 f”(x)/2 + O(h3)

A little rearrangement shows that the error in the one-sided difference, the first approximation above, is O(h). Now if you replace h with –h and do a little algebra you can also show that the two-sided difference is O(h2). When h is small, h2 is very small, so the two-sided version will be more accurate for sufficiently small h.

So how small should h be? The smaller the better, in theory. In computer arithmetic, you lose precision whenever you subtract two nearly equal numbers. The more bits two numbers share, the more bits of precision you may lose in the subtraction. In my application, h = 10-5 works well: the precision after the subtraction in the numerator is comparable to the precision of the (two-sided) finite difference approximation. The following code was adequate for my purposes.

    from scipy.special import zeta

    def zeta_prime(x):
        h = 1e-5
        return (zeta(x+h,1) - zeta(x-h,1))/(2*h)

The zeta function in SciPy is Hurwitz zeta function, a generalization of the Riemann zeta function. Setting the second argument to 1 gives the Riemann zeta function.

There’s a variation on the method above that works for real-valued functions that extend to a complex analytic function. In that case you can use the complex step differentiation trick to use

Im( f(x+ih)/h )

to approximate the derivative. It amounts to the two-sided finite difference above, except you don’t need to have a computer carry out the subtraction, and so you save some precision. Why’s that? When x is real, xih and xih are complex conjugates, and f(x – ih) is the conjugate of f(x + ih), i.e. conjugation and function application commute in this setting. So (f(x+ih) – f(x-ih)) is twice the imaginary part of f(x + ih).

SciPy implements complex versions many special functions, but unfortunately not the zeta function.

Click to find out more about consulting for numerical computing


Anthony Scopatz on xonsh and shells in general

Anthony Scopatz did an interview for Podcast.__init__ recently talking about xonsh, a command shell that blends Python and some traditions from bash. One line from the interview jumped out at me:

… thinking very critically about what shells get used for and what they’re actually good at and what they’re not good at.

I’ve wondered about this but never reached any satisfying conclusions. I was curious to hear Anthony’s ideas, so I asked him for another interview. (I interviewed Anthony and his co-author Katy Huff regarding their book Effective Computation in Physics.

* * *

JC: If your shell speaks your programming language, then what else does it need to do?

AS: It’s an interesting question. People have tried to use Python as a shell for years and years and they came up with a bunch of different potential solutions, but none of them quite worked because the language wasn’t built around that idea. It ended up being more verbose than people want from a shell. The main purpose of the shell, in my opinion, is to run other code and to glue things together. Python does that really well for libraries and functions, but it doesn’t do that so well for executables. Bash deals with executables really well, but it’s terrible for dealing with even simple conditional logic. Like a lot of people, I wanted something that would do all these things simultaneously and do them all well. But you quickly end up where many traditional computer science people are not willing to go: context-sensitive parsing. It’s something they teach you to be afraid of in school .

JC: But you do it all the time. How can you get away from it?

AS: You can’t, but people want to avoid it in their core languages. The major programming languages keep it out. You’ll find it quarantined to domain-specific languages where the damage is small.

JC: So you have something in mind like Perl? There the behavior of a function can depend entirely on whether it’s being used in a scalar context or an array context.

AS: That’s right. Perl does some of this. The language Forth is completely built around this. It’s all context-sensitive.

You brought up something interesting [in a previous email] about the overlap between shells and editors. Those things are completely separate in my mind, but for a lot of people they get merged very quickly. For instance, Emacs has the ability to run a shell inside the editor, and people use that all the time.

JC: The way I work is that I start something at the command line, then it gets a little complicated, and I switch over to writing a script and regret not having done that sooner. I especially do that with something like R. This is just going to be a few quick calculations, so I’ll do it right from the REPL. Then things get more complicated …

AS: IPython sorta has that too, the old IPython readline shell. You just wanted to do something simple that bash couldn’t do quickly or easily, so you open up the IPython command line. Inevitably it ends up taking more lines than you wanted it to.  That is part of why the Jupyter notebook is so great.

JC: One thing I noticed about PowerShell was that system administrators were ecstatic when it came out and would say how much they loved the command line. Then Microsoft put out this ISE, sort of an IDE for PowerShell, and everyone moved there. So they’re not really using the command line anymore. They’re excited about PowerShell as a programming language, not as an interactive shell per se.

In Bruce Payette’s PowerShell book he fields questions asking why PowerShell did something some way they find odd and his answer is always “Because it’s a shell.”

AS: Do you have any examples?

JC: For example, functions don’t use parentheses around their arguments or commas between their arguments because that’s not what people expect from a shell. You expect to type something like ls, not ls() with parentheses at the end. There were more subtle examples than this, but they’re not fresh on my mind.

AS: That’s where I think that tools like Python plumbum are lacking. It’s an all-Python environment, so you have to use Python syntax even when it’s cumbersome. It prevents you from having to import subprocess and worry about that all the time, but it doesn’t do much more than that.

JC: When you were writing xonsh, where there times you wished you could change the Python language? Or things you’d do differently in the shell if you weren’t aiming for 100% Python compatibility?

AS: That’s interesting. Python is deceptively simple. It has a lot of little pieces to it. It’s very natural and intuitive to use, but re-implementing the parser for Python was more work than I expected. There are a lot of little gotchas in the parser. I spent a lot of time on tuples and function argument grouping. The way they’re handled looks very similar but they’re handled completely differently for no reason that’s readily apparent.

There’s also this ambiguity between Python commands and shell commands if you’re trying to do both simultaneously, and that’s frustrating. That’s the hard part, figuring out when you’re in a subprocess and when you’re in Python mode.

JC: It’s hard for you as an implementer, but hopefully users can be blissfully ignorant of the issues and it just does what they expect.

I guess you’re walking a fine line, because as soon as you say you want the shell to infer what people mean, you start getting into the kinds of complications you have in Perl where things depend so heavily on context, and that sort of thing is contrary to the spirit of Python.

AS:  Yeah, exactly! After going through this exercise, there is one thing I’d like to change about Python. Python is white space-sensitive at the beginning of a line, but not after the first non-white space character. For example, you can put as many spaces around a binary operator as you like, or none at all. That’s really, really frustrating. If you enforced PEP 8, requiring exactly one white space around every binary operator, you’d be able to resolve these currently ambiguous cases between subprocess mode and Python mode very naturally. But I can’t imagine a world in which people would agree to this.

JC: What shell would you use if you weren’t using xonsh?

AS: I probably would use bash. Fish is really nice in some ways, and things like zsh have nice features too. What I used to do is go back and forth between working in an IPython shell and a bash shell, and between those two I could pretty much get the job done.

JC: Do you use Emacs?

AS: No, I don’t use Emacs or Vim or any of those editors. I use an editor I wrote, kinda like nano. I’ve used Emacs and Vim, but they got in my way too much, so I wanted something else. This is sort of the same thing as xonsh; I want my tools to get out of my way. I want the barrier to entry to doing what I want to be basically zero. You can spend years and years becoming a master of some of these tools and then you’re really effective, but I want to just open up the editor and start typing text. The same thing with the shell. I just want to open it up and get to work and not have to keep going back to the documentation.

Distance to Mars

The distance between the Earth and Mars depends on their relative positions in their orbits and varies quite a bit over time. This post will show how to compute the approximate distance over time. We’re primarily interested in Earth and Mars, though this shows how to calculate the distance between any two planets.

The planets have elliptical orbits with the sun at one focus, but these ellipses are nearly circles centered at the sun. We’ll assume the orbits are perfectly circular and lie in the same plane. (Now that Pluto is not classified as a planet, we can say without qualification that the planets have nearly circular orbits. Pluto’s orbit is much more elliptical than any of the planets.)

We can work in astronomical units (AUs) so that the distance from the Earth to the sun is 1. We can also work in units of years so that the period is also 1. Then we could describe the position of the Earth at time t as exp(2πit).

Mars has a larger orbit and a longer period. By Kepler’s third law, the size of the orbit and the period are related: the square of the period is proportional to the cube of the radius. Because we’re working in AUs and years, the proportionality constant is 1. If we denote the radius of Mars’ orbit by r, then its orbit can be described by

r exp(2πi (r-3/2 t ))

Here we pick our initial time so that at t = 0 the two planets are aligned.

The distance between the planets is just the absolute value of the difference between their positions:

| exp(2πit) – r exp(2πi (r-3/2 t)) |

The following code computes and plots the distance from Earth to Mars over time.

from scipy import exp, pi, absolute, linspace
import matplotlib.pyplot as plt

    def earth(t):
        return exp(2*pi*1j*t)

    def mars(t):
        r = 1.524 # semi-major axis of Mars orbit in AU
        return r*exp(2*pi*1j*(r**-1.5*t))

    def distance(t):
        return absolute(earth(t) - mars(t))

    x = linspace(0, 20, 1000)
    plt.plot(x, distance(x))
    plt.xlabel("Time in years")
    plt.ylabel("Distance in AU")
    plt.ylim(0, 3)

And the output looks like this:

Notice that the distance varies from about 0.5 to about 2.5. That’s because the radius of Mars’ orbit is about 1.5 AU. So when the planets are exactly in phase, they are 0.5 AU apart and when they’re exactly out of phase they are 2.5 AU apart. In other words the distance ranges from 1.5 – 1 to 1.5 + 1.

The distance function seems to be periodic with period about 2 years. We can do a little calculation by hand to show that is the case and find the period exactly.

The distance squared is the distance times its complex conjugate. If we let ω = -3/2 then the distance squared is

d2(t) = (exp(2πit) – r exp(2πiωt)) (exp(-2πit) – r exp(-2πiωt))

which simplifies to

1 + r2 – 2r cos(2π(1 – ω)t)

and so the (squared) distance is periodic with period 1/(1 – ω) = 2.13.

Notice that the plot of distance looks more angular at the minima and more rounded near the maxima. Said another way, the distance changes more rapidly when the planets leave their nearest approach than their furthest approach. You can prove this by taking square root of d2(t) and computing its derivative.

Let f(t) = 1 + r2 – 2r cos(2π(1 – ω)t). By the chain rule, the derivative of the square root of  f(t) is 1/2  f(t)-1/2 f‘(t). Near a maximum or a minimum, f‘(t) takes on the same values. But the term f(t)-1/2 is largest when f(t) is smallest and vice versa because of the negative exponent.

Related links:

Julia for Python programmers

One of my clients is writing software in Julia so I’m picking up the language. I looked at Julia briefly when it first came out but haven’t used it for work. My memory of the language was that it was almost a dialect of Python. Now that I’m looking at it a little closer, I can see more differences, though the most basic language syntax is more like Python than any other language I’m familiar with.

Here are a few scattered notes on Julia, especially on how it differs from Python.

  • Array indices in Julia start from 1, like Fortran and R, and unlike any recent language that I know of.
  • Like Python and many other scripting languages, Julia uses # for one-line comments. It also adds #= and =# for multi-line comments, like /* and */ in C.
  • By convention, names of functions that modify their first argument end in !. This is not enforced.
  • Blocks are indented as in Python, but there is no colon at the end of the first line, and there must be an end statement to close the block.
  • Julia uses elseif as in Perl, not elif as in Python [1].
  • Julia uses square brackets to declare a dictionary. Keys and values are separated with =>, as in Perl, rather than with colons, as in Python.
  • Julia, like Python 3, returns 2.5 when given 5/2. Julia has a // division operator, but it returns a rational number rather than an integer.
  • The number 3 + 4i would be written 3 + 4im in Julia and 3 + 4j in Python.
  • Strings are contained in double quotes and characters in single quotes, as in C. Python does not distinguish between characters and strings, and uses single and double quotes interchangeably.
  • Julia uses function to define a function, similar to JavaScript and R, where Python uses def.
  • You can access the last element of an array with end, not with -1 as in Perl and Python.

* * *

[1] Actually, Perl uses elsif, as pointed out in the comments below. I can’t remember when to use else if, elseif, elsif, and elif.

Effective Computation in Physics

Earlier this week I had a chance to talk with Anthony Scopatz and Katy Huff about their new book, Effective Computation in Physics.

JC: Thanks for giving me a copy of the book when we were at SciPy 2015. It’s a nice book. It’s about a lot more than computational physics.

KH: Right. If you think of it as physical science in general, that’s the group we’re trying to target.

JC: Targeting physical science more than life science?

KH: Yes. You can see that more in the data structures we cover which are very float-based rather than things like strings and locations.

AS: To second that, I’d say that all the examples are coming from the physical sciences. The deep examples, like in the parallelism chapter, are most relevant to physicists.

JC: Even in life sciences, there’s a lot more than sequences of base pairs.

KH: Right. A lot of people have asked what chapters they should skip. It’s probable that ecologists or social scientists are not going to be interested in the chapter about HDF5. But the rest of the book, more or less, could be useful to them.

JC: I was impressed that there’s a lot of scattered stuff that you need to know that you’ve brought into one place. This would be a great book to hand a beginning grad student.

KH: That was a big motivation for writing the book. Anthony’s a professor now and I’m applying to be a professor and I can’t spend all my time ramping students up to be useful researchers. I’d rather say “Here’s a book. It’s yours. Come to me if it’s not in the index.”

JC: And they’d rather have a book that they could access any time than have to come to you.  Are you thinking of doing a second edition as things change over time?

AS: It’s on the table to do a second edition eventually. Katy and I will have the opportunity if the book is profitable and the material becomes out of date. O’Reilly could ask someone else to write a second edition, but they would ask us first.

JC: Presumably putting out a second edition would not be as much work as creating the first one.

KH: I sure hope not!

AS: There’s a lot of stuff that’s not in this book. Greg Wilson jokingly asked us when Volume 2 would come out. There may be a need for a more intermediate book that extends the topics.

KH: And maybe targets languages other than Python where you’re going to have to deal with configuring and building, installing and linking libraries, that kind of stuff. I’d like to cover more of that, but Python doesn’t have that problem!

JC: You may sell a lot of books when the school year starts.

KH: Anthony and I both have plans for courses based around this book. Hopefully students will find it helpful.

JC: Maybe someone else is planning the same thing. It would be nice if they told you.

AS: A couple people have approached us about doing exactly that. Something I’d like to see is for people teaching courses around it to pull their curriculum together.

JC: Is there a web site for the book, other than an errata page at the publisher?

KH: Sure, there’s Anthony put that up.

AS: There’s also a GitHub repo, physics-codes. That’s where you can find code for all the examples, and that should be kept up to date. We also have a YouTube channel.

JC: When did y’all start writing the book?

AS: It was April or May last year when we finally started writing. There was a proposal cycle six or seven months before that. Katy and I were simultaneously talking to O’Reilly, so that worked out well.

KH: In a sense, the book process started for me in graduate school with The Hacker Within and Software Carpentry. A lot of the flows in the book come from the outlines of Hacker Within tutorials and Software Carpentry tutorials years ago.

AS: On that note, what happened for me, I took those tutorials and turned them into a masters course for AIMS, African Institute for Mathematical Sciences. At the end I thought it would be nice if this were a book. It didn’t occur to me that there was a book’s worth of material until the end of the course at AIMS. I owe a great debt to AIMS in that way.

JC: Is there something else you’d like to say about the book that we haven’t talked about?

KH: I think it would be a fun exercise for someone to try to determine which half of the chapters I wrote and which Anthony wrote. Maybe using some sort of clustering algorithm or pun detection. If anyone wants to do that sort of analysis, I’d love to see if you guess right. Open competition. Free beer from Katy if you can figure out which half. We split the work in half, but it’s really mixed around.  People who know us well will probably know that Anthony’s chapters have a high density of puns.

AS: I think the main point that I would like to see come across is that the book is useful to a broader audience outside the physical sciences. Even for people who are not scientists themselves, it’s useful to describe the mindset of physical scientists to software developers or managers. That communication protocol kinda goes both ways, though I didn’t expect that when we started out.

JC: I appreciate that it’s one book. Obviously it won’t cover everything you need to know. But it’s not like here’s a book on Linux, here’s a book on git, here are several books on Python. And some of the material in here isn’t in any book.

KH: Like licensing. Anthony had the idea to add the chapter on licensing. We get asked all the time “Which license do you use? And why?” It’s confusing, and you can get it really wrong.

* * *

Check out Effective Computation in Physics. It’s about more than physics. It’s a lot of what you need to know to get started with scientific computing in Python, all in one place.


For daily tips on Python and scientific computing, follow @SciPyTip on Twitter.

Scipytip twitter icon

Numerators of harmonic numbers

Harmonic numbers

The nth harmonic number, Hn, is the sum of the reciprocals of the integers up to and including n. For example,

H4 = 1 + 1/2 + 1/3 + 1/4 = 25/12.

Here’s a curious fact about harmonic numbers, known as Wolstenholme’s theorem:

For a prime p > 3, the numerator of Hp-1 is divisible by p2.

The example above shows this for p = 5. In that case, the numerator is not just divisible by p2, it is p2, though this doesn’t hold in general. For example, H10 = 7381/2520. The numerator 7381 is divisible by 112 = 121, but it’s not equal to 121.

Generalized harmonic numbers

The generalized harmonic numbers Hn,m are the sums of the reciprocals of the first n positive integers, each raised to the power m. Wolstenholme’s theorem also says something about these numbers too:

For a prime p > 3, the numerator of Hp-1,2 is divisible by p.

For example, H4,2 = 205/144, and the numerator is clearly divisible by 5.

Computing with Python

You can play with harmonic numbers and generalized harmonic numbers in Python using SymPy. Start with the import statement

from sympy.functions.combinatorial.numbers import harmonic

Then you can get the nth harmonic number with harmonic(n) and generalized harmonic numbers with harmonic(n, m).

To extract the numerators, you can use the method as_numer_denom to turn the fractions into (numerator, denominator) pairs. For example, you can create a list of the numerators of the first 10 harmonic numbers with

[harmonic(n).as_numer_denom()[0] for n in range(10)]

What about 0?

You might notice that harmonic(0) returns 0, as it should. The sum defining the harmonic numbers is empty in this case, and empty sums are defined to be zero.


Scientific computing in Python

Scientific computing in Python is expanding and maturing rapidly. Last week at the SciPy 2015 conference there were about twice as many people as when I’d last gone to the conference in 2013.

You can get some idea of the rapid develop of the scientific Python stack and its future direction by watching the final keynote of the conference by Jake VanderPlas.

I used Python for a while before I discovered that there were so many Python libraries for scientific computing. At the time I was considering learning Ruby or some other scripting language, but I committed to Python when I found out that Python has far more libraries for the kind of work I do than other languages do. It felt like I’d discovered a secret hoard of code. I expect it would be easier today to discover the scientific Python stack. (It really is becoming more of an integrated stack, not just a collection of isolated libraries. This is one of the themes in the keynote above.)

When people ask me why I use Python, rather than languages like Matlab or R, my response is that I do a mixture of mathematical programming and general programming. I’d rather do mathematics in a general programming language than do general programming in a mathematical language.

One of the drawbacks of Python, relative to C++ and related languages, is speed. This is a problem in languages like R as well. However, with Python there are ways to speed up code without completely rewriting it, such as Cython and Numba. The only reliable way I’ve found to make R much faster, is to rewrite it in another language.

Another drawback of Python until recently was that data manipulation and exploration were not as convenient as one would hope. However, that has changed due to developments such as Pandas, initiated by Wes McKinney. For more on how that came to be and where it’s going, see his keynote from the second day of SciPy 2015.

It’s not clear why Python has become the language of choice for so many people in scientific computing. Maybe if people like Travis Oliphant had decided to use some other language for scientific programming years ado, we’d all be using that language now. Python wasn’t intended to be a scientific programming language. And as Jake VanderPlas points out in his keynote, Python still is not a scientific programming language, but the foundation for a scientific programming stack. Maybe Python’s strength is that it’s not a scientific language. It has drawn more computer scientists to contribute to the core language than it would have if it had been more of a domain-specific language.

* * *

If you’d like help moving to the Python stack, please let me know.

Mystery curve

This afternoon I got a review copy of the book Creating Symmetry: The Artful Mathematics of Wallpaper Patterns. Here’s a striking curves from near the beginning of the book, one that the author calls the “mystery curve.”

The curve is the plot of exp(it) – exp(6it)/2 + i exp(-14it)/3 with t running from 0 to 2π.

Here’s Python code to draw the curve.

import matplotlib.pyplot as plt
from numpy import pi, exp, real, imag, linspace

def f(t):
    return exp(1j*t) - exp(6j*t)/2 + 1j*exp(-14j*t)/3

t = linspace(0, 2*pi, 1000)

plt.plot(real(f(t)), imag(f(t)))

# These two lines make the aspect ratio square
fig = plt.gcf()

Maybe there’s a more direct way to plot curves in the complex plane rather than taking real and imaginary parts.

Updated code for the aspect ratio per Janne’s suggestion in the comments.

Related posts:

Several people have been making fun visualizations that generalize the example above.

Brent Yorgey has written two posts, one choosing frequencies randomly and another that animates the path of a particle along the curve and shows how the frequency components each contribute to the motion.

Mike Croucher developed a Jupyter notebook that lets you vary the frequency components with sliders.

John Golden created visualizations in GeoGerba here and here.

Jennifer Silverman showed how these curves are related to decorative patterns that popular in the 1960’s. She also created a coloring book and a video.

Dan Anderson accused me of nerd sniping him and created this visualization.

Rotating PDF pages with Python

Yesterday I got a review copy of Automate the Boring Stuff with Python. It explains, among other things, how to manipulate PDFs from Python. This morning I needed to rotate some pages in a PDF, so I decided to try out the method in the book.

The sample code uses PyPDF2. I’m using Conda for my Python environment, and PyPDF2 isn’t directly available for Conda. I searched Binstar with

binstar search -t conda pypdf2

The first hit was from JimInCO, so I installed PyPDF2 with

conda install -c pypdf2

I scanned a few pages from a book to PDF, turning the book around every other page, so half the pages in the PDF were upside down. I needed a script to rotate the even numbered pages. The script counts pages from 0, so it rotates the odd numbered pages from its perspective.

import PyPDF2

pdf_in = open('original.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_in)
pdf_writer = PyPDF2.PdfFileWriter()

for pagenum in range(pdf_reader.numPages):
    page = pdf_reader.getPage(pagenum)
    if pagenum % 2:

pdf_out = open('rotated.pdf', 'wb')

It worked as advertised on the first try.

Random walks and the arcsine law

Suppose you stand at 0 and flip a fair coin. If the coin comes up heads, you take a step to the right. Otherwise you take a step to the left. How much of the time will you spend to the right of where you started?

As the number of steps N goes to infinity, the probability that the proportion of your time in positive territory is less than x approaches 2 arcsin(√x)/π. The arcsine term gives this rule its name, the arcsine law.

Here’s a little Python script to illustrate the arcsine law.

import random
from numpy import arcsin, pi, sqrt

def step():
    u = random.random()
    return 1 if u < 0.5 else -1

M = 1000 # outer loop
N = 1000 # inner loop

x = 0.3 # Use any 0 < x < 1 you'd like. 
outer_count = 0 
for _ in range(M): 
    n = 0 
    position= 0 
    inner_count = 0 
    for __ in range(N): 
        position += step() 
    if position > 0:
        inner_count += 1
    if inner_count/N < x:
        outer_count += 1

print (outer_count/M)
print (2*arcsin(sqrt(x))/pi)

Playing with continued fractions and Khinchin’s constant

Take a real number x and expand it as a continued fraction. Compute the geometric mean of the first n coefficients.

Aleksandr Khinchin proved that for almost all real numbers x, as n → ∞ the geometric means converge. Not only that, they converge to the same constant, known as Khinchin’s constant, 2.685452001…. (“Almost all” here mean in the sense of measure theory: the set of real numbers that are exceptions to Khinchin’s theorem has measure zero.)

To get an idea how fast this convergence is, let’s start by looking at the continued fraction expansion of π. In Sage, we can type


to get the continued fraction coefficient

[3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, 2, 2, 2, 1, 84, 2, 1, 1, 15, 3]

for π to 100 decimal places. The geometric mean of these coefficients is 2.84777288486, which only matches Khinchin’s constant to 1 significant figure.

Let’s try choosing random numbers and working with more decimal places.

There may be a more direct way to find geometric means in Sage, but here’s a function I wrote. It leaves off any leading zeros that would cause the geometric mean to be zero.

from numpy import exp, mean, log
def geometric_mean(x):
    return exp( mean([log(k) for k in x if k > 0]) )

Now let’s find 10 random numbers to 1,000 decimal places.

for _ in range(10):
    r = RealField(1000).random_element(0,1)

This produced


Three of these agree with Khinchin’s constant to two significant figures but the rest agree only to one. Apparently the convergence is very slow.

If we go back to π, this time looking out 10,000 decimal places, we get a little closer:


produces 2.67104567579, which differs from Khinchin’s constant by about 0.5%.

Python resources

Each Wednesday I post a list of some of the resources on this site. This week: Python notes.

See also blog posts tagged Python, SciPy, and SymPy.

Last week: Special functions

Next week: Probability resources

For daily tips on Python and scientific computing, follow @SciPyTip on Twitter.

Scipytip twitter icon

Counting primitive bit strings

A string of bits is called primitive if it is not the repetition of several copies of a smaller string of bits. For example, the 101101 is not primitive because it can be broken down into two copies of the string 101. In Python notation, you could produce 101101 by "101"*2. The string 11001101, on the other hand, is primitive. (It contains substrings that are not primitive, but the string as a whole cannot be factored into multiple copies of a single string.)

For a given n, let’s count how many primitive bit strings there are of length n. Call this f(n). There are 2n bit strings of length n, and f(n) of these are primitive. For example, there are f(12) primitive bit strings of length 12. The strings that are not primitive are made of copies of primitive strings: two copies of a primitive string of length 6, three copies of a primitive string of length 4, etc. This says

 2^{12} = f(12) + f(6) + f(4) + f(3) + f(2) + f(1)

and in general

2^n = \sum_{d \mid n} f(d)

Here the sum is over all positive integers d that divide n.

Unfortunately this formula is backward. It gives is a formula for something well known, 2n, as a sum of things we’re trying to calculate. The Möbius inversion formula is just what we need to turn this formula around so that the new thing is on the left and sums of old things are on the right. It tells us that

f(n) = \sum_{d \mid n} \mu\left(\frac{n}{d}\right) 2^d

where μ is the Möbius function.

We could compute f(n) with Python as follows:

from sympy.ntheory import mobius, divisors

def num_primitive(n):
    return sum( [mobius(n/d)*2**d for d in divisors(n)] )

The latest version of SymPy, version 0.7.6, comes with a function mobius for computing the Möbius function. If you’re using an earlier version of SymPy, you can roll your own mobius function:

from sympy.ntheory import factorint

def mobius(n):
    exponents = factorint(n).values()
    lenexp = len(exponents)
    m = 0 if lenexp == 0 else max(exponents)
    return 0 if m > 1 else (-1)**lenexp

The version of mobius that comes with SymPy 0.7.6 may be more efficient. It could, for example, stop the factorization process early if it discovers a square factor.

Related: Applied number theory

Ellipsoid surface area

How much difference does the earth’s equatorial bulge make in its surface area?

To first approximation, the earth is a sphere. The next step in sophistication is to model the earth as an ellipsoid.

The surface area of an ellipsoid with semi-axes abc is

A = 2\pi \left( c^2 + \frac{ab}{\sin\phi} \left( E(\phi, k) \sin^2\phi + F(\phi, k) \cos^2 \phi\right)\right)




m = k^2 = \frac{a^2(b^2 - c^2)}{b^2(a^2 - c^2)}

The functions E and F are incomplete elliptic integrals

 F(\phi, k) = \int_0^\phi \frac{d\theta}{\sqrt{1 - k^2 \sin^2\theta}}


E(\phi, k) = \int_0^\phi \sqrt{1 - k^2 \sin^2\theta}\,d\theta

implemented in SciPy as ellipeinc and ellipkinc. Note that the SciPy functions take m as their second argument rather its square root k.

For the earth, a = b and so m = 1.

The following Python code computes the ratio of earth’s surface area as an ellipsoid to its area as a sphere.

from scipy import pi, sin, cos, arccos
from scipy.special import ellipkinc, ellipeinc

# values in meters based on GRS 80
equatorial_radius = 6378137
polar_radius = 6356752.314140347

a = b = equatorial_radius
c = polar_radius

phi = arccos(c/a)
# in general, m = (a**2 * (b**2 - c**2)) / (b**2 * (a**2 - c**2))
m = 1

temp = ellipeinc(phi, m)*sin(phi)**2 + ellipkinc(phi, m)*cos(phi)**2
ellipsoid_area = 2*pi*(c**2 + a*b*temp/sin(phi))

# sphere with radius equal to average of polar and equatorial
r = 0.5*(a+c)
sphere_area = 4*pi*r**2


This shows that the ellipsoid model leads to 0.112% more surface area relative to a sphere.

Source: See equation 19.33.2 here.

Update: It was suggested in the comments that it would be better to compare the ellipsoid area to that of a sphere of the same volume. So instead of using the average of the polar and equatorial radii, one would take the geometric mean of the polar radius and two copies of the equatorial radius. Using that radius, the ellipsoid has 0.0002% more area than the sphere.

For daily posts on analysis, follow @AnalysisFact on Twitter.

AnalysisFact twitter icon