Moments of mixtures

I needed to compute the higher moments of a mixture distribution for a project I’m working on. I’m writing up the code here in case anyone else finds this useful. (And in case I’ll find it useful in the future.) I’ll include the central moments first. From there it’s easy to compute skewness and kurtosis.

Suppose X is a mixture of n random variables Xi with weights wi, non-negative numbers adding to 1. Then the jth central moment of X is given by

E[(X - \mu)^j] = \sum_{i=1}^n \sum_{k=0}^j {j \choose k} (\mu_i - \mu)^{j-k} w_i E[(X_i- \mu_i)^k]

where μi is the mean of Xi.

In my particular application, I’m interested in a mixture of normals and so the code below computes the moments for a mixture of normals. It could easily be modified for other distributions.

from scipy.misc import factorialk, comb

def mixture_central_moment(mixture, moment):

    '''Compute the higher moments of a mixture of normal rvs.
    mixture is a list of (mu, sigma, weight) triples.
    moment is the central moment to compute.'''

    mix_mean = sum( [w*m for (m, s, w) in mixture] )

    mixture_moment = 0.0
    for triple in mixture:
        mu, sigma, weight = triple
        for k in range(moment+1):
            prod = comb(moment, k) * (mu-mix_mean)**(moment-k)
            prod *= weight*normal_central_moment(sigma, k)
            mixture_moment += prod

    return mixture_moment


def normal_central_moment(sigma, moment):

    '''Central moments of a normal distribution'''

    if moment % 2 == 1:
        return 0.0
    else:
        # If Z is a std normal and n is even, E(Z^n) == (n-1)!!
        # So E (sigma Z)^n = sigma^n (n-1)!!
        return sigma**moment * factorialk(moment-1, 2)

Once we have code for central moments, it’s simple to add code for computing skewness and kurtosis.

def mixture_skew(mixture):

    variance = mixture_central_moment(mixture, 2)
    third = mixture_central_moment(mixture, 3)
    return third / variance**(1.5)

def mixture_kurtosis(mixture):

    variance = mixture_central_moment(mixture, 2)
    fourth = mixture_central_moment(mixture, 4)
    return fourth / variance**2 - 3.0

Here’s an example of how the code might be used.

# Test on a mixture of 30% Normal(-2, 1) and 70% Normal(1, 3)
mixture = [(-2, 1, 0.3), (1, 3, 0.7)]

print "Skewness = ", mixture_skew(mixture)
print "Kurtosis = ", mixture_kurtosis(mixture)

Related post: General formula for normal moments

New introduction to SciPy

The Python stack for scientific computing is more modular than say R or Mathematica. Python is a general-purpose programming language that has libraries for scientific computing. R and Mathematica are statistical and mathematical programming languages that have general-purpose features. The Python approach has its advantages — I’d rather do math in a general language than do general programming in a mathematical language — but it takes longer to learn. The components of the Python stack work well together, but someone new to Python has to discover what components they’ll need.

Several books have come out recently to help someone learn Python and the components for numerical computing. The latest is Learning SciPy for Numerical and Scientific Computing by Francisco J. Blanco-Silva.

This book covers the things you’d expect, including SciPy, NumPy, and Matplotlib. The only exception may be IPython. But no book can cover everything. And since IPython is an environment more than a library, it makes some sense to leave it out.

In addition to the usual topics, the book includes several important topics that are not as commonly covered. For example, it devotes a good amount of space to special functions and integration in a chapter on numerical analysis. I share the author’s assessment that this is “one of the most interesting chapters in the book.”

There are three chapters on more specific applications: signal processing, data mining, and computational geometry. These chapters give an introduction to their topics as well as how to carry out computations in SciPy.

The final chapter of the book is on integration with other programming languages.

Learning SciPy for Numerical and Scientific Computing covers the material you need to get going. It’s easy to read, and still fairly small: 150 pages total, about 130 pages of content. This is the right size for such a book in my opinion. There’s plenty of online documentation for more details, but it helps to have a good overview such as this book before diving into reference material.

Use typewriter font for code inside prose

There’s a useful tradition of using a typewriter font, or more generally some monospaced font, for bits of code sprinkled in prose. The practice is analogous to using italic to mark, for example, a French mot dropped into an English paragraph. In HTML, the code tag marks content as software code, which a browser typically will render in a typewriter font.

Here’s a sentence from a new article on Python at Netflix that could benefit a few code tags.

These features (and more) have led to increasingly pervasive use of Python in everything from small tools using boto to talk to AWS, to storing information with python-memcached and pycassa, managing processes with Envoy, polling restful APIs to large applications with requests, providing web interfaces with CherryPy and Bottle, and crunching data with scipy.

Here’s the same sentence with some code tags.

These features (and more) have led to increasingly pervasive use of Python in everything from small tools using boto to talk to AWS, to storing information with python-memcached and pycassa, managing processes with Envoy, polling restful APIs to large applications with requests, providing web interfaces with CherryPy and Bottle, and crunching data with scipy.

It’s especially helpful to let the reader know that packages like requests are indeed packages. It helps to clarify, for example, whether Wes McKinney has been stress testing pandas or pandas. That way we know whether to inform animal protection authorities or to download a new version of a library.

Wallpaper and phase portraits

Suppose you want to create a background image that tiles well. You’d like it to be periodic horizontally and vertically so that there are no obvious jumps when the image repeats.

Functions like sine and cosine are period along the real line. But if you want to make a two-dimensional image by extending the sine function to the complex plane, the result is not periodic along the imaginary axis but exponential.

There are functions that are periodic horizontally and vertically. If you restrict your attention to functions that are analytic except at poles, these doubly-periodic functions are elliptic functions, a class of functions with remarkable properties. See this post if you’re interested in the details. Here we’re just after pretty wallpaper. I’ll give Python code for creating the wallpaper.

Here I’ll take a particular elliptic function sn(x). This is one of the Jacobi elliptic functions, somewhat analogous to the sine function, and use its phase portrait. Phase portraits use hue to encode the phase of a complex number, the θ value when a complex number is written in polar coordinates. The brightness of the color indicates the magnitude, the r value in polar coordinates.

Here’s the plot of sn(z, 0.2). (The sn function takes a parameter m that I arbitrarily chose as 0.2.) The plot shows two periods, horizontally and vertically. I included two periods so you could more easily see how it repeats. If you wanted to use this image as wallpaper, you could use 1/4 of the image, one period in each direction, to get by with a smaller image.

phase plot of sn(z, 0.2) - 0.2

Here’s the Python code that was used to create the image.

from mpmath import cplot, ellipfun, ellipk
sn = ellipfun('sn')
m = 0.2
x = 4*ellipk(m) # horizontal period
y = 2*ellipk(1-m) # vertical period
cplot(lambda z: sn(z, m) - 0.2, [0, 2*x], [0, 2*y], points = 100000)

I subtracted 0.2 from sn just to shift the color a little. Adding a positive number shifts the color toward red. Subtracting a positive number shifts the color toward blue. You could also multiply by some constant to increase or decrease the brightness.

You could also play around with other elliptic functions, described in the mpmath documentation here. And you can find more on cplot here. For example, you could supply your own function for how phase is mapped to color. The saturated colors used by default are good for mathematical visualization, but more subtle colors could be better for aesthetics.

If you make some interesting images, leave a comment with a link to your image and a description of how you made it.

Exact chaos

Pick a number x between 0 and 1. Then repeatedly replace x with 4x(1-x). For almost all starting values of x, the result exhibits chaos. Two people could play this game with starting values very close together, and eventually their sequences will diverge.

It’s somewhat surprising that the iterative process described above can be written down in closed form. Starting from a value x0, the value after n iterations is

sin( 2n arcsin( √ x0 ) )2.

Now suppose two people start with the same initial value. One repeatedly applies 4x(1-x) and the other uses the formula above. If both carried out their calculations exactly, both would produce the same output at every step. But what if both used a computer?

The two approaches correspond to the Python functions f and g below. Because both functions are executed in finite precision arithmetic, both have errors, but they have different errors. Suppose we want to look at the difference between the two functions as we increase n.

from scipy import arcsin, sin, sqrt, linspace
from matplotlib import pyplot as plt

def f(x0, n):
    x = x0
    for _ in range(n):
        x = 4*x*(1-x)
    return x

def g(x0, n):
    return sin(2.0**n * arcsin(sqrt(x0)))**2

n = 40
x = linspace(0, 1, 100)
plt.plot(x, f(x, n) - g(x, n))
plt.ylim(-1, 1)
plt.show()

When we run the code, nothing exciting happens. The difference is a flat line.

Next we increase n to 45 and we start to see the methods diverge.

The divergence is large when n is 50.

And the two functions are nearly completely uncorrelated when n is 55.

Update

So which function is more accurate, f or g? As noted in the comments, the two functions have different kinds of numerical errors. The former accumulates arithmetic precision error at each iteration. The latter shifts noisy bits into significance by multiplying by 2^n. Apparently both about the same overall error, though they have different distributions of error.

I recomputed g using 100-digit precision with mpmath and used the results as the standard to evaluate the output of f and g in ordinary precision. Here’s a plot of the errors when n = 45, first with f

and then with g.

The average absolute errors of f and g are 0.0024 and 0.0015 respectively.

Python animation for mechanical vibrations

Stéfan van der Walt wrote some Python code to animate the system described in yesterday’s post on mechanical vibrations.

Stéfan posted his code on github. It currently illustrates undamped free vibrations, but could be modified to work with damped or driven vibrations.

The code includes an option to save the output as an mp4 file. You must have ffmpeg installed to save a matplotlib animation as mp4.

Continued fractions with Sage

My previous post looked at continued fractions and rational approximations for e and gave a little Python code.  I found out later there’s a more direct way to do this in Python using Sage.

At its simplest, the function continued_fraction takes a real number and returns a truncated continued fraction representation. For example, continued_fraction(e) returns

[2, 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, 8, 1, 1, 10, 1, 1, 12, 1, 1]

Optionally, you can also specify the number of bits of precision in the real argument and the number of terms desired.

By calling the convergents method on the return value of continued_fraction(e) you can find a sequence of rational approximations based on the continued fraction. For example,

print continued_fraction(e).convergents()

produces

[2, 3, 8/3, 11/4, 19/7, 87/32, 106/39, 193/71, 1264/465,
1457/536, 2721/1001, 23225/8544, 25946/9545, 49171/18089,
517656/190435, 566827/208524, 1084483/398959,
13580623/4996032, 14665106/5394991, 28245729/10391023].

To get higher precision output, you need higher precision input. For example, you could pass in

RealField(200)(e)

rather than simply e to tell Sage that you’d like to use the 200-bit representation of e rather than the default precision.

Rational approximations to e

This morning Dave Richeson posted a humorous fake proof that depends on the famous approximation 22/7 for pi. It occurred to me that nearly everyone knows a decent rational approximation to pi. Some people may know more. But hardly anyone, myself included, knows a decent rational approximation for e.

Another approximation for pi is 355/113. I like this approximation because it’s easy to remember: take the sequence 113355, split it in the middle, and make it into a fraction. It’s accurate to six decimal places, which is sufficient for most practical applications.

The approximations 22/7 and 355/113 are part of the sequence of approximations coming from the continued fraction approximation for pi. So to come up with rational approximations for e, I turned to its continued fraction representation.

The best analog of the approximation 22/7 for pi may be the approximation 19/7 for e. Obviously the denominators are the same, and the accuracy of the two approximations is roughly comparable.

Here’s how you can make your own rational approximations for e. Find the coefficients in the continued fraction for e, for example here. You can turn this into a sequence of approximations by using the following Python code:

from __future__ import division
from math import e

e_frac = [2,1,2,1,1,4,1,1,6,1,1,8]

def display(n, d, exact):
    print n, d, n/d, n/d - exact

def approx(a, exact):
    # initialize the recurrence
    n0 = a[0]
    d0 = 1
    n1 = a[0]*a[1] + 1
    d1 = a[1]

    display(n0, d0, exact)
    display(n1, d1, exact)

    for x in a[2:]:
        n = x*n1 + n0 # numerator
        d = x*d1 + d0 # denominator
        display(n, d, exact)
        n1, n0 = n, n1
        d1, d0 = d, d1

approx(e_frac, e)

This will print the numerator, denominator, value, and error for each approximation. You could include more terms in the continued fraction for e if you’d like. Here are some of the results: 19/7, 87/32, 106/39, etc. Unfortunately it doesn’t look like there are any approximations as memorable as 355/113 for pi.

You could also use the code to create rational approximations to other numbers if you’d like. For example, you can find the continued fraction expansion for pi here and use the code above to find rational approximations for pi.

Update: There’s a more direct way to find continued fractions and rational approximations in Python using Sage. See the next post.

Footnote: The continued fraction coefficients for e have a nice pattern:
… 1, 1, 4, 1, 1, 6, 1, 1, 8, … 1, 1, 2k, 1, 1, 2k+2, …

Related posts:

Python / Emacs setup

When I got a new computer a few days ago, I installed the latest version of Emacs, 24.2, and broke my Python environment. I decided to re-evaluate my environment and start over. I asked a question on the Python Google+ group and here’s a summary of the Emacs packages recommended.

  • python-mode
  • rainbow-mode, rainbow-delimiters-mode
  • flymake-mode (hooked up to flymake-pyflakes)
  • linum-on
  • jedi
  • pycheckers + pyflakes
  • rope
  • electric-pair
  • show-paren
  • auto-complete
  • yassnippets

What recommendations do you have for packages? Links to relevant articles?

It might be better to leave your comments on Google+ rather than here so the answers will be in one place.

Narcissus prime in Python

I’ve been looking back on some of my blog posts that included Mathematica code to see whether I could rewrite them using Python. For example, I rewrote my code for finding sonnet primes in Python a few days ago. Next I wanted to try testing the Narcissus prime.

Futility closet describes the Narcissus prime as follows:

Repeat the string 1808010808 1560 times, and tack on a 1 the end. The resulting 15601-digit number is prime, and because it’s a palindrome made up of the digits 1, 8, and 0, it remains prime when read backward, upside down, or in a mirror.

My Mathematica code for verifying this claim is posted here. Here’s Python code to do the same thing:

    from sympy.ntheory import isprime
    isprime(int("1808010808"*1560 + "1"))

This does indeed return True. However, the Mathematica code ran for about 2 minutes and the SymPy code took 17.5 hours, about 500 times longer.

Sonnet primes in Python

A while back I wrote about sonnet primes, primes of the form ababcdcdefefgg where the letters a through g represent digits and a is not zero. The name comes from the rhyme scheme of an English (Shakespearean) sonnet.

In the original post I gave Mathematica code to find all sonnet primes. This post shows how to do it in Python.

from sympy.ntheory import isprime
from itertools import permutations

def number(t):
    # turn a tuple into a number
    return 10100000000000*t[0] + 1010000000000*t[1] 
           +   1010000000*t[2] +     101000000*t[3] 
           +       101000*t[4] +         10100*t[5] 
           +           11*t[6]

sonnet_numbers = (number(t) for t in 
    permutations(range(10), 7) if t[0] != 0)

sonnet_primes = filter(isprime, sonnet_numbers)

Finding 2013 in pi

My youngest daughter asked me this morning whether you can find the number 2013 in the digits of pi. I said it must be possible, then wrote the following Python code to find where 2013 first appears.

from mpmath import mp
mp.dps = 100000
digits = str(mp.pi)[2:]
digits.find('2013')

I use the multi-precision math package mpmpath to get pi to 100,000 significant figures. I save this as a string and cut off the “3.” at the beginning to have just the string of digits after the decimal point.

The code returns 6275, so “2013” appears in the 6275th position in the string of digits. However, we usually count decimal places starting from 1 but count positions in a string starting from 0, so 2013 starts in the 6276th decimal place of pi.

So π = 3.14159…2013… where the first “…” represents 6,270 digits not shown.

* * *

Now we jump off into deeper mathematical water.

For some purposes, the digits of pi are random. The digits are obviously not random — there are algorithms for calculating them — and yet they behave randomly, and random is as random does.

If the digits of pi were random, then we could almost certainly find any sequence we want if we look long enough. Can we find any finite sequence of digits in the decimal expansion of pi? I would assume so, but I don’t know whether that has been proven.

You might expect that not only can you find 2013 in pi, but that if you split the digits of pi into blocks of 4, then 2013 and every other particular block would occur with the same frequency in the limit. That is, one would expect that the expansion of pi is uniform in base 10,000.

More generally, you might conjecture that pi is a normal number, i.e. that the digits of pi are uniformly distributed in every base. This has not been proven. In fact, no one has proved that any particular number is normal [reference]. However, we do know that almost all numbers are normal. That is, the set of non-normal numbers has Lebesgue measure zero.

Basics of Sweave and Pweave

Sweave is a tool for embedding R code in a LaTeX file. Pweave is an analogous tool for Python. By putting your code in your document rather than the results of running your code somewhere else, results are automatically recomputed when inputs change. This is especially useful with graphs: rather than including an image into your document, you include the code to create the image.

To use either Sweave or Pweave, you create a LaTeX file and include source code inside. A code block begins with <<>>= and ends with @ on a line by itself. By default, code blocks appear in the LaTeX output. You can start a code block with <<echo=FALSE>>= to execute code without echoing its source. In Pweave you can also use <% and %> to mark a code block that executes but does not echo. You might want to do this at the top of a file, for example, for import statements.

Sweave echos code like the R command line, with > for the command prompt. Pweave does not display the Python >>> command line prompt by default, though it will if you use the option term=TRUE in the start of your code block.

In Sweave, you can use Sexpr to inline a little bit of R code. For example, $x = Sexpr{sqrt(2)}$ will produce x = 1.414…. You can also use Sexpr to reference variables defined in previous code blocks. The Pweave analog uses <%= and %>. The previous example would be $x = <%= sqrt(2) %>$.

You can include a figure in Sweave or Pweave by beginning a code block with <<fig=TRUE, echo=FALSE>>= or with echo=TRUE if you want to display the code that produces the figure. With Sweave you don’t need to do anything else with your file. With Pweave you need to add usepackage{graphicx} at the top.

To process an Sweave file foo.Rnw, run Sweave("toy.Rnw") from the R command prompt. To process a Pweave file foo.Pnw, run Pweave -f tex foo.Pnw from the shell. Either way you get a LaTeX file that you can then compile to a PDF.

Here are sample Sweave and Pweave files. First Sweave:

\documentclass{article}
\begin{document}

Invisible code that sets the value of the variable $a$.

<<<echo=FALSE>>=
a <- 3.14
@

Visible code that sets $b$ and squares it.

<<bear, echo=TRUE>>=
b <- 3.15
b*b
@

Calling R inline: $\sqrt{2} = Sexpr{sqrt(2)}$

Recalling the variable $a$ set above: $a = Sexpr{a}$.

Here's a figure:

<<fig=TRUE, echo=FALSE>>=
x <- seq(0, 6*pi, length=200)
plot(x, sin(x))
@

\end{document}

And now Pweave:

\documentclass{article}
\usepackage{graphicx}
\begin{document}

<%
import matplotlib.pyplot as plt
from numpy import pi, linspace, sqrt, sin
%>

Invisible code that sets the value of the variable $a$.

<<echo=FALSE>>=
a = 3.14
@

Visible code that sets $b$ and squares it.

<<term=TRUE>>=
b = 3.15
print b*b
@

Calling Python inline: $\sqrt{2} = <%= sqrt(2) %>$

Recalling the variable $a$ set above: $a = <%= a %>$.

Here's a figure:

<<fig=TRUE, echo=FALSE>>=
x = linspace(0, 6*pi, 200)
plt.plot(x, sin(x))
@

\end{document}

Related links:

Three new Python books

This post reviews three Python books that have come out recently:

SciPy and NumPy by Eli Bressert is the smallest book I’ve seen from O’Reilly, aside from books in their pocket guide series. The SciPy and NumPy libraries are huge, and it can be hard to know where to start. This book gives a good, brisk overview.  In addition to SciPy and NumPy, the it also gives a brief introduction to SciKit, in particular scikit-learn for machine learning and scikit-image for image processing.

(Eli told me that he is working on supplementary material for the book. Everyone who bought the book electronically will automatically receive the new material when it is available.)

Python for Kids by Jason R. Briggs is an introduction to programming aimed at kids. It starts with with an introduction to Python and moves to developing a simple game. It seems to me that kids would find the book interesting. It’s about seven times longer than the SciPy and NumPy book. It moves at a slow pace, has many illustrations, and has a casual tone.

NumPy Cookbook by Ival Idris contains around 70 small recipes, about three pages each. Many of these are about NumPy itself, but the book covers much more than its title would imply. Out of 10 chapters, four are strictly about NumPy. The first chapter of the book is about IPython. Another chapter is about “connecting NumPy with the rest of the world,” i.e. interfacing with Java, R, Matlab, and cloud services. Two chapters are devoted to profiling, debugging, and optimizing performance. There is a chapter on quality assurance (static analysis, unit testing, and documentation). And the final chapter is about Scikits and Pandas.

Winston Churchill, Bessie Braddock, and Python

Last night I was talking with someone about the pros and cons of various programming languages and frameworks for data analysis. One of the pros of Python is its elegance. The primary con is that it can be slow.

The conversation reminded me of an apocryphal exchange between Winston Churchill and Bessie Braddock.

Braddock: Winston, you are drunk.

Churchill: Yes I am. And you, Bessie, are ugly. But I shall be sober in the morning, and you will still be ugly.

Python can be slow, though there are ways to improve its performance. But ugly code is just ugly, and there’s nothing you can do about it.