How can a statistician help a lawyer?

I’ll be presenting at a webinar on Wednesday, December 13 at 1:00 PM Eastern. The title of the presentation is “Seven questions a statistician and answer for an attorney.”

I will discuss, among other things, when common sense applies and when correct analysis can be counter-intuitive. There will be ample time at the end of the presentation for Q & A.

If you’re interested in attending, you can register here.

ACEDS association of certified e-discovery specialists and LexInsight

Moment generating functions and connections to other things

This post relates moment generating functions to the Laplace transform and to exponential generating functions. It also brings in connections to the z-transform and the Fourier transform.

Thanks to Brian Borchers who suggested the subject of this post in a comment on a previous post on transforms and convolutions.

Moment generating functions

The moment generating function (MGF) of a random variable X is defined as the expected value of exp(tX). By the so-called rule of the unconscious statistician we have

M_X(t) \equiv \mathrm{E}[e^{tX}] = \int_{-\infty}^\infty e^{tx} f_X(x)\, dx

where fX is the probability density function of the random variable X. The function MX is called the moment generating function of X because it’s nth derivative, evaluated at 0, gives the nth moment of X, i.e. the expected value of Xn.

Laplace transforms

If we flip the sign on t in the integral above, we have the two-sided Laplace transform of fX. That is, the moment generating function of X at t is the two-sided Laplace transform of fX at –t. If the density function is zero for negative values, then the two-sided Laplace transform reduces to the more common (one-sided) Laplace transform.

Exponential generating functions

Since the derivatives of MX at zero are the moments of X, the power series for MX is the exponential generating function for the moments. We have

M_X(t) = m_0 + m_1t + \frac{m_2}{2}t^2 + \frac{m_3}{3!} t^3 + \cdots

where mn is the nth moment of X.

Other generating functions

This terminology needs a little explanation since we’re using “generating function” two or three different ways. The “moment generating function” is the function defined above and only appears in probability. In combinatorics, the (ordinary) generating function of a sequence is the power series whose coefficient of xn is the nth term of the sequence. The exponential generating function is similar, except that each term is divided by n!. This is called the exponential generating series because it looks like the power series for the exponential function. Indeed, the exponential function is the exponential generating function for the sequence of all 1’s.

The equation above shows that MX is the exponential generating function for mn and the ordinary generating function for mn/n!.

If a random variable Y is defined on the integers, then the (ordinary) generating function for the sequence Prob(Yn) is called, naturally enough, the probability generating function for Y.

The z-transform of a sequence, common in electrical engineering, is the (ordinary) generating function of the sequence, but with x replaced with 1/z.

Characteristic functions

The characteristic function of a random variable is a variation on the moment generating function. Rather than use the expected value of tX, it uses the expected value of itX. This means the characteristic function of a random variable is the Fourier transform of its density function.

Characteristic functions are easier to work with than moment generating functions. We haven’t talked about when moment generating functions exist, but it’s clear from the integral above that the right tail of the density has to go to zero faster than ex, which isn’t the case for fat-tailed distributions. That’s not a problem for the characteristic function because the Fourier transform exists for any density function. This is another example of how complex variables simplify problems.

Shannon wavelet

The Shannon wavelet has an interesting plot:

Shannon wavelet

Given the complexity of the plot, the function definition is surprisingly simple:

\frac{1}{\pi t} (\sin 2\pi t - \sin\pi t)

The Fourier transform is even simpler: it’s the indicator function of [-2π, -π] ∪ [π, 2π], i.e. the function that is 1 on the intervals [-2π, -π] and [π, 2π] but zero everywhere else.

The Shannon wavelet is orthogonal to integer translates of itself. This isn’t obvious in the time domain, but it’s easy to prove in the frequency domain using Parseval’s theorem.

Here’s a plot of the original wavelet and the wavelet shifted to the left by 3:

Two Shannon wavelets

And here’s a plot of the product of the two wavelets. It’s plausible that the areas above and below the x-axis cancel out each other, and indeed they do.

Product of two Shannon wavelets

Related post: Sinc and Jinc integrals

Transforms and Convolutions

There are many theorems of the form

{\cal T}(f*g) = {\cal T}(f){\cal T}(g)

where f and g are functions, T is an integral transform, and * is a kind of convolution. In words, the transform of a convolution is the product of transforms.

When the transformation changes, the notion of convolution changes.

Here are three examples.

Fourier transform and convolution

With the Fourier transform defined as

{\cal T}(f)(s) = \int_{-\infty}^\infty \exp(-isx) f(x)\, dx

convolution is defined as

(f* g)(x) = \int_{-\infty}^\infty f(x - y) g(y)\, dy

Note: There are many minor variations on the definition of the Fourier transform. See these notes.

Laplace transform and convolution

With the Laplace transform defined as

{\cal T}(f)(s) = \int_0^\infty \exp(-sx) f(x)\, dx

convolution is defined as

(f* g)(x) = \int_0^x f(x - y) g(y)\, dy

Mellin transform and convolution

With the Mellin transform defined as

{\cal T}(f)(s) = \int_0^\infty x^{s-1} f(x)\, dx

convolution is defined as

(f* g)(x) = \int_0^\infty f(y) \, g\left( \frac{x}{y} \right)\, \frac{dy}{y}

Related posts

Gamma function partial sums

Last week I wrote about Jentzsch’s theorem. It says that if the power series of function has a finite radius of convergence, the set of zeros of the partial sums of the series will cluster around and fill in the boundary of convergence.

This post will look at the power series for the gamma function centered at 1 and will use a different plotting style.

Here’s what the contours look like with 12 terms:

Contour plot for 12th partial sum of gamma function

The radius of convergence is 1 because the gamma function has a singularity at 0. (The radius of convergence is the distance from the center of the power series to the nearest singularity.) Contour lines correspond to constant phase. The gamma function is real for real arguments, and so you can see the real axis running through the middle of the plot because real numbers have zero phase. The contour lines meet at zeros, which you can see are near a circle of radius 1 centered at z = 1.

Here’s what the contour plot looks like with 30 terms:

Contour plot for 30th partial sum of gamma function

And here’s what it looks like with just 5 terms:

Contour plot for 5th partial sum of gamma function

Here’s the Mathematica code that was used to create the images.

    P[x_] = Normal[Series[1.0 Gamma[x], {x, 1, 12}]]
    ContourPlot[
        Arg[P[x + I y]], 
        {x, -1, 3}, 
        {y, -2, 2}, 
        ColorFunction -> "PearlColors", 
        Contours -> 20
    ]

The 1.0 in front of the call to Gamma is important. It tells Mathematica to numerically evaluate the coefficients of the power series. Otherwise Mathematica will find the coefficients symbolically and the plots will take forever.

Similarly, the call to Normal tells Mathematica not to carry around a symbolic error term O(x - 1)13.

Hypergeometric functions are key

From Orthogonal Polynomials and Special Functions by Richard Askey:

At first the results we needed were in the literature but after a while we ran out of known results and had to learn something about special functions. This was a very unsettling experience for there were very few places to go to really learn about special functions. At least that is what we thought. Actually there were many, but the typical American graduate education which we had did not include anything about hypergeometric functions. And hypergeometric functions are the key to this subject, as I have found out after many years of fighting them.

Emphasis added.

Askey’s book was written in 1975, and he was describing his experience from ten years before that. Special functions, and in particular hypergeometric functions, went from being common knowledge among mathematicians at the beginning of the 20th century to being arcane by mid century.

I learned little about special functions and nothing about hypergeometric functions as a graduate student. I first ran into hypergeometric functions reading in Concrete Mathematics how they are used in combinatorics and in calculating sums in closed form. Then when I started working in statistics I found that they are everywhere.

Hypergeometric functions are very useful, but not often taught anymore. Like a lot of useful mathematics, they fall between two stools. They’re considered too advanced or arcane for the undergraduate curriculum, and not a hot enough area of research to be part of the graduate curriculum.

Related posts:

Distribution of Fibonacci numbers mod m

The last digits of Fibonacci numbers repeat with period 60. This is something I’ve written about before.

The 61st Fibonacci number is 2504730781961. The 62nd is 4052739537881. Since these end in 1 and 1, the 63rd Fibonacci number must end in 2, etc. and so the pattern starts over.

It’s not obvious that the cycle should have length 60, but it is fairly easy to see that there must be a cycle.

In any base, the last digits must cycle. The length of these cycles varies erratically:

In this post I want to as a different question: how often do Fibonacci numbers take on each of the possible last digits in each base? In other words, how are the Fibonacci numbers distributed mod m for various values of m?

I haven’t tried to prove any theorems. I’ve just run some examples using the following Python code.

    def histogram(base):

        prev = 1
        F = 2
    
        counts = [0]*base
        counts[F] += 1
    
        while F != 1 or prev != 1:
            temp = F
            F = (F + prev) % base
            counts[F] += 1    
            prev = temp

        return counts

In base 10, the last digits of Fibonacci numbers have period 60. Do all digits appear in the cycle? Do they all appear equally often?

Yes and no. Here are the results for base 10:

    >>> histogram(10) 
    >>> [4, 8, 4, 8, 4, 8, 4, 8, 4, 8]

Each even digits appears 4 times and each odd digit appears 8 times.

What happens if we look at the last two digits, i.e. if we look at Fibonacci numbers mod 100?

    >>> histogram(100)
    >>> [2, 6, 2, 2, …, 2, 6, 2, 2]

Each two-digit number n appears six times if n = 1 mod 4. Otherwise it appears two times.

What about the last three digits? Now we see some zeros. For example, no Fibonacci number is congruent to 4 or 6 mod 1000. (Or mod 8.)

    >>> histogram(1000)
    >>> [2, 3, 2, 1, 0, 3, 0, 1, …, 2, 3, 2, 1, 0, 3, 0, 1]

Here’s something curious. The Fibonacci numbers are exactly evenly distributed mod 5.

    >>> histogram(5)
    >>> [4, 4, 4, 4, 4]

The same is apparently true for all powers of 5. Not only are all the frequencies the same, they’re all 4’s. I’ve tested this for powers of 5 up to 510. And conversely, it seems the Fibonacci numbers are not evenly distributed mod m unless m is a power of 5. I’ve tested this for m up to 100,000.

Conjecture: The Fibonacci numbers are uniformly distributed mod m if and only if m is a power of 5.

Update: The conjecture is correct, and was proved in 1972 by Niederreiter.

 

A circle of zeros: Jentzsch’s theorem

Take a function that has a power series with a finite radius of convergence. Then the zeros of the partial sums will be dense around the boundary of convergence. That is Jentzsch’s theorem.

Here are a couple plots to visualize Jentzsch’s theorem using the plotting scheme described in this post.

First, we take the function f(z) = 1/(1 + z²). This function has singularities as i and –i, and so the radius of convergence is 1. These singularities correspond to the two white dots in the plot below.

1/(1 + x^2)

Now let’s look at the Taylor series for f.

f(z) = 1 – z2 + z4z6 + z8 – …

The series only converges inside the unit circle, but the partial sums are just polynomials and so are defined everywhere. Here we plot the partial sum up to z20.

1 - z^2 + z^4 - ... + z^20

The black dots around the unit circle are zeros. The color quickly lightens as you move away from the unit circle because the function values grow quickly. Inside the unit circle, the two graphs match well, though of course outside they are very different.

Here’s another example. This time we’ll look at the gamma function. The power series centered at 1 has radius 1 because the function has a pole at 0.

Here’s a plot of the gamma function. Note the white dot which is the singularity at 0.

gamma function

And here’s a plot of the first 20 terms of the Taylor series centered at 1.

Taylor series for gamma

Orthogonal polynomials and the beta distribution

This post shows a connection between three families of orthogonal polynomials—Legendre, Chebyshev, and Jacobi—and the beta distribution.

Legendre, Chebyshev, and Jacobi polynomials

A family of polynomials Pk is orthogonal over the interval [-1, 1] with respect to a weight w(x) if

\int_{-1}^1 P_m(x) P_n(x) w(x) \, dx = 0

whenever mn.

If w(x) = 1, we get the Legendre polynomials.

If w(x) = (1 – x²)-1/2 we get the Chebyshev polynomials.

These are both special cases of the Jacobi polynomials which have weight w(x) = (1- x)α (1 + x)β. Legendre polynomials correspond to α = β = 0, and Chebyshev polynomials correspond to α = β = -1/2.

Connection to beta distribution

The weight function for Jacobi polynomials is a rescaling of the density function of a beta distribution. The change of variables x = 1 – 2u shows

\int_{-1}^1 f(x) (1-x)^\alpha (1+x)^\beta \, dx = 2^{\alpha + \beta + 1}\int_0^1 f(1-2u) u^\alpha (1-u)^\beta \,du

The right side is proportional to the expected value of f(1 – 2X) where X is a random variable with a beta(α + 1, β+1) distribution. So for fixed α and β, if mn and X has a beta(α + 1, β+1) distribution, then the expected value of Pm(1 – 2X) Pn(1 – 2X) is zero.

While we’re at it, we’ll briefly mention two other connections between orthogonal polynomials and probability: Laguerre polynomials and Hermite polynomials.

Laguerre polynomials

The Laguerre polynomials are orthogonal over the interval [0, ∞) with weight w(x) = xα exp(-x), which is proportional to the density of a gamma random variable with shape α+1 and scale 1.

Hermite polynomials

There are two minor variations on the Hermite polynomials, depending on whether you take the weight to be exp(-x²) or exp(-x²/2). These are sometimes known as the physicist’s Hermite polynomials and the probabilist’s Hermite polynomials. Naturally we’re interested in the latter. The probabilist’s Hermite polynomials are orthogonal over (-∞, ∞) with the standard normal (Gaussian) density as the weight.

Related posts

Runge phenomena

I’ve mentioned the Runge phenomenon in a couple posts before. Here I’m going to go into a little more detail.

First of all, the “Runge” here is Carl David Tolmé Runge, better known for the Runge-Kutta algorithm for numerically solving differential equations. His name rhymes with cowabunga, not with sponge.

Runge showed that polynomial interpolation at evenly-spaced points can fail spectacularly to converge. His example is the function f(x) = 1/(1 + x²) on the interval [-5, 5], or equivalently, and more convenient here, the function f(x) = 1/(1 + 25x²) on the interval [-1, 1]. Here’s an example with 16 interpolation nodes.

Runge's example

Runge found that in order for interpolation at evenly spaced nodes in [-1, 1] to converge, the function being interpolated needs to be analytic inside a football-shaped [1] region of the complex plane with major axis [-1, 1] on the real axis and minor axis approximately [-0.5255, 0.5255]  on the imaginary axis. For more details, see [2].

The function in Runge’s example has a singularity at 0.2i, which is inside the football. Linear interpolation at evenly spaced points would converge for the function f(x) = 1/(1 + x²) since the singularity at i is outside the football.

Runge's example

For another example, consider the function f(x) = exp(- 1/x²) , defined to be 0 at 0. This function is infinitely differentiable but it is not analytic at the origin. With only 16 interpolation points as above, there’s a small indication of trouble at the ends.

Interpolating exp(-1/x^2)

With 28 interpolation points in the plot below, the lack of convergence is clear.

Interpolating exp(-1/x^2)

The problem is not polynomial interpolation per se but polynomial interpolation at evenly-spaced nodes. Interpolation at Chebyshev points converges for the examples here. The location of singularities effects the rate of convergence but not whether the interpolants converge.

RelatedHelp with interpolation

***

[1] American football, that is. The region is like an ellipse but pointy at -1 and 1.

[2] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

Twenty questions and conditional probability

The previous post compared bits of information to answers in a game of Twenty Questions.

The optimal strategy for playing Twenty Questions is for each question to split the remaining possibilities in half. There are a couple ways to justify this strategy: mixmax and average.

The minmax approach is to minimize the worse thing that could happen. The worst thing that could happen after asking a question is for your subject to be in the larger half of possibilities. For example, asking whether someone is left-handed is not a good strategy: the worst case scenario is “no,” in which case you’ve only narrowed your possibilities by 10%. The best mixmax approach is to split the population in half with each question.

The best average approach is also to split the possibilities in half each time. With the handedness example, you learn more if the answer is “yes,” but there’s only a 10% change that the answer is yes. There’s a 10% chance of gaining 3.322 bits of information, but a 90% chance of only gaining 0.152 bits. So the expected number of bits, the entropy, is 0.469 bits. Entropy is maximized when all outcomes are equally likely. That means you can’t learn more than 1 bit of information on average from a yes/no question, and you learn the most when both possibilities are equally likely.

Now suppose you want to ask about height and sex. As in the previous post, we assume men and women’s heights are normally distributed with means 70 and 64 inches respectively, and both have standard deviation 3 inches.

If you ask whether a person is taller than 67 inches, you split a mixed population of men and women in half. You will learn 1 bit of information from this question, but you’ve put yourself in a suboptimal position for the next question. A height of at least 67 inches half of the adult population in general, but it selects a majority of men and a minority of women. And as we discussed above, uneven splits are suboptimal, in the worst case and on average.

If you’re going to ask about height and sex, ask about sex first. If the person is female, ask next whether her height is above 64 inches. But if the person is male, ask whether his height is above 70 inches. That is, you want to split the population evenly at each step conditioning on your previous answer. A cutoff of 67 inches is optimal unconditionally, but suboptimal if you condition on sex.

The optimal strategy for Twenty Questions is to ask a question with probability of 1/2 being true, conditional on all previous data. You might get lucky with uneven conditional splits, but on average, and in the worst case, you won’t do as well.

Handedness, introversion, height, blood type, and PII

I’ve had data privacy on my mind a lot lately because I’ve been doing some consulting projects in that arena.

When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log2(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.

***

[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10-15. This is an example of where the normal distribution assumption breaks down in the extremes.

Pareto distribution and Benford’s law

The Pareto probability distribution has density

f(x) = \frac{a}{x^{a+1}}

for x ≥ 1 where a > 0 is a shape parameter. The Pareto distribution and the Pareto principle (i.e. “80-20” rule) are named after the same person, the Italian economist Vilfredo Pareto.

Samples from a Pareto distribution obey Benford’s law in the limit as the parameter a goes to zero. That is, the smaller the parameter a, the more closely the distribution of the first digits of the samples come to following the distribution known as Benford’s law.

Here’s an illustration of this comparing the distribution of 1,000 random samples from a Pareto distribution with shape a = 1 and shape a = 0.2 with the counts expected under Benford’s law.

Distribution of leading digits of Pareto samples in base 10

Note that this has nothing to do with base 10 per se. If we look at the leading digits as expressed in any other base, such as base 16 below, we see the same pattern.

Distribution of leading digits of Pareto samples in base 16

More posts on Benford’s law


More posts on Pareto

Random number generation posts

Random number generation is typically a two step process: first generate a uniformly distributed value, then transform that value to have the desired distribution. The former is the hard part, but also the part more likely to have been done for you in a library. The latter is relatively easy in principle, though some distributions are hard to (efficiently) sample from.

Here are some posts on testing a uniform RNG.

Here’s a book chapter I wrote on testing the transformation of a uniform RNG into some other distribution.

A few posts on manipulating a random number generator.

And finally, a post on a cryptographically secure random number generator.

Quantifying information gain in beta-binomial Bayesian model

The beta-binomial model is the “hello world” example of Bayesian statistics. I would call it a toy model, except it is actually useful. It’s not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it’s a conjugate model, the calculations work out trivially.

For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the a and b parameters of the prior and the posterior to compute how much information was gained.

    from scipy.integrate import quad
    from scipy.stats import beta as beta
    from scipy import log2

    def infogain(post_a, post_b, prior_a, prior_b):

        p = beta(post_a, post_b).pdf
        q = beta(prior_a, prior_b).pdf

        (info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1)
        return info

This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine quad needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

    print( infogain(4, 7, 3, 7) )
    print( infogain(3, 8, 3, 7) )

The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.