Posts tagged as:

Probability and Statistics

Incredibly simple approximation

by John on June 29, 2009

Suppose you need to find the slope of a line through a set of data. You can get a surprisingly good approximation by simply fitting a line to the first and last points. This is known as “Bancroft’s rule.” It seems too good to be true. Of course just fitting a line to just two points is not as good as using all the data, but unless you have a fairly large amount of data, it’s not too much worse either. It’s good enough for a quick estimate.

Just how good is this estimate compared to using all the data? We’ll look at the technical details of an example below.

Suppose you have a regression model y = α + βx + ε where ε is random noise. Suppose ε is normally distributed with mean 0 and variance σ2. Let b be the least squares estimator of β. The variance in b is

\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}

Now suppose we have observations yi corresponding to xi = 0, 1, 2, …, 2n. The average value of x is n, and the denominator in the expression for the variance of the slope estimator is 2(12 + 22 + 32 + … + n2) = (2n3 + 3n2 + n)/3. If we just use the data at x = 0 and x = 2n, the denominator is (0 - n)2 + (2n - n)2 = 2n2.

If we divide the estimator variance based on Bancroft’s rule by the estimator variance using all the data, the σ2 terms cancel and we are left with n/3 + 1/2 + 1/6n. So Bancroft’s rule increases the variance in the estimate for the slope by roughly n/3 compared to using all the data. Thus it increases the confidence interval by roughly the square root of n/3. So if you had 12 data points, the confidence interval would be about twice as wide. Said another way, the estimate based on all the data is only twice as good as the estimate based on just the first and last points.

Related posts:

Probability mistake can give a good approximation
Rolling dice for normal samples
Approximate problems and approximate solutions
Comparing two ways to fit a line to data

{ 1 comment }

If you run into someone on the street, the probability that the other person shares your birthday is 1/365. If you run into five people, the probability that at least one of them shares your birthday is 5/365, right?

The answer 5/365 is quite accurate, but not exactly correct. It’s good enough for this problem since there are other practical difficulties besides the quality of the approximation. For example, the problem implicitly assumes there are 365 days in a year, i.e. that no one is ever born on Leap Day.

Now think about a similar problem. Suppose the chance of rain is 40% each day for the next three days. Does that mean there is a 120% chance that it will rain at least one of the next three days? That can’t be. In fact, the chance of some rain over the next three days is 78.4%.

The following rule appears to be correct for birthdays but not for predicting rain.

If the probability of a success in one attempt is p, then the probability of at least one success in n attempts is np.

Why does this rule hold sometimes and not at other times? If the probability of success on each attempt is p, the probability of failure on each attempt is (1-p). The probability of n failures in a row is (1-p)n and so the probability of at least one success is 1 - (1-p)n. That’s the right way to approach the birthday example and the rain example. In the birthday example, p = 1/365 and so the probability of running into at least one person in five who shares your birthday is 1 - (364/365)5 = 0.013624. And 5/365 = 0.013699 is a very good approximation. In the rain example, p = 0.4 and the probability of at least one day of rain out of the next three is 1 - 0.63 = 0.784. The difference between the birthday example and the rain example is the size of p. The following equation, based on the binomial theorem, explains why.

1 - (1-p)^n = np - {n \choose 2}p^2 + {n \choose 3} p^3 - {n \choose 4}p^4 + \cdots

When we use np as our approximation, we’re ignoring the terms involving p2 and higher powers of p. When p is small, higher powers of p are very small and can be ignored. That’s why the approximation worked well for p = 1/365. But when p is large, say p = 0.4, the error is large; the terms involving higher powers of p are important in that case. Notice also that the size of n matters. The birthday example breaks down when n is large. If you run into 400 people, it is likely that one of them will share your birthday, but far from certain. The probability in that case is about 2/3,  not 400/365.

When p and n are both small, the probability of at least one success out of n tries is approximately np. We can say more. Because the first term the approximation drops from the equation above has a negative sign, our approximation is also an upper bound. This says np slightly over-estimates the probability.

Now how small do p and n have to be? If you calculate the approximation np and get a small answer, then it’s a good answer. Why? The error in the np approximation is roughly n(n-1)p2/2, which is less than (np)2. And if np is small, (np)2 is very small.

See Sales tax included for a similar discussion. That also post looks at a common mistake and explains why it makes a good approximation under some circumstances and not under others.

{ 3 comments }

John Tukey’s median of medians

by John on June 23, 2009

Yesterday I got an email from Jestin Abraham asking a question about Tukey’s “median of medians” paper from 1978. (The full title is “The Ninther, a Technique for Low-Effort Robust (Resistant) Location in Large Samples.”) Jestin thought I might be familiar with the paper since I’ve written about Tukey several times, but I’d never heard of it.

Tukey’s “ninther” or “median of medians” procedure is quite simple. Understanding the problem he was trying to solve is a little more difficult.

Suppose you are given nine data points: y1, y2, …, y9. Let yA be the median of the first three samples, yB the median of the next three samples, and yC the median of the last three samples. The “ninther” of the data set is the median of yA, yB, and yC, hence the “median of medians.” If the data were sorted, the ninther would simply be the median, but in general it will not be.

For example, suppose your data are 3, 1, 4, 4, 5, 9, 9, 8, 2.  Then

yA = median( 3, 1, 4 ) = 3
yB = median( 4, 5, 9 ) = 5
yC = median( 9, 8, 2 ) = 8

and so the ninther is median( 3, 5, 8 ) = 5. The median is 4.

That’s Tukey’s solution, so what was his problem? First of all, he’s trying to find an estimate for the central value of a large data set. Assume the data come from a symmetric distribution so that the mean equals the median. He’s looking for a robust estimator of the mean, an estimator resistant to the influence of outliers. That’s why he’s using an estimator that is more like the median than the mean.

Why not just use the median? Computing the sample median requires storing all data points and then sorting them to pick the middle value. Tukey wants to do his computation in one pass without storing the data. Also, he wants to do as few comparisons and as few arithmetic operations as possible. His ninther procedure uses no arithmetic operations and only order comparisons. He shows that it uses only about 1.1 comparisons per data point on average and 1.33 comparisons per data point in the worst case.

How well does Tukey’s ninther perform? He shows that if the data come from a normal distribution, the ninther has about 55% efficiency relative to the sample mean. That is, the variances of his estimates are a little less than twice the variances of estimates using the sample mean. But the purpose of robust statistics is efficient estimation in case the data do not come from a normal distribution but from a distribution with thicker tails. The relative efficiency of the ninther improves when data do come from distributions with thicker tails.

Where do large data sets come in? So far we’ve only talked about analyzing data sets with nine points. Tukey’s idea was to use the ninther in conjunction with the median. For some large number M, you could estimate the central value of 9M data points by applying the ninther to groups of 9 points and take the median of the M ninthers. This still requires computing the median of M points, but the memory requirement has been reduced by a factor of 9. Also, the sorting time has been reduced by more than a factor of 9 since sorting n points takes time proportional to n log n.

For even larger data sets, Tukey recommended breaking the data in to sets of 81 points and computing the ninther of the ninthers. Then 81M data points could be processed by storing and sorting M values.

Tukey gave M = 1,000,000 as an example of what he called an “impractically large data set.” I suppose finding the median of 81 million data points was impractical in 1978, though it’s a trivial problem today. Perhaps Tukey’s ninther is sill useful for an embedded device with extremely limited resources that must process enormous amounts of data.

Other posts on robust statistics:

Canonical examples from robust statistics
Efficiency of median versus mean

Other posts on John Tukey:

Approximate problems and approximate solutions
John Tukey and Aristotle
Tukey tallying
When discoveries stay discovered

{ 4 comments }

Classical statistics in a nutshell

by John on May 4, 2009

Here’s another quote from Anthony O’Hagan’s book Bayesian Inference.

All classical inference statements … are probability statements about x given θ, phrased so as to appear to be probability statements about θ.

Emphasis in the original.

Related posts:

Four reasons to use Bayesian inference
Four pillars of Bayesian statistics

{ 0 comments }

Four reasons to use Bayesian inference

by John on April 28, 2009

The following is a direct quote from Anthony O’Hagan’s book Bayesian Inference. I’ve edited the quote only to enumerate the points.

Why should one use Bayesian inference, as opposed to classical inference? There are various answers. Broadly speaking, some of the arguments in favour of the the Bayesian approach are that it is

  1. fundamentally sound,
  2. very flexible,
  3. produces clear and direct inferences,
  4. makes use of all available information.

I’ll elaborate briefly on each of O’Hagan’s points.

Bayesian inference has a solid philosophical foundation. It is consistent with certain axioms of rational inference. Non-Bayesian systems of inference, such as fuzzy logic, must violate one or more of these axioms; their conclusions are rationally satisfying to the extent that they approximate Bayesian inference.

Bayesian inference is at the same time rigid and flexible. It is rigid in the sense that all inference follows the same form: set up a likelihood and a prior, then calculate the posterior by conditioning on observed data via Bayes theorem. But this rigidity channels creativity into useful directions. It provides a template for setting up complex models when necessary.

Frequentist inferences are awkward to explain. For example, confidence intervals and p-values are tedious to define rigorously. Most consumers of confidence intervals and p-values do not know what they mean and implicitly assume Bayesian interpretations. The difference is not simply pedantic. Particularly with regard to p-values, the common understanding can be grossly inaccurate. By contrast, Bayesian counterparts are simple to define and interpret. Bayesian credible intervals are exactly what most people think confidence intervals are. And a Bayesian hypotheses test simply compares the probability of each hypothesis via Bayes factors.

Sometimes the necessity of specifying prior distributions is seen as a drawback to Bayesian inference. On the other hand, the ability to specify prior distributions means that more information can be incorporated in an inference. See Musicians, drunks, and Oliver Cromwell for a colorful illustration from Jim Berger on the need to incorporate prior information.

Related posts:

Four pillars of Bayesian statistics
Bayesian statistics is misnamed
What is a confidence interval?
Why most published research results are false

{ 3 comments }

Bayesian statistics is misnamed

by John on April 20, 2009

I’m teaching an introduction to Bayesian statistics. My first thought was to start with Bayes theorem, as many introductions do. But this isn’t the right starting point. Bayes’ theorem is an indispensable tool for Bayesian statistics, but it is not the foundational principle. The foundational principle of Bayesian statistics is the decision to represent uncertainty by probabilities. Unknown parameters have probability distributions that represent the uncertainty in our knowledge of their values.

Once you decide to use probabilities to express parameter uncertainty, you inevitably run into the need for Bayes theorem to work with these probabilities. Bayes theorem is applied constantly in Bayesian statistics, and that is why the field takes its name from the theorem’s author, Reverend Thomas Bayes (1702-1761). But “Bayesian” doesn’t describe Bayesian statistics quite the same way that “Frequentist” described frequentist statistics. The term “frequentist” gets to the heart of how frequentist statistics interprets probability. But “Bayesian” refers to a Bayes theorem, a computational tool for carrying out probability calculations in Bayesian statistics. If frequentist statistics were analogously named, it might be called “Bernoullian statistics” after Jacob Bernoulli’s law of large numbers.

The term “Bayesian” statistics might imply that frequentist statisticians dispute Bayes’ theorem. That is not the case. Bayes’ theorem is a simple mathematical result. What people dispute is the interpretation of the probabilites that Bayesians want to stick into Bayes’ theorem.

I don’t have a better name for Bayesian statistics. Even if I did, the name “Bayesian” is firmly established. It’s certainly easier to say “Bayesian statistics” than to say “that school of statistics that represents uncertainty in unknown parameters by probabilities,” even though the latter is accurate.

Related posts:

Four pillars of Bayesian statistics
What a probability means
Plausible reasoning
The probability that Shakespeare wrote a play

{ 5 comments }

Four pillars of Bayesian statistics

by John on April 7, 2009

Anthony O’Hagan’s book Bayesian Inference lists four basic principles of Bayesian statistics at the end of the first chapter:

  1. Prior information. Bayesian statistics provides a systematic way to incorporate what is known about parameters before an experiment is conducted. As a colleague of mine says, if you’re going to measure the distance to the moon, you know not to pick up a yard stick. You always know something before you do an experiment.
  2. Subjective probability. Some Bayesians don’t agree with the subjective probability interpretation, but most do, in practice if not in theory. If you write down reasonable axioms for quantifying degrees of belief, you inevitably end up with Bayesian statistics.
  3. Self-consistency. Even critics of Bayesian statistics acknowledge that Bayesian statistics has a rigorous self-consistent foundation. As O’Hagan says in his book, the difficulties with Bayesian statistics are practical, not foundational, and the practical difficulties are being resolved.
  4. No adhockery. Bruno de Finetti coined the term “adhockery” to describe the profusion of frequentist methods. More on this below.

This year I’ve had the chance to teach a mathematical statistics class primarily focusing on frequentist methods. Teaching frequentist statistics has increased my appreciation for Bayesian statistics. In particular, I better understand the criticism of frequentist adhockery.

For example, consider point estimation. Frequentist statistics to some extent has standardized on minimum variance unbiased estimators as the gold standard. But why? And what do you do when such estimators don’t exist?

Why focus on unbiased estimators? Granted, lack of bias sounds like a good thing to have. All things being equal, it would be better to be unbiased than biased. But all things are not equal. Sometimes unbiased estimators are ridiculous. Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.

Why minimize expected mean squared error? Efficiency in classical statistics is typically measured by expected mean squared error. But why not minimize some other measure of error? Why use an exponent of 2 and not 1, or 4, or 2.738? Or why limit yourself to power functions at all? The theory is simplest for squared error, and while this is a reasonable choice in many applications, it is still an arbitrary choice.

How much emphasis should be given to robustness? Once you consider robustness, there are infinitely many ways to compromise between efficiency and robustness.

Many frequentists are asking the same questions and are investigating alternatives. But I believe these alternatives are exactly what de Finetti had in mind: there are an infinite number of ad hoc choices you can make. Bayesian methods are criticized because prior distributions are explicitly subjective. But there are myriad subjective choices that go into frequentist statistics as well, though these choices are often implicit.

There is a great deal of latitude in Bayesian statistics as well, but the latitude is confined to fit within a universal framework: specify a likelihood and prior distribution, then update the model with data to compute the posterior distribution. There are many ways to construct a likelihood (exactly as in frequentist statistics), many ways to specify a prior, and many ways to summarize the information contained in the posterior distribution. But the basic framework is fixed. (In fact, the framework is inevitable given certain common-sense rules of inference.)

Related posts:

Probability and information
What is a confidence interval?

{ 8 comments }

Canonical examples from robust statistics

by John on March 11, 2009

The following definition of robust statistics comes from P. J. Huber’s book Robust Statistics.

… any statistical procedure should possess the following desirable features:

  1. It should have a reasonably good (optimal or nearly optimal) efficiency at the assumed model.
  2. It should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly. …
  3. Somewhat larger deviations from the model should not cause a catastrophe.

Classical statistics focuses on the first of Huber’s points, producing methods that are optimal subject to some criteria. This post looks at the canonical examples used to illustrate Huber’s second and third points.

[click to continue...]

{ 0 comments }

Two common ways to estimate the center of a set of data are the sample mean and the sample median. The sample mean is sometimes more efficient, but the sample median is always more robust. (I’m going to cut to the chase first, then go back and define basic terms like “median” and “robust” below.)

When the data come from distributions with thick tails, the sample median is more efficient. When the data come from distributions with a thin tail, like the normal, the sample mean is more efficient. The Student-t distribution illustrates both since it goes from having thick tails to having thinner tails as the degrees of freedom, denoted ν, increase.

When ν = 1, the Student-t is a Cauchy distribution and the sample mean wanders around without converging to anything, though the sample median behaves well. As ν increases, the Student-t becomes more like the normal and the relative efficiency of the sample median decreases.

Here is a plot of the asymptotic relative efficiency (ARE) of the median compared to the mean for samples from a Student-t distribution as a function of the degrees of freedom ν. The vertical axis is ARE and the horizontal axis is ν.

The curve crosses the top horizontal line at 4.67879. For values of ν less than that cutoff, the median is more efficient. For larger values of ν, the mean is more efficient. As ν gets larger, the relative efficiency of the median approaches the corresponding relative efficiency for the normal, 2/π = 0.63662, indicated by the bottom horizontal line.

Backing up

The sample mean is just the average of the sample values. The median is the middle value when the data are sorted.

Since data have random noise, statistics based on the data are also random. Statistics are generally less random than the data they’re computed from, but they’re still random. If you were to compute the mean, for example, many times, you’d get a different result each time. The estimates bounce around. But there are multiple ways of estimating the same thing, and some ways give estimates bounce around less than others. These are said to be more efficient. If your data come from a normal example, the sample median bounces around about 25% more than the sample mean. (The variance of the estimates is about 57% greater, so the standard deviations are about 25% greater.)

But what if you’re wrong? What if you think the data are coming from a normal distribution but they’re not. Maybe they’re coming from another distribution, say a Student-t. Or maybe they’re coming from a mixture of normals. Say 99% of the time the samples come from a normal distribution with one mean, but 1% of the time they come from a normal distribution with another mean. Now what happens? That is the question robustness is concerned with.

Say you have 100 data points, and one of them is replaced with ∞. What happens to the average? It becomes infinite. What happens to the median? Either not much or nothing at all depending on which data point was changed. The sample median is more robust than the mean because it is more resilient to this kind of change.

Asymptotic relative efficiency (ARE) is a way of measuring how much statistics bounce around as the amount of data increases. If I take n data points and look at √n times the difference between my estimator and the thing I’m estimating, often that becomes approximately normally distributed as n increases. If I do that for two different estimators, I can take the ratio of the variances of the normal distributions that this process produces for each. That’s the asymptotic relative efficiency.

Often efficiency and robustness are in tension and you have to decide how much efficiency you’re willing to trade off for how much robustness. ARE gives you a way of measuring the loss in efficiency if you’re right about the distribution of the data but choose a more robust, more cautious estimator. Of course if you’re significantly wrong about the distribution of the data (and often you are!) then you’re better off with the more robust estimator.

{ 1 comment }

Finding the shortest interval with given mass

by John on February 23, 2009

Here’s an elegant little theorem applied in statistics but useful more generally. Suppose you have a density function f(x) with one hump. Suppose a and b are two points on opposite sides of the hump with f(a) = f(b). Then [a, b] is the shortest interval with its mass. That is, any other interval of length b-a will have less mass than the interval [a, b]. (Here the “mass” of an interval is just the integral of f(x) over that interval.)

Suppose we want to find the shortest interval that has a given mass k. Start by imagining a horizontal line sitting on top of the graph of f(x).

graph of gamma(4,1) pdf

Now lower this horizontal line so that it intersects the graph in two places.

graph of gamma(4,1) pdf cut at height 0.1

Draw vertical lines down from these two points of intersection to find their x-coordinates.

graph showing how to find the x coordinates of the two intersection points

In this example, the two x-coordinates are about 1.30 and 5.77. So the interval [1.30, 5.77] is the shortest interval with its mass. In other words, no other interval of length 4.47 can contain more mass than this interval does.

We can find the shortest interval of mass k by lowering this horizontal line until the interval it defines has mass k. The lower the horizontal line, the greater the mass. So for any given mass less than the total mass f(x) assigns, there is a unique height of the horizontal line that defines an interval with that mass.

This procedure could be used to find the shortest confidence interval or the shortest Bayesian credible interval. In that case the “mass” is probability, and the task is to find the shortest interval containing a specified probability. The theorem says that the shortest confidence interval or credible interval has equal probability density at each end of the interval.

A proof of this theorem is given in Statistical Inference, chapter 9. Technically, f(x) must be unimodal and positive with finite integral. A homework exercise in the same chapter outlines a simpler proof using the additional assumption that f(x) is continuous.

Related post: What is a confidence interval?

{ 0 comments }

The data may not contain the answer

by John on February 18, 2009

Mark Reid sent me a link to a couple quotes by John Tukey that I had not seen before. First,

To statisticians, hubris should mean the kind of pride that fosters and inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again.

Also,

The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

Here are some more posts about John Tukey:

Approximate problems and approximate solutions
Innovation IV
Tukey tallying
How to linearize data for regression

{ 2 comments }

Here’s an easy error to fall into in statistics. Suppose I have n samples from a normal(μ, σ2) distribution, say n = 16, and σ is unknown. What is the distribution of the average of the samples? A common mistake is to say Student-t: if σ is known, the sample mean has a normal distribution, otherwise it has a t distribution.

But that’s wrong. Your ignorance of σ does not change the distribution of the data. There’s no spooky quantum effect that changes the data based on your knowledge. A linear combination of independent normal random variables is another normal random variable, so the sample mean has a normal distribution, whether or not you know its variance. Your knowledge or ignorance of σ doesn’t change the distribution of the data; it changes what you’re likely to want to do regarding the data. When the variance is unknown, you use procedures involving the sample variance rather than the distribution variance. This doesn’t change the distribution of the data but it changes the distribution you construct (implicitly) in your analysis of the data.

{ 3 comments }

Sums of uniform random values

by John on February 12, 2009

Suppose you have uniform random samples from the interval [0, 1].  If you add a large number of such samples together, the sum has an approximately normal distribution according to the central limit theorem. But how many do you have to add together to get a good approximation to a normal? Well, two is clearly not enough. Here’s what the density looks like for the sum of two uniform values.

But look what happens when we add three uniforms together. If you’re very familiar with the normal distribution and have a good eye, you might be able to tell that the density is a little flat at the top.

Here’s a graph that shows the distribution for a sum of four uniforms. The dotted curve is the graph of the normal distribution with the same mean and variance as the sum of the four uniforms. If the two graphs were not on top of each other, hardly anyone could tell which was the normal and which was the sum of uniforms.

Note that this fast convergence to a normal distribution is a special property of uniform random variables. The sum of four exponential random variables, for example, does not look nearly so normal.

Here’s the analytic expression for the density of the sum of n uniform random variables that was used to produce these graphs.

\frac{1}{(n-1)!} \sum_{k=0}^n (-1)^k {n \choose k} (x-k)_+^{n-1}

Here the notation z+ stands for the positive part of z, i.e. the expression z+ equals z if z is positive and equals 0 otherwise.

According to Feller’s classic book, the density result above was discovered by Lagrange. Feller gives a series of exercises leading up to this result. First he gives the distribution function for the sum of n-sided dice and asks the reader to prove the result. Then the reader is asked to take the limit as the number of sides on each dice goes to infinity to derive the result for uniform random variables.

John Venier left a comment to a previous post about the following method for generating a standard normal: add 12 uniform random variables and subtract 6. Note that this is not just any normal distribution but a standard normal, i.e. mean 0 and variance 1. Since the variance of a single uniform random variable is 1/12, adding 12 such values makes the variance 1. How good is the generator he describes? The maximum difference between the CDF of his generator and the CDF of a standard normal is about 0.00234.

Related post:  Rolling dice for normal samples

{ 4 comments }

Rolling dice for normal samples

by John on February 10, 2009

Roll five dice and use the sum to simulate samples from a normal distribution. This is an idea I ran across in Teaching Statistics by Andrew Gelman and Deborah Nolan. The book mentions this idea in the context of a classroom demonstration where students are to simulate IQ scores.

I was surprised that just five dice would do a decent job.  So just how good a job does the dice generator do?

To find out, I looked at the difference between the cumulative densities of the dice generator and an exact normal generator. When you roll a single six-sided die, the outcomes have mean 3.5 and variance 35/12, and so the corresponding mean and variance for rolling 5 dice is 5 times greater. By the central limit theorem, the sum of the five rolls should have approximately the same distribution as a normal random variable with the same mean and variance. So what we want to do is compare the distribution of the sum of the five dice rolls to a Normal(17.5, 3.8192).

Now how do we get the distribution of the sum of the five dice? For one die it’s easy: the numbers 1 through 6 all have probability 1/6 of showing up. The sum of two dice is harder, but not too hard to brute force. But if you’re looking at five or more dice, you need to be a little clever.

By using generating functions, you can see that the probability of getting a sum of k spots on the five dice is the coefficient of xk in the expansion of

(x + x2 + x3 + x4 + x5 + x6)5 / 65

In Mathematica code, the probability density for the sum of the five dice is

pmf = Table[
  6^-5 Coefficient[(x + x^2 + x^3 + x^4 + x^5 + x^6)^5, x^i],
  {i, 30}]

To get the distribution function, we need to create a new list whose nth item is the sum of the first n items in the list above. Here’s the Mathematica code to do that.

cdf = Rest[FoldList[Plus, 0, pmf]]

The FoldList repeatedly applies the function supplied as its first argument, in this case Plus, starting with the second argument, 0. The Rest function removes the extra 0 that FoldList adds to the front of the list. (Like cdr for Lisp fans.)

Here’s the Mathematica code to produce the corresponding normal distribution.

norm = Table[
    CDF[NormalDistribution[3.5*5, Sqrt[5*35/12]], i],
    {i, 30}]

Here’s what the difference between the dice distribution and the normal distribution looks like:

plot of cdf - norm

The maximum error is about 5%. It’s not an industrial quality random number generator, but it’s perfectly adequate for classroom demonstrations.

Now what would happen if we rolled more than five dice? We’d expect a better approximation. In fact, we can be more precise. The central limit theorem suggests that as we roll n dice, the maximum error will be proportionate to 1/√ n when n is large. So if we roll four times as many dice, we’d expect the error to go down by a factor of two. In fact, that’s just what we see. When we go from 5 dice to 20 dice, the maximum error goes from 0.0525 to 0.0262.

{ 11 comments }

What is a confidence interval?

by John on February 3, 2009

At first glance, a confidence interval is simple. If we say [3, 4] is a 95% confidence interval for a parameter θ, then there’s a 95% chance that θ is between 3 and 4. That explanation is not correct, but it works better in practice than in theory.

If you’re a Bayesian, the explanation above is correct if you change the terminology from “confidence” interval to “credible” interval. But if you’re a frequentist, you can’t make probability statements about parameters.

Confidence intervals take some delicate explanation. I took a look at Andrew Gelman and Deborah Nolan’s book Teaching Statistics: a bag of tricks,  to see what they had to say about teaching confidence intervals. They begin their section on the topic by saying “Confidence intervals are complicated …” That made me feel better. Some folks with more experience teaching statistics also find this challenging to teach. And according to The Lady Testing Tea, confidence intervals were controversial when they were first introduced.

From a frequentist perspective, confidence intervals are random, parameters are not, exactly the opposite of what everyone naturally thinks. You can’t talk about the probability that θ is in an interval because θ isn’t random. But in that case, what good is a confidence interval? As L. J. Savage once said,

The only use I know for a confidence interval is to have confidence in it.

In practice, people don’t go too wrong using the popular but technically incorrect notion of a confidence interval. Frequentist confidence intervals often approximate Bayesian credibility intervals; the frequentist approach is more useful in practice than in theory.

It’s interesting to see a sort of détente between frequentist and Bayesian statisticians. Some frequentists say that the Bayesian interpretation of statistics is nonsense, but the methods these crazy Bayesians come up with often have good frequentist properties. And some Bayesians say that frequentist methods, such as confidence intervals, are useful because they can come up with results that often approximate Bayesian results.

{ 0 comments }