From the category archives:

Statistics

Does gaining weight make you taller?

by John on March 12, 2010

In his autobiography, The Pleasures of Statistics, Frederick Mosteller gives an amusing example of why observational studies are no substitute for doing experiments.

We are all familiar with the idea that we can estimate height in male adults from their weight. … But not one of us believes that adding 20 pounds by eating and minimizing exercise will add an inch to our height.

The problem is not simply that the direction of causality backward, it’s that we cannot use a static description to predict what will happen if we change something.

Although regression situations may give one the illusion of finding out what would happen if we changed something, in the absence of an experiment they offer merely offer guesses.

He summarizes his point by quoting George Box:

To find out what happens to a system when you interfere with it, you have to interfere with it (and not just passively observe it).

Remember this next time you hear claims such as every dollar spent on X saves so many dollars spent on Y. Or every minute spent exercising increases your life expectancy by so many minutes. Or every time you do some activity you increase or decrease your risk of cancer by so much. First of all, these kinds of statements are linear extrapolations on situations that are not linear. Second, they may be observations that do not describe what will happen when you change something. They may be no more true than the idea that gaining weight makes you taller.

Here’s an example of how observation and intervention differ. Lottery winners often go bankrupt within a couple years of receiving their prize. If you suddenly make someone a millionaire, they’re not a typical millionaire.

Related posts:

Numerator-only data
Randomized trials of parachute use

{ 3 comments }

Numerator-only data

by John on March 11, 2010

I learned a useful new phrase today: numerator-only data. This is data without anything to compare it to, no denominator. I ran across the term in Frederick Mosteller’s autobiography. He illustrates the problem with the following old joke.

“Why do the white horses eat more than the black horses?”
“Don’t know. Why?”
“Because we have ten times as many white horses and black horses.”

Numerator-only data is data that leaves you asking “compared to what?” If I tell you the NASDAQ stock index closed at 2368 today, is that good or bad? The number by itself means nothing. Is that up or down compared to last week? Last year? If I tell you, for example, that the record high value was 5047, that gives you a denominator to compare it to.

{ 5 comments }

Does lightning prefer metal or wood?

by John on March 5, 2010

The video below features a demonstration that lightning is as likely to strike wood as metal.

I want to focus on one line from the video. After showing simulated lightning strikes that hit a wooden rod five times and a copper rod five times, the narrator says

It’s five all, proof that metal does not attract lightning.

No, such an experiment would prove no such thing. I imagine the researchers conducted a much larger experiment and selected a representative sample. And I’m willing to accept their conclusion that metal does not attract lightning. But I would not accept such a conclusion from an experiment with 10 samples. What the experiment proves is that, under their experimental conditions, lightning will sometimes strike wood even a metal rod is nearby.

I have two complementary criticisms of this made-for-video science.

  1. The results could easily happen if their conclusion were not true.
  2. The results could easily not have happened if there conclusion were true.

Suppose in reality, lightning will not always strike the metal rod, but will prefer the metal. Suppose in the long run, lightning will strike the metal rod 60% of the time. It would not be unusual in that case to do an experiment with 10 strikes and find that half or more of the strikes hit wood.

Now suppose the researchers are exactly correct. In the long run, lightning has no preference for one rod or the other. What would viewers have thought if they showed a clip of 10 strikes, of which 6 hit metal and 4 hit wood? Many would have howled in protest. If lightning really had no preference for metal, the result should have been an even split, right? This is an example of the Law of Small Numbers. People underestimate the variability of small samples.

If the probability of lightning striking each rod is 50%, then in a sequence of experiments each containing 10 strikes, most will not have an exact 5-5 split. If you flip 10 fair coins, the most likely outcome is a 5-5 split, but this will happen only about 1/4 of the time. It’s more likely that you’ll get near a 5-5 split, sometimes with more heads and sometimes with more tails.

The exact 5-5 split in the video is good showmanship, but it’s misleading science.

Related posts:

Law of small numbers
Example of the law of small numbers
Law of medium numbers

{ 2 comments }

p-values are inconsistent

by John on March 3, 2010

If there’s evidence that an animal is a bear, you’d think there’s even more evidence that it’s a mammal. It turns out that p-values fail this common sense criterion as a measure of evidence.

I just ran across a paper of Mark Schervish1 that contains a criticism of p-values I had not seen before. p-values are commonly used as measures of evidence despite the protests of many statisticians. It seems reasonable that a measure of evidence would have the following property. If a hypothesis H implies another hypothesis H’, then evidence in favor of H’ should be at least as great as evidence in favor of H.

Here’s one of the examples from Schervish’s paper. Suppose data come from a normal distribution with variance 1 and unknown mean μ. Let H be the hypothesis that μ is contained in the interval (-0.5, 0.5). Let H’ be the hypothesis that μ is contained in the interval (-0.82, 0.52). Then suppose you observe x = 2.18. The p-value for H is 0.0502 and the p-value for H’ is 0.0498. This says there is more evidence to support the hypothesis H that μ is in the smaller interval than there is to support the hypothesis H’ that μ is in the larger interval. If we adopt α = 0.05 as the cutoff for significance, we would reject the hypothesis that -0.82 < μ < 0.52 but accept the hypothesis that -0.5 < μ < 0.5. We’re willing to accept that we’ve found a bear, but doubtful that we’ve found a mammal.

1 Mark J. Schervish. “P values: What They Are and What They Are Not.” The American Statistician, August 1996, Vol. 50, No. 3.

Update: I added the details of the p-value calculation here.

Related posts:

How loud is the evidence
The cult of significance testing
Most published research results are false

{ 19 comments }

The more active a research area is, the less reliable its results are.

John Ioannidis suggested popular areas of research publish a greater proportion of false results in his paper Why most published research findings are false. Of course popular areas produce more results, and so they will naturally produce more false results. But Ioannidis is saying that they also produce a greater proportion of false results.

Now Thomas Pfeiffer and Robert Hoffmann have produced empirical support for Ioannidis’s theory in the paper Large-Scale Assessment of the Effect of Popularity on the Reliability of Research. Pfeiffer and Hoffmann review two reasons why popular areas have more false results.

First, in highly competitive fields there might be stronger incentives to ‘‘manufacture’’ positive results by, for example, modifying data or statistical tests until formal statistical significance is obtained. This leads to inflated error rates for individual findings: actual error probabilities are larger than those given in the publications. … The second effect results from multiple independent testing of the same hypotheses by competing research groups. The more often a hypothesis is tested, the more likely a positive result is obtained and published even if the hypothesis is false.

In other words,

  1. In a popular area there’s more temptation to fiddle with the data or analysis until you get what you expect.
  2. The more people who test an idea, the more likely someone is going to find data in support of it by chance.

The authors produce evidence of the two effects above in the context of papers written about protein interactions in yeast. They conclude that “The second effect is about 10 times larger than the first one.”

Related posts:

Why microarray conclusions are so often wrong
Using Photoshop on experimental results
Irreproducible analysis
Make up your own rules of probability

{ 3 comments }

Parameterizations are the bane of statistical software. One of the most common errors is to assume that one software package uses the same parameterization as another package. For example, some packages specify the exponential distribution in terms of the mean but others use the rate. [click to continue...]

{ 4 comments }

Parameters and percentiles

by John on January 31, 2010

The doctor says 10% of patients respond within 30 days of treatment and 80% respond within 90 days of treatment. Now go turn that into a probability distribution. That’s a common task in Bayesian statistics, capturing expert opinion in a mathematical form to create a prior distribution.

Graph of gamma density with 10th percentile at 30 and 80th percentile at 90

Things would be easier if you could ask subject matter experts to express their opinions in statistical terms. You could ask “If you were to represent your belief as a gamma distribution, what would the shape and scale parameters be?” But that’s ridiculous. Even if they understood the question, it’s unlikely they’d give an accurate answer. It’s easier to think in terms of percentiles.

Asking for mean and variance are not much better than asking for shape and scale, especially for a non-symmetric distribution such as a survival curve. Anyone who knows what variance is probably thinks about it in terms of a normal distribution. Asking for mean and variance encourages someone to think about a symmetric distribution.

So once you have specified a couple percentiles, such as the example this post started with, can you find parameters that meet these requirements? If you can’t meet both requirements, how close can you come to satisfying them? Does it depend on how far apart the percentiles are? The answers to these questions depend on the distribution family. Obviously you can’t satisfy two requirements with a one-parameter distribution in general. If you have two requirements and two parameters, at least it’s feasible that both can be satisfied.

If you have a random variable X whose distribution depends on two parameters, when can you find parameter values so that Prob(X ≤ x1) = p1 and Prob(X ≤ x2) = p2? For starters, if x1 is less than x2 then p1 must be less than p2. For example, the probability of a variable being less than 5 cannot be bigger than the probability of being less than 6. For some common distributions, the only requirement is this requirement that the x’s and p’s be in a consistent order.

For a location-scale family, such as the normal or Cauchy distributions, you can always find a location and scale parameter to satisfy two percentile conditions. In fact, there’s a simple expression for the parameters. The location parameter is given by

\frac{x_1 F^{-1}(p_2) - x_2 F^{-1}(p_1)}{F^{-1}(p_2) - F^{-1}(p_1)}

and the scale parameter is given by

\frac{x_2 - x_1}{F^{-1}(p_2) - F^{-1}(p_1)}

where F(x) is the CDF of the distribution representative with location 0 and scale 1.

The shape and scale parameters of a Weibull distribution can also be found in closed form. For a gamma distribution, parameters to satisfy the percentile requirements always exist. The parameters are easy to determine numerically but there is no simple expression for them.

For more details, see Determining distribution parameters from quantiles. See also the ParameterSolver software.

Update: I posted an article on CodeProject with Python code for computing the parameters described here.

Related posts:

Biostatistics software
Diagram of distribution relationships
How to calculate percentiles in memory-bound applications

{ 1 comment }

Statisticians take themselves too seriously

by John on January 28, 2010

I suppose most people take themselves too seriously, but I’ve been thinking specifically about how statisticians take themselves too seriously.

The fundamental task of statistics is making decisions in the presence of uncertainty, and that’s hard. You have to make all kinds of simplifying assumptions and arbitrary choices to get anywhere. But after a while you lose sight of these decisions. Or you justify your decisions after the fact, making a virtue out of a necessity. After you’ve worked on a problem long enough, it’s nearly impossible to say “Of course, our whole way of thinking about this might have been wrong from the beginning.”

My concern is not so much “creative” statistics but rather uncreative statistics, rote application of established methods. Statistics is extremely conventional. But a procedure is not objective just because it is conventional.  An arbitrary choice made 80 years ago is still an arbitrary choice.

I’ve taken myself too seriously at times in regard to statistical matters; it’s easy to get caught up in your model. But I’m reminded of a talk I heard one time in which the speaker listed a number of embarrassing things that people used to believe. He was not making a smug comparison of how sophisticated we are now compared to our ignorant ancestors. Instead, his point was that we too may be mistaken. He exhorted everyone to look in a mirror and say “I may be wrong. I may be very, very wrong.”

Related posts:

The IOT test
Bike shed arguments
The data may not contain the answer
Problems versus dilemmas
Approximate problems and approximate solutions

{ 4 comments }

Estimating reporting rates

by John on January 25, 2010

Suppose the police department in your community reported an average of 10 burglaries per month. You could take that at face value and assume there are 10 burglaries per month. But maybe there are 20 burglaries a month but only half are reported.  How could you tell?

Here’s a similar problem.  Suppose you gave away an electronic book.  You stated that people are free to distribute it however they want, but that you’d appreciate feedback.  How could you estimate the number of people who have read the book?  If you get email from 100 readers, you know at least 100 people have read it, but maybe 10,000 people have read it and only 1% sent email.  How can you estimate the number of readers and the percentage who send email at the same time?

[click to continue...]

{ 8 comments }

Biostatistics software

by John on January 13, 2010

The M. D. Anderson Cancer Center Department of Biostatistics has a software download site listing software developed by the department over many years.

The home page of the download site allows you to see all products sorted by date or by name. This page also allows search. A new page lets you see the software organized by tags.

{ 1 comment }

How the central limit theorem began

by John on January 5, 2010

The Central Limit Theorem says that if you average enough independent copies of a random variable, the result has a nearly normal (Gaussian) distribution. Of course that’s a very rough statement of the theorem. What are the precise requirements of the theorem? That question took two centuries to resolve. You can see the final answer here.

The first version of the Central Limit Theorem appeared in 1733, but necessary and sufficient conditions weren’t known until 1935. I won’t recap the entire history here. I just want to comment briefly on how the Central Limit Theorem began and how different the historical order of events was from the typical order of presentation.

A typical probability course might proceed as follows.

  1. Define the normal distribution.
  2. State and prove a special case of the Central Limit Theorem.
  3. Present the normal approximation to the binomial as a corollary.

This is the opposite of the historical order of events.

Abraham de Moivre discovered he could approximate binomial distribution probabilities using the integral of exp(-x2) and proved an early version of the Central Limit Theorem in 1733. At the time, there was no name given to his integral. Only later did anyone think of exp(-x2) as the density of a probability distribution. De Moivre certainly didn’t use the term “Gaussian” since Gauss was born 44 years after de Moivre’s initial discovery. De Moivre also didn’t call his result the “Central Limit Theorem.” George Pólya gave the theorem that name in 1920 as it was approaching its final form.

For more details, see The Life and Times of the Central Limit Theorem.

The Life and Times of the Central Limit Theorem by William Adams

Related links:

Sums of uniform random variables
Quantifying the error in the central limit theorem
Three central limit theorems

{ 1 comment }

A beta-like distribution

by John on November 24, 2009

I just stumbled across a distribution that approximates the beta distribution but is easier to work with in some ways. It’s called the Kumaraswamy distribution. Apparently it came out of hydrology. The graph below plots the density of the distribution for various parameters. If you’re familiar with the beta distribution, these curves will look very familiar.

Density plots of the Kumaraswamy distribution

The PDF for the Kumaraswamy distribution K(a, b) is

f(x | a, b) = abxa-1(1 – xa)b-1

and the CDF is

F(x | a, b) = 1 – (1 – xa)b.

The most convenient feature of the Kumaraswamy distribution is that its CDF has a simple form. (The CDF for a beta distribution cannot be reduced to elementary functions unless its parameters are integers.)  Also, the CDF is easy to invert. That means you can generate a random sample from a K(a, b) distribution by first generating a uniform random value u and then returning

F-1(u) = (1 – (1 – u)1/b)1/a.

If you’re going to use a Kumaraswamy distribution to approximate a beta distribution, the question immediately arises of how to find parameters to get a good approximation. That is, if you have a beta(α, β) distribution that you want to approximate with a K(a, b) distribution, how do you pick a and b?

My first thought was to match moments. That is, pick a and b so that K(a, b) has the same mean and variance as beta(α, β). That may work well, but it would have to be done numerically.

Since the beta(α, β) density is proportional to xα (1-x)β-1 and the K(a, b) distribution is proportional to xa(1 – xa)b, it seems reasonable to set a = α. But how do you pick b? The modes of the two distributions have simple forms and so you could pick b to match modes:

mode K(a, b) = ((a – 1)/(ab – 1))1/a = mode beta(α, β) = (α – 1)/(α + β – 2).

Update: I experimented with the method above, and it’s OK, but not great. Here’s an example comparing a beta(1/2, 1/2) density with a K(1/2, 2 – √2) density.

comparing two u-shaped densities

Here the K density matches the beta density not at the mode but at the minimum. The blue curve, the curve on top, is the beta density.

Here’s another example, this time comparing a beta(5, 3) density and a K(5, 251/40) density.

comparing K and beta densities

Again the beta density is the blue curve, on top at the mode.

Maybe the algorithm I suggested for picking parameters is not very good, but I suspect the optimal parameters are not much better. Rather than saying that the Kumaraswamy distribution approximates the beta distribution, I’d say that the Kumaraswamy distribution is capable of assuming roughly the same shapes as the beta distribution. If the only reason you’re using a beta distribution is to get a certain density shape, the Kumaraswamy distribution would be a reasonable alternative. But if you need to approximate a beta distribution closely, it may not work well enough.

{ 5 comments }

In Flaw of Averages, Sam Savage uses the illustration of shaking a ladder to explain why someone would want to use Monte Carlo simulation. Before climbing a ladder, most people shake the ladder a little to make sure it’s sturdy.

When you position a ladder and climb it immediately, you’re saying that you’re satisfied that the ladder’s position is safe. You’re assuming it will stay in the position you placed it. But when you shake the ladder, you’re testing how it will behave at a variety of nearby positions. If the ladder remains sturdy, you have more confidence that accidental motions while you’re on the ladder are not likely to cause an accident. When you stick a single number into a model, you’re acting like someone climbing a ladder immediately. When you stick in a series of random inputs, it’s like you’re shaking the ladder.

The latest INFORMS podcast features an interview with Sam Savage (audio). He mentions the ladder analogy and adds an observation that I don’t recall seeing in his book. One criticism of Monte Carlo methods is that the validity of your results depends on the validity of your input distributions. It’s better to have realistic input distributions, but it’s better to perform a simulation with an incorrect distribution than to not try random input at all. The distribution of random forces likely to result from working while standing on a ladder is different from the distribution of forces from shaking the ladder with your hands. But that doesn’t mean it isn’t a good idea to shake the ladder anyway.

Related post:

Mortgages, banks, and Jensen’s inequality

{ 0 comments }

Student-t as a mixture of normals

by John on October 30, 2009

You can express a Student-t distribution as a continuous mixture of normal distributions. Some properties of the t distribution are easier to prove in this form. Here are notes with details.

I ran across this tidbit reading Bayesian Data Analysis by Gelman et al.

Related post: Beer, Wine, and Statistics (origin of the Student-t distribution)

{ 1 comment }

Normal tail probability bounds

by John on October 22, 2009

Here are some notes on upper and lower bounds on the probability P(Z > t) for a standard normal random random variable Z. I wrote up these notes to settle a issue that came up in a probability class I’m teaching. It’s surprising that there are simple functions that provide efficient bounds on the normal distribution function.

{ 0 comments }