Posts tagged as:

Probability and Statistics

Does gaining weight make you taller?

by John on March 12, 2010

In his autobiography, The Pleasures of Statistics, Frederick Mosteller gives an amusing example of why observational studies are no substitute for doing experiments.

We are all familiar with the idea that we can estimate height in male adults from their weight. … But not one of us believes that adding 20 pounds by eating and minimizing exercise will add an inch to our height.

The problem is not simply that the direction of causality backward, it’s that we cannot use a static description to predict what will happen if we change something.

Although regression situations may give one the illusion of finding out what would happen if we changed something, in the absence of an experiment they offer merely offer guesses.

He summarizes his point by quoting George Box:

To find out what happens to a system when you interfere with it, you have to interfere with it (and not just passively observe it).

Remember this next time you hear claims such as every dollar spent on X saves so many dollars spent on Y. Or every minute spent exercising increases your life expectancy by so many minutes. Or every time you do some activity you increase or decrease your risk of cancer by so much. First of all, these kinds of statements are linear extrapolations on situations that are not linear. Second, they may be observations that do not describe what will happen when you change something. They may be no more true than the idea that gaining weight makes you taller.

Here’s an example of how observation and intervention differ. Lottery winners often go bankrupt within a couple years of receiving their prize. If you suddenly make someone a millionaire, they’re not a typical millionaire.

Related posts:

Numerator-only data
Randomized trials of parachute use

{ 3 comments }

Numerator-only data

by John on March 11, 2010

I learned a useful new phrase today: numerator-only data. This is data without anything to compare it to, no denominator. I ran across the term in Frederick Mosteller’s autobiography. He illustrates the problem with the following old joke.

“Why do the white horses eat more than the black horses?”
“Don’t know. Why?”
“Because we have ten times as many white horses and black horses.”

Numerator-only data is data that leaves you asking “compared to what?” If I tell you the NASDAQ stock index closed at 2368 today, is that good or bad? The number by itself means nothing. Is that up or down compared to last week? Last year? If I tell you, for example, that the record high value was 5047, that gives you a denominator to compare it to.

{ 5 comments }

Does lightning prefer metal or wood?

by John on March 5, 2010

The video below features a demonstration that lightning is as likely to strike wood as metal.

I want to focus on one line from the video. After showing simulated lightning strikes that hit a wooden rod five times and a copper rod five times, the narrator says

It’s five all, proof that metal does not attract lightning.

No, such an experiment would prove no such thing. I imagine the researchers conducted a much larger experiment and selected a representative sample. And I’m willing to accept their conclusion that metal does not attract lightning. But I would not accept such a conclusion from an experiment with 10 samples. What the experiment proves is that, under their experimental conditions, lightning will sometimes strike wood even a metal rod is nearby.

I have two complementary criticisms of this made-for-video science.

  1. The results could easily happen if their conclusion were not true.
  2. The results could easily not have happened if there conclusion were true.

Suppose in reality, lightning will not always strike the metal rod, but will prefer the metal. Suppose in the long run, lightning will strike the metal rod 60% of the time. It would not be unusual in that case to do an experiment with 10 strikes and find that half or more of the strikes hit wood.

Now suppose the researchers are exactly correct. In the long run, lightning has no preference for one rod or the other. What would viewers have thought if they showed a clip of 10 strikes, of which 6 hit metal and 4 hit wood? Many would have howled in protest. If lightning really had no preference for metal, the result should have been an even split, right? This is an example of the Law of Small Numbers. People underestimate the variability of small samples.

If the probability of lightning striking each rod is 50%, then in a sequence of experiments each containing 10 strikes, most will not have an exact 5-5 split. If you flip 10 fair coins, the most likely outcome is a 5-5 split, but this will happen only about 1/4 of the time. It’s more likely that you’ll get near a 5-5 split, sometimes with more heads and sometimes with more tails.

The exact 5-5 split in the video is good showmanship, but it’s misleading science.

Related posts:

Law of small numbers
Example of the law of small numbers
Law of medium numbers

{ 2 comments }

p-values are inconsistent

by John on March 3, 2010

If there’s evidence that an animal is a bear, you’d think there’s even more evidence that it’s a mammal. It turns out that p-values fail this common sense criterion as a measure of evidence.

I just ran across a paper of Mark Schervish1 that contains a criticism of p-values I had not seen before. p-values are commonly used as measures of evidence despite the protests of many statisticians. It seems reasonable that a measure of evidence would have the following property. If a hypothesis H implies another hypothesis H’, then evidence in favor of H’ should be at least as great as evidence in favor of H.

Here’s one of the examples from Schervish’s paper. Suppose data come from a normal distribution with variance 1 and unknown mean μ. Let H be the hypothesis that μ is contained in the interval (-0.5, 0.5). Let H’ be the hypothesis that μ is contained in the interval (-0.82, 0.52). Then suppose you observe x = 2.18. The p-value for H is 0.0502 and the p-value for H’ is 0.0498. This says there is more evidence to support the hypothesis H that μ is in the smaller interval than there is to support the hypothesis H’ that μ is in the larger interval. If we adopt α = 0.05 as the cutoff for significance, we would reject the hypothesis that -0.82 < μ < 0.52 but accept the hypothesis that -0.5 < μ < 0.5. We’re willing to accept that we’ve found a bear, but doubtful that we’ve found a mammal.

1 Mark J. Schervish. “P values: What They Are and What They Are Not.” The American Statistician, August 1996, Vol. 50, No. 3.

Update: I added the details of the p-value calculation here.

Related posts:

How loud is the evidence
The cult of significance testing
Most published research results are false

{ 19 comments }

Something like a random sequence but …

by John on February 24, 2010

When people ask for a random sequence, they’re often disappointed with what they get.

Random sequences clump more than most folks expect. For graphical applications, quasi-random sequence may be more appropriate.These sequences are “more random than random” in the sense that they behave more like what some folks expect from randomness. They jitter around like a random sequence, but they don’t clump as much.

Researchers conducting clinical trials are dismayed when a randomized trial puts several patients in a row on the same treatment. They want to assign patients one at a time to one of two treatments with equal probability, but they also want the allocation to work out evenly. This is like saying you want to flip a coin 100 times, and you also want to get exactly 50 heads and 50 tails. You can’t guarantee both, but there are effective compromises.

One approach is to randomize in blocks. For example, you could randomize in blocks of 10 patients by taking a sequence of 5 A’s and 5 B’s and randomly permuting the 10 letters. This guarantees that the allocations will be balanced, but some outcomes will be predictable. At a minimum, the last assignment in each block is always predictable: you assign whatever is left. Assignments could be even more predictable: if you give n A’s in a row in a block of 2n, you know the last n assignments will be all B’s.

Another approach is to “encourage” balance rather than enforce it. When you’ve given more A’s than B’s you could increase the probability of assigning a B. The greater the imbalance, the more heavily you bias the randomization probability in favor of the treatment that has been assigned less. This is a sort of compromise between equal randomization and block randomization. All assignments are random, though some assignments may be more predictable than others. Large imbalances are less likely than with equal randomization, but more likely than with block randomization. You can tune how aggressively the method responds to imbalances in order to make the method more like equal randomization or more like block randomization.

No approach to randomization will satisfy everyone because there are conflicting requirements. Randomization is a dilemma to be managed rather than a problem to be solved.

Related posts:

Quasi-random sequences in art and integration
Three ways of tuning an adaptively randomized trial
Population drift
Galen and clinical trials

{ 0 comments }

Random improvisation subjects

by John on February 23, 2010

Destination ImagiNation is a non-profit organization that encourages student creativity. This is my family’s first year to participate in DI and it has been a lot of fun. One of the things that impresses me most about DI is that they have strict rules limiting adult input.

This weekend I was an appraiser at a DI competition for an improvisation challenge. Teams could prepare for the overall format of the challenge, but some elements of the challenge were randomly selected on the day of the competition. This year the improvisations centered around endangered things. Teams were given a list of 10 endangered things ahead of time, but they wouldn’t know which thing would be theirs until just before they had to perform. Some of the things on the list were endangered animals, such as the giant panda. There were also other things in danger of disappearing, such as the VHS tape. The students also had to use a randomly chosen stock character and had to include a character with a randomly chosen “unimpressive superpower.”

There were 13 teams in the elementary division. What would you expect from 13 teams randomly selecting 10 endangered things? Obviously some endangered thing has to be chosen at least twice. Would you expect every item on the list to be chosen at least once? How often do you expect the most common item would be chosen?

In our case, three teams were assigned “glaciers” and five were assigned “the landline telephone.” The other items were assigned once or not at all. (No one was assigned “the Yiddish language”. Too bad. I really wanted to see what the students would do with that one.)

Is there reason to suspect that the assignments were not random? How likely is it that in a competition of 13 teams that five or more teams would be given the same subject? How likely is it that every subject would be used at least once? See an explanation here. Make a guess before looking at my answer.

Here’s some Python code you could use to simulate the selection of endangered things.

from random import random

num_reps     = 100000 # number of simulation repetitions
num_subjects = 10     # number of endangered things
num_teams    = 13     # number of teams competing

def maxperday():
    tally = [0] * num_subjects
    for i in range(num_teams):
        subject = int(random()*num_subjects)
        tally[subject] += 1
    return max(tally)

total = 0
for rep in range(num_reps):
    if maxperday() &gt;= 5:
        total += 1
print float(total)/num_reps

{ 4 comments }

The more active a research area is, the less reliable its results are.

John Ioannidis suggested popular areas of research publish a greater proportion of false results in his paper Why most published research findings are false. Of course popular areas produce more results, and so they will naturally produce more false results. But Ioannidis is saying that they also produce a greater proportion of false results.

Now Thomas Pfeiffer and Robert Hoffmann have produced empirical support for Ioannidis’s theory in the paper Large-Scale Assessment of the Effect of Popularity on the Reliability of Research. Pfeiffer and Hoffmann review two reasons why popular areas have more false results.

First, in highly competitive fields there might be stronger incentives to ‘‘manufacture’’ positive results by, for example, modifying data or statistical tests until formal statistical significance is obtained. This leads to inflated error rates for individual findings: actual error probabilities are larger than those given in the publications. … The second effect results from multiple independent testing of the same hypotheses by competing research groups. The more often a hypothesis is tested, the more likely a positive result is obtained and published even if the hypothesis is false.

In other words,

  1. In a popular area there’s more temptation to fiddle with the data or analysis until you get what you expect.
  2. The more people who test an idea, the more likely someone is going to find data in support of it by chance.

The authors produce evidence of the two effects above in the context of papers written about protein interactions in yeast. They conclude that “The second effect is about 10 times larger than the first one.”

Related posts:

Why microarray conclusions are so often wrong
Using Photoshop on experimental results
Irreproducible analysis
Make up your own rules of probability

{ 3 comments }

Statistical functions in Excel

by John on February 17, 2010

Depending on your expectations, you may have different reactions to the statistical function support in Excel. If you expect anything similar to a statistical package, you’ll be sorely disappointed. But if you think of Excel as a spreadsheet for everybody that sometimes lets you do statistical tasks right there without having to open up a statistical package, you’ll be pleased.

I was looking into the functions in Excel 2007 while preparing for a class I taught yesterday. I wanted to emphasize that certain functions are everywhere, not only in mathematical packages like Mathematica and R, but also in Python and even Excel.

Excel’s set of functions is inconsistent, both in the functionality provided and in the names it uses. Having an asymmetric API makes it harder to remember what is available and how to use it. On the other hand, the most commonly needed functions are available. The functions are individually reasonable even though they do not fit together into a simple pattern.

For details, see my notes Probability distributions in Excel 2007.

I discovered along the way that Excel has a GAMMALN function to compute the logarithm of the Gamma function Γ(x). This is a very useful function to have, even more useful than the Gamma function itself for reasons explained here.

Related links:

Comparison of data analysis packages from Brendan O’Connor

R, Excel, and the Windows clipboard (good tips in the comments)

{ 8 comments }

Parameterizations are the bane of statistical software. One of the most common errors is to assume that one software package uses the same parameterization as another package. For example, some packages specify the exponential distribution in terms of the mean but others use the rate. [click to continue...]

{ 4 comments }

Parameters and percentiles

by John on January 31, 2010

The doctor says 10% of patients respond within 30 days of treatment and 80% respond within 90 days of treatment. Now go turn that into a probability distribution. That’s a common task in Bayesian statistics, capturing expert opinion in a mathematical form to create a prior distribution.

Graph of gamma density with 10th percentile at 30 and 80th percentile at 90

Things would be easier if you could ask subject matter experts to express their opinions in statistical terms. You could ask “If you were to represent your belief as a gamma distribution, what would the shape and scale parameters be?” But that’s ridiculous. Even if they understood the question, it’s unlikely they’d give an accurate answer. It’s easier to think in terms of percentiles.

Asking for mean and variance are not much better than asking for shape and scale, especially for a non-symmetric distribution such as a survival curve. Anyone who knows what variance is probably thinks about it in terms of a normal distribution. Asking for mean and variance encourages someone to think about a symmetric distribution.

So once you have specified a couple percentiles, such as the example this post started with, can you find parameters that meet these requirements? If you can’t meet both requirements, how close can you come to satisfying them? Does it depend on how far apart the percentiles are? The answers to these questions depend on the distribution family. Obviously you can’t satisfy two requirements with a one-parameter distribution in general. If you have two requirements and two parameters, at least it’s feasible that both can be satisfied.

If you have a random variable X whose distribution depends on two parameters, when can you find parameter values so that Prob(X ≤ x1) = p1 and Prob(X ≤ x2) = p2? For starters, if x1 is less than x2 then p1 must be less than p2. For example, the probability of a variable being less than 5 cannot be bigger than the probability of being less than 6. For some common distributions, the only requirement is this requirement that the x’s and p’s be in a consistent order.

For a location-scale family, such as the normal or Cauchy distributions, you can always find a location and scale parameter to satisfy two percentile conditions. In fact, there’s a simple expression for the parameters. The location parameter is given by

\frac{x_1 F^{-1}(p_2) - x_2 F^{-1}(p_1)}{F^{-1}(p_2) - F^{-1}(p_1)}

and the scale parameter is given by

\frac{x_2 - x_1}{F^{-1}(p_2) - F^{-1}(p_1)}

where F(x) is the CDF of the distribution representative with location 0 and scale 1.

The shape and scale parameters of a Weibull distribution can also be found in closed form. For a gamma distribution, parameters to satisfy the percentile requirements always exist. The parameters are easy to determine numerically but there is no simple expression for them.

For more details, see Determining distribution parameters from quantiles. See also the ParameterSolver software.

Update: I posted an article on CodeProject with Python code for computing the parameters described here.

Related posts:

Biostatistics software
Diagram of distribution relationships
How to calculate percentiles in memory-bound applications

{ 1 comment }

Statisticians take themselves too seriously

by John on January 28, 2010

I suppose most people take themselves too seriously, but I’ve been thinking specifically about how statisticians take themselves too seriously.

The fundamental task of statistics is making decisions in the presence of uncertainty, and that’s hard. You have to make all kinds of simplifying assumptions and arbitrary choices to get anywhere. But after a while you lose sight of these decisions. Or you justify your decisions after the fact, making a virtue out of a necessity. After you’ve worked on a problem long enough, it’s nearly impossible to say “Of course, our whole way of thinking about this might have been wrong from the beginning.”

My concern is not so much “creative” statistics but rather uncreative statistics, rote application of established methods. Statistics is extremely conventional. But a procedure is not objective just because it is conventional.  An arbitrary choice made 80 years ago is still an arbitrary choice.

I’ve taken myself too seriously at times in regard to statistical matters; it’s easy to get caught up in your model. But I’m reminded of a talk I heard one time in which the speaker listed a number of embarrassing things that people used to believe. He was not making a smug comparison of how sophisticated we are now compared to our ignorant ancestors. Instead, his point was that we too may be mistaken. He exhorted everyone to look in a mirror and say “I may be wrong. I may be very, very wrong.”

Related posts:

The IOT test
Bike shed arguments
The data may not contain the answer
Problems versus dilemmas
Approximate problems and approximate solutions

{ 4 comments }

Estimating reporting rates

by John on January 25, 2010

Suppose the police department in your community reported an average of 10 burglaries per month. You could take that at face value and assume there are 10 burglaries per month. But maybe there are 20 burglaries a month but only half are reported.  How could you tell?

Here’s a similar problem.  Suppose you gave away an electronic book.  You stated that people are free to distribute it however they want, but that you’d appreciate feedback.  How could you estimate the number of people who have read the book?  If you get email from 100 readers, you know at least 100 people have read it, but maybe 10,000 people have read it and only 1% sent email.  How can you estimate the number of readers and the percentage who send email at the same time?

[click to continue...]

{ 8 comments }

Biostatistics software

by John on January 13, 2010

The M. D. Anderson Cancer Center Department of Biostatistics has a software download site listing software developed by the department over many years.

The home page of the download site allows you to see all products sorted by date or by name. This page also allows search. A new page lets you see the software organized by tags.

{ 1 comment }

How the central limit theorem began

by John on January 5, 2010

The Central Limit Theorem says that if you average enough independent copies of a random variable, the result has a nearly normal (Gaussian) distribution. Of course that’s a very rough statement of the theorem. What are the precise requirements of the theorem? That question took two centuries to resolve. You can see the final answer here.

The first version of the Central Limit Theorem appeared in 1733, but necessary and sufficient conditions weren’t known until 1935. I won’t recap the entire history here. I just want to comment briefly on how the Central Limit Theorem began and how different the historical order of events was from the typical order of presentation.

A typical probability course might proceed as follows.

  1. Define the normal distribution.
  2. State and prove a special case of the Central Limit Theorem.
  3. Present the normal approximation to the binomial as a corollary.

This is the opposite of the historical order of events.

Abraham de Moivre discovered he could approximate binomial distribution probabilities using the integral of exp(-x2) and proved an early version of the Central Limit Theorem in 1733. At the time, there was no name given to his integral. Only later did anyone think of exp(-x2) as the density of a probability distribution. De Moivre certainly didn’t use the term “Gaussian” since Gauss was born 44 years after de Moivre’s initial discovery. De Moivre also didn’t call his result the “Central Limit Theorem.” George Pólya gave the theorem that name in 1920 as it was approaching its final form.

For more details, see The Life and Times of the Central Limit Theorem.

The Life and Times of the Central Limit Theorem by William Adams

Related links:

Sums of uniform random variables
Quantifying the error in the central limit theorem
Three central limit theorems

{ 1 comment }

A case for robust Bayesian priors

by John on November 30, 2009

A paper I wrote with Jairo Fúquene and Luis Pericchi is now available online.

A Case for Robust Bayesian Priors with Applications to Clinical Trials
Jairo Fúquene, John Cook, and Luis Pericchi
Bayesian Analysis (2009) 4, Number 4, pp. 817–846.

{ 0 comments }