Posts tagged as:

Probability and Statistics

A beta-like distribution

by John on November 24, 2009

I just stumbled across a distribution that approximates the beta distribution but is easier to work with in some ways. It’s called the Kumaraswamy distribution. Apparently it came out of hydrology. The graph below plots the density of the distribution for various parameters. If you’re familiar with the beta distribution, these curves will look very familiar.

Density plots of the Kumaraswamy distribution

The PDF for the Kumaraswamy distribution K(a, b) is

f(x | a, b) = abxa-1(1 – xa)b-1

and the CDF is

F(x | a, b) = 1 – (1 – xa)b.

The most convenient feature of the Kumaraswamy distribution is that its CDF has a simple form. (The CDF for a beta distribution cannot be reduced to elementary functions unless its parameters are integers.)  Also, the CDF is easy to invert. That means you can generate a random sample from a K(a, b) distribution by first generating a uniform random value u and then returning

F-1(u) = (1 – (1 – u)1/b)1/a.

If you’re going to use a Kumaraswamy distribution to approximate a beta distribution, the question immediately arises of how to find parameters to get a good approximation. That is, if you have a beta(α, β) distribution that you want to approximate with a K(a, b) distribution, how do you pick a and b?

My first thought was to match moments. That is, pick a and b so that K(a, b) has the same mean and variance as beta(α, β). That may work well, but it would have to be done numerically.

Since the beta(α, β) density is proportional to xα (1-x)β-1 and the K(a, b) distribution is proportional to xa(1 – xa)b, it seems reasonable to set a = α. But how do you pick b? The modes of the two distributions have simple forms and so you could pick b to match modes:

mode K(a, b) = ((a – 1)/(ab – 1))1/a = mode beta(α, β) = (α – 1)/(α + β – 2).

Update: I experimented with the method above, and it’s OK, but not great. Here’s an example comparing a beta(1/2, 1/2) density with a K(1/2, 2 – √2) density.

comparing two u-shaped densities

Here the K density matches the beta density not at the mode but at the minimum. The blue curve, the curve on top, is the beta density.

Here’s another example, this time comparing a beta(5, 3) density and a K(5, 251/40) density.

comparing K and beta densities

Again the beta density is the blue curve, on top at the mode.

Maybe the algorithm I suggested for picking parameters is not very good, but I suspect the optimal parameters are not much better. Rather than saying that the Kumaraswamy distribution approximates the beta distribution, I’d say that the Kumaraswamy distribution is capable of assuming roughly the same shapes as the beta distribution. If the only reason you’re using a beta distribution is to get a certain density shape, the Kumaraswamy distribution would be a reasonable alternative. But if you need to approximate a beta distribution closely, it may not work well enough.

{ 5 comments }

Random inequalities IX: new tech report

by John on November 20, 2009

Just posted: Exact calculation of inequality probabilities. This report summarizes previous results for calculating P(X > Y) where X and Y are random variables.

Previous posts on random inequalities:

Introduction
Analytical results
Numerical results
Cauchy distributions
Beta distributions
Gamma distributions
Three or more random variables
Folded normals

{ 6 comments }

Counterfeit coins and rare diseases

by John on November 12, 2009

Here’s a puzzle I saw a long time ago that came to mind recently.

You have a bag of 27 coins. One of these coins is counterfeit and the rest are genuine. The genuine coins all weigh exactly the same, but the counterfeit coin is lighter. You have a simple balance. How can you find the counterfeit coin while using the scale the least number of times?

The surprising answer is that the false coin can be found while only using the scales only three times. Here’s how. Put nine coins on each side of the balance. If one side is lighter, the counterfeit is on that side; otherwise, it is one of the nine not on the scales. Now that you’ve narrowed it down to nine coins, apply the same idea recursively by putting three of the suspect coins on each side of the balance. The false coin is now either on the lighter side if the scales do not balance or one of the three remaining coins if the scales do balance. Now apply the same idea one last time to find which of the remaining three coins is the counterfeit. In general, you can find one counterfeit in 3n coins by using the scales n times.

The counterfeit coin problem came to mind when I was picking out homework problems for a probability class and ran into the following (problem 4.56 here):

A large number, N = mk, of people are subject to a blood test. This can be administered in two ways.

  1. Each person can be tested separately. In this case N tests are required.
  2. The blood samples of k people can be pooled and analyzed together. If the test is negative, this one test suffices for k people. If the test is positive, each of the k people must be tested separately, and, in all, k+1 test are required for the k people.

Suppose each person being tested has the disease with probability p. If the disease is rare, i.e. p is sufficiently small, the second approach will be more efficient. Consider the extremes. If p = 0, the first approach takes mk tests and the second approach takes only m tests. At the other extreme, if p = 1, the first approach still takes mk tests but the second approach now takes m(k+1) tests.

The homework problem asks for the expected number of tests used with each approach as a function of p for fixed k. Alternatively, you could assume that you always use the second method but need to find the optimal value of k. (This includes the possibility that k=1, which is equivalent to using the first method.)

I’d be curious to know whether these algorithms have names. I suspect computer scientists have given the coin testing algorithm a name. I also suspect the idea of pooling blood samples has several names, possibly one name when it is used in literally testing blood samples and other names when the same idea is applied to analogous testing problems.

{ 5 comments }

Yet another view of the negative binomial

by John on November 3, 2009

A while back I wrote a post on three views of the negative binomial distribution. This post adds a fourth view.

One of the shortcomings of the Poisson distribution is that its variance exactly equals its mean. It is common in practice for the variance of count data to be larger than the mean, so it’s natural to look for a distribution like the Poisson but with larger variance. We start with a Poisson random variable X with mean λ, but then we make λ itself random and suppose that λ comes from a gamma(α, β) distribution. Then the marginal distribution on X is a negative binomial distribution with parameters r = α and p = 1/(β + 1).

The previous post said that the negative binomial is useful because it has more variance than the Poisson. The derivation above explains why the negative binomial should have more variance than the Poisson.

For details, see updated notes on the negative binomial distribution.

Related links:

Three views of the negative binomial distribution
Student-t as a mixture of normals
Diagram of distribution relationships
General binomial coefficients

{ 1 comment }

Student-t as a mixture of normals

by John on October 30, 2009

You can express a Student-t distribution as a continuous mixture of normal distributions. Some properties of the t distribution are easier to prove in this form. Here are notes with details.

I ran across this tidbit reading Bayesian Data Analysis by Gelman et al.

Related post: Beer, Wine, and Statistics (origin of the Student-t distribution)

{ 1 comment }

Normal tail probability bounds

by John on October 22, 2009

Here are some notes on upper and lower bounds on the probability P(Z > t) for a standard normal random random variable Z. I wrote up these notes to settle a issue that came up in a probability class I’m teaching. It’s surprising that there are simple functions that provide efficient bounds on the normal distribution function.

{ 0 comments }

Achievement is not normal

by John on September 29, 2009

Angela Duckworth gave a 90-second talk entitled Why Achievement Isn’t Normal.

She’s using the term “normal” in the sense of the normal (Gaussian) distribution, the bell curve. With normally distributed attributes, such as height, most people are near the middle and very few are far from the middle. Also, the distribution is symmetric: as many people are likely to be above the middle as below.

Achievement is not like that in many fields. The highest achievers achieve far more than average. The best programmers may be 100 times more productive than average programmers. The wealthiest people have orders of magnitude more wealth than average. Best selling authors far outsell average authors.

Angela Duckworth says achievement is not normal, it’s log-normal. The log-normal distribution is skewed to the right. It has a long tail, meaning that values far from the mean are fairly common. The idea of using a long-tailed distribution makes sense, but I don’t understand the justification for the log-normal distribution in particular given in the video. This is not to disparage the speaker. No one can give a detailed derivation of a statistical distribution in 90 seconds. I’ll give a plausibility argument below. If you’re not interested in the math, just scroll down to the graph at the bottom.

The factors that contribute to achievement are often multiplicative. That is, advantages multiply rather than add. If your first book is a success, more people will give your second book a chance. Your readership doesn’t simply add, as if each book were written by a different person. Instead, your audience compounds. Web sites with more inbound links get a higher search engine rank. More people find these sites because of their ranking, and so more people link to them, and the ranking goes up. Skills like communication and organization don’t just contribute additively as they would on a report card; they are multipliers that amplify your effectiveness in other areas.

The log-normal distribution has two parameters: μ and σ. These look like the mean and standard deviation parameters, but they are not the mean and and standard deviation of the log-normal. If X is a log-normal(μ , σ) random variable, then log(X) has a normal(μ, σ) distribution. The parameters μ and σ are not the mean and standard deviation of X but of log(X).

The product of two log-normal distributions is log-normal because the sum of two normal distributions is normal. So if the contributions to achievement are multiplicative, log-normal distributions will be convenient to model achievement.

I said earlier that log-normal distributions are skewed. I’ve got something of a circular argument if I start with the assumption that the factors that contribute to achievement are skewed and then conclude that achievement is skewed. But log-normal distributions have varying degrees of skewness. When σ is small, the distribution is approximately normal. So you could start with individual factors that have a nearly normal distribution, modeled by a log-normal distribution. Then you can show that as you multiply these together, you get a distribution more skewed than it’s inputs.

Suppose you have n random variables that have a log-normal(1, σ) distribution. Their product will have a log-normal(n, √n σ) distribution. As n increases, the distribution of the product becomes more skewed. Here is an example. The following graph shows the density of a log-normal(1, 0.2) distribution.

plot of log-normal(1, 0.2) density

Here is the distribution of the product of nine independent copies of the above distribution, a log-normal(9, 0.6) distribution.

plot of log-normal(9, 0.6) density

So even though the original distribution is symmetric and concentrated near the middle, the product of nine independent copies has a long tail to the right.

Related posts:

Small advantages show up in the extremes
Variation in male and female Olympic performance: Part 1, Part 2
Evaluate people at their best or at their worst?

{ 5 comments }

The negative binomial distribution is interesting because it illustrates a common progression of statistical thinking. My aim here is to tell a story, not to give details; the details are available here. The following gives a progression of three perspectives.

First view: Counting

The origin of the negative binomial is very concrete. It is unfortunate that the name makes the distribution seem more abstract than it is. (What could possibly be negative about a binomial distribution? Sounds like abstract nonsense.)

Suppose you have decided to practice basketball free throws. You’ve decided to practice until you have made 20 free throws. If your probability of making a single free throw is p, how many shots will you have to attempt before you make your goal of 20 successes? Obviously you’ll need at least 20 attempts, but you might need a lot more. What is the expected number of attempts you would need? What’s the probability that you’ll need more than 50 attempts? These questions could be answered by using a negative binomial distribution. A negative binomial probability distribution with parameters r and p gives the probabilities of various numbers of failures before the rth success when each attempt has probability of success p.

Second view: Cheap generalization

After writing down the probability mass function for the negative binomial distribution as described above, somebody noticed that the number r didn’t necessarily have to be an integer. The distribution was motivated by integer values of r, counting the number of failures before the rth success, but the resulting formula makes sense even when r is not an integer. It doesn’t make sense to wait for 2.87 successes; you can’t interpret the formula as counting events unless r is an integer, but the formula is still mathematically valid.

The probability mass function involves a binomial coefficient. These coefficients were first developed for integer arguments but later extended to real and even complex arguments. See these notes for definitions and these notes for how to calculate the general coefficients. The probability mass function can be written most compactly when one of the binomial coefficient has a negative argument. See page two of these notes for an explanation. There’s no intuitive explanation of the negative argument. It’s just a consequence of some algebra.

What’s the point in using non-integer values of r? Just because we can? No, there are practical reasons, and that leads to our third view.

Third view: Modeling overdispersion

Next we take the distribution above and forget where it came from. It was motivated by counting successes and failures, but now we forget about that and imagine the distribution falling from the sky in its general form described above. What properties does it have?

The negative binomial distribution turns out to have a very useful property. It can be seen as a generalization of the Poisson distribution. (See this distribution chart. Click on the dashed arrow between the negative binomial and Poisson boxes.)

The Poisson is the simplest distribution for modeling count data. It is in some sense a very natural distribution and it has nice theoretical properties. However, the Poisson distribution has one severe limitation: its variance is equal to its mean. There is no way to increase the variance without increasing the mean. Unfortunately, in many data sets the variance is larger than the mean. That’s where the negative binomial comes in. When modeling count data, first try the simplest thing that might work, the Poisson. If that doesn’t work, try the next simplest thing, negative binomial.

When viewing the negative binomial this way, a generalization of the Poisson, it helps to use a new parameterization. The parameters r and p are no longer directly important. For example, if we have empirical data with mean 20.1 and variance 34.7, we would naturally be interested in the negative binomial distribution with this mean and variance. We would like a parametrization that reflects more directly the mean and variance and one that makes the connection with the Poisson more transparent. That is indeed possible, and is described in these notes.

Update: Here’s a new post giving a fourth view of the negative binomial distribution — a continuous mixture of Poisson distributions. This view explains why the negative binomial is related to the Poisson and yet has greater variance.

Related links:

Notes on the negative binomial distribution
General binomial coefficients
Diagram of distribution relationships
Upper and lower bounds on binomial coefficients

{ 4 comments }

Make up your own rules of probability

by John on September 18, 2009

Keith Baggerly and Kevin Coombes just wrote a paper about the analysis errors they commonly see in bioinformatics articles. From the abstract:

One theme that emerges is that the most common errors are simple (e.g. row or column offsets); conversely, it is our experience that the most simple errors are common.

The full title of the article by Keith Baggerly and Kevin Coombes is “Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology.” The article will appear in the next issue of Annals of Applied Statistics and is available here. The key phrase in the title is forensic bioinformatics: reverse engineering statistical analysis of bioinformatics data. The authors give five case studies of data analyses that cannot be reproduced and infer what analysis actually was carried out.

One of the more egregious errors came from the creative application of probability. One paper uses innovative probability results such as

P(ABCD) = P(A) + P(B) + P(C) + P(D) – P(A) P(B) P(C) P(D)

and

P(AB) = max( P(A), P(B) ).

Baggerly and Coombes were remarkably understated in their criticism: “None of these rules are standard.” In less diplomatic language, the rules are wrong.

To be fair, Baggerly and Coombes point out

These rules are not explicitly stated in the methods; we inferred them either from formulae embedded in Excel files … or from exploratory data analysis …

So, the authors didn’t state false theorems; they just used them. And nobody would have noticed if Baggerly and Coombes had not tried to reproduce their results.

Related posts:

Irreproducible analysis
Highlights from Reproducible Ideas
Reproducible Ideas blog winding down

{ 6 comments }

A woman nearly became violent in a math class I taught several years ago.  I was going over homework problems and she wanted to know whether a certain problem was a “permutation” or a “combination.”  She knew how to solve two kinds of problems and was irritated when I told her that her homework problem didn’t fall into either of her two categories.

She insisted that I tell her which of the two techniques would solve the problem and nearly lost control when I repeated that neither would work. The rest of the students and I were shocked. A little nervous laughter broke the tension and we resumed going over the homework.

The angry student had implicitly come to believe that if a counting problem contains two numbers, n and k, there are only two possible answers: P(n, k) and C(n, k). Here P(n, k) = n!/(n-k)! and C(n, k) = P(n, k)/k!. She was not alone. Students commonly believe this, and for good reason: most homework problems can be solved this way. For example, a club with 12 members can elect five distinct officers in P(12, 5) ways and they can select a committee of five members in C(12, 5) ways. It’s easy for an instructor or textbook author to think of dozens of homework problems in these patterns and unintentionally imply that these are these are the only possibilities. However, the following problem contains the numbers 12 and 5 but the solution is neither P(12, 5) nor C(12, 5).

Suppose you have a class of 12 students. Each student will receive one of five letter grades: A, B, C, D, or F. At the end of the course, you tally up how many students received each grade. How many different ways could the tally turn out?  For example, one possibility would be all A’s.  Another would be three A’s, four B’s, four C’s, no D’s, and one F.

The grade tally problem is representative of a class of problems that come up fairly often in application but that lie just outside what students typically learn. The general solution is written up in these notes on counting selections with replacement. The notes include the famous “stars and bars” explanation by William Feller.

By the way,  there are 1,820 possible grade tallies for 12 students and five grades.

Related links:

Binomial coefficients
Four uncommon but handy notations
How to compute binomial coefficients
Richard Stanley’s 12-fold way

{ 1 comment }

The IOT test

by John on August 31, 2009

In his book Flaw of Averages, Sam Savage describes the IOT test of statistical significance.

Joe Berkson, a statistician at the Mayo Clinic, developed his own criterion, which he termed the IOT Test, or Inter Ocular Trauma Test, requiring a graph that hit you between the eyes.

{ 1 comment }

Mortgages, banks, and Jensen’s inequality

by John on August 26, 2009

Sam Savage’s new book Flaw of Averages has a brilliantly simple explanation of why volatility in the housing market caused such problems for banks recently. When housing prices drop, more people default on their mortgages and obviously that hurts banks. But banks are also in trouble when the housing market is volatile, even if on average house prices are good.

Suppose there’s no change in the housing market on average. Prices go up in some areas and down in other areas. As long as the ups and downs average out, there should be no change in bank profits, right? Wrong!

When the housing market goes up a little bit, mortgage defaults drop a little bit and bank profits go up a little bit. But when the market goes down a little bit, defaults go up more than just a little bit, and bank profits go down more than a little bit. There is much more down-side potential than up-side potential. Say 95% of homeowners pay their mortgages. Then a good housing market can only improve repayments by 5%. But a bad housing market could decrease repayments by much more.

In mathematical terminology, the bank profits are a concave function of house prices. Jensen’s inequality says that if f() is a concave function (say bank profits) and X is a random variable (say, house prices) then the average of f(X) is less than f(average of X). Average profit is less than the profit from the average.

The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty by Sam Savage

Related posts:

Convex optimization
Log-concave functions

{ 2 comments }

Power laws and the generalized CLT

by John on August 19, 2009

Here’s an expert from a recent ACM Ubiquity interview with David Alderson that begs a few questions.

Actually, they [power laws] aren’t special at all. They can arise as natural consequences of aggregation of high variance data. You know from statistics that the Central Limit Theorem says distributions of data with limited variability tend to follow the Normal (bell-shaped, or Gaussian) curve. There is a less well-known version of the theorem that shows aggregation of high (or infinite) variance data leads to power laws. Thus, the bell curve is normal for low-variance data and the power law curve is normal for high-variance data. In many cases, I don’t think anything deeper than that is going on.

In this post I will explain the theory I believe Alderson is alluding to in his informal remarks. I’ll also explain some restrictions necessary for this theory to hold.

[click to continue...]

{ 8 comments }

Physical explanation of median

by John on August 12, 2009

This post from Statpics shows how to visualize the median of a set of numbers via a necklace draped over a pulley.

{ 0 comments }

Probability distributions in SciPy

by John on July 20, 2009

Here are some notes on how to work with probability distributions using the SciPy numerical library for Python.

Functions related to probability distributions are located in scipy.stats. The general pattern is

scipy.stats.<distribution family>.<function>

There are 81 supported continuous distribution families and 12 discrete distribution families. Some distributions have obvious names: gamma, cauchy, t, f, etc. The only possible surprise is that all distributions begin with a lower-case letter, even those corresponding to a proper name (e.g. Cauchy). Other distribution names are less obvious: expon for the exponential, chi2 for chi-squared distribution, etc.

Each distribution supports several functions. The density and cumulative distribution functions are pdf and cdf respectively. (Discrete distributions use pmf rather than pdf.) One surprise here is that the inverse CDF function is called ppf for “percentage point function.” I’d never heard that terminology and would have expected something like “quantile.”

Example: scipy.stats.beta.cdf(0.1, 2, 3) evaluates the CDF of a beta(2, 3) random variable at 0.1.

Random values are generated using rvs which takes an optional size argument. The size is set to 1 by default.

Example: scipy.stats.norm.rvs(2, 3) generates a random sample from a normal (Gaussian) random variable with mean 2 and standard deviation 3. The function call scipy.stats.norm.rvs(2, 3, size = 10) returns an array of 10 samples from the same distribution.

The command line help() facility does not document the distribution parameterizations, but the external documentation does. Most distributions are parameterized in terms of location and scale. This means, for example, that the exponential distribution is parameterized in terms of its mean, not its rate. Somewhat surprisingly, the exponential distribution has a location parameter. This means, for example, that scipy.stats.expon.pdf(x, 7) evaluates at x the PDF of an exponential distribution with location 7. This is not what I expected. I assumed there would be no location parameter and that the second argument, 7, would be the mean (scale). Instead, the location was set to 7 and the scale was left at its default value 1. Writing scipy.stats.expon.pdf(x, scale=7) would have given the expected result because the default location value is 0.

SciPy also provides constructors for objects representing random variables.

Example: x = scipy.stats.norm(3, 1); x.cdf(2.7) returns the same value as scipy.stats.norm.cdf(2.7, 3, 1).

Constructing objects representing random variables encapsulates the differences between distributions in the constructors. For example, some distributions take more parameters than others and so their object constructors require more arguments. But once a distribution object is created, its PDF, for example, can be called with a single argument. This makes it easier to write code that takes a general distribution object as an argument.

Related posts:

Numerical computing in IronPython with IronClad
Stand-alone error function erf(x)

{ 7 comments }