Small probabilities add, big ones don’t

A video has been making the rounds in which a well-known professor [1] says that if something has a 20% probability of happening in one attempt, then it has a 40% chance of happening in two attempts, a 60% chance in happening in three attempts, etc.

This is wrong, but it’s a common mistake. And one reason it’s common is that a variation on the mistake is approximately correct, which we will explain shortly.

It’s obvious the reasoning in the opening paragraph is wrong when you extend it to five, or especially six, attempts. Are you certain to succeed after five attempts? What does it even mean that you have a 120% chance of success after six attempts?!

But let’s reduce the probabilities in the opening paragraph. If there’s a 2% chance of success on your first attempt, is there a 4% chance of success in two attempts and a 6% chance of success in three attempts? Yes, approximately.

Two attempts

Here’s is the correct formula for the probability of an event happening in two tries.

P(A \cup B) = P(A) + P(B) - P(A\cap B)

In words, the probability of A or B happening equals the probability of A happening, plus the probability of B happening, minus the probability of A and B both happening. The last term is is a correction term. Without it, you’re counting some possibilities twice.

So if the probability of success on each attempt is 0.02, the probability of success on two attempts is

0.02 + 0.02 − 0.0004 = 0.0396 ≈ 0.04.

When the probabilities of A and B are each small, the probability of A and B both happening is an order of magnitude smaller, assuming independence [2]. The smaller the probabilities of A and B, the less the correction term matters.

If the probability of success on each attempt is 0.2, now the probability of success after two attempts is 0.36. Simply adding probabilities and neglecting the correction term is incorrect, but not terribly far from correct in this case.

Three attempts

When you consider more attempts, things get more complicated. The probability of success after three attempts is given by

\begin{align*} P(A \cup B \cup C) &= P(A) + P(B) + P(C) \\ &- P(A\cap B) - P(B \cap C) - P(A \cap C) \\ &+ P(A \cap B \cap C) \end{align*}

as I discuss here. Adding the probabilities of success separately over-estimates the correct probability. So you correct by subtracting the probabilities of pairs of successes. But then this is over-corrects, because you need to add back in the probability of three successes.

If A, B, and C all have a 20% probability, the probability of A or B or C happening is 48.8%, not 60%, again assuming independence.

The error from naively adding probabilities increases when the number of probabilities increase.

n attempts

Now let’s look at the general case. Suppose your probability of success on each attempt is p. Then your probability of failure on each independent attempt is 1 − p. The probability of at least one success out of n attempts is the complement of the probability of all failures, i.e.

1 - (1 - p)^n

When p is small, and when n is small, we can approximate this by np. That’s why naively adding probabilities works when the probabilities are small and there aren’t many of them. Here’s a way to say this precisely using the binomial theorem.

\begin{align*} 1 - (1 - p)^n &= 1 - \left(1 - np + {n \choose 2}p^2 - {n \choose 3}p^3 - \cdots\right ) \\ &= np + {\cal O}(p^2) \end{align*}

The exact probability is np plus (n − 1) terms that involve higher powers of p. When p and n are sufficiently small, these terms can be ignored.

 

[1] I’m deliberately not saying who. My point here is not to rub his nose in his mistake. This post will be online long after the particular video has been forgotten.

[2] Assuming A and B are independent. This is not always the case, and wrongly assuming independence can have disastrous consequences as I discuss here, but that’s a topic for another day.

Logistic regression quick takes

This post is a series of quick thoughts related to logistic regression. It starts with this article on moving between logit and probability scales.

***

Logistic regression models the probability of a yes/no event occurring. It gives you more information than a model that simply tries to classify yeses and nos. I advised a client to move from an uninterpretable classification method to logistic regression and they were so excited about the result that they filed a patent on it.

It’s too late to patent logistic regression, but they filed a patent on the application of logistic regression to their domain. I don’t know whether the patent was ever granted.

***

The article cited above is entitled “Rough approximations to move between logit and probability scales.” Here is a paragraph from the article giving its motivation.

When working in this space, it’s helpful to have some approximations to move between the logit and probability scales. (As an analogy, it is helpful to know that for a normal distribution, the interval ± 2 standard deviations around the mean covers about 95% of the distribution, while ± 3 standard deviations covers about 99%.)

Here are half the results from the post; the other half follow by symmetry.

    |-------+-------|
    |  prob | logit |
    |-------+-------|
    | 0.500 |     0 |
    | 0.750 |     1 |
    | 0.900 |     2 |
    | 0.995 |     3 |
    | 0.999 |     7 |
    |-------+-------|

Zero on the logit scale corresponds exactly to a probability of 0.5. The other values are approximate.

When I say the rest of the table follows by symmetry, I’m alluding to the fact that

logit(1 − p) = − logit(p).

So, for example, because logit(0.999) ≈ 7, logit(0.001) ≈ −7.

***

The post reminded me of the decibel scale. As I wrote in this post, “It’s a curious and convenient fact that many decibel values are close to integers.”

  • 3 dB ≈ 2
  • 6 dB ≈ 4
  • 7 dB ≈ 5
  • 9 dB ≈ 8

I was curious whether the logit / probability approximations were as accurate as these decibel approximations. Alas, they are not. They are rough approximations, as advertised in the title, but still useful.

***

The post also reminded me of a comment by Andrew Gelman and Jennifer Hill on why  natural logs are natural for regression.

 

MCMC and the coupon collector problem

Bob Carpenter wrote today about how Markov chains cannot thoroughly cover high-dimensional spaces, and that they do not need to. That’s kinda the point of Monte Carlo integration. If you want systematic coverage, you can/must sample systematically, and that’s impractical in high dimensions.

Bob gives the example that if you want to get one integration point in each quadrant of a 20-dimensional space, you need a million samples. (220 to be precise.) But you may be able to get sufficiently accurate results with a lot less than a million samples.

If you wanted to be certain to have one sample in each quadrant, you could sample at (±1, ±1, ±1, …, ±1). But if for some weird reason you wanted to sample randomly and hit each quadrant, you have a coupon collector problem. The expected number of samples to hit each of N cells with uniform [1] random sampling with replacement is

N( log(N) + γ )

where γ is Euler’s constant. So if N = 220, the expected number of draws would be over 15 million.

Related posts

[1] We’re assuming each cell is sampled independently with equal probability each time. Markov chain Monte Carlo is more complicated than that because the draws are not independent.

A surprising result about surprise index

Surprise index

Warren Weaver [1] introduced what he called the surprise index to quantify how surprising an event is. At first it might seem that the probability of an event is enough for this purpose: the lower the probability of an event, the more surprise when it occurs. But Weaver’s notion is more subtle than this.

Let X be a discrete random variable taking non-negative integer values such that

\text{Prob}(X = i) = p_i

Then the surprise index of the ith event is defined as

S_i = \frac{\sum_{j=0}^\infty p_j^2 }{p_i}

Note that if X takes on values 0, 1, 2, … N−1 all with equal probability 1/N, then Si = 1, independent of N. If N is very large, each outcome is rare but not surprising: because all events are equally rare, no specific event is surprising.

Now let X be the number of legs a human selected at random has. Then p2 ≈ 1, and so the numerator in the definition of Si is approximately 1 and S2 is approximately 1, but Si is large for any value of i ≠ 2.

The hard part of calculating the surprise index is computing the sum in the numerator. This is the same calculation that occurs in many contexts: Friedman’s index of coincidence, collision entropy in physics, Renyi entropy in information theory, etc.

Poisson surprise index

Weaver comments that he tried calculating his surprise index for Poisson and binomial random variables and had to admit defeat. As he colorfully says in a footnote:

I have spent a few hours trying to discover that someone else had summed these series and spent substantially more trying to do it myself; I can only report failure, and a conviction that it is a dreadfully sticky mess.

A few years later, however, R. M. Redheffer [2] was able to solve the Poisson case. His derivation is extremely terse. Redheffer starts with the generating function for the Poisson

e^{-\lambda} e^{\lambda x} = \sum p_mx^m

and then says

Let x = eiθ; then eiθ; multiply; integrate from 0 to 2π and simplify slightly to obtain

\sum p_m^2 = (e^{2\lambda}/\pi) \int_0^\pi e^{2\lambda \cos \theta}\, d\theta

The integral on the right is recognized as the zero-order Bessel function …

Redheffer then “recognizes” an expression involving a Bessel function. Redheffer acknowledges in a footnote at a colleague M. V. Cerrillo was responsible for recognizing the Bessel function.

It is surprising that the problem Weaver describes as a “dreadfully sticky mess” has a simple solution. It is also surprising that a Bessel function would pop up in this context. Bessel functions arise frequently in solving differential equations but not that often in probability and statistics.

Unpacking Redheffer’s derivation

When Redheffer says “Let x = eiθ; then eiθ; multiply; integrate from 0 to 2π” he means that we should evaluate both sides of the equation for the Poisson generating function equation at these two values of x, multiply the results, and average the both sides over the interval [0, 2π].

On the right hand side this means calculating

\frac{1}{2\pi} \int_0^{2\pi}\left(\sum_{m=0}^\infty p_m \exp(im\theta)\right) \left(\sum_{n=0}^\infty p_n \exp(-in\theta)\right)\, d\theta

This reduces to

\sum_{m=0}^\infty p_m^2

because

alt=

i.e. the integral evaluates to 1 when m = n but otherwise equals zero.

On the left hand side we have

\begin{align*} \frac{1}{2\pi} \int_0^{2\pi} \exp(-2\lambda) \exp(\lambda(e^{i\theta} + e^{-i\theta})) \, d\theta &= \frac{1}{2\pi} \int_0^{2\pi} \exp(-2\lambda) \exp(2 \cos \theta) \, d\theta \\ &= \frac{e^{-2\lambda}}{\pi} \int_0^{\pi} \exp(2 \cos \theta) \, d\theta \end{align*}

Cerrillo’s contribution was to recognize the integral as the Bessel function J0 evaluated at -2iλ or equivalently the modified Bessel function I0 evaluated at -2λ. This follows directly from equations 9.1.18 and 9.6.16 in Abramowitz and Stegun.

Putting it all together we have

\frac{\pi}{e^{2\lambda}}\sum_{m=0}^\infty p_m^2 = \int_0^{\pi} \exp(2 \cos \theta) \, d\theta = J_0(-2i\lambda ) = I_0(-2\lambda )

Using the asymptotic properties of I0 Redheffer notes that for large values of λ,

\sum_{m=0}^\infty p_m^2 \sim \frac{1}{2\sqrt{\pi\lambda}}

[1] Warren Weaver, “Probability, rarity, interest, and surprise,” The Scientific Monthly, Vol 67 (1948), p. 390.

[2] R. M. Redheffer. A Note on the Surprise Index. The Annals of Mathematical Statistics, Mar., 1951, Vol. 22, No. 1 pp. 128ndash;130.

Estimating an author’s vocabulary

How would you estimate the size of an author’s vocabulary? Suppose you have a analyzed the author’s available works and found n words, x of which are unique. Then you know the author’s vocabulary was at least x, but it’s reasonable to assume that the author may have know words he never used in writing, or that at least not in works you have access to.

Brainerd [1] suggested the following estimator based on a Markov chain model of language. The estimated vocabulary is the number N satisfying the equation

\sum_{j=0}^{x-1}\left(1 - \frac{j}{N}\right)^{-1} = n

The left side is a decreasing function of N, so you could solve the equation by finding a values of N that make the sum smaller and larger than n, then use a bisection algorithm.

We can see that the model is qualitatively reasonable. If every word is unique, i.e. x = n, then the solution is N = ∞. If you haven’t seen any repetition, you the author could keep writing new words indefinitely. As the amount of repetition increases, the estimate of N decreases.

Brainerd’s model is simple, but it tends to underestimate vocabulary. More complicated models might do a better job.

Problems analogous to estimating vocabulary size come up in other applications. For example, an ecologist might want to estimate the number of new species left to be found based on the number of species seen so far. In my work in data privacy I occasionally have to estimate diversity in a population based on diversity in a sample. Both of these examples are analogous to estimating potential new words based on the words you’ve seen.

[1] Brainerd, B. On the relation between types and tokes in literary text, J. Appl. Prob. 9, pp. 507-5

How likely is a random variable to be far from its center?

There are many answers to the question in the title: How likely is a random variable to be far from its center?

The answers depend on how much you’re willing to assume about your random variable. The more you can assume, the stronger your conclusion. The answers also depend on what you mean by “center,” such as whether you have in mind the mean or the mode.

Chebyshev’s inequality says that the probability of a random variable X taking on a value more than k standard deviations from its mean is less than 1/k². This of course assumes that X has a mean and a standard deviation.

If we assume further that X is unimodal, and k ≥ √(8/3), then the conclusion of Chebyshev’s inequality can be strengthened to saying that the probability of X being more than k standard deviations from its mean is less than 4/9k². This is the Vysochanskiĭ-Petunin inequality. More on this inequality here.

If k ≤ √(8/3) the Vysochanskiĭ-Petunin inequality says the probability of X being more than k standard deviations from its mean is less than

4/3k² − 1/3.

Gauss’ inequality is similar to the Vysochanskiĭ-Petunin inequality. It also assumes X is unimodal, and for convenience we will assume the mode is at zero (otherwise look at Y = Xm where m is the mode of X). Gauss’ inequality bounds the probability of X being more than k standard deviations away from its mode rather than its mean.

Let τ² be the expected value of X². If the mean value of X is zero then τ² = σ² and the equations below are similar to the Vysochanskiĭ-Petunin inequality. But Gauss does not require that X has mean zero.

Gauss’ inequality says that

P(|X| > kτ) ≤ 4/9k²

if if k ≥ √(4/3) and

P(|X| > kτ) ≤ 1 − k/(√3 τ)

otherwise.

Gauss’ inequality is stronger than the Vysochanskiĭ-Petunin inequality when X has zero mean, but it is also assuming more, namely that the mean and mode are equal.

Related post: Strengthening Markov’s inequality with conditional probability.

Beta inequality symmetries

I was thinking about the work I did when I worked in biostatistics at MD Anderson. This work was practical rather than mathematically elegant, useful in its time but not of long-term interest. However, one result came out of this work that I would call elegant, and that was a symmetry I found.

Let X be a beta(a, b) random variable and let Y be a beta(c, d) random variable. Let g(a, b, c, d) be the probability that a sample from X is larger than a sample from Y.

g(a, b, c, d) = Prob(X > Y)

This function often appeared in the inner loop of a simulation and so we spent thousands of CPU-hours computing its values. I looked for ways to evaluate this function more quickly, and along the way I found a symmetry.

The function I call g was studied by W. R. Thompson in 1933 [1]. Thompson noted two symmetries:

g(a, b, c, d) = 1 − g(c, d, a, b)

and

g(a, b, c, d) = g(d, c, b, a)

I found an additional symmetry:

g(a, b, c, d) = g(d, b, c, a).

The only reference to this result in a journal article as far as I know is a paper I wrote with Saralees Nadarajah [2]. That paper cites an MD Anderson technical report which is no longer online, but I saved a copy here.

Related posts

[1] W. R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, Volume 25, Issue 4. pp. 285 – 294.

[2] John D. Cook and Saralees Nadarajah. Stochastic Inequality Probabilities for Adaptively Randomized Clinical Trials. Biometrical Journal. 48 (2006) pp 256–365.

 

US Census area hierarchy

Some kinds US Census geographic areas nest into a tidy hierarchy, but others do not. Here’s a brief overview of both.

Hierarchical entities

The orderly hierarchy is

  • nation
  • region
  • division
  • state
  • county
  • census tract
  • block group
  • census block.

All cleanly nested.

There are four regions: West, Midwest, Northeast, and South. Each region splits into two or three divisions. Each state falls within one division.

States are divided into counties, counties are divided into census tracts, census tracts are divided into block groups, block groups are divided into census blocks.

The number of entities at each level of the hierarchy is approximately geometrically increasing as show in the following log-scale bar graph. The number of entities at each level was based on the 2010 census.

1 nation, 4 regions, 9 divisions, 50 states, 3143 counties, 74134 tracts, 220742 block groups, 11166336 blocks

Organic entities

The Postal Service cares how mail delivery points are clustered and doesn’t care so much for legal and administrative borders. For this reason, zip codes represent mail delivery points and not geographic regions per se.

Zip codes are subject to change, and can straddle county or even state lines. Zip codes make sense from the US Postal Service, and they are their codes after all. But the temptation to use zip codes as geographic areas is so great that it is often done, even though it results in occasional strange behavior.

The US Census does not use zip codes per se, but rather Zip Code Tabulation Areas (ZCTAs), which basically correspond to zip codes, but rationalize them a bit from the perspective of geography.

Urban Areas and Core Based Statistical Areas (CBSAs) are based around cities, and since for practical purposes a city can spread across state lines, Urban Areas and CBSAs can cross state lines. The organic growth of a city may cross state lines, even if legally the city is divided into two cities.

Census blocks

Census blocks accommodate both the hierarchical and organic geographic areas. That is, a census block is contained entirely within a block group (which is contained within a census tract, which is contained within a county, …) but also within a single ZCTA and a single CBSA.

Related posts

Zero-Concentrated Differential Privacy

Differential privacy can be rigid and overly conservative in practice, and so finding ways to relax pure differential privacy while retaining its benefits is an active area of research. Two approaches to doing this are concentrated differential privacy [1] and Rényi differential privacy [3].

Concentrated differential privacy was used in reporting results from the 2020 US Census. Specifically, zero-concentrated differential privacy with Gaussian noise.

Differential privacy quantifies the potential impact of an individual’s participation or lack of participation in a database and seeks to bound the difference. The original proposal for differential privacy and the approaches discussed here differ in how they measure the difference an individual can make. Both concentrated differential privacy (CDP) and Rényi differential privacy (RDP) use Rényi divergence, though they use it in different ways.

In [3], Mirinov discusses the similarities and differences regarding CDP and RDP. (I changed Mirnov’s reference numbers to the reference numbers used here.)

The closely related work by Dwork and Rothblum [1], followed by Bun and Steinke [2], explore privacy definitions—Concentrated Differential Privacy and zero-Concentrated Differential Privacy—that are framed using the language of, respectively, subgaussian tails and the Rényi divergence. The main difference between our approaches is that both Concentrated and zero-Concentrated DP require a linear bound on all positive moments of a privacy loss variable. In contrast, our definition applies to one moment at a time. Although less restrictive, it allows for more accurate numerical analyses.

(α, ε)-RDP fixes values of α and ε and requires that the Rényi divergence of order α between a randomized mechanism M applied to two adjacent databases, databases that differ by the data on one individual, is bounded by ε.

D_\alpha(M(x) || M(x')) \leq \varepsilon

Zero-concentrated differential privacy (zCDP) with parameters ε and ρ requires that the Rényi divergence is bounded by ε + ρα for all α in (1, ∞).

D_\alpha(M(x) || M(x')) \leq \varepsilon + \rho\alpha

The pros and cons of zCDP and RDP are complicated. For more details, see the references below.

Related posts

[1] Cynthia Dwork and Guy Rothblum. Concentrated differential privacy. CoRR, abs/1603.01887, 2016.

[2] Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. arXiv:1605.02065 [cs.CR], 2017

[3] Ilya Mironov. Renyi Differential Privacy. arXiv:1702.07476 [cs.CR]

Using dimensional analysis to check probability calculations

Probability density functions are independent of physical units. The normal distribution, for example, works just as well when describing weights or times. But sticking in units anyway is useful.

Normal distribution example

Suppose you’re trying to remember the probability density function for the normal distribution. Is the correct form

\frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma} \right )

or

\frac{1}{\sqrt{2\pi}\sigma^2} \exp\left( -\frac{(x-\mu)^2}{2\sigma} \right )

or

\frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma} \right )

or maybe some other variation?

Suppose the distribution represents heights. (More on that here, here, and here.) The argument to an exponential function must be dimensionless, so the numerator and denominator in the exp() argument must have the same units. Since x has units of length, μ must have units of length. So the numerator has units of length squared, and the denominator must also have units of length squared.

The standard deviation of a quantity has the same units as the quantity. If you have a set of dollar amounts, then the standard deviation of the set is also a dollar amount, not dollars squared or square root of dollars or anything else. The first expression above cannot be right because the numerator has units of length squared but the denominator has units of length.

Next let’s think about how we use a density function. Densities exist to be integrated. This is an important: Densities are not probabilities. They are things you integrate to produce probabilities. So when you integrate the expression above you should get a probability, not a length or any other dimensional quantity. When you integrate with respect to x, the integral has a dx term. Since x has units of length, dx has units of length. That means the density must have units of length−1, i.e. length inverse.

The second and third expressions look very similar. The difference is whether σ² is inside or outside of the square root. If it is inside, the density has units of length−1, but if it is outside then the density has units of length−2. If we integrate the former dx we get a dimensionless quantity, but if we integrate the latter then we get a quantity with dimensions of inverse length. So the second expression cannot be right.

We can’t prove from dimensional analysis alone that the third expression is correct, but we can say that it’s plausible, and in fact it is correct.

Gamma distribution example

Let’s look at one more example, the gamma distribution density with shape parameter α and scale parameter β.

\frac{1}{\beta^\alpha\Gamma(\alpha)} x^{\alpha-1} \exp\left(-\frac{x}{\beta} \right)

There’s a lot going on here. Are we sure the expression above is correct?

First of all, the argument to an exponential function must be dimensionless, so β must have the same units as x. If x is in Euros, β must be in Euros. If x is in light-years, β better be in light-years. If x is in amps, β is in amps.

This is always the case with a scaling parameter, for any probability distribution: the scaling parameter has the same units as the independent variable.

Now what about the shape parameter α? It appears inside a gamma function, so it better be dimensionless.

If x has dimension of length, so do β and dx. The terms in front of the exponential have dimensions of length−1, which checks out: when we integrate the gamma density dx we get a dimensionless quantity. The units of length cancel out. If we assume x is volume in liters, then β and dx also have units of liters, and the density has units of per liter. When we integrate we get a probability, not a quantity in liters or any other units.

Related posts