QR Codes and Percolation

Percolation theory looks at problems such as the probability of being able to traverse some region with random obstacles. It is motivated by problems such as modeling the flow of a fluid in a porous medium.

Here’s a percolation problem for QR codes: What is the probability that there is a path from one side of a QR code to the opposite side? How far across a QR code would you expect to be able to go? For example, the QR code below was generated from my contact information. It’s not possible to go from one side to the other, and the red line shows what I believe is the deepest path into the code from a side.

This could make an interesting programming exercise. A simple version would be to start with a file of bits representing a particular QR code and find the deepest path into the corresponding image.

The next step up would be to generate simplified QR codes, requiring certain bits to be set, such as the patterns in three of the four corners that allow a QR reader to orient itself.

The next step in sophistication would be to implement the actual QR encoding algorithm, including its error correction encoding, then use this to encode random data.

(Because of the error correction used by QR codes, you could scan the image above and your QR reader would ignore the red path. It would even work if a fairly large portion of the image were missing because the error correction introduces a lot of redundancy.)

Why not statistics

Jordan Ellenberg’s parents were both statisticians. In his interview with Strongly Connected Components Jordan explains why he went into mathematics rather than statistics.

I tried. I tried to learn some statistics actually when I was younger and it’s a beautiful subject. But at the time I think I found the shakiness of the philosophical underpinnings were too scary for me. I felt a little nauseated all the time. Math is much more comfortable. You know where you stand. You know what’s proved and what’s not. It doesn’t have the quite same ethical and moral dimension that statistics has. I was never able to get comfortable with it the way my parents were.

Bayes factors vs p-values

Bayesian analysis and Frequentist analysis often lead to the same conclusions by different routes. But sometimes the two forms of analysis lead to starkly different conclusions.

The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.

In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)

The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is p = 0.5. The alternative hypothesis was that p is not 0.5. There were N = 104,490,000 bits emitted during the experiment, and s = 52,263,471 were 1’s. The p-value, the probability of an imbalance this large or larger under the assumption that p = 0.5, is 0.0003. Such a tiny p-value would be regarded as extremely strong evidence in favor of ESP given the way p-values are commonly interpreted.

The Bayes factor, however, is 18.7, meaning that the null hypothesis appears to be about 19 times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).

So given the data and assumptions in this example, the Frequentist concludes there is very strong evidence for ESP while the Bayesian concludes there is strong evidence against ESP.

The following Python code shows how one might calculate the p-value and Bayes factor.

from scipy.stats import binom
from scipy import log, exp
from scipy.special import betaln

N = 104490000
s = 52263471

# sf is the survival function, i.e. complementary cdf
# ccdf multiplied by 2 because we're doing a two-sided test
print("p-value: ", 2*binom.sf(s, N, 0.5))

# Compute the log of the Bayes factor to avoid underflow.
logbf = N*log(0.5) - betaln(s+0.5, N-s+0.5) + betaln(0.5, 0.5)
print("Bayes factor: ", exp(logbf))

Fitting a triangular distribution

Sometimes you only need a rough fit to some data and a triangular distribution will do. As the name implies, this is a distribution whose density function graph is a triangle. The triangle is determined by its base, running between points a and b, and a point c somewhere in between where the altitude intersects the base. (c is called the foot of the altitude.) The height of the triangle is whatever it needs to be for the area to equal 1 since we want the triangle to be a probability density.

One way to fit a triangular distribution to data would be to set a to the minimum value and b to the maximum value. You could pick a and b are the smallest and largest possible values, if these values are known. Otherwise you could use the smallest and largest values in the data, or make the interval a little larger if you want the density to be positive at the extreme data values.

How do you pick c? One approach would be to pick it so the resulting distribution has the same mean as the data. The triangular distribution has mean

(a + b + c)/3

so you could simply solve for c to match the sample mean.

Another approach would be to pick c so that the resulting distribution has the same median as the data. This approach is more interesting because it cannot always be done.

Suppose your sample median is m. You can always find a point c so that half the area of the triangle lies to the left of a vertical line drawn through m. However, this might require the foot c to be to the left or the right of the base [a, b]. In that case the resulting triangle is obtuse and so sides of the triangle do not form the graph of a function.

For the triangle to give us the graph of a density function, c must be in the interval [a, b]. Such a density has a median in the range

[b – (ba)/√2, a + (ba)/√2].

If the sample median m is in this range, then we can solve for c so that the distribution has median m. The solution is

c = b – 2(bm)2 / (ba)

if m < (a + b)/2 and

c = a + 2(am)2 / (ba)

otherwise.

Extremely small probabilities

One objection to modeling adult heights with a normal distribution is that the former is obviously positive but the latter can be negative. However, by this model negative heights are astronomically unlikely. I’ll explain below how one can take “astronomically” literally in this context.

A common model says that men’s and women’s heights are normally distributed with means of 70 and 64 inches respectively, both with a standard deviation of 3 inches. A woman with negative height would be 21.33 standard deviations below the mean, and a man with negative height would be 23.33 standard deviations below the mean. These events have probability 3 × 10-101 and 10-120 respectively. Or to write them out in full

0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003

and

0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001.

As I mentioned on Twitter yesterday, if you’re worried about probabilities that require scientific notation to write down, you’ve probably exceeded the resolution of your model. I imagine most probability models are good to two or three decimal places at most. When model probabilities are extremely small, factors outside the model become more important than ones inside.

According to Wolfram Alpha, there are around 1080 atoms in the universe. So picking one particular atom at random from all atoms in the universe would be on the order of a billion trillion times more likely than running into a woman with negative height. Of course negative heights are not just unlikely, they’re impossible. As you travel from the mean out into the tails, the first problem you encounter with the normal approximation is not that the probability of negative heights is over-estimated, but that the probability of extremely short and extremely tall people is under-estimated. There exist people whose heights would be impossibly unlikely according to this normal approximation. See examples here.

Probabilities such as those above have no practical value, but it’s interesting to see how you’d compute them anyway. You could find the probability of a man having negative height by typing pnorm(-23.33) into R or scipy.stats.norm.cdf(-23.33) into Python. Without relying on such software, you could use the bounds

\frac{x}{\sqrt{2\pi}(x^2 + 1)} \exp(-x^2/2) < \Phi^c(x) < \frac{1}{\sqrt{2\pi}\,x} \exp(-x^2/2)

with x equal to -21.33 and -23.33. For a proof of these bounds and tighter bounds see these notes.

Finding the best dose

In a dose-finding clinical trial, you have a small number of doses to test, and you hope find the one with the best response. Here “best” may mean most effective, least toxic, closest to a target toxicity, some combination of criteria, etc.

Since your goal is to find the best dose, it seems natural to compare dose-finding methods by how often they find the best dose.  This is what is most often done in the clinical trial literature. But this seemingly natural criterion is actually artificial.

Suppose a trial is testing doses of 100, 200, 300, and 400 milligrams of some new drug. Suppose further that on some scale of goodness, these doses rank 0.1, 0.2, 0.5, and 0.51. (Of course these goodness scores are unknown; the point of the trial is to estimate them. But you might make up some values for simulation, pretending with half your brain that these are the true values and pretending with the other half that you don’t know what they are.)

Now suppose you’re evaluating two clinical trial designs, running simulations to see how each performs. The first design picks the 400 mg dose, the best dose, 20% of the time and picks the 300 mg dose, the second best dose, 50% of the time. The second design picks each dose with equal probability. The latter design picks the best dose more often, but it picks a good dose less often.

In this scenario, the two largest doses are essentially equally good; it hardly matters how often a method distinguishes between them. The first method picks one of the two good doses 70% of the time while the second method picks one of the two good doses only 50% of the time.

This example was exaggerated to make a point: obviously it doesn’t matter how often a method can pick the better of two very similar doses, not when it very often picks a bad dose. But there are less obvious situations that are quantitatively different but qualitatively the same.

The goal is actually to find a good dose. Finding the absolute best dose is impossible. The most you could hope for is that a method finds with high probability the best of the four arbitrarily chosen doses under consideration. Maybe the best dose is 350 mg, 843 mg, or some other dose not under consideration.

A simple way to make evaluating dose-finding methods less arbitrary would be to estimate the benefit to patients. Finding the best dose is only a matter of curiosity in itself unless you consider how that information is used. Knowing the best dose is important because you want to treat future patients as effectively as you can. (And patients in the trial itself as well, if it is an adaptive trial.)

Suppose the measure of goodness in the scenario above is probability of successful treatment and that 1,000 patients will be treated at the dose level picked by the trial. Under the first design, there’s a 20% chance that 51% of the future patients will be treated successfully, and a 50% chance that 50% will be. The expected number of successful treatments from the two best doses is 352. Under the second design, the corresponding number is 252.5.

(To simplify the example above, I didn’t say how often the first design picks each of the two lowest doses. But the first design will result in at least 382 expected successes and the second design 327.5.)

You never know how many future patients will be treated according to the outcome of a clinical trial, but there must be some implicit estimate. If this estimate is zero, the trial is not worth conducting. In the example given here, the estimate of 1,000 future patients is irrelevant: the future patient horizon cancels out in a comparison of the two methods. The patient horizon matters when you want to include the benefit to patients in the trial itself. The patient horizon serves as a way to weigh the interests of current versus future patients, an ethically difficult comparison usually left implicit.

Random walks and the arcsine law

Suppose you stand at 0 and flip a fair coin. If the coin comes up heads, you take a step to the right. Otherwise you take a step to the left. How much of the time will you spend to the right of where you started?

As the number of steps N goes to infinity, the probability that the proportion of your time in positive territory is less than x approaches 2 arcsin(√x)/π. The arcsine term gives this rule its name, the arcsine law.

Here’s a little Python script to illustrate the arcsine law.

import random
from numpy import arcsin, pi, sqrt

def step():
    u = random.random()
    return 1 if u < 0.5 else -1

M = 1000 # outer loop    
N = 1000 # inner loop

x = 0.3 # Use any 0 < x < 1 you'd like.

outer_count = 0
for _ in range(M):
    n = 0
    position= 0 
    inner_count = 0
    for __ in range(N):
        position += step()
        if position > 0:
            inner_count += 1
    if inner_count/N < x:
        outer_count += 1

print (outer_count/M)
print (2*arcsin(sqrt(x))/pi)

Miscellaneous math resources

Every Wednesday I’ve been pointing out various resources on my web site. So far they’ve all been web pages, but the following are all PDF files.

Probability and statistics:

Other math:

See also journal articles and technical reports.

Last week: Probability approximations

Next week: Code Project articles

Probability approximations

This week’s resource post lists notes on probability approximations.

Do we even need probability approximations anymore? They’re not as necessary for numerical computation as they once were, but they remain vital for understanding the behavior of probability distributions and for theoretical calculations.

Textbooks often leave out details such as quantifying the error when discussion approximations. The following pages are notes I wrote to fill in some of these details when I was teaching.

See also blog posts tagged Probability and statistics and the Twitter account ProbFact.

Last week: Numerical computing resources

Next week: Miscellaneous math notes

More data, less accuracy

Statistical methods should do better with more data. That’s essentially what the technical term “consistency” means. But with improper numerical techniques, the the numerical error can increase with more data, overshadowing the decreasing statistical error.

There are three ways Bayesian posterior probability calculations can degrade with more data:

  1. Polynomial approximation
  2. Missing the spike
  3. Underflow

Elementary numerical integration algorithms, such as Gaussian quadrature, are based on polynomial approximations. The method aims to exactly integrate a polynomial that approximates the integrand. But likelihood functions are not approximately polynomial, and they become less like polynomials when they contain more data. They become more like a normal density, asymptotically flat in the tails, something no polynomial can do. With better integration techniques, the integration accuracy will improve with more data rather than degrade.

With more data, the posterior distribution becomes more concentrated. This means that a naive approach to integration might entirely miss the part of the integrand where nearly all the mass is concentrated. You need to make sure your integration method is putting its effort where the action is. Fortunately, it’s easy to estimate where the mode should be.

The third problem is that software calculating the likelihood function can underflow with even a moderate amount of data. The usual solution is to work with the logarithm of the likelihood function, but with numerical integration the solution isn’t quite that simple. You need to integrate the likelihood function itself, not its logarithm. I describe how to deal with this situation in Avoiding underflow in Bayesian computations.

* * *

If you’d like help with statistical computation, let’s talk.

Probability resources

Each Wednesday I post a list of notes on some topic. This week it’s probability.

See also posts tagged probability and statistics and the Twitter account ProbFact.

Last week: Python resources

Next week: Regular expression resources

After a coin comes up heads 10 times

Suppose you’ve seen a coin come up heads 10 times in a row. What do you believe is likely to happen next? Three common responses:

  1. Heads
  2. Tails
  3. Equal probability of heads or tails.

Each is reasonable in its own context. The last answer is correct assuming the flips are independent and heads and tails are equally likely.

But as I argued here, if you see nothing but heads, you have reason to question the assumption that the coin is fair. So there’s some justification for the first answer.

The reasoning behind the second answer is that tails are “due.” This isn’t true if you’re looking at independent flips of a fair coin, but it could reasonable in other settings, such as sampling without replacement.

Say there are a number of coins on a table, covered by a cloth. A fixed number are on the table heads up, and a fixed number tails up. You reach under the cloth and slide a coin out. Every head you pull out increases the chances that the next coin will be tails. If there were an equal number of heads and tails under the cloth to being with, then after pulling out 10 heads tails are indeed more likely next time.

Related post: Long runs

First two impressions of statistics

When I was a postdoc I asked a statistician a few questions and he gave me an overview of his subject. (My area was PDEs; I knew nothing about statistics.) I remember two things that he said.

  1. A big part of being a statistician is knowing what to do when your assumptions aren’t met, because they’re never exactly met.
  2. A lot of statisticians think time series analysis is voodoo, and he was inclined to agree with them.

Blue Bonnet Bayes

Blue Bonnet™ used to run commercials with the jingle “Everything’s better with Blue Bonnet on it.” Maybe they still do.

Perhaps in reaction to knee-jerk antipathy toward Bayesian methods, some statisticians have adopted knee-jerk enthusiasm for Bayesian methods. Everything’s better with Bayesian analysis on it. Bayes makes it better, like a little dab of margarine on a dry piece of bread.

There’s much that I prefer about the Bayesian approach to statistics. Sometimes it’s the only way to go. But Bayes-for-the-sake-of-Bayes can expend a great deal of effort, by human and computer, to arrive at a conclusion that could have been reached far more easily by other means.

Related: Bayes isn’t magic

Image via Gallery of Graphic Design