Beta approximation to binomial

It is well-known that you can approximate a binomial distribution with a normal distribution. Of course there are a few provisos …

Robin Williams imitating William F. Buckley

It is also well-known that you can approximate a beta distribution with a normal distribution as well.

This means you could directly approximate a binomial distribution with a beta distribution. This is a trivial consequence of the two other approximation results, but I don’t recall seeing this mentioned anywhere.

Why would you want to approximate a binomial distribution with a beta? The normal distribution is better known than the beta, and the normal approximation is motivated by the central limit theorem. However, approximating a binomial distribution by a beta makes the connection to Bayesian statistics clearer.

Let’s look back at a post I wrote yesterday. There I argued that the common interpretation of a confidence interval, while unjustified by the theory that produced it, could be justified by appealing to Bayesian statistics because a frequentist confidence interval, in practice, is approximately a Bayesian credible interval.

In that post I give an example of estimating a proportion p based on a survey with 127 positive responses out of 400 persons surveyed. The confidence interval given in that post implicitly used a normal approximation to a binomial. This is done so often that it typically goes unnoticed, and it is justified for large samples when p is not too close to 0 or 1.

Binomial distributions with large n are difficult to work with and it is more convenient to work with continuous distributions. Instead of the normal approximation, we could have used a beta approximation. This has nothing to do with Bayesian statistics yet. We could introduce the beta distribution simply as an alternative to the normal distribution.

The distribution on the estimated rate is binomial with p = 127/400 and variance p(1-p)/n with n = 400.

We could compare this to a beta distribution with the same mean and variance. I worked out here how to solve beta distribution parameters that lead to a specified mean and variance. If we do that with the mean and variance above we get a = 126.7 and b = 272.3. We could then find a 95% confidence interval by finding the 2.5 and 97.5 percentiles for a beta(126.7, 272.3) distribution. When we do, we get a confidence interval of (0.2728, 0.3639), very nearly what we got in the earlier post using a normal approximation.

At this point we have been doing frequentist statistics, using a beta distribution as a computational convenience. Now let’s put on our Bayesian hats. Starting with a uniform, i.e. beta(1, 1), prior, we get a posterior distribution on the proportion we’re estimating which has a beta(128, 274).

If you plot the density functions of a beta(126.7, 272.3) and a beta(128, 274) you’ll see that they agree to within the thickness of a line.

Related posts

Can you have confidence in a confidence interval?

“The only use I know for a confidence interval is to have confidence in it.” — L. J. Savage

Can you have confidence in a confidence interval? In practice, yes. In theory, no.

If you have a 95% confidence interval for a parameter θ, can you be 95% sure that θ is in that interval? Sorta.

Frequentist theory

According to frequentist theory, parameters such as θ are fixed constants, though unknown. Probability statements about θ are forbidden. Here’s an interval I. What is the probability that θ is in I? Well, it’s 1 if θ is in the interval and 0 if it is not.

In theory, the probability associated with a confidence interval says something about the process used to create the interval, not about the parameter being estimated. Our θ is an immovable rock. Confidence intervals come and go, some containing θ and others not, but we cannot make probability statements about θ because θ is not a random variable.

Here’s an example of a perfectly valid and perfectly useless way to construct a 95% confidence interval. Take an icosahedron, draw an X on one face, and leave the other faces unmarked. Roll this 20-sided die, and if the X comes up on top, return the empty interval. Otherwise return the entire real line.

The resulting interval, either ø or (-∞, ∞), is a 95% confidence interval. The interval is the result of a process which will contain the parameter θ 95% of the time.

Now suppose I give you the empty set as my confidence interval. What is the probability now that θ is in the empty interval? Zero. What if I give you the real line as my confidence interval. What is the probability that θ is in the interval? One. The probability is either zero or one, but in no case is it 0.95. The probability that a given interval produced this way contains θ is never 95%. But before I hand you a particular result, the probability that the interval will be one that contains θ is 0.95.

Bayesian theory

Confidence intervals are better in practice than the example above. And importantly, frequentist confidence intervals are usually approximately Bayesian credible intervals.

In Bayesian statistics you can make probability statements about parameters. Once again θ is some unknown parameter. How might we express our uncertainty about the value of θ? Probability! Frequentist statistics represents some forms of uncertainty by probability but not others. Bayesian statistics uses probability to model all forms of uncertainty.

Example

Suppose I want to know what percentage of artists are left handed and I survey 400 artists. I find that 127 of artists surveyed were southpaws. A 95% confidence interval, using the most common approach rather than the pathological approach above, is given by

\left(\hat{p} - z_{1-\alpha/2} \sqrt{\frac{\hat{p}\hat{q}}{n}}, \hat{p} + z_{1-\alpha/2} \sqrt{\frac{\hat{p}\hat{q}}{n}} \right)

This results in a confidence interval of (0.272, 0.363).

Now suppose we redo our analysis using a Bayesian approach. Say we start with a uniform prior on θ. Then the posterior distribution on θ will have a beta(128, 264) distribution.

Now we can say in clear conscience that there is a 94% posterior probability that θ is in the interval (0.272, 0.363).

There are a couple predictable objections at this point. First, we didn’t get exactly 95%. No, we didn’t. But we got very close.

Second, the posterior probability depends on the prior probability. However, it doesn’t depend much on the prior. Suppose you said “I’m pretty sure most people are right handed, maybe 9 out of 10, so I’m going to start with a beta(1, 9) prior.” If so, you would compute the probability of θ being in the interval (0.272, 0.373) to be 0.948. Your a priori knowledge led you to have a little more confidence a posteriori.

Commentary

The way nearly everyone interprets a frequentist confidence interval is not justified by frequentist theory. And yet it can be justified by saying if you were to treat it as a Bayesian credible interval, you’d get nearly the same result.

You can often justify an informal understanding of frequentist statistics on Bayesian grounds. Note, however, that a Bayesian interpretation would not rescue the 95% confidence interval that returns either the empty set or the real line.

Often frequentist and Bayesian analyses reach approximately the same conclusions. A Bayesian can view frequentist techniques as convenient ways to produce approximately correct Bayesian results. And a frequentist can justify using a Bayesian procedure because the procedure has good frequentist properties.

There are times when frequentist and Bayesian results are incompatible, and in that case the Bayesian results typically make more sense in my opinion. But very often other considerations, such as model uncertainty, are much more substantial than the difference between a frequentist and Bayesian analysis.

Related posts

First names and Bayes’ theorem

Is the woman in this photograph more likely to be named Esther or Caitlin?

black and white photo of an elderly woman

Yesterday Mark Jason Dominus published wrote about statistics on first names in the US from 1960 to 2021. For each year and state, the data tell how many boys and girls were given each name.

Reading the data “forward” you could ask, for example, how common was it for girls born in 1960 to be named Paula. Reading the data “backward” you could ask how likely it is that a woman named Paula was born in 1960.

We do this intuitively. When you hear a name like Blanche you may think of an elderly woman because the name is uncommon now (in the US) but was more common in the past. Sometimes we get a bimodal distribution. Olivia, for example, has made a comeback. If you hear that a female is named Olivia, she’s probably not middle aged. She’s more likely to be in school or retired than to be a soccer mom.

Bayes’ theorem tells us how to turn probabilities around. We could go through Mark’s data and compute the probabilities in reverse. We could quantify the probability that a woman named Paula was born in the 1960s, for example, by adding up the probabilities that she was born in 1960, 1961, …, 1969.

Bayes theorem says

P(age | name) = P(name | age) P(age) / P(name).

Here the vertical bar separates the thing we want the probability of from the thing we assume. P(age | name), for example, is the probability of a particular age, given a particular name.

There is no bar in the probability in the denominator above. P(name) is the overall probability of a particular name, regardless of age.

People very often get probabilities backward; they need Bayes theorem to turn them around. A particular case of this is the prosecutor’s fallacy. In a court of law, the probability of a bit of evidence given that someone is guilty is irrelevant. What matters is the probability that they are guilty given the evidence.

In a paternity case, we don’t need to know the probability of someone having a particular genetic marker given that a certain man is or is not their father. We want to know the probability that a certain man is the father, given that someone has a particular genetic marker. The former probability is not what we’re after, but it is useful information to stick into Bayes’ theorem.

Related posts

Photo by Todd Cravens on Unsplash.

A Bayesian approach to pricing

Suppose you want to determine how to price a product and you initially don’t know what the market is willing to pay. This post outlines some of the things you might think about, and how Bayesian modeling might help.

This post is not the final word on the subject, or even my final word on the subject. It is essentially a reply to a friend’s question turned into a blog post rather than an email.

Prior information

You must have some idea, however vague, what the market value of your product is. If you had absolutely no idea what a product is worth, you wouldn’t be considering it as a business opportunity.

There is always prior information. As a former colleague would say, when you want to measure the distance to the moon, you know not to pick up a yard stick. Whenever you do an experiment, something motivated you to do the experiment.

This is an ideal application of Bayesian statistics because you have valuable prior information before you have data. Until you have a moderate amount of data, your prior information may be more useful than your data.

Some people will say you should only act on data, not on subjective prior knowledge, but this is impossible. When you offer your product for the first time, you have no data. All you have to go on is prior information. You could hide your prior information in the design of an experiment rather than making it explicit in a prior distribution, but it’s still there.

Model

Assume the market price for some product can be modeled by a random variable X that depends on some parameter θ. I’m not saying that the price is random in any philosophical sense, only that it is useful to model it as random. More on this line of thinking here.

By modeling market price as a random variable rather than a single number, we’re acknowledging that it has some fuzziness to it. Different customers are willing to pay different amounts for the same product. Maybe the prices they’re willing to pay are tightly distributed around some center, or maybe there’s substantial variance.

When we make a sale, or fail to make a sale, we learn something about θ. But notice that we don’t observe X per se, we observe whether a particular sample from X was above or below the offer price. You’re not conducting a survey where you ask “What is the highest price p that you’d be willing to pay” and get a candid answer. You make an offer x, and it is either accepted or rejected. You observe whether x < p or x > p.

This means the likelihood function is similar to what you’d see in modeling survival data, but a little different. When someone dies, you fully observe their survival time. But if you follow up with someone and they’re still alive, you only know a lower bound on their survival time, not the survival time itself. We say the data is censored because we haven’t yet observed everything we want to know.

Survival data is usually asymmetric, censored on one side but not the other. You could have two-sided censoring, but that’s less common.

With pricing your data is always censored in both directions. You either get a lower bound or an upper bound on what someone would have been willing to pay.

After each offer and response, you can update your estimate of θ. Each interaction gives you a better idea of the distribution on θ.

Pricing

Now suppose after numerous observations you’re moderately confident in your knowledge of θ. Now what?

One response is “Well then you charge what the market is most likely to bear.” That’s kind of a simplistic optimization. It implicitly assumes you’re OK with a 50% chance of a sale going through. Maybe your business is struggling and you don’t have many leads. Then you want a higher conversion rate. Or maybe you’re doing well, have plenty leads, and are OK with a low conversion rate. This is especially the case if the distribution on market price has a lot of variance; if the variance is low it makes more sense to think of “the” price as if it were a single number.

So far I have implicitly assumed that the only consequence of asking too much is a lost sale. But if you ask for too much, you might lose future sales, even if you get the current sale. I’ve also assumed that customers always prefer lower prices. That’s not the case. Asking too little for a product can hurt your credibility, for example.

Estimating willingness to pay is complicated, and determining what to do once you’ve made that estimate is complicated as well. This post is just a sketch of the thought process a company might go through.

Related posts

Universal confidence interval

Here’s a way to find a 95% confidence interval for any parameter θ.

  • With probability 0.95, return the real line.
  • With probability 0.05, return the empty set.

Clearly 95% of the time this procedure will return an interval that contains θ.

This example shows the difference between a confidence interval and a credible interval.

A 95% credible interval is an interval such that the probability is 95% that the parameter is in the interval. That’s what people think a 95% confidence interval is, but it’s not.

Suppose I give you a confidence interval using the procedure above. The probability that θ is in the interval is 1 if I return the real line and 0 if I return the empty set. In either case, the interval that I give you tells you absolutely nothing about θ.

But if I give you a 95% credible interval (a, b), then given the model and the data that went into it, the probability is 95% that θ is in the interval (a, b).

Confidence intervals are more useful in practice than in theory because they often approximately correspond to a credible interval under a reasonable model.

Credible intervals depend on your modeling assumptions. So do confidence intervals.

Confidence interval widths

Suppose you do N trials of something that can succeed or fail. After your experiment you want to present a point estimate and a confidence interval. Or if you’re a Bayesian, you want to present a posterior mean and a credible interval. The numerical results hardly differ, though the two interpretations differ.

If you got half successes, you will report a confidence interval centered around 0.5. The more unbalanced your results were, the smaller your confidence interval will be. That is, the confidence interval will be smallest if you had no successes and widest if you had half successes.

What can we say about how the width of your confidence varies as a function of your point estimate p? That question is the subject of this post [1].

Frequentist confidence interval width

We’ll start with an example, where N = 1,000,000. We let our number of observed successes n vary from 0 to 500,000 and plot the width of a 95% confidence interval as a function of n on a log-log plot. (By symmetry we only need to look to up to N/2. If you have more than half successes, reverse your definitions of success and failure.)

The result is almost a perfectly straight line, with the exception of a tiny curve at the end. This says the log of the confidence interval is a linear function of the log of n, or in other words, the dependence of confidence interval width on n follows a power law.

Bayesian credible interval width

The plot above measures frequentist confidence intervals, using Wilson score with continuity correction. We can repeat our experiment, putting on our Bayesian hat, and compute credible intervals using a Jeffreys prior, that is, a beta(0.5, 0.5) prior.

The results are indistinguishable. The confidence interval widths differ after the sixth decimal place.

In this example, there’s very little quantitative difference between a frequentist confidence interval and a Bayesian credible interval. The difference is primarily one of interpretation. The Bayesian interpretation makes sense and the frequentist interpretation doesn’t [2].

Power law

If the logs of two things are related linearly [3], then the things themselves are related by a power law. So if confidence/credible interval width varies as a power law with the point estimate p, what is that power law?

The plots above suggest that to a good approximation, if we let w be the credible interval length,

log w = m log p + b

for some slope m and intercept b.

We can estimate the slope by choosing p = 10-1 and p = 10-5. This gives us m = 0.4925 and b = -5.6116. There are theoretical reasons to expect that m should be 0.5, so we’ll round 0.4925 up to 0.5 both for convenience and for theory.

So

log w = 0.5 log p – 5.6116.

and so

w = 0.00366 √p

Let’s see how well this works by comparing it to the exact interval length. (I’m using Jeffreys’ credible interval, not that it matters.)

The power law approximation works well when the estimated proportion p is small, say less than 0.2. For larger p the normal approximation works well.

We could have guessed that the confidence interval width was proportional to √p(1-p) based on the normal approximation/central limit theorem. But the normal approximation is not uniformly good over the range of all ps.

Related posts

[1] Standard notation would put a little hat on top of the p but HTML doesn’t make this possible without inserting images into text. I hope you don’t mind if I take my hat off.

[2] This quip is a little snarky, but it’s true. When I would ask graduate students to describe what a confidence interval is, they would basically give me the definition of a credible interval. Most people who calculate frequentist confidence intervals think they’re calculating Bayesian credible intervals, though they don’t realize it. The definition of confidence interval is unnatural and convoluted. But confidence intervals work better in practice than in theory because they approximate credible intervals.

L. J. Savage once said

“The only use I know for a confidence interval is to have confidence in it.”

In other words, the value of a confidence interval is that you can often get away with interpreting it as a credible interval.

[3] Unfortunately the word “linear” is used in two inconsistent ways. Technically we should say “affine” when describing a function y = mx + b, and I wish we would. But that word isn’t commonly known. People call this relation linear because the graph is a line. Which makes sense, but it’s inconsistent with the use of “linear” as in “linear algebra.”

 

Scaled Beta2 distribution

I recently ran across a new probability distribution called the “Scaled Beta2” or “SBeta2” for short in [1].

For positive argument x and for positive parameters p, q, and b, its density is

\frac{\Gamma(p+q)}{b\,\Gamma(p) \, \Gamma(q)}\, \frac{\left(\frac{x}{b}\right)^{p-1}}{\left(\left(\frac{x}{b}\right) + 1 \right)^{p+q}}

This is a heavy-tailed distribution. For large x, the probability density is O(xq-1), the same as a Student-t distribution with q degrees of freedom.

The authors in [1] point out several ways this distribution can be useful in application. It can approximate a number of commonly-used prior distributions while offering advantages in terms of robustness and analytical tractability.

Related posts

[1] The Scaled Beta2 Distribution as a Robust Prior for Scales. María-Eglée Pérez, Luis Raúl Pericchi, Isabel Cristina Ramírez. Available here.

Predictive probability for large populations

Suppose you want to estimate the number of patients who respond to some treatment in a large population of size N and what you have is data on a small sample of size n.

The usual way of going about this calculates the proportion of responses in the small sample, then pretends that this is the precisely known proportion, and applies it to the population as a whole. This approach underestimates the uncertainty in the final estimate because it ignores the uncertainty in the initial estimate.

A more sophisticated approach creates a distribution representing what is known about the proportion of successes based on the small sample and any prior information. Then it estimates the posterior predictive probability of outcomes in the larger population based on this distribution. For more background on predictive probability, see my post on uncertainty in a probability.

This post will look at ways to compute predictive probabilities using the asymptotic approximation presented in the previous post.

Suppose your estimate on the probability of success, based on your small sample and prior information, has a Beta(α, β) distribution. Then the predictive probability of s successes and f failures is

 {s + f \choose s} \frac{B(s + \alpha, f + \beta)}{B(\alpha, \beta)}

The beta functions in the numerator and denominator are hard to understand and hard to compute, so we would like to replace them with something easier to understand and easier to compute.

We begin by expanding all the terms involving s and f in terms of gamma functions.

\frac{1}{B(\alpha,\beta)} \frac{\Gamma(s+f+1)}{\Gamma(s+1)\, \Gamma(f+1)} \frac{\Gamma(s+\alpha)\, \Gamma(f+\beta)}{\Gamma(s+f+\alpha+\beta)}

By rearranging the terms we can see three examples of the kind of gamma function ratios estimated in the previous post.

\frac{1}{B(\alpha,\beta)} \frac{\Gamma(s+\alpha)}{\Gamma(s+1)} \frac{\Gamma(f+\beta)}{\Gamma(f+1)} \frac{\Gamma(s+f+1)}{\Gamma(s+f+\alpha+\beta)}

Now we can apply the simplest approximation from the previous post to get the following.

\frac{s^{\alpha-1} f^{\beta-1} (s +f)^{1 - \alpha-\beta-1}}{B(\alpha, \beta)}

We can make this approximation more symmetric by expanding Beta(α, β) in terms of factorials:

 \frac{(\alpha + \beta-1)!}{(\alpha-1)!\, (\beta-1)!} \,\,\frac{s^{\alpha-1} f^{\beta-1}}{(s+f)^{\alpha+\beta-1}}

We saw in the previous post that we can make our asymptotic approximation much more accurate by shifting the arguments a bit, and the following expression, while a little more complicated, should be more accurate.

\frac{(\alpha + \beta-1)!}{(\alpha-1)!\, (\beta-1)!} \,\, \frac{ \left(s + \tfrac{\alpha}{2}\right)^{\alpha-1} \left(f + \tfrac{\beta}{2}\right)^{\beta-1} }{ \left(s +f + \tfrac{\alpha+\beta}{2}\right)^{\alpha+\beta-1} }

This should be accurate for sufficiently large s and f, but I haven’t tested it numerically and wouldn’t recommend using it without doing some testing first.

Automatic data reweighting

Suppose you are designing an autonomous system that will gather data and adapt its behavior to that data.

At first you face the so-called cold-start problem. You don’t have any data when you first turn the system on, and yet the system needs to do something before it has accumulated data. So you prime the pump by having the system act at first on what you believe to be true prior to seeing any data.

Now you face a problem. You initially let the system operate on assumptions rather than data out of necessity, but you’d like to go by data rather than assumptions once you have enough data. Not just some data, but enough data. Once you have a single data point, you have some data, but you can hardly expect a system to act reasonably based on one datum.

Rather than abruptly moving from prior assumptions to empirical data, you’d like the system to gradually transition from reliance on the former to reliance on the latter, weaning the system off initial assumptions as it becomes more reliant on new data.

The delicate part is how to manage this transition. How often should you adjust the relative weight of prior assumptions and empirical data? And how should you determine what weights to use? Should you set the weight given to the prior assumptions to zero at some point, or should you let the weight asymptotically approach zero?

Fortunately, there is a general theory of how to design such systems. It’s called Bayesian statistics. The design issues raised above are all handled automatically by Bayes theorem. You start with a prior distribution summarizing what you believe to be true before collecting new data. Then as new data arrive, the magic of Bayes theorem adjusts this distribution, putting more emphasis on the empirical data and less on the prior. The effect of the prior gradually fades away as more data become available.

Related posts

Dose finding ≠ dose escalation

You’ll often hear Phase I dose-finding trials referred to as dose escalation studies. This is because simple dose-finding methods can only explore in one direction: they can only escalate.

Three-plus-three rule

The most common dose finding method is the 3+3 rule. There are countless variations on this theme, but the basic idea is that you give a dose of an experimental drug to three people. If all three are OK, you go up a dose next time. If two out of three are OK, you give that dose again. If only one out of three is OK, you stop [1].

Deterministic thinking

The 3+3 algorithm implicitly assumes deterministic thinking, at least in part. The assumption is that if three out of three patients respond well, we know the dose is safe [2].

If you increase the dose level and the next three patients experience adverse events, you stop the trial. Why? Because you know that the new dose is dangerous, and you know the previous dose was safe. You can only escalate because you assume you have complete knowledge based on three samples.

But if we treat three patients at a particular dose level and none have an adverse reaction we do not know for certain that the dose level is safe, though we may have sufficient confidence in its safety to try the next dose level. Similarly, if we treat three patients at a dose and all have an adverse reaction, we do not know for certain that the dose is toxic.

Bayesian dose-finding

A Bayesian dose-finding method estimates toxicity probabilities given the data available. It might decide at one point that a dose appears safe, then reverse its decision later based on more data. Similarly, it may reverse an initial assessment that a dose is unsafe.

A dose-finding method based on posterior probabilities of toxicity is not strictly a dose escalation method because it can explore in two directions. It may decide that the next dose level to explore is higher or lower than the current level.

Starting at the lowest dose

In Phase I studies of chemotherapeutics, you conventionally start at the lowest dose. This makes sense. These are toxic agents, and you naturally want to start at a dose you have reason to believe isn’t too toxic. (NB: I say “too toxic” because chemotherapy is toxic. You hope that it’s toxic to a tumor without being too toxic for the patient host.)

But on closer inspection maybe you shouldn’t start at the lowest dose. Suppose you want to test 100 mg, 200 mg, and 300 mg of some agent. Then 100 mg is the lowest dose, and it’s ethical to start at 100 mg. Now what if we add a dose of 50 mg to the possibilities? Did the 100 mg dose suddenly become unethical as a starting dose?

If you have reason to believe that 100 mg is a tolerable dose, why not start with that dose, even if you add a lower dose in case you’re wrong? This makes sense if you think of dose-finding, but not if you think only in terms of dose escalation. If you can only escalate, then it’s impossible to ever give a dose below the starting dose.

More clinical trial posts

[1] I have heard, but I haven’t been able to confirm, that the 3+3 method has its origin in a method proposed by John Tukey during WWII for testing bombs. When testing a mechanical system, like a bomb, there is much less uncertainty than when testing a drug in a human. In a mechanical setting, you may have a lot more confidence from three samples than you would in a medical setting.

[2] How do you explain the situation where one out of three has an adverse reaction? Is the dose safe or not? Here you naturally switch to probabilistic thinking because deterministic thinking leads to a contradiction.