The Rise and Fall of Bayesian Statistics

Posted on 8 August 2025 by John

At one time Bayesian statistics was not just a minority approach, it was considered controversial or fringe. When I was in grad school, someone confided in me that he was a closet Bayesian. He thought the Bayesian approach to statistics made sense, but didn’t want to jeopardize his career by saying so publicly.

Then somewhere along the way, maybe 20 years ago or so, Bayesian analysis not only became acceptable, it became hot. People would throw around the term Bayesian much like the throw around AI now.

During the Bayesian heyday, someone said that you’d know Bayes won when people would quit putting the word “Bayesian” in the title of their papers. That happened. I’m not sure when, but maybe around 2013? That was the year I went out on my own as a consultant. I thought maybe I could cash in on some of the hype over Bayesian statistics, but the hype had already subsided by then.

It’s strange that Bayes was ever scandalous, or that it was ever sexy. It’s just math [1]. You’d never look askance at someone for studying Banach algebras, nor would you treat them like a celebrity.

Bayesian statistics hasn’t fallen, but the hype around Bayesian statistics has fallen. The utility of Bayesian statistics has improved as the theory and its software tools have matured. In fact, it has matured to the point that people don’t emphasize that it’s Bayesian.

[1] Statistics isn’t exactly math, but that’s a topic for another day.

Analyzing the Federalist Papers

Posted on 7 August 2025 by John

The Federalist Papers, a collection of 85 essays published anonymously between 1787 and 1788, were one of the first subjects for natural language processing aided by a computer. Because the papers were anonymous, people were naturally curious who wrote each of the essays. Early on it was determined that the authors were Alexander Hamilton, James Madison, and John Jay, but the authorship of individual essays wasn’t known.

In 1944, Douglass Adair conjectured the authorship of each essay, and twenty years later Frederick Mosteller and David Wallace confirmed Adair’s conclusions by Bayesian analysis. Mosteller and Wallace used a computer to carry out their statistical calculations, but they did not have an electronic version of the text.

They physically chopped a printed copy of the text into individual words and counted them. Mosteller recounted in his autobiography that until working on The Federalist Papers, he had underestimated how hard it was to count a large number things, especially little pieces of paper that could be scattered by a draft.

I’m not familiar with how Mosteller and Wallace did their analysis, but I presume they formed a prior distribution on the frequency of various words in writings known to be by Hamilton, Madison, and Jay, then computed the posterior probability of authorship by each author for each essay.

The authorship of the papers was summarized in the song “Non-Stop” from the musical Hamilton:

The plan was to write a total of twenty-five essays, the work divided evenly among the three men. In the end, they wrote eighty-five essays in the span of six months. John Jay got sick after writing five. James Madison wrote twenty-nine. Hamilton wrote the other fifty-one!

Yesterday I wrote about the TF-IDF statistic for the importance of words in a corpus of documents. In that post I used the books of the Bible as my corpus. Today I wanted to reuse the code I wrote for that post by applying it to The Federalist Papers.

Federalist No. 10 is the best known essay in the collection. Here are the words with the highest TF-IDF scores from that essay.

faction: 0.0084
majority: 0.0047
democracy: 0.0044
controlling: 0.0044
parties: 0.0039
republic: 0.0036
cure: 0.0035
factious: 0.0035
property: 0.0033
faculties: 0.0033

I skimmed a list of the most important words in the essays by Madison and Hamilton and noticed that Madison’s list had several words from classic literature: Achaens, Athens, Draco, Lycurgus, Sparta, etc. There were a only couple classical references in Hamilton’s top words: Lysander and Pericles. I noticed “debt” was important to Hamilton.

You can find the list of top 10 words in each essay here.

Deleting vs Replacing Names

Posted on 19 June 2025 by John

This post looks at whether you should delete names or replace names when deidentifying personal data. With structured data, generating synthetic names does not increase or decrease privacy. But with unstructured data, replacing real names with randomly generated names increases privacy protection.

Structured data

If you want to deidentify structured data (i.e. data separated into columns in a database) then an obvious first step is to remove a column containing names. This is not sufficient—the reason deidentification is subtle is that it is possible to identify people after the obvious identifiers have been removed—but clearly removing names is necessary.

Instead of removing the names, you could replace them with randomly generated names. This neither hurts nor helps privacy. But this might be useful, say, when creating test data.

Say you’re developing software that handles patient data. You want to use real patient data because you want the test data to be realistic, but that may not be acceptable (or legal) due to privacy risks. You can’t just remove names because you need to test whether software displays names correctly, so you replace real names with randomly generated names. No harm, no foul.

Unstructured data

Now let’s switch contexts and look at unstructured data, such as text transcriptions of doctors’ notes. You still need to remove names, but it’s not as simple as removing the name column. Instead of the database structure telling you what’s a name, what’s a diagnosis, etc. you have to infer these things statistically [1], which means there will always be errors [2].

When working with unstructured text, replacing names with randomly generated names is better for privacy. If you could identify names with 100% accuracy, then it would make no difference whether you removed or replaced names, but you can’t. Software will always miss some names [3].

Suppose your software replaces suspected names with “NAME REDACTED.” When your software fails to identify a name, it’s obvious that it failed. For example,

NAME REDACTED and his wife Margaret came into the clinic today to be tested for …

But if instead your software replaced suspected names with synthetic names, it is not obvious when the software fails. When you see a reference to Margaret, you can’t know whether the patient’s name was Virginia and was replaced with Margaret or whether the software made an error skipped over Margaret’s name.

All else being equal, it’s better to synthesize names than remove names. But that doesn’t mean that just any process for synthesizing names will be adequate. The error rate doesn’t need to be zero, but it can’t be too high either. And the process should not be biased. If the software consistently left Hispanic names slip through, for example, Hispanic patients would not appreciate that.

[1] “But what if I use a large language model?” That’s a particular kind of statistical inference.

[2] Any statistical test, such as testing whether a string of text represents a name, will have false positives (type I error) and false negatives (type II error). Software to remove names will remove some text that isn’t a name, and fail to recognize some text as names.

[3] We’ve tested a lot of deidentification software packages for clients, and the error rate is never zero. But privacy regulations such as HIPAA don’t require the error rate to be zero, only sufficiently small.

Knuth’s series for chi squared percentage points

Posted on 29 April 2025 by John

In the latest two posts, we have needed to find the percentage points for a chi square random variable when the parameter ν, the number of degrees of freedom, is large.

In Volume 2 of Knuth’s magnum opus TAOCP, he gives a table of percentage points for the chi square distribution for small values of ν and he gives an asymptotic approximation to be used for values of ν > 30.

Knuth’s series is

$\nu + \sqrt{2\nu}x_p + \frac{2}{3} x_p^2 - \frac{2}{3} + {\cal O}(1/\sqrt{\nu})$

where x_p is the corresponding percentage point function for a normal random variable. Knuth gives a few values of x_p in his table.

What Knuth calls x_p, I called p in the previous post. The approximation I gave, based on Fisher’s transformation to make the chi squared more like a normal, is similar to Knuth’s series, but not the same.

So which approximation is better? Where does Knuth’s approximation come from?

Comparing approximations

We want to find the quantiles for χ²(100) for p = 0.01 through 0.99. Both Fisher’s approximation and Knuth’s approximation are good enough that it’s hard to tell the curves apart visually when the exact values are plotted along with the two approximations.

But when you subtract off the exact values and just look at the errors, it’s clear that Knuth’s approximation is much better in general.

Here’s another plot, taking the absolute value of the error. This makes it easier to see little regions where Fisher’s approximation happens to be better.

Here are a couple more plots comparing the approximations at more extreme probabilities. First the left tail:

And then the right tail:

Derivation

The table mentioned above says to see exercise 16 in the same section, which in turn refers to Theorem 1.2.11.3A in Volume 1 of TAOCP. There Knuth derives an asymptotic series which is not directly the one above, but does most of the work. The exercise asks the reader to finish the derivation.

Chi squared approximations

Posted on 28 April 2025 by John

In the previous post I needed to know the tail percentile points for a chi squared distribution with a huge number of degrees of freedom. When the number of degrees of freedom ν is large, a chi squared random variable has approximately a normal distribution with the same mean and variance, namely mean ν and variance 2ν.

In that post, ν was 9999 and we needed to find the 2.5 and 97.5 percentiles. Here are the percentiles for χ²(9999):

    >>> chi2(9999).ppf([0.025, 0.975])
    array([ 9723.73223701, 10278.05632026])

And here are the percentiles for N(9999, √19998):

    >>> norm(9999, (2*9999)**0.5).ppf([0.025, 0.975])
    array([ 9721.83309451, 10276.16690549])

So the results on the left end agree to three significant figures and the results on the right agree to four.

Fewer degrees of freedom

When ν is more moderate, say ν = 30, the normal approximation is not so hot. (We’re stressing the approximation by looking fairly far out in the tails. Closer to the middle the fit is better.)

Here are the results for χ²(30):

    >>> chi2(30).ppf([0.025, 0.975])
    array([16.79077227, 46.97924224])

And here are the results for N(30, √60):

    >>> norm(30, (60)**0.5).ppf([0.025, 0.975])
    array([14.81818426, 45.18181574])

The normal distribution is symmetric and the chi squared distribution is not, though it becomes more symmetric as ν → ∞. Transformations of the chi squared distribution that make it more symmetric may also improve the approximation accuracy. That wasn’t important when we had ν = 9999, but it is more important when ν = 30.

Fisher transformation

If X ~ χ²(ν), Fisher suggested the approximation √(2X) ~ N(√(2ν − 1), 1).

Let Y be a N(√(2ν − 1), 1) random variable and Z a standard normal random variable, N(0, 1). Then we can estimate χ² probabilities from normal probabilities.

$\begin{align*} P(X \leq x) &= P(\sqrt{2X} \leq \sqrt{2x}) \\ &\approx P(Y \leq \sqrt{2x}) \\ &= P(Z \leq \sqrt{2x} - \sqrt{2\nu - 1}) \end{align*}$

So if we want to find the percentage points for X, we can solve for corresponding percentage points for Z.

If z is the point where P(Z ≤ z) = p, then

$x = \frac{(p + \sqrt{2\nu-1})^2}{2}$

is the point where P(X ≤ x) = p.

If we use this to find the 2.5 and 97.5 percentiles for a χ²(30) random variable, we get 16.36 and 46.48, an order of magnitude more accurate than before.

When ν = 9999, the Fisher transformation gives us percentiles that are two orders of magnitude more accurate than before.

Wilson–Hilferty transformation

If X ~ χ²(ν), the Wilson–Hilferty transformation is (X/ν)^1/3 is approximately normal with mean 1 − 2/9ν and variance 2/9ν.

This transformation is a little more complicated than the Fisher transform, but also more accurate. You could go through calculations similar to those above to approximate percentage points using the Wilson–Hilferty transformation.

The main use for approximations like this is now for analytical calculations; software packages can give accurate numerical results. For analytical calculation, the simplicity of the Fisher transformation may outweigh the improve accuracy of the Wilson-Hilferty transformation.

Variance of variances. All is variance.

Posted on 28 April 2025 by John

In the previous post, I did a simulation to illustrate a theorem about the number of steps needed in the Euclidean algorithm. The distribution of the number of steps is asymptotically normal, and for numbers 0 < a < b < x the mean is asymptotically

12 log(2) log(x) / π².

What about the variance? The reference cited in the previous post [1] gives an expression for the variance, but it’s complicated. The article said the variance is close to a constant times log x, where that constant involves the second derivative of “a certain pseudo zeta function associated with continued fractions.”

So let’s estimate the variance empirically. We can easily compute the sample variance to get an unbiased point estimate of the population variance, but how can we judge how accurate this point estimate is? We’d need the variance of our estimate of variance.

When I taught statistics classes, this is where students would struggle. There are levels of randomness going on. We have some random process. (Algorithm run times are not really random, but we’ll let that slide.)

We have a statistic S², the sample variance, which is computed from samples from a random process, so our statistic is itself a random variable. As such, it has a mean and a variance. The mean of S² is the population variance; that’s what it means for a statistic to be an unbiased estimator.

The statistic S² has a known distribution, which we can use to create a confidence interval. Namely,

$\frac{(n-1)S^2}{\sigma^2} \sim \chi^2$

where the chi squared random variable has n − 1 degrees of freedom. Here n is the number of samples and σ² is the population variance. But wait, isn’t the whole purpose of this discussion to estimate σ² because we don’t know what it is?!

This isn’t quite circular reasoning. More like iterative reasoning. We have an estimate of σ² which we then turn around and use in constructing its confidence interval. It’s not perfect, but it’s useful.

A (1 − α) 100% confidence interval is given by

$\left( \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}}, \frac{(n-1)s^2}{\chi^2_{\alpha/2}} \right)$

Notice that the right tail probability appears on the left end of the interval and the left tail probability appears on the right end of the interval. The order flips because they’re in a denominator.

I modified the simulation code from the previous post to compute the sample standard deviation for the runs, and got 4.6926, which corresponds to a sample variance of 22.0205. There were 10,000 reps, so our chi squared distribution has 9,999 degrees of freedom. We can compute the confidence interval as follows.

>>> from scipy.stats import chi2
>>> 9999*22.0205/chi2.ppf([0.975, 0.025], 9999)

This says a 95% confidence interval for the variance, in one particular simulation run, was [21.42, 22.64].

The result in [1] said that the variance is proportional to log x, and in our case n = 2⁶³ because that post simulated random numbers up to that limit. We can estimate that the proportionality constant is 63 log(2)/22.0205 = 0.5042. In other words, the variance is about half of log x.

The theorem in [1] gives us the asymptotic variance. If we used a bigger value of x, the variance could be a little different, but I expect it would be around 0.5 log x.

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

Normal probability approximation

Posted on 19 December 2024 by John

The previous post presented an approximation

$\int_0^x \exp(-t^2)\, dt \approx \sin(\sin(x))$

for −1 ≤ x ≤ 1 and said that it was related to a probability function. This post will make the connect explicit.

Let X be a normally distributed random variable with mean μ and variance σ². Then the CDF of X is

$P(X \leq x) = \frac{1}{\sqrt{2\pi} \sigma} \int_{-\infty}^x \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\, dt$

So if μ = 0 and σ² = 1/2 then we have the following.

$P(0 \leq X \leq x) = \frac{1}{\sqrt{\pi}} \int_0^x \exp(-t^2)\, dt \approx \frac{1}{\sqrt{\pi}} \sin(\sin(x))$

Here’s Python code to show how good the approximation is.

    from numpy import sin, linspace, sqrt, pi
    from scipy.stats import norm
    import matplotlib.pyplot as plt

    x = linspace(-1, 1, 200)
    X = norm(0, sqrt(0.5))
    plt.plot(x, X.cdf(x) - X.cdf(0))
    plt.plot(x, sin(sin(x))/sqrt(pi))
    plt.legend(["exact", "approx"])
    plt.xlabel("$x$")
    plt.ylabel(r"$P(X \leq x)$")

Here’s the resulting plot:

The orange curve for the plot of the approximation completely covers the blue curve of the exact value. The two curves are the same to within the resolution of the plot. See the previous post for a detailed error analysis.

The Cauchy distribution’s counter-intuitive behavior

Posted on 19 September 2024 by John

Someone with no exposure to probability or statistics likely has an intuitive sense that averaging random variables reduces variance, though they wouldn’t state it in those terms. They might, for example, agree that the average of several test grades gives a better assessment of a student than a single test grade. But data from a Cauchy distribution doesn’t behave this way.

Averages and scaling

If you have four independent random variables, each normally distributed with the same scale parameter σ, then their average is also normally distributed but with scale parameter σ/2.

If you have four independent random variables, each Cauchy distributed with the same scale parameter σ, then their average is also Cauchy distributed but with exact same scale parameter σ.

So the normal distribution matches common intuition, but the Cauchy distribution does not.

In the case of random variables with a normal distribution, the scale parameter σ is also the standard deviation. In the case of random variables with a Cauchy distribution, the scale parameter σ is not the standard deviation because Cauchy random variables don’t have a variance, so they don’t have a standard deviation.

Modeling

Some people object that nothing really follows a Cauchy distribution because the Cauchy distribution has no mean or variance. But nothing really follows a normal distribution either. All probability distributions are idealizations. The question of any probability distribution is whether it adequately captures the aspect of reality it is being used to model.

Mean

Suppose some phenomenon appears to behave like it has a Cauchy distribution, with no mean. Alternately, suppose the phenomenon has a mean, but this mean is so variable that it is impossible to estimate. There’s no practical difference between the two.

Variance

And in the alternate case, suppose there is a finite variance, but the variance is so large that it is impossible to estimate. If you take the average of four observations, the result is still so variable that the variance is impossible to estimate. You’ve cut the theoretical variance in half, but that makes no difference. Again this is practically indistinguishable from a Cauchy distribution.

Truncating

Now suppose you want to tame the Cauchy distribution by throwing out samples with absolute value larger than M. Now you have a truncated Cauchy distribution, and it has finite mean and variance.

But how do you choose M? If you don’t have an objective reason to choose a particular value of M, you would hope that your choice doesn’t matter too much. And that would be the case for a thin-tailed probability distribution like the normal, but it’s not true of the Cauchy distribution.

The variance of the truncated distribution will be approximately equal to M, so by choosing M you choose the variance. So if you double your cutoff for outliers that are to be discarded, you approximately double the variance of what’s left. Your choice of M matters a great deal.

Why do medical tests always have error rates?

Posted on 22 July 2024 by John

Most people implicitly assume medical tests are infallible. If they test positive for X, they assume they have X. Or if they test negative for X, they’re confident they don’t have X. Neither is necessarily true.

Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

A test for a disease is based on a pattern someone discovered. People who have a certain condition usually have a certain chemical in their blood, and people who do not have the condition usually do not have that chemical in their blood. But it’s not a direct observation of the disease.

“Usually” is as good as you can do in biology. It’s difficult to make universal statements. When I first started working with genetic data, I said something to a colleague about genetic sequences being made of A, C, G, and T. I was shocked when he replied “Usually. This is biology. Nothing always holds.” It turns out some viruses have a Zs (aminoadenine) rather than As (adenine).

Error rates may be very low and still be misleading. A positive test for rare disease is usually a false positive, even if the test is highly accurate. This is a canonical example of the need to use Bayes theorem. The details are written up here.

The more often you test for something, the more often you’ll find it, correctly or incorrectly. The more conditions you test, the more conditions you find, correctly or incorrectly.

Wouldn’t it be great if your toilet had a built in lab that tested you for hundreds of diseases several times a day? No, it would not! The result would be frequent false positives, leading to unnecessary anxiety and expense.

Up to this point we’ve discussed medical tests, but the same applies to any kind of test. Surveillance is a kind of test, one that also has false positives and false negatives. The more often Big Brother reads you email, the more likely it is that he will find something justifying a knock on your door.

Do incremental improvements add, multiply, or something else?

Posted on 2 July 2024 by John

Suppose you make an x% improvement followed by a y% improvement. Together do they make an (x + y)% improvement? Maybe.

The business principle of kaizen, based on the Japanese 改善 for improvement, is based on the assumption that incremental improvements accumulate. But quantifying how improvements accumulate takes some care.

Add or multiply?

Two successive 1% improvements amount to a 2% improvement. But two successive 50% improvements amount to a 125% improvement. So sometimes you can add, and sometimes you cannot. What’s going on?

An x% improvement multiplies something by 1 + x/100. For example, if you earn 5% interest on a principle of P dollars, you now have 1.05 P dollars.

So an x% improvement followed by a y% improvement multiplies by

(1 + x/100)(1 + y/100) = 1 + (x + y)/100 + xy/10000.

If x and y are small, then xy/10000 is negligible. But if x and y are large, the product term may not be negligible, depending on context. I go into this further in this post: Small probabilities add, big ones don’t.

Interactions

Now let’s look at a variation. Suppose doing one thing by itself brings an x% improvement and doing another thing by itself makes a y% improvement. How much improvement could you expect from doing both?

For example, suppose you find through A/B testing that changing the font on a page increases conversions by 10%. And you find in a separate A/B test that changing an image on the page increases conversions by 15%. If you change the font and the image, would you expect a 25% increase in conversions?

The issue here is not so much whether it is appropriate to add percentages. Since

1.1 × 1.15 = 1.265

you don’t get a much different answer whether you multiply or add. But maybe you could change the font and the image and conversions increase 12%. Maybe either change alone creates a better impression, but together they don’t make a better impression than doing one of the changes. Or maybe the new font and the new image clash somehow and doing both changes together lowers conversions.

The statistical term for what’s going on is interaction effects. A sequence of small improvements creates an additive effect if the improvements are independent. But the effects could be dependent, in which case the whole is less than the sum of the parts. This is typical. Assuming that improvements are independent is often overly optimistic. But sometimes you run into a synergistic effect and the whole is greater than the sum of the parts.

Sequential testing

In the example above, we imagine testing the effect of a font change and an image change separately. What if we first changed the font, then with the new font tested the image? That’s better. If there were a clash between the new font and the new image we’d know it.

But we’re missing something here. If we had tested the image first and then tested the new font with the new image, we might have gotten different results. In general, the order of sequential testing matters.

Factorial testing

If you have a small number of things to test, you can discover interaction effects by doing a factorial design, either a full factorial design or a fractional factorial design.

If you have a large number of things to test, you’ll have to do some sort of sequential testing. Maybe you do some combination of sequential and factorial testing, guided by which effects you have reason to believe will be approximately independent.

In practice, a testing plan needs to balance simplicity and statistical power. Sequentially testing one option at a time is simple, and may be fine if interaction effects are small. But if interaction effects are large, sequential testing may be leaving money on the table.

Help with testing

If you’d like some help with testing, or with web analytics more generally, we can help.

LET’S TALK

Statistics

The Rise and Fall of Bayesian Statistics

Analyzing the Federalist Papers

Deleting vs Replacing Names

Structured data

Unstructured data

Related posts

Knuth’s series for chi squared percentage points

Comparing approximations

Derivation

Chi squared approximations

Fewer degrees of freedom

Fisher transformation

Wilson–Hilferty transformation

Related posts

Variance of variances. All is variance.

Related posts

Normal probability approximation

The Cauchy distribution’s counter-intuitive behavior

Averages and scaling

Modeling

Mean

Variance

Truncating

Related posts

Why do medical tests always have error rates?

Related posts

Do incremental improvements add, multiply, or something else?

Add or multiply?

Interactions

Sequential testing

Factorial testing

Help with testing