Uncertainty in a probability

Suppose you did a pilot study with 10 subjects and found a treatment was effective in 7 out of the 10 subjects.

With no more information than this, what would you estimate the probability to be that the treatment is effective in the next subject? Easy: 0.7.

Now what would you estimate the probability to be that the treatment is effective in the next two subjects? You might say 0.49, and that would be correct if we knew that the probability of response is 0.7. But there’s uncertainty in our estimate. We don’t know that the response rate is 70%, only that we saw a 70% response rate in our small sample.

If the probability of success is p, then the probability of s successes and f failures in the next sf subjects is given by

{s+f \choose s} p^s (1-p)^f

But if our probability of success has some uncertainty and we assume it has a beta(ab) distribution, then the predictive probability of s successes and f failures is given by

{s+f \choose s} \frac{B(a+s, b+f)}{B(a,b)}


B(x, y) = \frac{\Gamma(x)\, \Gamma(y)}{\Gamma(x+y)}

In our example, after seeing 7 successes out of 10 subjects, we estimate the probability of success by a beta(7, 3) distribution. Then this says the predictive probability of two successes is approximately 0.51, a little higher than the naive estimate of 0.49. Why is this?

We’re not assuming the probability of success is 0.7, only that the mean of our estimate of the probability is 0.7. The actual probability might be higher or lower. The predictive probability calculates the probability of outcomes under all possible values of the probability, then creates a weighted average, weighing each probability of success by the probability of that value. The differences corresponding to probability above and below 0.7 approximately balance out, but the former carry a little more weight and so we get roughly what we did before.

If this doesn’t seem right, note that mean and median aren’t the same thing for asymmetric distributions. A beta(7,3) distribution has mean 0.7, but it has a probability of 0.537 of being larger than 0.7.

If our initial experiment has shown 70 successes out of 100 instead of 7 out of 10, the predictive probability of two successes would have been 0.492, closer to the value based on point estimate, but still different.

The further we look ahead, the more difference there is between using a point estimate and using a distribution that incorporates our uncertainty. Here are the probabilities for the number of successes out of the next 100 outcomes, using the point estimate 0.3 and using predictive probability with a beta(7,3) distribution.

So if we’re sure that the probability of success is 0.7, we’re pretty confident that out of 100 trials we’ll see between 60 and 80 successes. But if we model our uncertainty in the probability of response, we get quite a bit of uncertainty when we look ahead to the next 100 subjects. Now we can say that the number of responses is likely to be between 30 and 100.

Click to learn more about Bayesian statistics consulting

Münchausen numbers

Baron Münchausen

Baron Münchausen

The number 3435 has the following curious property:

3435 = 33 + 44 + 33 + 55.

It is called a Münchausen number, an allusion to fictional Baron Münchausen. When each digit is raised to its own power and summed, you get the original number back. The only other Münchausen number is 1.

At least in base 10. You could look at Münchausen numbers in other bases. If you write out a number n in base b, raise each of its “digits” to its own power, take the sum, and get n back, you have a Münchausen number in base b. For example 28 is a Münchausen number in base 9 because

28ten = 31nine = 33 + 11

Daan van Berkel proved that there are only finitely many Münchausen in any given base. In fact, he shows that a Münchausen number in base b cannot be greater than 2bb, and so you could do a brute-force search to find all the Münchausen numbers in any base.

The upper bound 2bb grows very quickly with b and so brute force becomes impractical for large b. If you wanted to find all the hexadecimal Münchausen numbers you’d have to search 2*1616 = 36,893,488,147,419,103,232 numbers. How could you do this more efficiently?

Beta reduction: The difference typing makes

Beta reduction is essentially function application. If you have a function described by what it does to x and apply it to an argument t, you rewrite the xs as ts. The formal definition of β-reduction is more complicated than this in order to account for free versus bound variables, but this informal description is sufficient for this blog post. We will first show that β-reduction holds some surprises, then explain how these surprises go away when you add typing.

Suppose you have an expression (λx.x + 2)y, which means apply to y the function that takes its argument and adds 2. Then β-reduction rewrites this expression as y + 2. In a more complicated expression, you might be able to apply β-reduction several times. When you do apply β-reduction several times, does the process always stop? And if you apply β-reduction to parts of an expression in a different order, can you get different results?

Failure to normalize

You might reasonably expect that if you apply β-reduction enough times you eventually get an expression you can’t reduce any further. Au contraire!

Consider the expression  (λx.xx) (λx.xx).  Beta reduction says to replace each of the red xs with the expression in blue. But when we do that, we get the original expression (λx.xx) (λx.xx) back. Beta reduction gets stuck in an infinite loop.

Next consider the expression L = (λx.xxy) (λx.xxy). Applying β-reduction the first time gives  (λx.xxy) (λx.xxyy or Ly. Applying β-reduction again yields Lyy. Beta “reduction” doesn’t reduce the expression at all but makes it bigger.

The technical term for what we’ve seen is that β-reduction is not normalizing. A rewrite system is strongly normalizing if applying the rules in any order eventually terminates. It’s weakly normalizing if there’s at least some sequence of applying the rules that terminates. Beta reduction is neither strongly nor weakly normalizing in the context of (untyped) lambda calculus.

Types to the rescue

In simply typed lambda calculus, we assign types to every variable, and functions have to take the right type of argument. This additional structure prevents examples such as those above that fail to normalize. If x is a function that takes an argument of type A and returns an argument of type B then you can’t apply x to itself. This is because x takes something of type A, not something of type function from A to B. You can prove that not only does this rule out specific examples like those above, it rules out all possible examples that would prevent β-reduction from terminating.

To summarize, β-reduction is not normalizing, not even weakly, in the context of untyped lambda calculus, but it is strongly normalizing in the context of simply typed lambda calculus.


Although β-reduction is not normalizing for untyped lambda calculus, the Church-Rosser theorem says it is confluent. That is, if an expression P can be transformed by β-reduction two different ways into expressions M and N, then there is an expression T such that both M and N can be reduced to T. This implies that if β-reduction does terminate, then it terminates in a unique expression (up to α-conversion, i.e. renaming bound variables). Simply typed lambda calculus is confluent as well, and so in that context we can say β-reduction always terminates in a unique expression (again up to α-conversion).

Less likely to get half, more likely to get near half

I was catching up on Engines of our Ingenuity episodes this evening when the following line jumped out at me:

If I flip a coin a million times, I’m virtually certain to get 50 percent heads and 50 percent tails.

Depending on how you understand that line, it’s either imprecise or false. The more times you flip the coin, the more likely you are to get nearly half heads and half tails, but the less likely you are to get exactly half of each. I assume Dr. Lienhard knows this and that by “50 percent” he meant “nearly half.”

Let’s make the fuzzy statements above more quantitative. Suppose we flip a coin 2n times for some large number n. Then a calculation using Stirling’s approximation shows that the probability of n heads and n tails is approximately


which goes to zero as n goes to infinity. If you flip a coin a million times, there’s less than one chance in a thousand that you’d get exactly half heads.

Next, let’s quantify the statement that nearly half the tosses are likely to be heads. The normal approximation to the binomial tells us that for large n, the number of heads out of 2n tosses is approximately distributed like a normal distribution with the same mean and variance, i.e. mean n and variance n/2. The proportion of heads is thus approximately normal with mean 1/2 and variance 1/8n. This means the standard deviation is 1/√(8n). So, for example, about 95% of the time the proportion of heads will be 1/2 plus or minus 2/√(8n). As n goes to infinity, the width of this interval goes to 0. Alternatively, we could pick some fixed interval around 1/2 and show that the probability of the proportion of heads being outside that interval goes to 0.

Insufficient statistics

Experience with the normal distribution makes people think all distributions have (useful) sufficient statistics [1]. If you have data from a normal distribution, then the sufficient statistics are the sample mean and sample variance. These statistics are “sufficient” in that the entire data set isn’t any more informative than those two statistics. They effectively condense the data for you. (This is conditional on knowing the data come from a normal. More on that shortly.)

With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough

Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.

Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, conditional on the normality assumption, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.


[1] Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.

Reversing WYSIWYG

The other day I found myself saying that I preferred org-mode files to Jupyter notebooks because with org-mode, what you see is what you get. Then I realized I was using “what you see is what you get” (WYSISYG) in exactly the opposite of the usual sense. Jupyter notebooks are WYSIWYG in the same sense that a Word document is. You work with a nicely formatted file and the changes you make are immediately reflected visually.

The source file for a Jupyter notebook is a JSON document containing formatting instructions, encoded images, and the notebook content. Looking at a notebook file in a text editor is analogous to unzipping a Word document and looking at the XML inside. Here’s a sample:

    "    if (with_grid_):\n",
    "        plt.grid (which=\"both\",ls=\"-\",color=\"0.65\")\n",
    "    plt.show()    "
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   "outputs": [
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAawAAAESCAYAAA...

It’s hard to diff two Jupyter notebooks because content changes don’t map simply to changes in the underlying file.

We can look a little closer at WYSIWYG and ask what you see where and what you get where. Word documents and Jupyter notebooks are visually WYSIWYG: what you see in the interactive environment (browser) is what you get visually in the final product. Which is very convenient, except for version control and comparing changes.

Org-mode files are functionally WYSIWYG: what you see in the interactive environment (editor) is what you get functionally in the final code.

You could say that HTML and LaTeX are functionally or causally WYSIWYG whereas Word is visually WYSIWYG. Word is visually WYSIWYG in the sense that what you see on your monitor is essentially what you’ll see coming out of a printer. But Word doesn’t always let you see what’s causing your document to look as it does. Why can’t I delete this? Why does it insist on putting that here? HTML and LaTeX work at a lower level, for better and for worse. You may not be able to anticipate how things will look while editing the source, e.g. where a line will break, but cause and effect are transparent.

Everybody wants WYSIWYG, but they have different priorities regarding what they want to see and are concerned about different aspects of what they get.

Floating point: between blissful ignorance and unnecesssary fear

Most programmers are at one extreme or another when it comes to floating point arithmetic. Some are blissfully ignorant that anything can go wrong, while others believe that danger lurks around every corner when using floating point.

The limitations of floating point arithmetic are something to be aware of, and ignoring these limitations can cause problems, like crashing airplanes. On the other hand, floating point arithmetic is usually far more reliable than the application it is being used in.

It’s well known that if you multiply two floating point numbers, a and b, then divide by b, you might not get exactly a back. Your result might differ from a by one part in ten quadrillion (10^16). To put this in perspective, suppose you have a rectangle approximately one kilometer on each side. Someone tells you the exact area and the exact length of one side. If you solve for the length of the missing side using standard (IEEE 754) floating point arithmetic, your result could be off by as much as the width of a helium atom.

Most of the problems attributed to “round off error” are actually approximation error. As Nick Trefethen put it,

If rounding errors vanished, 90% of numerical analysis would remain.

Still, despite the extreme precision of floating point, in some contexts rounding error is a practical problem. You have to learn in what context floating point precision matters and how to deal with its limitations. This is no different than anything else in computing.

For example, most of the time you can imagine that your computer has an unlimited amount of memory, though sometimes you can’t. It’s common for a computer to have enough memory to hold the entire text of Wikipedia (currently around 12 gigabytes). This is usually far more memory than you need, and yet for some problems it’s not enough.

More on floating point computing:

Proofs and programs

Here’s an interesting quote comparing writing proofs and writing programs:

Building proofs and programs are very similar activities, but there is one important difference: when looking for a proof it is often enough to find one, however complex it is. On the other hand, not all programs satisfying a specification are alike: even if the eventual result is the same, efficient programs must be preferred. The idea that the details of proofs do not matter—usually called proof irrelevance—clearly justifies letting the computer search for proofs …

The quote is saying “Any proof will do, but among programs that do the same job, more efficient programs are better.” And this is certainly true, depending on what you want from a proof.

Proofs serve two main purposes: to establish that a proposition is true, and to show why it is true. If you’re only interested in the existence of a proof, then yes, any proof will do. But proofs have a sort of pedagogical efficiency just as programs have an execution efficiency. Given two proofs of the same proposition, the more enlightening proof is better by that criteria.

I assume the authors of the above quote would not disagree because they say it is often enough to find one, implying that for some purposes simpler proofs are better. And existence does come first: a complex proof that exists is better than an elegant proof that does not exist! Once you have a proof you may want to look for a simpler proof. Or not, depending on your purposes. Maybe you have an elegant proof that you’re convinced must be essentially true, even if you doubt some details. In that case, you might want to complement it with a machine-verifiable proof, no matter how complex.

Source: Interactive Theorem Proving and Program Development
Coq’Art: The Calculus of Inductive Constructions

ETAOIN SHRDLU and all that

Statistics can be useful, even if it’s idealizations fall apart on close inspection.

For example, take English letter frequencies. These frequencies are fairly well known. E is the most common letter, followed by T, then A, etc. The string of letters “ETAOIN SHRDLU” comes from the days of Linotype when letters were arranged in that order, in decreasing order of frequency. Sometimes you’d see ETAOIN SHRDLU in print, just as you might see “QWERTY” today.

Morse code is also based on English letter frequencies. The length of a letter in Morse code varies approximately inversely with its frequency, a sort of precursor to Huffman encoding. The most common letter, E, is a single dot, while the rarer letters like J and Q have a dot and three dashes. (So does Y, even though it occurs more often than some letters with shorter codes.)

One letter has worn off my keyboard

One letter has worn off my keyboard

So how frequently does the letter E, for example, appear in English? That depends on what you mean by English. You can count how many times it appears, for example, in a particular edition of A Tale of Two Cities, but that isn’t the same as it’s frequency in English. And if you’d picked the novel Gadsby instead of A Tale of Two Cities you’d get very different results since that book was written without using a single letter E.

Peter Norvig reports that E accounted for 12.49% of English letters in his analysis of the Google corpus. That’s a better answer than just looking at Gadsby, or even A Tale of Two Cities, but it’s still not English.

What might we mean by “English” when discussing letter frequency? Written or spoken English? Since when? American, British, or worldwide? If you mean blog articles, I’ve altered the statistics from what they were a moment ago by publishing this. Introductory statistics books avoid this kind of subtlety by distinguishing between samples and populations, but in this case the population isn’t a fixed thing. When we say “English” as a whole we have in mind some idealization that strictly speaking doesn’t exist.

If we want to say, for example, what the frequency of the letter E is in English as a whole, not some particular English corpus, we can’t answer that to too many decimal places. Nor can we say, for example, which letter is the 18th most frequent. Context could easily change the second decimal place in a letter’s frequency or, among the less common letters, its frequency rank.

And yet, for practical purposes we can say E is the most common letter, then T, etc. We can design better Linotype machines and telegraphy codes using our understanding of letter frequency. At the same time, we can’t expect too much of this information. Anyone who has worked a cryptogram puzzle knows that you can’t say with certainty that the most common letter in a particular sample must correspond to E, the next to T, etc.

By the way, Peter Norvig’s analysis suggests that ETAOIN SHRDLU should be updated to ETAOIN SRHLDCU.


What is calculus?

When people ask me what calculus is, my usual answer is “the mathematics of change,” studying things that change continually. Algebra is essentially static, studying things frozen in time and space. Calculus studies things that move, shapes that bend, etc. Algebra deals with things that are exact and consequently can be fragile. Calculus deals with approximation and is consequently more robust.

I’m happier with the paragraph above if you replace “calculus” with “analysis.” Analysis certainly seeks to understand and model things that change continually, but calculus per se is the mechanism of analysis.

I used to think it oddly formal for people to say “differential and integral calculus.” Is there any other kind? Well yes, yes there is, though I didn’t know that at the time. A calculus is a system of rules for computing things. Differential and integral calculus is a system of rules for calculating derivatives and integrals. Lately I’ve thought about other calculi more than differential calculus: propositional calculus, lambda calculus, calculus of inductive constructions, etc.

In my first career I taught (differential and integral) calculus and was frustrated with students who would learn how to calculate derivatives but never understood what a derivative was or what it was good for. In some sense though, they got to the core of what a calculus is. It would be better if they knew what they were calculating and how to apply it, but they still learn something valuable by knowing how to carry out the manipulations. A computer science major, for example, who gets through (differential) calculus knowing how to calculate derivatives without knowing what they are is in a good position to understand lambda calculus later.

Big Logic

As systems get larger and more complex, we need new tools to test whether these systems are correctly specified and implemented. These tools may not be new per se, but they may be applied with new urgency.

Dimensional analysis is a well-established method of error detection. Simply checking that you’re not doing something like adding things that aren’t meaningful to add is surprisingly useful for detecting errors. For example, if you’re computing an area, does your result have units of area? This seems like a ridiculously basic question to ask, and yet it is more effective than it would seem.

Type theory and category theory are extensions of the idea of dimensional analysis. Simply asking of a function “Exactly what kind of thing does it take in? And what sort of thing does it put out?” is a powerful way to find errors. And in practice it may be hard to get answers to these questions. When you’re working with a system that wasn’t designed to make such things explicit and clear, it may be that nobody knows for certain, or people believe they know but have incompatible ideas. Type theory and category theory can do much more than nudge you to define functions well, but even that much is a good start.

Specifying types is more powerful than it seems. With dependent types, you can incorporate so much information in the type system that showing that an instance of a type exists is equivalent to proving a theorem. This is a consequence of the Curry-Howard correspondence (“Propositions as types”) and is the basis for proof assistants like Coq.

I’d like to suggest the term Big Logic for the application of logic on a large scale, using logic to prove properties of systems that are too complex for a typical human mind to thoroughly understand. It’s well known that systems have gotten larger, more connected, and more complex. It’s not as well known how much the tools of Big Logic have improved, making formal verification practical for more projects than it used to be.

Duality in spherical trigonometry

This evening I ran across an unexpected reference to spherical trigonometry: Thomas Hales’ lecture on lessons learned from the formal proof of the Kepler conjecture. He mentions at one point a lemma that was awkward to prove in its original form, but that became trivial when he looked at its spherical dual.

The sides of a spherical triangle are formed by great circular arcs through the vertices. Since the sides are portions of a circle, they can be measured as angles. So in spherical trig you have this interesting interplay of two kinds of angles: the angles formed at the intersections of the sides, and the angles describing the sides themselves.

Here’s how you form the dual of a spherical triangle. Suppose the vertices of the angle are AB, and C. Think of the arc connecting A and B as an equator, and let C‘ be the corresponding pole that lies on the same side of the arc as the original triangle ABC. Do the analogous process to find the points A‘ and B‘. The triangle ABC‘ is the dual of the triangle ABC. (This idea goes back to the Persian mathematician Abu Nasr Mansur circa 1000 AD.)

The sides in  ABC‘ are the supplementary angles of the corresponding intersection angles in ABC, and the intersection angles in  ABC‘ are the supplementary angles of the corresponding sides in ABC.

In his paper “Duality in the formulas of spherical trigonometry,” published in American Mathematical Monthly in 1909, W. A. Granville gives the following duality principle:

If the sides of a spherical triangle be denoted by Roman letters abc and the supplements of the corresponding opposite angles by the Greek letters α, β, γ, then from any given formula involving any of these six parts, we may wrote down a dual formula simply by interchanging the corresponding Greek and Roman letters.

Related: Notes on Spherical Trigonometry

Primitive recursive functions and enumerable sets

The set of primitive recursive (PR) functions is the smallest set of functions of several integer arguments satisfying five axioms:

  1. Constant functions are PR.
  2. The function that picks the ith element of a list of n arguments is PR.
  3. The successor function S(n) = n+1 is PR.
  4. PR functions are closed under composition.
  5. PR functions are closed under primitive recursion.

The last axiom obviously gives PR functions their name. So what is primitive recursion? Given a PR function  that takes k arguments, and another PR function g that takes k+2 arguments, the primitive recursion of f and g is a function h of k+1 arguments satisfying two properties:

  1. h(0, x1, …, xk) = f(x1, …, xk)
  2. h(S(y), x1, …, xk) = g(yh(yx1, … xk), x1, …, xk)

Not every computable function is primitive recursive. For example, Ackermann’s function is a general recursive function, but not a primitive recursive function. General recursive functions are Turing complete. Turing machines, lambda calculus, and general recursive functions are all equivalent models of computation, but the first two are better known.

For this post, the main thing about general recursive functions is that, as the name implies, they are more general than primitive recursive functions.

Now we switch from functions to sets. The characteristic function of a set A is the function that is 1 for elements of A and zero everywhere else. In other areas of math, there is a sort of duality between functions and sets via characteristic functions. For example, the indicator function of a measurable set is a measurable function. And the indicator function of a convex set is a convex function. But in recursive functions, there’s an unexpected wrinkle in this analogy.

A set of integers is recursively enumerable if it is either empty or the image of a general recursive function. But there’s a theorem, due to Alonzo Church, that a non-empty recursively enumerable set is actually the image of a primitive recursive function. So although general recursive functions are more general, you can’t tell that from looking at function images. For example, although the Ackermann function is not primitive recursive, there is a primitive recursive function with the same image.

Some ways linear algebra is different in infinite dimensions

There’s no notion of continuity in linear algebra per se. It’s not part of the definition of a vector space. But a finite dimensional vector space over the reals is isomorphic to a Euclidean space of the same dimension, and so we usually think of such spaces as Euclidean. (We’ll only going to consider real vector spaces in this post.) And there we have a notion of distance, a norm, and hence a topology and a way to say whether a function is continuous.


In finite dimensional Euclidean space, linear functions are continuous. You can put a different norm on a Euclidean space than the one it naturally comes with, but all norms give rise to the same topology and hence the same continuous functions. (This is useful in numerical analysis where you’d like to look at a variety of norms. The norms give different analytical results, but they’re all topologically equivalent.)

In an infinite dimensional normed space, linear functions are not necessarily continuous. If the dimension of a space is only a trillion, all linear functions are continuous, but when you jump from high dimension to infinite dimension, you can have discontinuous linear functions. But if you look at this more carefully, there isn’t a really sudden change.

If a linear function is discontinuous, its finite dimensional approximations are continuous, but the degree of continuity is degrading as dimension increases. For example, suppose a linear function stretches the nth basis vector by a factor of n. The bigger n gets, the more the function stretches in the nth dimension. As long as n is bounded, this is continuous, but in a sense it is less continuous as n increases. The fact that the infinite dimensional version is discontinuous tells you that the finite dimensional versions, while technically continuous, scale poorly with dimension. (See practical continuity for more discussion along these lines.)


A Banach space is a complete normed linear space. Finite dimensional normed spaces are always complete (i.e. every sequence in the space converges to a point in the space) but this might not happen in infinite dimensions.

Duals and double duals

In basic linear algebra, the dual of a vector space V is the space of linear functionals on V, i.e. the set of linear maps from V to the reals. This space is denoted V*. If V has dimension nV* has dimension n, and all n-dimensional spaces are isomorphic, so the distinction between a space and its dual seems pedantic. But in general a Banach space and its dual are not isomorphic and so its easier to tell them apart.

The second dual of a vector space, V** is the dual of the dual space. In finite dimensional spaces, V** is naturally isomorphic to V. In Banach spaces, V is isomorphic to a subset of V**. And even when V is isomorphic to V**, it might not be naturally isomorphic to V**.  (Here “natural” means natural in the category theory sense of natural transformations.)