Applying probability to non-random things

Probability has surprising uses, including applications to things that absolutely are not random. I’ve touched on this a few times. For example, I’ve commented on how arguments about whether something is really random are often moot: Random is as random does.

This post will take non-random uses for probability in a different direction. We’ll start with discussion of probability and end up proving a classical analysis result.

Let p be the probability of success on some experiment and let Xn count the number of successes in n independent trials. The expected value of Xn is np, and so the expected value of Xn/n is p.

Now let be a continuous function on [0, 1] and think about f(Xn/n). As n increases, the distribution of Xn/n clusters more and more tightly around p, and so f(Xn/n) clusters more and more around f(p).

Let’s illustrate this with a simulation. Let p = 0.6, n = 100, and let f(x) = cos(x).

Here’s a histogram of random samples from Xn/n.

Histogram of samples from Bin(100, 0.6)

And here’s a histogram of random samples from f(Xn/n). Note that it’s centered near cos(p) = 0.825.

Histogram of samples from cos( Bin(100, 0.6) )


As n gets large, the expected value of f(Xn/n) converges to f(p). (You could make all this all rigorous, but I’ll leave out the details to focus on the intuitive idea.)

Now write out the expected value of f(Xn/n).

E\left[ f\left(\frac{X_n}{n} \right ) \right ] = \sum_{k=0}^n f\left(\frac{k}{n} \right ) {n \choose k} p^k(1-p)^{n-k}

Up to this point we’ve thought of p as a fixed probability, and we’ve arrived at an approximation result guided by probabilistic thinking. But the expression above is approximately f(p) regardless of how we think of it. So now let’s think of p varying over the interval [0, 1].

E[ f(Xn/n) ] is an nth degree polynomial in p that is close to f(p). In fact, as n goes to infinity, E[ f(Xn/n) ] converges uniformly to f(p).

So for an arbitrary continuous function f, we’ve constructed a sequence of polynomials that converge uniformly to f. In other words, we’ve proved the Weierstrass approximation theorem! The Weierstrass approximation theorem says that there exists a sequence of polynomials converging to any continuous function f on a bounded interval, and we’ve explicitly constructed such a sequence. The polynomials E[ f(Xn/n) ] are known as the Bernstein polynomials associated with f.

This is one of my favorite proofs. When I saw this in college, it felt like the professor was pulling a rabbit out of a hat. We’re talking about probability when all of a sudden, out pops a result that has nothing to do with randomness.

Now let’s see an example of the Bernstein polynomials approximating a continuous function. Again we’ll let f(x) = cos(x). If we plot the Bernstein polynomials and cosine on the same plot, the approximation is good enough that it’s hard to see what’s what. So instead we’ll plot the differences, cosine minus the Bernstein approximations.

Error in approximating cosine by Bernstein polynomials

As the degree of the Bernstein polynomials increase, the error decreases. Also, note the vertical scale. Cosine of zero is 1, and so the errors here are small relative to the size of the function being approximated.

Misplacing a continent

There are many conventions for describing points on a sphere. For example, does latitude zero refer to the North Pole or the equator? Mathematicians tend to prefer the former and geoscientists the latter. There are also varying conventions for longitude.

Volker Michel describes this clash of conventions colorfully in his book on constructive approximation.

Many mathematicians have faced weird jigsaw puzzles with misplaced continents after using a data set from a geoscientist. If you ever get such figures, too, or if you are, for example, desperately searching South America in a data set but cannot find it, remember the remark you have just read to solve your problem.

Related posts:

Quaint supercomputers

The latest episode of Star Trek Discovery (S1E4) uses the word “supercomputer” a few times. This sounds jarring. The word has become less common in contemporary usage, and seems even more out of place in a work of fiction set more than two centuries in the future.

According to Google’s Ngram Viewer, the term “supercomputer” peaked in 1990.

Google Ngram Viewer results for supercomputer

(The term “cluster” is far more common, but it is mostly a non-technical term. It’s used in connection with grapes, for example, much more often than with computers.)

Years ago you’d hear someone say a problem would take a “big computer,” but these days you’re more likely to hear someone say a problem is going to take “a lot of computing power.” Hearing that a problem is going to require a “big computer” sounds as dated as saying something would take “a big generator” rather than saying it would take a lot of electricity.

Like electricity, computing power has been commoditized. We tend to think in terms of the total amount of computing resources needed, measured in, say, CPU-hours or number of low-level operations. We don’t think first about what configuration of machines would deliver these resources any more than we’d first think about what machines it would take to deliver a certain quantity of electrical power.

There are still supercomputers and problems that require them, though an increasing number of computationally intense projects do not require such specialized hardware.

Something that bothers me about deep neural nets

Overfitting happens when a model does too good a job of matching a particular data set and so does a poor job on new data. The way traditional statistical models address the danger of overfitting is to limit the number of parameters. For example, you might fit a straight line (two parameters) to 100 data points, rather than using a 99-degree polynomial that could match the input data exactly and probably do a terrible job on new data. You find the best fit you can to a model that doesn’t have enough flexibility to match the data too closely.

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to deliberately do a mediocre job of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

It a nutshell, the disturbing thing is that you have a negative criteria for what constitutes a good solution: one that’s not too close to optimal. But there are a lot of ways to not be too close to optimal. In practice, you experiment until you find an optimally suboptimal solution, i.e. the intentionally suboptimal fit that performs the best in validation.


Exponential sums make pretty pictures

Exponential sums are a specialized area of math that studies series with terms that are complex exponentials. Estimating such sums is delicate work. General estimation techniques are ham-fisted compared to what is possible with techniques specialized for these particular sums. Exponential sums are closely related to Fourier analysis and number theory.

Exponential sums also make pretty pictures. If you make a scatter plot of the sequence of partial sums you can get surprising shapes. This is related to the trickiness of estimating such sums: the partial sums don’t simply monotonically converge to a limit.

The exponential sum page at UNSW suggests playing around with polynomials with dates in the denominator. If we take that suggestion with today’s date, we get the curve below:

f(n) = n/10 + n**2/7 + n**3/17

These are the partial sums of exp(2πi f(n)) where f(n) = n/10 + n²/7 + n³/17.

[Update: You can get an image each day for the current day’s date here.]

Here’s the code that produced the image.

    import matplotlib.pyplot as plt
    from numpy import array, pi, exp, log

    N = 12000
    def f(n):
        return n/10 + n**2/7 + n**3/17 

    z = array( [exp( 2*pi*1j*f(n) ) for n in range(3, N+3)] )
    z = z.cumsum()

    plt.plot(z.real, z.imag, color='#333399')

If we use logarithms, we get interesting spirals. Here f(n) = log(n)4.1.

f(n) = log(n)**4.1

And we can mix polynomials with logarithms. Here f(n) = log(n) + n²/100.

f(n) = log(n) + n**2/100

In this last image, I reduced the number of points from 12,000 to 1200. With a large number of points the spiral nature dominates and you don’t see the swirls along the spiral as clearly.


No critical point between two peaks

If a function of one variable has two local maxima, it must have a local minimum in between.

What about a function of two variables? If it has two local maxima, does it need to have a local minimum? No, it could have a saddle point in between, a point that is a local minimum in one direction but a local maximum in another direction. But even more surprising, it need not even have a saddle point. A function of two variables could have two local maxima and no other critical points! Here’s an example:

f(x, y) = – (x² – 1)² – (x²yx – 1)²

It’s clear that the function is zero at (-1, 0) and (1, 2), and that the function is negative otherwise. So it clearly has two local maxima. You can write out the partial derivatives with respect to x and y and see that the only place they’re both zero is at the two local maxima.

Here’s a plot of the function:

Plot3D[f[x, y], {x, -1.5, 1.5}, {y, -0.5, 2.5}]

And here’s a contour plot:

ContourPlot[f[x, y], {x, -1.5, 1.5}, {y, -0.5, 2.5}, Contours -> 50]

The two maxima are in in the bright patches in the lower left and upper right.

You might be thinking that if you walk between two peaks, you’ve got to go down in between. And that’s true. If you walk in a straight line between (-1, 0) and (1, 2), you’ll run into a local minimum around (0.2316, 1.2316). But that’s only a local minimum along your path. It’s not a local minimum or saddle point of the function in a neighborhood of that point.

I found this example in the book Single Digits.

Clean obfuscated code

One way to obfuscate code is clever use of arcane programming language syntax. Hackers are able to write completely unrecognizable code by exploiting dark corners of programming techniques and languages. Some of these attempts are quite impressive.

But it’s also possible to write clean source code that is nevertheless obfuscated. For example, it’s not at all obvious what the following code computes.

def f(x):
    return 4*x*(1-x)

def g(x):
    return x**3.5 * (1-x)**2.5

sum = 0
x = 1/7.0
N = 1000000

for _ in range(N):
    x = f(x)
    sum += g(x)


The key to unraveling this mystery is a post I wrote a few days ago that shows the x‘s in this code cover the unit interval like random samples from a beta(0.5, 0.5) distribution. That means the code is finding the expected value of g(X) where X has a beta(0.5, 0.5) distribution. The following computes this expected value.

 \int_0^1 g(x) \, dF_X &=& \frac{1}{\pi} \int_0^1 x^{3.5} (1-x)^{2.5} x^{-0.5} (1-x)^{-0.5} \, dx \\ &=& \frac{1}{\pi} \int_0^1 x^3 (1-x)^2\, dx \\ &=& \frac{1}{\pi} \frac{\Gamma(4) \Gamma(3)}{\pi \Gamma(7)} \\ &=&\frac{1}{60\pi}

The output of the program matches 1/60π to six decimal places.

Related post: What does this code do?

How many musical scales are there?

How many musical scales are there? That’s not a simple question. It depends on how you define “scale.”

For this post, I’ll only consider scales starting on C. That is, I’ll only consider changing the intervals between notes, not changing the starting note. Also, I’ll only consider subsets of the common chromatic scale; this post won’t get into dividing the octave into more or less than 12 intervals.

First of all we have the major scale — C D E F G A B C — and the “natural” minor scale: A B C D E F G A. The word “natural” suggests there are other minor scales. More on that later.

Then we have the classical modes: Dorian, Phrygian, Lydian, Mixolydian, Aeolian, and Locrian. These have the same intervals as taking the notes of the C major scale and starting on D, E, F, G, A, and B respectively. For example, Dorian has the same intervals as D E F G A B C D. Since we said we’d start everything on C, the Dorian mode would be C D E♭ F G A B♭ C. The Aeloian mode is the same as the natural minor scale.

The harmonic minor scale adds a new wrinkle: C D E♭ F G A♭ B C. Notice that A♭ and B are three half steps apart. In all the scales above, notes were either a half step or a whole step apart. Do we want to consider scales that have such large intervals? It seems desirable to include the harmonic minor scale. But what about this: C E♭ G♭ A C. Is that a scale? Most musicians would think of that as a chord or arpeggio rather than a scale. (It’s a diminished seventh chord. And it would be more common to write the A as a B♭♭.)

We might try to put some restriction on the definition of a scale so that the harmonic minor scale is included and the diminished seventh arpeggio is excluded. Here’s what I settled on. For the purposes of this post, I’ll say that a scale is an ascending sequence of eight notes with two restrictions: the first and last are an octave apart, and no two consecutive notes are more than three half steps apart. This will include modes mentioned above, and the harmonic minor scale, but will exclude the diminished seventh arpeggio. (It also excludes pentatonic scales, which we may or may not want to include.)

One way to enumerate all possible scales would be to start with the chromatic scale and decide which notes to keep. Write out the notes C, C♯, D, … , B, C and write a ‘1’ underneath a note if you want to keep it and a ‘0’ otherwise. We have to start and end on C, so we only need to specify which of the 11 notes in the middle we are keeping. That means we can describe any potential scale as an 11-bit binary number. That’s what I did to carry out an exhaustive search for scales with a little program.

There are 266 scales that meet the criteria listed here. I’ve listed all of them on another page. Some of these scales have names and some don’t. I’ve noted some names as shown below. I imagine there are others that have names that I have not labeled. I’d appreciate your help filling these in.

| Scale number | Notes                 | Name              |
|          693 | C D  E  F# G  A  B  C | Lydian mode       |
|          725 | C D  E  F  G  A  B  C | Major             |
|          726 | C D  E  F  G  A  Bb C | Mixolydian mode   |
|          825 | C D  Eb F# G  Ab B  C | Hungarian minor   |
|          826 | C D  Eb F# G  Ab Bb C | Ukrainian Dorian  |
|          854 | C D  Eb F  G  A  Bb C | Dorian mode       |
|          858 | C D  Eb F  G  Ab Bb C | Natural minor     |
|         1235 | C Db E  F  G  Ab B  C | Double harmonic   |
|         1242 | C Db E  F  G  Ab Bb C | Phrygian dominant |
|         1257 | C Db E  F  Gb Ab B  C | Persian           |
|         1370 | C Db Eb F  G  Ab Bb C | Phrygian mode     |
|         1386 | C Db Eb F  Gb Ab Bb C | Locrian mode      |

Related posts:

Toxic pairs, re-identification, and information theory

Database fields can combine in subtle ways. For example, nationality is not usually enough to identify anyone. Neither is religion. But the combination of nationality and religion can be surprisingly informative.

Information content of nationality

How much information is contained in nationality? That depends on exactly how you define nations versus territories etc., but for this blog post I’ll take this Wikipedia table for my raw data. You can calculate that nationality has entropy of 5.26 bits. That is, on average, nationality is slightly more informative than asking five independent yes/no questions. (See this post for how to calculate information content.)

Entropy measures expected information content. Knowing that someone is from India (population 1.3 billion) carries only 2.50 bits of information. Knowing that someone is from Vatican City (population 800) carries 23.16 bits of information.

One way to reduce the re-identification risk of PII (personally identifiable information) such as nationality is to combine small categories. Suppose we lump all countries with a population under one million into “other.” Then we go from 240 categories down to 160. This hardly makes any difference to the entropy: it drops from 5.26 bits to 5.25 bits. But the information content for the smallest country on the list is now 8.80 bits rather than 23.16.

Information content of religion

What about religion? This is also subtle to define, but again I’ll use Wikipedia for my data. Using these numbers, we get an entropy of 2.65 bits. The largest religion, Christianity, has an information content 1.67 bits. The smallest religion on the list, Rastafari, has an information content of 13.29 bits.

Joint information content

So if nationality carries 5.25 bits of information and religion 2.65 bits, how much information does the combination of nationality and religion carry? At least 5.25 bits, but no more than 5.25 + 2.65 = 7.9 bits on average. For two random variables X and Y, the joint entropy H(X, Y) satisfies

max( H(X), H(Y) ) ≤ H(X, Y) ≤ H(X) + H(Y)

where H(X) and H(Y) are the entropy of X and Y respectively.

Computing the joint entropy exactly would require getting into the joint distribution of nationality and religion. I’d rather not get into this calculation in detail, except to discuss possible toxic pairs. On average, the information content of the combination of nationality and religion is no more than the sum of the information content of each separately. But particular combinations can be highly informative.

For example, there are not a lot of Jews living in predominantly Muslim countries. According to one source, there are at least five Jews living in Iraq. Other sources put the estimate as “less than 10.” (There are zero Jews living in Libya.)

Knowing that someone is a Christian living in Mexico, for example, would not be highly informative. But knowing someone is a Jew living in Iraq would be extremely informative.

More information

Chaos and the beta distribution

Iteration of the quadratic function f(x) = 4x(1-x) is a famous example in chaos theory. Here’s what the first few iterations look like, starting with 1/√3. (There’s nothing special about that starting point; any point that doesn’t iterate to exactly zero will do.)

quadratic iterator first 100 values

The values appear to bounce all over the place. Let’s look at a histogram of the sequence.


Indeed the points do bounce all over the unit interval, though they more often bounce near one of the ends.

Does that distribution look familiar? You might recognize it from Bayesian statistics. It’s a beta distribution. It’s symmetric, so the two beta distribution parameters are equal. There’s a vertical asymptote on each end, so the parameters are less than 1. In fact, it’s a beta(1/2, 1/2) distribution. It comes up, for example, as the Jeffreys prior for Bernoulli trials.

The graph below adds the beta(1/2, 1/2) density to the histogram to show how well it fits.

histogram with beta density

We have to be careful in describing the relation between the iterations and the beta distribution. The iterates are ergodic, not random. The sequence of points does not simulate independent draws from any distribution: points near 0.5 are always sent to points near 1, points near 1 are always sent to points near 0, etc. But in aggregate, the points do fill out the unit interval like samples from a beta distribution. In the limit, the proportion of points in a region equals the probability of that region under a beta(1/2, 1/2) distribution.

Here’s the Python code that produced the graphs above.

import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt

def quadratic(x):
    return 4*x*(1-x)

N = 100000
x = np.empty(N)

# arbitary irrational starting point
x[0] = 1/np.sqrt(3) 
for i in range(1, N):
    x[i] = quadratic( x[i-1] )

plt.xlabel("iteration index")

t = np.linspace(0, 1, 100)
plt.hist(x, bins=t, normed=True)

plt.plot(t, beta(0.5,0.5).pdf(t), linewidth=3)
plt.legend(["beta(1/2, 1/2)"])

Related posts

Cellular automata with random initial conditions

The previous post looked at a particular cellular automaton, the so-called Rule 90. When started with a single pixel turned on, it draws a Sierpinski triangle. With random starting pixels, it draws a semi-random pattern that retains features like the Sierpinski triangle.

There are only 256 possible elementary cellular automata, so it’s practical to plot them all. I won’t list all the images here—you can find them all here—but I will give a few examples to show the variety of patterns they produce. As in the previous post, we imagine our grid rolled up into a cylinder, i.e. we’ll wrap around if necessary to find pixels diagonally up to the left and right.

rule 8 with random initial conditions
rule 18 with random initial conditions
rule 29 with random initial conditions
rule 30 with random initial conditions
rule 108 with random initial conditions
rule 129 with random initial conditions

As we discussed in the previous post, the number of a rule comes from what value it assigns to each of eight possible cellular states, turned into a binary number. So it’s plausible that binary numbers with more 1’s correspond to more black pixels. This is roughly true, though the graph below shows that the situation is more complex than that.

automata pixel density as a function of 1 bits in rule

Sierpinski triangle strikes again

A couple months ago I wrote about how a simple random process gives rise to the Sierpinski triangle. Draw an equilateral triangle and pick a random point in the plane. Repeatedly pick a triangle vertex at random and move half way from the current position to that vertex. The result converges to a Sierpinksi triangle. This post will show another way to arrive at the same pattern using cellular automata.

Imagine an infinite grid of graph paper and fill in a few squares in one row. The squares in the subsequent rows are filled in or not depending on the state of the square’s upstairs neighbors. For elementary cellular automata, the state of a square depends on the square directly above and the two squares diagonally above on each side. In matrix notation, the state of the (ij) element depends on elements (i-1, j-1), (i-1, j), and (i-1, j+1).

There are 256 possible elementary cellular automata, and here’s how you can number them. The states of three consecutive cells determine a three-bit binary number. An automaton is determined by what bit it assigned to each of the eight possible three-bit states, so an automaton corresponds to an 8-bit number. In this post we’re interested in Rule 90. In binary, 90 is written 01011010, and the table below spells out the rule in detail.

000 -> 0
001 -> 1
010 -> 0
011 -> 1
100 -> 1
101 -> 0
110 -> 1
111 -> 0

If we start with a single square filled in (i.e. set to 1) then this is the graph we get, i.e. the Sierpenski triangle:

Rule 90 with one initial bit set

This pattern depends critically on our initial conditions. Roughly speaking, it seems that if you start with regular initial conditions you’ll get regular results. If you start with random initial conditions, you’ll get random-looking results as shown below.


Rule 90 with random initial conditions

We see the same empty triangles as before, but they’re much smaller and appear scattered throughout.

In order to create a rectangular image, I wrapped the edges: the upper left neighbor of a point on the left edge is the right-most square on the row above, and similarly for the right edge. You could think of this as wrapping our graph paper into a cylinder.


A cryptographically secure random number generator

A random number generator can have excellent statistical properties and yet not be suited for use in cryptography. I’ve written a few posts to demonstrate this. For example, this post shows how to discover the seed of an LCG random number generator.

This is not possible with a secure random number generator. Or more precisely, it is not practical. It may be theoretically possible, but doing so requires solving a problem currently believed to be extremely time-consuming. (Lots of weasel words here. That’s the nature of cryptography. Security often depends on the assumption that a problem is as hard to solve as experts generally believe it is.)

Blum Blum Shub algorithm

The Blum Blum Shub algorithm for generating random bits rests on the assumption that a certain number theory problem, the quadratic residuosity problem, is hard to solve. The algorithm is simple. Let M = pq where p and q are large primes, both congruent to 3 mod 4. Pick a seed x0 between 1 and M and relatively prime to M. Now for n > 0, set

xn+1 = xn² mod M

and return the least significant bit of xn+1. (Yes, that’s a lot of work for just one bit. If you don’t need cryptographic security, there are much faster random number generators.)

Python implementation

Here’s some Python code to illustrate using the generator. The code is intended to be clear, not efficient.

Given two large (not necessarily prime) numbers x and y, the code below finds primes p and q for the algorithm and checks that the seed is OK to use.

    import sympy

    # super secret large numbers
    x = 3*10**200
    y = 4*10**200
    seed = 5*10**300

    def next_usable_prime(x):
        # Find the next prime congruent to 3 mod 4 following x.
        p = sympy.nextprime(x)
        while (p % 4 != 3):
            p = sympy.nextprime(p)
        return p

    p = next_usable_prime(x)
    q = next_usable_prime(y)
    M = p*q

    assert(1 < seed < M)
    assert(seed % p != 0)
    assert(seed % q != 0)

There’s a little bit of a chicken-and-egg problem here: how do you pick x, y, and seed? Well, you could use a cryptographically secure random number generator ….

Now let’s generate a long string of bits:

# Number of random numbers to generate
N = 100000     

x = seed
bit_string = ""
for _ in range(N):
    x = x*x % M
    b = x % 2
    bit_string += str(b)


I did not test the output thoroughly; I didn’t use anything like DIEHARDER or PractRand as in previous posts, but just ran a couple simple tests described here.

First I look at the balance of 0’s and 1’s.

    Number of 1's: 50171
    Expected: 49683.7 to 50316.2

Then the longest run. (See this post for a discussion of expected run length.)

    Run length: 16
    Expected: 12.7 to 18.5

Nothing unusual here.

The Blums

The first two authors of Blum Blum Shub are Lenore and Manuel Blum. The third author is Michael Shub.

I had a chance to meet the Blums at the Heidelberg Laureate Forum in 2014. Manuel Blum gave a talk that year on mental cryptography that I blogged about here and followed up here. He and his wife Lenore were very pleasant to talk with.