Cellular automata with random initial conditions

The previous post looked at a particular cellular automaton, the so-called Rule 90. When started with a single pixel turned on, it draws a Sierpinski triangle. With random starting pixels, it draws a semi-random pattern that retains features like the Sierpinski triangle.

There are only 256 possible elementary cellular automata, so it’s practical to plot them all. I won’t list all the images here—you can find them all here—but I will give a few examples to show the variety of patterns they produce. As in the previous post, we imagine our grid rolled up into a cylinder, i.e. we’ll wrap around if necessary to find pixels diagonally up to the left and right.

rule 8 with random initial conditions
rule 18 with random initial conditions
rule 29 with random initial conditions
rule 30 with random initial conditions
rule 108 with random initial conditions
rule 129 with random initial conditions

As we discussed in the previous post, the number of a rule comes from what value it assigns to each of eight possible cellular states, turned into a binary number. So it’s plausible that binary numbers with more 1’s correspond to more black pixels. This is roughly true, though the graph below shows that the situation is more complex than that.

automata pixel density as a function of 1 bits in rule

Adding Laplace or Gaussian noise to database for privacy

pixelated face

In the previous two posts we looked at a randomization scheme for protecting the privacy of a binary response. This post will look briefly at adding noise to continuous or unbounded data. I like to keep the posts here fairly short, but this topic is fairly technical. To keep it short I’ll omit some of the details and give more of an intuitive overview.

Differential privacy

The idea of differential privacy is to guarantee bounds on how much information may be revealed by someone’s participation in a database. These bounds are described by two numbers, ε (epsilon) and δ (delta). We’re primarily interested in the multiplicative bound described by ε. This number is roughly the number of bits of information an analyst might gain regarding an individual. (The multiplicative bound is exp(ε) and so ε, the natural log of the multiplicative bound, would be the information measure, though technically in nats rather than bits since we’re using natural logs rather than logs base 2.)

The δ term is added to the multiplicative bound. Ideally δ is 0, that is, we’d prefer (ε, 0)-differential privacy, but sometimes we have to settle for (ε, δ)-differential privacy. Roughly speaking, the δ term represents the possibility that a few individuals may stand to lose more privacy than the rest, that the multiplicative bound doesn’t apply to everyone. If δ is very small, this risk is very small.

Laplace mechanism

The Laplace distribution is also known as the double exponential distribution because its distribution function looks like the exponential distribution function with a copy reflected about the y-axis; these two exponential curves join at the origin to create a sort of circus tent shape. The absolute value of a Laplace random variable is an exponential random variable.

Why are we interested this particular distribution? Because we’re interested in multiplicative bounds, and so it’s not too surprising that exponential distributions might make the calculations work out because of the way the exponential scales multiplicatively.

The Laplace mechanism adds Laplacian-distributed noise to a function. If Δf is the sensitivity of a function f, a measure of how revealing the function might be, then adding Laplace noise with scale Δf/ε preserves (ε 0)-differential privacy.

Technically, Δf is the l1 sensitivity. We need this detail because the results for Gaussian noise involve l2 sensitivity. This is just a matter of whether we measure sensitivity by the l1 (sum of absolute values) norm or the l2 (root sum of squares) norm.

Gaussian mechanism

The Gaussian mechanism protects privacy by adding randomness with a more familiar normal (Gaussian) distribution. Here the results are a little messier. Let ε be strictly between 0 and 1 and pick δ > 0. Then the Gaussian mechanism is (ε, δ)-differentially private provided the scale of the Gaussian noise satisfies

\sigma \geq \sqrt{2 \log(1.25/\delta)}\,\, \frac{\Delta_2 f}{\varepsilon}

It’s not surprising that the l2 norm appears in this context since the normal distribution and l2 norm are closely related. It’s also not surprising that a δ term appears: the Laplace distribution is ideally suited to multiplicative bounds but the normal distribution is not.

Related

Previous related posts:

Quantifying the information content of personal data

It can be surprisingly easy to identify someone from data that’s not directly identifiable. One commonly cited result is that the combination of birth date, zip code, and sex is enough to identify most people. This post will look at how to quantify the amount of information contained in such data.

If the answer to a question has probability p, then it contains -log2 p bits of information. Knowing someone’s sex gives you 1 bit of information because -log2(1/2) = 1.

Knowing whether someone can roll their tongue could give you more or less information than knowing their sex. Estimates vary, but say 75% can roll their tongue. Then knowing that someone can roll their tongue gives you 0.415 bits of information, but knowing that they cannot roll their tongue gives you 2 bits of information.

On average, knowing someone’s tongue rolling ability gives you less information than knowing their sex. The average amount of information, or entropy, is

0.75(-log2 0.75) + 0.25(-log2 0.25) = 0.81.

Entropy is maximized when all outcomes are equally likely. But for identifiability, we’re concerned with maximum information as well as average information.

Knowing someone’s zip code gives you a variable amount of information, less for densely populated zip codes and more for sparsely populated zip codes. An average zip code contains about 7,500 people. If we assume a US population of 326,000,000, this means a typical zip code would give us about 15.4 bits of information.

The Safe Harbor provisions of US HIPAA regulations let you use the first three digits of someone’s zip code except when this would represent less than 20,000 people, as it would in several sparsely populated areas. Knowing that an American lives in a region of 20,000 people would give you 14 bits of information about that person.

Birth dates are complicated because age distribution is uneven. Knowing that someone’s birth date was over a century ago is highly informative, much more so than knowing it was a couple decades ago. That’s why the Safe Harbor provisions do not allow including age, much less birth date, for people over 90.

Birthdays are simpler than birth dates. Birthdays are not perfectly evenly distributed throughout the year, but they’re close enough for our purposes. If we ignore leap years, a birthday contains -log2(1/365) or about 8.5 bits of information. If we consider leap years, knowing someone was born on a leap day gives us two extra bits of information.

Independent information is additive. I don’t expect there’s much correlation between sex, geographical region, and birthday, so you could add up the bits from each of these information sources. So if you know someone’s sex, their zip code (assuming 7,500 people), and their birthday (not a leap day), then you have 25 bits of information, which may be enough to identify them.

This post didn’t consider correlated information. For example, suppose you know someone’s zip code and primary language. Those two pieces of information together don’t provide as much information as the sum of the information they provide separately because language and location are correlated. I may discuss the information content of correlated information in a future post.

RelatedHIPAA de-identification

Negative correlation introduced by success

Suppose you measure people on two independent attributes, X and Y, and take those for whom X+Y is above some threshold. Then even though X and Y are uncorrelated in the full population, they will be negatively correlated in your sample.

This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how attractive someone is would give you no advantage in guessing their acting ability, and vice versa. Suppose further that successful actors have a combination of beauty and acting ability. Then among successful actors, the beautiful would tend to be poor actors, and the unattractive would tend to be good actors.

Here’s a little Python code to illustrate this. We take two independent attributes, distributed like IQs, i.e. normal with mean 100 and standard deviation 15. As the sum of the two attributes increases, the correlation between the two attributes becomes more negative.

from numpy import arange
from scipy.stats import norm, pearsonr
import matplotlib.pyplot as plt

# Correlation.
# The function pearsonr returns correlation and a p-value.
def corr(x, y):
    return pearsonr(x, y)[0]

x = norm.rvs(100, 15, 10000)
y = norm.rvs(100, 15, 10000)
z = x + y

span = arange(80, 260, 10)
c = [ corr( x[z > low], y[z > low] ) for low in span ]

plt.plot( span, c )
plt.xlabel( "minimum sum" )
plt.ylabel( "correlation coefficient" )
plt.show()

Width of mixture PDFs

Suppose you draw samples from two populations, one of which has a wider probability distribution than the other. How does the width of the distribution of the combined sample vary as you change the proportions of the two populations?

The extremes are easy. If you pick only from one population, then the resulting distribution will be exactly as wide as the distribution of that population. But what about in the middle? If you pick from both populations with equal probability, will the width resulting distribution be approximately the average of the widths of the two populations separately?

To make things more specific, we’ll draw from two populations: Cauchy and normal. With probability η we will sample from a Cauchy distribution with scale γ and with probability (1-η) we will sample from a normal distribution with scale σ. The resulting combined distribution is a mixture, known in spectroscopy as a pseudo-Voigt distribution. In that field, the Cauchy distribution is usually called the Lorentz distribution.

(Why”pseudo”? A Voigt random variable is the sum of a Cauchy and a normal random variable. Its PDF is a convolution of a Cauchy and a normal PDF. A pseudo-Voigt random variable is the mixture of a Cauchy and a normal random variable. Its PDF is a convex combination of the PDFs of a Cauchy and a normal PDF. In fact, the convex combination coefficients are η and (1-η) mentioned above. Convex combinations are easier to work with than convolutions, at least in some contexts, and the pseudo-Voigt distribution is a convenient approximation to the Voigt distribution.)

We will measure the width of distributions by full width at half maximum (FWHM). In other words, we’ll measure how far apart the two points are where the distribution takes on half of its maximum value.

It’s not hard to calculate that the FWHM for a Cauchy distribution with scale 2γ and the FWHM for a normal distribution with scale σ is 2 √(2 log 2) σ.

If we have a convex combination of Cauchy and normal distributions, we’d expect the FWHM to be at least roughly the same convex combination of the FWHM of the separate distributions, i.e. we’d expect the FWHM of our mixture to be

2ηγ + 2(1 – η)√(2 log 2)σ.

How close is that guess to being correct? It has to be exactly correct when η is 0 or 1, but how well does it do in the middle? Here are a few experiments fixing γ = 1 and varying σ.

Testing RNGs with PractRand

PractRand is a random number generator test suite, somewhat like the DIEHARDER and NIST tests I’ve written about before, but more demanding.

Rather than running to completion, it runs until it a test fails with an infinitesimally small p-value. It runs all tests at a given sample size, then doubles the sample and runs the tests again.

32-bit generators

LCG

A while back I wrote about looking for an RNG that would fail the NIST test suite and being surprised that a simple LCG (linear congruential generator) did fairly well. PractRand, however, dismisses this generator with extreme prejudice:

RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x4a992b2c
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x4a992b2c
length= 64 megabytes (2^26 bytes), time= 2.9 seconds
Test Name Raw Processed Evaluation
BCFN(2+0,13-3,T) R=+115128 p = 0 FAIL !!!!!!!!
BCFN(2+1,13-3,T) R=+105892 p = 0 FAIL !!!!!!!!
...
[Low1/32]FPF-14+6/16:(8,14-9) R= +25.8 p = 1.5e-16 FAIL
[Low1/32]FPF-14+6/16:(9,14-10) R= +15.5 p = 8.2e-9 very suspicious
[Low1/32]FPF-14+6/16:(10,14-11) R= +12.9 p = 1.8e-6 unusual
[Low1/32]FPF-14+6/16:all R=+380.4 p = 8.2e-337 FAIL !!!!!!!
[Low1/32]FPF-14+6/16:all2 R=+33954 p = 0 FAIL !!!!!!!!
[Low1/32]FPF-14+6/16:cross R= +33.6 p = 6.4e-25 FAIL !!
...and 9 test result(s) without anomalies

I don’t recall the last time I saw a p-value of exactly zero. Presumably the p-value was so small that it underflowed to zero.

MWC

Another 32 bit generator, George Marsaglia’s MWC generator, also fails, but not as spectacularly as LCG.

RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x187edf12
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x187edf12
length= 64 megabytes (2^26 bytes), time= 2.9 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +6.3 p = 2.2e-3 unusual
Gap-16:A R= +33.3 p = 1.6e-28 FAIL !!!
Gap-16:B R= +13.6 p = 7.9e-12 FAIL
...and 105 test result(s) without anomalies

64-bit generators

Next let’s see how some well-regarded 64-bit random number generators do. We’ll look at xorshift128+ and xoroshir0128+ by Sebastiano Vigna and David Blackman, the famous Mersenne Twister, and PCG by Melissa O’Neill.

The numbers generated by xhoroshir0128+ and xorshift128+ are not random in the least significant bit and so the PractRand tests end quickly. The authors claim that xoroshiro128+ “passes the PractRand test suite up to (and included) 16TB, with the exception of binary rank tests.” I’ve only run PractRand with its default settings; I have not tried getting it to keep running the rest of the tests.

A lack of randomness in the least significant bits is inconsequential if you’re converting the outputs to floating point numbers, say as part of the process of generating Gaussian random values. For other uses, such as reducing the outputs modulo a small number, a lack of randomness in the least significant bits would be disastrous. (Here numerical analysis and number theory have opposite ideas regarding which parts of a number are most “significant.”)

At the time of writing, both Mersenne Twister and PCG have gone through 256 GB of generated values and are still running. I intend to add more results tomorrow.

Update: Mersenne Twister failed a couple of tests with 512 GB of input. I stopped the tests after PCG passed 16 TB.

xoroshiro128+

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xe15bf63c
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xe15bf63c
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!!
[Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!!
...and 146 test result(s) without anomalies

xorshift128+

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x8c7c5173
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x8c7c5173
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!!
[Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!!
...and 146 test result(s) without anomalies

Mersenne Twister

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x300dab94
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x300dab94
length= 256 megabytes (2^28 bytes), time= 3.7 seconds
no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x300dab94
length= 512 megabytes (2^29 bytes), time= 7.9 seconds
Test Name Raw Processed Evaluation
BCFN(2+2,13-2,T) R= -7.0 p =1-1.2e-3 unusual
[Low1/64]BCFN(2+2,13-6,T) R= -5.7 p =1-1.0e-3 unusual
...and 167 test result(s) without anomalies

...

rng=RNG_stdin64, seed=0x300dab94
length= 256 gigabytes (2^38 bytes), time= 3857 seconds
  no anomalies in 265 test result(s)

rng=RNG_stdin64, seed=0x300dab94
length= 512 gigabytes (2^39 bytes), time= 8142 seconds
 Test Name Raw Processed Evaluation
 BRank(12):24K(1) R=+99759 p~= 0 FAIL !!!!!!!!
 [Low16/64]BRank(12):16K(1) R= +1165 p~= 1.3e-351 FAIL !!!!!!!
 ...and 274 test result(s) without anomalies

PCG

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x82f88431
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x82f88431
length= 128 megabytes (2^27 bytes), time= 2.0 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]DC6-9x1Bytes-1           R=  +6.6  p =  1.6e-3   unusual
  ...and 147 test result(s) without anomalies

rng=RNG_stdin64, seed=0x82f88431
length= 256 megabytes (2^28 bytes), time= 4.7 seconds
  no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x82f88431
length= 512 megabytes (2^29 bytes), time= 9.5 seconds
  no anomalies in 169 test result(s)

...

rng=RNG_stdin64, seed=0x82f88431
length= 16 terabytes (2^44 bytes), time= 254943 seconds
  no anomalies in 329 test result(s)

Random walk on quaternions

The previous post was a riff on a tweet asking what you’d get if you extracted all the i‘s, j‘s, and k‘s from Finnegans Wake and multiplied them as quaternions. This post is a probabilistic variation on the previous one.

If you randomly select a piece of English prose, extract the i‘s, j‘s, and k‘s, and multiply them together as quaternions, what are you likely to get?

The probability that a letter in this sequence is an i is about 91.5%. There’s a 6.5% chance it’s a k, and a 2% chance it’s a j. (Derived from here.) We’ll assume the probabilities of each letter appearing next are independent.

You could think of the process multiplying all the i‘s, j‘s, and k‘s together as a random walk on the unit quaternions, an example of a Markov chain. Start at 1. At each step, multiply your current state by i with probability 0.915, by j with probability 0.02, and by k with probability 0.065.

After the first step, you’re most likely at i. You could be at j or k, but nowhere else. After the second step, you’re most likely at -1, though you could be anywhere except at 1. For the first several steps you’re much more likely to be at some places than others. But after 50 steps, you’re about equally likely to be at any of the eight possible values.

Bayesian methods at Bletchley Park

From Nick Patterson’s interview on Talking Machines:

GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented Bayesian statistics for himself. The work against Enigma and other German ciphers was fully Bayesian. …

Bayesian statistics was an extreme minority discipline in the ’70s. In academia, I only really know of two people who were working majorly in the field, Jimmy Savage … in the States and Dennis Lindley in Britain. And they were regarded as fringe figures in the statistics community. It’s extremely different now. The reason is that Bayesian statistics works. So eventually truth will out. There are many, many problems where Bayesian methods are obviously the right thing to do. But in the ’70s we understood that already in Britain in the classified environment.

Alan Turing

Discrete example of concentration of measure

The previous post looked at a continuous example of concentration of measure. As you move away from a thin band around the equator, the remaining area in the rest of the sphere decreases as an exponential function of the dimension and the distance from the equator. This post will show a very similar result for discrete sequences.

Suppose you have a sequence X1, X2, …, Xn of n random variables that each take on the values {-1, 1} with equal probability. You could think of this as a random walk: you start at 0 and take a sequence of steps either to the left or the right.

Let Sn = X1 + X2 + … + Xn be the sum of the sequence. The expected value of Sn is 0 by symmetry, but it could be as large as n or as small as –n. We want to look at how large |Sn| is likely to be relative to the sequence length n.

Here’s the analytical bound:

P\left( \frac{|S_n|}{n} \geq r\right) \leq 2 \exp\left( - \frac{nr^2}{2}\right )

(If you’re reading this via email, you probably can’t see the equation. Here’s why and how to fix it.)

Here is a simulation in Python to illustrate the bound.

    from random import randint
    import numpy as np
    import matplotlib.pyplot as plt

    n = 400  # random walk length
    N = 10000 # number of repeated walks

    reps = np.empty(N)

    for i in range(N):
        random_walk = [2*randint(0, 1) - 1 for _ in range(n)]
        reps[i] = abs(sum(random_walk)) / n
    
    plt.hist(reps, bins=41)
    plt.xlabel("$|S_n|/n$")
    plt.show()

And here are the results.

DIEHARDER random number generator test results for PCG and MWC

A few days ago I wrote about testing the PCG random number generator using the DIEHARDER test suite. In this post I’ll go into a little more background on this random number generator test suite. I’ll also show that like M. E. O’Neill’s PCG (“permuted congruential generator”), George Marsaglia’s MWC (“multiply with carry”) generator does quite well.

This is not to say that MWC is the best generator for every purpose, but any shortcomings of MWC are not apparent from DIEHARDER. The PCG family of generators, for example, is apparently superior to MWC, but you couldn’t necessarily conclude that from these tests.

Unless your application demands more of a random number generator than these tests do, MWC is probably adequate for your application. I wouldn’t recommend it for cryptography or for high-dimensional integration by darts, but it would be fine for many common applications.

DIEHARDER test suite

George Marsaglia developed the DIEHARD battery of tests in 1995. Physics professor Robert G. Brown later refined and extended Marsaglia’s original test suite to create the DIEHARDER suite. (The name of Marsaglia’s battery of tests was a pun on the Diehard car battery. Brown continued the pun tradition by naming his suite after Die Harder, the sequel to the movie Die Hard.) The DIEHARDER suite is open source. It is designed to be at least as rigorous as the original DIEHARD suite and in some cases more rigorous.

There are 31 distinct kinds of tests in the DIEHARDER suite, but some of these are run multiple times. In total, 114 tests are run.

Marsaglia’s MWC

The main strength of Marsaglia’s MWC algorithm is that it is very short. The heart of the code is only three lines:

    m_z = 36969 * (m_z & 65535) + (m_z >> 16);
    m_w = 18000 * (m_w & 65535) + (m_w >> 16);
    return (m_z << 16) + m_w;

You can find a full implementation of a random number generator class based in MWC here.

The heart of PCG is also very short, but a bit more mysterious.

    rng->state = oldstate * 6364136223846793005ULL + (rng->inc | 1);
    uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
    uint32_t rot = oldstate >> 59u;
    return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));

(These are the 64-bit state versions of MWC and PCG. Both have versions based on larger state.)

Because these generators require little code, they’d be relatively easy to step into with a debugger, compared to other RNGs such as Mersenne Twister that require more code and more state.

Test results

Out of the 114 DIEHARDER tests run on MWC, all but three returned a pass, and the rest returned a weak pass.

A few weak passes are to be expected. The difference between pass, weak pass, and failure is whether a p-value falls below a certain threshold. Theory says that ideally p-values would uniformly distributed, and so one would fall outside the region for a strong pass now and then.

Rather than counting strong and weak passes, let’s look at the p-values themselves. We’d expect these to be uniformly distributed. Let’s see if they are.

Here are the p-values reported by the DIEHARDER tests for MWC:

Histogram of p-values for MWC

Here are the corresponding values for PCG:

Histogram of p-values for PCG

Neither test has too many small p-values, no more than we’d expect. This is normally what we’re concerned about. Too many small p-values would indicate that the generated samples are showing behavior that would be rare for truly random input.

But both sets of test results have a surprising number of large p-values. Not sure what to make of that. I suspect it says more about the DIEHARDER test suite than the random number generators being tested.

Update: I went back to look at some results from Mersenne Twister to see if this pattern of large p-values persists there. It does, and in fact the p-values are even more biased toward the high end for Mersenne Twister.

Histogram of Mersenne Twister p-values

One thing I noticed is that the large p-values are disproportionately coming from some of the same tests each time. In particular, the repetitions of thests_serial test have an unexpectedly high number of large p-values for each generator.

The chaos game and the Sierpinski triangle

The chaos game is played as follows. Pick a starting point at random. Then at each subsequent step, pick a triangle vertex at random and move half way from the current position to that vertex.

The result looks like a fractal called the Sierpinski triangle or Sierpinski gasket.

Here’s an example:

Unbiased chaos game results

If the random number generation is biased, the resulting triangle will show it. In the image below, the lower left corner was chosen with probability 1/2, the top with probability 1/3, and the right corner with probability 1/6.

Biased chaos game results

Update: Here’s an animated version that lets you watch the process in action.

animated gif

Here’s Python code to play the chaos game yourself.

from scipy import sqrt, zeros
import matplotlib.pyplot as plt
from random import random, randint

def midpoint(p, q):
    return (0.5*(p[0] + q[0]), 0.5*(p[1] + q[1]))

# Three corners of an equilateral triangle
corner = [(0, 0), (0.5, sqrt(3)/2), (1, 0)]

N = 1000
x = zeros(N)
y = zeros(N)

x[0] = random()
y[0] = random()
for i in range(1, N):
    k = randint(0, 2) # random triangle vertex
    x[i], y[i] = midpoint( corner[k], (x[i-1], y[i-1]) )
    
plt.scatter(x, y)
plt.show()

 

Update 2: Peter Norvig posted some Python code with variations on the game presented here, generalizing a triangle to other shapes. If you try the analogous procedure with a square, you simply get a square filled with random dots.

However, you can get what you might expect, the square analog of the Sierpinski triangle, the product of a Cantor set with itself, if you make a couple modifications. First, pick a side at random, not a corner. Second, move 1/3 of the way toward the chosen side, not 1/2 way.

Here’s what I got with these changes:

Chaos game for a square

Source: Chaos and Fractals

Testing the PCG random number generator

M. E. O’Neill’s PCG family of random number generators looks very promising. It appears to have excellent statistical and cryptographic properties. And it takes remarkably little code to implement. (PCG stands for Permuted Congruential Generator.)

The journal article announcing PCG gives the results of testing it with the TestU01 test suite. I wanted to try it out by testing it with the DIEHARDER test suite (Robert G. Brown’s extension of George Marsaglia’s DIEHARD test suite) and the NIST Statistical Test Suite. I used what the generator’s website calls the “minimal C implementation.”

The preprint of the journal article is dated 2015 but apparently hasn’t been published yet.

Update: See the very informative note by the author of PCG in the comments below.

For the NIST test suite, I generated 10,000,000 bits and divided them into 10 streams.

For the DIEHARDER test suite, I generated 800,000,000 unsigned 32-bit integers. (DIEHARDER requires a lot of random numbers as input.)

For both test suites I used the seed (state) 20170707105851 and sequence constant (inc) 42.

The PCG generator did well on all the NIST tests. For every test, at least least 9 out of 10 streams passed. The test authors say you should expect at least 8 out of 10 streams to pass.

Here’s an excerpt from the results. You can find the full results here.

--------------------------------------------------------
 C1  C2  C3 ...  C10  P-VALUE  PROPORTION  STATISTICAL TEST
--------------------------------------------------------
  2   0   2        0  0.213309     10/10   Frequency
  0   0   1        3  0.534146     10/10   BlockFrequency
  3   0   0        0  0.350485     10/10   CumulativeSums
  1   1   0        2  0.350485     10/10   CumulativeSums
  0   2   2        1  0.911413     10/10   Runs
  0   0   1        1  0.534146     10/10   LongestRun
  0   1   2        0  0.739918     10/10   Rank
  0   4   0        0  0.122325     10/10   FFT
  1   0   0        1  0.000439     10/10   NonOverlappingTemplate
  ...              
  2   1   0        0  0.350485      9/10   NonOverlappingTemplate
  0   2   1        0  0.739918     10/10   OverlappingTemplate
  1   1   0        2  0.911413     10/10   Universal
  1   1   0        0  0.017912     10/10   ApproximateEntropy
  1   0   1        1     ----       3/4    RandomExcursions
  ...              
  0   0   0        1     ----       4/4    RandomExcursions
  2   0   0        0     ----       4/4    RandomExcursionsVariant
  ...              
  0   0   3        0     ----       4/4    RandomExcursionsVariant
  1   2   3        0  0.350485      9/10   Serial
  1   1   1        0  0.739918     10/10   Serial
  1   2   0        0  0.911413     10/10   LinearComplexity

...

The DIEHARDER suite has 31 kinds tests, some of which are run many times, making a total of 114 tests. Out of the 114 tests, two returned a weak pass for the PCG input and all the rest passed. A few weak passes are to be expected from running so many tests and so this isn’t a strike against the generator. In fact, it might be suspicious if no tests returned a weak pass.

Here’s an edited version of the results. The full results are here.

#=============================================================================#
        test_name   |ntup| tsamples |psamples|  p-value |Assessment
#=============================================================================#
   diehard_birthdays|   0|       100|     100|0.46682782|  PASSED
      diehard_operm5|   0|   1000000|     100|0.83602120|  PASSED
  diehard_rank_32x32|   0|     40000|     100|0.11092547|  PASSED
    diehard_rank_6x8|   0|    100000|     100|0.78938803|  PASSED
   diehard_bitstream|   0|   2097152|     100|0.81624396|  PASSED
        diehard_opso|   0|   2097152|     100|0.95589325|  PASSED
        diehard_oqso|   0|   2097152|     100|0.86171368|  PASSED
         diehard_dna|   0|   2097152|     100|0.24812341|  PASSED
diehard_count_1s_str|   0|    256000|     100|0.75417270|  PASSED
diehard_count_1s_byt|   0|    256000|     100|0.25725000|  PASSED
 diehard_parking_lot|   0|     12000|     100|0.59288414|  PASSED
    diehard_2dsphere|   2|      8000|     100|0.79652706|  PASSED
    diehard_3dsphere|   3|      4000|     100|0.14978100|  PASSED
     diehard_squeeze|   0|    100000|     100|0.35356584|  PASSED
        diehard_sums|   0|       100|     100|0.04522121|  PASSED
        diehard_runs|   0|    100000|     100|0.39739835|  PASSED
        diehard_runs|   0|    100000|     100|0.99128296|  PASSED
       diehard_craps|   0|    200000|     100|0.64934221|  PASSED
       diehard_craps|   0|    200000|     100|0.27352733|  PASSED
 marsaglia_tsang_gcd|   0|  10000000|     100|0.10570816|  PASSED
 marsaglia_tsang_gcd|   0|  10000000|     100|0.00267789|   WEAK
         sts_monobit|   1|    100000|     100|0.98166534|  PASSED
            sts_runs|   2|    100000|     100|0.05017630|  PASSED
          sts_serial|   1|    100000|     100|0.95153782|  PASSED
...
          sts_serial|  16|    100000|     100|0.59342390|  PASSED
         rgb_bitdist|   1|    100000|     100|0.50763759|  PASSED
...
         rgb_bitdist|  12|    100000|     100|0.98576422|  PASSED
rgb_minimum_distance|   2|     10000|    1000|0.23378443|  PASSED
...
rgb_minimum_distance|   5|     10000|    1000|0.13215367|  PASSED
    rgb_permutations|   2|    100000|     100|0.54142546|  PASSED
...
    rgb_permutations|   5|    100000|     100|0.96040216|  PASSED
      rgb_lagged_sum|   0|   1000000|     100|0.66587166|  PASSED
...
      rgb_lagged_sum|  31|   1000000|     100|0.00183752|   WEAK
      rgb_lagged_sum|  32|   1000000|     100|0.13582393|  PASSED
     rgb_kstest_test|   0|     10000|    1000|0.74708548|  PASSED
     dab_bytedistrib|   0|  51200000|       1|0.30789191|  PASSED
             dab_dct| 256|     50000|       1|0.89665788|  PASSED
        dab_filltree|  32|  15000000|       1|0.67278231|  PASSED
        dab_filltree|  32|  15000000|       1|0.35348003|  PASSED
       dab_filltree2|   0|   5000000|       1|0.18749029|  PASSED
       dab_filltree2|   1|   5000000|       1|0.92600020|  PASSED

Simple random number generator does surprisingly well

I was running the NIST statistical test suite recently. I wanted an example of a random number generator where the tests failed, and so I used a simple generator, a linear congruence generator. But to my surprise, the generator passed nearly all the tests, even though some more sophisticated generators failed some of the same tests.

This post will implement a couple of the simplest tests in Python and show that the generator does surprisingly well.

The linear congruential generator used here starts with an arbitrary seed, then at each step produces a new number by multiplying the previous number by a constant and taking the remainder by 231 – 1. The multiplier constant was chosen to be one of the multipliers recommended in [1].

We’ll need a couple math functions:

from math import sqrt, log

and we need to define the constants for our generator.

# Linear congruence generator (LCG) constants
z = 20170705   # seed
a = 742938285  # multiplier
e = 31         # will need this later
m = 2**e -1    # modulus

Next we form a long string of 0’s and 1’s using our generator

# Number of random numbers to generate
N = 100000     

# Format to print bits, padding with 0's on the left if needed
formatstr = "0" + str(e) + "b"

bit_string = ""
for _ in range(N):
    z = a*z % m # LCG
    bit_string += format(z, formatstr)

Next we run a couple tests. First, we count the number of 1’s in our string of bits. We expect about half the bits to be 1’s. We can quantify “about” as within two standard deviations.

def count_ones(string):
    ones = 0
    for i in range(len(string)):
        if string[i] == '1':
            ones += 1
    return ones

ones = count_ones(bit_string)
expected = e*N/2
sd = sqrt(0.25*N)
print( "Number of 1's: {}".format(ones) )
print( "Expected: {} to {}".format(expected - 2*sd, expected + 2*sd) )

The results are nothing unusual:

Number of 1's: 1550199
Expected: 1549683.8 to 1550316.2

Next we look at the length of the longest runs on 1’s. I’ve written before about the probability of long runs and the code below uses a couple results from that post.

def runs(string):
    max_run = 0
    current_run = 0
    for i in range(len(string)):
        if string[i] == '1':
            current_run += 1
        else:
            current_run = 0
        max_run = max(max_run, current_run)
    return max_run

runlength = runs(bit_string)
expected = -log(0.5*e*N)/log(0.5)
sd = 1/log(2)
print( "Run length: {}".format(runlength) )
print( "Expected: {} to {}".format(expected - 2*sd, expected + 2*sd) )

Again the results are nothing unusual:

Run length: 19
Expected: 17.7 to 23.4

Simple random number generators are adequate for many uses. Some applications, such as high dimensional integration and cryptography, require more sophisticated generators, but sometimes its convenient and sufficient to use something simple. For example, code using the LCG generator above would be easier to debug than code using the Mersenne Twister. The entire state of the LCG is a single number, whereas the Mersenne Twister maintains an internal state of 312 numbers.

One obvious limitation of the LCG used here is that it couldn’t possibly produce more than  231 – 1 values before repeating itself. Since the state only depends on the last value, every time it comes to a given output, the next output will be whatever the next output was the previous time. In fact, [1] shows that it does produce 231 – 1 values before cycling. If the multiplier were not chosen carefully it could have a shorter period.

So our LCG has a period of about two billion values. That’s a lot if you’re writing a little game, for example. But it’s not enough for many scientific applications.

* * *

[1] George S. Fishman and Louis R. Moore III, An exhaustive analysis of multiplicative congruential random number generators with modulus 231 – 1, SIAM Journal of Scientific and Statistical Computing, Vol. 7, no. 1, January 1986.

Effective sample size for MCMC

In applications we’d like to draw independent random samples from complicated probability distributions, often the posterior distribution on parameters in a Bayesian analysis. Most of the time this is impractical.

MCMC (Markov Chain Monte Carlo) gives us a way around this impasse. It lets us draw samples from practically any probability distribution. But there’s a catch: the samples are not independent. This lack of independence means that all the familiar theory on convergence of sums of random variables goes out the window.

There’s not much theory to guide assessing the convergence of sums of MCMC samples, but there are heuristics. One of these is effective sample size (ESS). The idea is to have a sort of “exchange rate” between dependent and independent samples. You might want to say, for example, that 1,000 samples from a certain Markov chain are worth about as much as 80 independent samples because the MCMC samples are highly correlated. Or you might want to say that 1,000 samples from a different Markov chain are worth about as much as 300 independent samples because although the MCMC samples are dependent, they’re weakly correlated.

Here’s the definition of ESS:

\mbox{ESS} = \frac{n}{1 + 2\sum_{k=1}^\infty \rho(k)}

where n is the number of samples and ρ(k) is the correlation at lag k.

This behaves well in the extremes. If your samples are independent, your effective samples size equals the actual sample size. If the correlation at lag k decreases extremely slowly, so slowly that the sum in the denominator diverges, your effective sample size is zero.

Any reasonable Markov chain is between the extremes. Zero lag correlation is too much to hope for, but ideally the correlations die off fast enough that the sum in the denominator not only converges but also isn’t a terribly large value.

I’m not sure who first proposed this definition of ESS. There’s a reference to it in Handbook of Markov Chain Monte Carlo where the authors cite a paper [1] in which Radford Neal mentions it. Neal cites B. D. Ripley [2].

Click to learn more about Bayesian statistics consulting

 

[1] Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Robert E. Kass, Bradley P. Carlin, Andrew Gelman and Radford M. Neal. The American Statistician. Vol. 52, No. 2 (May, 1998), pp. 93-100

[2] Stochlastic Simulation, B. D. Ripley, 1987.

Why do linear prediction confidence regions flare out?

Suppose you’re tracking some object based on its initial position x0 and initial velocity v0. The initial position and initial velocity are estimated from normal distributions with standard deviations σx and σv. (To keep things simple, let’s assume our object is moving in only one dimension and that the distributions around initial position and velocity are independent.)

The confidence region for the object flares out over time, something like the bell of a trumpet.

Why does the region get larger? Because there’s uncertainty in the velocity, and the velocity gets multiplied by elapsed time.

Why isn’t the confidence region a cone? Because that would ignore the uncertainty in the initial position. The result would be too small.

Why isn’t the confidence region a truncated cone? That’s not a bad approximation, though it’s a bit too large. If we ignore probability for a moment and treat confidence intervals as deterministic limits, then we get a truncated cone. For example, suppose assume position and velocity are each within two standard deviations of their estimates. Then we’d estimate position to be between x0 – 2σx + (v0 – 2σv) t on the low end and x0 + 2σx + (v0 + 2σv) t on the high end. This is only an approximation because we’ve ignored probability, and it’s pessimistic because it assumes extreme error values for both estimates at the same time.

So what is the confidence region? It’s some where between the cone and the truncated cone.

The position xt v is the sum of two random variables. The first has variance σx² and the second has variance t² σv². Variances of independent random variables add, so the standard deviation for the sum is

√(σx² + t² σv²) = t √(σx² / t² + σv²)

Note that as t increases, the latter approaches t σv from above. Ignoring the uncertainty in initial position underestimates standard deviation, but the relative error decreases as t increases.

For large t, a confidence interval for position at time t is approximately proportional to t, so the width of the confidence intervals over time look like a cone. But from small t, the dependence on t is less linear and more curved.