Adding Laplace or Gaussian noise to database for privacy

pixelated face

In the previous two posts we looked at a randomization scheme for protecting the privacy of a binary response. This post will look briefly at adding noise to continuous or unbounded data. I like to keep the posts here fairly short, but this topic is fairly technical. To keep it short I’ll omit some of the details and give more of an intuitive overview.

Differential privacy

The idea of differential privacy is to guarantee bounds on how much information may be revealed by someone’s participation in a database. These bounds are described by two numbers, ε (epsilon) and δ (delta). We’re primarily interested in the multiplicative bound described by ε. This number is roughly the number of bits of information an analyst might gain regarding an individual. (The multiplicative bound is exp(ε) and so ε, the natural log of the multiplicative bound, would be the information measure, though technically in nats rather than bits since we’re using natural logs rather than logs base 2.)

The δ term is added to the multiplicative bound. Ideally δ is 0, that is, we’d prefer (ε, 0)-differential privacy, but sometimes we have to settle for (ε, δ)-differential privacy. Roughly speaking, the δ term represents the possibility that a few individuals may stand to lose more privacy than the rest, that the multiplicative bound doesn’t apply to everyone. If δ is very small, this risk is very small.

Laplace mechanism

The Laplace distribution is also known as the double exponential distribution because its distribution function looks like the exponential distribution function with a copy reflected about the y-axis; these two exponential curves join at the origin to create a sort of circus tent shape. The absolute value of a Laplace random variable is an exponential random variable.

Why are we interested this particular distribution? Because we’re interested in multiplicative bounds, and so it’s not too surprising that exponential distributions might make the calculations work out because of the way the exponential scales multiplicatively.

The Laplace mechanism adds Laplacian-distributed noise to a function. If Δf is the sensitivity of a function f, a measure of how revealing the function might be, then adding Laplace noise with scale Δf/ε preserves (ε 0)-differential privacy.

Technically, Δf is the l1 sensitivity. We need this detail because the results for Gaussian noise involve l2 sensitivity. This is just a matter of whether we measure sensitivity by the l1 (sum of absolute values) norm or the l2 (root sum of squares) norm.

Gaussian mechanism

The Gaussian mechanism protects privacy by adding randomness with a more familiar normal (Gaussian) distribution. Here the results are a little messier. Let ε be strictly between 0 and 1 and pick δ > 0. Then the Gaussian mechanism is (ε, δ)-differentially private provided the scale of the Gaussian noise satisfies

\sigma \geq \sqrt{2 \log(1.25/\delta)}\,\, \frac{\Delta_2 f}{\varepsilon}

It’s not surprising that the l2 norm appears in this context since the normal distribution and l2 norm are closely related. It’s also not surprising that a δ term appears: the Laplace distribution is ideally suited to multiplicative bounds but the normal distribution is not.

Related

Previous related posts:

Quantifying privacy loss in a statistical database

bench

In the previous post we looked at a simple randomization procedure to obscure individual responses to yes/no questions in a way that retains the statistical usefulness of the data. In this post we’ll generalize that procedure, quantify the privacy loss, and discuss the utility/privacy trade-off.

More general randomized response

Suppose we have a binary response to some question as a field in our database. With probability t we leave the value alone. Otherwise we replace the answer with the result of a fair coin toss. In the previous post, what we now call t was implicitly equal to 1/2. The value recorded in the database could have come from a coin toss and so the value is not definitive. And yet it does contain some information. The posterior probability that the original answer was 1 (“yes”) is higher if a 1 is recorded. We did this calculation for t = 1/2 last time, and here we’ll look at the result for general t.

If t = 0, the recorded result is always random. The field contains no private information, but it is also statistically useless. At the opposite extreme, t = 1, the recorded result is pure private information and statistically useful. The closer t is to 0, the more privacy we have, and the closer t is to 1, the more useful the data is. We’ll quantify this privacy/utility trade-off below.

Privacy loss

You can go through an exercise in applying Bayes theorem as in the previous post to show that the probability that the original response is 1, given that the recorded response is 1, is

\frac{(t+1) p}{2tp -t + 1}

where p is the overall probability of a true response of 1.

The privacy loss associated with an observation of 1 is the gain in information due to that observation. Before knowing that a particular response was 1, our estimate that the true response was 1 would be p; not having any individual data, we use the group mean. But after observing a recorded response of 1, the posterior probability is the expression above. The information gain is the log base 2 of the ratio of these values:

\log_2 \left( \frac{(t+1) p}{2tp - t + 1} \middle/ \ p \right) = \log_2\left( \frac{(t+1)}{2tp - t + 1} \right)

When t = 0, the privacy loss is 0. When t = 1, the loss is -log2(p) bits, i.e. the entire information contained in the response. When t = 1/2, the loss is -log2(3/(2p + 1)) bits.

Privacy / utility trade-off

We’ve looked at the privacy cost of setting t to various values. What are the statistical costs? Why not make t as small as possible? Well, 0 is a possible value of t, corresponding to complete loss of statistical utility. So we’d expect that small positive values of t make it harder to estimate p.

Each recorded response is a 1 with probability tp + (1 – t)/2. Suppose there are N database records and let S be the sum of the recorded values. Then our estimator for p is

\hat{p} = \frac{\frac{S}{N} - \frac{1-t}{2}}{t}

The variance of this estimator is inversely proportional to t, and so the width of our confidence intervals for p are proportional to 1/√t. Note that the larger N is, the smaller we can afford to make t.

Related

Previous related posts:

Next up: Adding Laplace or Gaussian noise and differential privacy

Pascal’s triangle and Fermat’s little theorem

I was listening to My Favorite Theorem when Jordan Ellenberg said something in passing about proving Fermat’s little theorem from Pascal’s triangle. I wasn’t familiar with that, and fortunately Evelyn Lamb wasn’t either and so she asked him to explain.

Fermat’s little theorem says that for any prime p, then for any integer a,

ap = a (mod p).

That is, ap and a have the same remainder when you divide by p. Jordan Ellenberg picked the special case of a = 2 as his favorite theorem for the purpose of the podcast. And it’s this special case that can be proved from Pascal’s triangle.

The pth row of Pascal’s triangle consists of the coefficients of (xy)p when expanded using the binomial theorem. By setting x = y = 1, you can see that the numbers in the row must add up to 2p. Also, all the numbers in the row are divisible by p except for the 1’s on each end. So the remainder when 2p is divided by p is 2.

It’s easy to observe that the numbers in the pth row, except for the ends, are divisible by p. For example, when p = 5, the numbers are 1, 5, 10, 10, 5, 1. When p = 7, the numbers are 1, 7, 28, 35, 35, 28, 7, 1.

To prove this you have to show that the binomial coefficient C(p, k) is divisible by p when 0 < k < p. When you expand the binomial coefficient into factorials, you see that there’s a factor of p on top, and nothing can cancel it out because p is prime and the numbers in the denominator are less than p.

By the way, you can form an analogous proof for the general case of Fermat’s little theorem by expanding

(1 + 1 + 1 + … + 1)p

using the multinomial generalization of the binomial theorem.

Making a problem easier by making it harder

In the oral exam for my PhD, my advisor asked me a question about a differential equation. I don’t recall the question, but I remember the interaction that followed.

I was stuck, and my advisor countered by saying “Let me ask you a harder question.” I was still stuck, and so he said “Let me ask you an even harder question.” Then I got it.

By “harder” he meant “more general.” He started with a concrete problem, then made it progressively more abstract until I recognized it. His follow-up questions were logically harder but psychologically easier.

This incident came to mind when I ran across an example in Lawrence Evans’ control theory course notes. He uses the example to illustrate what he calls an example of mathematical wisdom:

It is sometimes easier to solve a problem by embedding it within a larger class of problems and then solving the larger class all at once.

The problem is to evaluate the integral of the sinc function:

\int_0^\infty \frac{\sin(x)}{x}\, dx

He does so by introducing the more general problem of evaluating the function

I(\alpha) = \int_0^\infty \exp(-\alpha x) \frac{\sin(x)}{x}\, dx

which reduces to the sinc integral when α = 0.

We can find the derivative of I(α) by differentiating under the integral sign and integrating by parts twice.

I'(\alpha) &=& \int_0^\infty \frac{\partial}{\partial \alpha} \exp(-\alpha x) \frac{\sin(x)}{x}\, dx \\ &=& \int_0^\infty \exp(-\alpha x) \sin(x) \, dx \\ &=& - \frac{1}{1 + \alpha^2}

Therefore

I(\alpha) = - \arctan(\alpha) + C

As α goes to infinity, I(α) goes to zero, and so C = π/2 and I(0) = π/2.

Incidentally, note that instead of computing an integral in order to solve a differential equation as one often does, we introduced a differential equation in order to compute an integral.

Quantifying the information content of personal data

It can be surprisingly easy to identify someone from data that’s not directly identifiable. One commonly cited result is that the combination of birth date, zip code, and sex is enough to identify most people. This post will look at how to quantify the amount of information contained in such data.

If the answer to a question has probability p, then it contains -log2 p bits of information. Knowing someone’s sex gives you 1 bit of information because -log2(1/2) = 1.

Knowing whether someone can roll their tongue could give you more or less information than knowing their sex. Estimates vary, but say 75% can roll their tongue. Then knowing that someone can roll their tongue gives you 0.415 bits of information, but knowing that they cannot roll their tongue gives you 2 bits of information.

On average, knowing someone’s tongue rolling ability gives you less information than knowing their sex. The average amount of information, or entropy, is

0.75(-log2 0.75) + 0.25(-log2 0.25) = 0.81.

Entropy is maximized when all outcomes are equally likely. But for identifiability, we’re concerned with maximum information as well as average information.

Knowing someone’s zip code gives you a variable amount of information, less for densely populated zip codes and more for sparsely populated zip codes. An average zip code contains about 7,500 people. If we assume a US population of 326,000,000, this means a typical zip code would give us about 15.4 bits of information.

The Safe Harbor provisions of US HIPAA regulations let you use the first three digits of someone’s zip code except when this would represent less than 20,000 people, as it would in several sparsely populated areas. Knowing that an American lives in a region of 20,000 people would give you 14 bits of information about that person.

Birth dates are complicated because age distribution is uneven. Knowing that someone’s birth date was over a century ago is highly informative, much more so than knowing it was a couple decades ago. That’s why the Safe Harbor provisions do not allow including age, much less birth date, for people over 90.

Birthdays are simpler than birth dates. Birthdays are not perfectly evenly distributed throughout the year, but they’re close enough for our purposes. If we ignore leap years, a birthday contains -log2(1/365) or about 8.5 bits of information. If we consider leap years, knowing someone was born on a leap day gives us two extra bits of information.

Independent information is additive. I don’t expect there’s much correlation between sex, geographical region, and birthday, so you could add up the bits from each of these information sources. So if you know someone’s sex, their zip code (assuming 7,500 people), and their birthday (not a leap day), then you have 25 bits of information, which may be enough to identify them.

This post didn’t consider correlated information. For example, suppose you know someone’s zip code and primary language. Those two pieces of information together don’t provide as much information as the sum of the information they provide separately because language and location are correlated. I may discuss the information content of correlated information in a future post.

RelatedHIPAA de-identification

Highly cited theorems

Some theorems are cited far more often than others. These are not the most striking theorems, not the most advanced or most elegant, but ones that are extraordinarily useful.

I first noticed this when taking complex analysis where the Cauchy integral formula comes up over and over. When I first saw the formula I thought it was surprising, but certainly didn’t think “I bet we’re going to use this all the time.” The Cauchy integral formula was discovered after many of the results that textbooks now prove using it. Mathematicians realized over time that they could organize a class in complex variables more efficiently by proving the Cauchy integral formula as early as possible, then use it to prove much of the rest of the syllabus.

In functional analysis, it’s the Hahn-Banach theorem. This initially unimpressive theorem turns out to be the workhorse of functional analysis. Reading through a book on functional analysis you’ll see “By the Hahn-Banach theorem …” so often that you start to think “Really, that again? What does it have to do here?”

In category theory, it’s the Yoneda lemma. The most common four-word phrase in category theory must be “by the Yoneda lemma.” Not only is it the most cited theorem in category theory, it may be the only highly cited theorem in category theory.

The most cited theorem in machine learning is probably Bayes’ theorem, but I’m not sure Bayes’ theorem looms as large in ML the previous theorems do in their fields.

Every area of math has theorems that come up more often than other, such as the central limit theorem in probability and the dominated convergence theorem in real analysis, but I can’t think of any theorems that come up as frequently as Hahn-Banach and Yoneda do in their areas.

As with people, there are theorems that attract attention and theorems that get the job done. These categories may overlap, but often they don’t.

 

Width of mixture PDFs

Suppose you draw samples from two populations, one of which has a wider probability distribution than the other. How does the width of the distribution of the combined sample vary as you change the proportions of the two populations?

The extremes are easy. If you pick only from one population, then the resulting distribution will be exactly as wide as the distribution of that population. But what about in the middle? If you pick from both populations with equal probability, will the width resulting distribution be approximately the average of the widths of the two populations separately?

To make things more specific, we’ll draw from two populations: Cauchy and normal. With probability η we will sample from a Cauchy distribution with scale γ and with probability (1-η) we will sample from a normal distribution with scale σ. The resulting combined distribution is a mixture, known in spectroscopy as a pseudo-Voigt distribution. In that field, the Cauchy distribution is usually called the Lorentz distribution.

(Why”pseudo”? A Voigt random variable is the sum of a Cauchy and a normal random variable. Its PDF is a convolution of a Cauchy and a normal PDF. A pseudo-Voigt random variable is the mixture of a Cauchy and a normal random variable. Its PDF is a convex combination of the PDFs of a Cauchy and a normal PDF. In fact, the convex combination coefficients are η and (1-η) mentioned above. Convex combinations are easier to work with than convolutions, at least in some contexts, and the pseudo-Voigt distribution is a convenient approximation to the Voigt distribution.)

We will measure the width of distributions by full width at half maximum (FWHM). In other words, we’ll measure how far apart the two points are where the distribution takes on half of its maximum value.

It’s not hard to calculate that the FWHM for a Cauchy distribution with scale 2γ and the FWHM for a normal distribution with scale σ is 2 √(2 log 2) σ.

If we have a convex combination of Cauchy and normal distributions, we’d expect the FWHM to be at least roughly the same convex combination of the FWHM of the separate distributions, i.e. we’d expect the FWHM of our mixture to be

2ηγ + 2(1 – η)√(2 log 2)σ.

How close is that guess to being correct? It has to be exactly correct when η is 0 or 1, but how well does it do in the middle? Here are a few experiments fixing γ = 1 and varying σ.

Defining the Fourier transform on LCA groups

My previous post said that all the familiar variations on Fourier transforms—Fourier series analysis and synthesis, Fourier transforms on the real line, discrete Fourier transforms, etc.—can be unified into a single theory. They’re all instances of a Fourier transform on a locally compact Abelian (LCA) group. The difference between them is the underlying group.

Given an LCA group G, the Fourier transform takes a function on G and returns a function on the dual group of G. We said this much last time, but we didn’t define the dual group; we just stated examples. We also didn’t say just how you define a Fourier transform in this general setting.

Characters and dual groups

Before we can define a dual group, we have to define group homomorphisms. A homomorphism between two groups G and H is a function h between the groups that preserves the group structure. Suppose the group operation is denoted by addition on G and by multiplication on H (as it will be in our application), saying h preserves the group structure means

h(x + y) = h(x) h(y)

for all x and y in G.

Next, let T be the unit circle, i.e. complex numbers with absolute value 1. T is a group with respect to multiplication. (Why T for circle? This is a common notation, anticipating generalization to toruses in all dimensions. A circle is a one-dimensional torus.)

Now a character on G is a continuous homomorphism from G to T. The set of all characters on G is the dual group of G. Call this group Γ. If G is an LCA group, then so is Γ.

Integration

The classical Fourier transform is defined by an integral. To define the Fourier transform on a group we have to have a way to do integration on that group. And there’s a theorem that says we can always do that. For every LCA group, there exists a Haar measure μ, and this measure is nice enough to develop our theory. This measure is essentially unique: Any two Haar measures on the same LCA group must be proportional to each other. In other words, the measure is unique up to multiplying by a constant.

On a discrete group—for our purposes, think of the integers and the integers mod m—Haar measure is just counting; the measure of a set is the number of things in the set. And integration with respect to this measure is summation.

Fourier transform defined

Let f be a function in L¹(G), i.e. an absolutely integrable function on G. Then the Fourier transform of f is a function on Γ defined by

\hat{f}(\gamma) = \int_G f(x)\, \gamma(-x) \, d\mu

What does this have to do with the classical Fourier transform? The classical Fourier transform takes a function of time and returns a function of frequency. The correspondence between the classical Fourier transform and the abstract Fourier transform is to associate the frequency ω with the character that takes x to the value exp(iωx).

There are multiple slightly different conventions for the classical Fourier transform cataloged here. These correspond to different constant multiples in the choice of measure on G and Γ, i.e. whether to divide by or multiply by √(2π), and in the correspondence between frequencies and characters, whether ω corresponds to exp(±iωx) or exp(±2πiωx).

Unified theory of Fourier transforms

You can take a periodic function and analyze it into its Fourier coefficients, or use the Fourier coefficients in a sum to synthesize a periodic function. You can take the Fourier transform of a function defined on the whole real line and get another such function. And you can compute the discrete Fourier transform via the FFT algorithm.

Is there a general theory that unifies all these related but different things? Why yes, yes there is.

Groups

Everything in the opening paragraph is simply a Fourier transform, each in a different context. And the contexts correspond to groups. Specifically, locally compact Abelian groups.

Some of these groups are easier to see than others. Clearly the real numbers with addition form a group: the sum of two real numbers is a real number, etc. But where are the groups in the other contexts?

You can think of a periodic function as a function on a circle; the function values have to agree at both ends of an interval, so you might as well think of those two points as the same point, i.e. join them to make a circle. Shifting along an interval, wrapping around if necessary, corresponds to a rotation of the circle, and rotations form a group. So analyzing a periodic function in to a set of Fourier coefficients is a Fourier transform on the circle.

You can think of a set of Fourier coefficients as a function on the integers, mapping n to the nth coefficient. Synthesizing a set of Fourier coefficients into a periodic function is a Fourier transform on the group of integers.

What about a discrete Fourier transform (DFT)? If you have a function sampled at m points, you could think of those points as the group of integers mod m. Your sampled points constitute a function on the integers mod m, and the DFT is a Fourier transform on that group.

Note that the DFT is a Fourier transform in its own right. It’s not an approximation per se, though it’s nearly always used as part of an approximation process. You can start with a continuous function, approximate it by a finite set of samples, compute the DFT of these samples, and the result will give you an approximation to the Fourier transform of the original continuous function.

What about functions of several variables? These are functions on groups too. A function of two real variables, for example, is a function on R², which is a group with (vector) addition.

Dual groups

A Fourier transform takes a function defined on a group and returns a function defined on the dual of that group. I go into exactly what a dual group is in my next post, but for now, just note that the Fourier transform takes a function defined on one group and returns a function defined on another group.

The dual of the circle is the integers, and vice versa. That’s why the Fourier transform of a function on the circle is an infinite set of Fourier coefficients, which we think of as a function on the integers. The Fourier transform of the function on the integers, i.e. a set of Fourier coefficients, is a function on the circle, i.e. a periodic function.

The dual group of the real numbers is the real numbers again. That’s why the Fourier transform of a function on the real line is another function on the real line.

The integers mod m is also its own dual group. So the DFT takes a set of m numbers and returns a set of m numbers.

Locally compact Abelian (LCA) groups

What do locally compact and Abelian mean? And why do we make these assumptions?

Let’s start with Abelian. This just means that the group operation is commutative. When we’re adding real numbers, or composing rotations of a circle, these operations are commutative.

Locally compactness is a more technical requirement. The circle is compact, and so are the integers mod m. But if we restricted out attention to compact groups, that would leave out the integers and the real numbers. These spaces are not compact, but they’re locally compact, and that’s enough for the theory to go through.

It turns out that LCA groups are a sort of theoretical sweet spot. Assuming groups are LCA is general enough to include the examples we care about the most, but it’s not so general that the theory becomes harder and the results less powerful.

More connections

This post relates Fourier series (analysis and synthesis) to Fourier transforms (on the real line) by saying they’re both special cases of Fourier analysis on LCA groups. There are a couple other ways to connect Fourier series and Fourier transforms.

You can take the Fourier transform (not Fourier series) of a periodic function two ways: by restricting it to one period and defining it to be zero everywhere else, or by letting it repeat forever across the real line and taking the Fourier transform in the sense of generalized functions. You can read more about these two approaches in this post.

Predicting when an RNG will output a given value

A few days ago I wrote about how to pick the seed of a simple random number generator so that a desired output came n values later. The number n was fixed and we varied the seed. In this post, the seed will be fixed and we’ll solve for n. In other words, we ask when a pseudorandom sequence will produce a given value.

In principle you could just run the RNG until you get the output you’re looking for, but we’ll assume such a brute force approach is not feasible or at least not fast enough.

If a LCG (linear congruential generator) has seed z, multiplier a, and modulus m, then the nth output is an z reduced mod m. So our task is to solve

x = an z mod m

for n. If we forget for a moment that we’re working with integers mod m, we see that the solution is

n = loga (x / z)

We can actually make this work if we interpret division by z to mean multiplication by the inverse of z mod m and if we interpret the logarithm to be a discrete logarithm. For more on discrete logarithms and one algorithm for computing them, see this post.

In an earlier post I used  a = 742938285 and m = 231 – 1 = 2147483647. We set n = 100 and solved for z to make the 100th output equal to 20170816, the date of that post. It turned out that z = 1898888478.

Now let’s set the seed z = 1898888478 and ask when the LCG sequence will take on the value x = 20170816. Of course we know that n will turn out to be 100, but let’s pretend we don’t know that. We’ll write a little Python script to find n.

I expect there’s a simple way to compute modular inverses using SymPy, but I haven’t found it, so I used some code from StackOverflow.

The following code produces n = 100, as expected.

from sympy.ntheory import discrete_log

def egcd(a, b):
    if a == 0:
        return (b, 0, 1)
    else:
        g, y, x = egcd(b % a, a)
        return (g, x - (b // a) * y, y)

def modinv(a, m):
    g, x, y = egcd(a, m)
    if g != 1:
        raise Exception('modular inverse does not exist')
    else:
        return x % m

a = 742938285
z = 1898888478
m = 2**31 - 1
x = 20170816
zinv = modinv(z, m)

n = discrete_log(m, x*zinv, a)
print(n)

Reverse engineering the seed of a linear congruential generator

The previous post gave an example of manipulating the seed of a random number generator to produce a desired result. This post will do something similar for a different generator.

A couple times I’ve used the following LCG (linear congruential random number generator) in examples. An LCG starts with an initial value of z and updates z at each step by multiplying by a constant a and taking the remainder by m. The particular LCG I’ve used has a = 742938285 and m = 231 – 1 = 2147483647.

(I used these parameters because they have maximum range, i.e. every positive integer less than m appears eventually, and because it was one of the LCGs recommended in an article years ago. That is, it’s very good as far as 32-bit LCGs go, which isn’t very far. An earlier post shows how it quickly fails the PractRand test suite.)

Let’s pick the seed so that the 100th output of the generator is today’s date in ISO form: 20170816.

We need to solve

a100z = 20170816 mod 2147483647.

By reducing  a100 mod 2147483647 we  find we need to solve

160159497 z = 20170816 mod 2147483647

and the solution is z = 1898888478. (See How to solve linear congruences.)

The following Python code verifies that the solution works.

    a = 742938285
    z = 1898888478
    m = 2**31 - 1

    for _ in range(100):
        z = a*z % m
    print(z)

Update: In this post, I kept n = 100 fixed and solved for the seed to give a specified output n steps later. In a follow up post I do the opposite, fixing the seed and solving for n.

Testing RNGs with PractRand

PractRand is a random number generator test suite, somewhat like the DIEHARDER and NIST tests I’ve written about before, but more demanding.

Rather than running to completion, it runs until it a test fails with an infinitesimally small p-value. It runs all tests at a given sample size, then doubles the sample and runs the tests again.

32-bit generators

LCG

A while back I wrote about looking for an RNG that would fail the NIST test suite and being surprised that a simple LCG (linear congruential generator) did fairly well. PractRand, however, dismisses this generator with extreme prejudice:

RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x4a992b2c
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x4a992b2c
length= 64 megabytes (2^26 bytes), time= 2.9 seconds
Test Name Raw Processed Evaluation
BCFN(2+0,13-3,T) R=+115128 p = 0 FAIL !!!!!!!!
BCFN(2+1,13-3,T) R=+105892 p = 0 FAIL !!!!!!!!
...
[Low1/32]FPF-14+6/16:(8,14-9) R= +25.8 p = 1.5e-16 FAIL
[Low1/32]FPF-14+6/16:(9,14-10) R= +15.5 p = 8.2e-9 very suspicious
[Low1/32]FPF-14+6/16:(10,14-11) R= +12.9 p = 1.8e-6 unusual
[Low1/32]FPF-14+6/16:all R=+380.4 p = 8.2e-337 FAIL !!!!!!!
[Low1/32]FPF-14+6/16:all2 R=+33954 p = 0 FAIL !!!!!!!!
[Low1/32]FPF-14+6/16:cross R= +33.6 p = 6.4e-25 FAIL !!
...and 9 test result(s) without anomalies

I don’t recall the last time I saw a p-value of exactly zero. Presumably the p-value was so small that it underflowed to zero.

MWC

Another 32 bit generator, George Marsaglia’s MWC generator, also fails, but not as spectacularly as LCG.

RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x187edf12
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x187edf12
length= 64 megabytes (2^26 bytes), time= 2.9 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +6.3 p = 2.2e-3 unusual
Gap-16:A R= +33.3 p = 1.6e-28 FAIL !!!
Gap-16:B R= +13.6 p = 7.9e-12 FAIL
...and 105 test result(s) without anomalies

64-bit generators

Next let’s see how some well-regarded 64-bit random number generators do. We’ll look at xorshift128+ and xoroshir0128+ by Sebastiano Vigna and David Blackman, the famous Mersenne Twister, and PCG by Melissa O’Neill.

The numbers generated by xhoroshir0128+ and xorshift128+ are not random in the least significant bit and so the PractRand tests end quickly. The authors claim that xoroshiro128+ “passes the PractRand test suite up to (and included) 16TB, with the exception of binary rank tests.” I’ve only run PractRand with its default settings; I have not tried getting it to keep running the rest of the tests.

A lack of randomness in the least significant bits is inconsequential if you’re converting the outputs to floating point numbers, say as part of the process of generating Gaussian random values. For other uses, such as reducing the outputs modulo a small number, a lack of randomness in the least significant bits would be disastrous. (Here numerical analysis and number theory have opposite ideas regarding which parts of a number are most “significant.”)

At the time of writing, both Mersenne Twister and PCG have gone through 256 GB of generated values and are still running. I intend to add more results tomorrow.

Update: Mersenne Twister failed a couple of tests with 512 GB of input. I stopped the tests after PCG passed 16 TB.

xoroshiro128+

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0xe15bf63c
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0xe15bf63c
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!!
[Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!!
...and 146 test result(s) without anomalies

xorshift128+

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x8c7c5173
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x8c7c5173
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
Test Name Raw Processed Evaluation
[Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!!
[Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!!
...and 146 test result(s) without anomalies

Mersenne Twister

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x300dab94
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x300dab94
length= 256 megabytes (2^28 bytes), time= 3.7 seconds
no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x300dab94
length= 512 megabytes (2^29 bytes), time= 7.9 seconds
Test Name Raw Processed Evaluation
BCFN(2+2,13-2,T) R= -7.0 p =1-1.2e-3 unusual
[Low1/64]BCFN(2+2,13-6,T) R= -5.7 p =1-1.0e-3 unusual
...and 167 test result(s) without anomalies

...

rng=RNG_stdin64, seed=0x300dab94
length= 256 gigabytes (2^38 bytes), time= 3857 seconds
  no anomalies in 265 test result(s)

rng=RNG_stdin64, seed=0x300dab94
length= 512 gigabytes (2^39 bytes), time= 8142 seconds
 Test Name Raw Processed Evaluation
 BRank(12):24K(1) R=+99759 p~= 0 FAIL !!!!!!!!
 [Low16/64]BRank(12):16K(1) R= +1165 p~= 1.3e-351 FAIL !!!!!!!
 ...and 274 test result(s) without anomalies

PCG

RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x82f88431
test set = normal, folding = standard (64 bit)

rng=RNG_stdin64, seed=0x82f88431
length= 128 megabytes (2^27 bytes), time= 2.0 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]DC6-9x1Bytes-1           R=  +6.6  p =  1.6e-3   unusual
  ...and 147 test result(s) without anomalies

rng=RNG_stdin64, seed=0x82f88431
length= 256 megabytes (2^28 bytes), time= 4.7 seconds
  no anomalies in 159 test result(s)

rng=RNG_stdin64, seed=0x82f88431
length= 512 megabytes (2^29 bytes), time= 9.5 seconds
  no anomalies in 169 test result(s)

...

rng=RNG_stdin64, seed=0x82f88431
length= 16 terabytes (2^44 bytes), time= 254943 seconds
  no anomalies in 329 test result(s)

Random walk on quaternions

The previous post was a riff on a tweet asking what you’d get if you extracted all the i‘s, j‘s, and k‘s from Finnegans Wake and multiplied them as quaternions. This post is a probabilistic variation on the previous one.

If you randomly select a piece of English prose, extract the i‘s, j‘s, and k‘s, and multiply them together as quaternions, what are you likely to get?

The probability that a letter in this sequence is an i is about 91.5%. There’s a 6.5% chance it’s a k, and a 2% chance it’s a j. (Derived from here.) We’ll assume the probabilities of each letter appearing next are independent.

You could think of the process multiplying all the i‘s, j‘s, and k‘s together as a random walk on the unit quaternions, an example of a Markov chain. Start at 1. At each step, multiply your current state by i with probability 0.915, by j with probability 0.02, and by k with probability 0.065.

After the first step, you’re most likely at i. You could be at j or k, but nowhere else. After the second step, you’re most likely at -1, though you could be anywhere except at 1. For the first several steps you’re much more likely to be at some places than others. But after 50 steps, you’re about equally likely to be at any of the eight possible values.

Wolfram Alpha, Finnegans Wake, and Quaternions

James Joyce

I stumbled on a Twitter account yesterday called Wolfram|Alpha Can’t. It posts bizarre queries that Wolfram Alpha can’t answer. Here’s one that caught my eye.

Suppose you did extract all the i‘s, j‘s, and k‘s from James Joyce’s novel Finnegans Wake. How would you answer the question above?

You could initialize an accumulator to 1 and then march through the list, updating the accumulator by multiplying it by the next element. But is is there a more efficient way?

Quaternion multiplication is not commutative, i.e. the order in which you multiply things matters. So it would not be enough to have a count of how many times each letter appears. Is there any sort of useful summary of the data short of carrying out the whole multiplication? In other words, could you scan the list while doing something other than quaternion multiplication, something faster to compute? Something analogous to sufficient statistics.

We’re carrying out multiplications in the group Q of unit quaternions, a group with eight elements: ±1, ±i, ±j, ±k. But the input to the question about Finnegans Wake only involves three of these elements. Could that be exploited for some slight efficiency?

How would you best implement quaternion multiplication? Of course the answer depends on your environment and what you mean by “best.”

Note that we don’t actually need to implement quaternion multiplication in general, though that would be sufficient. All we really need is multiplication in the group Q.

You could implement multiplication by a table lookup. You could use an 8 × 3 table; the left side of our multiplication could be anything in Q, but the right side can only be ij, or k. You could represent quaternions as a list of four numbers—coefficients of 1, ij, and k—and write rules for multiplying these. You could also represent quaternions as real 4 × 4 matrices or as complex 2 × 2 matrices.

If you have an interesting solution, please share it in a comment below. It could be interesting by any one of several criteria: fast, short, cryptic, amusing, etc.

Update: See follow up post, Random walk on quaternions

Related posts:

The cross polytope

There are five regular solids in three dimensions:

  • tetrahedron
  • octahedron (pictured above)
  • hexahedron (cube)
  • dodecahedron
  • icosahedron.

I give a proof here that these are the only five.

The first three of these regular solids generalize to all dimensions, and these generalizations are the only regular solids in dimensions 5 and higher. (There are six regular solids in dimension 4.)

I’ve mentioned generalizations of the cube, the hypercube, lately. I suppose you could call the generalization of a octahedron a “hyperoctahedron” by analogy with the hypercube, though I’ve never heard anybody use that term. Instead, the most common name is cross polytope.

This post will focus on the cross polytope. In particular, we’re going to look at the relative volume of a ball inside a cross polytope.

The cross polytope in n dimensions is the convex hull of all n-dimensional vectors that are ±1 in one coordinate and 0 in all the rest. It is the “plus or minus” part that gives the cross polyhedron its name, i.e. the vertices are in pairs across the origin.

In analysis, the cross polytope is the unit ball in ℓ1 (“little ell one”), the set of points (x1, x1, …, xn) such that

|x1| + |x2| + … + |xn| = 1.

The ℓ1 norm, and hence the ℓ1 ball, comes up frequently in compressed sensing and in sparse regression.

In recent blog posts we’ve looked at how the relative volume in a ball inscribed in a hypercube drops quickly as dimension increases. What about the cross polytope? The relative volume of a ball inscribed in a cross polytope decreases rapidly with dimension as well. But does it decreases faster or slower than the relative volume of a ball inscribed in a hypercube? To answer this, we need to compute

\left.\frac{\mbox{vol ball in cross poly}}{\mbox{vol cross poly}}\middle/\frac{\mbox{vol ballin hypercube}}{\mbox{vol hypercube}}\right.

Let’s gather what we need to evaluate this. We need the volume of a ball of radius r in n dimensions, and as mentioned before this is

V = \frac{\pi^{\frac{n}{2}} r^n}{\Gamma\left(\frac{n}{2} + 1\right)}

A ball sitting inside an n-dimensional unit cross polytope will have radius 1/√n. This is because if n positive numbers sum to 1, the sum of their squares is minimized by making them all equal, and the point (1/n, 1/n, …, 1/n) has norm 1/√ n. A ball inside a unit hypercube will have radius 1/2.

The cross polytope has volume 2n / n! and the hypercube has volume 1.

Putting this all together, the relative volume of a ball in a cross polytope divided by the relative volume of a ball inside a hypercube is

\left. \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{\sqrt{n}}\right)^n } { \frac{2^n}{n!} } \middle/ \frac{ \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2} + 1\right)} \left(\frac{1}{2}\right)^n } { 1 } \right.

which fortunately reduces to just

\frac{n!}{n^n}

But how do we compare n! and nn/2? That’s a job for Stirling’s approximation. It tells us that for large n, the ratio is approximately

\sqrt{2\pi n}\, n^{n/2}e^{-n}

and so the ratio diverges for large n, i.e. the ball in the cross polytope takes up increasingly more relative volume.

Looking back at just the relative volume of the ball inside the cross polytope, and applying Stirling’s approximation again, we see that the relative volume of the ball inside the cross polytope is approximately

\sqrt{2}\left( \frac{\pi}{2e} \right )^{n/2}

and so the relative volume decreases geometrically as n increases, decreasing much slower than the relative volume of a ball in a hypercube.