Database anonymization for testing

How do you create a database for testing that is like your production database? It depends on in what way you want the test database to be “like” the production one.

Replacing sensitive data

Companies often use an old version of their production database for testing. But what if the production database has sensitive information that software developers and testers should not have access to?

You can’t completely remove customer phone numbers from the database, for example, if your software handles customer phone numbers. You have to replace in sensitive data with modified data. The question becomes how to modify it. Three approaches would be

  1. Use the original data.
  2. Generate completely new artificial data.
  3. Use the real data as a guide to generating new data.

We’ll assume the first option is off the table and consider the pros and cons of the other two options.

For example, suppose you collect customer ages. You could replace customer age with a random two-digit number. That’s fine as far as making sure that forms can display two-digit numbers. But maybe the age values matter. Maybe you want your fictional customers in the test database to have the same age distribution as your real customers. Or maybe you want your fictional customer ages to be correlated with other attributes so that you don’t have 11 year-old retirees or 98 year-old clients who can’t legally purchase alcohol.

Random vs realistic

There are pros and cons to having a realistic test database. A database filled with randomly generated data is likely to find more bugs, but a realistic database is likely to find more important bugs.

Randomly generated data may contain combinations that have yet to occur in the production data, combinations that will cause an error when they do come up in production. Maybe you’ve never sold your product to someone in Louisiana, and there’s a latent bug that will show up the first time someone from Louisiana does order. (For example, Louisiana retains vestiges of French law that make it different from all other states.)

On the other hand, randomly generated data may not find the bugs that affect the most customers. You might want the values in your test database to be distributed similarly to the values in real data so that bugs come up in testing with roughly the same frequency as in production. In that case, you probably want the joint distributions to match and not just the unconditional distributions. If you just match the latter, you could run into oddities such as a large number of teenage retirees as mentioned above.

So do you want a random test database or a realistic test database? Maybe both. It depends on your purposes and priorities. You might want to start by testing against a realistic database so that you first find the bugs that are likely to affect the most number of customers. Then maybe you switch to a randomized database that is more effective at flushing out problems with edge cases.

How to make a realistic test database

So how would you go about creating a realistic test database that protects customer privacy? The answer depends on several factors. First of all, it depends on what aspects of the real data you want to preserve. Maybe verisimilitude is more important for some fields than others. Once you decide what aspects you want your test database to approximate, how well do you need to approximate them? If you want to do valid statistical analysis on the test database, you may need something sophisticated like differential privacy. But if you just want moderately realistic test cases, you can do something much simpler.

Finally, you have to address your privacy-utility trade-off. What kinds of privacy protection are you ethically and legally obligated to provide? For example, is your data consider PHI under HIPAA regulation? Once your privacy obligations are clear, you look for ways to maximize your utility subject to these privacy constraints.

If you’d like help with this process, let’s talk. I can help you determine what your obligations are and how best to meet them while meeting your business objectives.

Recent exponential sums

The exponential sum of the day draws a line between consecutive partial sums of

\sum_{n=0}^N \exp\left( 2\pi i \left( \frac{n}{m} + \frac{n^2}{d} + \frac{n^3}{y} \right ) \right )

where md, and y are the current month, day, and two-digit year. The four most recent images show how different these plots can be.

exponential sums

These images are from 10/30/17, 10/31/17, 11/1/17, and 11/2/17.

Consecutive dates often produce very different images for a couple reasons. First, consecutive integers are relatively prime. From a number theoretic perspective, 30 and 31 are very different, for example. (This touches on the motivation for p-adic numbers: put a different metric on integers, one based on their prime factorization.)

The other reason consecutive dates produce qualitatively different images is that you might roll over a month or year, such as going from October (10) to November (11). You’ll see a big change when we roll over from 2017 to 2018.

The partial sums are periodic [1] with period lcm(mdy). The image for 10/31/17 above has the most points because 10, 31, and 17 are relatively prime. It’s also true that 11, 2, and 17 are relatively prime, but these are smaller numbers.

You could think of the month, day, and year components of the sum as three different gears. The sums repeat when all three gears return to their initial positions. In the image for yesterday, 11/1/17, the middle gear is effectively not there.

[1] The terms of the sums are periodic. The partial sums are periodic if the full sum is zero. So if the partial sums are periodic, the lcm is a period.

Yogi Berra meets Pafnuty Chebyshev

I just got an evaluation copy of The Best Writing on Mathematics 2017. My favorite chapter was Inverse Yogiisms by Lloyd N. Trefethen.

Trefethen gives several famous Yogi Berra quotes and concludes that

Yogiisms are statements that, if taken literally, are meaningless or contradictory or nonsensical or tautological—yet nevertheless convey something true.

An inverse yogiism is the opposite,

[a] statement that is literally true, yet conveys something false.

What a great way way to frame a chapter! Now that I’ve heard the phrase, I’m trying to think of inverse yogiisms. Nothing particular has come to mind yet, but I feel like there must be lots of things that fit that description. Trefethen comes up with three inverse yogiisms, and my favorite is the middle one: Faber’s theorem on polynomial interpolation.

Faber’s theorem is a non-convergence result for interpolants of continuous functions. Trefethen quotes several numerical analysis textbooks that comment on Faber’s theorem in a way that implies an overly pessimistic interpretation. Faber’s theorem is true for continuous functions in general, but if the function f  being interpolated is smooth, or even just Lipschitz continuous, the theorem doesn’t hold. In particular, Chebyshev interpolation produces a sequence of polynomials converging to f.

A few years ago I wrote a blog post that shows a famous example due to Carle Runge that if you interpolate f(x) = 1/(1 + x²) over [-5, 5] with evenly spaced nodes, the sequence of interpolating polynomials diverges. In other words, adding more interpolation points makes the fit worse.

Here’s the result of fitting a 16th degree polynomial to f  at evenly spaced nodes.

graph of f(x) and p16(x)

The error near the ends is terrible, though the fit does improve in the middle. If instead of using evenly spaced nodes you use the roots of Chebyshev polynomials, the interpolating polynomials do in fact converge, and converge quickly. If the kth derivative of f has bounded variation, then the error in interpolating f at n points is O(nk).

The most disliked programming language

According to this post from Stack Overflow, Perl is the most disliked programming language.

I have fond memories of writing Perl, though it’s been a long time since I used it. I mostly wrote scripts for file munging, the task it does best, and never had to maintain someone else’s Perl code. Under different circumstances I probably would have had less favorable memories.

Perl is a very large, expressive language. That’s a positive if you’re working alone but a negative if working with others. Individuals can carve out their favorite subsets of Perl and ignore the rest, but two people may carve out different subsets. You may personally avoid some feature, but you have to learn it anyway if your colleague uses it. Also, in a large language there’s greater chance that you’ll accidentally use a feature you didn’t intend to. For example, in Perl you might use an array in a scalar context. This works, but not as you’d expect if you didn’t intend to do it.

I suspect that people who like large languages like C++ and Common Lisp are more inclined to like Perl, while people who prefer small languages like C and Scheme have opposite inclinations.

Related posts:

Poisson distribution and prime numbers

Let ω(n) be the number of distinct prime factors of x. A theorem of Landau says that for N large, then for randomly selected positive integers less than N, ω-1 has a Poisson(log log N) distribution. This statement holds in the limit as N goes to infinity.

Apparently N has to be extremely large before the result approximately holds. I ran an experiment for N = 10,000,000 and the fit is not great.

Here’s the data:

|----+----------+-----------|
|    | Observed | Predicted |
|----+----------+-----------|
|  1 |   665134 |  620420.6 |
|  2 |  2536837 | 1724733.8 |
|  3 |  3642766 | 2397330.6 |
|  4 |  2389433 | 2221480.4 |
|  5 |   691209 | 1543897.1 |
|  6 |    72902 |  858389.0 |
|  7 |     1716 |  397712.0 |
|  8 |        1 |  157945.2 |
|  9 |        0 |   54884.8 |
| 10 |        0 |   16952.9 |
|----+----------+-----------|

And here’s a plot:

bar chart of actual vs predicted

I found the above theorem in these notes. It’s possible I’ve misunderstood something or there’s an error in the notes. I haven’t found a more formal reference on the theorem yet.

Update: According to the results in the comments below, it seems that the notes I cited may be wrong. The notes say “distinct prime factors”, i.e. ω(n), while numerical results suggest they meant to say Ω(n), the number of prime factors counted with multiplicity. I verified the numbers given below. Here’s a plot.

 

Here’s the Python code I used to generate the table. (To match the code for the revised graph, replace omega which calculated ω(n) with the SymPy function primeomega which calculates Ω(n).)

from sympy import factorint
from numpy import empty, log, log2
from scipy.stats import poisson

N = 10000000

def omega(n):
    return len(factorint(n))

table = empty(N, int)
table[0] = table[1] = 0
for n in range(2, N):
    table[n] = omega(n)

# upper bound on the largest value of omega(n) for n < N.
u = int(log2(N))

for n in range(1, u+1):
    print(n, len(table[table==n]), poisson(log(log(N))).pmf(n-1)*N)

Related posts

US Counties and power laws

Yesterday I heard that the county I live in, Harris County, is the 3rd largest is the United States. (In population. It’s nowhere near the largest in area.) Somehow I’ve lived here a couple decades without knowing that. Houston is the 4th largest city in the US, so it’s no shock that Harris County the 3rd largest county, but I hadn’t thought about it.

I knew that city populations followed a power law, so I wanted to see if county populations do too. I thought they would, but counties are a little different than cities. For example, cities grow and shrink over time but counties typically do not.

To cut to the chase, county populations do indeed follow a power law. They have the telltale straight line graph on a log-log plot. That is for the largest counties. The line starts to curve down at some point and later drops precipitously. That’s typical. When you hear that something “follows a power law” that means it approximately has a power law distribution over some range. Nothing has exactly a power law distribution, or any other ideal distribution for that matter. But some things follow a power law distribution more closely and over a longer range than others.

Log-log plot of US county populations

Even though Los Angeles County (10.1 million) is the largest by far, it doesn’t stick out on a log scale. It’s population compared to Cook County (5.2 million) and Harris County (4.6 million) is unremarkable for a power law.

Paul Klee meets Perry the Platypus

I was playing around with something in Mathematica and one of the images that came out of it surprised me.

filter contour plot

It’s a contour plot for the system function of a low pass filter.

    H[z_] := 0.05634*(1 + 1/z)*(1 - 1.0166/z + 1/z^2) /
            ((1 - 0.683/z)*(1 - 1.4461/z + 0.7957/z^2))
    ContourPlot[ Arg[H[Exp[I (x + I y)]]], 
                 {x, -1, 1}, {y, -1, 1}, 
                 ColorFunction -> "StarryNightColors"]

It looks sorta like a cross between Paul Klee’s painting Senecio

Senecio by Paul Klee, 1922

and Perry the Platypus from Phineas and Ferb.

Perry the Platypus from Phineas and Ferb

Differential equations and recurrence relations

Series solutions to differential equations can be grubby or elegant, depending on your perspective.

Power series solutions

At one level, there’s nothing profound going on. To find a series solution to a differential equation, assume the solution has a power series, stick the series into the equation, and solve for the coefficients. The mechanics are tedious, but you just follow your nose.

But why should this work? Why should the solution have a power series? And why should it be possible to find the coefficients? After all, there are infinitely many of them!

As for why the solution should have a power series, sometimes it doesn’t. But there is a complete theory to tell you when the solution should have a power series or not. And when it doesn’t have a power series, it may have more general series solution.

Recurrence relations

The tedious steps in the middle of the solution process may obscure what’s happening between the first and last step; the hard part in the middle grabs our attention. But the big picture is that series solution process transforms a differential equation for a function y(x) into a recurrence relation for the coefficients in the power series for y. We turn a continuous problem into a discrete problem, a hard problem into an easy problem.

To look at an example, consider Airy’s equation:

y” – xy = 0.

I won’t go into all the details of the solution process—my claim above is that these details obscure the transformation I want to emphasize—but skip from the first to the last step.

If you assume y has a power series at 0 with coefficients an, then we find that the coefficients satisfy the recurrence relation

(n + 2)(n + 1)an+2 = an-1

for n ≥ 1. Initial conditions determine a0 and a1, and it turns out a2 must be 0. The recurrence relation shows how these three coefficients determine all the other coefficients.

Holonomic functions and sequences

There’s something interesting going on here, a sort of functor mapping differential equations to recurrence relations. If we restrict ourselves to differential equations and recurrence relations with polynomial coefficients, we get into the realm of holonomic functions.

A holonomic function is one that satisfies a linear differential equation with polynomial coefficients. A holonomic sequence is a sequence that satisfies a linear recurrence relation with polynomial coefficients. Holonomic functions are the generating functions for holonomic sequences.

Holonomic functions are interesting for a variety of reasons. They’re a large enough class of functions to include many functions of interest, like most special functions from analysis and a lot of generating functions from combinatorics. And yet they’re a small enough class to have nice properties, such as closure under addition and multiplication.

Holonomic functions are determined by a finite list of numbers, the coefficients of the polynomials in their defining differential equation. This means they’re really useful in computer algebra because common operations ultimately map one finite set of numbers to another.

 

Finding numbers in pi

You can find any integer you want as a substring of the digits in π. (Probably. See footnote for details.) So you could encode a number by reporting where it appears.

If you want to encode a single digit, the best you can do is break even: it takes at least one digit to specify the location of another digit. For example, 9 is the 5th digit of π. But 7 doesn’t appear until the 13th digit. In that case it would take 2 digits to encode 1 digit.

What if we work in another base b? Nothing changes as long as we also describe positions in base b. But there’s a possibility of compression if we look for digits of base b but describe their position using base p numbers where p < b. For example, 15 is the 2nd base-100 “digit” in π.

Blocks of digits

For the rest of this post we’ll describe blocks of k digits by their position as a base 10 number. That is, we’ll use p = 10 and b = 10k in the notation above.

There are ten 2-digit numbers that can be described by a 1-digit position in π: 14, 15, 32, …, 92. There are 57 more 2-digit numbers that can be described by a 2-digit position. The other 33 2-digit numbers require 3 digits to describe their position.

So if we encode a 2-digit number by its position in pairs of digits in π, there’s a 1/10 chance that we’ll only need 1 digit, i.e. a 10% chance of compression. The chance that we’ll break even by needing two digits is 57/100, and the chance that we’ll need three digits, i.e. more digits than what we’re encoding, is 33/100. On average, we expect to use 2.23 digits to encode 2 digits.

If we look at 3-digit numbers, there are 10 that can be described by a 1-digit position, 83 that require 2 digits, 543 that require 3 digits, and 364 that require 4 digits. So on average we’d need 3.352 digits to encode 3 digits. Our “compression” scheme makes things about 8% larger on average.

With 4-digit numbers, we need up to 5 digits to describe the position of each, and on average we need 4.2611 digits.

With 5-digit numbers, we need at least 7 digits to describe each position. As I write this, my program is still running. The positions of 99,994 five-digit numbers can be described by numbers with at most 6 digits. The remaining 6 need at least 7 digits. Assuming they need exactly seven digits, we need an average of 5.2623 digits to encode 5-digit numbers by their position in π.

Compression scheme efficiency

If we’re assuming the numbers we’re wishing to encode are uniformly distributed, the representation as a location in π will take more digits than the number itself, on average. But all compression algorithms expand their contents on average if by “average” you mean picking all possible inputs with the same probability. Uniform randomness is not compressible. It takes n bits to describe n random bits.

Compression methods are practical because their inputs are not completely random. For example, English prose compresses nicely because it’s not random.

If we had some reason to suspect that a number came from a block of digits in π, and one not too far our, then the scheme here could be a form of compression. The possibility of efficient compression comes from an assumption about our input.

Extensions

You could extend the ideas in this post theoretically or empirically, i.e. you could write a theorem or a program.

Suppose you look at blocks of k base-b numbers whose position is described as a base b number. The case b = 2 seems particularly interesting. What is the compression efficiency of the method as k varies? What is it for particular k and what is it in the limit as k does to infinity?

You could look at this empirically for digits of π, or some other number, or you could try to prove it theoretically for digits of a normal number.

Footnote on normal numbers

It might not be true that you can find any integer in the digits of π, though we know it’s true for numbers of the size we’re considering here. You can find any number in the digits of π if it is a “normal number.” Roughly speaking, a number is normal if you can find any pattern in its digits with the probability you’d expect. That is, 1/10 of the digits in its decimal expansion will be 7’s, for example. Or if you grouped the digits in pairs, thinking of the number as being in base 100, 1/100 of the pairs to be 42’s, etc.

Almost all numbers are normal, so in a sense, the probability that π is normal is 1. Even though almost all numbers are normal, nobody has been able to prove that that any familiar number is normal: π, e, √2, etc.

It’s possible that every integer appears in the decimal expansion of π, but not necessarily with the expected probability. This would be a weaker condition than normality. Maybe this has been proven for π, but I don’t know. It would be strange if every number appeared in the digits of π but with some bias.

Empirically testing the Chowla conjecture

Terry Tao’s most recent blog post looks at the Chowla conjecture theoretically. This post looks at the same conjecture empirically using Python. (Which is much easier!)

The Liouville function λ(n) is (-1)Ω(n) where Ω(n) is the number of prime factors of n counted with multiplicity. So, for example, Ω(9) = 2 because even though 9 has only one distinct prime factor, that factor counts twice.

Given some set of k strides h1, h2, …, hk, define

f(n) = λ(n + h1) λ(n + h1) … λ(n + hk).

The Chowla conjecture says that the average of the first N values of f(n) tends to zero as N goes to infinity. This has been proven for k = 1 but not for larger k.

If f(n) acts like a Bernoulli random variable, our experiments might increase our confidence in the conjecture, but they wouldn’t prove anything. Unexpected results wouldn’t prove anything either, but a departure from random behavior might help find a counterexample if the conjecture is false.

We’re going to be evaluating the Liouville function repeatedly at the same arguments, so it will save some compute time if we tabulate its values. This will also let us use some compact Python notation to average f. We’ll need to tabulate the function up to Nhk.

In the code below, maxstride is an upper bound on the strides hk we may look at. SymPy has a function primeomega that calculates Ω(n) so we might as well use it. If you wanted to use a very large value of N, you might want to fill the array of Liouville function values using a more direct approach that avoids all the factoring implicit in calls to primeomega.

    from sympy import primeomega
    from numpy import empty

    N = 10000
    maxstride = 100

    liouville = empty(N + maxstride)
    liouville[0] = 1
    for n in range(1, len(liouville)):
        liouville[n] = (-1)**primeomega(n)

The following code looks at the Chowla conjecture for h1 = 0 and h2 ranging over 1, 2, …, 99.

    average = empty(maxstride-1)
    for s in range(1, maxstride):
        average[s-1] = (liouville[0:N] * liouville[s:N+s]).sum()/N

If the Liouville function is distributed like a random variable taking on -1 and 1 with equal probability, we’d expect the averages to vary like samples form a normal distribution with mean 0 and variance 1/(2N).

    print(average.mean())
    print(average.std())
    print( (2*N)**-0.5 )

This returns

    0.00141
    0.00851
    0.00707

and so the means are indeed near zero, and the standard deviation of the samples is about what we’d expect.

What if we replace Ω(n), the number of prime factors with multiplicity, with ω(n), the number of distinct prime factors? The averages above are still small, but the sample variance is about twice as big as before. I’ve seen this pattern with several different large values of N, so I suspect there’s something going on here.

(I didn’t see a function in SymPy corresponding to ω(n), but you can create your own with len(factorint(n)).

Time series analysis vs DSP terminology

Time series analysis and digital signal processing are closely related. Unfortunately, the two fields use different terms to refer to the same things.

Suppose you have a sequence of inputs x[n] and a sequence of outputs y[n] for integers n.

Moving average / FIR

If each output depends on a linear combination of a finite number of previous inputs

y[n] = b0 x[n] + b1 x[n-1] + … + bq x[nq]

then time series analysis would call this a moving average (MA) model of order q, provided b0 = 1. Note that this might not really be an average, i.e. the b‘s are not necessarily positive and don’t necessarily sum to 1.

Digital signal processing would call this a finite impulse response (FIR) filter of order q.

Autoregressive / IIR

If each output depends on a linear combination of a finite number of previous outputs

y[n] = a1 y[n-1] + … + ap y[np]

then time series analysis would call this an autoregressive (AR) model of order p.

Digital signal processing would call this an infinite impulse response (IIR) filter of order p. 

Sometimes you’ll see the opposite sign convention on the a‘s.

ARMA / IIR

If each output depends on a linear combination of a finite number of previous inputs and outputs

y[n] = b0 x[n] + b1 x[n-1] + … + bq x[nq] + a1 y[n-1] + … + ap y[np]

then time series analysis would call this an autoregressive moving average (ARMA) model of order (pq), i.e. p AR terms and q MA terms.

Digital signal processing would call this an infinite impulse response (IIR) filter with q feedforward coefficients and p feedback coefficients. Also, as above, you may see the opposite sign convention on the a‘s.

ARMA notation

Box and Jenkins use a‘s for input and z‘s for output. We’ll stick with x‘s and y’s to make the comparison to DSP easier.

Using the backward shift operator B that takes a sample at n to the sample at n-1, the ARMA system can be written

φ(B) y[n] = θ(B) x[n]

where φ and θ are polynomials

φ(B) = 1 – φ1B – φ2B² – … φpBp

and

θ(B) = 1 – θ1B – θ2B² – … θqBq.

System function notation

In DSP, filters are described by their system function, the z-transform of the impulse response. In this notation (as in Oppenheim and Shafer, for example) we have

H(z) = \frac{\sum_{k=0}^q b_k z^{-k}}{1 - \sum_{k=1}^p a_k z^{-k}}

The φk in Box and Jenkins correspond to the ak in Oppenheim and Schafer. The θk correspond to the (negative) bk.

The system function H(z) corresponds to θ(1/z) / φ(1/z).

Related

DSP and time series consulting

 

How to eliminate the first order term from a second order ODE

Authors will often say that “without loss of generality” they will assume that a differential equation has no first order derivative term. They’ll explain that there’s no need to consider

y'' + p(x) y' + q(x) y = 0

because a change of variables can turn the above equation into one of the form

y'' + q(x) y = 0

While this is true, the change of variables is seldom spelled out. I’ll give the change of variables explicitly here in case this is helpful to someone in the future. Define u(x) and r(x) by

u(x) = \exp\left( \frac{1}{2} \int^x p(t)\,dt\right ) y(x)

and

r(x) = q(x) - \frac{1}{2}p'(x) - \frac{1}{4}p(x)^2

With this change of variables

u'' + r(x) u = 0

Proof: Calculate u” + ru and use the fact that y satisfies the original differential equation. The calculation is tedious but routine.

Example: Suppose we start with

xy'' + y' + x^3 y = 0

Then dividing by x we get

y'' + \frac{1}{x}y' + x^2 y = 0

Now applying the change of variables above gives

u'' + \left(x^2 + \frac{1}{4x^2}\right)u = 0
and our original y is u / √ x.

Common words that have a technical meaning in math

Mathematical writing is the opposite of business writing in at least one respect. Math uses common words as technical terms, whereas business coins technical terms to refer to common ideas.

There are a few math terms I use fairly often and implicitly assume readers understand. Perhaps the most surprising is almost as in “almost everywhere.” My previous post, for example, talks about something being true for “almost all x.”

The term “almost” sounds vague but it actually has a precise technical meaning. A statement is true almost everywhere, or holds for almost all x, if the set of points where it doesn’t hold has measure zero.

For example, almost all real numbers are irrational. There are infinitely many rational numbers, and so there are a lot of exceptions to the statement “all real numbers are irrational,” but the set of exceptions has measure zero [1].

In common parlance, you might use ball and sphere interchangeably, but in math they’re different. In a normed vector space, the set of all points of norm no more than r is the ball of radius r. The set of all points with norm exactly r is the sphere of radius r. A sphere is the surface of a ball.

The word smooth typically means “infinitely differentiable,” or depending on context, differentiable as many times as you need. Often there’s no practical loss of generality in assuming something is infinitely differentiable when you only need to know, for example, that it only needs three derivatives [2]. For example, a manifold whose charts are once differentiable can always be altered slightly to be infinitely differentiable.

The words regular and normal are used throughout mathematics as technical terms, and their meaning changes completely depending on context. For example, in topology regular and normal are two kinds of separation axioms. They tell you whether a topology has enough open sets to separate a point from a closed set or separate two closed sets from each other.

When I use normal I’m most often talking about a normal (i.e. Gaussian) probability distribution. I don’t think I use regular as a technical term that often, but when I do it probably means something like smooth, but more precise. A regularity result in differential equations, for example, tells you what sort of desirable properties a solution has: whether it’s a classical solution or only a weak solution, whether it’s continuous or differentiable, etc.

While I’m giving a sort of reader’s guide to my terminology, log always refers to natural log and trig functions are always in radians unless noted otherwise. More on that here.

* * *

The footnotes below are much more technical than the text above.

[1] Here’s a proof that any countable set of points has measure zero. Pick any ε > 0. Put an open interval of width ε/2 around the first point, an interval of width ε/4 around the second point, an interval of width ε/8 around the third point etc. This covers the countable set of points with a cover of measure ε, and since ε as arbitrary, the set of points must have measure 0.

The irrational numbers are uncountable, but that’s not why they have positive measure. A countable set has measure zero, but a set of measure zero may be uncountable. For example, the Cantor set is uncountable but has measure zero. Or to be more precise, I should say the standard Cantor set has measure zero. There are other Cantor sets, i.e. sets homoemorphic to the standard Cantor set, that have positive measure. This shows that “measure zero” is not a topological property.

[2] I said above that often it doesn’t matter how many times you can differentiate a function, but partial differential equations are an exception to that rule. There you’ll often you’ll care exactly how many (generalized) derivatives a solution has. And you’ll obsess over exactly which powers of the function or its derivatives are integrable. The reason is that a large part of the theory revolves around embedding theorems, whether this function space embeds in that function space. The number of derivatives a function has and the precise exponents p for the Lebesgue spaces they live in matters a great deal. Existence and uniqueness of solutions can hang on such fine details.

A uniformly distributed sequence

If you take the fractional parts of the set of numbers {n cos nx :  integer n > 0} the result is uniformly distributed for almost all x. That is, in the limit, the number of times the sequence visits a subinterval of [0, 1] is proportional to the length of the interval. (Clearly it’s not true for all x: take x = 0, for instance. Or any rational number times π.)

The proof requires some delicate work with Fourier analysis that I’ll not repeat here. If you’re interested in the proof, see Theorem 4.4 of Uniform Distribution of Sequences.

This is a surprising result: why should such a strange sequence be uniformly distributed? Let’s look at a histogram to see whether the theorem is plausible.

 

Histogram for an = n

OK. Looks plausible.

But there’s a generalization that’s even more surprising. Let an be any increasing sequence of integers. Then the fractional parts of an cos anx are uniformly distributed for almost all x.

Any increasing sequence of integers? Like the prime numbers, for example? Apparently so. The result produces a very similar histogram.

Histogram for an = n

But this can’t hold for just any sequence. Surely you could pick an integer sequence to thwart the theorem. Pick an x, then let an be the subset of the integers for which n cos nx < 0.5. Then an cos anx isn’t uniformly distributed because it never visits the right half the unit interval!

Where’s the contradiction? The theorem doesn’t start by saying “For almost all x ….” It starts by saying “For any increasing sequence an ….” That is, you don’t get to pick x first. You pick the sequence first, then you get a statement that is true for almost all x. The theorem is true for every increasing sequence of integers, but the exceptional set of x‘s may be different for each integer sequence.

Applying probability to non-random things

Probability has surprising uses, including applications to things that absolutely are not random. I’ve touched on this a few times. For example, I’ve commented on how arguments about whether something is really random are often moot: Random is as random does.

This post will take non-random uses for probability in a different direction. We’ll start with discussion of probability and end up proving a classical analysis result.

Let p be the probability of success on some experiment and let Xn count the number of successes in n independent trials. The expected value of Xn is np, and so the expected value of Xn/n is p.

Now let be a continuous function on [0, 1] and think about f(Xn/n). As n increases, the distribution of Xn/n clusters more and more tightly around p, and so f(Xn/n) clusters more and more around f(p).

Let’s illustrate this with a simulation. Let p = 0.6, n = 100, and let f(x) = cos(x).

Here’s a histogram of random samples from Xn/n.

Histogram of samples from Bin(100, 0.6)

And here’s a histogram of random samples from f(Xn/n). Note that it’s centered near cos(p) = 0.825.

Histogram of samples from cos( Bin(100, 0.6) )

 

As n gets large, the expected value of f(Xn/n) converges to f(p). (You could make all this all rigorous, but I’ll leave out the details to focus on the intuitive idea.)

Now write out the expected value of f(Xn/n).

E\left[ f\left(\frac{X_n}{n} \right ) \right ] = \sum_{k=0}^n f\left(\frac{k}{n} \right ) {n \choose k} p^k(1-p)^{n-k}

Up to this point we’ve thought of p as a fixed probability, and we’ve arrived at an approximation result guided by probabilistic thinking. But the expression above is approximately f(p) regardless of how we think of it. So now let’s think of p varying over the interval [0, 1].

E[ f(Xn/n) ] is an nth degree polynomial in p that is close to f(p). In fact, as n goes to infinity, E[ f(Xn/n) ] converges uniformly to f(p).

So for an arbitrary continuous function f, we’ve constructed a sequence of polynomials that converge uniformly to f. In other words, we’ve proved the Weierstrass approximation theorem! The Weierstrass approximation theorem says that there exists a sequence of polynomials converging to any continuous function f on a bounded interval, and we’ve explicitly constructed such a sequence. The polynomials E[ f(Xn/n) ] are known as the Bernstein polynomials associated with f.

This is one of my favorite proofs. When I saw this in college, it felt like the professor was pulling a rabbit out of a hat. We’re talking about probability when all of a sudden, out pops a result that has nothing to do with randomness.

Now let’s see an example of the Bernstein polynomials approximating a continuous function. Again we’ll let f(x) = cos(x). If we plot the Bernstein polynomials and cosine on the same plot, the approximation is good enough that it’s hard to see what’s what. So instead we’ll plot the differences, cosine minus the Bernstein approximations.

Error in approximating cosine by Bernstein polynomials

As the degree of the Bernstein polynomials increase, the error decreases. Also, note the vertical scale. Cosine of zero is 1, and so the errors here are small relative to the size of the function being approximated.