Random sampling to save money

I was stunned when my client said that a database query that I asked them to run would cost the company $100,000 per year. I had framed my question in the most natural way, not thinking that at the company’s scale it would be worth spending some time thinking about the query.

Things have somewhat come full circle. When computers were new, computer time was expensive and programmer time was dirt cheap by comparison. Someone might be scolded for using computer time to do what a programmer could do. Now of course programmer time is expensive and computer time is cheap. Usually.

But when dealing with gargantuan data sets, the cost of computer time might matter a great deal, especially when renting services from a cloud provider. Maybe you can find a more clever way to run an enormous query. Or even better, maybe you can find a way to avoid running an enormous query.

One way to save time and money is to base decisions on random samples rather than exhaustive queries. This requires a little preparation. How big a sample do you need? How exactly are you going to take a sample? What kind of uncertainty do you have in your result when you’re done? You can afford to think about these questions a long time if it saves tens of thousands of dollars per year.

Aesthetic uses of Latin squares

We think they like randomness in design, but we don’t exactly. People like things that are sorta random, but not too random. When you literally scatter things randomly, they looked too clumped [1].

There are many ways around this problem, variations on randomness that people find more aesthetically pleasing. One of these ways is random Latin squares.

A Latin square of size n is an n × n grid filled with n different things—numbers, letters, colors, etc.— such that each kind of thing appears exactly once in each row and in each column. For example, 26 × 26 grid filled with the letters of the English alphabet is a Latin square if every row and every column contains the entire alphabet; no letters repeated and no letters missing.

A random Latin square is an arrangement that is random but subject to the constraint that it be a Latin square. Random Latin squares look nice because they’re somewhat random but they avoid certain kinds of repetition.

Suppose you have five different colors of tile and you need to lay down the tiles in a 10 × 20 grid. You want to lay down the tiles “randomly” but you don’t want too many of any one color to appear in a small area.

One way to do this would be to divide the 10 × 20 space into a grid of grids. Each subgrid is a 5 × 5 Latin square. Here’s an example.

The image above is a 2 × 4 grid of 5 × 5 grids, and each of the 5 × 5 grids is a Latin square chosen at random.

Here’s another example with five different colors and different random Latin squares.

If you look carefully, you’ll spot a few places where tiles of the same color touch on an edge. This cannot happen within a Latin square, but it can happen where two different Latin squares are next to each other. So the randomization scheme described here allows for a few tiles of the same color to touch but not many.

I expect randomized Latin squares are used this way fairly often. You’d never know, and that’s kinda the point. Someone would have to stare at a pattern like this a long time before they realized that horizontal or vertical segments of five do not repeat a color.

Related posts

[1] One way to detect fraud is to look for fake randomness. When people try to make something look random, they usually fail and leave statistically detectable fingerprints.

Yule statistics Y and Q

I recently wrote about the Yule-Simon distribution. The same Yule, George Udny Yule, is also known for the statistics Yule’s Y and Yule’s Q. The former is also known as the coefficient of colligation, and the latter is also known as the Yule coefficient of association.

Both measure how things are related. Given a 2 × 2 contingency table

\begin{tabular} {|c|c|} \hline $a$ & $b$ \\ \hline $c$ & $d$ \\ \hline \end{tabular}

with non-negative entries, Yule’s Y is defined by

Y = \frac{\sqrt{ad} - \sqrt{bc}}{\sqrt{ad} + \sqrt{bc}}

and Yule’s Q is defined by

Q = \frac{ad - bc}{ad + bc}

Both essentially measure how much bigger ad is than bc but are weighted differently.

The algebraic properties of these two statistics may be more interesting than their statistical properties. For starters, both Y and Q produce values in the interval (-1, 1), and each is an inverse of the other:

\begin{align*} Q &= \frac{2Y}{1 + Y^2} \\ Y &= \frac{1 - \sqrt{1 = Q^2}}{Q} \end{align*}

There’s some simple but interesting algebra going on here. Define

f(x) = \frac{1-z}{1+z}

This simple function comes up surprisingly often. It’s a Möbius transformation, and it comes up in diverse applications such as designing antennas and mental math shortcuts. More on that here.

If we define

z = \frac{bc}{ad}

and

W_p = f(z^p)

then W generalizes both Q and Y: setting p = 1 gives us Q and setting p = 1/2 gives us Y.

Given the value of W with subscript p, we could easily solve for the value of W with another subscript q, analogous to solving Q for Y and Y for Q above.

W_p = f(z^p) \to^{f}{z^p} \to{x^{q/p} z^q \to^{f} f(z^q) = W_q

If you’re expecting f-1 rather than f over the first arrow, you’re right, but f is its own inverse so we could just write f instead.

Related posts

Comparing three discrete power laws

Yesterday I wrote about the zeta distribution and the Yule-Simon distribution and said that they, along with the Zipf distribution, are all discrete power laws. This post fills in detail for that statement.

The probability mass functions for the zeta, Zipf, and Yule-Simon distributions are as follows.

\begin{align*} f_\zeta(k; s) &= k^{-s} / \zeta(s) \\ f_y(k; \rho) &= \rho B(k, \rho+1) \\ f_z(k; N, s) &= k^{-s} / H_{N,s} \end{align*}

Here the subscripts ζ, y, and z stand for zeta, Yule, and Zipf respectively. The distribution parameters follow after the semicolon.

Comparing Zipf and zeta

The Zipf distribution is only defined on the first N positive integers. The normalizing constant for the Zipf distribution is the Nth generalized harmonic number with exponent s.

H_{N,s} = \sum_{k=1}^N k^{-s}

As N goes to infinity, HN,s converges to ζ(s); this is the definition of the ζ function. So the Zipf and zeta distributions are asymptotically equal.

Comparing zeta and Yule

This post showed that for large k,

f_y(k; \rho) \approx \rho \Gamma(\rho + 1) \frac{1}{(k+\rho)^{\rho + 1}}

That is, fy(k, ρ) is proportional to a power of k, except k is shifted by an amount ρ.

To compare the zeta and Yule distributions we’ll need to compare zeta with s+1 to Yule with ρ in order to make the exponents agree, and we’ll need to shift the zeta distribution by s+1.

When we plot the ratio of the pmfs, we see that the distributions agree in the limit as k gets large.

In this plot s = ρ = 1.5.

The ratio is approaching a constant, and in fact the limit is

\frac{1}{ \rho \,\Gamma(\rho + 1)\,\zeta(s+1)}

based on the ratio of the proportionality constants.

Comment

Note that the three distributions are asymptotically proportional, given the necessary shift in k, but in different ways. The Zipf distribution converges to the zeta distribution, for every k, as N goes to infinity. The zeta and Yule distributions are asymptotically proportional as k goes to infinity. So one proportion is asymptotic in a parameter N and one is asymptotic in an argument k.

The zeta distribution

The previous post on the Yule-Simon distribution mentioned the zeta distribution at the end. This is a powerlaw distribution on the positive integers with normalizing constant given by the Riemann zeta distribution. That is, the zeta distribution has density

f(k; s) = ks / ζ(s).

where k is a positive integer and s > 1 is a fixed parameter.

For s > 1, the zeta function is defined as the sum of the positive integers to the power negative s, so ζ(s) is essentially defined as the normalizing constant of the zeta distribution.

I wanted to make a couple comments about this. First, it shows that the zeta function appears in applications outside of number theory. Second, when working with the zeta distribution, it would be useful to have an estimate for ζ(s).

Zeta function bounds

The integral test in calculus is typically presented as a way to test whether an infinite sum converges. This is a shame. In analysis as in statistics, estimation is better than testing. Testing throws away continuous information and replaces it with a single bit.

Let f(x) be a decreasing function; we have in mind f(x) = 1/xs. Then for any n,

\int_{n+1}^\infty f(x)\, dx \leq \sum_{k = n+1}^\infty f(k) \leq \int_{n}^\infty f(x)\, dx

To estimate a sum over k, we sum the first n terms directly and apply the inequalities above for the rest. Typically n will be small, maybe 0 or 1. A larger value of n will give more accurate bounds but at the expense of a more complicated expression.

When f(x) = 1/xs and n is at least 1, we have

\frac{(n+1)^{1-s}}{s-1} \leq \sum_{k = n+1}^\infty f(k) \leq \frac{n^{1-s}}{s-1}

This says

\sum_{k=1}^n k^{-s} + \frac{(n+1)^{1-s}}{s-1} \leq \zeta(s) \leq \sum_{k=1}^n k^{-s} + \frac{n^{1-s}}{s-1}

This gives a good lower bound but a bad upper bound when we choose n = 1.

But it gives a much better upper bound when we choose n = 2.

Related posts

Yule-Simon distribution

The Yule-Simon distribution, named after Udny Yule and Herbert Simon, is a discrete probability with pmf

f(k; \rho) = \rho B(k, \rho + 1)

The semicolon in f(k; ρ) suggests that we think of f as a function of k, with a fixed parameter ρ. The way the distribution shows the connection to the beta function, but for our purposes it will be helpful to expand this function using

B(x, y) = \frac{\Gamma(x) \, \Gamma(y)}{\Gamma(x+y)}

and so

\begin{align*} f(k; \rho) &= \rho B(k, \rho + 1) \\ &= \rho \Gamma(\rho + 1) \,\,\frac{\Gamma(k)}{\Gamma(k + \rho + 1)} \\ &= \rho \Gamma(\rho + 1) \,\, \frac{1}{(k + \rho)^{\underline{\rho + 1}}} \end{align*}

Ignore the first part of the last line, ρ Γ(ρ + 1), because it doesn’t involve k. It helps to ignore proportionality constants in probability densities when they’re not necessary. What’s left is the (ρ + 1) falling power of k + ρ.

For large values of k, the falling power term is asymptotically equal to kρ+1. To see this, let k = 1000 and ρ = 3. Then we’re saying that the ratio of

1003 × 1002 × 1001 × 1000

to

1000 × 1000 × 1000 × 1000

is approximately 1, and the ratio converges 1 as k increases.

This says that the Yule-Simon distribution is a power law in the tails, just like the Zipf distribution and the zeta distribution. Details of the comparison between these three distributions are given here.

Related posts

Probability of a magical permutation

Take a permutation of the numbers 1 through n² and lay out the elements of the permutation in a square. We will call a permutation a magic permutation if the corresponding square is a magic square. What is the probability that a permutation is a magic permutation? That is, if you fill a grid randomly with the numbers 1 through n², how likely are you to get a magic square?

For example, we could generate a random permutation in Python and see whether it forms a magic square.

>>> import numpy as np
>>> np.random.permutation(9).reshape(3,3)
array([[4, 7, 3],
       [2, 6, 5],
       [8, 0, 1]])

The first row adds up to 14 and the second row adds up to 13, so this isn’t a magic square. We could write a script to do this over and over and check whether the result is a magic square.

The exact number of magic squares of size n is known for n up to 5, and we have Monte Carlo estimates for larger values of n.

The number of unique magic squares, modulo rotations and reflections, of size 1 through 5 is

1, 0, 1, 880, 275305224.

For n > 2 the total number of magic squares, counting rotations and reflections as different squares, is 8 times larger than the numbers above. This is because the group of rotations and reflections of a square, D4, has 8 elements.

The probability that a permutation of the numbers 1 through 9, arranged in a square, gives a magic square is 8/9!. The corresponding probability for the numbers 1 through 16 is 8 × 880/16!, and for the numbers 1 through 25 we have 8 × 275305224/25!.

    |---+-------------|
    | n | Prob(magic) |
    |---+-------------|
    | 3 | 2.20459e-05 |
    | 4 | 3.36475e-10 |
    | 5 | 1.41990e-16 |
    |---+-------------|

It looks like the exponents are in roughly a linear progression, so maybe you could fit a line fairly well to the points on a logarithmic scale. In fact, empirical studies suggest that the probability that a permutation of the first n² positive integers is magic decreases a little faster than exponentially.

We could fit a linear regression to the logs of the numbers above to come up with an estimate for the result for n = 6. We expect the estimate could be pretty good, and likely an upper bound on the correct answer. Let’s see what actually happens with a little R code.

    > x <- c(3,4,5)
    > y <- 8*c(1/factorial(9), 880/factorial(16), 275305224/factorial(25))
    > m <- lm(log(y) ~ x)
    > predict(m, data.frame(x=c(6)))
    1
    -48.77694

This says we’d estimate that the natural log of the probability that a permutation of the first 6² positive integers is magic is -48.77694. So the estimate of the probability itself is

exp(-48.77694) = 6.55306 × 10-22

We don’t know the number of magic squares of size n = 6, but the number of distinct squares has been estimated [1] at

(0.17745 ± 0.00016) × 1020

and so the total number including rotations and reflections would be 8 times higher. This says we’d expect our probability to be around

8 × 0.17745 × 1020 / 36! = 3.18162 × 10-22

So our estimate is off by about a factor of 2, and as predicted it does give an upper bound.

Looking back at our regression model, the slope is -12.88 and the intercept is 28.53. This suggests that an upper bound of the probability of a permutation of size n² being magical is

exp(28.53 – 12.88 n).

In closing I’d like to point out that the estimate for n = 6 that we’re referencing above did not come from simply permuting 36 integers over and over and counting how many of the permutations correspond to magic squares. That would take far too long. It’s quite likely that after the first billion billion tries you would not have seen a magic permutation. To estimate such a small number to four significant figures requires a more sophisticated Monte Carlo procedure.

Related posts

[1] See OEIS sequence A006052

Degree of magic

A square grid of distinct integers is a magic square if all its rows columns and full diagonals have the same sum. Otherwise it is not a magic square.

Now suppose we fill a square grid with samples from a continuous random variable. The probability that the entries are distinct is 1, but the probability that the square is magic is 0.

We could make this more interesting by asking how close to magic the square is, using a continuous measure of degree of magic, rather than simply asking whether the square is magic or not.

If we have an n by n square of real numbers, we could take the set of n row sums, n column sums, and 2 diagonals as data and look at its range or its variance. These statistics would be zero for a magic square and small for a nearly magic square.

If each element of the square is a sample from a standard normal random variable, what is the distribution of these statistics? How much does the distribution on the numbers matter?

Gamma distribution tail probability approximation

This post will approximate of the tail probability of a gamma random variable using the heuristic given in the previous post.

The gamma distribution

Start with the integral defining Γ(a). Divide the integrand by Γ(a) so that it integrates to 1. This makes the integrand into a probability density, and the resulting probability distribution is called the gamma distribution with shape a. To evaluate the probability of this distribution’s tail, we need to integrate

\frac{1}{\Gamma(a)} \int_x^\infty t^{a-1} \exp(-t)\, dt

This distribution has mean and variance a, and mode a-1.

We’re interested in approximating the integral above for moderately large x relative to a.

Approximation

To apply the 1/e heuristic we need to find a value of t such that

ta-1 exp(-t) = xa-1 exp(-x)/e = xa-1 exp(-x – 1).

This equation would have to be solved numerically, but lets try something very simple first and see how it works. Let’s assume the ta-1 term doesn’t change as much as the the exp(-t) term between x and x + 1. This says an approximate solution to our equation for t is simply

t = x + 1

and so the base of our rectangle runs from x to x + 1, i.e. has width 1. So the area of our rectangle is simply its height. This gives us

\frac{1}{\Gamma(a)} \int_x^\infty t^{a-1} \exp(-t)\, dt \approx \frac{x^{a-1}\exp(-x)}{\Gamma(a)}

In other words, the CDF is approximately the PDF. That’s all nice and simple, but how good is it? Here’s a plot for a = 5.

So is this good or not? It’s not a question of whether but of which. It’s a poor approximation for some parameters but a very good approximation for others. The approximation improves when a is closer to 1 and when x is larger.

We were kinda sloppy in solving for t above. Would it help if we had a more accurate value of t? No, that doesn’t make much difference, and it actually makes things a little worse. The important issue is to identify over what range of a and x the approximation works well.

Asymptotic series

It turns out that our crude integration estimate happens to correspond to the first term in an asymptotic series for our integral. If X is a gamma random variable with shape a then we have the following asymptotic series for the tail probability.

P(X > x) = \frac{x^{a-1}e^{-x}}{\Gamma(a)}\left( 1 + \frac{a-1}{x} + \frac{(a-1)(a-2)}{x^2} + \cdots\right)

So if we truncate the portion inside the parentheses to just 1, we have the approximation above. The approximation error from truncating an asymptotic series is approximately the first term that was left out.

So if x is much larger than a, the approximation error is small. Since the expected value of X is a, the tail probabilities correspond to values of x larger than a anyway. But if x is only a little larger than a, the approximation above won’t be very good, and adding more terms in the series won’t help either.

Here’s a contour plot of the absolute error:

The deep blue area below the diagonal of the graph shows the error is smallest when x > a. In 3D, the plot has a high plateau in the top left and a valley in the lower right, with a rapid transition between the two, suggested by the density of contour lines near the diagonal.

Epilogue

In the previous post I said that the next post would give an example where the 1/e heuristic doesn’t work well. This post has done that. It’s also given an example when the heuristic works very well. Both are the same example but with different parameters. When x >> a the method works well, and the bigger x is relative to a, the better it works.

 

Simple normal distribution tail estimate

A very common task in probability and statistics is to estimate the probability of a normal random variable taking on a value larger than a given value x. By shifting and scaling we can assume our normal random variable has mean 0 and variance 1. This means we need to approximate the integral

 \frac{1}{\sqrt{2\pi}}\int_x^\infty \exp(-t^2/2)\, dt

We are interested in positive values of x, especially moderately large values of x.

We will estimate the integral above using the FWHM trick. This technique approximates the area under a curve by the area of a single rectangle.

The height of the rectangle is the maximum value of the integrand. The width is the distance between two points where the integrand takes on half its maximum value. The FWHM technique takes its name from the base of this rectangle: Full Width between points of Half Maximum.

In our example, the integrand is strictly decreasing over the region of integration. This means the maximum occurs at the beginning, at x, and there’s only one place where the integrand takes on half its maximum value. So the base of our rectangle runs from x to the point t where the integrand drops by a half.

To find t we need to solve

 \exp(-t^2/2) = \exp(-x^2/2)/2

to get

t = \sqrt{x^2 + 2 \log 2}

It follows that

\frac{1}{\sqrt{2\pi}}\int_x^\infty \exp(-t^2/2)\, dt \approx \frac{1}{\sqrt{2\pi}} \left(\sqrt{x^2 + 2 \log 2} - x\right) \exp(-x^2/2)

Note that the base of our rectangle, tx, is on the order of 1/x, and so the estimate above is roughly similar to the rigorous bounds on normal tail probabilities given here.

So how good is the approximation? We can compute the error with the following Python code.

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.stats import norm

    x = np.linspace(1, 5, 100)
    approx = (2*np.pi)**-0.5 * np.exp(-x**2/2)
    approx *= ((x*x + 2*np.log(2))**0.5 - x)
    error = exact - approx

Here’s a plot of the error:

So the approximation error is small, and gets smaller as x increases. Depending on context, the error may be small enough.

However, Gaussian tail probabilities decrease rapidly, so we should ask whether the approximation error is small relative to the exact value. So we plot

    relative = error/exact

Here the results are not as impressive.

The relative error is on the order of 20%, and it actually increases as x increases. This tells us our approximation is not an asymptotic approximation. Still, it’s not bad for such a crude derivation.

Related posts