Conspicuously missing data

I was working on a report for a client this afternoon when I remembered this comic from Spiked Math.

Waitress: Does everyone want a beer? Logician 1: I don't know. Logician 2: I don't know. Logician 3: Yes!

I needed to illustrate the point that revealing information about one person or group can reveal information on other people or other groups. If you give your genetic information to a company, for example, you also give that company (and every entity they share your data with) information about your relatives.

This comic doesn’t illustrate the point I had in mind, but it does illustrate a related point. The third logician didn’t reveal the preferences of the first two, though it looks like that at first. Actually, the first two implicitly reported their own preferences.

If the first logician did not want a beer, he or she could have said “No” to the question “Does everyone want a beer?” Answering this question with “I don’t know” is tantamount to answering the question “Do you want a beer?” with “Yes.” What appears to be a non-committal answer is a definite answer on closer examination.

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

A naive approach to removing sensitive data can be about as effective as bleeping out profanity: it’s not hard to infer what was removed.

Related posts

Order statistics for normal distributions

The last couple posts have touched on order statistics for normal random variables. I wrote the posts quickly and didn’t go into much detail, and more detail would be useful.

Given two integers nr ≥ 1, define E(r, n) to be the rth order statistic of n samples from standard normal random variables. That is, if we were to take n samples and then sort them in increasing order, the expected value of the rth sample is E(r, n).

We can compute E(r, n) exactly by

E(r, n) = \frac{n!}{(r-1)!\, (n-r)!} \int_{-\infty}^\infty x (1 - \Phi(x))^{r-1}\,\Phi(x)^{n-r}\, \phi(x) \, dx

where φ and Φ are the PDF and CDF of a standard normal random variable respectively. We can numerically evaluate E(r, n) by numerically evaluating its defining integral.

The previous posts have used

dn = E(n, n) – E(1, n) = 2 E(n, n).

The second equality above follows by symmetry.

We can compute dn in Mathematica as follows.

    Phi[x_] := (1 + Erf[x/Sqrt[2]])/2    
    phi[x_] := Exp[-x^2/2]/Sqrt[2 Pi]
    d[n_] := 2 n NIntegrate[
        x Phi[x]^(n - 1) phi[x], {x, -Infinity, Infinity}]

And we can reproduce the table here by

    Table[1/d[n], {n, 2, 10}]

Finally, we can see how dn behaves for large n by calling

    ListPlot[Table[d[n], {n, 1, 1000}]]

to produce the following graph.

See the next post for approximations to dn.

Range trick for larger sample sizes

The previous post showed that the standard deviation of a sample of size n can be well estimated by multiplying the sample range by a constant dn that depends on n.

The method works well for relatively small n. This should sound strange: typically statistical methods work better for large samples, not small samples. And indeed this method would work better for larger samples. However, we’re not so much interested in the efficiency of the method per se, but its efficiency relative to the standard way of estimating standard deviation. For small samples, both methods are not very accurate, and the two methods appear to work equally well.

If we want to use the range method for larger n, there are a couple questions: how well does the method work, and how do we calculate dn.

Simple extension

Ashley Kanter left a comment on the previous post saying

dn = 3 log n0.75 (where the log is base 10) seems to perform quite well even for larger n.

Ashley doesn’t say where this came from. Maybe it’s an empirical fit.

The constant dn is the expected value of the range of n samples from a standard normal random variable. You could find a (complicated) expression for this and then find a simpler expression using an asymptotic approximation. Maybe I’ll try that later, but for now I need to wrap up this post and move on to client work.

Note that

log10 n0.75 = 0.977 loge(n)

and so we could use

dn = log n

where log is natural log. This seems like something that might fall out of an asymptotic approximation. Maybe Ashley empirically discovered the first term of a series approximation.

Update: See this post for a more detailed exploration of how well log n, square root of n, and another method approximate dn.

Update 2: Ashley Kanter’s approximation was supposed to be 3 (log10 n) 0.75 rather than 3 log10 (n0.75) and is a very good approximation. This is also addressed in the link in the first update.

Simulation

Here’s some Python code to try things out.

    import numpy as np
    from scipy.stats import norm

    np.random.seed(20220309)

    n = 20
    for _ in range(5):
        x = norm.rvs(size=n)
        w = x.max() - x.min()
        print(x.std(ddof=1), w/np.log(n))

And here are the results.

    |   std | w/d_n |
    |-------+-------|
    | 0.930 | 1.340 |
    | 0.919 | 1.104 |
    | 0.999 | 1.270 |
    | 0.735 | 0.963 |
    | 0.956 | 1.175 |

It seems the range method has more variance, though notice in the fourth row that the standard estimate can occasionally wander pretty far from the theoretical value of 1 as well.

We get similar results when we increase n to 50.

    |   std | w/d_n |
    |-------+-------|
    | 0.926 | 1.077 |
    | 0.889 | 1.001 |
    | 0.982 | 1.276 |
    | 1.038 | 1.340 |
    | 1.025 | 1.209 |

Not-so-simple extension

There are ways to improve the range method, if by “improve” you mean make it more accurate. One way is to divide the sample into random partitions, apply the method to each partition, and average the results. If you’re going to do this, partitions of size 8 are optimal [1]. However, the main benefit of the range method [2] is its simplicity.

Related posts

[1] F. E. Grubbs and C. L. Weaver (1947). The best unbiased estimate of a population standard deviation based on group ranges. Journal of the American Statistical Association 42, pp 224–41

[2] The main advantage now is its simplicity, When it was discovered, the method reduced manual calculation, and so it could have been worthwhile to make the method a little more complicated as long as the calculation effort was still less than that of the standard method.

Estimating standard deviation from range

Suppose you have a small number of samples, say between 2 and 10, and you’d like to estimate the standard deviation σ of the population these samples came from. Of course you could compute the sample standard deviation, but there is a simple and robust alternative.

Let W be the range of our samples, the difference between the largest and smallest value. Think “w” for “width.” Then

W / dn

is an unbiased estimator of σ where the constants dn can be looked up in a table [1].

    |  n | 1/d_n |
    |----+-------|
    |  2 | 0.886 |
    |  3 | 0.591 |
    |  4 | 0.486 |
    |  5 | 0.430 |
    |  6 | 0.395 |
    |  7 | 0.370 |
    |  8 | 0.351 |
    |  9 | 0.337 |
    | 10 | 0.325 |

The values dn in the table were calculated from the expected value of W/σ for normal random variables, but the method may be used on data that do not come from a normal distribution.

Let’s try this out with a little Python code. First we’ll take samples from a standard normal distribution, so the population standard deviation is 1. We’ll draw five samples, and estimate the standard deviation two ways: by the method above and by the sample standard deviation.

    from scipy.stats import norm, gamma

    for _ in range(5):
        x = norm.rvs(size=10)
        w = x.max() - x.min()
        print(x.std(ddof=1), w*0.325)

Here’s the output:

    | w/d_n |   std |
    |-------+-------|
    | 1.174 | 1.434 |
    | 1.205 | 1.480 |
    | 1.173 | 0.987 |
    | 1.154 | 1.277 |
    | 0.921 | 1.083 |

Just from this example it seems the range method does about as well as the sample standard deviation.

For a non-normal example, let’s repeat our exercise using a gamma distribution with shape 4, which has standard deviation 2.

    | w/d_n |   std |
    |-------+-------|
    | 2.009 | 1.827 |
    | 1.474 | 1.416 |
    | 1.898 | 2.032 |
    | 2.346 | 2.252 |
    | 2.566 | 2.213 |

Once again, it seems both methods do about equally well. In both examples the uncertainty due to the small sample size is more important than the difference between the two methods.

Update: To calculate dn for other values of n, see this post.

[1] Source: H, A. David. Order Statistics. John Wiley and Sons, 1970.

Find log normal parameters for given mean and variance

Earlier today I needed to solve for log normal parameters that yield a given mean and variance. I’m going to save the calculation here in case I needed in the future or in case a reader needs it. The derivation is simple, but in the heat of the moment I’d rather look it up and keep going with my train of thought.

NB: The parameters μ and σ² of a log normal distribution are not the mean and variance of the distribution; they are the mean and variance of its log.

If m is the mean and v is the variance then

\begin{align*} m &= \exp(\mu + \sigma^2/2) \\ v &= (\exp(\sigma^2) - 1) \exp(2\mu + \sigma^2) \end{align*}

Notice that the square of the m term matches the second part of the v term.

Then

\frac{v}{m^2} = \exp(\sigma^2) -1

and so

\sigma^2 = \log\left(\frac{v}{m^2} + 1 \right)

and once you have σ² you can find μ by

\mu = \log m - \sigma^2/2

Here’s Python code to implement the above.

    from numpy immport log
    def solve_for_log_normal_parameters(mean, variance):
        sigma2 = log(variance/mean**2 + 1)
        mu = log(mean) - sigma2/2
        return (mu, sigma2)

And here’s a little test code for the code above.

    mean = 3.4
    variance = 5.6

    mu, sigma2 = solve_for_log_normal_parameters(mean, variance)

    X = lognorm(scale=exp(mu), s=sigma2**0.5)
    assert(abs(mean - X.mean()) < 1e-10)
    assert(abs(variance - X.var()) < 1e-10)

Related posts

Information in a discretized normal

Here’s a problem that came up in my work today.

Suppose you have a normal random variable X and you observe X in n discrete bins. The first bin is the left α tail, the last bin is the right α tail, and the range between the two tails is divided evenly into n-2 intervals. How many bits of information are in such an observation?

For example, assume adult male heights are normally distributed with mean 70 inches and standard deviation 3 inches. How much information is there in an observation of height, rounded down to the nearest inch, if we put all heights less than 64 inches in a bin and all heights greater than or equal to 76 inches in a bin?

Here α = 0.025 because that’s the probability that a normal random variable is more than 2 standard deviations below its mean or more than 2 standard deviations above its mean.

We have n = 15 because we have intervals (-∞, 64), [64, 65), [65, 66), … [75, 76), and [76, ∞).

We know that with n bins we have at most log2n bits of information. That would correspond to n bins of equal probability. In other words, we’d get the maximum information if we spaced our bins evenly in terms of percentiles rather in terms of inches.

So how does the information content vary as a function of the number of bins, and how close is it to the maximum information for n bins?

Here’s a plot, with α = 0.025.

The suboptimality of our bins, i.e. the difference between log2n and the information we get from n bins, drops quickly at first as n increases, then appears to plateau.

Next lets look at just the suboptimality, and increase our range of n.

This shows that the suboptimality does not approach an asymptote but instead starts to increase. The minimum at n = 36 is 0.16 bits.

Swanson’s rule of thumb

Swanson’s rule of thumb [1] says that the mean of a moderately skewed probability distribution can be approximated by the weighted average of the 10th, 50th, and 90th percentile, with weights 0.3, 0.4, and 0.3 respectively. Because it is based on percentiles, the rule is robust to outliers. Swanson’s rule is used in the oil and gas industry, but I don’t believe it’s widely known outside of that sector.

This post will apply Swanson’s rule to three distributions to see how well it does. I’ll plot the bias in Swanson’s rule and skewness for three distributions families as the shape parameters vary.

First, here’s a plot for the gamma distribution with shape parameter k varying from 0.5 to 10.

The skewness of a gamma distribution is 1/√k. So unless k is close to 0, the skewness of a gamma distribution is small and the error in Swanson’s rule is very small.

For the log normal I let the shape parameter σ vary from 0.01 to 1; I didn’t use larger parameters because skewness for the log normal increases very rapidly as a function of σ. Over the range I plotted the percentage relative error in Swanson’s rule is roughly equal to the skewness.

The skewness of a Pareto distribution is undefined unless the shape parameter is greater than 3. The plot below is based on shape parameters between 3.1 and 4. The skewness decreases as the shape parameter increases, asymptotically approaching 2.

Based on these examples, Swanson’s rule does indeed work pretty well when skewness is small. It works well for the gamma distribution in general because skewness is always small (unless the shape parameter k is tiny). But the log normal distribution shows that the assumption of small skewness is important.

For the Pareto distribution, Swanson’s rule works well when the shape parameter is large and the skewness is small. But this was kind of a stress test for Swanson’s rule because often the Pareto distribution is used when the skewness is large if not infinite.

 

[1] Swanson’s 30-40-30 rule. A. Hurst, G. C. Brown, R. I. Swanson. AAPG Bulletin, December 2000

New R book

Five years ago I recommended the book Learning Base R. Here’s the last paragraph of my review:

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t start there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R per se. This is the book I wish I could have handed my programmers who had to pick up R.

Now there’s a second edition. The author has gone through the book and made countless changes, many of them small updates that might be unnoticeable. Here are some of the changes that would be more noticeable.

  1. There are 265 new exercises.
  2. Chapter 26 (statistics) and Chapter 28 (packages) got a complete overhaul.
  3. Dozens of new functions are introduced (either in the body of the text or through exercises).
  4. New sections include the switch function (in Chapter 13 on relational operators), algorithm development (in Chapter 23 on iteration), analysis of variance (in Chapter 26 on statistics), and time series analysis (in Chapter 26 on statistics).

I was impressed with the first edition, and the new edition promises to be even better.

The book is available at Barnes & Noble and Amazon.

Yule statistics Y and Q

I recently wrote about the Yule-Simon distribution. The same Yule, George Udny Yule, is also known for the statistics Yule’s Y and Yule’s Q. The former is also known as the coefficient of colligation, and the latter is also known as the Yule coefficient of association.

Both measure how things are related. Given a 2 × 2 contingency table

\begin{tabular} {|c|c|} \hline $a$ & $b$ \\ \hline $c$ & $d$ \\ \hline \end{tabular}

with non-negative entries, Yule’s Y is defined by

Y = \frac{\sqrt{ad} - \sqrt{bc}}{\sqrt{ad} + \sqrt{bc}}

and Yule’s Q is defined by

Q = \frac{ad - bc}{ad + bc}

Both essentially measure how much bigger ad is than bc but are weighted differently.

The algebraic properties of these two statistics may be more interesting than their statistical properties. For starters, both Y and Q produce values in the interval (-1, 1), and each is an inverse of the other:

\begin{align*} Q &= \frac{2Y}{1 + Y^2} \\ Y &= \frac{1 - \sqrt{1 = Q^2}}{Q} \end{align*}

There’s some simple but interesting algebra going on here. Define

f(x) = \frac{1-z}{1+z}

This simple function comes up surprisingly often. It’s a Möbius transformation, and it comes up in diverse applications such as designing antennas and mental math shortcuts. More on that here.

If we define

z = \frac{bc}{ad}

and

W_p = f(z^p)

then W generalizes both Q and Y: setting p = 1 gives us Q and setting p = 1/2 gives us Y.

Given the value of W with subscript p, we could easily solve for the value of W with another subscript q, analogous to solving Q for Y and Y for Q above.

W_p = f(z^p) \to^{f}{z^p} \to{x^{q/p} z^q \to^{f} f(z^q) = W_q

If you’re expecting f-1 rather than f over the first arrow, you’re right, but f is its own inverse so we could just write f instead.

Related posts

Comparing three discrete power laws

Yesterday I wrote about the zeta distribution and the Yule-Simon distribution and said that they, along with the Zipf distribution, are all discrete power laws. This post fills in detail for that statement.

The probability mass functions for the zeta, Zipf, and Yule-Simon distributions are as follows.

\begin{align*} f_\zeta(k; s) &= k^{-s} / \zeta(s) \\ f_y(k; \rho) &= \rho B(k, \rho+1) \\ f_z(k; N, s) &= k^{-s} / H_{N,s} \end{align*}

Here the subscripts ζ, y, and z stand for zeta, Yule, and Zipf respectively. The distribution parameters follow after the semicolon.

Comparing Zipf and zeta

The Zipf distribution is only defined on the first N positive integers. The normalizing constant for the Zipf distribution is the Nth generalized harmonic number with exponent s.

H_{N,s} = \sum_{k=1}^N k^{-s}

As N goes to infinity, HN,s converges to ζ(s); this is the definition of the ζ function. So the Zipf and zeta distributions are asymptotically equal.

Comparing zeta and Yule

This post showed that for large k,

f_y(k; \rho) \approx \rho \Gamma(\rho + 1) \frac{1}{(k+\rho)^{\rho + 1}}

That is, fy(k, ρ) is proportional to a power of k, except k is shifted by an amount ρ.

To compare the zeta and Yule distributions we’ll need to compare zeta with s+1 to Yule with ρ in order to make the exponents agree, and we’ll need to shift the zeta distribution by s+1.

When we plot the ratio of the pmfs, we see that the distributions agree in the limit as k gets large.

In this plot s = ρ = 1.5.

The ratio is approaching a constant, and in fact the limit is

\frac{1}{ \rho \,\Gamma(\rho + 1)\,\zeta(s+1)}

based on the ratio of the proportionality constants.

Comment

Note that the three distributions are asymptotically proportional, given the necessary shift in k, but in different ways. The Zipf distribution converges to the zeta distribution, for every k, as N goes to infinity. The zeta and Yule distributions are asymptotically proportional as k goes to infinity. So one proportion is asymptotic in a parameter N and one is asymptotic in an argument k.