What’s “differential” about differential privacy?

Interest in differential privacy is growing rapidly. As evidence of this, here’s the result of a Google Ngram search [1] on “differential privacy.”

Graph rapidly rising from 2000 to 2019

When I first mentioned differential privacy to consulting leads a few years ago, not many had heard of it. Now most are familiar with the term, though they may not be clear on what it means.

The term “differential privacy” is kinda mysterious, particularly the “differential” part. Are you taking the derivative of some privacy function?!

In math and statistics, the adjective “differential” usually has something to do with calculus. Differential geometry applies calculus to geometry problems. Differential equations are equations involving the derivatives of the function you’re after.

But differential privacy is different. There is a connection to differential calculus, but it’s further up the etymological tree. Derivatives are limits of difference quotients, and that’s the connection. Differential privacy is about differences, but without any limits. In a nutshell, a system is differentially private if it hardly makes any difference whether your data is in the system or not.

The loose phrase “it hardly makes any difference” can be quantified in terms of probability distributions. You can find a brief explanation of how this works here.

Differential privacy is an elegant and defensible solution to the problem of protecting personal privacy while also preserving the usefulness of data in the aggregate. But it’s also challenging to understand and implement. If you’re interested in exploring differential privacy for your business, let’s talk.

More differential privacy posts

[1] You can see the search here. I’ve chopped off the vertical axis labels because they’re virtually meaningless. The proportion of occurrences of “differential privacy” among all two-word phrases is extremely small. But the rate of growth in the proportion is large.

50 sigma events for t distributions

I had a recent discussion with someone concerning a 50 sigma event, and that conversation prompted this post.

When people count “sigmas” they usually have normal distributions in mind. And six-sigma events so rare for normal random variables that it’s more likely the phenomena under consideration doesn’t have a normal distribution. As I explain here,

If you see a six-sigma event, that’s evidence that it wasn’t really a six-sigma event.

If six-sigma events are hard to believe, 50-sigma events are absurd. But not if you change the distribution assumption.

Chebyshev’s inequality says that if a random variable X has mean μ and variance σ², then

Prob( |X – μ| ≥ kσ) ≤ 1/k².

Not only that, it is possible to construct a random variable so that the inequality above is tight, i.e. it is possible to construct a random variable X such that the probability of it being 50 standard deviations away from its mean equals 1/2500. However, this requires constructing a custom random variable just to make equality hold. No distribution that you run into in the wild will have such a high probability of being that far out from its mean.

I wondered what the probability of a 50-sigma event would be like with a random variable with such a fat tail that it barely has a finite variance. I chose to look at Student t distributions because the variance of a Student t distribution with ν degrees of freedom is ν/(ν – 2), so the variance blows up as ν decreases toward 2. The smaller v us, the fatter the tails.

The probability of a t random variable with 3 degrees of freedom being 50 sigmas from its mean is about 3.4 × 10-6. The corresponding probability for 30 degrees of freedom is 6.7 × 10-31. So the former is rare but plausible, but the latter is so small as to be hard to believe.

This suggests that 50-sigma events are more likely with low degrees of freedom. And that’s sorta true, but not entirely. For 2.001 degrees of freedom, the probability of a 50-sigma event is 2 × 10-7, i.e. the probability is smaller for ν = 2.001 than for ν = 3.

Here’s a plot of the probability of 50-sigma events for a t distribution with ν degrees of freedom.

Apparently the probability peaks somewhere around ν = 2.25 then decreases exponentially as ν increases.

This isn’t what I expected, but it makes sense after the fact. As ν increases, the tails get thinner, but σ also gets smaller. We have two competing effects. For ν less than 2.25 the size of σ matters more. For larger values of ν, the thinness of tails matters more.

Power law principles are not about power laws

Much of what you’ll read about power laws in popular literature is not mathematically accurate, but still useful.

A lot of probability distributions besides power laws look approximately linear on a log-log plot, particularly over part of their range. The usual conclusion from this observation is that much of the talk about power laws is rubbish. But you could take this the other way around and say that all the talk about power laws applies more generally to other distributions!

Most appeals to power laws aren’t about power laws per se. For example, someone may say that supposedly rare events are not as rare as is commonly believed, because power laws. The conclusion may be correct even if the explanation isn’t. We routinely underestimate the probability of extreme events because probability distribution tails are often heavier than we suppose. Power laws are an example of heavy-tailed distributions, but other heavy-tailed distributions can thwart our intuition just as effectively as power laws.

When people say “power law” in this context, substitute “heavy-tailed distribution” in your mind and see whether everything still makes sense. Often it will.

Similarly, when you hear you should go for the Pareto optimum “because power laws,” interpret this generously. When someone appeals to the Pareto principle (a.k.a. the 80-20 rule) they probably don’t expect exactly 80% of returns to come from 20% of efforts. What matters is that the return on effort is not uniformly distributed. In fact, returns are often very unevenly distributed. Some actions accomplish far more than others. This is true when returns have a power law distribution, but certainly isn’t limited to that case.

Fat tails and diminishing returns are a lot to think about, and power laws help us think about them. Certainly power laws are used when they’re not a very good fit, but this is equally true of normal distributions. Both power laws and normal distributions may be useful tools of thought even when they’re not very accurate models.

Power laws are often useful as cautionary tales. “We don’t know that this situation follows a power law. In fact we don’t have much of a clue what distribution it follows. But what if it follows a power law? Then what?”

The difference between power laws and other heavy-tailed distributions matters more when you’re trying to make quantitative predictions. But here be dragons. It’s likely that data that appears to follow a power law distribution will only follow it so far and then depart. But the same is true of any other distribution you care to fit.

Related posts

Missing data

Missing data throws a monkey wrench into otherwise elegant plans. Yesterday’s post on genetic sequence data illustrates this point. DNA sequences consist of four bases, but we need to make provision for storing a fifth value for unknowns. If you know there’s a base in a particular position, but you don’t know what its value is, it’s important to record this unknown value to avoid throwing off the alignment of the sequence.

There are endless debates over how to handle missing data because missing data is a dilemma to be managed rather than a problem to be solved. (See Problems vs Dilemmas.)

It’s simply a fact of life that data will be incomplete. The debate stems from how to represent and handle missingness. Maybe the lowest level of a software application represents missing data and the highest uses complete data only. At what level are the missing values removed and how they are removed depends very much on context.

A naive approach to missing data is to not allow it. We’ve all used software that demands that we enter a value for some field whether a value exists or not. Maybe you have to enter a middle name, even though you don’t have a middle name. Or maybe you have to enter your grandfather’s name even though you don’t know his name.

Note that the two examples above illustrate two kinds of missing data: one kind does not exist, while the other certainly exists but is unknown. In practice there are entire taxonomies of missing data. Is in unknown or non-existent? If it is unknown, why is it unknown? If it does not exist, why doesn’t it?

There can be information in missing information. For example, suppose a clinical trial tracks how long people survive after a given treatment. You won’t have complete data until everyone in the study has died. In the mean time, their date of death is missing. If someone’s date of death is missing because they’re still alive, that’s information: you know they’ve survived at least until the current point in time. If someone’s date of death is missing because they were lost to followup, i.e. they dropped out of the study and you lost contact with them, that’s different.

The simplest approach to missing data is throw it away. That can be acceptable in some circumstances, particularly if the amount of missing data is small. But simply discarding missing data can be disastrous. In wide data, data with many different fields per subject, maybe none of your data is complete. Maybe there are many columns and every row is missing something in at least one column.

Throwing away incomplete data can be inefficient or misleading. In the survival study example above, throwing out missing data would give you a very pessimistic assessment of the treatment. The people who lived the longest would be excluded precisely because they’re still living! Your analysis would be based only on those who died shortly after treatment.

Analysis of data with missing values is a world unto itself. It seems paradoxical at first to devise ways to squeeze information out of data that isn’t there. But there are many ways to do just that, each with pros and cons. There are subtle ways to infer the missing values, while also accounting for the fact that these values have been inferred. If done poorly, this can increase bias, but if done well it decreases bias.

Analysis techniques that account for missing data are more complicated than techniques that do not. But they are worth the effort if throwing away missing data would leave you with too little data or give you misleading results. If you’re not concerned about the former, perhaps you should be concerned about the latter. The bias introduced by discarding incomplete data could be hard to foresee until you’ve analyzed the data properly accounting for missing values.

Initial letter frequency

I needed to know the frequencies of letters at the beginning of words for a project. The overall frequency of letters, wherever they appear in a word, is well known. Initial frequencies are not so common, so I did a little experiment.

I downloaded the Canterbury Corpus and looked at the frequency of initial letters in a couple of the files in the corpus. I first tried a different approach, then realized a shell one-liner [1] would be simpler and less-error prone.

cat alice29.txt | lc | grep -o '\b[a-z]' | sort | uniq -c | sort -rn

This shows that the letters in descending order of frequency at the beginning of a word are t, a, s, …, j, x, z.

The file alice29.txt is the text of Alice’s Adventures in Wonderland. Then for comparison I ran the same script on another file, lcet10.txt. a lengthy report from a workshop on electronic texts.

This technical report’s initial letter frequencies order the alphabet t, a, o, …, y, z, x. So starting with the third letter, the two files have different initial letter frequencies.

I made the following plot to visualize how the frequencies differ. The horizontal axis is sorted by overall letter frequency (based on the Google corpus summarized here).

I expected the initial letter frequencies to differ from overall letter frequencies, but I did not expect the two corpora to differ.

Apparently initial letter frequencies vary more across corpora than overall letter frequencies. The following plot shows the overall letter frequencies for both corpora, with the horizontal axis again sorted by the frequency in the Google corpus.

Here the two corpora essentially agree with each other and with the Google corpus. The tech report ranks letters in essentially the same order as the Google corpus because the orange dashed line is mostly decreasing, though there is a curious kink in the graph at c.

Related posts

[1] The lc function converts its input to lower case. See this post for how to install and use the function.

Random drug screening

Suppose in a company of N employees, m are chosen randomly for drug screening. In two independent screenings, what is the probability that someone will be picked both times? It may be unlikely that any given individual will be picked twice, while being very likely that someone will be picked twice.

Imagine m employees being given a red ticket representing the first screening, and m being given a blue ticket representing the second screening. The tickets be passed out in

{N \choose m}^2

different ways. Of these, the number of ways the tickets could be passed out so that no one has both a red and a blue ticket is

{N \choose m} {N-m \choose m}

because you can first pass out the red tickets, then choose m of the employees who did not get a red ticket for the blue tickets. And so the probability that no one will be picked twice, the probability that nobody holds both a red and a blue ticket, is

\frac{ {N-m \choose m} }{ {N \choose m} }

Now let’s plug in some numbers. Suppose a company has 100 employees and 20 are randomly screened each time. In two screenings, there is only a 0.7% chance that no one will be tested twice. Said another way, there’s a 99.3% chance that at least one person will be screened twice. Any given individual has a 20% chance of being selected each time, and so a 4% chance of being picked twice.

A variation on this problem is to compute the expected number of overlaps between two tests. With N = 100 and m = 20, we expect four people to be tested twice.

By the way, what if over half the employees are tested each time? For example, if a company of 100 people tests 60 people each time, it’s certain to test somebody twice. But the derivation above still works. The general definition of binomial coefficients takes care of this because the numerator will be zero if m is larger than half N. The number of ways to choose 60 things from a set of 40 things, for instance, is zero.

Universal confidence interval

Here’s a way to find a 95% confidence interval for any parameter θ.

  • With probability 0.95, return the real line.
  • With probability 0.05, return the empty set.

Clearly 95% of the time this procedure will return an interval that contains θ.

This example shows the difference between a confidence interval and a credible interval.

A 95% credible interval is an interval such that the probability is 95% that the parameter is in the interval. That’s what people think a 95% confidence interval is, but it’s not.

Suppose I give you a confidence interval using the procedure above. The probability that θ is in the interval is 1 if I return the real line and 0 if I return the empty set. In either case, the interval that I give you tells you absolutely nothing about θ.

But if I give you a 95% credible interval (a, b), then given the model and the data that went into it, the probability is 95% that θ is in the interval (a, b).

Confidence intervals are more useful in practice than in theory because they often approximately correspond to a credible interval under a reasonable model.

Credible intervals depend on your modeling assumptions. So do confidence intervals.

Confidence interval widths

Suppose you do N trials of something that can succeed or fail. After your experiment you want to present a point estimate and a confidence interval. Or if you’re a Bayesian, you want to present a posterior mean and a credible interval. The numerical results hardly differ, though the two interpretations differ.

If you got half successes, you will report a confidence interval centered around 0.5. The more unbalanced your results were, the smaller your confidence interval will be. That is, the confidence interval will be smallest if you had no successes and widest if you had half successes.

What can we say about how the width of your confidence varies as a function of your point estimate p? That question is the subject of this post [1].

Frequentist confidence interval width

We’ll start with an example, where N = 1,000,000. We let our number of observed successes n vary from 0 to 500,000 and plot the width of a 95% confidence interval as a function of n on a log-log plot. (By symmetry we only need to look to up to N/2. If you have more than half successes, reverse your definitions of success and failure.)

The result is almost a perfectly straight line, with the exception of a tiny curve at the end. This says the log of the confidence interval is a linear function of the log of n, or in other words, the dependence of confidence interval width on n follows a power law.

Bayesian credible interval width

The plot above measures frequentist confidence intervals, using Wilson score with continuity correction. We can repeat our experiment, putting on our Bayesian hat, and compute credible intervals using a Jeffreys prior, that is, a beta(0.5, 0.5) prior.

The results are indistinguishable. The confidence interval widths differ after the sixth decimal place.

In this example, there’s very little quantitative difference between a frequentist confidence interval and a Bayesian credible interval. The difference is primarily one of interpretation. The Bayesian interpretation makes sense and the frequentist interpretation doesn’t [2].

Power law

If the logs of two things are related linearly [3], then the things themselves are related by a power law. So if confidence/credible interval width varies as a power law with the point estimate p, what is that power law?

The plots above suggest that to a good approximation, if we let w be the credible interval length,

log w = m log p + b

for some slope m and intercept b.

We can estimate the slope by choosing p = 10-1 and p = 10-5. This gives us m = 0.4925 and b = -5.6116. There are theoretical reasons to expect that m should be 0.5, so we’ll round 0.4925 up to 0.5 both for convenience and for theory.

So

log w = 0.5 log p – 5.6116.

and so

w = 0.00366 √p

Let’s see how well this works by comparing it to the exact interval length. (I’m using Jeffreys’ credible interval, not that it matters.)

The power law approximation works well when the estimated proportion p is small, say less than 0.2. For larger p the normal approximation works well.

We could have guessed that the confidence interval width was proportional to √p(1-p) based on the normal approximation/central limit theorem. But the normal approximation is not uniformly good over the range of all ps.

Related posts

[1] Standard notation would put a little hat on top of the p but HTML doesn’t make this possible without inserting images into text. I hope you don’t mind if I take my hat off.

[2] This quip is a little snarky, but it’s true. When I would ask graduate students to describe what a confidence interval is, they would basically give me the definition of a credible interval. Most people who calculate frequentist confidence intervals think they’re calculating Bayesian credible intervals, though they don’t realize it. The definition of confidence interval is unnatural and convoluted. But confidence intervals work better in practice than in theory because they approximate credible intervals.

L. J. Savage once said

“The only use I know for a confidence interval is to have confidence in it.”

In other words, the value of a confidence interval is that you can often get away with interpreting it as a credible interval.

[3] Unfortunately the word “linear” is used in two inconsistent ways. Technically we should say “affine” when describing a function y = mx + b, and I wish we would. But that word isn’t commonly known. People call this relation linear because the graph is a line. Which makes sense, but it’s inconsistent with the use of “linear” as in “linear algebra.”

 

Computing normal probabilities with a simple calculator

If you need to calculate Φ(x), the CDF of a standard normal random variable, but don’t have Φ on your calculator, you can use the approximation [1]

Φ(x) ≈ 0.5 + 0.5*tanh(0.8 x).

If you don’t have a tanh function on your calculator, you can use

tanh(0.8x) = (exp(1.6x) – 1) / (exp(1.6x) + 1).

This saves a few keystrokes over computing tanh directly from its definition [2].

Example

For example, suppose we want to compute Φ(0.42) using bc, the basic calculator on Unix systems. bc only comes with a few math functions, but it has a function e for exponential. Our calculation would look something like this.

    > bc -lq
    t = e(1.6*0.42); (1 + (t-1)/(t+1))/2
    .66195084790657784948

The exact value of Φ(0.42) is 0.66275….

Plots

It’s hard to see the difference between Φ(x) and the approximation above.

Comparing Phi and its approximation

A plot of the approximation error shows that the error is greatest in [-2, -1] and [1, 2], but the error is especially small for x near 0 and x far from 0.

Error in approximating Phi

Related posts

[1] Anthony C. Robin. A Quick Approximation to the Normal Integral. The Mathematical Gazette, Vol. 81, No. 490 (Mar., 1997), pp. 95-96

[2] Proof: tanh(x) is defined to be (exex) / (ex + ex). Multiply top and bottom by ex to get (e2x – 1) / (e2x + 1).

Scaled Beta2 distribution

I recently ran across a new probability distribution called the “Scaled Beta2” or “SBeta2” for short in [1].

For positive argument x and for positive parameters p, q, and b, its density is

\frac{\Gamma(p+q)}{b\,\Gamma(p) \, \Gamma(q)}\, \frac{\left(\frac{x}{b}\right)^{p-1}}{\left(\left(\frac{x}{b}\right) + 1 \right)^{p+q}}

This is a heavy-tailed distribution. For large x, the probability density is O(xq-1), the same as a Student-t distribution with q degrees of freedom.

The authors in [1] point out several ways this distribution can be useful in application. It can approximate a number of commonly-used prior distributions while offering advantages in terms of robustness and analytical tractability.

Related posts

[1] The Scaled Beta2 Distribution as a Robust Prior for Scales. María-Eglée Pérez, Luis Raúl Pericchi, Isabel Cristina Ramírez. Available here.