What is differential privacy?

Differential privacy is a strong form of privacy protection with a solid mathematical definition.

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not. This intuitive idea can be made precise as follows.

blurry crowd

Queries and algorithms

First of all, differential privacy is something that applies to queries, not to databases. It could apply to a database if your query is to select everything in the database, but typically you want to run far more specific queries. Differential privacy adds noise to protect privacy, and the more intrusive the query, the more noise must be added, so typically you want queries to be more narrow than “Tell me everything about everything.”

Generally the differential privacy literature speaks of algorithms rather than queries. The two are the same if you have a general idea of what a query is, but using the word algorithm emphasizes that the query need not be a simple SQL query.

Opt-out and neighboring databases

To quantify the idea of “whether your information is included or not” we look at two versions of a database, D and D‘, that differ by one row, which you can think of as the row containing your data. Our formalism is symmetric, so we do not specify which database, D or D‘, is the one which contains your row and which is missing your row.

We should mention some fine print here. The paragraph above implicitly assumes that your database is just a simple table. In general we can speak of neighboring databases. In the case of a more complex database design, the notion of what it means for two databases to be neighboring is more complex. Differential privacy can be defined whenever a notion of neighboring databases can be defined, so you could, for example, consider differential privacy for a non-relational (“No SQL”) database. But for the simple case of a single table, two databases are neighboring if they differ by a row.

However, there’s a subtlety regarding what it means even for two tables to “differ by one row.” Unbound differential privacy assumes one row has been removed from one of the databases. Bound differential privacy assumes the two databases have the same number of rows, but the data in one row has been changed. In practice the difference between bound and unbound differential privacy doesn’t often matter.

Quantifying difference

We said it “makes little difference” whether an individual’s data is included or not. How to we quantify that statement? Differential privacy is a stochastic procedure, so we need to measure difference in terms of probability.

Whenever mathematicians want to speak of two things being close together, we often use ε as our symbol. While in principle ε could be large, the choice of symbol implies that you should think of it as a quantity you can make as small as you like (though there may be some cost that increases as ε decreases).

We’re going to look at ratios of probabilities, so we want these ratios to be near 1. This means we’ll work with eε = exp(ε) rather than ε itself. A convenient property that falls out of the Taylor series for the exponential function is that if ε is small then exp(ε) is approximate 1+ε. For example, exp(0.05) = 1.0513, and so if the ratio of two probabilities is bounded by exp(0.05), the probabilities are roughly within 5% of each other.

Definition of differential privacy

Now we can finally state our definition. An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D‘,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

Update: See a follow-on post about Rényi differential privacy, a generalization of differential privacy that measures the difference between output probability distributions using information theory rather than bounding ratios. This includes the definition above a special case.

Help with differential privacy

If you’d like help understanding and implementing differential privacy, let’s talk.

Related posts

Biometric security and hypothesis testing

A few weeks ago I wrote about how there are many ways to summarize the operating characteristics of a test. The most basic terms are accuracy, precision, and recall, but there are many others. Nobody uses all of them. Each application area has their own jargon.

Biometric security has its own lingo, and it doesn’t match any of the terms in the list I gave before.

facial scan authentication

False Acceptance Rate

Biometric security uses False Acceptance Rate (FAR) for the proportion of times a system grants access to an unauthorized person. In statistical terms, FAR is Type II error. Also known as False Match Rate (FRM).

False Rejection Rate

False Rejection Rate (FRR) is the proportion of times a biometric system fails to grant access to an authorized person. In statistical terms, FRR is Type I error. FAR is also known as False Non Match Rate (FNMR).

Crossover Error Rate

One way to summarize the operating characteristics of a biometric security system is to look at the Crossover Error Rate (CER), also known as the Equal Error Rate (EER). The system has parameters that can be tuned to adjust the FAR and FRR. Adjust these to the point where the FAR and FRR are equal. When the two are equal, their common value is the CER or EER.

The CER gives a way to compare systems. The smaller the CER the better. A smaller CER value means it’s possible to tune the system so that both the Type I and Type II error rates are smaller than they would be for another system.

Loss Function

CER is kind of a strange metric. Everyone agrees that you wouldn’t want to calibrate a system so that FAR = FRR. In security applications, FAR (unauthorized access) is worse than FRR (authorized user locked out). The former could be a disaster while the latter is an inconvenience. Of course there could be a context where the consequences of FAR and FRR are equal, or that FRR is worse, but that’s not usually the case.

A better approach would be to specify a loss function (or its negative, a utility function). If unauthorized access is K times more costly than locking out an authorized user, then you might want to know at what point K * FAR = FRR or your minimum expected loss [1] over the range of tuning parameters. The better system for you, in your application, is the one corresponding to your value of K.

Since everyone has a different value of K, it is easier to just use K = 1, even though everyone’s value of K is likely to be much larger than 1. Unfortunately this often happens in decision theory. When people can’t agree on a realistic loss function, they standardize on a mathematically convenient implicit loss function that nobody would consciously choose.

If everyone had different values of K near 1, the CER metric might be robust, i.e. it might often make the right comparison between two different systems even though the criteria is wrong. But since K is probably much larger than 1 for most people, it’s questionable that CER would rank two systems the same way people would if they could use their own value of K.

***

[1] These are not the same thing. To compute expected loss you’d need to take into account the frequency of friendly and unfriendly access attempts. In a physically secure location, friends may attempt to log in much more often than foes. On a public web site the opposite is more likely to be true.

Computing extreme normal tail probabilities

Let me say up front that relying on the normal distribution as an accurate model of extreme events is foolish under most circumstances. The main reason to calculate the probability of, say, a 40 sigma event is to show how absurd it is to talk about 40 sigma events. See my previous post on six-sigma events for an explanation.

For this post I’ll be looking at two-tailed events, the probability that a normal random variable is either less than –kσ or greater than kσ. If you’re only interested in one of these two probabilities, divide by 2. Also, since the results are independent of σ, let’s assume σ = 1 for convenience.

The following Python code will print a table of k-sigma event probabilities.

    from scipy.stats import norm

    for k in range(1, 40):
        print(k, 2*norm.cdf(-k))

This shows, for example, that a “25 sigma event,” something I’ve heard people talk about with a straight face, has a probability of 6 × 10-138.

The code above reports that a 38 or 39 sigma event has probability 0, exactly 0. That’s because the actual value, while not zero, is so close to zero that floating point precision can’t tell the difference. This happens around 10-308.

What if, despite all the signs warning hic sunt dracones you want to compute even smaller probabilities? Then for one thing you’ll need to switch over to log scale in order for the results to be representable as floating point numbers.

Exactly computing these extreme probabilities is challenging, but there are convenient upper and lower bounds on the probabilities that can be derived from Abramowitz and Stegun, equation 7.1.13 with a change of variables. See notes here.

We can use these bounds to compute upper and lower bounds on the base 10 logs of the tail probabilities.

    from scipy import log, sqrt, pi

    def core(t, c):
        x = 2*sqrt(2/pi)/(t + sqrt(t**2 + c))
        ln_p = -0.5*t**2 + log(x)
        return ln_p/log(10)

    def log10_upper(t):
        return core(t, 8/pi)

    def log10_lower(t):
        return core(t, 4)

This tells us that the log base 10 of the probability of a normal random variable being more than 38 standard deviations away from its mean is between -315.53968 and -315.53979. The upper and lower bounds agree to about seven significant figures, and the accuracy only improves as k gets larger. So for large arguments, we can use either the upper or lower bound as an accurate approximation.

The code above was used to compute this table of tail probabilities for k = 1 to 100 standard deviations.

Six sigma events

I saw on Twitter this afternoon a paraphrase of a quote from Nassim Taleb to the effect that if you see a six-sigma event, that’s evidence that it wasn’t really a six-sigma event.

What does that mean? Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about “six sigma” in the context of other distributions [1]. Here’s a table to indicate the odds against a k-sigma event for various k.

        |-------+-----------------|
        | Sigma | Odds            |
        |-------+-----------------|
        |     1 | 2 : 1           |
        |     2 | 21 : 1          |
        |     3 | 370 : 1         |
        |     4 | 16,000 : 1      |
        |     5 | 1,700,000 : 1   |
        |     6 | 500,000,000 : 1 |
        |-------+-----------------|

If you see something that according to your assumptions should happen twice in a billion tries, maybe you’ve seen something extraordinarily rare, or maybe your assumptions were wrong. Taleb’s comment suggests the latter is more likely.

Bayes rule and Bayes factors

You could formalize this with Bayes rule. For example, suppose you’re 99% sure the thing you’re looking at has a normal distribution with variance 1, but you’re willing to concede there’s a 1% chance that what you’re looking at has a fat-tailed distribution, say a Student t distribution with 10 degrees of freedom, rescaled to also have variance 1.

normal distribution vs t with 10 dof

It’s hard to tell the two distributions apart, especially in the tails. But although both are small in the tails, the normal is relatively much smaller.

Now suppose you’ve seen an observation greater than 6. The Bayes factor in favor of the t distribution hypothesis is 272. This means that even though before seeing any data you thought the odds were 99 to 1 in favor of the data coming from a normal distribution, after seeing such a large observation you would put the odds at 272 to 1 in favor of the t distribution.

If you allow a small possibility that your assumption of a normal distribution is wrong (see Cromwell’s rule) then seeing an extreme event will radically change your mind. You don’t have to think the fat-tailed distribution is equally likely, just a possibility. If you did think a priori that both possibilities were equally likely, the posterior odds for the t distribution would be 27,000 to 1.

In this example we’re comparing the normal distribution to a very specific and somewhat arbitrary alternative. Our alternative was just an example. You could have picked a wide variety of alternatives that would have given a qualitatively similar result, reversing your a priori confidence in a normal model.

By the way, a t distribution with 10 degrees of freedom is not a very fat-tailed distribution. It has fatter tails than a normal for sure, but not nearly as fat as a Cauchy, which corresponds to a t with only one degree of freedom. If we had used a distribution with a heavier tail, the posterior odds in favor of that distribution would have been higher.

***

[1] A six-sigma event isn’t that rare unless your probability distribution is normal. By Markov’s inequality, the probability is less than 1/36 for any distribution. The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.

Robustness and tests for equal variance

The two-sample t-test is a way to test whether two data sets come from distributions with the same mean. I wrote a few days ago about how the test performs under ideal circumstances, as well as less than ideal circumstances.

This is an analogous post for testing whether two data sets come from distributions with the same variance. Statistics texts books often present the F-test for this task, then warn in a footnote that the test is highly dependent on the assumption that both data sets come from normal distributions.

Sensitivity and robustness

Statistics texts give too little attention to robustness in my opinion. Modeling assumptions never hold exactly, so it’s important to know how procedures perform when assumptions don’t hold exactly. Since the F-test is one of the rare instances where textbooks warn about a lack of robustness, I expected the F-test to perform terribly under simulation, relative to its recommended alternatives Bartlett’s test and Levene’s test. That’s not exactly what I found.

Simulation design

For my simulations I selected 35 samples from each of two distributions. I selected significance levels for the F-test, Bartlett’s test, and Levene’s test so that each would have roughly a 5% error rate under a null scenario, both sets of data coming from the same distribution, and a 20% error rate under an alternative scenario.

I chose my initial null and alternative scenarios to use normal (Gaussian) distributions, i.e. to satisfy the assumptions of the F-test. Then I used the same designs for data coming from a fat-tailed distribution to see how well each of the tests performed.

For the normal null scenario, both data sets were drawn from a normal distribution with mean 0 and standard deviation 15. For the normal alternative scenario I used normal distributions with standard deviations 15 and 25.

Normal distribution calibration

Here are the results from the normal distribution simulations.

|----------+-------+--------+---------|
| Test     | Alpha | Type I | Type II |
|----------+-------+--------+---------|
| F        |  0.13 | 0.0390 |  0.1863 |
| Bartlett |  0.04 | 0.0396 |  0.1906 |
| Levene   |  0.06 | 0.0439 |  0.2607 |
|----------+-------+--------+---------|

Here the Type I column is the proportion of times the test incorrectly concluded that identical distributions had unequal variances. The Type II column reports the proportion of times the test failed to conclude that distributions with different variances indeed had unequal variances. Results were based on simulating 10,000 experiments.

The three tests had roughly equal operating characteristics. The only difference that stands out above simulation noise is that the Levene test had larger Type II error than the other tests when calibrated to have the same Type I error.

To calibrate the operating characteristics, I used alpha levels 0.15, 0.04, and 0.05 respectively for the F, Bartlett, and Levene tests.

Fat-tail simulation results

Next I used the design parameters above, i.e. the alpha levels for each test, but drew data from distributions with a heavier tail. For the null scenario, both data sets were drawn from a Student t distribution with 4 degrees of freedom and scale 15. For the alternative scenario, the scale of one of the distributions was increased to 25. Here are the results, again based on 10,000 simulations.

|----------+-------+--------+---------|
| Test     | Alpha | Type I | Type II |
|----------+-------+--------+---------|
| F        |  0.13 | 0.2417 |  0.2852 |
| Bartlett |  0.04 | 0.2165 |  0.2859 |
| Levene   |  0.06 | 0.0448 |  0.4537 |
|----------+-------+--------+---------|

The operating characteristics degraded when drawing samples from a fat-tailed distribution, t with 4 degrees of freedom, but they didn’t degrade uniformly.

Compared to the F-test, the Bartlett test had slightly better Type I error and the same Type II error.

The Levene test, had a much lower Type I error than the other tests, hardly higher than it was when drawing from a normal distribution, but had a higher Type II error.

Conclusion: The F-test is indeed sensitive to departures from the Gaussian assumption, but Bartlett’s test doesn’t seem much better in these particular scenarios. Levene’s test, however, does perform better than the F-test, depending on the relative importance you place on Type I and Type II error.

Related posts

Two-sample t-test and robustness

A two-sample t-test is intended to determine whether there’s evidence that two samples have come from distributions with different means. The test assumes that both samples come from normal distributions.

Robust to non-normality, not to asymmetry

It is fairly well known that the t-test is robust to departures from a normal distribution, as long as the actual distribution is symmetric. That is, the test works more or less as advertised as long as the distribution is symmetric like a normal distribution, but it may not work as expected if the distribution is asymmetric.

This post will explore the robustness of the t-test via simulation. How far can you be from a normal distribution and still do OK? Can you have any distribution as long as it’s symmetric? Does a little asymmetry ruin everything? If something does go wrong, how does it go wrong?

Experiment design

For the purposes of this post, we’ll compare the null hypothesis that two groups both have mean 100 to the alternative hypothesis that one group has mean 100 and the other has mean 110. We’ll assume both distributions have a standard deviation of 15. There is a variation on the two-sample t-test that does not assume equal variance, but for simplicity we will keep the variance the same in both groups.

We’ll do a typical design, 0.05 significance and 0.80 power. That is, under the null hypothesis, that is if the two groups do come from the same distribution, we expect the test to wrongly conclude that the two distributions are different 5% of the time. And under the alternative hypothesis, when the groups do come from distributions with means 100 and 110, we expect the test to conclude that the two means are indeed different 80% of the time.

Under these assumptions, you’d need a sample of 36 in each group.

Python simulation code

Here’s the code we’ll use for our simulation.

    from scipy.stats import norm, t, gamma, uniform, ttest_ind

    num_per_group = 36

    def simulate_trials(num_trials, group1, group2):
        num_reject = 0
        for _ in range(num_trials):
            a = group1.rvs(num_per_group)
            b = group2.rvs(num_per_group)
            stat, pvalue = ttest_ind(a, b)
            if pvalue < 0.05:
                num_reject += 1
        return(num_reject)

Normal simulation

Normal distributions

Let’s see how the two-sample t-test works under ideal conditions by simulating from the normal distributions that the method assumes.

First we simulate from the null, i.e. we draw the data for both groups from the same distribution

    n1 = norm(100, 15)
    n2 = norm(100, 15)
    print( simulate_trials(1000, n1, n2) )

Out of 1000 experiment simulations, the method rejected the null 54 times, very close to the expected number of 50 based on 0.05 significance.

Next we simulate under the alternative, i.e. we draw data from different distributions.

    n1 = norm(100, 15)
    n2 = norm(110, 15)
    print( simulate_trials(1000, n1, n2) )

This time the method rejected the null 804 times in 1000 simulations, very close to the expected 80% based on the power.

Gamma simulation

Gamma distributions

Next we draw data from a gamma distribution. First we draw both groups from a gamma distribution with mean 100 and standard deviation 15. This requires a shape parameter of 44.44 and a scale of 2.25.

    g1 = gamma(a = 44.44, scale=2.25)
    g2 = gamma(a = 44.44, scale=2.25)
    print( simulate_trials(1000, g1, g2) )

Under the null simulation, we rejected the null 64 times out of 1000 simulations, higher than expected, but not that remarkable.

Under the alternative simulation, we pick the second gamma distribution to have mean 110 and standard deviation 15, which requires shape 53.77 and scale 2.025.

    g1 = gamma(a = 44.44, scale=2.25)
    g2 = gamma(a = 53.77, scale=2.045)
    print( simulate_trials(1000, g1, g2) )

We rejected the null 805 times in 1000 simulations, in line with what you’d expect from the power.

Gamma distributions

When the gamma distribution has such large shape parameters, the distributions are approximately normal, though slightly asymmetric. To increase the asymmetry, we’ll use a couple gamma distributions with smaller mean but shifted over to the right. Under the null, we create two gamma distributions with mean 10 and standard deviation 15, then shift them to the right by 90.

    g1 = gamma(a = 6.67, scale=1.5, loc=90)
    g2 = gamma(a = 6.67, scale=1.5, loc=90)
    print( simulate_trials(1000, g1, g2) )

Under this simulation we reject the null 56 times out of 1000 simulations, in line with what you’d expect.

For the alternative simulation, we pick the second distribution to have a mean of 20 and standard deviation 15, and then shift it to the right by 90 so tht it has mean 110. This distribution has quite a long tail.

    g1 = gamma(a = 6.67, scale=1.5, loc=90)
    g2 = gamma(a = 1.28, scale=11.25, loc=90)
    print( simulate_trials(1000, g1, g2) )

This time we rejected the null 499 times in 1000 simulations. This is a serious departure from what’s expected. Under the alternative hypothesis, we reject the null 50% of the time rather than the 80% we’d expect. If higher mean is better, this means that half the time you’d fail to conclude that the better group really is better.

Uniform distribution simulation

Next we use uniform random samples from the interval [74, 126]. This is a symmetric distribution with mean 100 and standard deviation 15. When we draw both groups from this distribution we rejected the null 48 times, in line with what we’d expect.

    u1 = uniform(loc=74, scale=52)
    u2 = uniform(loc=74, scale=52)
    print( simulate_trials(1000, u1, u2) )

If we move the second distribution over by 10, drawing from [84, 136], we rejected the null 790 times out of 1000, again in line with what you’d get from a normal distribution.

    u1 = uniform(loc=74, scale=52)
    u2 = uniform(loc=84, scale=52)
    print( simulate_trials(1000, u1, u2) )

In this case, we’ve made a big departure from normality and the test still worked as expected. But that’s not always the case, as in the t(3) distribution below.

Student-t simulation (fat tails)

Finally we simulate from a Student-t distribution. This is a symmetric distribution but it has fat tails relative to the normal distribution.

t distributions with 6 dof

First, we simulate from a t distribution with 6 degrees of freedom and scale 12.25, making the standard deviation 15. We shift the location of the distribution by 100 to make the mean 100.

    t1 = t(df=6, loc=100, scale=12.25)
    t2 = t(df=6, loc=100, scale=12.25)
    print( simulate_trials(1000, t1, t2) )

When both groups come from this distribution, we rejected the null 46 times. When we shifted the second distribution to have mean 110, we rejected the null 777 times out of 1000.

    t1 = t(df=6, loc=100, scale=12.25)
    t2 = t(df=6, loc=110, scale=12.25)
    print( simulate_trials(1000, t1, t2) )

In both cases, the results are in line what we’d expect given a 5% significance level and 80% power.

t distributions with 3 dof

A t distribution with 6 degrees of freedom has a heavy tail compared to a normal. Let’s try making the tail even heavier by using 3 degrees of freedom. We make the scale 15 to keep the standard deviation at 15.

    t1 = t(df=3, loc=100, scale=15)
    t2 = t(df=3, loc=100, scale=15)
    print( simulate_trials(1000, t1, t2) )

When we draw both samples from the same distribution, we rejected the null 37 times out of 1000, less than the 50 times we’d expect.

    t1 = t(df=3, loc=100, scale=15)
    t2 = t(df=3, loc=110, scale=15)
    print( simulate_trials(1000, t1, t2) )

When we shift the second distribution over to have mean 110, we rejected the null 463 times out of 1000, far less than the 80% rejection you’d expect if the data were normally distributed.

Just looking at a plot of the PDFs, a t(3) looks much closer to a normal distribution than a uniform distribution does. But the tail behavior is different. The tails of the uniform are as thin as you can get—they’re zero!—while the t(3) has heavy tails.

These two examples show that you can replace the normal distribution with a moderately heavy tailed symmetric distribution, but don’t overdo it. When the data some from a heavy tailed distribution, even one that is symmetric, the two-sample t-test may not have the operating characteristics you’d expect.

Robust statistics

P. J. Huber gives three desiderata for a statistical method in his book Robust Statistics:

  1. It should have a reasonably good (optimal or nearly optimal) efficiency at the assumed model.
  2. It should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly.
  3. Somewhat larger deviations from the model should not cause a catastrophe.

Robust statistical methods are those that strive for these properties. These are not objective properties per se though there are objective ways to try to quantify robustness in various contexts.

(In Nassim Taleb’s tricotomy of fragile, robust, and anti-fragile, robustness is in the middle. Something is fragile if it is harmed by volatility, and anti-fragile if it benefits from volatility. Robustness is in the middle. It is not unduly harmed by volatility, but does not benefit from it either. In the context of statistical methods, volatility is departure from modeling assumptions. Hardly anything benefits from modeling assumptions being violated, but some methods are harmed by such violations more or less than others.)

Here are several blog posts I’ve written about robustness:

Probit regression

The previous post looked at how probability predictions from a logistic regression model vary as a function of the fitted parameters. This post goes through the same exercise for probit regression and compares the two kinds of nonlinear regression.

Generalized linear models and link functions

Logistic and probit regression are minor variations on a theme. They are both forms of generalized linear models (GLM), but with different link functions.

In logistic regression, the link function is

\mbox{logit}(p) = \log\left( \frac{p}{1-p} \right)

whereas in probit regression the link function is

\Phi^{-1}(p)

where Φ is the standard normal distribution CDF, i.e.

\Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^x \exp(-t^2/2)\, dt

The probability prediction from a generalized linear model is the inverse of the link function applied to a linear model. In logistic regression we had

p(x) = \text{logit}^{-1}(a + bx) = \frac{1}{1 + \exp(-a -bx)}

and in probit regression we have

p(x) = \text{probit}^{-1}(a + bx) = \Phi(a + bx).

Comparing logistic and probit curves

As a rule of thumb, probit regression coefficients are roughly equal to logisitic regression coefficients divided by 1.6. [1] To see this, here’s a plot of the logistic curve from the previous post, which used a = 3 and b = 4, and a plot of the probit curve where a = 3/1.6 and b = 4/1.6.

logistic and rescaled probit curves

And because the curves are so close together, we’ll also plot their difference to see a little better how they differ.

Difference between rescaled probit and logitistic curves

If the curves are so similar, why do we have both? Because one may be easier to work with in some context and the other in other contexts.

Sensitivity to parameters

In this post we’ll look at how

p(x, a, b) = \Phi(a + bx)

varies with each of its parameters.

As with the logistic curve before, the probit curve has zero curvature when x = –a/b, which corresponds to p = 1/2.

For the logistic curve, the slope at p = 1/2 was simply b, whereas here with the probit curve the slope at the flattest part is b/√(2π).

The rate of change of p with respect to a is

\frac{\partial p}{\partial a} = \frac{\exp\left( -\frac{1}{2} (a + bx)^2 \right)}{\sqrt{2\pi}}

To find where this is maximizes we find where

\frac{\partial^2 p}{\partial x\, \partial a} = \frac{-b \exp\left( -\frac{1}{2} (a + bx)^2 \right)(a + bx)}{\sqrt{2\pi}}

is zero, which is again at x = –a/b. By plugging this back in we find the maximum rate of change of p with respect to a is 1/√(2π).

As before, the dependence on b is a little more complicated than the dependence on a. At the middle of the curve, i.e. when p = 1/2, and so x = –a/b, we have

\left.\frac{\partial p}{\partial b}\right|_{x = -a/b} = -\frac{a}{b\sqrt{2\pi}}

The places where the sensitivity to b are maximized are a little easier to find this time.

\frac{\partial^2 p}{\partial x \, \partial b} = -\frac{(b^2x^2 + abx -1) \exp\left(-\frac{1}{2}(a + bx)^2 \right)}{\sqrt{2\pi}}

and so the places most sensitive to a change in b can be found by solving a quadratic equation, and it works out that

x = \frac{-a \pm \sqrt{a^2 + 4}}{2b}

Related posts

[1] Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press 2007.

Sensitivity of logistic regression prediction on coefficients

The output of a logistic regression model is a function that predicts the probability of an event as a function of the input parameter. This post will only look at a simple logistic regression model with one predictor, but similar analysis applies to multiple regression with several predictors.

p(x) = \frac{1}{1 + \exp(-a + bx)}

Here’s a plot of such a curve when a = 3 and b = 4.

Flattest part

The curvature of the logistic curve is small at both extremes. As x comes in from negative infinity, the curvature increases, then decreases to zero, then increases again, then decreases as x goes to positive infinity. We quantified this statement in another post where we calculate the curvature. The curvature is zero at the point where the second derivative of p

p''(x) = \frac{b^2 \exp(a + bx)\left(\exp(a +bx) -1\right)}{(1 + \exp(a + bx))^3}

is zero, which occurs when x = –a/b. At that point p = 1/2, so the curve is flattest where the probability crosses 1/2. In the graph above, this happens at x = -0.75.

A little calculation shows that the slope at the flattest part of the logistic curve is simply b.

Sensitivity to parameters

Now how much does the probability prediction p(x) change as the parameter a changes? We now need to consider p as a function of three variables, i.e. we need to consider a and b as additional variables. The marginal change in p in response to a change in a is the partial derivative of p with respect to a.

To know where this is maximized with respect to x, we take the partial derivative of the above expression with respect to x

\frac{\partial^2 p}{\partial x\, \partial a} = \frac{b(\exp(a + bx) - 1) \exp(a + bx)}{(1 + \exp(a + bx))^3}

which is zero when  x = –a/b, the same place where the logistic curve is flattest. And the partial of p with respect to a at that point is simply 1/4, independent of b. So a small change Δa results in a change of approximately Δa/4 at the flattest part of the logistic curve and results in less change elsewhere.

What about the dependence on b? That’s more complicated. The rate of change of p with respect to b is

\frac{\partial p}{\partial b} = \frac{\exp(a + bx) x }{(1 + \exp(a + bx))^2}

and this is maximized where

\frac{\partial^2 p}{\partial x \partial b} = 0

which in turn requires solving a nonlinear equation. This is easy to do numerically in a specific case, but not easy to work with analytically in general.

However, we can easily say how p changes with b near the point x = –a/b. This is not where the partial of p with respect to b is maximized, but it’s a place of interest because it has come up two times above. At that point the derivative of p with respect to b is –a/4b. So if a and b have the same sign, then a small increase in b will result in a small decrease in p and vice versa.

Obesity index: Measuring the fatness of probability distribution tails

A probability distribution is called “fat tailed” if its probability density goes to zero slowly. Slowly relative to what? That is often implicit and left up to context, but generally speaking the exponential distribution is the dividing line. Probability densities that decay faster than the exponential distribution are called “thin” or “light,” and densities that decay slower are called “thick”, “heavy,” or “fat,” or more technically “subexponential.” The distinction is important because fat-tailed distributions tend to defy our intuition.

One surprising property of heavy-tailed (subexponential) distributions is the single big jump principle. Roughly speaking, most of the contribution to the sum of several heavy-tailed random variables comes from the largest of the samples. To be more specific, let “several” = 4 for reasons that’ll be apparent soon, though the result is true for any n. As x goes to infinity, the probability that

X1 + X2 + X3 + X4

is larger than a given x is asymptotically equal the probability that

max(X1, X2, X3, X4)

is larger than the same x.

The idea behind the obesity index [1] is to turn the theorem above around, making it an empirical measure of how thick a distribution’s tail is. If you draw four samples from a random variable and sort them, the obesity index is the probability that that the sum of the max and min, X1 + X4, is greater than the sum of the middle samples, X2 + X3.

The obesity index could be defined for any distribution, but it only measures what the name implies for right-tailed distributions. For any symmetric distribution, the obesity index is exactly 1/2. A Cauchy distribution is heavy-tailed, but it has two equally heavy tails, and so its obesity index is the same as the normal distribution, which has two light tails.

Note that location and scale parameters have no effect on the obesity index; shifting and scaling effect all the X values the same, so it doesn’t change the probability that X1 + X4 is greater than X2 + X3.

To get an idea of the obesity index in action, we’ll look at the normal, exponential, and Cauchy distributions, since these are the canonical examples of thin, medium, and thick tailed distributions. But for reasons explained above, we’ll actually look at the folded normal and folded Cauchy distributions, i.e. we’ll take their absolute values to create right-tailed distributions.

To calculate the obesity index exactly you’d need to do analytical calculations with order statistics. We’ll simulate the obesity index because that’s easier. It’s also more in the spirit of calculating the obesity index from data.

    from scipy.stats import norm, expon, cauchy

    def simulate_obesity(dist, N):
        data = abs(dist.rvs(size=(N,4)))
        count = 0
        for row in range(N):
            X = sorted(data[row])
            if X[0] + X[3] > X[1] + X[2]:
                count += 1
        return count/N

    for dist in [norm, expon, cauchy]:
        print( simulate_obesity(dist, 10000) )

When I ran the Python code above, I got

    0.6692
    0.7519
    0.8396

This ranks the three distributions in the anticipated order of tail thickness.

Note that the code above takes the absolute value of the random samples. This lets us pass in ordinary (unfolded) versions of the normal and Cauchy distributions, and its redundant for any distribution like the exponential that’s already positive-valued.

[I found out after writing this blog post that SciPy now has foldnorm and foldcauchy, but they don’t seem to work like I expect.]

Let’s try it on a few more distributions. Lognormal is between exponential and Cauchy in thickness. A Pareto distribution with parameter b goes to zero like x-1-b and so we expect a Pareto distribution to have a smaller obesity index than Cauchy when b is greater than 1, and a larger index when b is less than one. Once again the simulation results are what we’d expect.

The code

    for dist in [lognorm, pareto(2), pareto(0.5)]:
        print( simulate_obesity(dist, 10000) )

returns

    0.7766
    0.8242
    0.9249

By this measure, lognormal is just a little heavier than exponential. Pareto(2) comes in lighter than Cauchy, but not by much, and Pareto(0.5) comes in heavier.

Since the obesity index is a probability, it will always return a value between 0 and 1. Maybe it would be easier to interpret if we did something like take the logit transform of the index to spread the values out more. Then the distinctions between Pareto distributions of different orders, for example, might match intuition better.

[1] Roger M. Cooke et al. Fat-Tailed Distributions: Data, Diagnostics and Dependence. Wiley, 2014.