Robust statistics

P. J. Huber gives three desiderata for a statistical method in his book Robust Statistics:

  1. It should have a reasonably good (optimal or nearly optimal) efficiency at the assumed model.
  2. It should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly.
  3. Somewhat larger deviations from the model should not cause a catastrophe.

Robust statistical methods are those that strive for these properties. These are not objective properties per se though there are objective ways to try to quantify robustness in various contexts.

(In Nassim Taleb’s tricotomy of fragile, robust, and anti-fragile, robustness is in the middle. Something is fragile if it is harmed by volatility, and anti-fragile if it benefits from volatility. Robustness is in the middle. It is not unduly harmed by volatility, but does not benefit from it either. In the context of statistical methods, volatility is departure from modeling assumptions. Hardly anything benefits from modeling assumptions being violated, but some methods are harmed by such violations more or less than others.)

Here are several blog posts I’ve written about robustness:

Probit regression

The previous post looked at how probability predictions from a logistic regression model vary as a function of the fitted parameters. This post goes through the same exercise for probit regression and compares the two kinds of nonlinear regression.

Generalized linear models and link functions

Logistic and probit regression are minor variations on a theme. They are both forms of generalized linear models (GLM), but with different link functions.

In logistic regression, the link function is

\mbox{logit}(p) = \log\left( \frac{p}{1-p} \right)

whereas in probit regression the link function is

\Phi^{-1}(p)

where Φ is the standard normal distribution CDF, i.e.

\Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^x \exp(-t^2/2)\, dt

The probability prediction from a generalized linear model is the inverse of the link function applied to a linear model. In logistic regression we had

p(x) = \text{logit}^{-1}(a + bx) = \frac{1}{1 + \exp(-a -bx)}

and in probit regression we have

p(x) = \text{probit}^{-1}(a + bx) = \Phi(a + bx).

Comparing logistic and probit curves

As a rule of thumb, probit regression coefficients are roughly equal to logisitic regression coefficients divided by 1.6. [1] To see this, here’s a plot of the logistic curve from the previous post, which used a = 3 and b = 4, and a plot of the probit curve where a = 3/1.6 and b = 4/1.6.

logistic and rescaled probit curves

And because the curves are so close together, we’ll also plot their difference to see a little better how they differ.

Difference between rescaled probit and logitistic curves

If the curves are so similar, why do we have both? Because one may be easier to work with in some context and the other in other contexts.

Sensitivity to parameters

In this post we’ll look at how

p(x, a, b) = \Phi(a + bx)

varies with each of its parameters.

As with the logistic curve before, the probit curve has zero curvature when x = –a/b, which corresponds to p = 1/2.

For the logistic curve, the slope at p = 1/2 was simply b, whereas here with the probit curve the slope at the flattest part is b/√(2π).

The rate of change of p with respect to a is

\frac{\partial p}{\partial a} = \frac{\exp\left( -\frac{1}{2} (a + bx)^2 \right)}{\sqrt{2\pi}}

To find where this is maximizes we find where

\frac{\partial^2 p}{\partial x\, \partial a} = \frac{-b \exp\left( -\frac{1}{2} (a + bx)^2 \right)(a + bx)}{\sqrt{2\pi}}

is zero, which is again at x = –a/b. By plugging this back in we find the maximum rate of change of p with respect to a is 1/√(2π).

As before, the dependence on b is a little more complicated than the dependence on a. At the middle of the curve, i.e. when p = 1/2, and so x = –a/b, we have

\left.\frac{\partial p}{\partial b}\right|_{x = -a/b} = -\frac{a}{b\sqrt{2\pi}}

The places where the sensitivity to b are maximized are a little easier to find this time.

\frac{\partial^2 p}{\partial x \, \partial b} = -\frac{(b^2x^2 + abx -1) \exp\left(-\frac{1}{2}(a + bx)^2 \right)}{\sqrt{2\pi}}

and so the places most sensitive to a change in b can be found by solving a quadratic equation, and it works out that

x = \frac{-a \pm \sqrt{a^2 + 4}}{2b}

Related posts

[1] Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press 2007.

Sensitivity of logistic regression prediction on coefficients

The output of a logistic regression model is a function that predicts the probability of an event as a function of the input parameter. This post will only look at a simple logistic regression model with one predictor, but similar analysis applies to multiple regression with several predictors.

p(x) = \frac{1}{1 + \exp(-a + bx)}

Here’s a plot of such a curve when a = 3 and b = 4.

Flattest part

The curvature of the logistic curve is small at both extremes. As x comes in from negative infinity, the curvature increases, then decreases to zero, then increases again, then decreases as x goes to positive infinity. We quantified this statement in another post where we calculate the curvature. The curvature is zero at the point where the second derivative of p

p''(x) = \frac{b^2 \exp(a + bx)\left(\exp(a +bx) -1\right)}{(1 + \exp(a + bx))^3}

is zero, which occurs when x = –a/b. At that point p = 1/2, so the curve is flattest where the probability crosses 1/2. In the graph above, this happens at x = -0.75.

A little calculation shows that the slope at the flattest part of the logistic curve is simply b.

Sensitivity to parameters

Now how much does the probability prediction p(x) change as the parameter a changes? We now need to consider p as a function of three variables, i.e. we need to consider a and b as additional variables. The marginal change in p in response to a change in a is the partial derivative of p with respect to a.

To know where this is maximized with respect to x, we take the partial derivative of the above expression with respect to x

\frac{\partial^2 p}{\partial x\, \partial a} = \frac{b(\exp(a + bx) - 1) \exp(a + bx)}{(1 + \exp(a + bx))^3}

which is zero when x = –a/b, the same place where the logistic curve is flattest. And the partial of p with respect to a at that point is simply 1/4, independent of b. So a small change Δa results in a change of approximately Δa/4 at the flattest part of the logistic curve and results in less change elsewhere.

What about the dependence on b? That’s more complicated. The rate of change of p with respect to b is

\frac{\partial p}{\partial b} = \frac{\exp(a + bx) x }{(1 + \exp(a + bx))^2}

and this is maximized where

\frac{\partial^2 p}{\partial x \partial b} = 0

which in turn requires solving a nonlinear equation. This is easy to do numerically in a specific case, but not easy to work with analytically in general.

However, we can easily say how p changes with b near the point x = –a/b. This is not where the partial of p with respect to b is maximized, but it’s a place of interest because it has come up two times above. At that point the derivative of p with respect to b is –a/4b. So if a and b have the same sign, then a small increase in b will result in a small decrease in p and vice versa.

Obesity index: Measuring the fatness of probability distribution tails

A probability distribution is called “fat tailed” if its probability density goes to zero slowly. Slowly relative to what? That is often implicit and left up to context, but generally speaking the exponential distribution is the dividing line. Probability densities that decay faster than the exponential distribution are called “thin” or “light,” and densities that decay slower are called “thick”, “heavy,” or “fat,” or more technically “subexponential.” The distinction is important because fat-tailed distributions tend to defy our intuition.

One surprising property of heavy-tailed (subexponential) distributions is the single big jump principle. Roughly speaking, most of the contribution to the sum of several heavy-tailed random variables comes from the largest of the samples. To be more specific, let “several” = 4 for reasons that’ll be apparent soon, though the result is true for any n. As x goes to infinity, the probability that

X1 + X2 + X3 + X4

is larger than a given x is asymptotically equal the probability that

max(X1, X2, X3, X4)

is larger than the same x.

The idea behind the obesity index [1] is to turn the theorem above around, making it an empirical measure of how thick a distribution’s tail is. If you draw four samples from a random variable and sort them, the obesity index is the probability that that the sum of the max and min, X1 + X4, is greater than the sum of the middle samples, X2 + X3.

The obesity index could be defined for any distribution, but it only measures what the name implies for right-tailed distributions. For any symmetric distribution, the obesity index is exactly 1/2. A Cauchy distribution is heavy-tailed, but it has two equally heavy tails, and so its obesity index is the same as the normal distribution, which has two light tails.

Note that location and scale parameters have no effect on the obesity index; shifting and scaling effect all the X values the same, so it doesn’t change the probability that X1 + X4 is greater than X2 + X3.

To get an idea of the obesity index in action, we’ll look at the normal, exponential, and Cauchy distributions, since these are the canonical examples of thin, medium, and thick tailed distributions. But for reasons explained above, we’ll actually look at the folded normal and folded Cauchy distributions, i.e. we’ll take their absolute values to create right-tailed distributions.

To calculate the obesity index exactly you’d need to do analytical calculations with order statistics. We’ll simulate the obesity index because that’s easier. It’s also more in the spirit of calculating the obesity index from data.

    from scipy.stats import norm, expon, cauchy

    def simulate_obesity(dist, N):
        data = abs(dist.rvs(size=(N,4)))
        count = 0
        for row in range(N):
            X = sorted(data[row])
            if X[0] + X[3] > X[1] + X[2]:
                count += 1
        return count/N

    for dist in [norm, expon, cauchy]:
        print( simulate_obesity(dist, 10000) )

When I ran the Python code above, I got

    0.6692
    0.7519
    0.8396

This ranks the three distributions in the anticipated order of tail thickness.

Note that the code above takes the absolute value of the random samples. This lets us pass in ordinary (unfolded) versions of the normal and Cauchy distributions, and its redundant for any distribution like the exponential that’s already positive-valued.

[I found out after writing this blog post that SciPy now has foldnorm and foldcauchy, but they don’t seem to work like I expect.]

Let’s try it on a few more distributions. Lognormal is between exponential and Cauchy in thickness. A Pareto distribution with parameter b goes to zero like x-1-b and so we expect a Pareto distribution to have a smaller obesity index than Cauchy when b is greater than 1, and a larger index when b is less than one. Once again the simulation results are what we’d expect.

The code

    for dist in [lognorm, pareto(2), pareto(0.5)]:
        print( simulate_obesity(dist, 10000) )

returns

    0.7766
    0.8242
    0.9249

By this measure, lognormal is just a little heavier than exponential. Pareto(2) comes in lighter than Cauchy, but not by much, and Pareto(0.5) comes in heavier.

Since the obesity index is a probability, it will always return a value between 0 and 1. Maybe it would be easier to interpret if we did something like take the logit transform of the index to spread the values out more. Then the distinctions between Pareto distributions of different orders, for example, might match intuition better.

[1] Roger M. Cooke et al. Fat-Tailed Distributions: Data, Diagnostics and Dependence. Wiley, 2014.

Categorical Data Analysis

Categorical data analysis could mean a couple different things. One is analyzing data that falls into unordered categories (e.g. red, green, and blue) rather than numerical values (e.g. height in centimeters).

Another is using category theory to assist with the analysis of data. Here “category” means something more sophisticated than a list of items you might choose from in a drop-down menu. Instead we’re talking about applied category theory.

So we have ((categorical data) analysis) and (categorical (data analysis)), i.e. analyzing categorical data and categorically analyzing data. The former is far, far more common.

I ran across Alan Agresti’s classic book the other day in a used book store. The image below if from the third (2012) edition. The book store had the 1st (1990) edition with a more austere cover.

I bought Agresti’s book because it’s a good reference to have. But I was a little disappointed. My first thought was  that someone has written a book on category theory and statistics, which is not the case, as far as I know.

The main reference for category theory and statistics is Peter McCullagh’s 2002 paper What is a statistical model? That paper raised a lot of interesting ideas, but the statistics community did not take McCullagh’s bait.

commutative diagram for statistical models

Maybe this just wasn’t a fruitful idea. I suspect it is a fruitful idea, but the number of people available to develop it, conversant in both statistics and category theory, is very small. I’ve seen category theory used in mathematical modeling more generally, but not in statistics per se.

At its most basic, category theory asks you to be explicit about the domain and range (codomain) of functions. It would be very helpful if statisticians merely did this. Statistical notation is notoriously bad at revealing where a function goes from and to, or even when a function is a function. Just 0th level category theory, defining categories, would be useful. Maybe it would be useful to go on to identifying limits or adjoints, but simply being explicit about “from” and “to” would be a good start.

Category theory is far too abstract to completely carry out a statistical analysis. But it can prompt you to ask questions that check whether your model has any inconsistencies you hadn’t noticed. The idea of a “categorical error” doesn’t differ that much moving from its philosophical meaning under Aristotle to its mathematical meaning under MacLane. Nor does the idea of something being “natural.” One of the primary motivations for creating category theory was to come up with a rigorous definition of what it means for something in math to be “natural.”

Hypothesis testing vs estimation

I was looking at my daughter’s statistics homework recently, and there were a pair of questions about testing the level of lead in drinking water. One question concerned testing whether the water was safe, and the other concerned testing whether the water was unsafe.

There’s something bizarre, even embarrassing, about this. You want to do two things: estimate the amount of lead, and decide what to do in response. But instead of simply doing just that, you do this arcane dance of choosing two hypotheses, one natural and one arbitrary, and treating the two asymmetrically, depending on which one you call the null and which you call the alternative. This asymmetry is the reason you make a distinction between testing whether the water is safe and testing whether it is unsafe.

It’s a weird tangle of estimation and decision making. The decision-making rules implicit in the procedure are not at all transparent. And even though you are testing the level of lead, you’re doing so indirectly.

The Bayesian approach to the problem is much easier to understand. You estimate the probability distribution for the concentration of lead based on all available information. You can plot this distribution and show it to civil engineers, politicians, or anybody else who needs to make a decision. Non-statisticians are much more likely to understand such a plot than the nuances of null and alternative hypotheses, significance, power, and whether you’re testing for safety versus testing for non-safety. (Statisticians are more likely to understand estimation as well.)

In the homework problems, the allowable level of lead was 15 ppm. After obtaining the posterior distribution on the concentration of lead, you could simply estimate the probability that the concentration is above 15 ppm. But you could also calculate the probability that the concentration lies in any other range you’re interested in.

Classical statistics does not allow such probability calculations. Even a confidence interval, something that looks like a probability statement about the concentration of lead, is actually a probability statement about the statistical process being used and not a probability statement about lead concentration per se.

Generalized normal distribution and kurtosis

The generalized normal distribution adds an extra parameter β to the normal (Gaussian) distribution. The probability density function for the generalized normal distribution is

\frac{\beta}{2\sigma \Gamma\left(\frac{1}{\beta}\right )} \exp\left(-\left|\frac{x-\mu}{\sigma} \right|^\beta \right)

Here the location parameter μ is the mean, but the scaling factor σ is not the standard deviation unless β = 2.

For small values of the shape parameter β, the distribution is more sharply pointed in the middle and decays more slowly in the tails. We say the tails are “thick” or “heavy.” When β = 1 the generalized normal distribution reduces to the Laplace distribution.

Here are examples with μ = 0 and σ = 1.

The normal distribution is a special case corresponding to β = 2. Large values of β make the distribution flatter on top and thinner (lighter) in the tails. Again μ = 0 and σ = 1 in the plots below.

thick-tailed generalized normal densities

One way to measure the thickness of probability distribution tails is kurtosis. The normal distribution has kurtosis equal to 3. Smaller values of kurtosis correspond to thinner tails and larger values to thicker tails.

There’s a common misunderstanding that kurtosis measures how pointy the distribution is in the middle. Often that’s the case, and in fact that’s the case for the generalized normal distribution. But it’s not true in general. It’s possible for a distribution to be flat on top and have heavy tails or pointy on top and have thin tails.

Distributions with thinner tails than the normal are called “platykurtic” and distributions with thicker tails than the normal are called “leptokurtic.” The names were based on the misunderstanding mentioned above. The platy– prefix means broad, but it’s not the tails that are broader, it’s the middle. Similarly, the lepto– prefix means “thin”, referring to being pointy in the middle. But leptokurtic distributions have thicker tails!

The kurtosis of the generalized normal distribution is given by

\frac{ \Gamma\left( \frac{5}{\beta} \right ) \Gamma\left( \frac{1}{\beta} \right ) }{\Gamma\left(\frac{3}{\beta}\right)^2}

We can use that to visualize how the kurtosis varies as a function of the shape parameter β.

The Laplace distribution (β = 1) has kurtosis 6 and the normal distribution (β = 2) has kurtosis 3.

You can use the fact that Γ(x) ~ 1/x for small x to show that in the limit as β goes to infinity, the kurtosis approaches 9/5.

Related post: Computing skewness and kurtosis in one pass

Making sense of a probability problem in the WSJ

Wall Street Journal clipping

Someone wrote to me the other day asking if I could explain a probability example from the Wall Street Journal. (“Proving Investment Success Takes Time,” Spencer Jakab, November 25, 2017.)

Victor Haghani … and two colleagues told several hundred acquaintances who worked in finance that they would flip two coins, one that was normal and the other that was weighted so it came up heads 60% of the time. They asked the people how many flips it would take them to figure out, with a 95% confidence level, which one was the 60% coin. Told to give a “quick guess,” nearly a third said fewer than 10 flips, while the median response was 40. The correct answer is 143.

The anecdote is correct in spirit: it takes longer to discover the better of two options than most people suppose. But it’s jarring to read the answer is precisely 143 when the question hasn’t been stated clearly.

How many flips would it take to figure out which coin is better with a 95% confidence level? For starters, the answer would have to be a distribution, not a single number. You might quickly come to the right conclusion. You might quickly come to the wrong conclusion. You might flip coins for a long time and never come to a conclusion. Maybe there is a way a formulating the problem so that so that the expected value of the distribution is 143.

How are you to go about flipping the coins? Do you flip both of them, or just flip one coin? For example, you might flip both coins until you are confident that one is better, and conclude that the better one is the one that was designed to come up heads 60% of the time. Or you could just flip one of them and test the hypothesis Prob(heads) = 0.5 versus the alternative Prob(heads) = 0.6. Or maybe you flip one coin two times for every one time you flip the other. Etc.

What do you mean by “95% confidence level”? Is this a frequentist confidence interval? And do you compute the (Bayesian) predictive probability of arriving at such a confidence level? Are you computing the (Bayesian) posterior model probabilities of two models, one in which the first coin has probability of heads 0.5 and the second has probability 0.6 versus the opposite model?

Do you assume that you know a priori that one coin has probability of heads 0.5 and the other 0.6, or do you not assume this and just want to find the coin with higher probability of heads, and evaluate such a model when in fact the probabilities of heads are as stated?

Are you conducting an experiment with a predetermined sample size of 143? Or are you continuous monitoring the data, stopping when you reach your conclusion?

I leave it as an exercise to the reader to implement the various alternatives suggested above and see whether one of them produces 143 as a result. (I did a a back-of-the-envelope calculation that suggests there is one.) So the first question is to reverse engineer which problem statement the article was based on. The second question is to decide which problem formulation you believe would be most appropriate in the context of the article.

How can a statistician help a lawyer?

I’ll be presenting at a webinar on Wednesday, December 13 at 1:00 PM Eastern. The title of the presentation is “Seven questions a statistician and answer for an attorney.”

I will discuss, among other things, when common sense applies and when correct analysis can be counter-intuitive. There will be ample time at the end of the presentation for Q & A.

If you’re interested in attending, you can register here.

ACEDS association of certified e-discovery specialists and LexInsight

Handedness, introversion, height, blood type, and PII

I’ve had data privacy on my mind a lot lately because I’ve been doing some consulting projects in that arena.

When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log2(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.

***

[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10-15. This is an example of where the normal distribution assumption breaks down in the extremes.