Big data and the law

computer law

Excerpt from the new book Big Data of Complex Networks:

Big Data and data protection law provide for a number of mutual conflicts: from the perspective of Big Data analytics, a strict application of data protection law as we know it today would set an immediate end to most Big Data applications. From the perspective of the law, Big Data is either a big threat … or a major challenge for international and national lawmakers to adopt today’s data protection laws to the latest technological and economic developments.

Emphasis added.

The author of the chapter on legal matters is Swiss and writes primarily in a European context, though all countries face similar problems.

I’m not a lawyer, though I sometimes work with lawyers, and sometimes help companies with the statistical aspects of HIPAA law. But as a layman the observation above sounds reasonable to me, that strict application of the law could bring many applications to a halt, for better and for worse.

In my opinion the regulations around HIPAA and de-identification are mostly reasonable. The things it prohibits mostly should be prohibited. And it has a common sense provision in the form of expert determination. If your data uses fall outside the regulation’s specific recommendations but don’t endanger privacy, you can have an expert can certify that this is the case.

Subjectivity in statistics

Andrew Gelman on subjectivity in statistics:

Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.

People worry about prior distributions, not because they’re subjective, but because they’re explicitly subjective. There are many other subjective factors, common to Bayesian and Frequentist statistics, but these are implicitly subjective.

In practice, prior distributions often don’t make much difference. For example, you might show that an optimistic prior and a pessimistic prior lead to the same conclusion.

If you have so little data that the choice of prior does make a substantial difference, being able to specify the prior is a benefit. Suppose you only have a little data but have to make a decision anyway. A frequentist might say there’s too little data for a statistical analysis to be meaningful. So what do you do? Make a decision entirely subjectively! But with a Bayesian approach, you capture what is known outside of the data at hand in the form of a prior distribution, and update the prior with the little data you have. In this scenario, a Bayesian analysis is less subjective and more informed by the data than a frequentist approach.

Interim analysis, futility monitoring, and predictive probability

An interim analysis of a clinical trial is an unusual analysis. At the end of the trial you want to estimate how well some treatment X works. For example, you want to how likely is it that treatment X works better than the control treatment Y. But in the middle of the trial you want to know something more subtle.

It’s possible that treatment X is doing so poorly that you want to end the trial without going any further. It’s also possible that X is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with futility. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments and Y are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.

Maybe treatment X is doing a little better than Y, but not so much better that you can declare with confidence that X is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.

Futility analysis is more about resources than ethics. If X is doing poorly, ethics might dictate that you stop giving X to patients so you stop early. If X is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if X is doing so-so, there’s usually not an ethical reason to stop, unless X is worse than Y on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.

Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the trial, not of the treatment per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.

This is why predictive probability is natural for interim analysis. You’re trying to predict outcomes, not parameters. (This is subtle: you’re trying to estimate the probability of outcomes that lead to certain estimates of parameters, namely those that allow you to reach a conclusion with pre-specified significance.)

Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a p-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (Bayesian) predictive probability of meeting your frequentist criteria!

Click to learn more about Bayesian statistics consulting

Gentle introduction to R

The R language is closely tied to statistics. It’s ancestor was named S, because it was a language for Statistics. The open source descendant could have been named ‘T’, but its creators chose to call it’R.’

Most people learn R as they learn statistics: Here’s a statistical concept, and here’s how you can compute it in R. Statisticians aren’t that interested in the R language itself but see it as connective tissue between commands that are their primary interest.

This works for statisticians, but it makes the language hard for non-statisticians to approach. Years ago I managed a group of programmers who supported statisticians. At the time, there were no books for learning R without concurrently learning statistics. This created quite a barrier to entry for programmers whose immediate concern was not the statistical content of an R program.

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t start there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R per se. This is the book I wish I could have handed my programmers who had to pick up R.

Uncertainty in a probability

Suppose you did a pilot study with 10 subjects and found a treatment was effective in 7 out of the 10 subjects.

With no more information than this, what would you estimate the probability to be that the treatment is effective in the next subject? Easy: 0.7.

Now what would you estimate the probability to be that the treatment is effective in the next two subjects? You might say 0.49, and that would be correct if we knew that the probability of response is 0.7. But there’s uncertainty in our estimate. We don’t know that the response rate is 70%, only that we saw a 70% response rate in our small sample.

If the probability of success is p, then the probability of s successes and f failures in the next sf subjects is given by

{s+f \choose s} p^s (1-p)^f

But if our probability of success has some uncertainty and we assume it has a beta(ab) distribution, then the predictive probability of s successes and f failures is given by

{s+f \choose s} \frac{B(a+s, b+f)}{B(a,b)}


B(x, y) = \frac{\Gamma(x)\, \Gamma(y)}{\Gamma(x+y)}

In our example, after seeing 7 successes out of 10 subjects, we estimate the probability of success by a beta(7, 3) distribution. Then this says the predictive probability of two successes is approximately 0.51, a little higher than the naive estimate of 0.49. Why is this?

We’re not assuming the probability of success is 0.7, only that the mean of our estimate of the probability is 0.7. The actual probability might be higher or lower. The predictive probability calculates the probability of outcomes under all possible values of the probability, then creates a weighted average, weighing each probability of success by the probability of that value. The differences corresponding to probability above and below 0.7 approximately balance out, but the former carry a little more weight and so we get roughly what we did before.

If this doesn’t seem right, note that mean and median aren’t the same thing for asymmetric distributions. A beta(7,3) distribution has mean 0.7, but it has a probability of 0.537 of being larger than 0.7.

If our initial experiment has shown 70 successes out of 100 instead of 7 out of 10, the predictive probability of two successes would have been 0.492, closer to the value based on point estimate, but still different.

The further we look ahead, the more difference there is between using a point estimate and using a distribution that incorporates our uncertainty. Here are the probabilities for the number of successes out of the next 100 outcomes, using the point estimate 0.3 and using predictive probability with a beta(7,3) distribution.

So if we’re sure that the probability of success is 0.7, we’re pretty confident that out of 100 trials we’ll see between 60 and 80 successes. But if we model our uncertainty in the probability of response, we get quite a bit of uncertainty when we look ahead to the next 100 subjects. Now we can say that the number of responses is likely to be between 30 and 100.

Click to learn more about Bayesian statistics consulting

Insufficient statistics

Experience with the normal distribution makes people think all distributions have (useful) sufficient statistics [1]. If you have data from a normal distribution, then the sufficient statistics are the sample mean and sample variance. These statistics are “sufficient” in that the entire data set isn’t any more informative than those two statistics. They effectively condense the data for you. (This is conditional on knowing the data come from a normal. More on that shortly.)

With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough.

Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.

Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, conditional on the normality assumption, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.


[1] Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.

ETAOIN SHRDLU and all that

Statistics can be useful, even if it’s idealizations fall apart on close inspection.

For example, take English letter frequencies. These frequencies are fairly well known. E is the most common letter, followed by T, then A, etc. The string of letters “ETAOIN SHRDLU” comes from the days of Linotype when letters were arranged in that order, in decreasing order of frequency. Sometimes you’d see ETAOIN SHRDLU in print, just as you might see “QWERTY” today.

Morse code is also based on English letter frequencies. The length of a letter in Morse code varies approximately inversely with its frequency, a sort of precursor to Huffman encoding. The most common letter, E, is a single dot, while the rarer letters like J and Q have a dot and three dashes. (So does Y, even though it occurs more often than some letters with shorter codes.)

One letter has worn off my keyboard

One letter has worn off my keyboard

So how frequently does the letter E, for example, appear in English? That depends on what you mean by English. You can count how many times it appears, for example, in a particular edition of A Tale of Two Cities, but that isn’t the same as it’s frequency in English. And if you’d picked the novel Gadsby instead of A Tale of Two Cities you’d get very different results since that book was written without using a single letter E.

Peter Norvig reports that E accounted for 12.49% of English letters in his analysis of the Google corpus. That’s a better answer than just looking at Gadsby, or even A Tale of Two Cities, but it’s still not English.

What might we mean by “English” when discussing letter frequency? Written or spoken English? Since when? American, British, or worldwide? If you mean blog articles, I’ve altered the statistics from what they were a moment ago by publishing this. Introductory statistics books avoid this kind of subtlety by distinguishing between samples and populations, but in this case the population isn’t a fixed thing. When we say “English” as a whole we have in mind some idealization that strictly speaking doesn’t exist.

If we want to say, for example, what the frequency of the letter E is in English as a whole, not some particular English corpus, we can’t answer that to too many decimal places. Nor can we say, for example, which letter is the 18th most frequent. Context could easily change the second decimal place in a letter’s frequency or, among the less common letters, its frequency rank.

And yet, for practical purposes we can say E is the most common letter, then T, etc. We can design better Linotype machines and telegraphy codes using our understanding of letter frequency. At the same time, we can’t expect too much of this information. Anyone who has worked a cryptogram puzzle knows that you can’t say with certainty that the most common letter in a particular sample must correspond to E, the next to T, etc.

By the way, Peter Norvig’s analysis suggests that ETAOIN SHRDLU should be updated to ETAOIN SRHLDCU.


Mittag-Leffler function and probability distribution

The Mittag-Leffler function is a generalization of the exponential function. Since k!= Γ(k + 1), we can write the exponential function’s power series as

\exp(x) = \sum_{k=0}^\infty \frac{x^k}{\Gamma(k+1)}

and we can generalize this to the Mittag=Leffler function

E_{\alpha, \beta}(x) = \sum_{k=0}^\infty \frac{x^k}{\Gamma(\alpha k+\beta)}

which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,

E_{2,1}(x) = \cosh(\sqrt{x})


E_{1/2, 1}(x) = \exp(x^2) \, \mbox{erfc}(-x)

where erfc(x) is the complementary error function.


Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).

The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function Eα is Eα, 1.

Mittag-Leffler probability distributions

Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.

Continuous Mittag-Leffler distribution

The exponential function exp(-x) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is f(x) = exp(-x) and whose distribution function (CDF) is F(x) = 1 – exp(-x). The Mittag-Leffler distribution has CDF is 1 – Eα(-xα) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. [1]

Discrete Mittag-Leffler distribution

The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the kth term as the probability mass for k. That is,

P(X = k) = \frac{1}{\exp(\lambda)} \frac{\lambda^k}{k!}

The analogous discrete Mittag-Leffler distribution [2] has probability mass function

P(X = k) = \frac{1}{E_{\alpha, \beta}(\lambda)} \frac{\lambda^k}{\Gamma(\alpha k + \beta)}.

Fractional differential equations

In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simply differential equation

\frac{d}{dx} f(x) = a\, f(x)

is exp(ax), for 0 < μ < 1, the soltuion to the fractional differential equation

\frac{d^\mu}{dx^\mu} f(x) = a\, f(x)

is axμ-1 Eμ, μ(a xμ). Note that this reduces to exp(ax) when μ = 1. [3]


[1] Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions

[2] Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1

[3] Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.

Sparsely populated zip codes

The dormitory I lived in as an undergraduate had its own five-digit zip code at one time. It was rumored to be the largest dorm in the US, or maybe the largest west of the Mississippi, or something like that. There were about 3,000 of us living there. Although the dorm had enough people to justify its own zip code—some zip codes have far fewer people—zip code boundaries were redraw so that the dorm shares its zip code with other areas.

Some zip code are so sparsely populated that people living in these areas are relatively easy to identify if you have other data. The so-called Safe Harbor provision of HIPAA (Health Insurance Portability and Accountability Act) says that it’s usually OK to include the first three digits of someone’s zip code in de-identified data. But there are 17 areas so thinly populated that even listing the first three digits of their zip code is considered too much of an identification risk. These are areas such that the first three digits of the zip code are:

  • 036
  • 059
  • 063
  • 102
  • 203
  • 556
  • 692
  • 790
  • 821
  • 823
  • 830
  • 831
  • 878
  • 879
  • 884
  • 890
  • 893

This list could change over time. These are the regions that currently contain fewer than 20,000 people, the criterion given in the HIPAA regulations.

Knowing that someone is part of an area containing 20,000 people hardly identifies them. The concern is that in combination with other information, zip code data is more informative in these areas.

Related post: Bayesian clinical trials in one zip code

Need help with HIPAA de-identification?

Cepstrum, quefrency, and pitch

John Tukey

John Tukey coined many terms that have passed into common use, such as bit (a shortening of binary digit) and software. Other terms he coined are well known within their niche: boxplot, ANOVA, rootogram, etc. Some of his terms, such as jackknife and vacuum cleaner, were not new words per se but common words he gave a technical meaning to.

Cepstrum is an anagram of spectrum. It involves an unusual use of power spectra, and is roughly analogous to making anagrams of a word. A related term, one we will get to shortly, is quefrency, an anagram of frequency. Some people pronounce the ‘c’ in cepstrum hard (like ‘k’) and some pronounce it soft (like ‘s’).

Let’s go back to an example from my post on guitar distortion. Here’s a note played with a fairly large amount of distortion:


And here is its power spectrum:

single note with distortion

There’s a lot going on in the spectrum, but the peaks are very regularly spaced. As I mentioned in the post on the sound of a leaf blower, this is the fingerprint of a sound with a definite pitch. Spikes in the spectrum alone don’t indicate a definite pitch if they are irregularly spaced.

The peaks are fairly periodic. How to you find periodic patterns in a signal? Fourier transform! But if you simply take the Fourier transform of a Fourier transform, you essentially get the original signal back. The key to the cepstrum is to do something else between the two Fourier transforms.

The cepstrum starts by taking the Fourier transform, then the magnitude, then the logarithm, and then the inverse Fourier transform.

When we take the magnitude, we throw away phase information, which we don’t need in this context. Taking the log of the magnitude is essentially what you do when you compute sound pressure level. Some define the cepstrum using the magnitude of the Fourier transform and some the magnitude squared. Squaring only introduces a multiple of 2 once we take logs, so it doesn’t effect the location of peaks, only their amplitude.

Taking the logarithm compresses the peaks, bringing them all into roughly the same range, making the sequence of peaks roughly periodic.

When we take the inverse Fourier transform, we now have something like a frequency, but inverted. This is what Tukey called quefrency.

Looking at the guitar power spectrum above, we see a sequence of peaks spaced 440 Hz apart. When we take the inverse Fourier transform of this, we’re looking at a sort of frequency of a frequency, what Tukey calls quefrency. The quefrency scale is inverted: sounds with a high frequency fundamental have overtones that are far apart on the frequency domain, so the sequence of the overtone peaks has low frequency.

Here’s the plot of the cepstrum for the guitar sample.

electric guitar cepstrum

There’s a big peak at 109 on the quefrency scale. The audio clip was recorded at 48000 samples per second, so the 109 on the quefrency scale corresponds to a frequency of 48000/109 = 440 Hz. The second peak is at quefrency 215, which corresponds to 48000/215 = 223 Hz. The second peak corresponds to the perceived pitch of the note, A3, and the first peak corresponds to its first harmonic, A4. (Remember the quefrency scale is inverted relative to the frequency scale.)

I cheated a little bit in the plot above. The very highest peaks are at 0. They are so large that they make it hard to see the peaks we’re most interested in. These low quefrency peaks correspond to very high frequency noise, near the edge of the audible spectrum or beyond.

Click to learn more about consulting help with signal processing

Continuum between anecdote and data

The difference between anecdotal evidence and data is overstated. People often have in mind this dividing line where observations on one side are worthless and observations on the other side are trustworthy. But there’s no such dividing line. Observations are data, but some observations are more valuable than others, and there’s a continuum of value.

Rib eye steak

I believe rib eye steaks are better for you than rat poison. My basis for that belief is anecdotal evidence. People who have eaten rib eye steaks have fared better than people who have eaten rat poison. I don’t have exact numbers on that, but I’m pretty sure it’s true. I have more confidence in that than in any clinical trial conclusion.

Hearsay evidence about food isn’t very valuable, per observation, but since millions of people have eaten steak for thousands of years, the cumulative weight of evidence is pretty good that steak is harmless if not good for you. The number of people who have eaten rat poison is much smaller, but given the large effect size, there’s ample reason to suspect that eating rat poison is a bad idea.

Now suppose you want to get more specific and determine whether rib eye steaks are good for you in particular. (I wouldn’t suggest trying rat poison.) Suppose you’ve noticed that you feel better after eating a steak. Is that an anecdote or data? What if you look back through your diary and noticed that every mention of eating steak lately has been followed by some remark about feeling better than usual. Is that data? What if you decide to flip a coin each day for the next month and eat steak if the coin comes up heads and tofu otherwise. Each of these steps is an improvement, but there’s no magical line you cross between anecdote and data.

Suppose you’re destructively testing the strength of concrete samples. There are better and worse ways to conduct such experiments, but each sample gives you valuable data. If you test 10 samples and they all withstand two tons of force per square inch, you have good reason to believe the concrete the samples were taken from can withstand such force. But if you test a drug on 10 patients, you can’t have the same confidence that the drug is effective. Human subjects are more complicated than concrete samples. Concrete samples aren’t subject to placebo effects. Also, cause and effect are more clear for concrete. If you apply a load and the sample breaks, you can assume the load caused the failure. If you treat a human for a disease and they recover, you can’t be as sure that the treatment caused the recovery. That doesn’t mean medical observations aren’t data.

Carefully collected observations in one area may be less statistically valuable than anecdotal observations in another. Observations are never ideal. There’s always some degree of bias, effects that can’t be controlled, etc. There’s no quantum leap between useless anecdotes and perfectly informative data. Some data are easy to draw inference from, but data that’s harder to understand doesn’t fail to be data.

Click to learn more about Bayesian statistics consulting

The empty middle: why no one is average

In 1945, a Cleveland newspaper held a contest to find the woman whose measurements were closest to average. This average was based on a study of 15,000 women by Dr. Robert Dickinson and embodied in a statue called Norma by Abram Belskie. Out of 3,864 contestants, no one was average on all nine factors, and fewer than 40 were close to average on five factors. The story of Norma and the Cleveland contest is told in Todd Rose’s book The End of Average.

People are not completely described by a handful of numbers. We’re much more complicated than that. But even in systems that are well described by a few numbers, the region around the average can be nearly empty. I’ll explain why that’s true in general, then look back at the Norma example.

General theory

Suppose you have N points, each described by n independent, standard normal random variables. That is, each point has the form (x1, x2, x2, …, xn) where each xi is independent with a normal distribution with mean 0 and variance 1. The expected value of each coordinate is 0, so you might expect that most points are piled up near the origin (0, 0, 0, …, 0). In fact most points are in spherical shell around the origin. Specifically, as n becomes larger, most of the points will be in a thin shell with distance √n from the origin. (More details here.)

Simulated contest

In the contest above, n = 9, and so we expect most contestants to be about a distance of 3 from average when we normalize each of the factors being measured, i.e. we subtract the mean so that each factor has mean 0, and we divide each by its standard deviation so the standard deviation is 1 on each factor.

We’ve made several simplifying assumptions. For example, we’ve assumed independence, though presumably some of the factors measured in the contest were correlated. There’s also a selection bias: presumably women who knew they were far from average would not have entered the contest. But we’ll run with our simplified model just to see how it behaves in a simulation.

import numpy as np

# Winning critera: minimum Euclidean distance
def euclidean_norm(x):
    return np.linalg.norm(x)

# Winning criteria: min-max
def max_norm(x):
    return max(abs(x))

n = 9
N = 3864

# Simulated normalized measurements of contestants 
M = np.random.normal(size=(N, n))

euclid = np.empty(N)
maxdev = np.empty(N)
for i in range(N):
    euclid[i] = euclidean_norm(M[i,:])
    maxdev[i] = max_norm(M[i,:])

w1 = euclid.argmin()
w2 = maxdev.argmin()

print( M[w1,:] )
print( euclidean_norm(M[w1,:]) )
print( M[w2,:] )
print( max_norm(M[w2,:]) )

There are two different winners, depending on how we decide the winner. Using the Euclidean distance to the origin, the winner in this simulation was contestant 3306. Her normalized measurements were

[ 0.1807, 0.6128, -0.0532, 0.2491, -0.2634, 0.2196, 0.0068, -0.1164, -0.0740]

corresponding to a Euclidean distance of 0.7808.

If we judge the winner to be the one whose largest deviation from average is the smallest, the winner is contestant 1916. Her normalized measurements were

[-0.3757, 0.4301, -0.4510, 0.2139, 0.0130, -0.2504, -0.1190, -0.3065, -0.4593]

with the largest deviation being the last, 0.4593.

By either measure, the contestant closest to the average deviated significantly from the average in at least one dimension.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon

Bayesian and nonlinear

Someone said years ago that you’ll know Bayesian statistics has become mainstream when people no longer put “Bayesian” in the titles of their papers. That day has come. While the Bayesian approach is still the preferred approach of a minority of statisticians, it’s no longer a novelty. If you want people to find your paper interesting, the substance needs to be interesting. A Bayesian approach isn’t remarkable alone.

You could say the same about nonlinear differential equations. Differential equations are so often nonlinear that the “nonlinear” qualifier isn’t always necessary to say explicitly. Just as a Bayesian analysis isn’t interesting just because it’s Bayesian, a differential equation isn’t necessarily interesting just because it’s nonlinear.

The analogy between Bayesian statistics and nonlinear differential equations breaks down though. Nonlinear equations are intrinsically more interesting than linear ones. But it’s no longer remarkable to solve a nonlinear differential equation numerically.

When an adjective becomes the default, it drops off and the previous default now requires an adjective. Terms like “electronic” and “digital” are fading from use. If you say you’re going to mail someone something, the default assumption is usually that you are going to email it. What used to be simply “mail” is now “snail mail.” Digital signal processing is starting to sound quaint. The abbreviation DSP is still in common use, but digital signal processing is simply signal processing. Now non-digital signal processing requires a qualifier, i.e. analog.

There was no term for Frequentist statistics when it was utterly dominant. Now of course there is. (Some people use the term “classical,” but that’s an odd term given that Bayesian analysis is older.) The term linear has been around a long time. Even when nearly all analysis was linear, people were aware that linearity was a necessary simplification.

Related posts:

Improving on Chebyshev’s inequality

Chebyshev’s inequality says that the probability of a random variable being more than k standard deviations away from its mean is less than 1/k2. In symbols,

P(|X - E(X)| \geq k\sigma) \leq \frac{1}{k^2}

This inequality is very general, but also very weak. It assumes very little about the random variable X but it also gives a loose bound. If we assume slightly more, namely that X has a unimodal distribution, then we have a tighter bound, the Vysochanskiĭ-Petunin inequality.

P(|X - E(X)| \geq k\sigma) \leq \frac{4}{9k^2}

However, the Vysochanskiĭ-Petunin inequality does require that k be larger than √(8/3). In exchange for the assumption of unimodality and the restriction on k we get to reduce our upper bound by more than half.

While tighter than Chebyshev’s inequality, the stronger inequality is still very general. We can usually do much better if we can say more about the distribution family. For example, suppose X has a uniform distribution. What is the probability that X is more than two standard deviations from its mean? Zero, because two standard deviations puts you outside the interval the uniform is defined on!

Among familiar distributions, when is the Vysochanskiĭ-Petunin inequality most accurate? That depends, of course, on what distributions you consider familiar, and what value of k you use. Let’s look at normal, exponential, and Pareto. These were chosen because they have thin, medium, and thick tails. We’ll also throw in the double exponential, because it has the same tail thickness as exponential but is symmetric. We’ll let k be 2 and 3.

Distribution familyP(|X – E(X)| > 2σ)V-P estimateP(|X – E(X)| > 3σ)V-P estimate
Double exponential0.05910.11110.01440.0484

A normal random variable is more than 2 standard deviations away from its mean with probability 0.0455, compared to the Vysochanskiĭ-Petunin bound of 1/9 = 0.1111. A normal random variable is more than 3 standard deviations away from its mean with probability 0.0027, compared to the bound of 4/81 = 0.0484.

An exponential random variable with mean μ also has standard deviation μ, so the only way it could be more than 2μ from its mean is to be 3μ from 0. So an exponential is more that 2 standard deviations from its mean with probability exp(-3) = 0.0498, and more than 3 standard deviations with probability exp(-4) = 0.0183.

We’ll set the minimum value of our Pareto random variable to be 1. As with the exponential, the Pareto cannot be 2 standard deviations less than its mean, so we look at the probability of it being more than 2 greater than its mean. The shape parameter α must be bigger than 2 for for the variance to exist. The probability of our random variable being more than k standard deviations away from its mean works out to ((α-1)/((k-1)α))α and is largest as α converges down toward 2. The limiting values for k equal to 2 and 3 are 1/36 = 0.0277 and 1/64 = 0.0156 respectively. Of our examples, the Pareto distribution comes closest to the Vysochanskiĭ-Petunin bounds, but doesn’t come that close.

The double exponential, also know as Laplace, has the highest probability of any of our examples of being two standard deviations from its mean, but this probability is still less than half of the Vysochanskiĭ-Petunin bound. The limit of the Pareto distribution has the highest probability of being three standard deviations from its mean, but stil less than one-third of the Vysochanskiĭ-Petunin bound.

Generic bounds are useful, especially in theoretical calculations, but it’s usually possible to do much better with specific distributions.

More inequality posts:

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon