# Improving on Chebyshev’s inequality

Chebyshev’s inequality says that the probability of a random variable being more than k standard deviations away from its mean is less than 1/k2. In symbols,

This inequality is very general, but also very weak. It assumes very little about the random variable X but it also gives a loose bound. If we assume slightly more, namely that X has a unimodal distribution, then we have a tighter bound, the Vysochanskiĭ-Petunin inequality.

However, the Vysochanskiĭ-Petunin inequality does require that k be larger than √(8/3). In exchange for the assumption of unimodality and the restriction on k we get to reduce our upper bound by more than half.

While tighter than Chebyshev’s inequality, the stronger inequality is still very general. We can usually do much better if we can say more about the distribution family. For example, suppose X has a uniform distribution. What is the probability that X is more than two standard deviations from its mean? Zero, because two standard deviations puts you outside the interval the uniform is defined on!

Among familiar distributions, when is the Vysochanskiĭ-Petunin inequality most accurate? That depends, of course, on what distributions you consider familiar, and what value of k you use. Let’s look at normal, exponential, and Pareto. These were chosen because they have thin, medium, and thick tails. We’ll also throw in the double exponential, because it has the same tail thickness as exponential but is symmetric. We’ll let k be 2 and 3.

Distribution familyP(|X – E(X)| > 2σ)V-P estimateP(|X – E(X)| > 3σ)V-P estimate
Uniform0.00000.11110.00000.0484
Normal0.04550.11110.00270.0484
Exponential0.04980.11110.01830.0484
Pareto0.02770.11110.01560.0484
Double exponential0.05910.11110.01440.0484

A normal random variable is more than 2 standard deviations away from its mean with probability 0.0455, compared to the Vysochanskiĭ-Petunin bound of 1/9 = 0.1111. A normal random variable is more than 3 standard deviations away from its mean with probability 0.0027, compared to the bound of 4/81 = 0.0484.

An exponential random variable with mean μ also has standard deviation μ, so the only way it could be more than 2μ from its mean is to be 3μ from 0. So an exponential is more that 2 standard deviations from its mean with probability exp(-3) = 0.0498, and more than 3 standard deviations with probability exp(-4) = 0.0183.

We’ll set the minimum value of our Pareto random variable to be 1. As with the exponential, the Pareto cannot be 2 standard deviations less than its mean, so we look at the probability of it being more than 2 greater than its mean. The shape parameter α must be bigger than 2 for for the variance to exist. The probability of our random variable being more than k standard deviations away from its mean works out to ((α-1)/((k-1)α))α and is largest as α converges down toward 2. The limiting values for k equal to 2 and 3 are 1/36 = 0.0277 and 1/64 = 0.0156 respectively. Of our examples, the Pareto distribution comes closest to the Vysochanskiĭ-Petunin bounds, but doesn’t come that close.

The double exponential, also know as Laplace, has the highest probability of any of our examples of being two standard deviations from its mean, but this probability is still less than half of the Vysochanskiĭ-Petunin bound. The limit of the Pareto distribution has the highest probability of being three standard deviations from its mean, but stil less than one-third of the Vysochanskiĭ-Petunin bound.

Generic bounds are useful, especially in theoretical calculations, but it’s usually possible to do much better with specific distributions.

More inequality posts:

# Trends and Opportunities in Data Analysis

Andy Warhol said “In the future, everyone will be world-famous for 15 minutes.” Here’s my 15 seconds of fame, a soundbite from the IBM Insight conference last year.

My comments start at 1:30. In a nutshell, I predict that data analytics will work its way down from large companies to small companies.

# Connection between hypergeometric distribution and series

What’s the connection between the hypergeometric distributions, hypergeometric functions, and hypergeometric series?

The hypergeometric distribution is a probability distribution with parameters NM, and n. Suppose you have an urn containing N balls, M red and the rest, N – M blue and you select n balls at a time. The hypergeometric distribution gives the probability of selecting k red balls.

The probability generating function for a discrete distribution is the series formed by summing over the probability of an outcome k and xk. So the probability generating function for a hypergeometric distribution is given by

The summation is over all integers, but the terms are only non-zero for k between 0 and M inclusive. (This may be more general than the definition of binomial coefficients you’ve seen before. If so, see these notes on the general definition of binomial coefficients.)

It turns out that f is a hypergeometric function of x because it is can be written as a hypergeometric series. (Strictly speaking,  f is a constant multiple of a hypergeometric function. More on that in a moment.)

A hypergeometric function is defined by a pattern in its power series coefficients. The hypergeometric function F(a, bcx) has a the power series

where (n)k is the kth rising power of n. It’s a sort of opposite of factorial. Start with n and multiply consecutive increasing integers for k terms. (n)0 is an empty product, so it is 1. (n)1 = n, (n)2 = n(n+1), etc.

If the ratio of the k+1st term to the kth term in a power series is a polynomial in k, then the series is a (multiple of) a hypergeometric series, and you can read the parameters of the hypergeometric series off the polynomial. This ratio for our probability generating function works out to be

and so the corresponding hypergeometric function is F(-M, –nN – M – n + 1; x). The constant term of a hypergeometric function is always 1, so evaluating our probability generating function at 0 tells us what the constant is multiplying F(-M, –nN – M – n + 1; x). Now

and so

The hypergeometric series above gives the original hypergeometric function as defined by Gauss, and may be the most common form in application. But the definition has been extended to have any number of rising powers in the numerator and denominator of the coefficients. The classical hypergeometric function of Gauss is denoted 2F1 because it has two falling powers on top and one on bottom. In general, the hypergeometric function pFq has p rising powers in the denominator and q rising powers in the denominator.

The CDF of a hypergeometric distribution turns out to be a more general hypergeometric function:

where a = 1, bk+1-M, ck+1-n, d = k+2, and eN+k+2-Mn.

Thanks to Jan Galkowski for suggesting this topic via a comment on an earlier post, Hypergeometric bootstrapping.

* * *

# Reproducible randomized controlled trials

“Reproducible” and “randomized” don’t seem to go together. If something was unpredictable the first time, shouldn’t it be unpredictable if you start over and run it again? As is often the case, we want incompatible things.

But the combination of reproducible and random can be reconciled. Why would we want a randomized controlled trial (RCT) to be random, and why would we want it to be reproducible?

One of the purposes in randomized experiments is the hope of scattering complicating factors evenly between two groups. For example, one way to test two drugs on a 1000 people would be to gather 1000 people and give the first drug to all the men and the second to all the women. But maybe a person’s sex has something to do with how the drug acts. If we randomize between two groups, it’s likely that about the same number of men and women will be in each group.

The example of sex as a factor is oversimplified because there’s reason to suspect a priori that sex might make a difference in how a drug performs. The bigger problem is that factors we can’t anticipate or control may matter, and we’d like them scattered evenly between the two treatment groups. If we knew what the factors were, we could assure that they’re evenly split between the groups. The hope is that randomization will do that for us with things we’re unaware of. For this purpose we don’t need a process that is “truly random,” whatever that means, but a process that matches our expectations of how randomness should behave. So a pseudorandom number generator (PRNG) is fine. No need, for example, to randomize using some physical source of randomness like radioactive decay.

Another purpose in randomization is for the assignments to be unpredictable. We want a physician, for example, to enroll patients on a clinical trial without knowing what treatment they will receive. Otherwise there could be a bias, presumably unconscious, against assigning patients with poor prognosis if the physicians know the next treatment be the one they hope or believe is better. Note here that the randomization only has to be unpredictable from the perspective of the people participating in and conducting the trial. The assignments could be predictable, in principle, by someone not involved in the study.

And why would you want an randomization assignments to be reproducible? One reason would be to test whether randomization software is working correctly. Another might be to satisfy a regulatory agency or some other oversight group. Still another reason might be to defend your randomization in a law suit. A physical random number generator, such as using the time down to the millisecond at which the randomization is conducted would achieve random assignments and unpredictability, but not reproducibility.

Computer algorithms for generating random numbers (technically pseudo-random numbers) can achieve reproducibility, practically random allocation, and unpredictability. The randomization outcomes are predictable, and hence reproducible, to someone with access to the random number generator and its state, but unpredictable in practice to those involved in the trial. The internal state of the random number generator has to be saved between assignments and passed back into the randomization software each time.

Random number generators such as the Mersenne Twister have good statistical properties, but they also carry a large amount of state. The random number generator described here has very small state, 64 bits, and so storing and returning the state is simple. If you needed to generate a trillion random samples, Mersenne Twitster would be preferable, but since RCTs usually have less than a trillion subjects, the RNG in the article is perfectly fine. I have run the Die Harder random number generator quality tests on this generator and it performs quite well.

Image by Ilmicrofono Oggiono, licensed under Creative Commons

# Random number generator seed mistakes

## Long run or broken software?

I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You might argue, for example, that the chances of flipping five heads in a row would be (1/2)5 = 1/32, but that underestimates the probability because a run could start at any time. The chances that the first five flips are heads would indeed be 1/32. But the probability of seeing five heads in a row any time during a series of flips is higher.

Most of the times that I’ve been asked to look at randomization software that “couldn’t be right,” the software was fine. But in this case, there wasn’t simply a long run of random results that happened to be equal. The results were truly constant. At least for some users. Some users would get different results from time to time, but others would get the same result every single time.

The problem turned out to be how the software set the seed in its random number generator. When the program started up it asked the user “Enter a number.” No indication of what kind of number or for what purpose. This number, unbeknownst to the user, was being used as the random number generator seed. Some users would enter the same number every time, and get the same randomization result, every time. Others would use more whimsy in selecting numbers and get varied output.

How do you seed a random number generator in a situation like this? A better solution would be to seed the generator with the current time, though that has drawbacks too. I write about that in another post.

## Seeding many independent processes

A more subtle problem I’ve seen with random number generator seeding is spawning multiple processes that each generate random numbers. In a well-intentioned attempt to give each process a unique seed, the developers ended up virtually assuring that many of the processes would have exactly the same seed.

If you parallelize a huge simulation by spreading out multiple copies, but two of the processes use the same seed, then their results will be identical. Throwing out the redundant simulation would reduce your number of samples, but not noticing and keeping the redundant output would be worse because it would cause you to underestimate the amount of variation.

To avoid duplicate seeds, the developers used a random number generator to assign the RNG seeds for each process. Sounds reasonable. Randomly assigned RNG seeds give you even more random goodness. Except they don’t.

The developers had run into a variation on the famous birthday problem. In a room of 23 people, there’s a 50% chance that two people share the same birthday. And with 50 people, the chances go up to 97%. It’s not certain that two people will have the same birthday until you have 367 people in the room, but the chances approach 1 faster than you might think.

Applying the analog of the birthday problem to the RNG seeds explains why the project was launching processes with the same seed. Suppose you seed each process with an unsigned 16-bit integer. That means there are 65,536 possible seeds. Now suppose you launch 1,000 processes. With 65 times as many possible seeds as processes, surely every process should get its own seed, right? Not at all. There’s a 99.95% chance that two processes will have the same seed.

In this case it would have been better to seed each process with sequential seeds: give the first process seed 1, the second seed 2, etc. The seeds don’t have to be random; they just have to be unique. If you’re using a good random number generator, the outputs of 1,000 processes seeded with 1, 2, 3, …, 1000 will be independent.

# MCMC burn-in

In Markov Chain Monte Carlo (MCMC), it’s common to throw out the first few states of a Markov chain, maybe the first 100 or the first 1000. People say they do this so the chain has had a chance to “burn in.” But this explanation by itself doesn’t make sense. It may be good to throw away a few samples, but it could betray a lack of understanding to say a chain has “burned in.”

A Markov chain has no memory. That’s its defining characteristic: its future behavior depends solely on where it is, not how it got there. So if you “burn in” a thousand samples, your future calculations are absolutely no different than if you had started where there first thousand samples left off. Also, any point you start at is a point you might return to, or at least return arbitrarily close to again.

So why burn in? To enter a high probability region, a place where the states of the Markov chain are more representative of the distribution you’re sampling. When someone says a chain has “burned in,” that’s fine if they mean “has entered a high probability region.” And why do you want to enter such a region? Because you’re going to average some function of your samples:

The result will be correct as n → ∞, but you’re going to stop after some finite n. When n is small, and your samples are in a low probability region, the average on the right might be a poor approximation to the expectation on the left.

The idea of burn-in is that you can start your MCMC procedure at some point chosen for convenience, one which might be out in the weeds, but then after a few iterations you’ll be in a high probability region. However, you don’t know that this will happen. It probably will happen, eventually, by definition: a random process spends most of its time where it spends most of its time! It is possible, though unlikely, that you could be in a lower probability region at the end of your burn-in period than at the beginning. Or maybe your chain is slowly moving toward a higher probability region, but you’re still not close at the end of your burn-in.

If you know where a high probability region is, just start there. Then you’ve “burned in” immediately. However, with a very complicated problem you might not know where a high probability region is. So you hope that a few steps of your chain will land you in a high probability region. And maybe it will. But if you understand your problem so poorly that you have no idea where the probability is concentrated, you’re going to have a hard time evaluating your results.

# Introduction to MCMC

Markov Chain Monte Carlo (MCMC) is a technique for getting your work done when Monte Carlo won’t work.

The problem is finding the expected value of f(X) where X is some random variable. If you can draw independent samples xi from X, the solution is simple:

When it’s possible to draw these independent samples, the sum above is well understood. It’s easy to estimate the error after n samples, or to turn it the other way around, estimate what size n you need so that the error is probably below a desired threshold.

MCMC is a way of making the approximation above work even though it’s not practical to draw independent random samples from X. In Bayesian statistics, X is the posterior distribution of your parameters and you want to find the expected value of some function of these parameters. The problem is that this posterior distribution is typically complicated, high-dimensional, and unique to your problem (unless you have a simple conjugate model). You don’t know how to draw independent samples from X, but there are standard ways to construct a Markov chain whose samples make the approximation above work. The samples are not independent but in the limit the set of samples has the same distribution as X.

MCMC is either simple or mysterious, depending on your perspective. It’s simple to write code that should work, eventually, in theory. Writing efficient code is another matter. And above all, knowing when your answer is good enough is tricky. If the samples were independent, the Central Limit Theorem would tell you how the sum should behave. But since the samples are dependent, all bets are off. Almost all we know is that the average on the right side converges to the expectation on the left side. Aside from toy problems there’s very little theory to tell you when your average is sufficiently close to what you want to compute.

Because there’s not much theory, there’s a lot of superstition and folklore. As with other areas, there’s some truth in MCMC superstition and folklore, but also some error and nonsense.

In some ways MCMC is truly marvelous. It’s disappointing that there isn’t more solid theory around convergence, but it lets you take a stab at problems that would be utterly intractable otherwise. In theory you can’t know when it’s time to stop, and in practice you can fool yourself into thinking you’ve seen convergence when you haven’t, such as when you have a bimodal distribution and you’ve only been sampling from one mode. But it’s also common in practice to have some confidence that a calculation has converged because the results are qualitatively correct: this value should be approximately this other value, this should be less than that, etc.

Bayesian statistics is older than Frequentist statistics, but it didn’t take off until statisticians discovered MCMC in the 1980s, after physicists discovered it in the 1950s. Before that time, people might dismiss Bayesian statistics as being interesting but impossible to compute. But thanks to Markov Chain Monte Carlo, and Moore’s law, many problems are numerically tractable that were not before.

Since MCMC gives a universal approach for solving certain kinds of problems, some people equate the algorithm with the problem. That is, they see MCMC as the solution rather than an algorithm to compute the solution. They forget what they were ultimately after. It’s sometimes possible to compute posterior probabilities more quickly and more accurately by other means, such as numerical integration.

# Big p, Little n

Statisticians use n to denote the number of subjects in a data set and p to denote nearly everything else. You’re supposed to know from context what each p means.

In the phrase “big n, little p” the symbol p means the number of measurements per subject. Traditional data sets are “big n, little p” because you have far more subjects than measurements per subject. For example, maybe you measure 10 things about 1000 patients.

Big data sets, such as those coming out of bioinformatics, are often “big p, little n.” For example, maybe you measure 20,000 biomarkers on 50 patients. This turns classical statistics sideways, literally and figuratively, literally in the sense that a “big p, little n” data set looks like the transpose of a “big n, little p” data set.

From the vantage point of a traditional statistician, “big p, little n” data sets give you very little to work with. If n is small, it doesn’t matter how big p is. In the example above, n = 50, not a big data set. But the biologist will say “What do you mean it’s not a big data set? I’ve given you 1,000,000 measurements!”

So how to you take advantage of large p even though n is small? That’s the big question. It summarizes the research program of many people in statistics and machine learning. There’s no general answer, at least not yet, though progress is being made in specific applications.

Related post: Nomenclatural abomination

# The longer it has taken, the longer it will take

Suppose project completion time follows a Pareto (power law) distribution with parameter α. That is, for t > 1, the probability that completion time is bigger than t is t. (We start out time at t = 1 because that makes the calculations a little simpler.)

Now suppose we know that a project has lasted until t0 so far. Then the expected finish time is αt0/(α-1) and so the expected additional time is t0/(α-1). Note that both are proportional to t0. So the longer it has taken, the longer it will take. If the project is running late, you can expect the time remaining to be even more than the expected time before the project started. The finish line is moving away from you!

For example, suppose α = 2 (in applications of power laws, α is often between 1 and 3) and you’re measuring time in years. When the project starts at t = 1, it is expected to take one year, until t = 2. Now suppose you’re starting the second year and the project isn’t done. Now it’s expected to finish at t = 4, two more years. When you started, the project was supposed to take a year. One year later, it has taken a year, and should be expected to take two more years. I said “should be expected” rather than “is expected” because no one would believe such an estimate. (Ever heard of the Big Dig? Or other megaprojects?)

Note that we have computed the conditional probability given only the time it has taken so far, and no other information. If you know more, for example maybe you know that some specific pieces have been completed, then you should use that information.

This is related to the Lindy effect. The longer a cultural artifact has been around, the longer it is expected to last into the future.

* * *

This is what the book Social Media Mining calls the Big Data Paradox:

Social media data is undoubtedly big. However, when we zoom into individuals for whom, for example, we would like to make relevant recommendations, we often have little data for each specific individual. We have to exploit the characteristics of social media and use its multidimensional, multisource, and multisite data to aggregate information with sufficient statistics for effective mining.

… enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.

Big data doesn’t always tell us directly what we’d like to know. It may give us a gargantuan amount of slightly related data, from which we may be able to tease out what we want.

Related post: New data, not just bigger data

# Estimating the exponent of discrete power law data

Suppose you have data from a discrete power law with exponent α. That is, the probability of an outcome n is proportional to n. How can you recover α?

A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That does a very poor job [1]. The discrete case needs its own estimator.

To illustrate this, we start by generating 5,000 samples from a discrete power law with exponent 3.

   import numpy.random

alpha = 3
n = 5000
x = numpy.random.zipf(alpha, n)


The continuous MLE is very simple to implement:

    alpha_hat = 1 + n / sum(log(x))


Unfortunately, it gives an estimate of 6.87 for alpha, though we know it should be around 3.

The MLE for the discrete power law distribution satisfies

Here ζ is the Riemann zeta function, and xi are the samples. Note that the left side of the equation is the derivative of log ζ, or what is sometimes called the logarithmic derivative.

There are three minor obstacles to finding the estimator using Python. First, SciPy doesn’t implement the Riemann zeta function ζ(x) per se. It implements a generalization, the Hurwitz zeta function, ζ(x, q). Here we just need to set q to 1 to get the Riemann zeta function.

Second, SciPy doesn’t implement the derivative of zeta. We don’t need much accuracy, so it’s easy enough to implement our own. See an earlier post for an explanation of the implementation below.

Finally, we don’t have an explicit equation for our estimator. But we can easily solve for it using the bisection algorithm. (Bisect is slow but reliable. We’re not in a hurry, so we might as use something reliable.)

    from scipy import log
from scipy.special import zeta
from scipy.optimize import bisect

xmin = 1

def log_zeta(x):
return log(zeta(x, 1))

def log_deriv_zeta(x):
h = 1e-5
return (log_zeta(x+h) - log_zeta(x-h))/(2*h)

t = -sum( log(x/xmin) )/n
def objective(x):
return log_deriv_zeta(x) - t

a, b = 1.01, 10
alpha_hat = bisect(objective, a, b, xtol=1e-6)
print(alpha_hat)


We have assumed that our data follow a power law immediately from n = 1. In practice, power laws generally fit better after the first few elements. The code above works for the more general case if you set xmin to be the point at which power law behavior kicks in.

The bisection method above searches for a value of the power law exponent between 1.01 and 10, which is somewhat arbitrary. However, power law exponents are very often between 2 and 3 and seldom too far outside that range.

The code gives an estimate of α equal to 2.969, very near the true value of 3, and much better than the naive estimate of 6.87.

Of course in real applications you don’t know the correct result before you begin, so you use something like a confidence interval to give you an idea how much uncertainty remains in your estimate.

The following equation [2] gives a value of σ from a normal approximation to the distribution of our estimator.

So an approximate 95% confidence interval would be the point estimate +/- 2σ.

    from scipy.special import zeta
from scipy import sqrt

def zeta_prime(x, xmin=1):
h = 1e-5
return (zeta(x+h, xmin) - zeta(x-h, xmin))/(2*h)

def zeta_double_prime(x, xmin=1):
h = 1e-5
return (zeta(x+h, xmin) -2*zeta(x,xmin) + zeta(x-h, xmin))/h**2

def sigma(n, alpha_hat, xmin=1):
z = zeta(alpha_hat, xmin)
temp = zeta_double_prime(alpha_hat, xmin)/z
temp -= (zeta_prime(alpha_hat, xmin)/z)**2
return 1/sqrt(n*temp)

print( sigma(n, alpha_hat) )


Here we use a finite difference approximation for the second derivative of zeta, an extension of the idea used above for the first derivative. We don’t need high accuracy approximations of the derivatives since statistical error will be larger than the approximation error.

In the example above, we have α = 2.969 and σ = 0.0334, so a 95% confidence interval would be [2.902, 3.036].

* * *

[1] Using the continuous MLE with discrete data is not so bad when the minimum output xmin is moderately large. But here, where xmin = 1 it’s terrible.

[2] Equation 3.6 from Power-law distributions in empirical data by Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman.

# Skin in the game for observational studies

The article Deming, data and observational studies by S. Stanley Young and Alan Karr opens with

Any claim coming from an observational study is most likely to be wrong.

They back up this assertion with data about observational studies later contradicted by prospective studies.

Much has been said lately about the assertion that most published results are false, particularly observational studies in medicine, and I won’t rehash that discussion here. Instead I want to cut to the process Young and Karr propose for improving the quality of observational studies. They summarize their proposal as follows.

The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set.

They then describe an eight-step process in detail. One step is that cleaning the data and dividing it into a modelling set and a holdout set would be done by different people than the modelling and analysis. They then explain why this would lead to more truthful publications.

The holdout set is the key. Both the author and the journal know there is a sword of Damocles over their heads. Both stand to be embarrassed if the holdout set does not support the original claims of the author.

* * *

The full title of the article is Deming, data and observational studies: A process out of control and needing fixing. It appeared in the September 2011 issue of Significance.

Update: The article can be found here.

# Balancing profit and learning in A/B testing

A/B testing, or split testing, is commonly used in web marketing to decide which of two design options performs better. If you have so many visitors to a site that the number of visitors used in a test is negligible, conventional randomization schemes are the way to go. They’re simple and effective.

But if you have less traffic so that the number of visitors involved in a test is appreciable, you might be concerned with possible lost revenue during the test itself. The point of A/B testing is to improve profitability after the test, not during the test. If you also want to consider profitability during the test, you might want to consider more alternatives.

My experience with testing comes from a context where the stakes are higher than improving conversion on web sites: treating cancer patients. You want to find out which treatments performed better for the sake of future patients, those who were treated after the randomized trial. But you also want to treat the participants in the clinical trial effectively. Two ways we would do that are early stopping rules and adaptive randomization. Both practices are applicable to A/B testing web pages.

A conventional clinical trial might take a few hundred patients and randomize half to one treatment and half to another. But if one treatment appears to be much more effective, at some point it becomes unconscionable to keep assigning the less effective treatment. So you stop the experiment early. You might want to do the same with web designs. If you planned to show two variations of a page to 500 visitors each, but after 100 visitors it’s obvious which version is performing better, you’d like to stop the test and show everyone the better page. On the other hand, if you have so many visitors that you’re not concerned with what happens to the 1000 visitors in the test, just let the test run to completion.

Another approach is to compromise between equal randomization and early stopping. Suppose A is performing better than B, but not so much better that you’re willing to stop and declare A the winner. You might keep randomizing, but increase the probability that the test will assign A. If A really is better, more visitors will see the better page. But if you’re wrong and B is really better, you may still discover this because some visitors are still seeing B. If B keeps performing better, the tide will turn and the test will prefer it. This is called adaptive randomization. The more evidence there is that one version is better, the higher the probability that you’ll show people that version.

One way to use adaptive randomization is variable experiment sizes. Instead of deciding a test size in advance, you test until you’re satisfied that you’ve found a winner. That may require fewer visitors than a conventional A/B test. It may also require more, but only when there’s a good reason to. The test may go into overtime, so to speak, because the two versions are performing similarly, in which case you’d like to keep testing longer to find which is better.

It’s easy to fall into thinking that the winner of a test will be used forever, whether you’re testing web pages or cancer treatments. But this isn’t the case. The winner will eventually be tested against something else, maybe very soon. This means that you might want to put a little more emphasis on the performance during the test and not just performance after the test, because there may not be much opportunity for performance after the test.

# A rose by any other name: Data science etc.

I help people make decisions in the face of uncertainty. Sounds interesting.

I’m a data scientist. Not sure what that means, but it sounds cool.

I study machine learning. Hmm. Maybe interesting, maybe a little ominous.

I’m into big data. Exciting or passé, depending on how many times you’ve heard the term.

Even though each of these descriptions makes a different impression, they’re all essentially the same thing. You could throw in a few more terms too, like artificial intelligence, inferential science, decision theory, or inverse probability.

There are distinctions. These terms don’t entirely overlap, but the overlap is huge. They all have to do with taking data and making an inference.

“Decision-making under uncertainty” emphasizes that you never have complete data, and yet you need to make decisions anyway. “Decision theory” emphasizes that the whole point of analyzing data is to do something as a result, and suggests that focusing directly on the decision itself, rather than proxies along the way, is the best way to do this.

“Data science” stresses that there is more to the process of making inferences than what falls under the traditional heading of “statistics.” Statistics has never been only about “the grotesque phenomenon generally known as mathematical statistics,” as Francis Anscombe described it. Things like data cleaning and visualization have always been part of the practice of statistics, though not the theory of statistics. Data science also emphasizes the role of computation. Some say a data scientist is a statistician who can program. Some say data science is statistics on a Mac.

Despite the hype around the term data science, it’s growing on me. It has its drawbacks, but so does every other name.

Machine learning, like decision theory, emphasizes the ultimate goal of doing something with data rather than creating an accurate model of the process that generates the data. If you can create such a model, so much the better. But it may not be necessary to have a great model in order to accomplish what you originally set out to do. “Naive Bayes,” for example, is a classification algorithm that is admittedly naive. It knowingly makes a gross simplification, assuming events are independent that we know are certainly not independent, and yet it often works well enough.

“Big data” is a big can of worms. It is often concerned with data sets that are indeed big, but it also implies other things, such as the way the data become available, as a real time stream rather than as a complete static set. See Erik Meijer’s Big data cube. And that’s just when the term “big data” is used in some fairly meaningful way. It’s also used so broadly as to be meaningless.

The term “statistics” literally means the mathematics of the interests of states, as in governments, because these were the first applications of statistics. So while “statistics” may be the most established and perhaps most respectable term discussed here, it’s not great. As I remarked here, “The term statistics would be equivalent to governmentistics, a historically accurate but otherwise useless term.” Statistics emphasizes probability models and mathematical rigor more than other variations on data analysis do. Statisticians criticize machine learning folks for being sloppy. Machine learning folks criticize statisticians for being too conservative, or for being too focused on description and not focused enough on prediction.

Bayesian statistics is much older than what is now sometimes called “classical” statistics. It was essential dormant during the first half of the 20th century before experiencing a renaissance in the second half of the century. Bayesian statistics was originally called “inverse probability” for good reason. Probability theory takes the probabilities of events as given and makes inferences about possible outcomes. Bayesian statistics does the inverse, taking data as given and inferring the probabilities that lead to the data. All statistics does something like this, but Bayesian statistics is consistent in forming all inference directly as probabilities. Frequetist (“classical”) statistics also infers probabilities, but the results, things like p-values and confidence intervals, are not the probabilities of what most people think they are. See Anthony O’Hagan’s description here.

Data analysis has gone by many names over time, sometimes with meaningful distinctions and sometimes not. Often people make a distinction without a difference.

# Data analysis vs statistics

John Tukey preferred the term “data analysis” over “statistics.” In his paper Data Anaysis, Computation and Mathematics, he explains why.

My title speaks of “data analysis” not “statistics”, and of “computation” not “computing science”; it does not speak of “mathematics”, but only last. Why? …

My brother-in-squared-law, Francis J. Anscombe has commented on my use of “data analysis” in the following words:

Whereas the content of Tukey’s remarks is always worth pondering, some of his terminology is hard to take. He seems to identify “statistics” with the grotesque phenomenon generally known as “mathematical statistics”, and finds it necessary to replace “statistical analysis” with “data analysis.”

(Tukey calls Anscombe his “brother-in-squared-law” because Anscombe was a fellow statistician as well as his brother-in-law. At first I thought Tukey had said “brother-in-law-squared”, which could mean his brother-in-law’s brother-in-law, but I suppose it was a pun on the role of least-square methods in statistics.)

Tukey later says

I … shall stick to this attitude today, and shall continue to use the words “data analysis”, in part to indicate that we can take probability seriously, or leave it alone, as may from time to time be appropriate or necessary.

It seems Tukey was reserving the term “statistics” for that portion of data analysis which is rigorously based on probability.