I believe rib eye steaks are better for you than rat poison. My basis for that belief is anecdotal evidence. People who have eaten rib eye steaks have fared better than people who have eaten rat poison. I don’t have exact numbers on that, but I’m pretty sure it’s true. I have more confidence in that than in any clinical trial conclusion.

Hearsay evidence about food isn’t very valuable, per observation, but since millions of people have eaten steak for thousands of years, the cumulative weight of evidence is pretty good that steak is harmless if not good for you. The number of people who have eaten rat poison is much smaller, but given the large effect size, there’s ample reason to suspect that eating rat poison is a bad idea.

Now suppose you want to get more specific and determine whether rib eye steaks are good for *you* in particular. (I wouldn’t suggest trying rat poison.) Suppose you’ve noticed that you feel better after eating a steak. Is that an anecdote or data? What if you look back through your diary and noticed that every mention of eating steak lately has been followed by some remark about feeling better than usual. Is that data? What if you decide to flip a coin each day for the next month and eat steak if the coin comes up heads and tofu otherwise. Each of these steps is an improvement, but there’s no magical line you cross between anecdote and data.

Suppose you’re destructively testing the strength of concrete samples. There are better and worse ways to conduct such experiments, but each sample gives you valuable data. If you test 10 samples and they all withstand two tons of force per square inch, you have good reason to believe the concrete the samples were taken from can withstand such force. But if you test a drug on 10 patients, you can’t have the same confidence that the drug is effective. Human subjects are more complicated than concrete samples. Concrete samples aren’t subject to placebo effects. Also, cause and effect are more clear for concrete. If you apply a load and the sample breaks, you can assume the load caused the failure. If you treat a human for a disease and they recover, you can’t be as sure that the treatment caused the recovery. That doesn’t mean medical observations aren’t data.

Carefully collected observations in one area may be less statistically valuable than anecdotal observations in another. Observations are never ideal. There’s always some degree of bias, effects that can’t be controlled, etc. There’s no quantum leap between useless anecdotes and perfectly informative data. Some data are easy to draw inference from, but data that’s harder to understand doesn’t fail to be data.

]]>People are not completely described by a handful of numbers. We’re much more complicated than that. But even in systems that are well described by a few numbers, the region around the average can be nearly empty. I’ll explain why that’s true in general, then look back at the Norma example.

Suppose you have *N* points, each described by *n* independent, standard normal random variables. That is, each point has the form (*x*_{1}, *x*_{2}, *x*_{2}, …, *x*_{n}) where each *x*_{i} is independent with a normal distribution with mean 0 and variance 1. The expected value of each coordinate is 0, so you might expect that most points are piled up near the origin (0, 0, 0, …, 0). In fact most points are in spherical shell around the origin. Specifically, as *n* becomes larger, most of the points will be in a thin shell with distance √*n* from the origin. (More details here.)

In the contest above, *n* = 9, and so we expect most contestants to be about a distance of 3 from average when we normalize each of the factors being measured, i.e. we subtract the mean so that each factor has mean 0, and we divide each by its standard deviation so the standard deviation is 1 on each factor.

We’ve made several simplifying assumptions. For example, we’ve assumed independence, though presumably some of the factors measured in the contest were correlated. There’s also a selection bias: presumably women who knew they were far from average would not have entered the contest. But we’ll run with our simplified model just to see how it behaves in a simulation.

import numpy as np # Winning critera: minimum Euclidean distance def euclidean_norm(x): return np.linalg.norm(x) # Winning criteria: min-max def max_norm(x): return max(abs(x)) n = 9 N = 3864 # Simulated normalized measurements of contestants M = np.random.normal(size=(N, n)) euclid = np.empty(N) maxdev = np.empty(N) for i in range(N): euclid[i] = euclidean_norm(M[i,:]) maxdev[i] = max_norm(M[i,:]) w1 = euclid.argmin() w2 = maxdev.argmin() print( M[w1,:] ) print( euclidean_norm(M[w1,:]) ) print( M[w2,:] ) print( max_norm(M[w2,:]) )

There are two different winners, depending on how we decide the winner. Using the Euclidean distance to the origin, the winner in this simulation was contestant 3306. Her normalized measurements were

[ 0.1807, 0.6128, -0.0532, 0.2491, -0.2634, 0.2196, 0.0068, -0.1164, -0.0740]

corresponding to a Euclidean distance of 0.7808.

If we judge the winner to be the one whose largest deviation from average is the smallest, the winner is contestant 1916. Her normalized measurements were

[-0.3757, 0.4301, -0.4510, 0.2139, 0.0130, -0.2504, -0.1190, -0.3065, -0.4593]

with the largest deviation being the last, 0.4593.

By either measure, the contestant closest to the average deviated significantly from the average in at least one dimension.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

]]>You could say the same about nonlinear differential equations. Differential equations are so often nonlinear that the “nonlinear” qualifier isn’t always necessary to say explicitly. Just as a Bayesian analysis isn’t interesting just because it’s Bayesian, a differential equation isn’t necessarily interesting just because it’s nonlinear.

The analogy between Bayesian statistics and nonlinear differential equations breaks down though. Nonlinear equations are intrinsically more interesting than linear ones. But it’s no longer remarkable to solve a nonlinear differential equation numerically.

When an adjective becomes the default, it drops off and the previous default now requires an adjective. Terms like “electronic” and “digital” are fading from use. If you say you’re going to mail someone something, the default assumption is usually that you are going to email it. What used to be simply “mail” is now “snail mail.” Digital signal processing is starting to sound quaint. The abbreviation DSP is still in common use, but digital signal processing is simply signal processing. Now non-digital signal processing requires a qualifier, i.e. analog.

There was no term for Frequentist statistics when it was utterly dominant. Now of course there is. (Some people use the term “classical,” but that’s an odd term given that Bayesian analysis is older.) The term linear has been around a long time. Even when nearly all analysis was linear, people were aware that linearity was a necessary simplification.

**Related posts**:

This inequality is very general, but also very weak. It assumes very little about the random variable *X* but it also gives a loose bound. If we assume slightly more, namely that *X* has a **unimodal** distribution, then we have a tighter bound, the Vysochanskiĭ-Petunin inequality.

However, the Vysochanskiĭ-Petunin inequality does require that *k* be larger than √(8/3). In exchange for the assumption of unimodality and the restriction on *k* we get to reduce our upper bound by more than half.

While tighter than Chebyshev’s inequality, the stronger inequality is still very general. We can usually do much better if we can say more about the distribution family. For example, suppose *X* has a uniform distribution. What is the probability that *X* is more than two standard deviations from its mean? Zero, because two standard deviations puts you outside the interval the uniform is defined on!

Among familiar distributions, when is the Vysochanskiĭ-Petunin inequality most accurate? That depends, of course, on what distributions you consider familiar, and what value of *k* you use. Let’s look at normal, exponential, and Pareto. These were chosen because they have thin, medium, and thick tails. We’ll also throw in the double exponential, because it has the same tail thickness as exponential but is symmetric. We’ll let *k* be 2 and 3.

Distribution family | P(|X – E(X)| > 2σ) |
V-P estimate | P(|X – E(X)| > 3σ) |
V-P estimate |
---|---|---|---|---|

Uniform | 0.0000 | 0.1111 | 0.0000 | 0.0484 |

Normal | 0.0455 | 0.1111 | 0.0027 | 0.0484 |

Exponential | 0.0498 | 0.1111 | 0.0183 | 0.0484 |

Pareto | 0.0277 | 0.1111 | 0.0156 | 0.0484 |

Double exponential | 0.0591 | 0.1111 | 0.0144 | 0.0484 |

A normal random variable is more than 2 standard deviations away from its mean with probability 0.0455, compared to the Vysochanskiĭ-Petunin bound of 1/9 = 0.1111. A normal random variable is more than 3 standard deviations away from its mean with probability 0.0027, compared to the bound of 4/81 = 0.0484.

An exponential random variable with mean μ also has standard deviation μ, so the only way it could be more than 2μ from its mean is to be 3μ from 0. So an exponential is more that 2 standard deviations from its mean with probability exp(-3) = 0.0498, and more than 3 standard deviations with probability exp(-4) = 0.0183.

We’ll set the minimum value of our Pareto random variable to be 1. As with the exponential, the Pareto cannot be 2 standard deviations less than its mean, so we look at the probability of it being more than 2 greater than its mean. The shape parameter α must be bigger than 2 for for the variance to exist. The probability of our random variable being more than *k* standard deviations away from its mean works out to ((α-1)/((*k*-1)α))^{α} and is largest as α converges down toward 2. The limiting values for *k* equal to 2 and 3 are 1/36 = 0.0277 and 1/64 = 0.0156 respectively. Of our examples, the Pareto distribution comes closest to the Vysochanskiĭ-Petunin bounds, but doesn’t come that close.

The double exponential, also know as Laplace, has the highest probability of any of our examples of being two standard deviations from its mean, but this probability is still less than half of the Vysochanskiĭ-Petunin bound. The limit of the Pareto distribution has the highest probability of being three standard deviations from its mean, but stil less than one-third of the Vysochanskiĭ-Petunin bound.

Generic bounds are useful, especially in theoretical calculations, but it’s usually possible to do much better with specific distributions.

**More inequality posts**:

For daily posts on probability, follow @ProbFact on Twitter.

]]>My comments start at 1:30. In a nutshell, I predict that data analytics will work its way down from large companies to small companies.

]]>The hypergeometric distribution is a probability distribution with parameters *N*, *M*, and *n*. Suppose you have an urn containing *N* balls, *M* red and the rest, *N* – *M* blue and you select *n* balls at a time. The hypergeometric distribution gives the probability of selecting *k* red balls.

The **probability generating function** for a discrete distribution is the series formed by summing over the probability of an outcome *k* and *x*^{k}. So the probability generating function for a hypergeometric distribution is given by

The summation is over all integers, but the terms are only non-zero for *k* between 0 and *M* inclusive. (This may be more general than the definition of binomial coefficients you’ve seen before. If so, see these notes on the general definition of binomial coefficients.)

It turns out that *f* is a **hypergeometric function** of *x* because it is can be written as a **hypergeometric series**. (Strictly speaking, *f* is a constant multiple of a hypergeometric function. More on that in a moment.)

A hypergeometric function is defined by a pattern in its power series coefficients. The hypergeometric function *F*(*a, **b*; *c*; *x*) has a the power series

where (*n*)_{k} is the *k*th rising power of *n*. It’s a sort of opposite of factorial. Start with *n* and multiply consecutive *increasing* integers for *k* terms. (*n*)_{0} is an empty product, so it is 1. (*n*)_{1} = *n*, (*n*)_{2} = *n*(*n*+1), etc.

If the ratio of the *k*+1st term to the *k*th term in a power series is a polynomial in *k*, then the series is a (multiple of) a hypergeometric series, and you can read the parameters of the hypergeometric series off the polynomial. This ratio for our probability generating function works out to be

and so the corresponding hypergeometric function is *F*(-*M*, –*n*; *N* – *M* – *n* + 1; *x*). The constant term of a hypergeometric function is always 1, so evaluating our probability generating function at 0 tells us what the constant is multiplying *F*(-*M*, –*n*; *N* – *M* – *n* + 1; *x*). Now

and so

The hypergeometric series above gives the original hypergeometric function as defined by Gauss, and may be the most common form in application. But the definition has been extended to have any number of rising powers in the numerator and denominator of the coefficients. The classical hypergeometric function of Gauss is denoted _{2}*F*_{1} because it has two falling powers on top and one on bottom. In general, the hypergeometric function _{p}*F*_{q} has *p* rising powers in the denominator and *q* rising powers in the denominator.

The CDF of a hypergeometric distribution turns out to be a more general hypergeometric function:

where *a* = 1, *b* = *k*+1-*M*, *c* = *k*+1-*n*, *d = k*+2, and *e* = *N*+*k*+2-*M*–*n*.

Thanks to Jan Galkowski for suggesting this topic via a comment on an earlier post, Hypergeometric bootstrapping.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

]]>“Reproducible” and “randomized” don’t seem to go together. If something was unpredictable the first time, shouldn’t it be unpredictable if you start over and run it again? As is often the case, we want incompatible things.

But the combination of reproducible and random can be reconciled. Why would we want a randomized controlled trial (RCT) to be random, and why would we want it to be reproducible?

**One of the purposes** in randomized experiments is the hope of scattering complicating factors evenly between two groups. For example, one way to test two drugs on a 1000 people would be to gather 1000 people and give the first drug to all the men and the second to all the women. But maybe a person’s sex has something to do with how the drug acts. If we randomize between two groups, it’s likely that about the same number of men and women will be in each group.

The example of sex as a factor is oversimplified because there’s reason to suspect *a priori* that sex might make a difference in how a drug performs. The bigger problem is that factors we can’t anticipate or control may matter, and we’d like them scattered evenly between the two treatment groups. If we knew what the factors were, we could assure that they’re evenly split between the groups. The hope is that randomization will do that for us with things we’re unaware of. For this purpose we don’t need a process that is “truly random,” whatever that means, but a process that matches our expectations of how randomness should behave. So a pseudorandom number generator (PRNG) is fine. No need, for example, to randomize using some physical source of randomness like radioactive decay.

**Another purpose** in randomization is for the assignments to be unpredictable. We want a physician, for example, to enroll patients on a clinical trial without knowing what treatment they will receive. Otherwise there could be a bias, presumably unconscious, against assigning patients with poor prognosis if the physicians know the next treatment be the one they hope or believe is better. Note here that the randomization only has to be unpredictable from the perspective of the people participating in and conducting the trial. The assignments could be predictable, in principle, by someone *not* involved in the study.

And why would you want an randomization assignments to be **reproducible**? One reason would be to test whether randomization software is working correctly. Another might be to satisfy a regulatory agency or some other oversight group. Still another reason might be to defend your randomization in a law suit. A physical random number generator, such as using the time down to the millisecond at which the randomization is conducted would achieve random assignments and unpredictability, but not reproducibility.

Computer algorithms for generating random numbers (technically pseudo-random numbers) can achieve reproducibility, practically random allocation, and unpredictability. The randomization outcomes are predictable, and hence reproducible, to someone with access to the random number generator and its state, but unpredictable in practice to those involved in the trial. The internal state of the random number generator has to be saved between assignments and passed back into the randomization software each time.

Random number generators such as the Mersenne Twister have good statistical properties, but they also carry a large amount of state. The random number generator described here has very small state, 64 bits, and so storing and returning the state is simple. If you needed to generate a trillion random samples, Mersenne Twitster would be preferable, but since RCTs usually have less than a trillion subjects, the RNG in the article is perfectly fine. I have run the Die Harder random number generator quality tests on this generator and it performs quite well.

Image by Ilmicrofono Oggiono, licensed under Creative Commons

]]>I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You might argue, for example, that the chances of flipping five heads in a row would be (1/2)^{5} = 1/32, but that underestimates the probability because a run could start at any time. The chances that the *first* five flips are heads would indeed be 1/32. But the probability of seeing five heads in a row any time during a series of flips is higher.

Most of the times that I’ve been asked to look at randomization software that “couldn’t be right,” the software was fine. But in this case, there wasn’t simply a long run of random results that happened to be equal. The results were truly constant. At least for some users. Some users would get different results from time to time, but others would get the same result every single time.

The problem turned out to be how the software set the seed in its random number generator. When the program started up it asked the user “Enter a number.” No indication of what kind of number or for what purpose. This number, unbeknownst to the user, was being used as the random number generator seed. Some users would enter the same number every time, and get the same randomization result, every time. Others would use more whimsy in selecting numbers and get varied output.

How do you seed a random number generator in a situation like this? A better solution would be to seed the generator with the current time, though that has drawbacks too. I write about that in another post.

A more subtle problem I’ve seen with random number generator seeding is spawning multiple processes that each generate random numbers. In a well-intentioned attempt to give each process a unique seed, the developers ended up virtually assuring that many of the processes would have exactly the same seed.

If you parallelize a huge simulation by spreading out multiple copies, but two of the processes use the same seed, then their results will be identical. Throwing out the redundant simulation would reduce your number of samples, but not noticing and keeping the redundant output would be worse because it would cause you to underestimate the amount of variation.

To avoid duplicate seeds, the developers used a random number generator to assign the RNG seeds for each process. Sounds reasonable. Randomly assigned RNG seeds give you even more random goodness. Except they don’t.

The developers had run into a variation on the famous birthday problem. In a room of 23 people, there’s a 50% chance that two people share the same birthday. And with 50 people, the chances go up to 97%. It’s not *certain* that two people will have the same birthday until you have 367 people in the room, but the chances approach 1 faster than you might think.

Applying the analog of the birthday problem to the RNG seeds explains why the project was launching processes with the same seed. Suppose you seed each process with an unsigned 16-bit integer. That means there are 65,536 possible seeds. Now suppose you launch 1,000 processes. With 65 times as many possible seeds as processes, surely every process should get its own seed, right? Not at all. There’s a 99.95% chance that two processes will have the same seed.

In this case it would have been better to seed each process with *sequential* seeds: give the first process seed 1, the second seed 2, etc. The *seeds* don’t have to be random; they just have to be unique. If you’re using a good random number generator, the outputs of 1,000 processes seeded with 1, 2, 3, …, 1000 will be independent.

]]>

A Markov chain has no memory. That’s its defining characteristic: its future behavior depends solely on where it is, not how it got there. So if you “burn in” a thousand samples, your future calculations are absolutely no different than if you had started where there first thousand samples left off. Also, any point you start at is a point you might return to, or at least return arbitrarily close to again.

So why burn in? To enter a high probability region, a place where the states of the Markov chain are more representative of the distribution you’re sampling. When someone says a chain has “burned in,” that’s fine if they mean “has entered a high probability region.” And why do you want to enter such a region? Because you’re going to average some function of your samples:

The result will be correct as *n* → ∞, but you’re going to stop after some finite *n*. When *n* is small, and your samples are in a low probability region, the average on the right might be a poor approximation to the expectation on the left.

The idea of burn-in is that you can start your MCMC procedure at some point chosen for convenience, one which might be out in the weeds, but then after a few iterations you’ll be in a high probability region. However, you don’t know that this will happen. It *probably* will happen, eventually, by definition: a random process spends most of its time where it spends most of its time! It is possible, though unlikely, that you could be in a lower probability region at the end of your burn-in period than at the beginning. Or maybe your chain is slowly moving toward a higher probability region, but you’re still not close at the end of your burn-in.

If you know where a high probability region is, just start there. Then you’ve “burned in” immediately. However, with a very complicated problem you might not know where a high probability region is. So you hope that a few steps of your chain will land you in a high probability region. And maybe it will. But if you understand your problem so poorly that you have no idea where the probability is concentrated, you’re going to have a hard time evaluating your results.

]]>

The problem is finding the expected value of *f*(*X*) where *X* is some random variable. If you can draw independent samples *x*_{i} from *X*, the solution is simple:

When it’s possible to draw these independent samples, the sum above is well understood. It’s easy to estimate the error after *n* samples, or to turn it the other way around, estimate what size *n* you need so that the error is probably below a desired threshold.

MCMC is a way of making the approximation above work even though it’s not practical to draw independent random samples from *X*. In Bayesian statistics, *X* is the posterior distribution of your parameters and you want to find the expected value of some function of these parameters. The problem is that this posterior distribution is typically complicated, high-dimensional, and unique to your problem (unless you have a simple conjugate model). You don’t know how to draw independent samples from *X*, but there are standard ways to construct a **Markov chain** whose samples make the approximation above work. The samples are **not independent** but in the limit the set of samples has the same distribution as *X*.

MCMC is either simple or mysterious, depending on your perspective. It’s simple in that writing code that should work, eventually. Writing efficient code is another matter. And above all, knowing when your answer is good enough is tricky. If the samples were independent, the Central Limit Theorem would tell you how the sum should behave. But since the samples are **dependent**, all bets are off. Almost all we know is that the average on the right side converges to the expectation on the left side. Aside from toy problems there’s very little theory to tell you when your average is sufficiently close to what you want to compute.

Because there’s not much theory, there’s a lot of superstition and folklore. As with other areas, there’s truth in MCMC superstition and folklore, but also some error and nonsense.

In some ways MCMC is truly marvelous. It’s disappointing that there isn’t more solid theory around convergence, but it lets you take a stab at problems that would be utterly intractable otherwise. In theory you can’t know when it’s time to stop, and in practice you can fool yourself into thinking you’ve seen convergence when you haven’t, such as when you have a bimodal distribution and you’ve only been sampling from one mode. But it’s also common in practice to have some confidence a calculation has converged because the results are qualitatively correct: this value should be approximately this other value, this should be less than that, etc.

Bayesian statistics is older than Frequentist statistics, but it didn’t take off until statisticians discovered MCMC in the 1980s, after physicists discovered it in the 1950s. Before that time, people might dismiss Bayesian statistics as being interesting but impossible to compute. But thanks to Markov Chain Monte Carlo, and Moore’s law, many problems are numerically tractable that were not before.

Since MCMC gives a universal approach for solving certain kinds of problems, some people equate the algorithm with the problem. That is, they see MCMC as the solution rather than an algorithm to compute the solution. They forget what they were ultimately after. It’s sometimes possible to compute posterior probabilities more quickly and more accurately by other means, such as numerical integration.

]]>

In the phrase “big *n*, little *p*” the symbol *p* means the number of measurements per subject. Traditional data sets are “big *n*, little *p*” because you have far more subjects than measurements per subject. For example, maybe you measure 10 things about 1000 patients.

Big data sets, such as those coming out of bioinformatics, are often “big *p*, little *n*.” For example, maybe you measure 20,000 biomarkers on 50 patients. This turns classical statistics sideways, literally and figuratively, literally in the sense that a “big *p*, little *n*” data set looks like the transpose of a “big *n*, little *p*” data set.

From the vantage point of a traditional statistician, “big *p*, little *n*” data sets give you very little to work with. If *n* is small, it doesn’t matter how big *p* is. In the example above, *n* = 50, not a big data set. But the biologist will say “What do you mean it’s not a big data set? I’ve given you 1,000,000 measurements!”

So how to you take advantage of large *p* even though *n* is small? That’s the big question. It summarizes the research program of many people in statistics and machine learning. There’s no general answer, at least not yet, though progress is being made in specific applications.

**Related post**: Nomenclatural abomination

For daily tips on data science, follow @DataSciFact on Twitter.

]]>Now suppose we know that a project has lasted until *t*_{0} so far. Then the expected finish time is α*t*_{0}/(α-1) and so the expected additional time is *t*_{0}/(α-1). Note that both are proportional to *t*_{0}. So the longer it has taken, the longer it will take. If the project is running late, you can expect the time remaining to be even more than the expected time before the project started. The finish line is moving away from you!

For example, suppose α = 2 (in applications of power laws, α is often between 1 and 3) and you’re measuring time in years. When the project starts at *t* = 1, it is expected to take one year, until *t* = 2. Now suppose you’re starting the second year and the project isn’t done. Now it’s expected to finish at *t* = 4, two more years. When you started, the project was supposed to take a year. One year later, it has taken a year, and should be expected to take two more years. I said “should be expected” rather than “is expected” because no one would believe such an estimate. (Ever heard of the Big Dig? Or other megaprojects?)

Note that we have computed the conditional probability given only the time it has taken so far, and *no other information*. If you know more, for example maybe you know that some specific pieces have been completed, then you should use that information.

This is related to the Lindy effect. The longer a cultural artifact has been around, the longer it is expected to last into the future.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

]]>Social media data is undoubtedly big. However, when we zoom into individuals for whom, for example, we would like to make relevant recommendations, we often have little data for each specific individual. We have to exploit the characteristics of social media and use its multidimensional, multisource, and multisite data to aggregate information with sufficient statistics for effective mining.

Brad Efron said something similar:

… enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.

Big data doesn’t always tell us directly what we’d like to know. It may give us a gargantuan amount of slightly related data, from which we may be able to tease out what we want.

**Related post**: New data, not just bigger data

A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That does a very poor job [1]. The discrete case needs its own estimator.

To illustrate this, we start by generating 5,000 samples from a discrete power law with exponent 3.

import numpy.random alpha = 3 n = 5000 x = numpy.random.zipf(alpha, n)

The continuous MLE is very simple to implement:

alpha_hat = 1 + n / sum(log(x))

Unfortunately, it gives an estimate of 6.87 for alpha, though we know it should be around 3.

The MLE for the discrete power law distribution satisfies

Here ζ is the Riemann zeta function, and *x*_{i} are the samples. Note that the left side of the equation is the derivative of log ζ, or what is sometimes called the logarithmic derivative.

There are three minor obstacles to finding the estimator using Python. First, SciPy doesn’t implement the Riemann zeta function ζ(*x*) per se. It implements a generalization, the Hurwitz zeta function, ζ(*x*, *q*). Here we just need to set *q* to 1 to get the Riemann zeta function.

Second, SciPy doesn’t implement the derivative of zeta. We don’t need much accuracy, so it’s easy enough to implement our own. See an earlier post for an explanation of the implementation below.

Finally, we don’t have an explicit equation for our estimator. But we can easily solve for it using the bisection algorithm. (Bisect is slow but reliable. We’re not in a hurry, so we might as use something reliable.)

from scipy import log from scipy.special import zeta from scipy.optimize import bisect xmin = 1 def log_zeta(x): return log(zeta(x, 1)) def log_deriv_zeta(x): h = 1e-5 return (log_zeta(x+h) - log_zeta(x-h))/(2*h) t = -sum( log(x/xmin) )/n def objective(x): return log_deriv_zeta(x) - t a, b = 1.01, 10 alpha_hat = bisect(objective, a, b, xtol=1e-6) print(alpha_hat)

We have assumed that our data follow a power law immediately from *n* = 1. In practice, power laws generally fit better after the first few elements. The code above works for the more general case if you set `xmin`

to be the point at which power law behavior kicks in.

The bisection method above searches for a value of the power law exponent between 1.01 and 10, which is somewhat arbitrary. However, power law exponents are very often between 2 and 3 and seldom too far outside that range.

The code gives an estimate of α equal to 2.969, very near the true value of 3, and much better than the naive estimate of 6.87.

Of course in real applications you don’t know the correct result before you begin, so you use something like a confidence interval to give you an idea how much uncertainty remains in your estimate.

The following equation [2] gives a value of σ from a normal approximation to the distribution of our estimator.

So an approximate 95% confidence interval would be the point estimate +/- 2σ.

from scipy.special import zeta from scipy import sqrt def zeta_prime(x, xmin=1): h = 1e-5 return (zeta(x+h, xmin) - zeta(x-h, xmin))/(2*h) def zeta_double_prime(x, xmin=1): h = 1e-5 return (zeta(x+h, xmin) -2*zeta(x,xmin) + zeta(x-h, xmin))/h**2 def sigma(n, alpha_hat, xmin=1): z = zeta(alpha_hat, xmin) temp = zeta_double_prime(alpha_hat, xmin)/z temp -= (zeta_prime(alpha_hat, xmin)/z)**2 return 1/sqrt(n*temp) print( sigma(n, alpha_hat) )

Here we use a finite difference approximation for the second derivative of zeta, an extension of the idea used above for the first derivative. We don’t need high accuracy approximations of the derivatives since statistical error will be larger than the approximation error.

In the example above, we have α = 2.969 and σ = 0.0334, so a 95% confidence interval would be [2.902, 3.036].

* * *

[1] Using the continuous MLE with discrete data is not so bad when the minimum output *x*_{min} is moderately large. But here, where *x*_{min} = 1 it’s terrible.

[2] Equation 3.6 from Power-law distributions in empirical data by Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman.

]]>

Any claim coming from an observational study is most likely to be wrong.

They back up this assertion with data about observational studies later contradicted by prospective studies.

Much has been said lately about the assertion that most published results are false, particularly observational studies in medicine, and I won’t rehash that discussion here. Instead I want to cut to the process Young and Karr propose for improving the quality of observational studies. They summarize their proposal as follows.

The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set.

They then describe an eight-step process in detail. One step is that cleaning the data and dividing it into a modelling set and a holdout set would be done by different people than the modelling and analysis. They then explain why this would lead to more truthful publications.

The holdout set is the key. Both the author and the journal know there is a sword of Damocles over their heads. Both stand to be embarrassed if the holdout set does not support the original claims of the author.

* * *

The full title of the article is *Deming, data and observational studies: A process out of control and needing fixing. *It* a*ppeared in the September 2011 issue of Significance.

Update: The article can be found here.

]]>But if you have less traffic so that the number of visitors involved in a test is appreciable, you might be concerned with possible lost revenue during the test itself. The point of A/B testing is to improve profitability *after* the test, not during the test. If you also want to consider profitability *during* the test, you might want to consider more alternatives.

My experience with testing comes from a context where the stakes are higher than improving conversion on web sites: treating cancer patients. You want to find out which treatments performed better for the sake of future patients, those who were treated after the randomized trial. But you also want to treat the participants in the clinical trial effectively. Two ways we would do that are early stopping rules and adaptive randomization. Both practices are applicable to A/B testing web pages.

A conventional clinical trial might take a few hundred patients and randomize half to one treatment and half to another. But if one treatment appears to be much more effective, at some point it becomes unconscionable to keep assigning the less effective treatment. So you stop the experiment early. You might want to do the same with web designs. If you planned to show two variations of a page to 500 visitors each, but after 100 visitors it’s obvious which version is performing better, you’d like to stop the test and show everyone the better page. On the other hand, if you have so many visitors that you’re not concerned with what happens to the 1000 visitors in the test, just let the test run to completion.

Another approach is to compromise between equal randomization and early stopping. Suppose A is performing better than B, but not so much better that you’re willing to stop and declare A the winner. You might keep randomizing, but increase the probability that the test will assign A. If A really is better, more visitors will see the better page. But if you’re wrong and B is really better, you may still discover this because some visitors are still seeing B. If B keeps performing better, the tide will turn and the test will prefer it. This is called adaptive randomization. The more evidence there is that one version is better, the higher the probability that you’ll show people that version.

One way to use adaptive randomization is variable experiment sizes. Instead of deciding a test size in advance, you test until you’re satisfied that you’ve found a winner. That may require fewer visitors than a conventional A/B test. It may also require more, but only when there’s a good reason to. The test may go into overtime, so to speak, because the two versions are performing similarly, in which case you’d like to keep testing longer to find which is better.

It’s easy to fall into thinking that the winner of a test will be used forever, whether you’re testing web pages or cancer treatments. But this isn’t the case. The winner will eventually be tested against something else, maybe very soon. This means that you might want to put a little more emphasis on the performance *during* the test and not just performance *after* the test, because there may not be much opportunity for performance after the test.

*I’m a data scientist*. Not sure what that means, but it sounds cool.

*I study machine learning*. Hmm. Maybe interesting, maybe a little ominous.

*I’m into big data*. Exciting or passé, depending on how many times you’ve heard the term.

Even though each of these descriptions makes a different impression, they’re all essentially the same thing. You could throw in a few more terms too, like artificial intelligence, inferential science, decision theory, or inverse probability.

There are distinctions. These terms don’t entirely overlap, but the overlap is huge. They all have to do with taking data and making an inference.

“Decision-making under uncertainty” emphasizes that you never have complete data, and yet you need to make decisions anyway. “Decision theory” emphasizes that the whole point of analyzing data is to do something as a result, and suggests that focusing directly on the decision itself, rather than proxies along the way, is the best way to do this.

“Data science” stresses that there is more to the process of making inferences than what falls under the traditional heading of “statistics.” Statistics has never been only about “the grotesque phenomenon generally known as mathematical statistics,” as Francis Anscombe described it. Things like data cleaning and visualization have always been part of the practice of statistics, though not the theory of statistics. Data science also emphasizes the role of computation. Some say a data scientist is a statistician who can program. Some say data science is statistics on a Mac.

Despite the hype around the term data science, it’s growing on me. It has its drawbacks, but so does every other name.

Machine learning, like decision theory, emphasizes the ultimate goal of doing something with data rather than creating an accurate model of the process that generates the data. If you can create such a model, so much the better. But it may not be necessary to have a great model in order to accomplish what you originally set out to do. “Naive Bayes,” for example, is a classification algorithm that is admittedly naive. It knowingly makes a gross simplification, assuming events are independent that we know are certainly not independent, and yet it often works well enough.

“Big data” is a big can of worms. It is often concerned with data sets that are indeed big, but it also implies other things, such as the way the data become available, as a real time stream rather than as a complete static set. See Erik Meijer’s Big data cube. And that’s just when the term “big data” is used in some fairly meaningful way. It’s also used so broadly as to be meaningless.

The term “statistics” literally means the mathematics of the interests of states, as in governments, because these were the first applications of statistics. So while “statistics” may be the most established and perhaps most respectable term discussed here, it’s not great. As I remarked here, “The term *statistics* would be equivalent to *governmentistics*, a historically accurate but otherwise useless term.” Statistics emphasizes probability models and mathematical rigor more than other variations on data analysis do. Statisticians criticize machine learning folks for being sloppy. Machine learning folks criticize statisticians for being too conservative, or for being too focused on description and not focused enough on prediction.

Bayesian statistics is much older than what is now sometimes called “classical” statistics. It was essential dormant during the first half of the 20th century before experiencing a renaissance in the second half of the century. Bayesian statistics was originally called “inverse probability” for good reason. Probability theory takes the probabilities of events as given and makes inferences about possible outcomes. Bayesian statistics does the inverse, taking data as given and inferring the probabilities that lead to the data. All statistics does something like this, but Bayesian statistics is consistent in forming all inference directly as probabilities. Frequetist (“classical”) statistics also infers probabilities, but the results, things like *p*-values and confidence intervals, are not the probabilities of what most people think they are. See Anthony O’Hagan’s description here.

Data analysis has gone by many names over time, sometimes with meaningful distinctions and sometimes not. Often people make a distinction without a difference.

]]>My title speaks of “data analysis” not “statistics”, and of “computation” not “computing science”; it does not speak of “mathematics”, but only last. Why? …

My brother-in-squared-law, Francis J. Anscombe has commented on my use of “data analysis” in the following words:

Whereas the content of Tukey’s remarks is always worth pondering, some of his terminology is hard to take. He seems to identify “statistics” with the grotesque phenomenon generally known as “mathematical statistics”, and finds it necessary to replace “statistical analysis” with “data analysis.”

(Tukey calls Anscombe his “brother-in-squared-law” because Anscombe was a fellow statistician as well as his brother-in-law. At first I thought Tukey had said “brother-in-law-squared”, which could mean his brother-in-law’s brother-in-law, but I suppose it was a pun on the role of least-square methods in statistics.)

Tukey later says

I … shall stick to this attitude today, and shall continue to use the words “data analysis”, in part to indicate that we can take probability seriously, or leave it alone, as may from time to time be appropriate or necessary.

It seems Tukey was reserving the term “statistics” for that portion of data analysis which is rigorously based on probability.

]]>Other methods, such as fuzzy logic, may be useful, though they must violate common sense (at least as defined by Cox’s theorem) under some circumstances. They may be still useful when they provide approximately the results that probability would have provided and at less effort and stay away from edge cases that deviate too far from common sense.

There are various kinds of uncertainty, principally epistemic uncertainty (lack of knowledge) and aleatory uncertainty (randomness), and various philosophies for how to apply probability. One advantage to the Bayesian approach is that it handles epistemic and aleatory uncertainty in a unified way.

Blog posts related to quantifying uncertainty:

- How loud is the evidence?
- The law of small numbers
- Example of the law of small numbers
- Laws of large numbers and small numbers
- Plausible reasoning
- What is a confidence interval?
- Learning is not the same as gaining information
- What a probability means
- Irrelevant uncertainty
- Probability and information
- False positives for medical papers
- False positives for medical tests
- Most published research results are false
- Determining distribution parameters from quantiles
- Fitting a triangular distribution
- Musicians, drunks, and Oliver Cromwell

]]>I tried. I tried to learn some statistics actually when I was younger and it’s a beautiful subject. But at the time I think I found the shakiness of the philosophical underpinnings were too scary for me. I felt a little nauseated all the time. Math is much more comfortable. You know where you stand. You know what’s proved and what’s not. It doesn’t have the quite same ethical and moral dimension that statistics has. I was never able to get comfortable with it the way my parents were.

The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.

In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)

The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is *p* = 0.5. The alternative hypothesis was that *p* is not 0.5. There were *N* = 104,490,000 bits emitted during the experiment, and *s* = 52,263,471 were 1’s. The *p*-value, the probability of an imbalance this large or larger under the assumption that *p* = 0.5, is 0.0003. Such a tiny *p*-value would be regarded as extremely strong evidence in favor of ESP given the way *p*-values are commonly interpreted.

The Bayes factor, however, is 18.7, meaning that the null hypothesis appears to be about 19 times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).

So given the data and assumptions in this example, the Frequentist concludes there is very strong evidence **for** ESP while the Bayesian concludes there is strong evidence **against** ESP.

The following Python code shows how one might calculate the *p*-value and Bayes factor.

from scipy.stats import binom from scipy import log, exp from scipy.special import betaln N = 104490000 s = 52263471 # sf is the survival function, i.e. complementary cdf # ccdf multiplied by 2 because we're doing a two-sided test print("p-value: ", 2*binom.sf(s, N, 0.5)) # Compute the log of the Bayes factor to avoid underflow. logbf = N*log(0.5) - betaln(s+0.5, N-s+0.5) + betaln(0.5, 0.5) print("Bayes factor: ", exp(logbf))

]]>

Here are some of the pros and cons of the term. (Listing “cons” first seems backward, but I’m currently leaning toward the pro side, so I thought I should conclude with it.)

The term “data scientist” is sometimes used to imply more novelty than is there. There’s not a great deal of difference between data science and statistics, though the new term is more fashionable. (Someone quipped that data science is statistics on a Mac.)

Similarly, the term *data scientist* is sometimes used as an excuse for ignorance, as in “I don’t understand probability and all that stuff, but I don’t need to because I’m a data scientist, not a statistician.”

The big deal about data science isn’t data but the science of drawing *inferences* from the data. *Inference science* would be a better term, in my opinion, but that term hasn’t taken off.

*Data science* could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title *data scientist* is rightfully associated with people who have better computational skills than statisticians typically have.

While the term *data science* isn’t perfect, there’s little to recommend the term *statistics* other than that it is well established. The root of *statistics* is *state*, as in a *government*. This is because statistics was first applied to the concerns of bureaucracies. The term *statistics* would be equivalent to *governmentistics*, a historically accurate but otherwise useless term.

So a request like “Please send me the data from your experiment” becomes “Please send me the measurements from your experiment.” Same thing.

But rousing statements about the power of data become banal or even ridiculous. For example, here’s an article from Forbes after substituting *measurements* for *data*:

]]>

The Hottest Jobs In IT: Training Tomorrow’s Measurements ScientistsIf you thought good plumbers and electricians were hard to find, try getting hold of a measurements scientist. The rapid growth of big measurements and analytics for use within businesses has created a huge demand for people capable of extracting knowledge from measurements.

…

Some of the top positions in demand include business intelligence analysts, measurements architects, measurements warehouse analysts and measurements scientists, Reed says. “We believe the demand for measurements expertise will continue to grow as more companies look for ways to capitalize on this information,” he says.

…

One way to fit a triangular distribution to data would be to set *a* to the minimum value and *b* to the maximum value. You could pick *a* and *b* are the smallest and largest *possible* values, if these values are known. Otherwise you could use the smallest and largest values in the data, or make the interval a little larger if you want the density to be positive at the extreme data values.

How do you pick *c*? One approach would be to pick it so the resulting distribution has the same mean as the data. The triangular distribution has mean

(*a* + *b* + *c*)/3

so you could simply solve for *c* to match the sample mean.

Another approach would be to pick *c* so that the resulting distribution has the same *median* as the data. This approach is more interesting because it cannot always be done.

Suppose your sample median is *m*. You can always find a point *c* so that half the area of the triangle lies to the left of a vertical line drawn through *m*. However, this might require the foot *c* to be to the left or the right of the base [*a*, *b*]. In that case the resulting triangle is obtuse and so sides of the triangle do not form the graph of a function.

For the triangle to give us the graph of a density function, *c* must be in the interval [*a*, *b*]. Such a density has a median in the range

[*b* – (*b* – *a*)/√2, *a* + (*b* – *a*)/√2].

If the sample median *m* is in this range, then we can solve for *c* so that the distribution has median *m*. The solution is

*c* = *b* – 2(*b* – *m*)^{2} / (*b* – *a*)

if *m* < (*a* + *b*)/2 and

*c* = *a* + 2(*a* – *m*)^{2} / (*b* – *a*)

otherwise.

* * *

For daily tips on data science, follow @DataSciFact on Twitter.

]]>Since your goal is to find the best dose, it seems natural to compare dose-finding methods by how often they find the best dose. This is what is most often done in the clinical trial literature. But this seemingly natural criterion is actually artificial.

Suppose a trial is testing doses of 100, 200, 300, and 400 milligrams of some new drug. Suppose further that on some scale of goodness, these doses rank 0.1, 0.2, 0.5, and 0.51. (Of course these goodness scores are unknown; the point of the trial is to estimate them. But you might make up some values for simulation, pretending with half your brain that these are the true values and pretending with the other half that you don’t know what they are.)

Now suppose you’re evaluating two clinical trial designs, running simulations to see how each performs. The first design picks the 400 mg dose, the best dose, 20% of the time and picks the 300 mg dose, the second best dose, 50% of the time. The second design picks each dose with equal probability. The latter design picks the *best* dose more often, but it picks a *good* dose less often.

In this scenario, the two largest doses are essentially equally good; it hardly matters how often a method distinguishes between them. The first method picks one of the two good doses 70% of the time while the second method picks one of the two good doses only 50% of the time.

This example was exaggerated to make a point: obviously it doesn’t matter how often a method can pick the better of two very similar doses, not when it very often picks a bad dose. But there are less obvious situations that are quantitatively different but qualitatively the same.

The goal is actually to find a good dose. Finding the absolute best dose is impossible. The most you could hope for is that a method finds with high probability the best of the four *arbitrarily chosen doses* under consideration. Maybe the *best* dose is 350 mg, 843 mg, or some other dose not under consideration.

A simple way to make evaluating dose-finding methods less arbitrary would be to estimate the benefit to patients. Finding the best dose is only a matter of curiosity in itself unless you consider how that information is used. Knowing the best dose is important because you want to treat future patients as effectively as you can. (And patients in the trial itself as well, if it is an adaptive trial.)

Suppose the measure of goodness in the scenario above is probability of successful treatment and that 1,000 patients will be treated at the dose level picked by the trial. Under the first design, there’s a 20% chance that 51% of the future patients will be treated successfully, and a 50% chance that 50% will be. The expected number of successful treatments from the two best doses is 352. Under the second design, the corresponding number is 252.5.

(To simplify the example above, I didn’t say how often the first design picks each of the two lowest doses. But the first design will result in at least 382 expected successes and the second design 327.5.)

You never know how many future patients will be treated according to the outcome of a clinical trial, but there must be some implicit estimate. If this estimate is zero, the trial is not worth conducting. In the example given here, the estimate of 1,000 future patients is irrelevant: the future patient horizon cancels out in a comparison of the two methods. The patient horizon matters when you want to include the benefit to patients in the trial itself. The patient horizon serves as a way to weigh the interests of current versus future patients, an ethically difficult comparison usually left implicit.

]]>

Probability and statistics:

- How to test a random number generator
- Predictive probabilities for normal outcomes
- Predictive probability interim analysis
- Relating two definitions of expectation
- Illustrating the error in the delta method
- Relating the error function erf and Φ
- Inverse gamma distribution
- Negative binomial distribution
- Upper and lower bounds for the normal distribution function
- Canonical example of Bayes’ theorem in detail
- Functions of regular variation
- Student-t as a mixture of normals

Other math:

- Chebyshev polynomials
- Richard Stanley’s twelvefold way (combinatorics)
- Hypergeometric functions
- Outline of Laplace transforms
- Navier-Stokes equations
- Picking the step size for numerical ODEs
- Orthogonal polynomials
- Multi-index notation
- The
*pqr*theorem for seminorms

See also journal articles and technical reports.

**Last week**: Probability approximations

**Next week**: Code Project articles

Do we even need probability approximations anymore? They’re not as necessary for numerical computation as they once were, but they remain vital for understanding the behavior of probability distributions and for theoretical calculations.

Textbooks often leave out details such as quantifying the error when discussion approximations. The following pages are notes I wrote to fill in some of these details when I was teaching.

- Error in the normal approximation to the binomial distribution
- Error in the normal approximation to the gamma distribution
- Error in the normal approximation to the Poisson distribution
- Error in the normal approximation to the t distribution
- Error in the Poisson approximation to the binomial distribution
- Error in the normal approximation to the beta distribution
- Camp-Paulson normal approximation to the binomial distribution
- Diagram of probability distribution relationships
- Relative error in normal approximations

See also blog posts tagged Probability and statistics and the Twitter account ProbFact.

**Last week**: Numerical computing resources

**Next week**: Miscellaneous math notes

]]>

There are three ways Bayesian posterior probability calculations can degrade with more data:

- Polynomial approximation
- Missing the spike
- Underflow

Elementary numerical integration algorithms, such as Gaussian quadrature, are based on polynomial approximations. The method aims to exactly integrate a polynomial that approximates the integrand. But likelihood functions are not approximately polynomial, and they become less like polynomials when they contain more data. They become more like a normal density, asymptotically flat in the tails, something no polynomial can do. With better integration techniques, the integration accuracy will *improve* with more data rather than degrade.

With more data, the posterior distribution becomes more concentrated. This means that a naive approach to integration might entirely miss the part of the integrand where nearly all the mass is concentrated. You need to make sure your integration method is putting its effort where the action is. Fortunately, it’s easy to estimate where the mode should be.

The third problem is that software calculating the likelihood function can underflow with even a moderate amount of data. The usual solution is to work with the logarithm of the likelihood function, but with numerical integration the solution isn’t quite that simple. You need to integrate the likelihood function itself, not its logarithm. I describe how to deal with this situation in Avoiding underflow in Bayesian computations.

]]>

- R language for programmers
- Default arguments and lazy evaluation in R
- Distributions in R
- Moving data between R and Excel via the clipboard
- Sweave: First steps toward reproducible analyses
- Troubleshooting Sweave
- Regular expressions in R

See also posts tagged Rstats.

I started the Twitter account RLangTip and handed it over the folks at Revolution Analytics.

**Last week**: Emacs resources

**Next week**: C++ resources

They will follow a Poisson distribution with an average of two per day. (Times are truncated to multiples of 5 minutes because my scheduling software requires that.)

]]>