The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.

In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)

The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is *p* = 0.5. The alternative hypothesis was that *p* is not 0.5. There were *N* = 104,490,000 bits emitted during the experiment, and *s* = 52,263,471 were 1’s. The *p*-value, the probability of an imbalance this large or larger under the assumption that *p* = 0.5, is 0.0003. Such a tiny *p*-value would be regarded as extremely strong evidence in favor of ESP given the way *p*-values are commonly interpreted.

The Bayes factor, however, is 5.95, meaning that the null hypothesis appears to be about six times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).

So given the data and assumptions in this example, the Frequentist concludes there is strong evidence **for** ESP while the Bayesian concludes there is substantial evidence **against** ESP.

The following Python code shows how one might calculate the *p*-value and Bayes factor.

from scipy.stats import binom from scipy import log, exp from scipy.special import betaln N = 104490000 s = 52263471 # sf is the survival function, i.e. complementary cdf # ccdf multiplied by 2 because we're doing a two-sided test print("p-value: ", 2*binom.sf(s, N, 0.5)) # Compute the log of the Bayes factor to avoid underflow. logbf = N*log(0.5) - betaln(s+0.5, N-s+0.5) print("Bayes factor: ", exp(logbf))]]>

Here are some of the pros and cons of the term. (Listing “cons” first seems backward, but I’m currently leaning toward the pro side, so I thought I should conclude with it.)

The term “data scientist” is sometimes used to imply more novelty than is there. There’s not a great deal of difference between data science and statistics, though the new term is more fashionable. (Someone quipped that data science is statistics on a Mac.)

Similarly, the term *data scientist* is sometimes used as an excuse for ignorance, as in “I don’t understand probability and all that stuff, but I don’t need to because I’m a data scientist, not a statistician.”

The big deal about data science isn’t data but the science of drawing *inferences* from the data. *Inference science* would be a better term, in my opinion, but that term hasn’t taken off.

*Data science* could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title *data scientist* is rightfully associated with people who have better computational skills than statisticians typically have.

While the term *data science* isn’t perfect, there’s little to recommend the term *statistics* other than that it is well established. The root of *statistics* is *state*, as in a *government*. This is because statistics was first applied to the concerns of bureaucracies. The term *statistics* would be equivalent to *governmentistics*, a historically accurate but otherwise useless term.

So a request like “Please send me the data from your experiment” becomes “Please send me the measurements from your experiment.” Same thing.

But rousing statements about the power of data become banal or even ridiculous. For example, here’s an article from Forbes after substituting *measurements* for *data*:

]]>

The Hottest Jobs In IT: Training Tomorrow’s Measurements ScientistsIf you thought good plumbers and electricians were hard to find, try getting hold of a measurements scientist. The rapid growth of big measurements and analytics for use within businesses has created a huge demand for people capable of extracting knowledge from measurements.

…

Some of the top positions in demand include business intelligence analysts, measurements architects, measurements warehouse analysts and measurements scientists, Reed says. “We believe the demand for measurements expertise will continue to grow as more companies look for ways to capitalize on this information,” he says.

…

One way to fit a triangular distribution to data would be to set *a* to the minimum value and *b* to the maximum value. You could pick *a* and *b* are the smallest and largest *possible* values, if these values are known. Otherwise you could use the smallest and largest values in the data, or make the interval a little larger if you want the density to be positive at the extreme data values.

How do you pick *c*? One approach would be to pick it so the resulting distribution has the same mean as the data. The triangular distribution has mean

(*a* + *b* + *c*)/3

so you could simply solve for *c* to match the sample mean.

Another approach would be to pick *c* so that the resulting distribution has the same *median* as the data. This approach is more interesting because it cannot always be done.

Suppose your sample median is *m*. You can always find a point *c* so that half the area of the triangle lies to the left of a vertical line drawn through *m*. However, this might require the foot *c* to be to the left or the right of the base [*a*, *b*]. In that case the resulting triangle is obtuse and so sides of the triangle do not form the graph of a function.

For the triangle to give us the graph of a density function, *c* must be in the interval [*a*, *b*]. Such a density has a median in the range

[*b* – (*b* – *a*)/√2, *a* + (*b* – *a*)/√2].

If the sample median *m* is in this range, then we can solve for *c* so that the distribution has median *m*. The solution is

*c* = *b* – 2(*b* – *m*)^{2} / (*b* – *a*)

if *m* < (*a* + *b*)/2 and

*c* = *a* + 2(*a* – *m*)^{2} / (*b* – *a*)

otherwise.

]]>Since your goal is to find the best dose, it seems natural to compare dose-finding methods by how often they find the best dose. This is what is most often done in the clinical trial literature. But this seemingly natural criterion is actually artificial.

Suppose a trial is testing doses of 100, 200, 300, and 400 milligrams of some new drug. Suppose further that on some scale of goodness, these doses rank 0.1, 0.2, 0.5, and 0.51. (Of course these goodness scores are unknown; the point of the trial is to estimate them. But you might make up some values for simulation, pretending with half your brain that these are the true values and pretending with the other half that you don’t know what they are.)

Now suppose you’re evaluating two clinical trial designs, running simulations to see how each performs. The first design picks the 400 mg dose, the best dose, 20% of the time and picks the 300 mg dose, the second best dose, 50% of the time. The second design picks each dose with equal probability. The latter design picks the *best* dose more often, but it picks a *good* dose less often.

In this scenario, the two largest doses are essentially equally good; it hardly matters how often a method distinguishes between them. The first method picks one of the two good doses 70% of the time while the second method picks one of the two good doses only 50% of the time.

This example was exaggerated to make a point: obviously it doesn’t matter how often a method can pick the better of two very similar doses, not when it very often picks a bad dose. But there are less obvious situations that are quantitatively different but qualitatively the same.

The goal is actually to find a good dose. Finding the absolute best dose is impossible. The most you could hope for is that a method finds with high probability the best of the four *arbitrarily chosen doses* under consideration. Maybe the *best* dose is 350 mg, 843 mg, or some other dose not under consideration.

A simple way to make evaluating dose-finding methods less arbitrary would be to estimate the benefit to patients. Finding the best dose is only a matter of curiosity in itself unless you consider how that information is used. Knowing the best dose is important because you want to treat future patients as effectively as you can. (And patients in the trial itself as well, if it is an adaptive trial.)

Suppose the measure of goodness in the scenario above is probability of successful treatment and that 1,000 patients will be treated at the dose level picked by the trial. Under the first design, there’s a 20% chance that 51% of the future patients will be treated successfully, and a 50% chance that 50% will be. The expected number of successful treatments from the two best doses is 352. Under the second design, the corresponding number is 252.5.

(To simplify the example above, I didn’t say how often the first design picks each of the two lowest doses. But the first design will result in at least 382 expected successes and the second design 327.5.)

You never know how many future patients will be treated according to the outcome of a clinical trial, but there must be some implicit estimate. If this estimate is zero, the trial is not worth conducting. In the example given here, the estimate of 1,000 future patients is irrelevant: the future patient horizon cancels out in a comparison of the two methods. The patient horizon matters when you want to include the benefit to patients in the trial itself. The patient horizon serves as a way to weigh the interests of current versus future patients, an ethically difficult comparison usually left implicit.

]]>Probability and statistics:

- How to test a random number generator
- Predictive probabilities for normal outcomes
- One-arm binary predictive probability
- Relating two definitions of expectation
- Illustrating the error in the delta method
- Relating the error function erf and Φ
- Inverse gamma distribution
- Negative binomial distribution
- Upper and lower bounds for the normal distribution function
- Canonical example of Bayes’ theorem in detail
- Functions of regular variation
- Student-t as a mixture of normals

Other math:

- Chebyshev polynomials
- Richard Stanley’s twelvefold way (combinatorics)
- Hypergeometric functions
- Outline of Laplace transforms
- Navier-Stokes equations
- Picking the step size for numerical ODEs
- Orthogonal polynomials
- Multi-index notation
- The
*pqr*theorem for seminorms

See also journal articles and technical reports.

**Last week**: Probability approximations

**Next week**: Code Project articles

Do we even need probability approximations anymore? They’re not as necessary for numerical computation as they once were, but they remain vital for understanding the behavior of probability distributions and for theoretical calculations.

Textbooks often leave out details such as quantifying the error when discussion approximations. The following pages are notes I wrote to fill in some of these details when I was teaching.

- Error in the normal approximation to the binomial distribution
- Error in the normal approximation to the gamma distribution
- Error in the normal approximation to the Poisson distribution
- Error in the normal approximation to the t distribution
- Error in the Poisson approximation to the binomial distribution
- Error in the normal approximation to the beta distribution
- Camp-Paulson normal approximation to the binomial distribution
- Diagram of probability distribution relationships
- Relative error in normal approximations

See also blog posts tagged Probability and statistics and the Twitter account ProbFact.

**Last week**: Numerical computing resources

**Next week**: Miscellaneous math notes

There are three ways Bayesian posterior probability calculations can degrade with more data:

- Polynomial approximation
- Missing the spike
- Underflow

Elementary numerical integration algorithms, such as Gaussian quadrature, are based on polynomial approximations. The method aims to exactly integrate a polynomial that approximates the integrand. But likelihood functions are not approximately polynomial, and they become less like polynomials when they contain more data. They become more like a normal density, asymptotically flat in the tails, something no polynomial can do. With better integration techniques, the integration accuracy will *improve* with more data rather than degrade.

With more data, the posterior distribution becomes more concentrated. This means that a naive approach to integration might entirely miss the part of the integrand where nearly all the mass is concentrated. You need to make sure your integration method is putting its effort where the action is. Fortunately, it’s easy to estimate where the mode should be.

The third problem is that software calculating the likelihood function can underflow with even a moderate amount of data. The usual solution is to work with the logarithm of the likelihood function, but with numerical integration the solution isn’t quite that simple. You need to integrate the likelihood function itself, not its logarithm. I describe how to deal with this situation in Avoiding underflow in Bayesian computations.

***

If you’d like help with statistical computation, let’s talk.

]]>- R language for programmers
- Default arguments and lazy evaluation in R
- Distributions in R
- Moving data between R and Excel via the clipboard
- Sweave: First steps toward reproducible analyses
- Troubleshooting Sweave
- Regular expressions in R

See also posts tagged Rstats.

I started the Twitter account RLangTip and handed it over the folks at Revolution Analytics.

**Last week**: Emacs resources

**Next week**: C++ resources

They will follow a Poisson distribution with an average of two per day. (Times are truncated to multiples of 5 minutes because my scheduling software requires that.)

]]>

- A big part of being a statistician is knowing what to do when your assumptions aren’t met, because they’re never exactly met.
- A lot of statisticians think time series analysis is voodoo, and he was inclined to agree with them.

Perhaps in reaction to knee-jerk antipathy toward Bayesian methods, some statisticians have adopted knee-jerk enthusiasm for Bayesian methods. Everything’s better with Bayesian analysis on it. Bayes makes it better, like a little dab of margarine on a dry piece of bread.

There’s much that I prefer about the Bayesian approach to statistics. Sometimes it’s the only way to go. But Bayes-for-the-sake-of-Bayes can expend a great deal of effort, by human and computer, to arrive at a conclusion that could have been reached far more easily by other means.

**Related**: Bayes isn’t magic

Image via Gallery of Graphic Design

]]>This is a variation on a problem I’ve blogged about before. As I pointed out there, we can assume without loss of generality that the samples come from the unit interval. Then the sample range has a beta(*n* – 1, 2) distribution. So the probability that the sample range is greater than a value *c* is

Setting *c* = 0.9, here’s a plot of the probability that the sample range contains at least 90% of the population range, as a function of sample size.

The answer to the question at the top of the post is 16 or 17. These two values of *n* yield probabilities 0.485 and 0.518 respectively. This means that a fairly small sample is likely to give you a fairly good estimate of the range.

For several years I’ve thought about the interplay of statistics and common sense. Probability is more abstract than physical properties like length or color, and so common sense is more often misguided in the context of probability than in visual perception. In probability and statistics, the analogs of optical illusions are usually called paradoxes: St. Petersburg paradox, Simpson’s paradox, Lindley’s paradox, etc. These paradoxes show that common sense can be seriously wrong, without having to consider contrived examples. Instances of Simpson’s paradox, for example, pop up regularly in application.

Some physicists say that you should always have an order-of-magnitude idea of what a result will be before you calculate it. This implies a belief that such estimates are usually possible, and that they provide a sanity check for calculations. And that’s true in physics, at least in mechanics. In probability, however, it is quite common for even an expert’s intuition to be way off. Calculations are more likely to find errors in common sense than the other way around.

Nevertheless, common sense is vitally important in statistics. Attempts to minimize the need for common sense can lead to nonsense. You need common sense to formulate a statistical model and to interpret inferences from that model. Statistics is a layer of exact calculation sandwiched between necessarily subjective formulation and interpretation. Even though common sense can go badly wrong with probability, it can also do quite well in some contexts. Common sense is necessary to map probability theory to applications and to evaluate how well that map works.

]]>**Update**: Use promo code KeenCon-JohnCook to get 75% off registration.

The problem is more interesting when the interval is unknown. You may be trying to estimate the end points of the interval by taking the max and min of the samples you’ve drawn. But in fact we might as well assume the interval is [0, 1] because the probability of a new sample falling within the previous sample range does not depend on the interval. The location and scale of the interval cancel out when calculating the probability.

Suppose we’ve taken *n* samples so far. The range of these samples is the difference between the 1st and the *n*th order statistics, and for a uniform distribution this difference has a beta(*n*-1, 2) distribution. Since a beta(*a*, *b*) distribution has mean *a*/(*a*+*b*), the expected value of the sample range from *n* samples is (*n*-1)/(*n*+1). This is also the probability that the next sample, or any particular future sample, will lie within the range of the samples seen so far.

If you’re trying to estimate the size of the total interval, this says that after *n* samples, the probability that the next sample will give you any new information is 2/(*n*+1). This is because we only learn something when a sample is less than the minimum so far or greater than the maximum so far.

]]>

Which side is correct depends on what’s out there waiting to be discovered, which of course we don’t know. We can only guess. Timid research is rational if you believe there are only marginal improvements that are likely to be discovered.

Sample size increases quickly as the size of the effect you’re trying to find decreases. To establish small differences in effect, you need very large trials.

If you think there are only small improvements on the status quo available to explore, you’ll explore each of the possibilities very carefully. On the other hand, if you think there’s a miracle drug in the pipeline waiting to be discovered, you’ll be willing to risk falsely rejecting small improvements along the way in order to get to the big improvement.

Suppose there are 500 drugs waiting to be tested. All of these are only 10% effective except for one that is 100% effective. You could quickly find the winner by giving each candidate to one patient. For every drug whose patient responded, repeat the process until only one drug is left. One strike and you’re out. You’re likely to find the winner in three rounds, treating fewer than 600 patients. But if all the drugs are 10% effective except one that’s 11% effective, you’d need hundreds of trials with thousands of patients each.

The best research strategy depends on what you believe is out there to be found. People who know nothing about cancer often believe we could find a cure soon if we just spend a little more money on research. Experts are more sanguine, except when they’re asking for money.

]]>You could read this aloud as “the mean of the mean is the mean.” More explicitly, it says that the expected value of the average of some number of samples from some distribution is equal to the expected value of the distribution itself. The shorter reading is confusing since “mean” refers to three different things in the same sentence. In reverse order, these are:

- The mean of the distribution, defined by an integral.
- The sample mean, calculated by averaging samples from the distribution.
- The mean of the sample mean as a random variable.

The hypothesis of this theorem is that the underlying distribution **has** a mean. Lets see where things break down if the distribution does not have a mean.

It’s tempting to say that the Cauchy distribution has mean 0. Or some might want to say that the mean is infinite. But if we take any value to be the mean of a Cauchy distribution — 0, ∞, 42, etc. — then the theorem above would be false. The mean of *n* samples from a Cauchy has the same distribution as the original Cauchy! The variability does not decrease with *n*, as it would with samples from a normal, for example. The sample mean doesn’t converge to any value as *n* increases. It just keeps wandering around with the same distribution, no matter how large the sample. That’s because the mean of the Cauchy distribution simply doesn’t exist.

Statistics is in many ways much more useful for most students than calculus. The problem is, to teach it

wellis extraordinarily difficult. It’s very easy to teach a horrible statistics class where you spit back the definitions of mean and median. But you become dangerous because you think you know something about data when in fact it’s kind of subtle.

A little knowledge is a dangerous thing, more so for statistics than calculus.

This reminds me of a quote by Stephen Senn:

Statistics: A subject which most statisticians find difficult but in which nearly all physicians are expert.

**Related**: Elementary statistics book recommendation

]]>

The terminology used throughout this document

enormously overloadsthe symbolp(). That is, we are using, in each line of this discussion, the functionp() to mean something different; its meaning is set by the letters used in its arguments. That is a nomenclatural abomination. I apologize, and encourage my readers to do things that aren’t so ambiguous (like maybe add informative subscripts), but it is so standard in our business that I won’t change (for now).

I found this terribly confusing when I started doing statistics. The meaning is not explicit in the notation but implicit in the conventions surrounding its use, conventions that were foreign to me since I was trained in mathematics and came to statistics later. When I would use letters like *f* and *g* for functions collaborators would say “I don’t know what you’re talking about.” Neither did I understand what they were talking about since they used one letter for everything.

This morning I thought about what Eric said when I saw a little snow. Last Tuesday was predicted to see ice and schools all over the Houston area closed. As it turned out, there was only a tiny amount of ice and the streets were clear. This morning there actually is snow and ice in the area, though not much, and the schools are all open. (There’s snow out in Cypress where I live, but I don’t think there is in Houston proper.)

**Related posts**:

The problem with big data is that it’s difficult to analyze it when the data is stored in many different ways. How do you analyze data that is distributed across relational database management systems (RDBMS), XML flat-file databases, text-based log files, and binary format storage systems?

If data are in disparate file formats, that’s a pain. And from an IT perspective that may be as far as the difficulty goes.** But why would data be in multiple formats**? Because it’s different kinds of data! That’s the bigger difficulty.

It’s conceivable, for example, that a scientific study would collect the exact same kinds of data at two locations, under as similar conditions as possible, but one site put their data in a relational database and the other put it in XML files. More likely the differences go deeper. Maybe you have lab results for patients stored in a relational database and their phone records stored in flat files. How do you meaningfully combine lab results and phone records in a single analysis? That’s a much harder problem than converting storage formats.

]]>

However, a more fundamental point has been lost. At the core of Ioannidis’ paper is the assertion that **the proportion of true hypotheses under investigation matters**. In terms of Bayes’ theorem, the *posterior* probability of a result being correct depends on the *prior* probability of the result being correct. This prior probability is vitally important, and it varies from field to field.

In a field where it is hard to come up with good hypotheses to investigate, most researchers will be testing false hypotheses, and most of their positive results will be coincidences. In another field where people have a good idea what ought to be true before doing an experiment, most researchers will be testing true hypotheses and most positive results will be correct.

For example, it’s very difficult to come up with a better cancer treatment. Drugs that kill cancer in a petri dish or in animal models usually don’t work in humans. One reason is that these drugs may cause too much collateral damage to healthy tissue. Another reason is that treating human tumors is more complex than treating artificially induced tumors in lab animals. Of all cancer treatments that appear to be an improvement in early trials, very few end up receiving regulatory approval and changing clinical practice.

A greater proportion of physics hypotheses are correct because physics has powerful theories to guide the selection of experiments. Experimental physics often succeeds because it has good support from theoretical physics. Cancer research is more empirical because there is little reliable predictive theory. This means that a published result in physics is more likely to be true than a published result in oncology.

Whether “most” published results are false depends on context. The proportion of false results varies across fields. It is high in some areas and low in others.

]]>… a non-informative prior is a placeholder: you can use the non-informative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information. …

At first this may sound like tweaking your analysis until you get the conclusion you want. It’s like the old joke about consultants: the client asks what 2+2 equals and the consultant counters by asking the client what he wants it to equal. But that’s not what Andrew is recommending.

A prior distribution cannot strictly be non-informative, but there are common intuitive notions of what it means to be non-informative. It may be helpful to substitute “convenient” or “innocuous” for “non-informative.” My take on Andrew’s advice is something like this.

Start with a prior distribution that’s easy to use and that nobody is going to give you grief for using. Maybe the prior doesn’t make much difference. But if your convenient/innocuous prior leads to too vague a conclusion, go back and use a more realistic prior, one that requires more effort or risks more criticism.

It’s odd that realistic priors can be more controversial than unrealistic priors, but that’s been my experience. It’s OK to be unrealistic as long as you’re conventional.

***

]]>Statistics seems to be a difficult subject for mathematicians, perhaps because its elusive and wide-ranging character mitigates against the traditional theorem-proof method of presentation. It may come as some comfort then that statistics is also a difficult subject for statisticians.

**Related posts**:

Although such derivations are attractive, they don’t apply that often, and they’re suspect when they do apply. There’s often some effect that keeps the prerequisite conditions from being satisfied in practice, so the derivation doesn’t lead to the right result.

The Poisson may be the best example of this. It’s easy to argue that certain count data have a Poisson distribution, and yet empirically the Poisson doesn’t fit so well because, for example, you have a mixture of two populations with different rates rather than one homogeneous population. (Averages of Poisson distributions have a Poisson distribution. Mixtures of Poisson distributions don’t.)

The best scenario is when a theoretical derivation agrees with empirical analysis. Theory suggests the distribution should be X, and our analysis confirms that. Hurray! The theoretical and empirical strengthen each other’s claims.

Theoretical derivations can be useful even when they disagree with empirical analysis. The theoretical distribution forms a sort of baseline, and you can focus on how the data deviate from that baseline.

**Related posts**:

In God we trust, all others bring data. — William Edwards Deming

The footnote to the quote is better than the quote:

On the Web, this quote has been widely attributed to both Deming and Robert W. Hayden; however Professor Hayden told us that he can claim no credit for this quote, and ironically

we could find no “data” confirming that Deming actually said this.

Emphasis added.

The fact that so many people attributed the quote to Deming *is* evidence that Deming in fact said it. It’s not conclusive: popular attributions can certainly be wrong. But it is evidence.

Another piece of evidence for the authenticity of the quote is the slightly awkward phrasing “all others bring data.” The quote is often stated in the form “all others must bring data.” The latter is better, which lends credibility to the former: a plausible explanation for why the more awkward version survives would be that it is what someone, maybe Deming, actually said.

The inconclusive evidence in support of Deming being the source of the quote is actually representative of the kind of data people are likely to bring someone like Deming.

]]>

Perl has the slogan “There’s more than one way to do it,” abbreviated TMTOWTDI and pronounced “tim toady.” Perl prides itself on variety.

Python takes the opposite approach. The Zen of Python says “There should be one — and preferably only one — obvious way to do it.” Python prides itself on consistency.

Frequentist statistics has a variety of approaches and criteria for various problems. Bayesian critics call this “adhockery.”

Bayesian statistics has one way to do everything: write down a likelihood function and prior distribution, then add data and compute a posterior distribution. This is sometimes called “turning the Bayesian crank.”

]]>But the title was actually “Statistics for People Who (*Think They*) Hate Statistics” which is far less interesting.