There’s a theorem in statistics that says
You could read this aloud as “the mean of the mean is the mean.” More explicitly, it says that the expected value of the average of some number of samples from some distribution is equal to the expected value of the distribution itself. The shorter reading is confusing since “mean” refers to three different things in the same sentence. In reverse order, these are:
- The mean of the distribution, defined by an integral.
- The sample mean, calculated by averaging samples from the distribution.
- The mean of the sample mean as a random variable.
The hypothesis of this theorem is that the underlying distribution has a mean. Lets see where things break down if the distribution does not have a mean.
It’s tempting to say that the Cauchy distribution has mean 0. Or some might want to say that the mean is infinite. But if we take any value to be the mean of a Cauchy distribution — 0, ∞, 42, etc. — then the theorem above would be false. The mean of n samples from a Cauchy has the same distribution as the original Cauchy! The variability does not decrease with n, as it would with samples from a normal, for example. The sample mean doesn’t converge to any value as n increases. It just keeps wandering around with the same distribution, no matter how large the sample. That’s because the mean of the Cauchy distribution simply doesn’t exist.
Russ Roberts had this to say about the proposal to replacing the calculus requirement with statistics for students.
Statistics is in many ways much more useful for most students than calculus. The problem is, to teach it well is extraordinarily difficult. It’s very easy to teach a horrible statistics class where you spit back the definitions of mean and median. But you become dangerous because you think you know something about data when in fact it’s kind of subtle.
A little knowledge is a dangerous thing, more so for statistics than calculus.
This reminds me of a quote by Stephen Senn:
Statistics: A subject which most statisticians find difficult but in which nearly all physicians are expert.
Related: Elementary statistics book recommendation
David Hogg calls conventional statistical notation a “nomenclatural abomination”:
The terminology used throughout this document enormously overloads the symbol p(). That is, we are using, in each line of this discussion, the function p() to mean something different; its meaning is set by the letters used in its arguments. That is a nomenclatural abomination. I apologize, and encourage my readers to do things that aren’t so ambiguous (like maybe add informative subscripts), but it is so standard in our business that I won’t change (for now).
I found this terribly confusing when I started doing statistics. The meaning is not explicit in the notation but implicit in the conventions surrounding its use, conventions that were foreign to me since I was trained in mathematics and came to statistics later. When I would use letters like f and g for functions collaborators would say “I don’t know what you’re talking about.” Neither did I understand what they were talking about since they used one letter for everything.
Why would anyone care about what the weather was predicted to be once you know what the weather actually was? Because people make decisions based in part on weather predictions, not just weather. Eric Floehr of ForecastWatch told me that people are starting to realize this and are increasingly interested in his historical prediction data.
This morning I thought about what Eric said when I saw a little snow. Last Tuesday was predicted to see ice and schools all over the Houston area closed. As it turned out, there was only a tiny amount of ice and the streets were clear. This morning there actually is snow and ice in the area, though not much, and the schools are all open. (There’s snow out in Cypress where I live, but I don’t think there is in Houston proper.)
Aftermath of last Tuesday’s storm
I have a quibble with the following paragraph from Introducing Windows Azure for IT Professionals:
The problem with big data is that it’s difficult to analyze it when the data is stored in many different ways. How do you analyze data that is distributed across relational database management systems (RDBMS), XML flat-file databases, text-based log files, and binary format storage systems?
If data are in disparate file formats, that’s a pain. And from an IT perspective that may be as far as the difficulty goes. But why would data be in multiple formats? Because it’s different kinds of data! That’s the bigger difficulty.
It’s conceivable, for example, that a scientific study would collect the exact same kinds of data at two locations, under as similar conditions as possible, but one site put their data in a relational database and the other put it in XML files. More likely the differences go deeper. Maybe you have lab results for patients stored in a relational database and their phone records stored in flat files. How do you meaningfully combine lab results and phone records in a single analysis? That’s a much harder problem than converting storage formats.
John Ioannidis stirred up a healthy debate when he published Why Most Published Research Findings Are False. Unfortunately, most of the discussion has been over whether the word “most” is correct, i.e. whether the proportion of false results is more or less than 50 percent. At least there is more awareness that some published results are false and that it would be good to have some estimate of the proportion.
However, a more fundamental point has been lost. At the core of Ioannidis’ paper is the assertion that the proportion of true hypotheses under investigation matters. In terms of Bayes’ theorem, the posterior probability of a result being correct depends on the prior probability of the result being correct. This prior probability is vitally important, and it varies from field to field.
In a field where it is hard to come up with good hypotheses to investigate, most researchers will be testing false hypotheses, and most of their positive results will be coincidences. In another field where people have a good idea what ought to be true before doing an experiment, most researchers will be testing true hypotheses and most positive results will be correct.
For example, it’s very difficult to come up with a better cancer treatment. Drugs that kill cancer in a petri dish or in animal models usually don’t work in humans. One reason is that these drugs may cause too much collateral damage to healthy tissue. Another reason is that treating human tumors is more complex than treating artificially induced tumors in lab animals. Of all cancer treatments that appear to be an improvement in early trials, very few end up receiving regulatory approval and changing clinical practice.
A greater proportion of physics hypotheses are correct because physics has powerful theories to guide the selection of experiments. Experimental physics often succeeds because it has good support from theoretical physics. Cancer research is more empirical because there is little reliable predictive theory. This means that a published result in physics is more likely to be true than a published result in oncology.
Whether “most” published results are false depends on context. The proportion of false results varies across fields. It is high in some areas and low in others.
Many people have drawn Venn diagrams to locate machine learning and related ideas in the intellectual landscape. Drew Conway’s diagram may have been the first. It has at least been frequently referenced.
By this classification, Hector Cuesta’s new book Practical Data Anaysis is located toward the “hacking skills” corner of the diagram. No single book can cover everything, and this one emphasizes practical software knowledge more than mathematical theory or details of a particular problem domain.
The biggest strength of the book may be that it brings together in one place information on tools that are used together but whose documentation is scattered. The book is great source for sample code. The source code is available on GitHub, though it’s more understandable in the context of the book.
Much of the book uses Python and related modules and tools including:
It also uses D3.js (with JSON, CSS, HTML, …), MongoDB (with MapReduce, Mongo Shell, PyMongo, …), and miscellaneous other tools and APIs.
There’s a lot of material here in 360 pages, making it a useful reference.
Andrew Gelman has some interesting comments on non-informative priors this morning. Rather than thinking of the prior as a static thing, think of it as a way to prime the pump.
… a non-informative prior is a placeholder: you can use the non-informative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information. …
At first this may sound like tweaking your analysis until you get the conclusion you want. It’s like the old joke about consultants: the client asks what 2+2 equals and the consultant counters by asking the client what he wants it to equal. But that’s not what Andrew is recommending.
A prior distribution cannot strictly be non-informative, but there are common intuitive notions of what it means to be non-informative. It may be helpful to substitute “convenient” or “innocuous” for “non-informative.” My take on Andrew’s advice is something like this.
Start with a prior distribution that’s easy to use and that nobody is going to give you grief for using. Maybe the prior doesn’t make much difference. But if your convenient/innocuous prior leads to too vague a conclusion, go back and use a more realistic prior, one that requires more effort or risks more criticism.
It’s odd that realistic priors can be more controversial than unrealistic priors, but that’s been my experience. It’s OK to be unrealistic as long as you’re conventional.
Consulting in Bayesian analysis
From Controversies in the Foundations of Statistics by Bradley Efron:
Statistics seems to be a difficult subject for mathematicians, perhaps because its elusive and wide-ranging character mitigates against the traditional theorem-proof method of presentation. It may come as some comfort then that statistics is also a difficult subject for statisticians.
Sometimes you can derive a probability distributions from a list of properties it must have. For example, there are several properties that lead inevitably to the normal distribution or the Poisson distribution.
Although such derivations are attractive, they don’t apply that often, and they’re suspect when they do apply. There’s often some effect that keeps the prerequisite conditions from being satisfied in practice, so the derivation doesn’t lead to the right result.
The Poisson may be the best example of this. It’s easy to argue that certain count data have a Poisson distribution, and yet empirically the Poisson doesn’t fit so well because, for example, you have a mixture of two populations with different rates rather than one homogeneous population. (Averages of Poisson distributions have a Poisson distribution. Mixtures of Poisson distributions don’t.)
The best scenario is when a theoretical derivation agrees with empirical analysis. Theory suggests the distribution should be X, and our analysis confirms that. Hurray! The theoretical and empirical strengthen each other’s claims.
Theoretical derivations can be useful even when they disagree with empirical analysis. The theoretical distribution forms a sort of baseline, and you can focus on how the data deviate from that baseline.
The preface to Elements of Statistical Learning opens with the popular quote
In God we trust, all others bring data. — William Edwards Deming
The footnote to the quote is better than the quote:
On the Web, this quote has been widely attributed to both Deming and Robert W. Hayden; however Professor Hayden told us that he can claim no credit for this quote, and ironically we could find no “data” confirming that Deming actually said this.
The fact that so many people attributed the quote to Deming is evidence that Deming in fact said it. It’s not conclusive: popular attributions can certainly be wrong. But it is evidence.
Another piece of evidence for the authenticity of the quote is the slightly awkward phrasing “all others bring data.” The quote is often stated in the form “all others must bring data.” The latter is better, which lends credibility to the former: a plausible explanation for why the more awkward version survives would be that it is what someone, maybe Deming, actually said.
The inconclusive evidence in support of Deming being the source of the quote is actually representative of the kind of data people are likely to bring someone like Deming.
Bayesian statistics is to Python as frequentist statistics is to Perl.
Perl has the slogan “There’s more than one way to do it,” abbreviated TMTOWTDI and pronounced “tim toady.” Perl prides itself on variety.
Python takes the opposite approach. The Zen of Python says “There should be one — and preferably only one — obvious way to do it.” Python prides itself on consistency.
Frequentist statistics has a variety of approaches and criteria for various problems. Bayesian critics call this “adhockery.”
Bayesian statistics has one way to do everything: write down a likelihood function and prior distribution, then add data and compute a posterior distribution. This is sometimes called “turning the Bayesian crank.”
I caught a glimpse of a book in a library this morning and thought the title was “Statistics for People Who Think.” Sounds like a great book!
But the title was actually “Statistics for People Who (Think They) Hate Statistics” which is far less interesting.
From John Tukey’s Sunset Salvo:
Our suffering sinuses are now frequently relieved by antihistamines. Our suffering philosophy — whether implicit or explicit — of data analysis, or of statistics, or of science and technology needs to be far more frequently relieved by antihubrisines.
To the Greeks hubris meant the kind of pride that would be punished by the gods. To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others.
Tukey then lists several antihubrisines. The first is this:
The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
If you compute the standard deviation of a data set by directly implementing the definition, you’ll need to pass through the data twice: once to find the mean, then a second time to accumulate the squared differences from the mean. But there is an equivalent algorithm that requires only one pass and that is more accurate than the most direct method. You can find the code for implementing it here.
You can also compute the higher sample moments in one pass. I’ve extended the previous code to compute skewness and kurtosis in one pass as well.
The new code also lets you split your data, say to process it in parallel on different threads, and then combine the statistics, in the spirit of map-reduce.
Lastly, I’ve posted analogous code for simple linear regression.