From the category archives:

Statistics

The universal solvent of statistics

by John on February 1, 2012

Andrew Gelman just posted an interesting article on the philosophy of Bayesian statistics. Here’s my favorite passage.

This reminds me of a standard question that Don Rubin … asks in virtually any situation: “What would you do if you had all the data?” For me, that “what would you do” question is one of the universal solvents of statistics.

Emphasis added.

I had not heard Don Rubin’s question before, but I think I’ll be asking it often. It reminds me of Alice’s famous dialog with the Cheshire Cat:

“Would you tell me, please, which way I ought to go from here?”

“That depends a good deal on where you want to get to,” said the Cat.

“I don’t much care where–” said Alice.

“Then it doesn’t matter which way you go,” said the Cat.

Cheshire Cat

Related post: Irrelevant uncertainty

{ 6 comments }

Interpreting statistics

by John on January 13, 2012

From Matt Briggs:

I challenge you to find me in any published statistical analysis, outside of an introductory textbook, a confidence interval given the correct interpretation. If you can find even one instance where the [frequentist] confidence interval is not interpreted as a [Bayesian] credible interval, then I will eat your hat.

Most statistical analysis is carried out by people who do not interpret their results correctly. They carry out frequentist procedures and then give the results a Bayesian interpretation. This is not simply a violation of an academic taboo. It means that people generally underestimate the uncertainty in their conclusions.

Related posts:

Most published research results are false
Classical statistics in a nutshell

{ 16 comments }

R in Action

by John on January 2, 2012

No Starch Press sent me a copy of The Art of R Programming last Fall and I wrote a review of it here. Then a couple weeks ago, Manning sent me a copy of R in Action. Here I’ll give a quick comparison of the two books, then focus specifically on R in Action.

[click to continue...]

{ 3 comments }

Type R error

by John on December 9, 2011

Andrew Gelman added a couple more types of error to the standard repertoire of type I and type II errors. He suggests using type S error to describe a result that gets a sign backward, reporting that A is bigger than B when in fact B is bigger than A. He also suggests using type M error for results that get the magnitude of a result wrong.

Maybe we could add to this list type R for reification error: treating an abstraction as if it were real, forgetting that a model is a model and stretching it beyond its limits.

Related links:

Just an approximation
Floating point error is the least of my worries
The Pretense of Knowledge

{ 7 comments }

Amputating reality

by John on December 1, 2011

“If you just rely on one model, you tend to amputate reality to make it fit your model.” — David Brooks

Related post: Advantages of crude models

{ 6 comments }

Bad logic, but good statistics

by John on November 28, 2011

Ad hominem arguments are bad logic, but good (Bayesian) statistics. A statement isn’t necessarily false because it comes from an unreliable source, though it is more likely to be false.

Some people are much more likely to know what they’re talking about than others, depending on context. You’re more likely to get good medical advice from a doctor than from an accountant, though the former may be wrong and the latter may be right. (Actors are not likely to know what they’re talking about when giving advice regarding anything but acting, though that doesn’t stop them.)

Ad hominem guesses are a reasonable way to construct a prior, but the prior needs to be updated with data. Given no other data, the doctor is more likely to know medicine than the accountant is. Assuming a priori that both are equally likely to be correct may be “fair,” but it’s not reasonable. However, as you gather data on the accuracy of each, you could change your mind. The posterior distribution could persuade you that you’ve been talking to a quack doctor or an accountant who is unusually knowledgeable of medicine.

Related post: Musicians, drunks, and Oliver Cromwell

{ 5 comments }

Career advice regarding tools

by John on November 21, 2011

J. D. Long wearing a Panama and smoking a Dominican

A few weeks ago, J. D. Long gave some interesting advice in a Google+ discussion. He starts out

Lunch today with an analyst 13 years my junior made me think about things I wish I had known about the technical analytical profession when I was 25. Here’s some things that popped into my head:

The entire list is worth reading, but I want to focus on two things he said about tools.

  • Use tools you don’t have to ask permission to install (i.e. open source).
  • Dependence on tools that are closed license and un-scriptable will limit the scope of problems you can solve. (i.e. Excel) Use them, but build your core skills on more portable & scalable technologies.

I would have disagreed a few years ago, but now I think this is good advice.

In the late 90’s I used mostly Microsoft tools. That was a good time to be a Microsoft developer. Windows was on the rise; Unix and Mac OS were on the ropes. Desktop applications were the norm and were easier to write on Windows. Open source software was hard to install and hard to use. People who used open source software often did so for ideological reasons, not because it made their work easier.

Of course times have changed. Mac recovered from its near death experience. Unix didn’t, but it has been resurrected as Linux. The web made it easier to write cross-platform software. And above all, open source software has matured. The open source community is more positive, focused on promoting good software rather than trying to give some corporation a stick in the eye.

Now the advantages of open source are clearer. There’s not the same hidden cost in frustration that there was a few years ago. Now I would say yes, it’s a great advantage to use tools you can install whenever and wherever you want, without having to go through a purchasing bureaucracy.

It’s interesting that JD equates open source with scriptability. Open source software often is scriptable, not because it’s open source, but because of the Unix aesthetic that pervades the open source community. Closed source software is often not scriptable, not because it’s closed source, but because it is often written for consumers who value usability over composability. Commercial server-side products may be scriptable. If I were to restate JD’s advice on this point, I’d say to keep composability in mind and don’t just think about usability.

I appreciate JD’s attitude toward applications such as Excel. He’s not saying you should never defile your conscience by opening Excel. Some tasks are incredibly easy in Excel. The danger comes from pushing the tool into territory where other tools are better. There are still some in the open source community who believe that opening Excel is a sin, but I’m much more in agreement with the people who say, for example, that Excel isn’t the best tool for statistical analysis.

Portability is funny. In the early days of computing, there were no dominant players, and portability was important (and difficult). Then for a while, portability didn’t matter if you were content with only running on the 95% of the world’s computers that ran Windows. Now portability is important again. Windows still has a huge market share on the desktop, but the desktop itself is losing market share.

And portability matters for more than consumer operating systems. JD mentions portability and scalability in one breath. You may want to move code between operating systems to scale up (e.g. to run on a cluster) or to scale down (e.g. to run on a mobile device).

There’s also the aspect of career portability. You want to master tools that you can take with you from job to job. I would be leery of building a career around a small company’s proprietary tools. If I were in that situation, I’d learn something else on the side that’s more portable.

In closing, I’ll give the rest of JD’s career advice without commentary. These points could make interesting fodder for future blog posts.

  • Be a profit center, not a cost center.
  • Use tools you don’t have to ask permission to install (i.e. open source).
  • Dependence on tools that are closed license and un-scriptable will limit the scope of problems you can solve. (i.e. Excel) Use them, but build your core skills on more portable & scalable technologies.
  • Learn basic database tools.
  • Learn a programming language.
  • Your internal job description may say, “Analyst” but get something else on your business cards. Analyst is so vague as to be meaningless. My external title is currently “Sr. Risk Economist.” I like the term “Data Scientist” for now. I expect that term will be meaningless in 5 years.
  • Large organizations do not properly appreciate agile and smart analytic types. Time at large firms should be seen as subsidized learning. Learn lots, but get out.
  • Ensure you can explain any of your projects to your wife or non-technical friends. It’s good practice for board meetings later in your career.
  • Be sure you know the handful of things that you can do better than most anyone else. Add something to that list every year. Make sure you can explain these things to non techies.
  • Be a profit center, not a cost center. At least be as close to the profit center as possible. The chief analyst for the sales SVP is closer to the profit center than an IT analyst supporting billing operations.
  • Get really good at asking questions so you understand problems before you start solving them.
  • Yes, that bit about being a profit center not a cost center is in there twice. It should probably be in there 5 times.

{ 13 comments }

A Bayesian view of Amazon Resellers

by John on September 27, 2011

I was buying a used book through Amazon this evening. Three resellers offered the book at essentially the same price. Here were their ratings:

  • 94% positive out of 85,193 reviews
  • 98% positive out of 20,785 reviews
  • 99% positive out of 840 reviews

Which reseller is likely to give the best service? Before you assume it’s the seller with the highest percentage of positive reviews, consider the following simpler scenario.

Suppose one reseller has 90 positive reviews out of 100. The other reseller has two reviews, both positive. You could say one has 90% approval and the other has 100% approval, so the one with 100% approval is better. But this doesn’t take into consideration that there’s much more data on one than the other. You can have some confidence that 90% of the first reseller’s customers are satisfied. You don’t really know about the other because you have only two data points.

A Bayesian view of the problem naturally incorporates the amount of data as well as its average. Let θA be the probability of a customer being satisfied with company A’s service. Let θB be the corresponding probability for company B. Suppose before we see any reviews we think all ratings are equally likely. That is, we start with a uniform prior distribution θA and θB. A uniform distribution is the same as a beta(1, 1) distribution.

After observing 90 positive reviews and 10 negative reviews, our posterior estimate on θA has a beta(91, 11) distribution. After observing 2 positive reviews, our posterior estimate on θB has a beta(3, 1) distribution. The probability that a sample from θA is bigger than a sample from θB is 0.713. That is, there’s a good chance you’d get better service from the reseller with the lower average approval rating.

beta(91,11) versus beta(3,1)

Now back to our original question. Which of the three resellers is most likely to satisfy a customer?

Assume a uniform prior on θX, θY, and θZ, the probabilities of good service for each reseller. The posterior distributions on these variables have distributions beta(80082, 5113), beta(20370, 417), and beta(833, 9).

These beta distributions have such large parameters that we can approximate them by normal distributions with the same mean and variance. (A beta(a, b) random variable has mean a/(a+b) and variance ab/((a+b)2(a+b+1)).) The variable with the most variance, θZ, has standard deviation 0.003. The other variables have even smaller standard deviation. So the three distributions are highly concentrated at their mean values with practically non-overlapping support. And so a sample from θX or θY is unlikely to be higher than a sample from θZ.

In general, going by averages alone works when you have a lot of customer reviews. But when you have a small number of reviews, going by averages alone could be misleading.

Thanks to Charles McCreary for suggesting the xkcd comic.

Related links:

Inequality Calculator
Calculating random inequalities
Exact calculation of beta inequalities

{ 28 comments }

Big data and humility

by John on September 22, 2011

One of the challenges with big data is to properly estimate your uncertainty. Often “big data” means a huge amount of data that isn’t exactly what you want.

As an example, suppose you have data on how a drug acts in monkeys and you want to infer how the drug acts in humans. There are two sources of uncertainty:

  1. How well do we really know the effects in monkeys?
  2. How well do these results translate to humans?

The former can be quantified, and so we focus on that, but the latter may be more important. There’s a strong temptation to believe that big data regarding one situation tells us more than it does about an analogous situation.

I’ve seen people reason as follows. We don’t really know how results translate from monkeys to humans (or from one chemical to a related chemical, from one market to an analogous market, etc.). We have a moderate amount of data on monkeys and we’ll decimate it and use that as if it were human data, say in order to come up with a prior distribution.

Down-weighting by a fixed ratio, such as 10 to 1, is misleading. If you had 10x as much data on monkeys, would you as much about effects in humans as if the original smaller data set were collected on people? What if you suddenly had “big data” involving every monkey on the planet. More data on monkeys drives down your uncertainty about monkeys, but does nothing to lower your uncertainty regarding how monkey results translate to humans.

At some point, more data about analogous cases reaches diminishing return and you can’t go further without data about what you really want to know. Collecting more and more data about how a drug works in adults won’t help you learn how it works in children. At some point, you need to treat children. Terabytes of analogous data may not be as valuable as kilobytes of highly relevant data.

Related posts:

Big data is not enough
Does gaining weight make you taller?

{ 10 comments }

Bayes isn’t magic

by John on September 6, 2011

If a study is completely infeasible using traditional statistical methods, Bayesian methods are probably not going to rescue it. Bayesian methods can’t squeeze blood out of a turnip.

The Bayesian approach to statistics has real advantages, but sometimes these advantages are oversold. Bayesian statistics is still statistics, not magic.

{ 7 comments }

Markov chains don’t converge

by John on August 10, 2011

I often hear people often say they’re using a burn-in period in MCMC to run a Markov chain until it converges. But Markov chains don’t converge, at least not the Markov chains that are useful in MCMC. These Markov chains wander around forever exploring the domain they’re sampling from. Any point that makes a “bad” starting point for MCMC is a point you might reach by burn-in.

Not only that, Markov chains can’t remember how they got where they are. That’s their defining property. So if your burn-in period ends at a point x, the chain will perform exactly as if you had simply started at x.

When someone says a Markov chain has converged, they mean that the chain has entered a high-probability region. I’ll explain in a moment why that’s desirable. But the belief/hope is that a burn-in period will put a Markov chain in a high-probability region. And it probably will, but there are a couple reasons why this isn’t necessarily the best thing to do.

  1. Burn-in may be ineffective. You could use use optimization to be certain that you’re starting in such a region. Burn-in offers no such assurance. See Burn-in and other MCMC folklore.
  2. Burn-in may be inefficient. Casting aside worries that burn-in may not do what you want, it can be an inefficient way to find a high-probability region. MCMC isn’t designed to optimize a density function but rather to sample from it.

Why use burn-in? MCMC practitioners often don’t know how to do optimization, and in any case the corresponding optimization problem may be difficult. Also, if you’ve got the MCMC code in hand, it’s convenient to use it to find a starting point as well as for sampling.

So why does it matter whether you start your Markov chain in a high-probability region? In the limit, it doesn’t matter. But since you’re averaging some function of some finite number of samples, your average will be a better approximation if you start at a typical point in the density you’re sampling. If you start at a low probability location, your average may be more biased.

Samples from Markov chains don’t converge, but averages of functions applied to these samples may converge. When someone says a Markov chain has converged, they mean they’re at a point where the average of a finite number of function applications will be a better approximation of the thing they want to compute than if they’d started at a low probability point.

It’s not just a matter of imprecise language when people say a Markov chain has converged. It sometimes betrays a misunderstanding of how Markov chains work.

{ 13 comments }

The single big jump principle

by John on August 9, 2011

Suppose you’re playing a game where you take 10 steps of a random size. Here are two variations on the game. Which will give you a better chance of ending up far from where you started?

  1. You take your steps one at a time, starting each new step from where the last one took you.
  2. You return to the starting point after each step. After 10 turns, you go back to where your largest step took you.

Assume all steps are in the same direction. Then you’re always better off under the first rule. The sum of all your steps will be bigger than the maximum since the maximum is one of the terms in the sum.

However, depending on the probability distribution on your random steps, you may do almost as well under the second rule, taking the maximum of your steps.

If the distribution on your step size is heavy-tailed (technically, subexponential) then the maximum and the sum have the same asymptotic distribution. That is, as x goes to infinity,

Pr( X1 + X2 + … + Xn > x ) ~ Pr( max(X1, X2, … , Xn) > x )

As x gets larger, the relative advantage of taking the sum rather than the maximum goes to zero. This is known as “the principle of the single big jump.” If you’re looking to make a lot of progress, adding up typical jumps isn’t likely to get you there. You need one big jump. Said another way, your total progress is about as good as the progress on your best shot.

Before you draw a life lesson from this, note that this is only true for heavy-tailed distributions. It’s not true at all if your jumps have a thin-tailed distribution. But sometimes the payoffs in life’s games are heavy-tailed.

What distributions are subexponential? Any heavy-tailed distribution you’re likely to have heard of: Cauchy, Lévy, Weibull (with shape < 1), log-normal, etc.

To illustrate the single jump principle, let X1 and X2 be standard Cauchy random variables. The difference between Pr( X1 + X2 > x ) and Pr( max(X1 , X2) > x ) becomes impossible to see for x much bigger than 3.

Here’s a plot of the ratio of the sum to the maximum.

The ratio tends to 1 in the limit as predicted by the single big jump principle.

I chose to use a Cauchy distribution to simplify the calculations. (The sum of two Cauchy random variables is another Cauchy. See distribution relationships.) In this case the maximum is actually larger than the sum because the Cauchy can be positive or negative. But it is still the case that the two tails converge as x increases.

Here’s the Sage notebook that I used to create the graphs.

More on the single big jump here.

{ 10 comments }

Math superhero in training

by John on July 27, 2011

Steve Yegge has a new project. He’s in training to become a math superhero. Or at least a sidekick. He said that math/stat folks superheros and he wants to join them.

In his presentation at OSCON Data 2011 on Monday, Yegge said that all the hard problems require math and statistics. So he’s quitting his job at Google to study math in hopes that he can solve big problems three to five years from now.

His enthusiasm for math is naive and inspiring. I gather from some of the articles he’s written that he’s an original thinker and a hard worker. It’ll be interesting to see what he does.

Update: As pointed out in the comments, Steve Yegge clarified on his blog that he is not quitting Google, only his project at Google.

{ 11 comments }

How to fit an elephant

by John on June 21, 2011

John von Neumann famously said

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

By this he meant that one should not be impressed when a complex model fits a data set well. With enough parameters, you can fit any data set.

It turns out you can literally fit an elephant with four parameters if you allow the parameters to be complex numbers.

I mentioned von Neumann’s quote on StatFact last week and Piotr Zolnierczuk replied with reference to a paper explaining how to fit an elephant:

“Drawing an elephant with four complex parameters” by Jurgen Mayer, Khaled Khairy, and Jonathon Howard,  Am. J. Phys. 78, 648 (2010), DOI:10.1119/1.3254017.

Piotr also sent me the following Python code he’d written to implement the method in the paper. This code produced the image above.

"""
Author: Piotr A. Zolnierczuk (zolnierczukp at ornl dot gov)

Based on a paper by:
Drawing an elephant with four complex parameters
Jurgen Mayer, Khaled Khairy, and Jonathon Howard,
Am. J. Phys. 78, 648 (2010), DOI:10.1119/1.3254017
"""
import numpy as np
import pylab

# elephant parameters
p1, p2, p3, p4 = (50 - 30j, 18 +  8j, 12 - 10j, -14 - 60j )
p5 = 40 + 20j # eyepiece

def fourier(t, C):
    f = np.zeros(t.shape)
    A, B = C.real, C.imag
    for k in range(len(C)):
        f = f + A[k]*np.cos(k*t) + B[k]*np.sin(k*t)
    return f

def elephant(t, p1, p2, p3, p4, p5):
    npar = 6
    Cx = np.zeros((npar,), dtype='complex')
    Cy = np.zeros((npar,), dtype='complex')

    Cx[1] = p1.real*1j
    Cx[2] = p2.real*1j
    Cx[3] = p3.real
    Cx[5] = p4.real

    Cy[1] = p4.imag + p1.imag*1j
    Cy[2] = p2.imag*1j
    Cy[3] = p3.imag*1j

    x = np.append(fourier(t,Cx), [-p5.imag])
    y = np.append(fourier(t,Cy), [p5.imag])

    return x,y

x, y = elephant(np.linspace(0,2*np.pi,1000), p1, p2, p3, p4, p5)
pylab.plot(y,-x,'.')
pylab.show()

Related posts:

Advantages of crude models
Occam’s razor and Bayes theorem

{ 8 comments }

Impure math

by John on June 15, 2011

When Samuel Hansen said in his interview “You’re not a pure mathematician” I agreed without thinking, but later the statement bothered me a little. I know what he meant: considering the two categories of pure math and applied math, you’d put yourself in the latter category. Which is true.

But the term “pure” math can be misleading, as if everyone else does impure math. Applied math is not an alternative to theoretical math. Applied mathematicians prove theorems etc. We work on applications in addition to doing what is expected of pure mathematicians. The difference between pure and applied math is motivation, not content. Applied math is motivated by direct application to non-mathematical problems. Pure math seeks to advance math for its own sake. Both are important.

Statistics uses the terms “theoretical” and “applied” rather than “pure” and “applied.” Math doesn’t use “theoretical” as an antithesis to “applied” because applied math is theoretical. But unlike math, being “applied” in statistics does mean you’re often (too often?) excused from proving theorems. The first time I was a coauthor on a statistics paper I was surprised to find out you could publish with just simulation results and no theorems. This happens in applied math as well, but not nearly as often as it does in applied statistics.

On the other hand, when I hear the term “applied statistics” I want to ask “Is there any other kind?” Statistics is applied (and theoretical!) though some statisticians work more directly on applications than others. As Andrew Gelman quips, the difference between theoretical and applied statisticians is that

The theoretical statistician uses x, the applied statistician uses y (because we reserve x for predictors).

I assume that statement wasn’t meant to be taken literally, but I agree with the sentiment that the distinction between theoretical and applied statistics can be exaggerated. I’d say the same applies to pure and applied math.

{ 9 comments }