What distribution does my data have?

“Which distribution describes my data?” Variations on that question pop up regularly on various online forums. Sometimes the person asking the question is looking for a goodness of fit test but doesn’t know the jargon “goodness of fit.” But more often they have something else in mind. They’re thinking of some list of familiar, named distribution families — normal, gamma, Poisson, etc. — and want to know which distribution from this list best fits their data. So the real question is something like the following:

Which distribution from the well-known families of probability distributions fits my data best?

Statistics classes can give the impression that there is a short list of probability distribution families, say the list in the index of text book for that class, and that something from one of those families will always fit any data set. This impression starts to seem absurd when stated explicitly. It raises two questions.

  1. What exactly is the list of well-known distributions?
  2. Why should a distribution from this list fit your data?

As for the first question, there is some consensus as to what the well-known distributions are. The distribution families in this diagram would make a good start. But the question of which distributions are “well known” is a sociological question, not a mathematical one. There’s nothing intrinsic to a distribution that makes it well-known. For example, most statisticians would consider the Kumaraswamy distribution obscure and the beta distribution well-known, even though the two are analytically similar.

You could argue that the canonical set of distributions is somewhat natural by a chain of relations. The normal distribution is certainly natural due to the central limit theorem. The chi-squared distribution is natural because the square of a normal random variable has a chi-squared distribution. The F distribution is related to the ratio of chi-squared variables, so perhaps it ought to be included. And so on and so forth. But each link in the chain is a little weaker than the previous. Also, why this chain of relationships and not some other?

Alternatively, you could argue that the distributions that made the canon are there because they have been found useful in practice. And so they have.  But had people been interested in different problems, a somewhat different set of distributions would have been found useful.

Now on to the second question: Why should a famous distribution fit a particular data set?

Suppose a police artist asked a witness which U. S. president a criminal most closely resembled. The witness might respond

Well, she didn’t look much like any of them, but if I have to pick one, I’d pick John Adams.

The U. S. presidents form a convenient set of faces. You can find posters of their faces in many classrooms. The U. S. presidents are historically significant, but a police artist would do better to pick a different set of faces as a first pass in making a sketch.

I’m not saying it is unreasonable to want to fit a famous distribution to your data. Given two distributions that fit the data equally well, go with the more famous distribution. This is a sort of celebrity version of Occam’s razor. It’s convenient to use distributions that other people recognize. Famous distributions often have nice mathematical properties and widely available software implementations. But the list of famous distributions can form a Procrustean bed that we force our data to fit.

The extreme of Procrustean statistics is a list of well-known distributions with only one item: the normal distribution. Researchers often apply a normal distribution where it doesn’t fit at all. More dangerously, experienced statisticians can assume a normal distribution when the lack of fit isn’t obvious. If you implicitly assume a normal distribution, then any data point that doesn’t fit the distribution is an outlier. Throw out the outliers and the normal distribution fits well! Nassim Taleb calls the normal distribution the “Great Intellectual Fraud” in his book The Black Swan because people so often assume the distribution fits when it does not.

Timing software

When measuring how long it takes to execute a program, people often report the best time out of several runs with the same input. That seems odd at first. Why report the best time? Why not report the average time?

Operating systems have more to do than just run your program. The time that elapses between the start and finish of your program may vary because of other demands on the operating system’s resources. By taking the best time, you (presumably) get an idea how much time it takes to run the program itself with minimal overhead from other operating system activity.

The best-time approach is appropriate when comparing the efficiency of two programs: compare the best time for Program A to the best time of Program B. But if you want to estimate how long you can expect wait for Program A itself to run, not comparing the efficiency of Program A to anything else, then averaging the run times is reasonable.

(Everything above assumes constant input. This post is not about analyzing execution times with varying inputs. For example, you might use best-case timing to compare two sorting algorithms by sorting the exact same list several times. The only thing changing from run to run is the state of the operating system, not the input. Of course you might also want to measure the average run time with multiple randomly generated data sets, but that’s a different matter.)

Emacs command to add HTML tags

A while back I asked Jason Fruit how to add HTML tags to text in Emacs, something like the format painter in Microsoft applications. He said he didn’t know of anything and wrote the following macro for me.

(defun tag-word-or-region (tag)
    "Surround current word or region with a given tag."
    (interactive "sEnter tag (without <>): ")
    (let (pos1 pos2 bds start-tag end-tag)
        (setq start-tag (concat "<" tag ">"))
        (setq end-tag (concat "</" tag ">"))
        (if (and transient-mark-mode mark-active)
            (progn
                (goto-char (region-end))
                (insert end-tag)
                (goto-char (region-beginning))
                (insert start-tag))
            (progn
                (setq bds (bounds-of-thing-at-point 'symbol))
                (goto-char (cdr bds))
                (insert end-tag)
                 (goto-char (car bds))
                 (insert start-tag)))))

I added the following line to my .emacs file to bind Jason’s macro to the key sequence C-x t.

(global-set-key "C-xt" 'tag-word-or-region)

To add a tag to a single word, place the cursor before or in the word and execute the command. To tag a block of text, select the text first then execute the command.

Related posts:

Emacs adventures
Giving Emacs another try

The Stone Age didn’t end because we ran out of stones

According to Richard Sears, the world hit peak oil in 1985 in the sense that oil accounted for 50% of world energy in 1985 and the percentage has been declining since then. By that same measure, we hit peak coal in the 1920’s and peak wood in the 1820’s. Sears summarizes

For 200 years we have been systematically de-carbonizing our energy system.

We didn’t stop using wood as our primary energy source because we ran out of trees; we moved on to something better better. Sears believes we are now in the process of transitioning from oil to renewable energy sources, and not because we’re running out of oil. He concludes his presentation by saying

… the Stone Age ended, not because we ran out of stones. It’s ideas, it’s innovation, it’s technology that will end the Age of Oil before we run out of oil.

A tale of two espresso machines

This post tells the story of two espresso machines: one in Los Angeles and one in Brenham, Texas. But it’s more about deciding what you do and do not want to control.

***

In his book Made by Hand, Mark Frauenfelder chronicles his quest to make great espresso at his home in Los Angeles. He reviews some of the tricks to make good espresso from a relatively inexpensive espresso machine. (The machine he describes, a Rancilio Silvia, was around $500 when Frauenfelder bought it. That’s a lot more than a Mr. Coffee, but it’s cheap compared to espresso machines that cost over $10,000.) The problem with inexpensive espresso machines is that they have poor temperature. The water temperature can vary as much as 40 °F. Frauenfelder hacked his espresso machine by replacing its controller with a more sophisticated proportional-integral-derivative controller or PID.

(Small changes in brewing temperature can have a large impact on coffee taste. This is because coffee is chemically complex. Brewing at different temperatures extracts these chemicals in different proportions. See this Scientific American article for details.)

The context of Frauenfelder’s story is his book on doing things yourself, not to save money, but to have more control of your environment. He describes his adventures from growing vegetables to making musical instruments for the joy of doing so.

***

Roger Sessions tells a very different story about why he does not have his own espresso machine in The ObjectWatch Newsletter. He describes why he drives 10 miles every day to the Starbucks in Brenham. He says he saves money, even though he spends more on gasoline than the price of his doppio macchiato. For starters, Sessions estimates it would take nearly four years to pay off his hardware investment. Then he lists the things that could go wrong:

  • Something might break down.
  • Something could short out his electrical system.
  • The roaster could burn his house down.
  • He might not be able to use the equipment well.

The context of Sessions’ story is the benefits of software as a service. Even when it appears to be more economical to create and host your own software, you may save money by paying someone else to offer that software as a service. Sessions pays Starbucks to make his espresso for him because they not only make the capital investment in hardware, they’re also responsible for operation and maintenance. (Update: See Roger Sessions’ comment below.)

***

Sessions makes as strong a case for not owning an espresso machine as Frauenfelder makes for owning one. Frauenfelder speaks of the confidence and joy that comes from having detailed knowledge and control of the things around you. Sessions speaks of the hassle and expense of being responsible for the things around you. Both have valid points. I’m sure there are areas in which Frauenfelder is happy for someone else to take responsibility, and areas in which Sessions enjoys fine-grained control. But they disagree which approach is preferable when it comes to making espresso.

One of the things that prompted me to buy Mark Frauenfelder’s book was an interview I heard on a podcast. He said that he was attracted to do-it-yourself projects after becoming editor of Make Magazine. The spirit of the writers rubbed off. At one point in the interview he says DIY is not about saving money; in fact, the DIY approach will often cost more money. I appreciated this comment since many DIY enthusiasts justify their projects with dubious financial arguments rather than simply say that they enjoy what they’re doing and that the extra expense is worth it to them. I suspect Frauenfelder may agree with Sessions on the economics of espresso making, though they have different perspectives of the non-monetary benefits.  And although Sessions only argues monetary benefits, he also has non-monetary benefits to visiting Starbucks. I imagine he enjoys getting out of his house, seeing familiar faces, etc.

***

Posts quoting Mark Frauenfelder:

Endless preparation
Getting women to smoke

Posts quoting Roger Sessions:

Three quotes on software development