Multiple comparisons

Multiple comparisons present a conundrum in classical statistics. The options seem to be:

  1. do nothing and tolerate a high false positive rate
  2. be extremely conservative and tolerate a high false negative rate
  3. do something ad hoc between the extremes

A new paper by Andrew Gelman, Jennifer Hill, and Masanao Yajima opens with “The problem of multiple comparisons can disappear when viewed from a Bayesian perspective.” I would clarify that the resolution comes not from the Bayesian perspective per se but from the Bayesian hierarchical perspective.

See this blog post for a link to the article “Why we (usually) don’t have to worry about multiple comparisons” and to a presentation by the same title.

What's better about small companies?

Popular business writers often say flat organizations are better than hierarchical organizations, and small businesses are better than big businesses. By “better” they usually mean more creative, nimble, fun, and ultimately profitable. But they don’t often try to explain why small and flat is better than big and hierarchical. They support their argument with examples of big sluggish companies and small agile companies, but that’s as far as they go.

Paul Graham posted a new essay called You Weren’t Meant to Have a Boss in which he also argues for small and flat over big and hierarchical. However, his line of reasoning is fresh. I haven’t decided what I think of his points, but as usual his writing is creative and thought-provoking.

Update: See Jeff Atwood’s comments, Paul Graham’s Participatory Narcissism.

Simple unit tests

After you’ve read a few books or articles on unit testing, the advice becomes repetitive. But today I heard someone who had a few new things to say. Gerard Meszaros made these points in an interview on the OOPSLA 2007 podcast, Episode 11.

Test code should be much simpler than production code for three reasons.

  1. Unit tests should not contain branching logic. Each test should test one path through the production code. If a unit test has branching logic, it’s doing too much, attempting to test more than one path.
  2. Unit tests are the safety net for changes to production code, but there is no such safety net for the tests themselves. Therefore tests should be written simply the first time rather simplified later through refactoring.
  3. Unit tests are not subject to the same constraints as production code. They can be slow, and they only have to work in isolation. Brute force is more acceptable in tests than in production code.

(Meszaros made points 1 and 2 directly. Point 3 is my interpolation.)

A well-tested project will have at least as much test code as production code. The immediate conclusion too many people draw is that therefore unit testing doubles the cost of a project.  One reason this is not true is that test code is easier to write than production code for the reasons listed above. Or rather, test code can be easier to write, if the project uses test-driven development. Retrofitting tests to code that wasn’t designed to be testable is hard work indeed.

Plausible reasoning

If Socrates is probably a man, he’s probably mortal.

How do you extend classical logic to reason with uncertain propositions, such as the statement above? Suppose we agree to represent degrees of plausibility with real numbers, larger numbers indicating greater plausibility. If we also agree to a few axioms to quantify what we mean by consistency and common sense, there is a unique system that satisfies the axioms. The derivation is tedious and not well suited to a blog posting, so I’ll cut to the chase: given certain axioms, the inevitable system for plausible reasoning is probability theory.

There are two important implications of this result. First, it is possible to develop probability theory with no reference to sets. This renders much of the controversy about the interpretation of probability moot. Instead of arguing about what a probability can and cannot represent, one could concede the point. “We won’t use probabilities to represent uncertain information. We’ll use ‘plausibilities’ instead, derived from rules of common sense reasoning. And by the way, the resulting theory is identical to probability theory.”

The other important implication is that all other systems of plausible reasoning — fuzzy logic, neural networks, artificial intelligence, etc. — must either lead to the same conclusions as probability theory, or violate one of the axioms used to derive probability theory.

See the first two chapters of Probability Theory by E. T. Jaynes (ISBN 0521592712) for a full development. It’s interesting to note that the seminal paper in this area came out over 60 years ago. (Richard Cox, “Probability, frequency, and reasonable expectation”, 1946.)

Second homes

Designing good surveys is hard work. Andrew Gelman posted an example of unintended consequences in survey design yesterday. A survey question asked “How many people do you know who have a second home?” Apparently some respondents thought the question was asking about folks who own vacation homes while others thought the question referred to immigrants.

Conceptual integrity

How do you maintain conceptual integrity when multiple people contribute to a project?

Fred Brooks, author of the software engineering classic The Mythical Man-Month, gave a talk at at OOPSLA 2007 entitled Collaboration and Telecollaboration in Design (audio here). In his talk, Brooks discusses the importance of conceptual integrity. Great products have conceptual integrity and are nearly always the fruit of one or at most two minds. Products reflect their creators, and products designed by committees have multiple personalities. How do you maintain conceptual integrity when scale and complexity demand the participation of many people? Listen to Fred Brooks for some ideas.

Error function and the normal distribution

The error function erf(x) and the normal distribution Φ(x) are essentially the same function. The former is more common in math, the latter in statistics. I often have to convert between the two.

It’s a simple exercise to move between erf(x) and Φ(x), but it’s tedious and error-prone, especially when you throw in variations on these two functions such as their complements and inverses. Some time ago I got sufficiently frustrated to write up the various relationships in a LaTeX file for future reference. I was using this file yesterday and thought I should post it as a PDF file in case it could save someone else time and errors.

What is the cosine of a matrix?

How would you define the cosine of a matrix? If you’re trying to think of a triangle whose sides are matrices, you’re not going to get there. Think of power series. If a matrix A is square, you can stick it into the power series for cosine and call the sum the cosine of A.

cosine series

For example,

cosine of a 2x2 matrix

This only works for square matrices. Otherwise the powers of A are not defined.

The power series converges and has many of the properties you’d expect. However, the usual trig identities may or may not apply. For example,

cos(A+B)

only if the matrices A and B commute, i.e. AB = BA. To see why this is necessary, imagine trying to prove the sum identity above. You’d stick A+B into the power series and do some algebra to re-arrange terms to get the terms on the right side of the equation. Along the way you’ll encounter terms like A2 + AB + BA + B2 and you’d like to factor that into (A+B)2, but you can’t justify that unless A and B commute.

Is cosine still periodic in this context? Yes, in the sense that cos(A + 2πI) = cos(A). This is because the diagonal matrix 2πI commutes with every matrix A and so the sum identity above holds.

Why would you want to define the cosine of a matrix? One application of analytic functions of a matrix is solving systems of differential equations. Any linear system of ODEs, of any order, can be rewritten in the form x‘ = Ax where x is a vector of functions and A is a square matrix. Then the solution is x(t) = etA x(0). And cos(At) is a solution to x‘ ‘+ A2x = 0, just as in calculus.

Four characterizations of the normal distribution

The normal distribution pops up everywhere in statistics. Contrary to popular belief, the name does not come from “normal” as in “conventional.” Instead the term comes from a detail in a proof by Gauss discussed below where he showed that two things were perpendicular in a sense.

(The word “normal” originally meant “at a right angle,” going back to the Latin word normalis for a carpenter’s square. Later the word took on the metaphorical meaning of something in line with custom. Mathematicians sometimes use “normal” in the original sense of being orthogonal.)

The mistaken etymology persists because the normal distribution is conventional. Statisticians often assume anything random has a normal distribution by default. While this assumption is not always justified, it often works remarkably well. This post gives four lines of reasoning that lead naturally to the normal distribution.

1) The earliest characterization of the normal distribution is the central limit theorem, going back to Abraham de Moivre. Roughly speaking, this theorem says that if you average enough distributions together, even if they’re not normal, in the limit their average is normal. But this justification for assuming normal distributions everywhere has a couple problems. First, the convergence in the central limit theorem may be slow, depending on what is being averaged. Second, if you relax the hypotheses on the central limit theorem, other stable distributions with thicker tails also satisfy a sort of central limit theorem. The characterizations given below are more satisfying because they do not rely on limit theorems.

2) The astronomer William Herschel discovered the simplest characterization of the normal. He wanted to characterize the errors in astronomical measurements. He assumed (1) the distribution of errors in the x and y directions must be independent, and (2) the distribution of errors must be independent of angle when expressed in polar coordinates. These are very natural assumptions for an astronomer, and the only solution is a product of the same normal distribution in x and y. James Clerk Maxwell came up with an analogous derivation in three dimensions when modeling gas dynamics.

3) Carl Friedrich Gauss came up with the characterization of the normal distribution that caused it to be called the “Gaussian” distribution. There are two strategies for estimating the mean of a random variable from a sample: the arithmetic mean of the samples, and the maximum likelihood value. Only for the normal distribution do these coincide.

4) The final characterization listed here is in terms of entropy. For a specified mean and variance, the probability density with the greatest entropy (least information) is the normal distribution. I don’t know who discovered this result, but I read it in C. R. Rao‘s book. Perhaps it’s his result. If anyone knows, please let me know and I’ll update this post. For advocates of maximum entropy this is the most important characterization of the normal distribution.

Related post: How the Central Limit Theorem began

In praise of tedious proofs

The book Out of Their Minds quotes Leslie Lamport on proofs:

The proofs have been carried out to an excruciating level of detail … The reader may feel that we have given long, tedious proofs of obvious assertions. However, what he has not seen are the many equally obvious assertions that we discovered to be wrong only by trying to write similarly long, tedious proofs.                   

See Lamport’s paper How to Write a Proof. See also Complementary validation.