Programming the last mile

In any programming project there comes a point where the programming ends and manual processes begin. That boundary is where problems occur, particularly for reproducibility.

Before you can build a software project, there are always things you need to know in addition to having all the source code. And usually at least one of those things isn’t documented. Statistical analyses are perhaps worse. Software projects typically yield their secrets after a moderate amount of trial and error; statistical analyses may remain inscrutable forever.

The solution to reproducibility problems is to automate more of the manual steps. It is becoming more common for programmers to realize the need for one-click builds. (See Pragmatic Project Automation for a good discussion of why and how to do this.  Here’s a one-page summary of the book.) Progress is slower on the statistical side, but a few people have discovered the need for reproducible analysis.

It’s all a question of how much of a problem should be solved with code. Programming has to stop at some point, but we often stop too soon. We stop when it’s easier to do the remaining steps by hand, but we’re often short-sighted in our idea of “easier”. We mean easier for me to do by hand this time. We don’t think about someone else needing to do the task, or the need for someone (maybe ourselves) to do the task repeatedly. And we don’t think of the possible debugging/reverse-engineering effort in the future.

I’ve tried to come up with a name for the discipline of including more work in the programming portion of problem solving. “Extreme programming” has already been used for something else. Maybe “turnkey programming” would do; it doesn’t have much of a ring to it, but it sorta captures the idea.

Literate programming and statistics

Sweave, mentioned in my previous post, is a tool for literate programming. Donald Knuth invented literate programming and gives this description of the technique in his book by the same name:

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.”

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.

Knuth says the quality of his code when up dramatically when he started using literate programming. When he published the source code for TeX as a literate program and a book, he was so confident in the quality of the code that he offered cash rewards for bug reports, doubling the amount of the reward with each edition. In one edition, he goes so far as to say “I believe that the final bug in TeX was discovered and removed on November 27, 1985.” Even though TeX is a large program, this was not an idle boast. A few errors were discovered after 1985, but only after generations of Stanford students studied the source code carefully and multitudes of users around the world put TeX through its paces.

While literate programming is a fantastic idea, it has failed to gain a substantial following. And yet Sweave might catch on even though literate programming in general has not.

In most software development, documentation is an after thought. When push comes to shove, developers are rewarded for putting buttons on a screen, not for writing documentation. Software documentation can be extremely valuable, but it’s most valuable to someone other than the author. And the benefit of the documentation may only be realized years after it was written.

But statisticians are rewarded for writing documents. In a statistical analysis, the document is the deliverable. The benefits of literate programming for a statistician are more personal and more immediate. Statistical analyses are often re-run, with just enough time between runs for the previous work to be completely flushed from term memory. Data is corrected or augmented, papers come back from review with requests for changes, etc. Statisticians have more self-interest in making their work reproducible than do programmers.

Patrick McPhee gives this analysis for why literate programming has not caught on.

Without wanting to be elitist, the thing that will prevent literate programming from becoming a mainstream method is that it requires thought and discipline. The mainstream is established by people who want fast results while using roughly the same methods that everyone else seems to be using, and literate programming is never going to have that kind of appeal. This doesn’t take away from its usefulness as an approach.

But statisticians are more free to make individual technology choices than programmers are. Programmers typically work in large teams and have to use the same tools as their colleagues. Statisticians often work alone. And since they deliver documents rather than code, statisticians are free to use Sweave without their colleagues’ knowledge or consent. I doubt whether a large portion of statisticians will ever be attracted to literate programming, but technological minorities can thrive more easily in statistics than in mainstream software development.

Irreproducible analysis

Journals and granting agencies are prodding scientists to make their data public. Once the data is public, other scientists can verify the conclusions. Or at least that’s how it’s supposed to work. In practice, it can be extremely difficult or impossible to reproduce someone else’s results. I’m not talking here about reproducing experiments, but simply reproducing the statistical analysis of experiments.

It’s understandable that many experiments are not practical to reproduce: the replicator needs the same resources as the original experimenter, and so expensive experiments are seldom reproduced. But in principle the analysis of an experiment’s data should be repeatable by anyone with a computer. And yet this is very often not possible.

Published analyses of complex data sets, such as microarray experiments, are seldom exactly reproducible. Authors inevitably leave out some detail of how they got their numbers. In a complex analysis, it’s difficult to remember everything that was done. And even if authors were meticulous to document every step of the analysis, journals do not want to publish such great detail. Often an article provides enough clues that a persistent statistician can approximately reproduce the conclusions. But sometimes the analysis is opaque or just plain wrong.

I attended a talk yesterday where Keith Baggerly explained the extraordinary steps he and his colleagues went through in an attempt to reproduce the results in a medical article published last year by Potti et al. He called this process “forensic bioinformatics,” attempting to reconstruct the process that lead to the published conclusions. He showed how he could reproduce parts of the results in the article in question by, among other things, reversing the labels on some of the groups. (For details, see “Microarrays: retracing steps” by Kevin Coombes, Jing Wang, and Keith Baggerly in Nature Medicine, November 2007, pp 1276-1277.)

While they were able to reverse-engineer many of the mistakes in the paper, some remain a mystery. In any case, they claim that the results of the paper are just wrong. They conclude “The idea … is exciting. Our analysis, however, suggests that it did not work here.”

The authors of the original article replied that there were a few errors but that these have been fixed and they didn’t effect the conclusions anyway. Baggerly and his colleagues disagree. So is this just a standoff with two sides pointing fingers at each other saying the other guys are wrong? No. There’s an important asymmetry between the two sides: the original analysis is opaque but the critical analysis is transparent. Baggerly and company have written code to carry out every tiny step of their analysis and made the Sweave code available for anyone to download. In other words, they didn’t just publish their paper, they published code to write their paper.

Sweave is a program that lets authors mix prose (LaTeX) with code (R) in a single file. Users do not directly paste numbers and graphs into a paper. Instead, they embed the code to produce the numbers and graphs, and Sweave replaces the code with the results of running the code. (Sweave embeds R inside LaTeX the way CGI embeds Perl inside HTML.) Sweave doesn’t guarantee reproducibility, but it is a first step.