Beautiful Testing is available for pre-order at Amazon. Proceeds from the book will go to Nothing But Nets, a project to distribute anti-malaria bed nets. I contributed a chapter on how to test random number generators.

The blog of John D. Cook
Posts tagged as:
Beautiful Testing is available for pre-order at Amazon. Proceeds from the book will go to Nothing But Nets, a project to distribute anti-malaria bed nets. I contributed a chapter on how to test random number generators.

{ 1 comment }
The broken windows theory says that cracking down on petty crime reduces more serious crime. The name comes from the explanation that if a building has a few broken windows, it invites vandals to break more windows and eventually burn down the building. Turned around, this suggests that punishing vandalism could lead to a reduction in violent crime. Rudy Giuliani is perhaps the most visible proponent of the theory. His first initiative as mayor of New York was to go after turnstile jumpers and squeegeemen as a way of reducing crime in city. Crime rates dropped dramatically during his tenure.

In the book Pragmatic Thinking and Learning, Andy Hunt applies the broken windows theory to software development.
Known problems (such as bugs in code, bad process in an organization, poor interfaces, or lame management) that are uncorrected have a debilitating, viral effect that ends up causing even more damage.
I’ll add a couple of my pet peeves to Andy Hunt’s list.
The first is compiler warnings. I can’t understand why some programmers are totally comfortable with their code having dozens of compiler warnings. They’ll say “Oh yeah, I know about that. It’s not a problem.” But then when a warning shows up that is trying to tell them something important, the message gets lost in the noise. My advice: Just fix the code. In very exceptional situations, explicitly turn off the warning.
The second is similar. Many programmers blithely ignore run-time exceptions that are written to an event log. As with compile warnings, they justify that these exceptions are not really a problem. My advice: If it’s not really a problem, then don’t log it. Otherwise, fix it.
{ 0 comments }
Michael Feathers wrote one of my favorite books on unit testing: Working Effectively with Legacy Code. Some books on unit testing just give abstract platitudes. Feather’s book wrestles with the hard, messy problem of retrofitting unit tests to existing code.
The .NET Rocks podcast had an interview with Michael Feathers recently. The whole interview is worth listening to, but here I’ll just recap a couple things he said about refactoring that I thought were insightful. First, most people agree that you need to have unit tests in place before you can do much refactoring. The unit tests give you the confidence to refactor without worrying that you’ll break something in the process and not know that you broke it. But Feathers adds that you might have to do some light refactoring before you can put the unit tests in place to allow more aggressive refactoring.
The second thing he mentioned about refactoring was the technique called “scratch refactoring.” With this approach, you refactor quickly without worrying about whether you are introducing bugs in order to see where you want to go. But then you completely throw away those changes and refactor carefully. Sometimes you need to do a dry run first to see what patterns emerge and determine where you want to go.
Both of these observations are ways to break out of a chicken-and-egg cycle, needing to refactor before you can refactor.
{ 1 comment }
Daniel Lemire wrote a blog post this morning that ties together a couple themes previously discussed here.
Most published math papers contain errors, and yet there have been surprisingly few “major screw-ups” as defined by Mark Dominus. Daniel Lemire’s post quotes Doron Zeilberger on why these frequent errors are often benign.
Most mathematical papers are leaves in the web of knowledge, that no one reads, or will ever use to prove something else. The results that are used again and again are mostly lemmas, that while a priori non-trivial, once known, their proof is transparent. (Zeilberger’s Opinion 91)
Those papers that are “branches” rather than “leaves” receive more scrutiny and are more likely to be correct.
Zeilberger says lemmas get reused more than theorems. This dovetails with Mandelbrot’s observation mentioned a few weeks ago.
Many creative minds overrate their most baroque works, and underrate the simple ones. When history reverses such judgments, prolific writers come to be best remembered as authors of “lemmas,” of propositions they had felt “too simple” in themselves and had to be published solely as preludes to forgotten theorems.
There are obvious analogies to software. Software that many people use has fewer bugs than software that few people use, just as theorems that people build on have fewer bugs than “leaves in the web of knowledge.” Useful subroutines and libraries are more likely to be reused than complete programs. And as Donald Knuth pointed out, re-editable code is better than black-box reusable code.
Everybody knows that software has bugs, but not everyone realizes how buggy theorems are. Bugs in software are more obvious because paper doesn’t abort. Proofs and programs are complementary forms of validation. Attempting to prove the correctness of an algorithm certainly reduces the chances of a bug, but proofs are fallible as well. Again quoting Knuth, he once said “Beware of bugs in the above code; I have only proved it correct, not tried it.” Not only can programs benefit from being more proof-like, proofs can benefit from being more program-like.
{ 2 comments }
I’ve never written a line of Ruby, but I find Ruby on Rails fascinating. From all reports, the Rails framework lets you develop a web site much faster than you could using other tools, provided you can live with its limitations. Rails emphasizes consistency and simplicity, deliberately leaving out support for some contingencies.
I listed to an interview last night with Ruby developer Glenn Vanderburg. Here’s an excerpt that I found insightful.
In the Java world, the APIs and libraries … tend to be extremely thorough in trying to solve the entire problem that they are addressing and [are] somewhat complicated and difficult to use. Rails, in particular, takes exactly the opposite philosophy … Rails tries to solve the 90% of the problem that everybody has and that can be solved with 10% of the code. And it punts on that last 10%. And I think that’s the right decision, because the most complicated, odd, corner cases of these problems tend to be the things that can be solved by the team in a specific and rather simple way for one application. But if you try to solve them in a completely general way that everybody can use, it leads to these really complicated APIs and complicated underpinnings as well.
The point is not to pick on Java. I believe similar remarks apply to Microsoft’s libraries, or the libraries of any organization under pressure to be all things to all people. The Ruby on Rails community is a small, voluntary association that can turn away people who don’t like their way of doing things.
At first it sounds unprofessional to develop a software library does anything less than a thorough solution to the problem it addresses. And in some contexts that is true, though every library has to leave something out. But in other contexts, it makes sense to leave out the edge cases that users can easily handle in their particular context. What is an edge case to a library developer may be bread and butter to a particular set of users. (Of course the library provider should document explicitly just what part of the problem their code does and does not solve.)
Suppose that for some problem you really can write the code that is sufficient for 90% of the user base with 10% of the effort of solving the entire problem. That means a full solution is 10 times more expensive to build than a 90% solution.
Now think about quality. The full solution will have far more bugs. For starters, the extra code required for the full solution will have a higher density of bugs because it deals with trickier problems. Furthermore, it will have far fewer users per line of code — only 10% of the community cares about it in the first place, and of that 10%, they all care about different portions. With fewer users per line of code, this extra code will have more unreported bugs. And when users do report bugs in this code, the bugs will be a lower priority to fix because they impact fewer people.
So in this hypothetical example, the full solution costs an order of magnitude more to develop and has maybe two orders of magnitude more bugs.
{ 2 comments }
In the interview with Charles Petzold I mentioned in my previous post, Petzold talks about the sharp decline in programming book sales. At one time, nearly every Windows programmer owned a copy of Petzold’s first book, especially in its earlier editions. But he said that now only 4,000 people have purchased his recent 3D programming book.
Programming book sales have plummeted, not because there is any less to learn, but because there is too much to learn. Developers don’t want to take the time to thoroughly learn any technology they suspect will become obsolete in a couple years, especially if its only one of many technologies they have to use. So they plunge ahead using tools they have never systematically studied. And when they get stuck, they Google for help and hope someone else has blogged about their specific problem.
Companies have cut back on training at the same time that they’re expecting more from software. So programmers do the best they can. They jump in and write code without really understanding what they’re doing. They guess and see what works. And when things don’t work, they Google for help. It’s the most effective thing to do in the short term. In the longer term it piles up technical debt that leads to a quality disaster or a maintenance quagmire.
{ 3 comments }
I had a conversation yesterday with someone who said he needed to hire a computer scientist. I replied that actually he needed to hire someone who could program, and that not all computer scientists could program. He disagreed, but I stood by my statement. I’ve known too many people with computer science degrees, even advanced degrees, who were ineffective software developers. Of course I’ve also known people with computer science degrees, especially advanced degrees, that were terrific software developers. The most I’ll say is that programming ability is positively correlated with computer science achievement.
The conversation turned to what it means to say someone can program. My proposed definition was someone who could write large programs that have a high probability of being correct. Joel Spolsky wrote a good book last year called Smart and Gets Things Done about recruiting great programmers. I agree with looking for someone who is “smart and gets things done,” but “writes large correct programs” may be easier to explain. The two ideas overlap a great deal.
People who are not professional programmers often don’t realize how the difficulty of writing software increases with size. Many people who wrote 100-line programs in college imagine that they could write 1,000-line programs if they worked at it 10 times longer. Or even worse, they imagine they could write 10,000-line programs if they worked 100 times longer. It doesn’t work that way. Most people who can write a 100-line program could never finish a 10,000-line program no matter how long they worked on it. They would simply drown in complexity. One of the marks of a professional programmer is knowing how to organize software so that the complexity remains manageable as the size increases. Even among professionals there are large differences in ability. The programmers who can effectively manage 100,000-line projects are in a different league than those who can manage 10,000-line projects.
(When I talk about a program that is so many lines long, I mean a program that needs to be about that long. It’s no achievement to write 1,000 lines of code for a problem that would be reasonable to solve in 10.)
Writing large buggy programs is hard. To say a program is buggy is to imply that it is at least of sufficient quality to approximate what it’s supposed to do much of the time. For example, you wouldn’t say that Notepad is a buggy web browser. A program has got to display web pages at least occassionally to be called a buggy browser.
Writing large correct programs is much harder. It’s even impossible, depending on what you mean by “large” and “correct.” No large program is completely bug-free, but some large programs have a very small probability of failure. The best programmers can think of a dozen ways to solve any problem, and they choose the way they believe has the best chance of being implemented corrrectly. Or they choose the way that is most likely to make an error obvious if it does occur. They know that software needs to be tested and they design their software to make it easier to test.
If you ask an amateur whether their program is correct, they are likely to be offended. They’ll tell you that of course it’s correct because they were careful when they wrote it. If you ask a professional the same question, they may tell you that their program probably has bugs, but then go on to tell you how they’ve tested it and what logging facilities are in place to help debug errors when they show up later.
{ 1 comment }
Modern operating systems are huge, and their size comes at a cost. When I worry out loud about the size of operating systems (or applications, or programming languages) I often get the response “What do you care? If you don’t like the new features, just don’t use them.” The objection seems to be that you don’t pay for what you don’t use. But you do. Every feature comes at some cost. Every feature is a potential source of instability. Every feature takes up developer resources and computer resources. Often the extra cost is worth it for the extra benefit, but not always. And costs can be more subtle than benefits.
Suppose a developer has a great idea for a new feature. He’s so excited that he puts in voluntary overtime to develop his feature, so the cost of his extra contribution is zero. Or is it? Not unless his enthusiasm spills over to everyone else involved so that they volunteer overtime as well. The testers, tech writers, and others who now have more work to do because of this feature are unlikely to be as excited as the developer. What was a labor of love for the developer is just plain labor for everyone else. So the new feature now takes a little time away from everything else that needs to be documented, tested, and otherwise managed, diluting overall quality.
This post was prompted by a discussion with Codewiz in the comments to his post about his woes recovering operating system problems. Along the way he mentioned a remarkably stable FreeBSD server he had and attributed its stability to the fact that he never installed any GUI on the box. Lest anyone think that only the Unix world would create a minimalist operating system, take a look at Windows Server Core. Microsoft also realizes that the features that aren’t there can’t cause problems.
{ 2 comments }
Yesterday I added a blog to the ReproducibleResearch.org web site. You can visit the site here or subscribe via RSS.
I’d like a couple people to join me in writing this blog, and I would greatly appreciate suggestions, guest posts, etc. If you’re interested, please send a note to contribute at the domain name.
{ 0 comments }
Phil Haack has a great article on unit test boundaries. A unit test must not touch the file system, interact with a database, or communicate across a network. Tests that break these rules are necessary, but they’re not unit tests. With some hard thought, the code with external interactions can be isolated and reduced. This applies to both production and test code
As with most practices related to test-driven development, the primary benefit of unit test boundaries is the improvement in the design of the code being tested. If your unit test boundaries are hard to enforce, your production code may have architectural boundary problems. Refactoring the production code to make it easier to test will make the code better.
{ 0 comments }
Software engineers typically use the term “horizontal scalability” to mean throwing servers at a problem. A web site scales horizontally if you can handle increasing traffic simply by adding more servers to a server farm. I think of horizontal scalability as scalability as the number of projects increases, rather than increasing the performance demands on a single project. My biggest challenges have come from managing lots of small projects, more projects than developers.
I’ve seen countless books and articles about how to scale a single project, but I don’t remember ever seeing anything written about scaling the number of projects. It sounds easy to manage independent projects: if the projects are for different clients and they have different developers, just let each one go their own way. But there are two problems. One is a single developer maintaining an accumulation of his or her own projects, and the other is the ability (or more important, the inability) of peers to maintain each other’s projects. Projects that were independent during development become dependent in maintenance because they are maintained at the same time by the same people. Consistency across projects didn’t seem necessary during development, but then in maintenance you look back and wish there had been more consistency.
Maintenance becomes a tractor pull. Robert Martin describes a software tractor pull in his essay The Tortoise and the Hare:
Have you ever been to a tractor pull? Imagine a huge arena filled with mud and churned up soil. Huge tractors tie themselves up to devices of torture and try to pull them across the arena. The devices get harder to pull the farther they go. They are inclined planes with wheels on the rear and a wide shoe at the front that sits squarely on the ground. There is a huge weight at the rear that is attached to a mechanism that drags the weight up the inclined plane and over the shoe as the wheels turn. This steadily increases the weight over the shoe until the friction overcomes the ability of the tractor.
Writing software is like a tractor pull. You start out fast without a lot of friction. Productivity is high, and you get a lot done. But the more you write the harder it gets to write more. The weight is being dragged up over the shoe. The more you write the more the mess builds. Productivity slows. Overtime increases. Teams grow larger. More and more code is piled up over the shoe, and the development team grinds to a halt unable to pull the huge mass of code any farther through the mud.
Robert Martin had in mind a single project slowing down over time, but I believe his analogy applies even better to maintenance of multiple projects.
To scale your number of projects you’ve got to enforce consistency before there’s an immediate need for it. But there you face several dangers. Enforcing apparently unnecessary consistency could make you appear arbitrary and damage morale. And you’ll make some wrong decisions. You’ve got to have a lot of experience to predict what sort of policies you’ll wish in the future that you had enforced. These issues are challenging when scaling a single project, but they are more of challenging when scaling across smaller projects because you don’t get feedback as quickly. On a single large project, you may feel the pain of a bad decision quickly, but with multiple small projects you may not feel the pain until much later.
Quality is critical when scaling the number of projects. Each project needs to be better than seems necessary. When you look at a single project in isolation, maybe it’s acceptable to have one bug report a month. But then when you have an accumulation of such projects, you’ll get bug reports every day. And the cost per bug fix goes up over time because developers can most easily fix bugs in the code freshest in their minds. Fixing a bug in an old project that no one wants to think about anymore will be unpleasant and expensive.
Scaling your number of projects requires more discipline than scaling a single project because feedback takes longer. Although scaling single projects gets far more attention, I suspect a lot of people are struggling with scaling their number of projects.
{ 0 comments }
Here’s a quote from a recent blog post from Tom Peters:
You will be remembered in the long haul for the quality of your work, not the quantity of your work—the quantity part is just your defective ego talking—no one evaluates Picasso based on the number of paintings he churned out.
{ 2 comments }
William Gosset discovered the t-distribution while working for the Guinness brewing company. Because his employer prevented employees from publishing papers, Gosset published his research under the pseudonym Student. That’s why his distribution is often called Student’s t-distribution.
This story is fairly well known. It often appears in the footnotes of statistics textbooks. However, I don’t think many people realize why it’s not surprising that fundamental statistical research should come from a brewery, and why we don’t hear of statistical research coming out of wineries.
Beer makers pride themselves on consistency while wine makers pride themselves on variety. That’s why you’ll never hear beer fans talk about a “good year” the way wine connoisseurs do. Because they value consistency, beer makers invest more in extensive statistical quality control than wine makers do.
{ 2 comments }
What is an acceptable probability of finding bug parts in a box of cereal? You can’t say zero. As the acceptable probability goes to zero, the price of a box of cereal goes to infinity. In practice, the FDA sets very small but non-zero limits on the probability of finding bug parts in food. This is unsettling at first, but there’s no rational way around it.
What is an acceptable probability of finding bugs in your software? Again, you can’t say zero. The cost increases without bound as the quality requirements increase. In my previous post, I wrote about the extraordinary quality procedures for writing software for space probes. And yet even these projects have to tolerate some non-zero probability of error. It’s not worthwhile to spend 10 billion dollars to prevent a bug in a billion dollar mission.
Bugs are a fact of life. We can insist that they are unacceptable or we can pretend they don’t exist, but neither approach is constructive. It’s better to focus on the probability of running into bugs and consequences of running into bugs.
Not all bugs have the same consequences. It’s distasteful to find a piece of a roach leg in your can of green beans, but it’s not the end of the world. Toxic microscopic bugs are more serious. Along the same lines, a software bug that causes incorrect hyphenation is hardly the same as a bug that causes a plane crash. To get an idea of the potential economic cost of running into a bug, and therefore the resources worthwhile to detect and fix it, multiply the probability by the consequences.
How do you estimate the probabilities of software bugs? The same way you estimate the probability of bugs in food: by conducting experiments and analyzing data. Some people find this very hard to accept. They understand that testing is necessary in the physical world, but they think software is entirely different and must be proven correct in some mathematical sense. They object that computer programs are complex systems, too complex to test. Computer programs are complex, but human bodies are far more complex, and yet we conduct tests on human subjects all the time to estimate different probabilities, such as the probabilities of drug toxicity.
Another objection to software testing is that it can only test paths through the software that are actually taken, not all potential paths. That’s true, but the most important data when estimating the probability of running into a bug is data from people using the software under normal conditions. A bug that you never run into has no consequences.
But what about people using software in unanticipated ways? I certainly find it frustrating when I uncover bugs when I use a program in an atypical way. But this is not terribly different from physical systems. Bridges may fail when they’re subject to loads they weren’t designed for. There is a difference, however. Most software is designed to permit far more uses than can be tested, whereas there’s less of a gap in physical systems between what is permissible and what is testable. Unit testing helps. If every component of a software system works correctly in isolation, it more likely, though not certain, that the components will work correctly together in a new situation. Still, there’s no getting around the fact that the best tested uses are the most likely to succeed.
{ 0 comments }
I’ve run into the same theme in very different contexts lately: people ignore data from crashes.
FlowingData has an article today claiming that, contrary to popular belief, some parts of an airplane are safer than others. According to the article, pundits routinely claim that all seats are equally safe even though data show that the probability of surviving a plane crash varies from 49% in the front of the aircraft up to 69% in the rear.
Also today, Coding Horror published its second article on software crashes. See Crashing Responsibly and Twitter: How Not To Crash Responsibly. Many applications don’t collect data from crashes, and those that do don’t always make good use of it.
Finally, Scott Shane’s book The Illusions of Entrepreneurship examines small business crashes. Entrepreneurs, investors, and policy makers often make decisions based on myths that are soundly refuted by data.
{ 0 comments }