Probability approximations

When I took my first probability course, it seemed like there were an infinite number of approximation theorems to learn, all mysterious. Looking back, there were probably only two or three, and they don’t need to be mysterious.

For example, under the right circumstances you can approximate a Binomial(n, p) well with a Normal(np, np(1-p)). While the relationship between the parameters in these two distributions is obvious to the initiated, it’s not at all obvious to a beginner. It seems much clearer to say that a Binomial can be approximated by a Normal with the same mean and variance. After all, a distribution that doesn’t get the mean and variance correct doesn’t sound like a very promising approximation.

Taking it a step further, a good teacher could guide a class to discover this approximation themselves. This would take more time than simply stating the result and working an example or two, but the difference in understanding would be immense. And if you’re not going to take the time to aim for understanding, what’s the point in covering approximation theorems at all? They’re not used that often for computation anymore. In my opinion, the only reason to go over them is the insight they provide.

Bugs in food and software

What is an acceptable probability of finding bug parts in a box of cereal? You can’t say zero. As the acceptable probability goes to zero, the price of a box of cereal goes to infinity. In practice, the software for space probes. And yet even these projects have to tolerate some non-zero probability of error. It’s not worthwhile to spend 10 billion dollars to prevent a bug in a billion dollar mission.

Bugs are a fact of life. We can insist that they are unacceptable or we can pretend they don’t exist, but neither approach is constructive. It’s better to focus on the probability of running into bugs and consequences of running into bugs.

Not all bugs have the same consequences. It’s distasteful to find a piece of a roach leg in your can of green beans, but it’s not the end of the world. Toxic microscopic bugs are more serious. Along the same lines, a software bug that causes incorrect hyphenation is hardly the same as a bug that causes a plane crash. To get an idea of the potential economic cost of  running into a bug, and therefore the resources worthwhile to detect and fix it, multiply the probability by the consequences.

How do you estimate the probabilities of software bugs? The same way you estimate the probability of bugs in food: by conducting experiments and analyzing data. Some people find this very hard to accept. They understand that testing is necessary in the physical world, but they think software is entirely different and must be proven correct in some mathematical sense. They object that computer programs are complex systems, too complex to test. Computer programs are complex, but human bodies are far more complex, and yet we conduct tests on human subjects all the time to estimate different probabilities, such as the probabilities of drug toxicity.

Another objection to software testing is that it can only test paths through the software that are actually taken, not all potential paths. That’s true, but the most important data when estimating the probability of running into a bug is data from people using the software under normal conditions. A bug that you never run into has no consequences.

But what about people using software in unanticipated ways? I certainly find it frustrating when I uncover bugs when I use a program in an atypical way. But this is not terribly different from physical systems. Bridges may fail when they’re subject to loads they weren’t designed for. There is a difference, however. Most software is designed to permit far more uses than can be tested, whereas there’s less of a gap in physical systems between what is permissible and what is testable. Unit testing helps. If every component of a software system works correctly in isolation, it more likely, though not certain, that the components will work correctly together in a new situation. Still, there’s no getting around the fact that the best tested uses are the most likely to succeed.

Software in Space

The latest episode of Software Engineering Radio has an interview with Hans-Joachim Popp of the German aerospace company DLR. A bug in the software embedded in a space probe could cost years of lost time and billions of dollars. These folks have to write solid code.

The interview gives some details of the unusual practices DLR uses to produce such high quality code. For one, Popp said that his company writes an average of 12 lines of test code for every line of production code. They also pair junior and senior developers. The junior developer writes all the code, and the senior developer picks it apart.

Such extreme attention to quality doesn’t come cheap. Popp said that they produce about 0.6 lines of (production?) code per hour.

Small effective sample size does not mean uninformative

Today I talked to a doctor about the design of a randomized clinical trial that would use a Bayesian monitoring rule. The probability of response on each arm would be modeled as a binomial with a beta prior. Simple conjugate model. The historical response rate in this disease is only 5%, and so the doctor had chosen a beta(0.1, 1.9) prior so that the prior mean matched the historical response rate.

For beta distributions, the sum of the two parameters is called the effective sample size. There is a simple and natural explanation for why a beta(a, b) distribution is said to contain as much information as a+b data observations. By this criterion, the beta(0.1, 1.9) distribution is not very informative: it only has as much influence as two observations. However, viewed in another light, a beta(0.1, 1.9) distribution is highly informative.

This trial was designed to stop when the posterior probability is more than  0.999 that one treatment is more effective than the other. That’s an unusually high standard of evidence for stopping a trial — a cutoff of 0.99 or smaller would be much more common — and yet a trial could stop after only six patients. If X is the probability of response on one arm and Y is the probability of response on the other, after three failures on the first treatment and three successes on the other, Pr(Y > X) > 0.999.

The explanation for how the trial could stop so early is that the prior distribution is, in an odd sense, highly informative. The trial starts with a strong assumption that each treatment is ineffective. This assumption is somewhat justified by of experience, and yet a beta(0.1, 1.9) distribution doesn’t fully capture the investigator’s prior belief.

(Once at least one response has been observed, the beta(0.1, 1.9) prior becomes essentially uninformative. But until then, in this context, the prior is informative.)

A problem with a beta prior is that there is no way to specify the mean at 0.05 without also placing a large proportion of the probability mass below 0.05. The beta prior places too little probability on better outcomes that might reasonably happen. I imagine a more diffuse prior with mode 0.05 rather than mean 0.05 would better describe the prior beliefs regarding the treatments.

The beta prior is convenient because Bayes’ theorem takes a very simple form in this case: starting from a beta(a, b) prior and observing s successes and f failures, the posterior distribution is beta(a+s, b+f).  But a prior less convenient algebraically could be more robust and better adept at representing prior information.

A more basic observation is that “informative” and “uninformative” depend on context. This is part of what motivated Jeffreys to look for prior distributions that were equally (un)informative under a set of transformations. But Jeffreys’ approach isn’t the final answer. As far as I know, there’s no universally acceptable resolution to this dilemma.

Related: Adaptive clinical trial design

C++ templates may reduce memory footprint

One of the complaints about C++ templates is that they can cause code bloat. But Scott Meyers pointed out in an interview that some people are using templates in embedded systems applications because templates result in smaller code.

C++ compilers only generate code for template methods that are actually used in an application, so it’s possible that code using templates may result in a smaller executable than code that a more traditional object oriented approach.

Greek letters and math symbols in (X)HTML

It’s not hard to use Greek letters and math symbols in (X)HTML, but apparently it’s not common knowledge either. Many pages insert little image files every time they need a special character. Such web pages look a little like ransom notes with letters cut from multiple sources.  Sometimes this is necessary but often it can be avoided.

I’ve posted a couple pages on using Greek letters and math symbols in HTML, XML, XHTML, TeX, and Unicode. I included TeX because it’s the lingua franca for math typography, and I included Unicode because the X(HT)ML representation of symbols is closely related to Unicode.

The notes give charts for encoding Greek letters and some of the most common math symbols. They explain how HTML and XHTML differ in this context and also discuss browser compatibility issues.

Solving the problem with Visual Source Safe and time zones

If you use Microsoft Visual SourceSafe (VSS) with developers in more than one time zone, you may be in for an unpleasant surprise.

VSS uses the local time on each developer’s box as the time of a check in/out. If every developer’s time is set by the same reference, say an NNTP server, then no problems will result. But what if one developer is in a different time zone? Say one developer is in Boston and one in Houston. The Boston developer checks in a file at 2:00 PM Eastern time, then 10 minutes later the Houston developer checks out the file, quickly makes a change, and checks the file back in at 2:20 Eastern time, 1:20 Central local time. VSS now says the latest version of the file is the one made at 2:00 Eastern time. When the Houston developer looks at VSS, the Boston developer’s changes were make 40 minutes into the future!

This has been a problem with VSS for all versions prior to VSS 2005, and is still a problem in the most recent version by default. Starting with the 2005 version, you can configure VSS 2005 to use “server local time.” This means all transactions will use the time on the server where the VSS repository is located. The time is stored internally as UTC (GMT) but displayed to each user according to his own time zone. In the example above, the server would record the Boston check-in as 7:00 PM UTC and the Houston check-in as 7:20 PM UTC. The Boston user would see the check-ins as happening at 2:00 and 2:20 Eastern time, and the Houston user would see the check-ins as happening at 1:00 and 1:20 Central time. Importantly, everyone agrees which check-in occurred first.

A more subtle version of the problem can occur even if all users are in the same time zone but have not synchronized their clocks. This is a good reason to use server local time even if everyone works in the same city.

Although it is possible to set server local time in VSS 2005, it still uses client local time by default, presumably for backward compatibility. You have to turn on server local time by opening the VSS Administrator tool and clicking on Tools/Options and going to the Time Zone tab.

SourceSafe Options dialog for time zones

Microsoft has written about this problem at http://support.microsoft.com/kb/931804. Note that the solution only applies to Visual SourceSafe 2005 and later.

When you set the time zone, you will get dire warnings discouraging you from doing so.

To avoid unintended data loss, do not change the time zone for a Visual SourceSafe database after it has been established and is being used.

You may have to let VSS set idle for as many hours as your developers span time zones to let everything synchronize.

Identical twins are not genetically identical

Researchers recently discovered that identical twins are not genetically identical after all. They differ in the copy numbers of their genes. They have the same genes, but each may have different numbers of copies of certain genes.

Source: “Copy That” by Charles Q. Choi, Scientific American, May 2008.

You have a website?

I was talking to my wife about my website last night. One my daughters interrupted with “You have a website?!” Then one of her sisters put things in perspective. “Yeah, but it doesn’t have any games.”