Aging with grace

Bill’s comment on my previous post reminded me of a book I read a few years ago, Aging With Grace by David Snowdon. The author describes what he learned about aging and especially about Alzheimer’s disease by studying a community of nuns. (Nuns make ideal subjects for epidemiological studies. They have very similar lifestyles, and so a number of confounding variables are reduced. Also, nuns keep excellent records.) The book is a pleasant mixture of science and human interest stories.

Snowdon says in his book that it is nearly impossible to accurately diagnose the extent of Alzheimer’s disease in a patient without an autopsy. Some nuns in the study who were believed to have advanced Alzheimer’s disease in fact did not. Others who were mentally sharp until they died were discovered on autopsy to have suffered extensive damage from the disease. (Snowdon tells the story of one nun in particular who was believed to be senile but who was actually quite witty. She was hard of hearing and reluctant to talk. Few people had the patience to carry on a conversation with her, but Snowdon drew her out.)

Nuns who had greater vocabulary and verbal skill earlier in their lives (as measured by essays the nuns wrote upon entering their order) and those who remained mentally active (for example, those who were teachers) fared better as they aged. They may have had more redundant mental pathways so that as Alzheimer’s disease knocked out pathways at random, enough pathways survived to allow these women to communicate well.

Brain plasticity

Today’s Big Ideas podcast carried a lecture by Norman Doidge on neuroplasticity, the recently-discovered ability of the brain to rewire itself. Doidge relates several amazing stories of people who have recovered from severe strokes or other brain injuries by developing detours around the damaged areas. Hearing of people who have had the persistence to re-learn how to use an arm or leg inspires me to not give up so easily when I face comparatively trivial challenges.

Doidge gives several explanations for why it has taken so long to discover neuroplasticity. Until very recently, scientific orthodoxy has held that neuroplasticity is impossible. Patients were told they’d never be able, for example, to use their left arm again. This became a self-fulfilling prognosis as most patients would not work to do that they were told would be impossible. But what about patients who ignored medical advice and were able to recover lost functionality? Why did that not persuade scientists that neuroplasticity was possible? The patient’s recovery was interpreted as evidence that the brain damage must not have been as extensive as initially believed, since the alternative was known to be impossible.

Wine, Beer, and Statistics

William Gosset discovered the t-distribution while working for the Guinness brewing company. Because his employer prevented employees from publishing papers, Gosset published his research under the pseudonym Student. That’s why his distribution is often called Student’s t-distribution.

This story is fairly well known. It often appears in the footnotes of statistics textbooks. However, I don’t think many people realize why it’s not surprising that fundamental statistical research should come from a brewery, and why we don’t hear of statistical research coming out of wineries.

Beer makers pride themselves on consistency while wine makers pride themselves on variety. That’s why you’ll never hear beer fans talk about a “good year” the way wine connoisseurs do. Because they value consistency, beer makers invest more in extensive statistical quality control than wine makers do.

Two definitions of expectation

In an introductory probability class, the expected value of a random variable X is defined as

E(X) = int_{-infty}^infty x, f_X(x) ,dx

where fX is the probability density function of X. I’ll call this the analytical definition.

In a more advanced class the expected value of X is defined as

E(X) = int_Omega X ,dP

where (Ω, P) is a probability space. I’ll call this the measure-theoretic definition. It’s not obvious that these two definitions are equivalent. They may even seem contradictory unless you look closely: they’re integrating different functions over different spaces.

If for some odd reason you learned the measure-theoretic definition first, you could see the analytical definition as a theorem. But if, like most people, you learn the analytical definition first, the measure-theoretic version is quite mysterious. When you take an advanced course and look at the details previously swept under the rug, probability looks like an entirely different subject, unrelated to your introductory course. The definition of expectation is just one concept among many that takes some work to resolve.

I’ve written a couple pages of notes that bridge the gap between the two definitions of expectation and show that they are equivalent.

Why computer scientists count from zero

In my previous post, cohort assignments in clinical trials, I mentioned in passing how you could calculate cohort numbers from accrual numbers if the world were simpler than it really is.

Suppose you want to treat patients in groups of 3. If you count patients and cohorts starting from 1, then patients 1, 2, and 3 are in cohort 1. Patients 4, 5, and 6 are in cohort 2. Patients 7, 8, and 9 are in cohort 3, etc. In general patient n is in cohort 1 + ⌊(n-1)/3⌋.

If you start counting patients and cohorts from 0, then patients 0, 1, and 2 are in cohort 0. Patients 3, 4, and 5 are in cohort 1. Patients 6, 7, and 8 are in cohort 2, etc. In general patient n is in cohort ⌊n/3⌋.

These kinds of calculations, common in computer science, are often simpler when you start counting from 0. If you want to divide things (patients, memory locations, etc.) into groups of size k, the nth item is in group ⌊n/k⌋. In C notation, integer division truncates to an integer and so the expression is even simpler: n/k.

Counting centuries is confusing because we count from 1. That’s why the 1900’s were the 20th century etc. If we called the century immediately following the birth of Christ the 0th century, then the 1900’s would be the 19th century.

Because computer scientists usually count from 0, most programming languages also count from zero. Fortran and Visual Basic are notable exceptions.

The vast majority of humanity finds counting from 0 unnatural and so there is a conflict between how software producers and consumers count. Demanding that average users learn to count from zero is absurd. So the programmer must either use one-based counting internally, and risk confusing his peers, or use zero-based counting internally, and risk forgetting to do a conversion for input or output. I prefer the latter. The worst option is to vascillate between the two approaches.

Cohort assignments in clinical trials

Cohorts are very simple in theory but messy in practice. In a clinical trial, a cohort is a group of patients who receive the same treatment. For example, in dose-finding trials, it is very common to treat patients in groups of three. I’ll stick with cohorts of three just to be concrete, though nothing here depends particularly on this choice of cohort size.

If we number patients in the order in which they arrive, patients 1, 2, and 3 would be the first cohort. Patients 4, 5, and 6 would be the second cohort, etc. If it were always that simple, we could determine which cohort a patient belongs to based on their accrual number alone. To calculate a patient’s cohort number, subtract 1 from their accrual number, divide by 3, throw away any remainder, and add 1. In math symbols, the cohort number for patient #n would be 1 + ⌊(n-1)/3⌋. (See the next post.)

Here’s an example of why that won’t work. Suppose you treat patients 1, 2, and 3, then discover that patient #2 was not eligible for the trial after all. (This happens regularly.) Now a 4th patient enters the trial. What cohort are they in? If patient #4 arrived after you discovered that patient #2 was ineligible, you could put patient #4 in the first cohort, essentially taking patient #2’s place. But if patient #4 arrived before you discovered that patient #2 was ineligible, then patient #4 would receive the treatment assigned to the second cohort; the first cohort would have a hole in it and only contain two patients. You could treat patient #5 with the treatment of the first cohort to try to patch the hole, but that’s more confusing. It gets even worse if you’re on to the third or fourth cohort before discovering a gap in the first cohort.

In addition to patients being removed from a trial due to ineligibility, patients can remove themselves from a trial at any time.

There are numerous other ways the naïve view of cohorts can fail. A doctor may decide to give the same treatment to only two consecutive patients, or to four consecutive patients, letting medical judgment override the dose assignment algorithm for a particular patient. A mistake could cause a patient to receive the dose intended for another cohort. Researchers may be unable to access the software needed to make the dose assignment for a new cohort and so they give a new patient the dose from the previous cohort.

Cohort assignments can become so tangled that it is simply not possible to look at an ordered list of patients and their treatments after the fact and determine how the patients were grouped into cohorts. Cohort assignment is to some extent a mental construct, an expression of how the researcher thought about the patients, rather than an objective grouping.

Monitoring legacy code that fails silently

Clift Norris and I just posted an article on CodeProject entitled Monitoring Unreliable Scheduled Tasks about some software Clift wrote to resolve problems we had calling some legacy software that would fail silently. His software adds from the outside monitoring and logging functions that better software would have provided on the inside.

The monitoring and logging software, called RunAndWait, kicks off a child process and waits a specified amount of time for the process to complete. If the child does not complete in time, a list of people are notified by email. The software also checks return codes and writes all its activity to a log.

RunAndWait is a simple program, but it has proven very useful over the last year and a half since it was written. We use RunAndWait in combination with PowerShell for scheduling our nightly processes to interact with the legacy system. Since PowerShell has verbose error reporting, calling RunAndWait from PowerShell rather than from cmd.exe gives additional protection against possible silent failures.

Fasting may reduce chemotherapy side-effects

Chemotherapy harms cancer cells as well as normal cells. Chemotherapy is designed to be more harmful to cancer cells than to normal cells, but the damage to normal cells can be brutal.

New studies suggest that fasting prior to receiving chemotherapy may reduce the number of normal cells harmed by the treatment. Fasting may put normal cells in a defensive mode that increases their resistance to chemical attack.

Saving energy by tolerating mistakes

Computer chips can use significantly less energy if they don’t have to be correct all the time. That’s the idea behind PCMOS — probabilistic complementary metal-oxide semiconductor technology. Here’s an excerpt from Technology Review’s article on PCMOS.

[Inventor Krishna] Palem’s idea is to lower the operating voltage of parts of a chip—specifically, the logic circuits that calculate the least significant bits, such as the 3 in the number 21,693. The resulting decrease in signal-to-noise ratio means those circuits would occasionally arrive at the wrong answer, but engineers can calculate the probability of getting the right answer for any specific voltage. “Relaxing the probability of correctness even a little bit can produce significant savings in energy,” Palem says.

In applications such as video processing, a small probability of error would not make a noticeable difference. It would be an interesting exercise to separate those parts of a system that require accuracy and those that tolerate error. For example, a cell phone might use high-accuracy chips for dialing phone numbers but low-accuracy chips for controlling the display in order to extend battery life.

Why functional programming hasn’t taken off

Bjarne Stroustrup made a comment in an interview about functional programming. He said advocates of functional programming have been in charge of computer science departments for 30 years now, and yet functional programming has hardly been used outside academia. Maybe it’s because it’s not practical, at least in its purest form.

I’ve heard other people say that functional programming is the way to go, but most programmers aren’t smart enough to work that way and its too hard for the ones who are smart enough to go against the herd. But there are too many brilliant maverick programmers out there to make such a condescending explanation plausible. Stroustrup’s explanation makes more sense.

Let me quickly address some objections.

  • Yes, there have been very successful functional programming projects.
  • Yes, procedural programming languages are adding support for functional programming.
  • Yes, the rise of multi-core processors is driving the search for ways to make concurrent programming easier, and functional programming has a lot to offer.

I fully expect there will be more functional programming in the future, but it will be part of a multi-paradigm approach. On the continuum between pure imperative programming and pure functional programming, development will move toward the functional end, but not all the way. A multi-paradigm approach could be a jumbled mess, but it doesn’t have to be. One could clearly delineate which parts of a code base are purely functional (say, because they need to run concurrently) and which are not (say, for efficiency). The problem of how to mix functional and procedural programming styles well seems interesting and tractable.

[Stroustrup’s remark came from an OnSoftware podcast. I’ve listed to several of his podcasts with OnSoftware lately but I don’t remember which one contained his comment about functional programming.]

Probability approximations

When I took my first probability course, it seemed like there were an infinite number of approximation theorems to learn, all mysterious. Looking back, there were probably only two or three, and they don’t need to be mysterious.

For example, under the right circumstances you can approximate a Binomial(n, p) well with a Normal(np, np(1-p)). While the relationship between the parameters in these two distributions is obvious to the initiated, it’s not at all obvious to a beginner. It seems much clearer to say that a Binomial can be approximated by a Normal with the same mean and variance. After all, a distribution that doesn’t get the mean and variance correct doesn’t sound like a very promising approximation.

Taking it a step further, a good teacher could guide a class to discover this approximation themselves. This would take more time than simply stating the result and working an example or two, but the difference in understanding would be immense. And if you’re not going to take the time to aim for understanding, what’s the point in covering approximation theorems at all? They’re not used that often for computation anymore. In my opinion, the only reason to go over them is the insight they provide.

Bugs in food and software

What is an acceptable probability of finding bug parts in a box of cereal? You can’t say zero. As the acceptable probability goes to zero, the price of a box of cereal goes to infinity. In practice, the FDA sets very small but non-zero limits on the probability of finding bug parts in food. This is unsettling at first, but there’s no rational way around it.

What is an acceptable probability of finding bugs in your software? Again, you can’t say zero. The cost increases without bound as the quality requirements increase. In my previous post, I wrote about the extraordinary quality procedures for writing software for space probes. And yet even these projects have to tolerate some non-zero probability of error. It’s not worthwhile to spend 10 billion dollars to prevent a bug in a billion dollar mission.

Bugs are a fact of life. We can insist that they are unacceptable or we can pretend they don’t exist, but neither approach is constructive. It’s better to focus on the probability of running into bugs and consequences of running into bugs.

Not all bugs have the same consequences. It’s distasteful to find a piece of a roach leg in your can of green beans, but it’s not the end of the world. Toxic microscopic bugs are more serious. Along the same lines, a software bug that causes incorrect hyphenation is hardly the same as a bug that causes a plane crash. To get an idea of the potential economic cost of  running into a bug, and therefore the resources worthwhile to detect and fix it, multiply the probability by the consequences.

How do you estimate the probabilities of software bugs? The same way you estimate the probability of bugs in food: by conducting experiments and analyzing data. Some people find this very hard to accept. They understand that testing is necessary in the physical world, but they think software is entirely different and must be proven correct in some mathematical sense. They object that computer programs are complex systems, too complex to test. Computer programs are complex, but human bodies are far more complex, and yet we conduct tests on human subjects all the time to estimate different probabilities, such as the probabilities of drug toxicity.

Another objection to software testing is that it can only test paths through the software that are actually taken, not all potential paths. That’s true, but the most important data when estimating the probability of running into a bug is data from people using the software under normal conditions. A bug that you never run into has no consequences.

But what about people using software in unanticipated ways? I certainly find it frustrating when I uncover bugs when I use a program in an atypical way. But this is not terribly different from physical systems. Bridges may fail when they’re subject to loads they weren’t designed for. There is a difference, however. Most software is designed to permit far more uses than can be tested, whereas there’s less of a gap in physical systems between what is permissible and what is testable. Unit testing helps. If every component of a software system works correctly in isolation, it more likely, though not certain, that the components will work correctly together in a new situation. Still, there’s no getting around the fact that the best tested uses are the most likely to succeed.

Software in Space

The latest episode of Software Engineering Radio has an interview with Hans-Joachim Popp of the German aerospace company DLR. A bug in the software embedded in a space probe could cost years of lost time and billions of dollars. These folks have to write solid code.

The interview gives some details of the unusual practices DLR uses to produce such high quality code. For one, Popp said that his company writes an average of 12 lines of test code for every line of production code. They also pair junior and senior developers. The junior developer writes all the code, and the senior developer picks it apart.

Such extreme attention to quality doesn’t come cheap. Popp said that they produce about 0.6 lines of (production?) code per hour.

Small effective sample size does not mean uninformative

Today I talked to a doctor about the design of a randomized clinical trial that would use a Bayesian monitoring rule. The probability of response on each arm would be modeled as a binomial with a beta prior. Simple conjugate model. The historical response rate in this disease is only 5%, and so the doctor had chosen a beta(0.1, 1.9) prior so that the prior mean matched the historical response rate.

For beta distributions, the sum of the two parameters is called the effective sample size. There is a simple and natural explanation for why a beta(a, b) distribution is said to contain as much information as a+b data observations. By this criterion, the beta(0.1, 1.9) distribution is not very informative: it only has as much influence as two observations. However, viewed in another light, a beta(0.1, 1.9) distribution is highly informative.

This trial was designed to stop when the posterior probability is more than  0.999 that one treatment is more effective than the other. That’s an unusually high standard of evidence for stopping a trial — a cutoff of 0.99 or smaller would be much more common — and yet a trial could stop after only six patients. If X is the probability of response on one arm and Y is the probability of response on the other, after three failures on the first treatment and three successes on the other, Pr(Y > X) > 0.999.

The explanation for how the trial could stop so early is that the prior distribution is, in an odd sense, highly informative. The trial starts with a strong assumption that each treatment is ineffective. This assumption is somewhat justified by of experience, and yet a beta(0.1, 1.9) distribution doesn’t fully capture the investigator’s prior belief.

(Once at least one response has been observed, the beta(0.1, 1.9) prior becomes essentially uninformative. But until then, in this context, the prior is informative.)

A problem with a beta prior is that there is no way to specify the mean at 0.05 without also placing a large proportion of the probability mass below 0.05. The beta prior places too little probability on better outcomes that might reasonably happen. I imagine a more diffuse prior with mode 0.05 rather than mean 0.05 would better describe the prior beliefs regarding the treatments.

The beta prior is convenient because Bayes’ theorem takes a very simple form in this case: starting from a beta(a, b) prior and observing s successes and f failures, the posterior distribution is beta(a+s, b+f).  But a prior less convenient algebraically could be more robust and better adept at representing prior information.

A more basic observation is that “informative” and “uninformative” depend on context. This is part of what motivated Jeffreys to look for prior distributions that were equally (un)informative under a set of transformations. But Jeffreys’ approach isn’t the final answer. As far as I know, there’s no universally acceptable resolution to this dilemma.