Chris Wiggins gave an excellent talk at Rice this afternoon on data science at the New York Times. In the Q&A afterward, someone asked how you would set up a machine learning algorithm where you’re trying to optimize for outcomes and for information.
Here’s how I’ve approached this dilemma in the past. Information and outcomes are not directly comparable, so any objective function combining the two is going to add incommensurate things. One way around this is to put a value not on information per se but on what you’re going to do with the information.
In the context of a clinical trial, you want to treat patients in the trial effectively, and you want a high probability of picking the best treatment at the end. You can’t add patients and probabilities meaningfully. But why do you want to know which treatment is best? So you can treat future patients effectively. The value of knowing which treatment is best is the increase in the expected number of future successful treatments. This gives you a meaningful objective: maximize the expected number of effective treatments, of patients in the trial and future patients treated as a result of the trial.
The hard part is guessing what will be done with the information learned in the trial. Of course this isn’t exactly known, but it’s the key to estimating the value of the information. If nobody will be treated any differently in the future no matter how the trial turns out—and unfortunately this may be the case—then the trial should be optimized strictly for the benefit of the people in the trial. On the other hand, if a trial will determine the standard of care for thousands of future patients, then there is more value in picking the best treatment.
An interim analysis of a clinical trial is an unusual analysis. At the end of the trial you want to estimate how well some treatment X works. For example, you want to how likely is it that treatment X works better than the control treatment Y. But in the middle of the trial you want to know something more subtle.
It’s possible that treatment X is doing so poorly that you want to end the trial without going any further. It’s also possible that X is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with futility. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments X and Y are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.
Maybe treatment X is doing a little better than Y, but not so much better that you can declare with confidence that X is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.
Futility analysis is more about resources than ethics. If X is doing poorly, ethics might dictate that you stop giving X to patients so you stop early. If X is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if X is doing so-so, there’s usually not an ethical reason to stop, unless X is worse than Y on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.
Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the trial, not of the treatment per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.
This is why predictive probability is natural for interim analysis. You’re trying to predict outcomes, not parameters. (This is subtle: you’re trying to estimate the probability of outcomes that lead to certain estimates of parameters, namely those that allow you to reach a conclusion with pre-specified significance.)
Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a p-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (Bayesian) predictive probability of meeting your frequentist criteria!
We’re about to see a lot of new, powerful, inexpensive medical devices come out. And to my surprise, I’ve contributed to a few of them.
Growing compute power and shrinking sensors open up possibilities we’re only beginning to explore. Even when the things we want to observe elude direct measurement, we may be able to infer them from other things that we can now measure accurately, inexpensively, and in high volume.
In order to infer what you’d like to measure from what you can measure, you need a mathematical model. Or if you’d like to make predictions about the future from data collected in the past, you need a model. And that’s where I come in. Several companies have hired me to help them create medical devices by working on mathematical models. These might be statistical models, differential equations, or a combination of the two. I can’t say much about the projects I’ve worked on, at least not yet. I hope that I’ll be able to say more once the products come to market.
I started my career doing mathematical modeling (partial differential equations) but wasn’t that interested in statistics or medical applications. Then through an unexpected turn of events, I ended up spending a dozen years working in the biostatistics department of the world’s largest cancer center.
After leaving MD Anderson and starting my consultancy, several companies have approached me for help with mathematical problems associated with their idea for a medical device. These are ideal projects because they combine my earlier experience in mathematical modeling with my more recent experience with medical applications.
If you have an idea for a medical device, or know someone who does, let’s talk. I’d like to help.
This afternoon I’m giving a talk at the Houston INFORMS chapter entitled “Bayesian adaptive clinical trials: promise and pitfalls.”
When I started working in adaptive clinical trials, I was very excited about the potential of such methods. The clinical trial methods most commonly used are very crude, and there’s plenty of room for improvement.
Over time I became concerned about overly complex methods, methods which were good for academic publication but may not be best for patients. Such methods are extremely time-consuming to develop and may not perform as well in practice as simpler methods.
There’s a great deal of opportunity between the extremes, methods that are more sophisticated than the status quo without being unnecessarily complex.
Bayesian methods for designing clinical trials have become more common, and yet these Bayesian designs are almost always evaluated by frequentist criteria. For example, a trial may be designed to stop early 95% of the time under some bad scenario and stop no more than 20% of the time under some good scenario.
These criteria are arbitrary, since the “good” and “bad” scenarios are arbitrary, and because the stopping probability requirements of 95% and 20% are arbitrary. Still, there’s an idea in lurking in the background that in every design there must be something that is shown to happen no more than 5% of the time.
It takes a great deal of effort to design Bayesian methods with desired frequentist properties. It’s an inverse problem, searching for the parameters in a high-dimensional design space, usually via lengthy simulation, that cause the method to satisfy some criteria. Of course frequentist methods satisfy frequentist criteria by design and so meet these criteria with far less effort. It’s rare to see the tables turned, evaluating frequentist methods by Bayesian criteria.
Sometimes the effort to beat frequentist designs at their own game is futile because the frequentist designs are optimal by their own criteria. More often, however, the Bayesian and frequentist methods being compared are not direct competitors but only analogs. The aim in this case is to match the frequentist method’s operating characteristics by one criterion while doing better by a new criterion.
Sometimes a Bayesian method can be shown to have better frequentist operating characteristics than its frequentist counterpart. This puts dogmatic frequentists in the awkward position of admitting that what they see as an unjustified approach to statistics has nevertheless produced a superior product. Some anti-Bayesians are fine with this, happy to have a procedure with better frequentist properties, even though it happened to be discovered via a process they view as illegitimate.
“Reproducible” and “randomized” don’t seem to go together. If something was unpredictable the first time, shouldn’t it be unpredictable if you start over and run it again? As is often the case, we want incompatible things.
But the combination of reproducible and random can be reconciled. Why would we want a randomized controlled trial (RCT) to be random, and why would we want it to be reproducible?
One of the purposes in randomized experiments is the hope of scattering complicating factors evenly between two groups. For example, one way to test two drugs on a 1000 people would be to gather 1000 people and give the first drug to all the men and the second to all the women. But maybe a person’s sex has something to do with how the drug acts. If we randomize between two groups, it’s likely that about the same number of men and women will be in each group.
The example of sex as a factor is oversimplified because there’s reason to suspect a priori that sex might make a difference in how a drug performs. The bigger problem is that factors we can’t anticipate or control may matter, and we’d like them scattered evenly between the two treatment groups. If we knew what the factors were, we could assure that they’re evenly split between the groups. The hope is that randomization will do that for us with things we’re unaware of. For this purpose we don’t need a process that is “truly random,” whatever that means, but a process that matches our expectations of how randomness should behave. So a pseudorandom number generator (PRNG) is fine. No need, for example, to randomize using some physical source of randomness like radioactive decay.
Another purpose in randomization is for the assignments to be unpredictable. We want a physician, for example, to enroll patients on a clinical trial without knowing what treatment they will receive. Otherwise there could be a bias, presumably unconscious, against assigning patients with poor prognosis if the physicians know the next treatment be the one they hope or believe is better. Note here that the randomization only has to be unpredictable from the perspective of the people participating in and conducting the trial. The assignments could be predictable, in principle, by someone not involved in the study.
And why would you want an randomization assignments to be reproducible? One reason would be to test whether randomization software is working correctly. Another might be to satisfy a regulatory agency or some other oversight group. Still another reason might be to defend your randomization in a law suit. A physical random number generator, such as using the time down to the millisecond at which the randomization is conducted would achieve random assignments and unpredictability, but not reproducibility.
Computer algorithms for generating random numbers (technically pseudo-random numbers) can achieve reproducibility, practically random allocation, and unpredictability. The randomization outcomes are predictable, and hence reproducible, to someone with access to the random number generator and its state, but unpredictable in practice to those involved in the trial. The internal state of the random number generator has to be saved between assignments and passed back into the randomization software each time.
Random number generators such as the Mersenne Twister have good statistical properties, but they also carry a large amount of state. The random number generator described here has very small state, 64 bits, and so storing and returning the state is simple. If you needed to generate a trillion random samples, Mersenne Twitster would be preferable, but since RCTs usually have less than a trillion subjects, the RNG in the article is perfectly fine. I have run the Die Harder random number generator quality tests on this generator and it performs quite well.
The article Deming, data and observational studies by S. Stanley Young and Alan Karr opens with
Any claim coming from an observational study is most likely to be wrong.
They back up this assertion with data about observational studies later contradicted by prospective studies.
Much has been said lately about the assertion that most published results are false, particularly observational studies in medicine, and I won’t rehash that discussion here. Instead I want to cut to the process Young and Karr propose for improving the quality of observational studies. They summarize their proposal as follows.
The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set.
They then describe an eight-step process in detail. One step is that cleaning the data and dividing it into a modelling set and a holdout set would be done by different people than the modelling and analysis. They then explain why this would lead to more truthful publications.
The holdout set is the key. Both the author and the journal know there is a sword of Damocles over their heads. Both stand to be embarrassed if the holdout set does not support the original claims of the author.
* * *
The full title of the article is Deming, data and observational studies: A process out of control and needing fixing. It appeared in the September 2011 issue of Significance.
Arguments over the difference between statistics and machine learning are often pointless. There is a huge overlap between the two approaches to analyzing data, sometimes obscured by differences in vocabulary. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.
This distinction is valid in broad strokes, though things are fuzzier than it admits. Some statisticians are content with constructing models, while others look further down the road to how the models are used. And machine learning experts vary in their interest in creating accurate models.
Clinical trial design usually comes under the heading of statistics, though in spirit it’s more like machine learning. The goal of a clinical trial is to answer some question, such as whether a treatment is safe or effective, while also having safeguards in place to stop the trial early if necessary. There is an underlying model—implicit in some methods, more often explicit in newer methods—that guides the conduct of the trial, but the accuracy of this model per se is not the primary goal. Some designs have been shown to be fairly robust, leading to good decisions even when the underlying probability model does not fit well. For example, I’ve done some work with clinical trial methods that model survival times with an exponential distribution. No one believes that an exponential distribution, i.e. one with constant hazard, accurately models survival times in cancer trials, and yet methods using these models do a good job of stopping trials early that should stop early and letting trials continue that should be allowed to continue.
Experts in machine learning are more accustomed to the idea of inaccurate models sometimes producing good results. The best example may be naive Bayes classifiers. The word “naive” in the name is a frank admission that these classifiers model as independent events known not to be independent. These methods can do well at their ultimate goal, such as distinguishing spam from legitimate email, even though they make a drastic simplifying assumption.
There have been papers that look at why naive Bayes works surprisingly well. Naive Bayes classifiers work well when the errors due to wrongly assuming independence effect positive and negative examples roughly equally. The inaccuracies of the model sort of wash out when the model is reduced to a binary decision, classifying as positive or negative. Something similar happens with the clinical trial methods mentioned above. The ultimate goal is to make correct go/no-go decisions, not to accurately model survival times. The naive exponential assumption effects both trials that should and should not stop, and the model predictions are reduced to a binary decision.
In a dose-finding clinical trial, you have a small number of doses to test, and you hope find the one with the best response. Here “best” may mean most effective, least toxic, closest to a target toxicity, some combination of criteria, etc.
Since your goal is to find the best dose, it seems natural to compare dose-finding methods by how often they find the best dose. This is what is most often done in the clinical trial literature. But this seemingly natural criterion is actually artificial.
Suppose a trial is testing doses of 100, 200, 300, and 400 milligrams of some new drug. Suppose further that on some scale of goodness, these doses rank 0.1, 0.2, 0.5, and 0.51. (Of course these goodness scores are unknown; the point of the trial is to estimate them. But you might make up some values for simulation, pretending with half your brain that these are the true values and pretending with the other half that you don’t know what they are.)
Now suppose you’re evaluating two clinical trial designs, running simulations to see how each performs. The first design picks the 400 mg dose, the best dose, 20% of the time and picks the 300 mg dose, the second best dose, 50% of the time. The second design picks each dose with equal probability. The latter design picks the best dose more often, but it picks a good dose less often.
In this scenario, the two largest doses are essentially equally good; it hardly matters how often a method distinguishes between them. The first method picks one of the two good doses 70% of the time while the second method picks one of the two good doses only 50% of the time.
This example was exaggerated to make a point: obviously it doesn’t matter how often a method can pick the better of two very similar doses, not when it very often picks a bad dose. But there are less obvious situations that are quantitatively different but qualitatively the same.
The goal is actually to find a good dose. Finding the absolute best dose is impossible. The most you could hope for is that a method finds with high probability the best of the four arbitrarily chosen doses under consideration. Maybe the best dose is 350 mg, 843 mg, or some other dose not under consideration.
A simple way to make evaluating dose-finding methods less arbitrary would be to estimate the benefit to patients. Finding the best dose is only a matter of curiosity in itself unless you consider how that information is used. Knowing the best dose is important because you want to treat future patients as effectively as you can. (And patients in the trial itself as well, if it is an adaptive trial.)
Suppose the measure of goodness in the scenario above is probability of successful treatment and that 1,000 patients will be treated at the dose level picked by the trial. Under the first design, there’s a 20% chance that 51% of the future patients will be treated successfully, and a 50% chance that 50% will be. The expected number of successful treatments from the two best doses is 352. Under the second design, the corresponding number is 252.5.
(To simplify the example above, I didn’t say how often the first design picks each of the two lowest doses. But the first design will result in at least 382 expected successes and the second design 327.5.)
You never know how many future patients will be treated according to the outcome of a clinical trial, but there must be some implicit estimate. If this estimate is zero, the trial is not worth conducting. In the example given here, the estimate of 1,000 future patients is irrelevant: the future patient horizon cancels out in a comparison of the two methods. The patient horizon matters when you want to include the benefit to patients in the trial itself. The patient horizon serves as a way to weigh the interests of current versus future patients, an ethically difficult comparison usually left implicit.
Dose-finding trials of chemotherapy agents look for the MTD: maximum tolerated dose. The idea is to give patients as much chemotherapy as they can tolerate, hoping to do maximum damage to tumors without doing too much damage to patients.
But “maximum tolerated dose” implies a degree of personalization that rarely exists in clinical trials. Phase I chemotherapy trials don’t try to find the maximum dose that any particular patient can tolerate. They try to find a dose that is toxic to a certain percentage of the trial participants, say 30%. (This rate may seem high, but it’s typical. It’s not far from the toxicity rate implicit in the so-called 3+3 rule or from the explicit rate given in many CRM (“continual reassessment method”) designs.)
It’s tempting to think of “30% toxicity rate” as meaning that each patient experiences a 30% toxic reaction. But that’s not what it means. It means that each patient has a 30% chance of a toxicity, however toxicity is defined in a particular trial. If toxicity were defined as kidney failure, for example, then 30% toxicity rate means that each patient has a 30% probability of kidney failure, not that they should expect a 30% reduction in kidney function.
My interest in the Anil Potti scandal started when my former colleagues could not reproduce the analysis in one of Potti’s papers. (Actually, they did reproduce the analysis, at great effort, in the sense of forensically determining the erroneous steps that were carried out.) Two years ago, the story was on 60 Minutes. The straw that broke the camel’s back was not bad science but résumé padding.
It looks like the story is a matter of fraud rather than sloppiness. This is unfortunate because sloppiness is much more pervasive than fraud, and this could have made a great case study of bad analysis. However, one could look at it as a case study in how good analysis (by the folks at MD Anderson) can uncover fraud.
Now there’s a new development in the Potti saga. The latest issue of The Cancer Letter contains letters by whistle-blower Bradford Perez who warned officials at Duke about problems with Potti’s research.
When you sort data and look at which sample falls in a particular position, that’s called order statistics. For example, you might want to know the smallest, largest, or middle value.
Order statistics are robust in a sense. The median of a sample, for example, is a very robust measure of central tendency. If Bill Gates walks into a room with a large number of people, the mean wealth jumps tremendously but the median hardly budges.
But order statistics are not robust in this sense: the identity of the sample in any given position can be very sensitive to perturbation. Suppose a room has an odd number of people so that someone has the median wealth. When Bill Gates and Warren Buffett walk into the room later, the value of the median income may not change much, but the person corresponding to that income will change.
One way to evaluate machine learning algorithms is by how often they pick the right winner in some sense. For example, dose-finding algorithms are often evaluated on how often they pick the best dose from a set of doses being tested. This can be a terrible criteria, causing researchers to be mislead by a particular set of simulation scenarios. It’s more important how often an algorithm makes a good choice than how often it makes the best choice.
Suppose five drugs are being tested. Two are nearly equally effective, and three are much less effective. A good experimental design will lead to picking one of the two good drugs most of the time. But if the best drug is only slightly better than the next best, it’s too much to expect any design to pick the best drug with high probability. In this case it’s better to measure the expected utility of a decision rather than how often a design makes the best decision.
The term overfitting usually describes fitting too complex a model to available data. But it is possible to overfit a model before there are any data.
An experimental design, such as a clinical trial, proposes some model to describe the data that will be collected. For simple, well-known models the behavior of the design may be known analytically. For more complex or novel methods, the behavior is evaluated via simulation.
If an experimental design makes strong assumptions about data, and is then simulated with scenarios that follow those assumptions, the design should work well. So designs must be evaluated using scenarios that do not exactly follow the model assumptions. Here lies a dilemma: how far should scenarios deviate from model assumptions? If they do not deviate at all, you don’t have a fair evaluation. But deviating too far is unreasonable as well: no method can be expected to work well when it’s assumptions are flagrantly violated.
With complex designs, it may not be clear to what extent scenarios deviate from modeling assumptions. The method may be robust to some kinds of deviations but not to others. Simulation scenarios for complex designs are samples from a high dimensional space, and it is impossible to adequately explore a high dimensional space with a small number of points. Even if these scenarios were chosen at random—which would be an improvement over manually selecting scenarios that present a method in the best light—how do you specify a probability distribution on the scenarios? You’re back to a variation on the previous problem.
Once you have the data in hand, you can try a complex model and see how well it fits. But with experimental design, the model is determined before there are any data, and thus there is no possibility of rejecting the model for being a poor fit. You might decide after its too late, after the data have been collected, that the model was a poor fit. However, retrospective model criticism is complicated for adaptive experimental designs because the model influenced which data were collected.
This is especially a problem for one-of-a-kind experimental designs. When evaluating experimental designs — not the data in the experiment but the experimental design itself—each experiment is one data point. With only one data point, it’s hard to criticize a design. This means we must rely on simulation, where it is possible to obtain many data points. However, this brings us back to the arbitrary choice of simulation scenarios. In this case there are no empirical data to test the model assumptions.
Suppose you’ve written a program that randomly assigns test subjects to one of two treatments, A or B, with equal probability. The researcher using your program calls you to tell you that your software is broken because it has assigned treatment A to seven subjects in a row.
You might argue that the probability of seven A’s in a row is 1/2^7 or about 0.008. Not impossible, but pretty small. Maybe the software is broken.
But this line of reasoning grossly underestimates the probability of a run of 7 identical assignments. If someone asked the probability that the next 7 assignments would all be A’s, then 1/2^7 would be the right answer. But that’s not the same as asking whether an experiment is likely to see a run of length 7 because the run could start any time, not just on the next assignment. Also, the phone didn’t ring out of the blue: it rang precisely because there had just been a run.
Suppose you have a coin that has probability of heads p and you flip this coin n times. A rule of thumb says that the expected length of the longest run of heads is about
provided that n(1-p) is much larger than 1.
So in a trial of n = 200 subjects with p = 0.5, you’d expect the longest run of heads to be about seven in a row. When p is larger than 0.5, the longest expected run will be longer. For example, if p = 0.6, you’d expect a run of about 9.
The standard deviation of the longest run length is roughly 1/log(1/p), independent of n. For coin flips with equal probability of heads or tails, this says an approximate 95% confidence interval would be about 3 either side of the point estimate. So for 200 tosses of a fair coin, you’d expect the longest run of heads to be about 7 ± 3, or between 4 and 10.
The following Python code gives an estimate of the probability that the longest run is between a and b inclusive, based on an extreme value distribution.
What if you were interested in the longest run of head or tails? With a fair coin, this just adds 1 to the estimates above. To see this, consider a success to be when consecutive coins turn up the same way. This new sequence has the same expected run lengths, but a run of length m in this sequence corresponds to a run of length m + 1 in the original sequence.
For more details, see “The Surprising Predictability of Long Runs” by Mark F. Schilling, Mathematics Magazine 85 (2012), number 2, pages 141–149.