Managing biological data

Jon Udell’s latest Interviews with Innovators podcast features Randall Julian of Indigo BioSystems. I found this episode particularly interesting because it deals with issues I have some experience with.

The problems in managing biological data begin with how to store the raw experiment data. As Julian says

… without buying into all the hype around semantic web and so on, you would argue that a flexible schema makes more sense in a knowledge gathering or knowledge generation context than a fixed schema does.

So you need something less rigid than a relational database and something with more structure than a set of Excel spreadsheets. That’s not easy, and I don’t know whether anyone has come up with an optimal solution yet. Julian said that he has seen many attempts to put vast amounts of biological data into a rigid relational database schema but hasn’t seen this approach succeed yet. My experience has been similar.

Representing raw experimental data isn’t enough. In fact, that’s the easy part. As Jon Udell comments during the interview

It’s easy to represent data. It’s hard to represent the experiment.

That is, the data must come with ample context to make sense of the data. Julian comments that without this context, the data may as well be a list of zip codes. And not only must you capture experimental context, you must describe the analysis done to the data. (See, for example, this post about researchers making up their own rules of probability.)

Julian comments on how electronic data management is not nearly as common as someone unfamiliar with medical informatics might expect.

So right now maybe 50% of the clinical trials in the world are done using electronic data capture technology. … that’s the thing that maybe people don’t understand about health care and the life sciences in general is that there is still a huge amount of paper out there.

Part of the reason for so much paper goes back to the belief that one must choose between highly normalized relational data stores and unstructured files. Given a choice between inflexible bureaucracy and chaos, many people choose chaos. It may work about as well, and it’s much cheaper to implement. I’ve seen both extremes. I’ve also been part of a project using a flexible but structured approach that worked quite well.

Related posts:

Posts on reproducibility
Problems versus dilemmas
Blogging about reproducible research

Bayesian clinical trials in one zip code

I recently ran across this quote from Mithat Gönen of Memorial Sloan-Kettering Cancer Center:

While there are certainly some at other centers, the bulk of applied Bayesian clinical trial design in this country is largely confined to a single zip code.

from “Bayesian clinical trials: no more excuses,” Clinical Trials 2009; 6; 203.

The zip code Gönen alludes to is 77030, the zip code of M. D. Anderson Cancer Center. I can’t say how much activity there is elsewhere, but certainly we design and conduct a lot of Bayesian clinical trials at MDACC.

Related posts:

Cartoon guide to cancer research
Stopping trials of ineffective drugs sooner
Three ways of tuning an adaptively randomized clinical trial
Population drift

Off to Puerto Rico

I’m leaving today for San Juan. I’m giving a couple talks at a conference on clinical trials.

Puerto Rico is beautiful. (I want to say a “lovely island,” but then the song America from West Side Story gets stuck in my head.) Here are a couple photos from my last visit.

Science versus medicine

Before I started working for a cancer center, I was not aware of the tension between science and medicine. Popular perception is that the two go together hand and glove, but that’s not always true.

Physicians are trained to use their subjective judgment and to be decisive. And for good reason: making a fairly good decision quickly is often better than making the best decision eventually. But scientists must be tentative, withhold judgment, and follow protocols.

Sometimes physician-scientists can reconcile their two roles, but sometimes they have to choose to wear one hat or the other at different times.

The physician-scientist tension is just one facet of the constant tension between treating each patient effectively and learning how to treat future patients more effectively. Sometimes the interests of current patients and future patients coincide completely, but not always.

This ethical tension is part of what makes biostatistics a separate field of statistics. In manufacturing, for example, you don’t need to balance the interests of current light bulbs and future light bulbs. If you need to destroy 1,000 light bulbs to find out how to make better bulbs in the future, no big deal. But different rules apply when experimenting on people. Clinical trials will often use statistical designs that sacrifice some statistical power in order to protect the people participating in the trial. Ethical constraints make biostatistics interesting.

Sometimes it's right under your nose

Neptune was discovered in 1846. But Galileo’s notebooks describe a “star” he saw on 28 December 1612 and 2 January 1613 that we now know was Neptune. Galileo even noticed that his star was in a slightly different location for his two observations, but he chalked the difference up to observational error.

The men who discovered Neptune were not the first to see it; they were the first to realize what they were looking at.

Voyager 2 photo of Neptune via Wikipedia

Drug looks promising, come back in 30 years

The most recent 60-Second Science podcast summarizes a paper in Science magazine reporting that the average interval between a drug being deemed “promising” and the first paper appearing showing clinical effectiveness is 24 years.

Note that the publication of a paper saying a drug is clinically effective is a far cry from regulatory approval. Many new drugs that look like an improvement after a phase II trial turn out to be no better than existing treatments, and those really are better take years to achieve regulatory approval.

See also

False positives for medical papers
Most published research results are false

Random inequalities VII: three or more variables

The previous posts in this series have looked at P(X > Y), the probability that a sample from a random variable X is greater than a sample from an independent random variable Y. In applications, X and Y have different distributions but come from the same distribution family.

Sometimes applications require computing P(X > max(Y, Z)). For example, an adaptively randomized trial of three treatments may be designed to assign a treatment with probability equal to the probability that that treatment has the best response. In a trial with a binary outcome, the variables X, Y, and Z may be beta random variables representing the probability of response. In a trial with a time-to-event outcome, the variables might be gamma random variables representing survival time.

Sometimes we’re interested in the opposite inequality, P(X < min(Y,Z)). This would be the case if we thought in terms of failures rather than responses, or wanted to minimize the time to a desirable event rather than maximizing the time to an undesirable event.

The maximum and minimum inequalities are related by the following equation:

P(X < min(Y,Z)) = P(X > max(Y, Z)) + 1 – P(X > Y) – P(X > Z).

These inequalities are used for safety monitoring rules as well as to determine randomization probabilities. In a trial seeking to maximize responses, a treatment arm X might be dropped if P(X > max(Y,Z)) becomes too small.

In principle one could design an adaptively randomized trial with n treatment arms for any n ≥ 2 based on P(X1 > max(X2, …, Xn)). In practice, the most common value of n by far is 2. Sometimes n is 3. I’m not familiar with an adaptively randomized trial with more than three arms. I’ve heard of an adaptively randomized trial that was designed with five arms, but I don’t believe the trial ran.

Computing P(X1 > max(X2, …, Xn)) by numerical integration becomes more difficult as n increases. For large n, simulation may be more efficient than integration. Computing P(X1 > max(X2, …, Xn)) for gamma random variables with n=3 was unacceptably slow in a previous version of our adaptive randomization software. The search for a faster algorithm lead to this paper: Numerical Evaluation of Gamma Inequalities.

Previous posts on random inequalities:

Analytical results
Numerical results
Cauchy distributions
Beta distributions
Gamma distributions

Random inequalities VI: gamma distributions

This post looks at computing P(X > Y) where X and Y are gamma random variables. These inequalities are central to the Thall-Wooten method of monitoring single-arm clinical trials with time-to-event outcomes. They also are central to adaptively randomized clinical trials with time-to-event outcomes.

When X and Y are gamma random variables P(X > Y) can be computed in terms of the incomplete beta function. Suppose X has shape αX and scale βX and Y has shape αY and scale βY. Then βXY/(βX Y+ βYX) has a beta(αY, αX) distribution. (This result is well-known in the special case of the scale parameters both equal to 1. I wrote up the more general result here but I don’t imagine I was the first to stumble on the generalization.) It follows that

P(X < Y) = P(B < βX/(βX+ βY))

where B is a beta(αY, αX) random variable.

For more details, see Numerical evaluation of gamma inequalities.

Previous posts on random inequalities:

Analytical results
Numerical results
Cauchy distributions
Beta distributions

Stopping trials of ineffective drugs earlier

Valen Johnson and I recently posted a working paper on a method for stopping trials of ineffective drugs earlier. For Bayesians, we argue that our method is more consistently Bayesian than other methods in common use. For frequentists, we show that our method has better frequentist operating characteristics than the most commonly used safety monitoring method.

The paper looks at binary and time-to-event trials. The results are most dramatic for the time-to-event analog of the Thall-Simon method, the Thall-Wooten method, as shown below.

This graph plots the probability of concluding that an experimental treatment is inferior when simulating from true mean survival times ranging from 2 to 12 months. The trial is designed to test a null hypothesis of 6 months mean survival against an alternative hypothesis of 8 months mean survival. When the true mean survival time is less than the alternative hypothesis of 8 months, the Bayes factor design is more likely to stop early. And when the true mean survival time is greater than the alternative hypothesis, the Bayes factor method is less likely to stop early.

The Bayes factor method also outperforms the Thall-Simon method for monitoring single-arm trials with binary outcomes. The Bayes factor method stops more often when it should and less often when it should not. However, the difference in operating characteristics is not as pronounced as in the time-to-event case.

The paper also compares the Bayes factor method to the frequentist mainstay, the Simon two-stage design. Because the Bayes factor method uses continuous monitoring, the method is able to use fewer patients while maintaining the type I and type II error rates of the Simon design as illustrated in the graph below.

bayes factor vs simon two-stage designs

The graph above plots the number of patients used in a trial testing a null hypothesis of a 0.2 response rate against an alternative of a 0.4 response rate. Design 8 is the Bayes factor method advocated in the paper. Designs 7a and 7b are variations on the Simon two-stage design. The horizontal axis gives the true probabilities of response. We simulated true probabilities of response varying from 0 to 1 in increments of 0.05. The vertical axis gives the number of patients treated before the trial was stopped. When the true probability of response is less than the alternative hypothesis, the Bayes factor method treats fewer patients. When the true probability of response is better than the alternative hypothesis, the Bayes factor method treats slightly more patients.

Design 7a is the strict interpretation of the Simon method: one interim look at the data and another analysis at the end of the trial. Design 7b is the Simon method as implemented in practice, stopping when the criteria for continuing cannot be met at the next analysis. (For example, if the design says to stop if there are three or fewer responses out of the first 15 patients, then the method would stop after the 12th patient if there have been no responses.) In either case, the Bayes factor method uses fewer patients. The rejection probability curves, not shown here, show that the Bayes factor method matches (actually, slightly improves upon) the type I and type II error rates for the Simon two-stage design.

Random inequalities V: beta distributions

I’ve put a lot of effort into writing software for evaluating random inequality probabilities with beta distributions because such inequalities come up quite often in application. For example, beta inequalities are at the heart of the Thall-Simon method for monitoring single-arm trials and adaptively randomized trials with binary endpoints.

It’s not easy to evaluate P(X > Y) accurately and efficiently when X and Y are independent random variables. I’ve seen several attempts that were either inaccurate or slow, including a few attempts on my part. Efficiency is important because this calculation is often in the inner loop of a simulation study. Part of the difficulty is that the calculation depends on four parameters and no single algorithm will work well for all parameter combinations.

Let g(a, b, c, d) equal P(X > Y) where X ~ beta(a, b) and Y ~ beta(c, d). Then the function g has several symmetries.

  • g(a, b, c, d) = 1 – g(c, d, a, b)
  • g(a, b, c, d) = g(d, c, b, a)
  • g(a, b, c, d) = g(d, b, c, a)

The first two relations were published by W. R. Thompson in 1933, but as far as I know the third relation first appeared in this technical report in 2003.

For special values of the parameters, the function g(a, b, c, d) can be computed in closed form. Some of these special cases are when

  • one of the four parameters is an integer
  • a + b + c + d = 1
  • a + b = c + d = 1.

The function g(a, b, c, d) also satisfies several recurrence relations that make it possible to bootstrap the latter two special cases into more results. Define the beta function B(a, b) as Γ(a, b)/(Γ(a) Γ(b)) and define h(a, b, c, d) as B(a+c, b+d)/( B(a, b) B(c, d) ). Then the following recurrence relations hold.

  • g(a+1, b, c, d) = g(a, b, c, d) + h(a, b, c, d)/a
  • g(a, b+1, c, d) = g(a, b, c, d) – h(a, b, c, d)/b
  • g(a, b, c+1, d) = g(a, b, c, d) – h(a, b, c, d)/c
  • g(a, b, c, d+1) = g(a, b, c, d) + h(a, b, c, d)/d

For more information about beta inequalities, see these papers:

Numerical computation of stochastic inequality probabilities
Exact calculation of beta inequalities

Previous posts on random inequalities:

Analytical results
Numerical results
Cauchy distributions