Low-tech transparency

I recently received two large data files from a client, with names like foo.xlsx and foo.csv. Presumably these are redundant; the latter is probably an export of the former. I did a spot check and that seems to be the case.

Then I had a bright idea: use pandas to make sure the two files are the same. It’s an elegant solution: import both files as data frames, then use the compare() function to verify that they’re the same.

Except it didn’t work. I got a series of mysterious and/or misleading messages as I tried to track down the source of the problem, playing whack-a-mole with the data. There could be any number of reason why compare() might not work on imported data: character encodings, inferred data types, etc.

So I used brute force. I exported the Excel file as CSV and compared the text files. This is low-tech, but transparent. It’s easier to compare text files than to plumb the depths of pandas.

One of the problems was that the data contained heights, such as 5'9". This causes problems with quoting, whether you enclose strings in single or double quotes. A couple quick sed one-liners resolved most of the mismatches. (Though not all. Of course it couldn’t be that simple …)

It’s easier to work with data in a high-level environment like pandas. But it’s also handy to be able to use low-level tools like diff and sed for troubleshooting.

I suppose someone could write a book on how to import CSV files. If all goes well, it’s one line of code. Then there are a handful of patterns that handle the majority of remaining cases. Then there’s the long tail of unique pathologies. As Tolstoy would say, happy data sets are all alike, but unhappy data sets are each unhappy in their own way.

Data pathology


This post is an expansion of something I wrote on Twitter:

Data scientists often complain that the bulk of their work is data cleaning.

But if you see data cleaning as the work, not just an obstacle to the work, it can be interesting.

You could think of it as data pathology, a kind of analysis before the intended analysis.

Anything you have to do before you can get down to what you thought you would be doing can be irritating. And if you bid a project not anticipating preliminary difficulties, it can be costly as well.

But if you anticipate problems with dirty data, these problems can be fun to solve. They may offer more variety than the “real work.” These problems bring up some interesting questions.

  • How are the data actually represented and why?
  • Can you automate the cleaning process, or at least automate some of it?
  • Is the the corruption in the data informative, i.e. does it reveal something in addition to the data itself?

Courses in mathematical statistics will not prepare you to answer any of these questions. Beginning data scientists may find this sort of forensic work irritating because they aren’t prepared for it. Maybe it’s a job for regular expressions rather than regression.

Or maybe they can do it well but would rather get on to something else.

Or maybe this sort of data pathology work would be more interesting with a change of attitude, seeing it as important work that requires experience and provides opportunities for creativity.

Predictive probability for large populations

Suppose you want to estimate the number of patients who respond to some treatment in a large population of size N and what you have is data on a small sample of size n.

The usual way of going about this calculates the proportion of responses in the small sample, then pretends that this is the precisely known proportion, and applies it to the population as a whole. This approach underestimates the uncertainty in the final estimate because it ignores the uncertainty in the initial estimate.

A more sophisticated approach creates a distribution representing what is known about the proportion of successes based on the small sample and any prior information. Then it estimates the posterior predictive probability of outcomes in the larger population based on this distribution. For more background on predictive probability, see my post on uncertainty in a probability.

This post will look at ways to compute predictive probabilities using the asymptotic approximation presented in the previous post.

Suppose your estimate on the probability of success, based on your small sample and prior information, has a Beta(α, β) distribution. Then the predictive probability of s successes and f failures is

 {s + f \choose s} \frac{B(s + \alpha, f + \beta)}{B(\alpha, \beta)}

The beta functions in the numerator and denominator are hard to understand and hard to compute, so we would like to replace them with something easier to understand and easier to compute.

We begin by expanding all the terms involving s and f in terms of gamma functions.

\frac{1}{B(\alpha,\beta)} \frac{\Gamma(s+f+1)}{\Gamma(s+1)\, \Gamma(f+1)} \frac{\Gamma(s+\alpha)\, \Gamma(f+\beta)}{\Gamma(s+f+\alpha+\beta)}

By rearranging the terms we can see three examples of the kind of gamma function ratios estimated in the previous post.

\frac{1}{B(\alpha,\beta)} \frac{\Gamma(s+\alpha)}{\Gamma(s+1)} \frac{\Gamma(f+\beta)}{\Gamma(f+1)} \frac{\Gamma(s+f+1)}{\Gamma(s+f+\alpha+\beta)}

Now we can apply the simplest approximation from the previous post to get the following.

\frac{s^{\alpha-1} f^{\beta-1} (s +f)^{1 - \alpha-\beta-1}}{B(\alpha, \beta)}

We can make this approximation more symmetric by expanding Beta(α, β) in terms of factorials:

 \frac{(\alpha + \beta-1)!}{(\alpha-1)!\, (\beta-1)!} \,\,\frac{s^{\alpha-1} f^{\beta-1}}{(s+f)^{\alpha+\beta-1}}

We saw in the previous post that we can make our asymptotic approximation much more accurate by shifting the arguments a bit, and the following expression, while a little more complicated, should be more accurate.

\frac{(\alpha + \beta-1)!}{(\alpha-1)!\, (\beta-1)!} \,\, \frac{ \left(s + \tfrac{\alpha}{2}\right)^{\alpha-1} \left(f + \tfrac{\beta}{2}\right)^{\beta-1} }{ \left(s +f + \tfrac{\alpha+\beta}{2}\right)^{\alpha+\beta-1} }

This should be accurate for sufficiently large s and f, but I haven’t tested it numerically and wouldn’t recommend using it without doing some testing first.

Good news from Pfizer and Moderna

Both Pfizer and Moderna have announced recently that their SARS-COV2 vaccine candidates reduce the rate of infection by over 90% in the active group compared to the control (placebo) group.

That’s great news. The vaccines may turn out to be less than 90% effective when all is said and done, but even so they’re likely to be far more effective than expected.

But there’s other good news that might be overlooked: the subjects in the control groups did well too, though not as well as in the active groups.

The infection rate was around 0.4% in the Pfizer control group and around 0.6% in the Moderna control group.

There were 11 severe cases of COVID in the Moderna trial, out of 30,000 subjects, all in the control group.

There were 0 severe cases of COVID in the Pfizer trial in either group, out of 43,000 subjects.


Real-time analytics

There’s an ancient saying “Whom the gods would destroy they first make mad.” (Mad as in crazy, not mad as in angry.) I wrote a variation of this on Twitter:

Whom the gods would destroy, they first give real-time analytics.

Having more up-to-date information is only valuable up to a point. Past that point, you’re more likely to be distracted by noise. The closer you look at anything, the more irregularities you see, and the more likely you are to over-steer [1].

I don’t mean to imply that the noise isn’t real. (More on that here.) But there’s a temptation to pay more attention to the small variations you don’t understand than the larger trends you believe you do understand.

I became aware of this effect when simulating Bayesian clinical trial designs. The more often you check your stopping rule, the more often you will stop [2]. You want to monitor a trial often enough to shut it down, or at least pause it, if things change for the worse. But monitoring too often can cause you to stop when you don’t want to.

Flatter than glass

A long time ago I wrote about the graph below.

The graph looks awfully jagged, until you look at the vertical scale. The curve represents the numerical difference between two functions that are exactly equal in theory. As I explain in that post, the curve is literally smoother than glass, and certainly flatter than a pancake.


[1] See The Logic of Failure for a discussion of how over-steering is a common factor in disasters such as the Chernobyl nuclear failure.

[2] Bayesians are loathe to talk about things like α-spending, but when you’re looking at stopping frequencies, frequentist phenomena pop up.

Understanding statistical error

A simple linear regression model has the form

y = μ + βx + ε.

This means that the output variable y is a linear function of the input variable x, plus some error term ε that is randomly distributed.

There’s a common misunderstanding over whose error the error term is. A naive view is that the world really is linear, that

y = μ + βx

is some underlying Platonic reality, and that the only reason that we don’t measure exactly that linear part is that we as observers have made some sort of error, that the fault is the real world rather than in the model.

No, reality is what it is, and it’s our model that is in error. Some aspect of reality may indeed have a structure that is approximately linear (over some range, under some conditions), but when we truncate reality to only that linear approximation, we introduce some error. This error may be tolerable—and thankfully it often is—but the error is ours, not the world’s.

COVID19 mortality per capita by state

Here’s a silly graph by Richard West with a serious point. States with longer names tend to have higher covid19 mortality. Of course no one believes there’s anything about the length of a state’s name that should impact the health of its residents. The correlation is real, but it’s a coincidence.

The variation between mortality in different states is really large. Something caused that, though not the length of the names. But here’s the kicker: you may come up with an explanation that’s much more plausible than length of name, and be just as wrong. Discovering causation is hard work, much harder than looking for correlations.

Randomization audit

clipboard list of names

“How would you go about drawing a random sample?”

I thought that was kind of a silly question. I was in my first probability class in college, and the professor started the course with this. You just take a sample, right?

As with many things in life, it gets more complicated when you get down to business. Random sample of what? Sampled according to what distribution? What kind of constraints are there?

How would you select a random sample of your employees in a way that you could demonstrate was not designed to include any particular employee? How can you not only be fair, but convince people who may not understand probability that you were fair?

When there appear to be patterns in your random selections, how can you decide whether there has been an error or whether the results that you didn’t expect actually should have been expected?

How do you sample data that isn’t simply a list but instead has some sort of structure? Maybe you have data split into tables in a relational database. Maybe you have data that naturally forms a network. How can you sample from a relational database or a network in a way that respects the underlying structure?

How would you decide whether the software you’re using for randomization is adequate for your purposes? If it is adequate, are you using it the right way?

These are all questions that I’ve answered for clients. Sometimes they want me to develop randomization procedures. Sometimes they want me to audit the procedures they’re already using. They not only want good procedures, they want procedures that can stand up to critical review.

If you’d like me to design or audit randomization procedures for your company, let’s talk. You will get the assurance that you’re doing the right thing, and the ability to demonstrate that you’re doing the right thing.

CRM consulting gig

This morning I had someone from a pharmaceutical company call me with questions about conducting a CRM dose-finding trial and I mentioned it to my wife.

Then this afternoon she was reading a book in which there was a dialog between husband and wife including this sentence:

He launched into a technical explanation of his current consulting gig—something about a CRM implementation.

You can’t make this kind of thing up. A few hours before reading this line, my wife had exactly this conversation. However, I doubt the author and I had the same thing in mind.

In my mind, CRM stands for Continual Reassessment Method, a Bayesian method for Phase I clinical trials, especially in oncology. We ran a lot of CRM trials while I worked in biostatistics at MD Anderson Cancer Center.

For most people, presumably including the author of the book quoted above, CRM stands for Customer Relationship Management software.

Like my fictional counterpart, I know a few things about CRM implementation, but it’s a different kind of CRM.

More clinical trial posts

Simple clinical trial of four COVID-19 treatments

A story came out in Science yesterday saying the World Health Organization is launching a trial of what it believes are the four most promising treatments for COVID-19 (a.k.a. SARS-CoV-2, novel coronavirus, etc.)

The four treatment arms will be

  • Remdesivir
  • Chloroquine and hydroxychloroquine
  • Ritonavir + lopinavir
  • Ritonavir + lopinavir + interferon beta

plus standard of care as a control arm.

I find the design of this trial interesting. Clinical trials are often complex and slow. Given a choice in a crisis between ponderously designing the perfect clinical trial and flying by the seat of their pants, health officials would rightly choose the latter. On the other hand, it would obviously be good to know which of the proposed treatments is most effective. So this trial has to be a compromise.

The WHO realizes that the last thing front-line healthcare workers want right now is the added workload of conducting a typical clinical trial. So this trial, named SOLIDARITY, will be very simple to run. According to the Science article,

When a person with a confirmed case of COVID-19 is deemed eligible, the physician can enter the patient’s data into a WHO website, including any underlying condition that could change the course of the disease, such as diabetes or HIV infection. The participant has to sign an informed consent form that is scanned and sent to WHO electronically. After the physician states which drugs are available at his or her hospital, the website will randomize the patient to one of the drugs available or to the local standard care for COVID-19.

… Physicians will record the day the patient left the hospital or died, the duration of the hospital stay, and whether the patient required oxygen or ventilation, she says. “That’s all.”

That may sound a little complicated, but by clinical trial standards the SOLIDARITY trial is shockingly simple. Normally you would have countless detailed case report forms, adverse event reporting, etc.

The statistics of the trial will be simple on the front end but complicated on the back end. There’s no sophisticated algorithm assigning treatments, just a randomization between available treatment options, including standard of care. I don’t see how you could do anything else, but this will create headaches for the analysis.

Patients are randomized to available treatments—what else could you do? [1]—which means the treatment options vary by site and over time. The control arm, standard of care, also varies by site and could change over time as well.  Also, this trial is not double-blind. This is a trial optimized for the convenience of frontline workers, not for the convenience of statisticians.

The SOLIDARITY trial will be adaptive in the sense that a DSMB will look at interim results and decide whether to drop treatment arms that appear to be under-performing. Ideally there would be objective algorithms for making these decisions, carefully designed and simulated in advanced, but there’s no time for that. Better to start learning immediately than to spend six months painstakingly designing a trial. Even if we could somehow go back in time and start the design process six months ago, there could very well be contingencies that the designers couldn’t anticipate.

The SOLIDARITY trial is an expedient compromise, introducing a measure of scientific rigor when there isn’t time to be as rigorous as we’d like.

Update: The hydroxycloroqine arm was dropped from the trial because a paper in the Lancet reported that the drug was neither safe nor effective. However, it now appears that the data used in the Lancet paper were fraudulent.

Update: The Lancet retracted the paper in question on June 4, 2020.

More clinical trial posts

[1] You could limit the trial to sites that have all four treatment options available, cutting off most potential sources of data. The data would not be representative of the world at large and accrual would be slow. Or you could wait until all four treatments were distributed to clinics around the world, but there’s no telling how long that would take.