COVID19 mortality per capita by state

Here’s a silly graph by Richard West with a serious point. States with longer names tend to have higher covid19 mortality. Of course no one believes there’s anything about the length of a state’s name that should impact the health of its residents. The correlation is real, but it’s a coincidence.

The variation between mortality in different states is really large. Something caused that, though not the length of the names. But here’s the kicker: you may come up with an explanation that’s much more plausible than length of name, and be just as wrong. Discovering causation is hard work, much harder than looking for correlations.

Randomization audit

clipboard list of names

“How would you go about drawing a random sample?”

I thought that was kind of a silly question. I was in my first probability class in college, and the professor started the course with this. You just take a sample, right?

As with many things in life, it gets more complicated when you get down to business. Random sample of what? Sampled according to what distribution? What kind of constraints are there?

How would you select a random sample of your employees in a way that you could demonstrate was not designed to include any particular employee? How can you not only be fair, but convince people who may not understand probability that you were fair?

When there appear to be patterns in your random selections, how can you decide whether there has been an error or whether the results that you didn’t expect actually should have been expected?

How do you sample data that isn’t simply a list but instead has some sort of structure? Maybe you have data split into tables in a relational database. Maybe you have data that naturally forms a network. How can you sample from a relational database or a network in a way that respects the underlying structure?

How would you decide whether the software you’re using for randomization is adequate for your purposes? If it is adequate, are you using it the right way?

These are all questions that I’ve answered for clients. Sometimes they want me to develop randomization procedures. Sometimes they want me to audit the procedures they’re already using. They not only want good procedures, they want procedures that can stand up to critical review.

If you’d like me to design or audit randomization procedures for your company, let’s talk. You will get the assurance that you’re doing the right thing, and the ability to demonstrate that you’re doing the right thing.

CRM consulting gig

This morning I had someone from a pharmaceutical company call me with questions about conducting a CRM dose-finding trial and I mentioned it to my wife.

Then this afternoon she was reading a book in which there was a dialog between husband and wife including this sentence:

He launched into a technical explanation of his current consulting gig—something about a CRM implementation.

You can’t make this kind of thing up. A few hours before reading this line, my wife had exactly this conversation. However, I doubt the author and I had the same thing in mind.

In my mind, CRM stands for Continual Reassessment Method, a Bayesian method for Phase I clinical trials, especially in oncology. We ran a lot of CRM trials while I worked in biostatistics at MD Anderson Cancer Center.

For most people, presumably including the author of the book quoted above, CRM stands for Customer Relationship Management software.

Like my fictional counterpart, I know a few things about CRM implementation, but it’s a different kind of CRM.

More clinical trial posts

Simple clinical trial of four COVID-19 treatments

A story came out in Science yesterday saying the World Health Organization is launching a trial of what it believes are the four most promising treatments for COVID-19 (a.k.a. SARS-CoV-2, novel coronavirus, etc.)

The four treatment arms will be

  • Remdesivir
  • Chloroquine and hydroxychloroquine
  • Ritonavir + lopinavir
  • Ritonavir + lopinavir + interferon beta

plus standard of care as a control arm.

I find the design of this trial interesting. Clinical trials are often complex and slow. Given a choice in a crisis between ponderously designing the perfect clinical trial and flying by the seat of their pants, health officials would rightly choose the latter. On the other hand, it would obviously be good to know which of the proposed treatments is most effective. So this trial has to be a compromise.

The WHO realizes that the last thing front-line healthcare workers want right now is the added workload of conducting a typical clinical trial. So this trial, named SOLIDARITY, will be very simple to run. According to the Science article,

When a person with a confirmed case of COVID-19 is deemed eligible, the physician can enter the patient’s data into a WHO website, including any underlying condition that could change the course of the disease, such as diabetes or HIV infection. The participant has to sign an informed consent form that is scanned and sent to WHO electronically. After the physician states which drugs are available at his or her hospital, the website will randomize the patient to one of the drugs available or to the local standard care for COVID-19.

… Physicians will record the day the patient left the hospital or died, the duration of the hospital stay, and whether the patient required oxygen or ventilation, she says. “That’s all.”

That may sound a little complicated, but by clinical trial standards the SOLIDARITY trial is shockingly simple. Normally you would have countless detailed case report forms, adverse event reporting, etc.

The statistics of the trial will be simple on the front end but complicated on the back end. There’s no sophisticated algorithm assigning treatments, just a randomization between available treatment options, including standard of care. I don’t see how you could do anything else, but this will create headaches for the analysis.

Patients are randomized to available treatments—what else could you do? [1]—which means the treatment options vary by site and over time. The control arm, standard of care, also varies by site and could change over time as well.  Also, this trial is not double-blind. This is a trial optimized for the convenience of frontline workers, not for the convenience of statisticians.

The SOLIDARITY trial will be adaptive in the sense that a DSMB will look at interim results and decide whether to drop treatment arms that appear to be under-performing. Ideally there would be objective algorithms for making these decisions, carefully designed and simulated in advanced, but there’s no time for that. Better to start learning immediately than to spend six months painstakingly designing a trial. Even if we could somehow go back in time and start the design process six months ago, there could very well be contingencies that the designers couldn’t anticipate.

The SOLIDARITY trial is an expedient compromise, introducing a measure of scientific rigor when there isn’t time to be as rigorous as we’d like.

Update: The hydroxycloroqine arm was dropped from the trial because a paper in the Lancet reported that the drug was neither safe nor effective. However, it now appears that the data used in the Lancet paper were fraudulent.

Update: The Lancet retracted the paper in question on June 4, 2020.

More clinical trial posts

[1] You could limit the trial to sites that have all four treatment options available, cutting off most potential sources of data. The data would not be representative of the world at large and accrual would be slow. Or you could wait until all four treatments were distributed to clinics around the world, but there’s no telling how long that would take.

Automatic data reweighting

Suppose you are designing an autonomous system that will gather data and adapt its behavior to that data.

At first you face the so-called cold-start problem. You don’t have any data when you first turn the system on, and yet the system needs to do something before it has accumulated data. So you prime the pump by having the system act at first on what you believe to be true prior to seeing any data.

Now you face a problem. You initially let the system operate on assumptions rather than data out of necessity, but you’d like to go by data rather than assumptions once you have enough data. Not just some data, but enough data. Once you have a single data point, you have some data, but you can hardly expect a system to act reasonably based on one datum.

Rather than abruptly moving from prior assumptions to empirical data, you’d like the system to gradually transition from reliance on the former to reliance on the latter, weaning the system off initial assumptions as it becomes more reliant on new data.

The delicate part is how to manage this transition. How often should you adjust the relative weight of prior assumptions and empirical data? And how should you determine what weights to use? Should you set the weight given to the prior assumptions to zero at some point, or should you let the weight asymptotically approach zero?

Fortunately, there is a general theory of how to design such systems. It’s called Bayesian statistics. The design issues raised above are all handled automatically by Bayes theorem. You start with a prior distribution summarizing what you believe to be true before collecting new data. Then as new data arrive, the magic of Bayes theorem adjusts this distribution, putting more emphasis on the empirical data and less on the prior. The effect of the prior gradually fades away as more data become available.

Related posts

Sufficient statistic paradox

A sufficient statistic summarizes a set of data. If the data come from a known distribution with an unknown parameter, then the sufficient statistic carries as much information as the full data [0]. For example, given a set of n coin flips, the number of heads and the number of flips are sufficient statistics. More on sufficient statistics here.

There is a theorem by Koopman, Pitman, and Darmois that essentially says that useful sufficient statistics exist only if the data come from a class of probability distributions known as normal families [1]. This leads to what Persi Diaconis [2] labeled a paradox.

Exponential families are quite a restricted family of measuures. Modern statistics deals with far richer classes of probabilities. This suggests a kind of paradox. If statistics is to be of any real use it must provide ways of boiling down great masses of data to a few humanly interpretable numbers. The Koopman-Pitman-Darmois theorem suggests this is impossible unless nature follows highly specialized laws which no one really believes.

Diaconis suggests two ways around the paradox. First, the theorem he refers to concerns sufficient statistics of a fixed size; it doesn’t apply if the summary size varies with the data size. Second, and more importantly, the theorem says nothing about a summary containing approximately as much information as the full data.

***

[0] Students may naively take this to mean that all you need is sufficient statistics. No, it says that if you know the distribution the data come from then all you need is the sufficient statistics. You cannot test whether a model fits given sufficient statistics that assume the model. For example, mean and variance are sufficient statistics assuming data come from a normal distribution. But knowing only the mean and variance, you can’t assess whether a normal distribution model fits the data.

[1] This does not mean the data have to come from the normal distribution. The normal family of distributions includes the normal (Gaussian) distribution, but other distributions as well.

[2] Persi Diaconis. Sufficiency as Statistical Symmetry. Proceedings of the AMS Centennial Symposium. August 8–12, 1988.

What is a privacy budget?

Accountant

The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t for this post, but it matters that there is an exact definition.

Someone designing a differentially private system sets an upper limit on the amount of difference anyone’s participation can make. That’s the privacy budget. The system will allow someone to ask one question that uses the whole privacy budget, or a series of questions whose total impact is no more than that one question.

If you think of a privacy budget in terms of money, maybe your privacy budget is $1.00. You could ask a single $1 question if you’d like, but you couldn’t ask any more questions after that. Or you could ask one $0.30 question and seven $0.10 questions.

Some metaphors are dangerous, but the idea of comparing cumulative privacy impact to a financial budget is a good one. You have a total amount you can spend, and you can chose how you spend it.

The only problem with privacy budgets is that they tend to be overly cautious because they’re based on worst-case estimates. There are several ways to mitigate this. A simple way to stretch privacy budgets is to cache query results. If you ask a question twice, you get the same answer both times, and you’re only charged once.

(Recall that differential privacy adds a little random noise to query results to protect privacy. If you could ask the same question over and over, you could average your answers, reducing the level of added noise, and so a differentially private system will rightly charge you repeatedly for repeated queries. But if the system adds noise once and remembers the result, there’s no harm in giving you back that same answer as often as you ask the question.)

A more technical way to get more from a privacy budget is to use Rényi differential privacy (RDP) rather than the original ε-differential privacy. The former simplifies privacy budget accounting due to simple composition rules, and makes privacy budgets stretch further by leaning away from worst-case analysis a bit and leaning toward average-case analysis. RDP depends on a tuning parameter that includes ε-differential privacy, so one can control how much RDP acts like ε-differential privacy by adjusting that parameter.

There are other ways to stretch privacy budgets as well. The net effect is that when querying a large database, you can often ask all the questions like, and get sufficiently accurate answers, without worrying about privacy budget.

More mathematical privacy posts

Fat tails and the t test

Suppose you want to test whether something you’re doing is having any effect. You take a few measurements and you compute the average. The average is different than what it would be if what you’re doing had no effect, but is the difference significant? That is, how likely is it that you might see the same change in the average, or even a greater change, if what you’re doing actually had no effect and the difference is due to random effects?

The most common way to address this question is the one-sample t test. “One sample” doesn’t mean that you’re only taking one measurement. It means that you’re taking a set of measurements, a sample, from one thing. You’re not comparing measurements from two different things.

The t test assumes that the data are coming from some source with a normal (Gaussian) distribution. The Gaussian distribution has thin tails, i.e. the probability of seeing a value far from the mean drops precipitously as you move further out. What if the data are actually coming from a distribution with heavier tails, i.e. a distribution where the probability of being far from the mean drops slowly?

With fat tailed data, the t test loses power. That is, it is less likely to reject the null hypothesis, the hypothesis that the mean hasn’t changed, when it should. First we will demonstrate by simulation that this is the case, then we’ll explain why this is to be expected from theory.

Simulation

We will repeatedly draw a sample of 20 values from a distribution with mean 0.8 and test whether the mean of that distribution is not zero by seeing whether the t test produces a p-value less than the conventional cutoff of 0.05. We will increase the thickness of the distribution tails and see what that does to our power, i.e. the probability of correctly rejecting the hypothesis that the mean is zero.

We will fatten the tails of our distribution by generating samples from a Student t distribution and decreasing the number of degrees of freedom: as degrees of freedom go down, the weight of the tail goes up.

With a large number of degrees of freedom, the t distribution is approximately normal. As the number of degrees of freedom decreases, the tails get fatter. With one degree of freedom, the t distribution is a Cauchy distribution.

Here’s our Python code:

from scipy.stats import t, ttest_1samp

n = 20
N = 1000

for df in [100, 30, 10, 5, 4, 3, 2, 1]:
    rejections = 0
    for _ in range(N):
        y = 0.8 + t.rvs(df, size=n)
        stat, p = ttest_1samp(y, 0)
        if p < 0.05:
            rejections += 1
    print(df, rejections/N)

And here’s the output:

100 0.917
 30 0.921 
 10 0.873 
  5 0.757  
  4 0.700    
  3 0.628  
  2 0.449  
  1 0.137  

When the degrees of freedom are high, we reject the null about 90% of the time, even for degrees of freedom as small as 10. But with one degree of freedom, i.e. when we’re sampling from a Cauchy distribution, we only reject the null around 14% of the time.

Theory

Why do fatter tails lower the power of the t test? The t statistic is

\frac{\bar{y} - \mu_0}{s / \sqrt{n}}

where y bar is the sample average, μ0 is the mean under the null hypothesis (μ0 = 0 in our example), s is the sample standard deviation, and n is the sample size.

As distributions become fatter in the tails, the sample standard deviation increases. This means the denominator in the t statistic gets larger and so the t statistic gets smaller. The smaller the t statistic, the greater the probability that the absolute value of a t random variable is greater than the statistic, and so the larger the p-value.

t statistic, t distribution, t test

There are a lot of t‘s floating around in this post. I’ll finish by clarifying what the various t things are.

The t statistic is the thing we compute from our data, given by the expression above. It is called a t statistic because if the hypotheses of the test are satisfied, this statistic has a t distribution with n-1 degrees of freedom. The t test is a hypothesis test based on the t statistic and its distribution. So the t statistic, the t distribution, and the t test are all closely related.

The t family of probability distributions is a convenient example of a family of distributions whose tails get heavier or lighter depending on a parameter. That’s why in the simulation we drew samples from a t distribution. We didn’t need to, but it was convenient. We would get similar results if we sampled from some other distribution whose tails get thicker, and so variance increases, as we vary some parameter.

More probability posts

Testing Rupert Miller’s suspicion

I was reading Rupert Miller’s book Beyond ANOVA when I ran across this line:

I never use the Kolmogorov-Smirnov test (or one of its cousins) or the χ² test as a preliminary test of normality. … I have a feeling they are more likely to detect irregularities in the middle of the distribution than in the tails.

Rupert wrote these words in 1986 when it would have been difficult to test is hunch. Now it’s easy, and so I wrote up a little simulation to test whether his feeling was justified. I’m sure this has been done before, but it’s easy (now—it would not have been in 1986) and so I wanted to do it myself.

I’ll compare the Kolmogorov-Smirnov test, a popular test for goodness-of-fit, with the Shapiro-Wilks test that Miller preferred. I’ll run each test 10,000 times on non-normal data and count how often each test produces a p-value less than 0.05.

To produce departures from normality in the tails, I’ll look at samples from a Student t distribution. This distribution has one parameter, the number of degrees of freedom. The fewer degrees of freedom, the thicker the tails and so the further from normality in the tails.

Then I’ll look at a mixture of a normal and uniform distribution. This will have thin tails like a normal distribution, but will be flatter in the middle.

If Miller was right, we should expect the Shapiro-Wilks to be more sensitive for fat-tailed t distributions, and the K-S test to be more sensitive for mixtures.

First we import some library functions we’ll need and define our two random sample generators.

from numpy import where
from scipy.stats import *

def mixture(p, size=100):
    u = uniform.rvs(size=size)
    v = uniform.rvs(size=size)
    n = norm.rvs(size=size)
    x = where(u < p, v, n)
    return x

def fat_tail(df, size=100):
    return t.rvs(df, size=size)

Next is the heart of the code. It takes in a sample generator and compares the two tests, Kolmogorov-Smirnov and Shapiro-Wilks, on 10,000 samples of 100 points each. It returns what proportion of the time each test detected the anomaly at the 0.05 level.

def test(generator, parameter):

    ks_count = 0
    sw_count = 0

    N = 10_000
    for _ in range(N):
        x = generator(parameter, 100)

        stat, p = kstest(x, "norm")
        if p < 0.05:
            ks_count += 1
    
        stat, p = shapiro(x)
        if p < 0.05:
            sw_count += 1
    
    return (ks_count/N, sw_count/N)

Finally, we call the test runner with a variety of distributions.

for df in [100, 10, 5, 2]:
    print(test(fat_tail, df))

for p in [0.05, 0.10, 0.15, 0.2]:
    print(test(mixture,p))

Note that the t distribution with 100 degrees of freedom is essentially normal, at least as far as a sample of 100 points can tell, and so we should expect both tests to report a lack of fit around 5% of the time since we’re using 0.05 as our cutoff.

Here’s what we get for the fat-tailed samples.

(0.0483, 0.0554)
(0.0565, 0.2277)
(0.1207, 0.8799)
(0.8718, 1.0000)   

So with 100 degrees of freedom, we do indeed reject the null hypothesis of normality about 5% of the time. As the degrees of freedom decrease, and the fatness of the tails increases, both tests reject the null hypothesis of normality more often. However, in each chase the Shapiro-Wilks test picks up on the non-normality more often than the K-S test, about four times as often with 10 degrees of freedom and about seven times as often with 5 degrees of freedom. So Miller was right about the tails.

Now for the middle. Here’s what we get for mixture distributions.

(0.0731, 0.0677)
(0.1258, 0.1051)
(0.2471, 0.1876)
(0.4067, 0.3041)

We would expect both goodness of fit tests to increase their rejection rates as the mixture probability goes up, i.e. as we sample from the uniform distribution more often. And thatis what we see. But the K-S test outperforms the S-W test each time. Both test have rejection rates that increase with the mixture probability, but the rejection rates increase faster for the K-S test. Miller wins again.

More posts on hypothesis testing

National Drug Code (NDC)

The US Food and Drug Administration tracks drugs using an identifier called the NDC or National Drug Code. It is described as a 10-digit code, but it may be more helpful to think of it as a 12-character code.

An NDC contains 10 digits, separated into three segments by two dashes. The three segments are the labeler code, product code, and package code. The FDA assigns the labeler codes to companies, and each company assigns its own product and package codes.

Format

The segments are of variable length and so the dashes are significant. The labeler code could be 4 or 5 digits. The product code could be 3 or 4 digits, and the package code could be 1 or 2 digits. The total number of digits is must be 10, so there are three possible combinations:

  • 4-4-2
  • 5-3-2
  • 5-4-1.

There’s no way to look at just the digits and know how to separate them into three segments. My previous post looked at self-punctuating codes. The digits of NDC codes are not self-punctuating because they require the dashes. The digit combinations are supposed to be unique, but you can’t tell how to parse a set of digits from the digits alone.

Statistics

I downloaded the NDC data from the FDA to verify whether the codes work as documented, and to see the relative frequency of various formats.

(The data change daily, so you may get different results if you do try this yourself.)

Format

All the codes were 12 characters long, and all had the documented format as verified by the regular expression [1]

    \d{4,5}-\d{3,4}-\d{1,2}

Uniqueness exception

I found one exception to the rule that the sequence of digits should be unique. The command

    sed "s/-//g" ndc.txt | sort | uniq -d

returned 2950090777.

The set of NDC codes contained both 29500-907-77 and 29500-9077-7.

Distribution

About 60% of the codes had the form 5-3-2. About 30% had the form 5-4-1, and the remaining 10% had the form 4-4-2.

There were a total of 252,355 NDC codes with 6,532 different lablelers (companies).

There were 9448 NDC codes associated with the most prolific labeler. The 1,424 least prolific labelers had only one DNC code. In Pareto-like fashion, the top 20% of labelers accounted for about 90% of the codes.

Related posts

[1] Programming languages like Python or Perl will recognize this regular expression, but by default grep does not support \d for digits. The Gnu implementation of grep with the -P option will. It will also understand notation like {4,5} to mean a pattern is repeated 4 or 5 times, with or without -P, but I don’t think other implementations of grep necessarily will.