Statistics | John D. Cook

A Bayesian approach to proving you’re human

John — Thu, 28 Mar 2024 14:47:00 +0000

I set up a GitHub account for a new employee this morning and spent a ridiculous amount of time proving that I’m human.

The captcha was to listen to three audio clips at a time and say which one contains bird sounds. This is a really clever test, because humans can tell the difference between real bird sounds and synthesized bird-like sounds. And we’re generally good at recognizing bird sounds even against a background of competing sounds. But some of these were ambiguous, and I had real birds chirping outside my window while I was doing the captcha.

You have to do 20 of these tests, and apparently you have to get all 20 right. I didn’t. So I tried again. On the last test I accidentally clicked the start-over button rather than the submit button. I wasn’t willing to listen to another 20 triples of audio clips, so I switched over to the visual captcha tests.

These kinds of tests could be made less annoying and more secure by using a Bayesian approach.

Suppose someone solves 19 out of 20 puzzles correctly. You require 20 out of 20, so you have them start over. When you do, you’re throwing away information. You require 20 more puzzles, despite the fact that they only missed one. And if a bot had solved say 8 out of 20 puzzles, you’d let it pass if it passes the next 20.

If you wipe your memory after every round of 20 puzzles, and allow unlimited do-overs, then any sufficiently persistent entity will almost certainly pass eventually.

Bayesian statistics reflects common sense. After someone (or something) has correctly solved 19 out of 20 puzzles designed to be hard for machines to solve, your conviction that this entity is human is higher than if they/it had solved 8 out of 20 correctly. You don’t need as much additional evidence in the first case as in the latter to be sufficiently convinced.

Here’s how a Bayesian captcha could work. You start out with some distribution on your probability θ that an entity is human, say a uniform distribution. You present a sequence of puzzles, recalculating your posterior distribution after each puzzle, until the posterior probability that this entity is human crosses some upper threshold, say 0.95, or some lower threshold, say 0.50. If the upper threshold is crossed, you decide the entity is likely human. If the lower threshold is crossed, you decide the entity is likely not human.

If solving 20 out of 20 puzzles correctly crosses your threshold of human detection, then after solving 19 out 20 correctly your posterior probability of humanity is close to the upper threshold and would only require a few more puzzles. And if an entity solved 8 out of 20 puzzles correctly, that may cross your lower threshold. If not, maybe only a few more puzzles would be necessary to reject the entity as non-human.

When I worked at MD Anderson Cancer Center we applied this approach to adaptive clinical trials. A clinical trial might stop early because of particularly good results or particularly bad results. Clinical trials are expensive, both in terms of human costs and financial costs. Rejecting poor treatments quickly, and sending promising treatments on to the next stage quickly, is both humane and economical.

The post A Bayesian approach to proving you’re human first appeared on John D. Cook.

Estimating an author’s vocabulary

John — Fri, 08 Mar 2024 03:07:07 +0000

How would you estimate the size of an author’s vocabulary? Suppose you have a analyzed the author’s available works and found n words, x of which are unique. Then you know the author’s vocabulary was at least x, but it’s reasonable to assume that the author may have know words he never used in writing, or that at least not in works you have access to.

Brainerd [1] suggested the following estimator based on a Markov chain model of language. The estimated vocabulary is the number N satisfying the equation

The left side is a decreasing function of N, so you could solve the equation by finding a values of N that make the sum smaller and larger than n, then use a bisection algorithm.

We can see that the model is qualitatively reasonable. If every word is unique, i.e. x = n, then the solution is N = ∞. If you haven’t seen any repetition, you the author could keep writing new words indefinitely. As the amount of repetition increases, the estimate of N decreases.

Brainerd’s model is simple, but it tends to underestimate vocabulary. More complicated models might do a better job.

Problems analogous to estimating vocabulary size come up in other applications. For example, an ecologist might want to estimate the number of new species left to be found based on the number of species seen so far. In my work in data privacy I occasionally have to estimate diversity in a population based on diversity in a sample. Both of these examples are analogous to estimating potential new words based on the words you’ve seen.

[1] Brainerd, B. On the relation between types and tokes in literary text, J. Appl. Prob. 9, pp. 507-5

The post Estimating an author’s vocabulary first appeared on John D. Cook.

Detecting the language of encrypted text

John — Thu, 07 Mar 2024 15:52:48 +0000

Imagine you are a code breaker living a century ago. You’ve intercepted a message, and you go through your bag of tricks, starting with the simplest techniques first. Maybe the message has been encrypted using a simple substitution cipher, so you start with that.

Simple substitution ciphers can be broken by frequency analysis: the most common letter probably corresponds to E, the next most common letter probably corresponds to T, etc. But that’s only for English prose. Maybe the message was composed in French. Or maybe it was composed in Japanese, then transliterated into the Latin alphabet so it could be transmitted via Morse code. You’d like to know what language the message was written in before you try identifying letters via their frequency.

William Friedman’s idea was to compute a statistic, what he dubbed the index of coincidence, to infer the probable language of the source. Since this statistic only depends on symbol frequencies, it gives the same value whether computed on clear text or text encrypted with simple substitution. It also gives the same value if the text has been run through a transposition cipher as well.

(Classical cryptanalysis techniques, such as computing the index of coincidence, are completely useless against modern cryptography. And yet ideas from classical cryptanalysis are still useful for other applications. Here’s an example that came up in a consulting project recently.)

As I mentioned at the top of the post, you’d try breaking the simplest encryption first. If the index of coincidence is lower than you’d expect for a natural language, you might suspect that the message has been encrypted using polyalphabetic substitution. That is, instead of using one substitution alphabet for every letter, maybe the message has been encrypted using a cycle of n different alphabets, such as the Vigenère cypher.

How would you break such a cipher? First, you’d like to know what n is. How would you do that? By trial and error. Try splitting the text into groups of letters according to their position mod n, then compute the index of coincidence again for each group. If the index statistics are much larger when n = 7, you’re probably looking a message encrypted with a key of length 7.

The source language would still leave its signature. If the message was encrypted by cycling through seven scrambled alphabets, each group of seven letters would most likely have the statistical distribution of the language used in the clear text.

Friedman’s index of coincidence, published in 1922, was one statistic that could be computed based on letter frequencies, one that worked well in practice, but you could try other statistics, and presumably people did. The index of coincidence is essentially Rényi entropy with parameter α = 2. You might try different values of α.

If the approach above doesn’t work, you might suspect that the text was not encrypted one letter at a time, even using multiple alphabets. Maybe pairs of letters were encrypted, as in the Playfair cipher. You could test this hypothesis by looking that the frequencies of pairs of letters in the encrypted text, calculating an index of coincidence (or some other statistic) based on pairs of letters.

Here again letter pair frequencies may suggest the original language. It might not distinguish Spanish from Portuguese, for example, but it would distinguish Japanese written in Roman letters from English.

The post Detecting the language of encrypted text first appeared on John D. Cook.

Uncovering names masked with stars

John — Fri, 23 Feb 2024 19:34:27 +0000

Sometimes I’ll see things like my name partially concealed as J*** C*** and think “a lot of good that does.”

Masking letters reveals more than people realize. For example, when you see that someone’s first name is four letters and begins with J, there’s about a 70% chance they’re male and there’s a 44% chance they’re named John. If you know this person is male, there’s a 63% chance they’re name is John.

If you know a man’s name has the form J***, his name isn’t necessarily John, though that’s the most likely possibility. There’s a 8% chance his name is Jack and a 6% chance his name is Joel.

All these numbers depend on the data set you’re looking at, but these are roughly accurate numbers for looking at any representative sample of American names.

Some names stand out more than others. If I tell you someone’s name is E********, there’s a 90% chance the name is Elizabeth.

If I tell you someone’s name is B*****, there’s a 77% chance this person is female, but it’s harder to guess which name is hers. The most likely possibility is Brenda, but there are several other possibilities that are fairly likely: Bonnie, Brooke, Brandy, etc.

We could go through a similar exercise with last names. You can probably guess who S**** is, though C***** is not so clear.

In short, replacing letters with stars doesn’t do much to conceal someone’s name. It usually doesn’t let you infer someone’s name with certainty, but it definitely improves your chances of guessing correctly. If you have a few good guesses as to someone’s name, and some good guesses on a handful of other attributes, together you have a good chance of identifying someone.

The post Uncovering names masked with stars first appeared on John D. Cook.

When is less data less private?

John — Thu, 22 Feb 2024 14:05:17 +0000

If I give you a database, I give you every row in the database. So if you delete some rows from the database, you have less information, not more, right?

This seems very simple, and it mostly is, but there are a couple subtleties.

A common measure in data privacy is k-anonymity. The idea is that if at least k individuals in a data set share some set of data values, and k is large enough, then the privacy of those individuals is protected.

Now suppose you randomly select a single record from a database that was deemed deidentified because it satisfied k-anonymity with k = 10. Now your new dataset, consisting of only one record, is k-anonymous with k = 1: every record is unique because there’s only one record. But how is this person’s data any less private that it was before?

Note that I said above that you selected a record at random. If you selected the row using information that you know but which isn’t in the database, you might have implicitly added information. But if you select a subset of data, using only information explicit in that data, you haven’t added information.

Here’s where k-anonymity breaks down. The important measure is k-anonymity in the general population, not k-anonymity in a data set, unless you know that someone is in the data set.

If you find someone named John Cook in a data set, you probably haven’t found my information, even if there is only one person by that name in the data set. My name may or may not be common in that particular data set, but my name is common in general.

The number of times a combination of data fields gives a lower bound on how often the combination appears in general, so k-anonymity in a data set is a good sign for privacy, but the lack of k-anonymity is not necessarily a bad sign. The latter could just be an artifact of having a small data set.

The post When is less data less private? first appeared on John D. Cook.

How likely is a random variable to be far from its center?

John — Tue, 06 Feb 2024 14:33:55 +0000

There are many answers to the question in the title: How likely is a random variable to be far from its center?

The answers depend on how much you’re willing to assume about your random variable. The more you can assume, the stronger your conclusion. The answers also depend on what you mean by “center,” such as whether you have in mind the mean or the mode.

Chebyshev’s inequality says that the probability of a random variable X taking on a value more than k standard deviations from its mean is less than 1/k². This of course assumes that X has a mean and a standard deviation.

If we assume further that X is unimodal, and k ≥ √(8/3), then the conclusion of Chebyshev’s inequality can be strengthened to saying that the probability of X being more than k standard deviations from its mean is less than 4/9k². This is the Vysochanskiĭ-Petunin inequality. More on this inequality here.

If k ≤ √(8/3) the Vysochanskiĭ-Petunin inequality says the probability of X being more than k standard deviations from its mean is less than

4/3k² − 1/3.

Gauss’ inequality is similar to the Vysochanskiĭ-Petunin inequality. It also assumes X is unimodal, and for convenience we will assume the mode is at zero (otherwise look at Y = X – m where m is the mode of X). Gauss’ inequality bounds the probability of X being more than k standard deviations away from its mode rather than its mean.

Let τ² be the expected value of X². If the mean value of X is zero then τ² = σ² and the equations below are similar to the Vysochanskiĭ-Petunin inequality. But Gauss does not require that X has mean zero.

Gauss’ inequality says that

P(|X| > kτ) ≤ 4/9k²

if if k ≥ √(4/3) and

P(|X| > kτ) ≤ 1 − k/(√3 τ)

otherwise.

Gauss’ inequality is stronger than the Vysochanskiĭ-Petunin inequality when X has zero mean, but it is also assuming more, namely that the mean and mode are equal.

The post How likely is a random variable to be far from its center? first appeared on John D. Cook.

Two-digit zip codes

John — Fri, 02 Feb 2024 02:13:39 +0000

It’s common to truncate US zip codes to the first three digits for privacy reasons. Truncating to the first two digits is less common, but occurs in some data sets.

HIPAA Safe Harbor requires sparse 3-digit zip codes to be suppressed; even when rolled up to three digits some regions are still sparsely populated.

How sparse can a two-digit zip code region be? Empty. The population of all zip codes starting with 09 is zero. That’s because all zip codes of the form 09XXX are overseas military bases and not anyone’s permanent residence.

Aside from that, the least populated 2-digit zip code is 69 with a population of about 175,000.

The most populated 2-digit zip code is 33 with a population of about 10,500,000.

So the ratio of the largest population to the smallest non-zero population is about 60.

What if we roll up all the way to 1-digit zip codes? Now things are more evenly distributed. The smallest region is 5 with a population of about 17.5M and the largest is 9 with a population of about 54M. A ratio of about 3.

The post Two-digit zip codes first appeared on John D. Cook.

Beta inequality symmetries

John — Sat, 20 Jan 2024 13:45:08 +0000

I was thinking about the work I did when I worked in biostatistics at MD Anderson. This work was practical rather than mathematically elegant, useful in its time but not of long-term interest. However, one result came out of this work that I would call elegant, and that was a symmetry I found.

Let X be a beta(a, b) random variable and let Y be a beta(c, d) random variable. Let g(a, b, c, d) be the probability that a sample from X is larger than a sample from Y.

g(a, b, c, d) = Prob(X > Y)

This function often appeared in the inner loop of a simulation and so we spent thousands of CPU-hours computing its values. I looked for ways to evaluate this function more quickly, and along the way I found a symmetry.

The function I call g was studied by W. R. Thompson in 1933 [1]. Thompson noted two symmetries:

g(a, b, c, d) = 1 − g(c, d, a, b)

and

g(a, b, c, d) = g(d, c, b, a)

I found an additional symmetry:

g(a, b, c, d) = g(d, b, c, a).

The only reference to this result in a journal article as far as I know is a paper I wrote with Saralees Nadarajah [2]. That paper cites an MD Anderson technical report which is no longer online, but I saved a copy here.

[1] W. R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, Volume 25, Issue 4. pp. 285 – 294.

[2] John D. Cook and Saralees Nadarajah. Stochastic Inequality Probabilities for Adaptively Randomized Clinical Trials. Biometrical Journal. 48 (2006) pp 256–365.

The post Beta inequality symmetries first appeared on John D. Cook.

The Five Safes data privacy framework

John — Fri, 05 Jan 2024 13:55:30 +0000

The Five Safes decision framework was created a couple decades ago by Felix Ritchie at the UK Office for National Statistics. It is a framework for evaluating the safe use of confidential data, particularly by government agencies. You can find a description of the Five Safes, for example, in NIST SP 800-188.

The Five Safes are

Safe projects
Safe people
Safe settings
Safe data
Safe outputs

Safe projects asks whether the use of the data is appropriate. It doesn’t matter how safe the access controls and so forth are if the project itself is inappropriate.

Safe people asks whether the users be trusted to use the data in an appropriate manner. For health care data, for example, one could ask whether users have had HIPAA training.

Safe settings applies to physical access. Does the facility hosting the data limit unauthorised access?

Safe data asks about statistical disclosure control, whether the data itself poses a disclosure risk. For example, have the data been adequately deidentified?

Safe outputs asks whether the output of the project poses a privacy risk.

Various approaches to data privacy have different trade-offs between the Five Safes. Differential privacy focuses on safe outputs. There are mathematical guarantees that the outputs satisfy a certain definition of privacy. The data itself is regarded as unsafe, and so it is important that the people and settings are safe.

HIPAA expert determination focuses on safe data. Often there is a sort of firewall with data considered safe on one side for one set of reasons (patient consent, a BAA contract, etc.) and considered safe on the other side of the wall because the data itself is safe, i.e. properly deidentified.

Safe Harbor is unrelated to the Five Safes. Safe Harbor is a provision under the HIPAA Privacy Rule for deeming certain data safe. Whether the Safe Harbor rules actually result in safe data depends on context. Data may comply with the letter of the law appealing to Safe Harbor, and yet not protect individuals in the data from being identified.

If you would like help evaluating the privacy aspects of a data analysis project, let’s talk.

The post The Five Safes data privacy framework first appeared on John D. Cook.

Database reconstruction attacks

John — Thu, 23 Nov 2023 02:19:54 +0000

In 2018, three researchers from the US Census Bureau published a paper entitled “Understanding Database Reconstruction Attacks on Public Data.” [1] The article showed that private data on many individuals could be reverse engineered from public data.

As I wrote about a few days ago, census blocks are at the bottom of the US Census Bureau’s hierarchy of geographical entities. On average a census block may contain about 40 people, but a block may contain only one person.

In hindsight it seems fairly obvious that data reported at the census block level is vulnerable to re-identification, and yet this doesn’t seem to have been noticed before around 2000. There were some privacy measures in place before then, but it wasn’t clear that these methods were insufficient to protect privacy.

You can think of each fact about each person as a variable and each reported statistic as an equation. When the number of equations is comparable to the number of variables, it’s possible that the system of equations has a unique solution. (We know a priori that there exists at least one solution, assuming the reported statistics were correctly computed.)

It’s not quite as simple as that, though that is roughly the idea in [1]. The data collected in the census is binary or integer data, which makes database reconstruction easier. Ages, for example, are integers, and typically integers less than 100.

One of the techniques the Census Bureau previously used in an attempt to protect individual privacy was a sort of small cell rule, a rule to not report statistics based on three or fewer individuals. This may or may not help. In the example given in [1], there are 7 people in a hypothetical census block, of whom 4 are adults and an unreported number are minors. Determining the number of minors is left as an exercise for the reader.

The set of equations is more complicated than a set of linear equations. The inference problem is a matter of logic programming or constraint satisfaction. Missing data is not always as trivial to reconstruct as in the preceding paragraph, but missing data can still convey partial information. The very fact that the data is missing tells you something.

The discrete nature of the data makes the solution process both harder and easier. It makes things harder in the sense of requiring a more complicated solution algorithm, but it makes things easier in the sense of increasing the likelihood that the equations have a unique solution.

This is why the Census Bureau embraced differential privacy for the 2020 census. They had no choice but to do something substantially different than they had done in the past once it became apparent that their previous approach failed rather badly at protecting confidentiality.

[1] Simson Garfinkel, John M. Abowd, Christain Martindale. Understanding Database Reconstruction Attacks on Public Data. ACM Quque, October 2018. The article was also published in Communications of the ACM in March 2019.

The post Database reconstruction attacks first appeared on John D. Cook.

Differentially private stochastic gradient descent

John — Wed, 08 Nov 2023 18:33:21 +0000

Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent.

Gradient descent

We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. Then the gradient ∇f(x) points in the direction of greatest increase in f, and its negative −∇f(x) points in the direction of greatest decrease. So you can decrease the function f by moving in the direction −∇f(x). You can turn this observation into an algorithm for minimizing f. Find the gradient, take a step in the opposite direction. Rinse, lather, and repeat. The gradient descent method is also called steepest descent because locally you’re always moving in the direction of steepest descent.

Locally is the key word here. In some finite neighborhood of x, −∇f(x) points in a direction that will decrease the value of f. How large is this neighborhood? Hard to say. How long a step should you take? Also hard to say.

If your function f is strictly convex, then there is a global minimum, and the gradient descent algorithm will converge to it.

Stochastic gradient descent

What could go wrong? If your objective function f is strictly convex, convergence is guaranteed, but convergence may be slow. The method is always locally optimal, but optimal may mean inside a tiny little neighborhood of where you start each iteration.

Gradient descent can meander through a valley, a nearly flat spot of the objective function, taking tiny steps back and forth and not making much progress toward finding the low point.

Another problem is that gradient descent can get stuck in a local minimum. If f is strictly convex then there are no local minima, just one global minimum, but stochastic gradient descent is used to minimize functions that are not convex and that may have many local minima.

So to get unstuck, either from being stuck at a local minimum or from a flat spot, stochastic gradient descent adds randomness.

The objective functions used in training neural networks have many variables, maybe millions or billions, and these functions are far from convex.

Differentially private stochastic gradient descent

Now suppose you’re doing machine learning on sensitive data, such as medical records. You want to train a model on the data, but you don’t want the model to memorize and leak personally identifiable information (PII) or protected health information (PHI).

If you were simply querying the data, rather than training a network on it, you could apply differential privacy to queries, adding an amount of noise to each query result calibrated to be just enough randomness to preserve privacy. But training a neural network is far more complicated that running a SQL query.

The core idea of differential privacy is that any individual’s presence or absence from a database must not make much difference, in a rigorously quantified sense. Any outcome that happened with someone in the database could plausibly have happened without that person being in the database. Stochastic gradient descent is already a randomized algorithm, and variations on the algorithm can also provide differential privacy guarantees. See [1] for a seminal paper and [2] for a later refinement.

[1] Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate. Stochastic gradient descent with differentially private updates. 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

[2] Abadi et al. Deep Learning with Differential Privacy. arXiv:1607.00133 [stat.ML]

The post Differentially private stochastic gradient descent first appeared on John D. Cook.

Using classical statistics to avoid regulatory burden

John — Wed, 01 Nov 2023 16:14:12 +0000

On June 29 this year I said on Twitter that companies would start avoiding AI to avoid regulation.

I followed that up with an article Three advantages of non-AI models. The third advantage I listed was

Statistical models are not subject to legislation hastily written in response to recent improvements in AI. The chances that such legislation will have unintended consequences are roughly 100%.

Fast forward four months and we now have a long, highly-detailed executive order, Executive Order 14110, effecting all things related to artificial intelligence. Here’s an excerpt:

… the Secretary [of Commerce] shall require compliance with these reporting requirements for: any model that was trained using a quantity of computing power greater than 10²⁶ integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 10²³ integer or floating-point operations; and any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 10²⁰ integer or floating-point operations per second for training AI.

If a classical model can do what you need, you are not subject to any regulations that will flow out of the executive order above, not if these regulations use definitions similar to those in the executive order.

How many floating point operations does it take to train, say, a logistic regression model? It depends on the complexity of the model and the amount of data fed into the model, but it’s not 10²⁰ flops.

Can you replace an AI model with something more classical like a logistic regression model or a Bayesian hierarchical model? Very often. I wouldn’t try to compete with Midjourney for image generation that way, but classical models can work very well on many problems. These models are much simpler—maybe a dozen parameters rather than a billion parameters—and so are much better understood (and so there is less fear of such models that leads to regulation).

I had a client that was using some complicated models to predict biological outcomes. I replaced their previous models with a classical logistic regression model and got better results. The company was so impressed with the improvement that they filed a patent on my model.

If you’d like to discuss whether I could help your company replace a complicated AI model with a simpler statistical model, let’s talk.

The post Using classical statistics to avoid regulatory burden first appeared on John D. Cook.

Identifiers depend on context

John — Mon, 30 Oct 2023 23:06:40 +0000

Can you tell who someone is from their telephone number? That’s kinda the point of telephone numbers, to let you contact someone. And indeed telephone number is one the 18 identifiers under HIPAA Safe Harbor.

But whether any piece of information allows you to identify someone depends on context. If you don’t have access to a phone, or a phone book, or any electronic counterpart of a phone book, then a phone number doesn’t allow you to identify anyone. But once you can call a phone number or enter it into a search engine, then the phone number is identifiable. Maybe.

What if the number belongs to a burner phone? Then it would be harder to learn the identity of the person who owns the number, but not impossible. Maybe you couldn’t learn anything about the owner, but law enforcement officials could. Again identifiability depends on context.

An obvious identifier like a phone number might not be an identifier in some special circumstance. And an apparently useless bit of information might reveal someone’s identity in another circumstance.

HIPAA’s Safe Harbor Rule tries to say apart from context what kinds of data are identifiable. But if you read the Safe Harbor Rule carefully you’ll notice it isn’t so context-free as it seems. The last item in the list of 18 items to remove is “any other unique identifying number, characteristic, or code.” What might be an identifying characteristic? That depends on context.

The post Identifiers depend on context first appeared on John D. Cook.

V-statistics

John — Wed, 19 Jul 2023 13:07:28 +0000

A few days ago I wrote about U-statistics, statistics which can be expressed as the average of a symmetric function over all combinations of elements of a set. V-statistics can be written as an average of over all products of elements of a set.

Let S be a statistical sample of size n and let h be a symmetric function of r elements. The average of h over all subsets of S with r elements is a U-statistic. The average of h over the Cartesian product of S with itself r times

is a V-statistic.

As in the previous post, let h(x, y) = (x − y)²/2. We can illustrate the V-statistic associated with h with Python code as before.

    import numpy as np
    from itertools import product

    def var(xs):
        n = len(xs)
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in product(xs, repeat=2)) / n**2

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs))
    print(var(xs))

This time, however, we iterate over product rather than over combinations. Note also that at the bottom of the code we print

   np.var(xs)

rather than

   np.var(xs, ddof=1)

This means our code here is computing the population variance, not the sample variance. We could make this more explicit by supplying the default value of ddof.

   np.var(xs, ddof=0)

The point of V-statistics is not to calculate them as above, but that they could be calculated as above. Knowing that a statistic is an average of a symmetric function is theoretically advantageous, but computing a statistic this way would be inefficient.

U-statistics are averages of a function h over all subsamples of S of size r without replacement. V-statistics are averages of h over all subsamples of size r with replacement. The difference between sampling with or without replacement goes away as n increases, and so V-statistics have the same asymptotic properties as U-statistics.

The post V-statistics first appeared on John D. Cook.

Moments of Tukey’s g-and-h distribution

John — Mon, 17 Jul 2023 18:16:15 +0000

John Tukey developed his so-called g-and-h distribution to be very flexible, having a wide variety of possible values of skewness and kurtosis. Although the reason for the distribution’s existence is its range of possible skewness and values, calculating the skewness and kurtosis of the distribution is not simple.

Definition

Let φ be the function of one variable and four parameters defined by

A random variable Y has a g-and-h distribution if it has the same distribution as φ(Z; a, b, g, h) where Z is a standard normal random variable. Said another way, if Y has a g-and-h distribution then the transformation φ^-1 makes the data normal.

The a and b parameters are for location and scale. The name of the distribution comes from the parameters g and h that control skewness and kurtosis respectively.

The transformation φ is invertible but φ^-1 does not have a closed-form; φ^-1 must be computed numerically. It follows that the density function for Y does not have a closed form either.

Special cases

The g distribution is the g-and-h distribution with h = 0. It generalizes the log normal distribution.

The limit of the g-and-h distribution as g does to 0 is the h distribution.

If g and h are both zero we get the normal distribution.

Calculating skewness and kurtosis

The following method of computing the moments of Y comes from [1].

Define f by

Then the raw moments of Y are given by

Skewness is the 3rd centralized moment and kurtosis is the 4th centralized moment. Equations for finding centralized moments from raw moments are given here.

[1] James B. McDonald and Patrick Turley. Distributional Characteristics: Just a Few More Moments. The American Statistician, Vol. 65, No. 2 (May 2011), pp. 96–103

The post Moments of Tukey’s g-and-h distribution first appeared on John D. Cook.

Symmetric functions and U-statistics

John — Sun, 16 Jul 2023 17:56:52 +0000

A symmetric function is a function whose value is unchanged under every permutation of its arguments. The previous post showed how three symmetric functions of the sides of a triangle

a + b + c
ab + bc + ac
abc

are related to the perimeter, inner radius, and outer radius. It also mentioned that the coefficients of a cubic equation are symmetric functions of its roots.

This post looks briefly at symmetric functions in the context of statistics.

Let h be a symmetric function of r variables and suppose we have a set S of n numbers where n ≥ r. If we average h over all subsets of size r drawn from S then the result is another symmetric function, called a U-statistic. The “U” stands for unbiased.

If h(x) = x then the corresponding U-statistic is the sample mean.

If h(x, y) = (x − y)²/2 then the corresponding U-function is the sample variance. Note that this is the sample variance, not the population variance. You could see this as a justification for why sample variance as an n−1 in the denominator while the corresponding term for population variance has an n.

Here is some Python code that demonstrates that the average of (x − y)²/2 over all pairs in a sample is indeed the sample variance.

    import numpy as np
    from itertools import combinations

    def var(xs):
        n = len(xs)
        bin = n*(n-1)/2    
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in combinations(xs, 2)) / bin

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs, ddof=1))
    print(var(xs))

Note the ddof term that causes NumPy to compute the sample variance rather than the population variance.

Many statistics can be formulated as U-statistics, and so numerous properties of such statistics are corollaries general results about U-statistics. For example U-statistics are asymptotically normal, and so sample variance is asymptotically normal.

The post Symmetric functions and U-statistics first appeared on John D. Cook.

Eccentricity of bivariate normal level sets

John — Mon, 03 Jul 2023 12:00:07 +0000

Suppose you have a bivariate normal distribution with correlation ρ where

Then the level sets of the density function are ellipses, and the eccentricity e of the ellipses is related to the correlation ρ by

Plots

For example, suppose ρ = 0.8.

Here’s a plot of the density function f(x, y).

And here is a plot of the contours, the level sets obtained by slicing the plot of the density horizontally.

Eccentricity and aspect ratio

The equation above says that since ρ = 0.8 the eccentricity e should be √(1.6/1.8) = 0.9428.

As I’ve written about several times, eccentricity is a little hard to interpret. It’s a number between 0 and 1, with 0 being a circle and 1 being a degenerate ellipse collapsed to a line segment. Since 0.9428 is closer to 1 than to 0, you might think that an ellipse with such a large value of e is very far from circular. But this is not the case.

Eccentric literally means off-center. Eccentricity measures how far off-center the foci of an ellipse are. This is related to how round or thin an ellipse is, but only indirectly. It is true that such a large value of e means that the foci are close to the ends of the ellipse. In fact, the foci are located at 94.28% of the distance from the center to the end of the semimajor axes. But the aspect ratio is more moderate.

It’s easier to imagine the shape of an ellipse from the aspect ratio than from the eccentricity. Aspect ratio r is related to eccentricity e by

Using the expression above for eccentricity e as a function of correlation ρ we have

This says our aspect ratio should be 1/3. So although our ellipse is highly eccentric—the foci are near the ends—it isn’t extraordinarily thin. It’s three times longer than it is wide, which matches the contour plot above.

The post Eccentricity of bivariate normal level sets first appeared on John D. Cook.

Babies and the beta-binomial distribution

John — Mon, 05 Jun 2023 16:05:16 +0000

About half of children are boys and half are girls, but that doesn’t mean that every couple is equally likely to have a boy or a girl each time they conceive a child. And evidence suggests that indeed the probability of conceiving a girl varies per couple.

I will simplify things for this post and look at a hypothetical situation abstracting away the complications of biology. This post fills in the technical details of a thread I posted on Twitter this morning.

Suppose the probability p that a couple will have a baby girl has a some distribution centered at 0.5 and symmetric about that point. Then half of all births on the planet will be girls, but that doesn’t mean that a particular couple is equally likely to have a boy or a girl.

How could you tell the difference empirically? You couldn’t if every family had one child. But suppose you studied all families with four children, for example. You’d expect 1 in 16 such families to have all boys, and 1 in 16 families to have all girls. If the proportions are higher than that, and they are, then that suggests that the distribution on p, the probability of a couple having a girl, is not constant.

Suppose the probability of a couple having girls has a beta(a, b) distribution. We would expect a and b to be approximately equal, since about half of babies are girls, and we’d expect a and b to be large, i.e. for the distribution be fairly concentrated around 1/2. For example, here’s a plot with a = b = 100.

Then the probability distribution for the number of girls in a family of n children is given by a beta-binomial distribution with parameters n, a, and b. That is, the probability of x girls in a family of size n is given by

The mean of this distribution is na/(a+b). And so if a = b then the mean is n/2, half girls and half boys.

But the variance is more interesting. The variance is

The variance of a binomial, corresponding to a constant p, is np(1-p). In the equation above, p corresponds to a/(a+b), and (1-p) corresponds to b/(a+b). And there’s an extra term,

which is larger than 1 when n > 1. This says a beta binomial random variable always has more variance than the corresponding binomial distribution with the same mean.

Now suppose a family has had n children, with g girls and n – g boys. Then the posterior predictive probability of a girl on the next birth is

If g = n/2 then this probability is 1/2. But if g > n/2 then the probability is greater than 1/2. And the smaller a and b are, the more the probability exceeds 1/2.

The binomial model is the limit of the beta-binomial model as a and b go to infinity (proportionately). In the limit, the probability above equals a/(a+b), independent of g and n.

The post Babies and the beta-binomial distribution first appeared on John D. Cook.

Can you have confidence in a confidence interval?

John — Sun, 23 Apr 2023 18:04:29 +0000

“The only use I know for a confidence interval is to have confidence in it.” — L. J. Savage

Can you have confidence in a confidence interval? In practice, yes. In theory, no.

If you have a 95% confidence interval for a parameter θ, can you be 95% sure that θ is in that interval? Sorta.

Frequentist theory

According to frequentist theory, parameters such as θ are fixed constants, though unknown. Probability statements about θ are forbidden. Here’s an interval I. What is the probability that θ is in I? Well, it’s 1 if θ is in the interval and 0 if it is not.

In theory, the probability associated with a confidence interval says something about the process used to create the interval, not about the parameter being estimated. Our θ is an immovable rock. Confidence intervals come and go, some containing θ and others not, but we cannot make probability statements about θ because θ is not a random variable.

Here’s an example of a perfectly valid and perfectly useless way to construct a 95% confidence interval. Take an icosahedron, draw an X on one face, and leave the other faces unmarked. Roll this 20-sided die, and if the X comes up on top, return the empty interval. Otherwise return the entire real line.

The resulting interval, either ø or (-∞, ∞), is a 95% confidence interval. The interval is the result of a process which will contain the parameter θ 95% of the time.

Now suppose I give you the empty set as my confidence interval. What is the probability now that θ is in the empty interval? Zero. What if I give you the real line as my confidence interval. What is the probability that θ is in the interval? One. The probability is either zero or one, but in no case is it 0.95. The probability that a given interval produced this way contains θ is never 95%. But before I hand you a particular result, the probability that the interval will be one that contains θ is 0.95.

Bayesian theory

Confidence intervals are better in practice than the example above. And importantly, frequentist confidence intervals are usually approximately Bayesian credible intervals.

In Bayesian statistics you can make probability statements about parameters. Once again θ is some unknown parameter. How might we express our uncertainty about the value of θ? Probability! Frequentist statistics represents some forms of uncertainty by probability but not others. Bayesian statistics uses probability to model all forms of uncertainty.

Example

Suppose I want to know what percentage of artists are left handed and I survey 400 artists. I find that 127 of artists surveyed were southpaws. A 95% confidence interval, using the most common approach rather than the pathological approach above, is given by

This results in a confidence interval of (0.272, 0.363).

Now suppose we redo our analysis using a Bayesian approach. Say we start with a uniform prior on θ. Then the posterior distribution on θ will have a beta(128, 264) distribution.

Now we can say in clear conscience that there is a 94% posterior probability that θ is in the interval (0.272, 0.363).

There are a couple predictable objections at this point. First, we didn’t get exactly 95%. No, we didn’t. But we got very close.

Second, the posterior probability depends on the prior probability. However, it doesn’t depend much on the prior. Suppose you said “I’m pretty sure most people are right handed, maybe 9 out of 10, so I’m going to start with a beta(1, 9) prior.” If so, you would compute the probability of θ being in the interval (0.272, 0.373) to be 0.948. Your a priori knowledge led you to have a little more confidence a posteriori.

Commentary

The way nearly everyone interprets a frequentist confidence interval is not justified by frequentist theory. And yet it can be justified by saying if you were to treat it as a Bayesian credible interval, you’d get nearly the same result.

You can often justify an informal understanding of frequentist statistics on Bayesian grounds. Note, however, that a Bayesian interpretation would not rescue the 95% confidence interval that returns either the empty set or the real line.

Often frequentist and Bayesian analyses reach approximately the same conclusions. A Bayesian can view frequentist techniques as convenient ways to produce approximately correct Bayesian results. And a frequentist can justify using a Bayesian procedure because the procedure has good frequentist properties.

There are times when frequentist and Bayesian results are incompatible, and in that case the Bayesian results typically make more sense in my opinion. But very often other considerations, such as model uncertainty, are much more substantial than the difference between a frequentist and Bayesian analysis.

The post Can you have confidence in a confidence interval? first appeared on John D. Cook.

Privacy and tomography

John — Sat, 01 Apr 2023 22:19:49 +0000

I ran across the image below [1] when I was searching for something else, and it made me think of a few issues in data privacy.

The green lights cut across the man’s body like tomographic imaging. No one of these green lines would be identifiable, but maybe the combination of the lines is.

We can tell it’s a young man in the image, someone in good shape. But is this image identifiable?

It’s identifiable to the photographer. It’s identifiable to the person in the photo.

If hundreds of men were photographed under similar circumstances, and we knew who those men were, we could probably determine which of them is in this particular photo.

The identifiability of the photo depends on context. The HIPAA Privacy Rule acknowledges that the risk of identification of individuals in a data set depends on who is receiving the data and what information the recipient could reasonably combine the data with. The NSA could probably tell who the person in the photo is; I cannot.

Making photographs like this available is not a good idea if you want to protect privacy, even if its debatable whether the person in the photo can be identified. Other people photographed under the same circumstances might be easier to identify.

Differential privacy takes a more conservative approach. Differential privacy advocates argue that deidentification procedures should not depend on what the recipient knows. We can never be certain what another person does or doesn’t know, or what they might come to know in the future.

Conventional approaches to privacy consider some data more identifiable than other data. Your phone number and your lab test results from a particular time and place are both unique, but the former is readily available public information and the latter is not. Differential privacy purists would say both kinds of data should be treated identically. A Bayesian approach might be to say we need to multiply each risk by a prior probability: the probability of an intruder looking up your phone number is substantially greater than the probability of the intruder accessing your medical records.

Some differential privacy practitioners make compromises, agreeing that not all data is equally sensitive. Purists decry this as irresponsible. But if the only alternatives are pure differential privacy and traditional approaches, many clients will choose the latter, even though an impure approach to differential privacy would have provided better privacy protection.

[1] Photo by Caspian Dahlström on Unsplash

The post Privacy and tomography first appeared on John D. Cook.

Privacy implications of hashing data

John — Wed, 29 Mar 2023 11:06:30 +0000

Cryptographic hash functions are also known as one-way functions because given an input x, one can easily compute its hashed value f(x), but it is impractical to recover x from knowing f(x).

However, if we know that x comes from a small universe of possible values, our one-way function can effectively become a two-way function, i.e. it may be possible to start with f(x) and recover x using a rainbow table attack.

It’s possible to defend against a rainbow attack yet still leak information about the probability distribution of values of x given f(x).

For privacy protection, hashing is better than not hashing. Keyed hashing is better than unkeyed hashing. But even keyed hashing can leak information.

Rainbow tables

Suppose a data set contains SHA-256 hashed values of US Social Security Numbers (SSNs). Since SSNs have 10 digits, there are 10 billion possible SSNs. It would be possible to hash all possible SSNs and create a lookup table, known as a rainbow table. There are three things that make the rainbow table attack possible in this example:

The range of possible inputs is known and relatively small.
The hashing algorithm is known.
The hashing algorithm can be computed quickly.

There are a couple ways to thwart a rainbow table attack. Assuming we have no control of (1) above, we can alter (2) or (3).

Keyed hashing

A way to alter (2) is to use a keyed hash algorithm. For example, if we XOR the SSNs with a key before applying SHA-256, an attacker cannot construct a rainbow table without knowing the key. The attacker may know the core hash algorithm we are using, but they do not know the entire algorithm because the key is part of the algorithm.

Expensive hashing

A way to alter (3) is to use hashing algorithms that are expensive to compute, such as Argon2. The idea is that such hash functions can be computed quickly enough for legitimate use cases, but not for brute-force attacks.

The time required to create a rainbow table is the product of the time to compute a single hash and the size of the set of possible inputs. If it took an hour to compute a hash value, but there are only 10 possible values, then an attacker could compute a rainbow table in 10 hours.

Leaking attribute probabilities

Now suppose we’re using a keyed hashing algorithm. Maybe we’re paranoid and use an expensive hashing algorithm on top of that. What can go wrong now?

If we’re hashing values that each appear one time then we’re OK. If we apply a keyed hash to primary keys in a database, such as patient ID, then the hashed ID does not reveal the patient ID. But if we hash attributes associated with that patient ID, things are different.

Frequency analysis

Suppose US state is an attribute in your database, and you hash this value. No matter how secure and how expensive your hash algorithm is, each state will have a unique hash value. If the database contains a geographically representative sample of the US population, then the hashed state value that appears most often is probably the most populous state, i.e. California. The second most common hashed state value probably corresponds to Texas.

Things are fuzzier on the low end. The hashed state value appearing least often may not correspond to Wyoming, but it very likely does not correspond to Florida, for example.

In short, you can infer the state values using the same kind of frequency analysis you’d use to solve a simple substitution cipher in a cryptogram puzzle. The larger the data set, the more closely the empirical order will likely align with the population order.

Maybe this is OK?

Frequency analysis makes it possible to infer (with some uncertainty) the most common values. But the most common values are also the least informative. Knowing someone is from California narrows down their identity less than knowing that they are from Wyoming.

Frequency analysis is less specific for less common values. It might not tell you with much confidence that someone is from Wyoming, but it might tell you with some confidence that they come from a low-population state. However, since there are several low-population states, knowing someone is from such a state, without knowing which particular state, isn’t so informative.

Data privacy depends on context. Knowing what state someone is likely from may or may not be a problem depending on what other information is available.

The post Privacy implications of hashing data first appeared on John D. Cook.

Burr distribution

John — Tue, 14 Feb 2023 15:26:50 +0000

Irving Burr came up with a set of twelve probability distributions known as Burr I, Burr II, …, Burr XII. The last of these is by far the best known, and so the Burr XII distribution is often referred to simply as the Burr distribution [1]. See the next post for the rest of the Burr distributions.

Cumulative density functions (CDFs) of probability distributions don’t always have a closed form, but it’s convenient when they do. And the CDF of the Burr XII distribution has a particularly nice closed form:

You can also add a scale parameter the usual way.

The flexibility of the Burr distribution isn’t apparent from looking at the CDF. It’s obviously a two-parameter family of distributions, but not all two-parameter families are equally flexible. Flexibility is a fuzzy concept, and there are different notions of flexibility for different applications. But one way of measuring flexibility is the range of (skewness, kurtosis) values the distribution family can have.

The diagram below [2] shows that the Burr distribution can take on a wide variety of skewness and kurtosis combinations, and that for a given value of skewness, the Burr distribution has the largest kurtosis of common probability distribution families. (Note that the skewness axis increases from top to bottom.)

The green area at the bottom of the rainbow belongs to the Burr distribution. The yellow area is a subset of the green area.

[1] In economics the Burr Type XII distribution is also known as the Singh-Maddala distribution.

[2] Vargo, Pasupathy, and Leemis. Moment-Ratio Diagrams for Univariate Distributions. Journal of Quality Technology. Vol. 42, No. 3, July 2010. The article contains a grayscale version of the image, and the color version is available via supplementary material online.

The post Burr distribution first appeared on John D. Cook.

Maxwell-Boltzmann and Gamma

John — Sat, 07 Jan 2023 17:25:42 +0000

When I shared an image from the previous post on Twitter, someone who goes by the handle Nonetheless made the astute observation that image looked like the Maxwell-Boltzmann distribution. That made me wonder what 1/Γ(x) would be like turned into a probability distribution, and whether it would be approximately like the Maxwell-Boltzmann distribution.

(Here I’m looking at something related to what Nonetheless said, but different. He was looking at my plot of the error in approximating 1/Γ(x) by partial products, but I’m looking at 1/Γ(x) itself.)

Making probability distributions

You can make any non-negative function with a finite integral into a probability density function by dividing it by its integral. So we can define a probability density function

where

and so c = 0.356154. (The integral cannot be computed in closed form and so has to be evaluated numerically.)

Here’s a plot of f(x).

Note that we’re doing something kind of odd here. It’s common to see the gamma function in the definition of probability distributions, but the function is always evaluated at distribution parameters, not at the free variable x. We’ll give an example of this shortly.

Maxwell-Boltzmann distribution

The Maxwell-Boltzmann distribution, sometimes called just the Maxwell distribution, is used in statistical mechanics to give a density function for particle velocities. In statistical terminology, the Maxwell-Boltzmann distribution is a chi distribution [1] with three degrees of freedom and a scale parameter σ that depends on physical parameters.

The density function for a Maxwell-Boltzmann distribution with k degrees of freedom and scale parameter σ has the following equation.

Think of this as x^k-1 exp(-x² / 2σ²) multiplied by whatever you have to multiply it by to make it integrate to 1; most of the complexity in the definition is in the proportionality constant. And as mentioned above, the gamma function appears in the proportionality constant.

For the Maxwell-Boltzmann distribution, k = 3.

Lining up distributions

Now suppose we want to compare the distribution we’ve created out of 1/Γ(x) with the Maxwell-Boltzman distribution. One way of to do this would be to align the two distributions to have their peaks at the same place. The mode of a chi random variable with k degrees of freedom and scale σ is

and so for a given k we can solve for σ to put the mode where we’d like.

For positive values, the minimum of the gamma function, and hence the maximum of its reciprocal, occurs at 1.46163. As with the integral above, this has to be computed numerically.

For the Maxwell-Boltzmann distribution, i.e. when k = 3, we don’t get a very good fit.

The tails on the chi distribution are too thin. If we try again with k = 2 we get a much better fit.

You get an even better fit with k = 1.86. The optimal value of k would depend on your criteria for optimality, but k = 1.86 looks like a good value just eyeballing it.

[1] This is a chi distribution, not the far more common chi-square distribution. The first time you see this is looks like a typo. If X is a random variable with a chi random variable with k degrees of freedom then X² has a chi-square distribution with k degrees of freedom.

The post Maxwell-Boltzmann and Gamma first appeared on John D. Cook.

Surprisingly not that surprising

John — Fri, 16 Dec 2022 15:03:43 +0000

World record marathon times have been falling in increments of roughly 30 seconds, each new record shaving roughly 30 seconds off the previous record. If someone were to set a new record, taking 20 seconds off the previous record, this would be exciting, but not suspicious. If someone were to take 5 minutes off the previous record, that would be suspicious.

One way to quantify how surprising a new record is would be to divide its margin of improvement over the previous margin of improvement. That is, given a new record y, and previous records y₁ and y₂, we can calculate an index of surprise by

r = (y – y₁) / (y₁ – y₂)

In [1] the authors analyze this statistic and extensions that take into account more than just the latest two records under technical assumptions I won’t get into here.

A p-value for the statistic R is given by

Prob(R > r) = 2/(r + 2).

You could think of this as a scale of surprise, with 0 being impossibly surprising and 1 being completely unremarkable.

There are multiple reasons to take this statistic with a grain of salt. It is an idealization based on assumptions that may not even approximately hold in a particular setting. And yet it is at least a useful rule of thumb.

The current marathon record beat the previous record by by 30 seconds. The previous margin of improvement was 78 seconds. This gives a value of r equal to 0.385 and a corresponding p-value of 0.84. This says the current record is impressive but statistically unremarkable. An improvement of 5 minutes, i.e. 300 seconds, would result in a p-value of 0.17, which is notable but not hard evidence cheating. [2]

The assumptions in [1] do not apply to marathon times, and may not apply to many situations where the statistic above nevertheless is a useful rule of thumb. The ideas in the paper could form the basis of a more appropriate analysis customized for a particular application.

Reports of a new record in any context are usually over-hyped. The rule of thumb above gives a way to gauge for yourself whether you should share the report’s excitement. You shouldn’t read too much into it, like any rule of thumb, but it at least gives a basis for deciding whether something deserves closer attention.

Big correlations and big interactions

John — Sat, 26 Nov 2022 17:07:29 +0000

An outcome cannot be highly correlated with a large number of independent predictors.

This observation has been called the piranha problem. Predictors are compared to piranha fish. If you have a lot of big piranhas in a small pond, they start eating each other. If you have a lot of strong predictors, they predict each other.

In [1] the authors quantify the piranha effect several ways. I’ll just quote the first one here. See the paper for several other theorems and commentary on their implications.

If X₁, …, X_p, y are real-valued random variables with finite non-zero variance, then

So if the left side is large, either because p is large or because some of the correlations are large, then the right side is also large, and so the sum of the interaction terms is large.

[1]. The piranha problem: large effects swimming in a small pond. Available on arxiv.

The post Big correlations and big interactions first appeared on John D. Cook.

Design of experiments and design theory

John — Mon, 21 Nov 2022 18:04:40 +0000

Design of experiments is a branch of statistics, and design theory is a branch of combinatorics, and yet they overlap quite a bit.

It’s hard to say precisely what design theory is, but it’s consider with whether objects can be arranged in certain ways, and if so how many ways this can be done. Design theory is pure mathematics, but it is of interest to people working in ares of applied mathematics such as coding theory and statistics.

Here’s a recap of posts I’ve written recently related to design of experiments and design theory.

Design of Experiments

A few weeks ago I wrote about fractional factorial design. Then later I wrote about response surface models. Then a diagram from central composite design, a popular design in response surface methodology, was one the diagrams in a post I wrote about visually similar diagrams from separate areas of application.

I wrote two posts about pitfalls with A/B testing. One shows how play-the-winner sequential testing has the same problems as Condorcet’s voter paradox, with the order of the tests potentially determining the final winner. More seriously, A/B testing cannot detect interaction effects which may be critical.

ANSI and Military Standards

There are several civilian and military standards related to design of experiments. The first of these was MIL-STD-105. The US military has retired this standard in favor of the civilian standard ASQ/ANSI Z1.4 which is virtually identical.

Similarly, the US military standard MIL-STD-414 was replaced by the very similar civilian standard ASQ/ANSI Z1.9. This post looks at the mean-range method for estimating variation which these two standards reference.

Design Theory

I wrote a couple posts on Room squares, one on Room squares in general and one on Thomas Room’s original design now known as a Room square. Room squares are used in tournament designs.

I wrote a couple posts about Costas arrays, an introduction and a post on creating Costas arrays in Mathematica.

Latin Squares

Latin squares and Greco-Latin squares a part of design theory and a part of design of experiments. Here are several posts on Latin and Greco-Latin squares.

The post Design of experiments and design theory first appeared on John D. Cook.

Three diagrams

John — Wed, 09 Nov 2022 12:30:17 +0000

This post will give examples of three similar diagrams that occur in three dissimilar areas: design of experiments, finite difference methods for PDEs, and numerical integration.

Central Composite Design (CCD)

The most popular design for fitting a second-order response surface is the central composite design or CCD. When there are two factors being tested, the combinations of the two factors are chosen according to the following diagram, with the value at the center being used for several replications, say four.

The variables are coded so that the corners of the square have both coordinates equal to ±1. The points outside the square, the so-called axial points, have one coordinate equal to 0 and another equal to ±√2.

Finite difference method

The star-like arrangement of points with the middle being weighted more heavily is reminiscent of the five-point finite difference template for the Laplacian. Notice that the center point is weighted by a factor of -4.

In the finite difference method, this pattern of grid points carries over to a pattern of matrix coefficients.

Integrating over a disk

The CCD grid looks even more like a numerical integration pattern for a function f defined over a disk of radius h. See A&S equation 25.4.61.

The center of the circle has weight w = 1/6. The four points on the circle are located at (±h, 0) and (0, ±h) and have weight w = 1/24. The remaining points are located at (±h/2, ±h/2) and have weight w = 1/6.

The post Three diagrams first appeared on John D. Cook.

Another problem with A/B testing: interaction effects

John — Mon, 31 Oct 2022 16:27:01 +0000

The previous post looked at a paradox with A/B testing: your final result may depend heavily on the order of your tests. This post looks at another problem with A/B testing: the inability to find interaction effects.

Suppose you’re debating between putting a photo of a car or a truck on your website, and you’re debating between whether the vehicle should be red or blue. You decide to use A/B testing, so you test whether customers prefer a red truck or a blue truck. They prefer the blue truck. Then you test whether customers prefer a blue truck or a blue car. They prefer the blue truck.

Maybe customers would prefer a red car best of all, but you didn’t test that option. By testing vehicle type and color separately, you didn’t learn about the interaction of vehicle type and color. As Andrew Gelman and Jennifer Hill put it [1],

Interactions can be important. In practice, inputs that have large main effects also tend to have large interactions with other inputs. (However, small main effects do not preclude the possibility of large interactions.)

Notice that sample size is not the issue. Suppose you tested the red truck against the blue truck with 1000 users and found that 88.2% preferred the blue truck. You can be quite confident that users prefer the blue truck to the red truck. Suppose you also used 1000 users to test the blue truck against the blue car and this time 73.5% preferred the blue truck. Again you can be confident in your results. But you failed to learn something that you might have learned if you’d split 100 users between four options: red truck, blue truck, red car, blue car.

Experiment size

This is an example of a factorial design, testing all combinations of the factors involved. Factorial designs seem impractical because the number of combinations can grow very quickly as the number of factors increases. But if it’s not practical to test all combinations of 10 factors, for example, that doesn’t mean that it’s impractical to test all combinations of two factors, as in the example above. It is often practical to use a full factorial design for a moderate number of factors, and to use a fractional factorial design with more factors.

If you only test one factor at a time, you’re betting that interaction effects don’t matter. Maybe you’re right, and you can optimize your design by optimizing each variable separately. But if you’re wrong, you won’t know.

Agility

The advantage of A/B tests is that they can often be done rapidly. Blue or red? Blue. Car or truck? Truck. Done. Now let’s test something else.

If the only options were between a rapid succession of tests of one factor at a time or one big, complicated statistical test of everything, speed might win. But there’s another possibility: a rapid succession of slightly more sophisticated tests.

Suppose you have 9 factors that you’re interested in, and you understandably don’t want to test several replications of 2⁹ = 512 possibilities. You might start out with a (fractional) factorial design of 5 of the factors. Say that only one of these factors seems to make much difference, no matter what you pair it with. Next you do another experiment testing 5 factors at a time, the winner of the first experiment and the 4 factors you haven’t tested yet. This lets you do two small experiments rather than one big one.

Note that in this example you’re assuming that the factors that didn’t matter in the first experiment wouldn’t have important interactions with the factors in the second experiment. And your assumption might be wrong. But you’re making an educated guess, based on data from the first experiment. This is less than ideal, but it’s better than the alternative of testing every factor one at a time, assuming that no interactions matter. Assuming that some interactions don’t matter, based on data, is better than making a blanket assumption that no interactions matter, based on no data.

Testing more than one factor at a time can be efficient for screening as well as for finding interactions. It can help you narrow in on the variables you need to test more thoroughly.

[1] Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.

The post Another problem with A/B testing: interaction effects first appeared on John D. Cook.

ASQ/ANSI Z1.4 sampling procedures

John — Tue, 25 Oct 2022 18:49:01 +0000

I mentioned the other day that the US military standard MIL-STD-105 for statistical sampling procedures lives on in the ASQ/ANSI standard Z1.4. The Department of Defense cancelled their own standard in 1995 in favor of adopting civilian standards, in particular ASQ/ANSI Z1.4.

There are two main differences between military standard and its replacement. First, the military standard is free and the ASNI standard costs $199. Second, the ASNI standard has better typography. Otherwise the two standards are remarkably similar.

For example, the screen shot from MIL-STD-105 that I posted the other day appears verbatim in ASQ/ASNI Z1.4, except with better typesetting. The table even has the same name: “Table II-B Single sampling plans for tightened inspection (Master table).” Since the former is public domain and the latter is copyrighted, I’ll repeat my screenshot of the former.

Everything I said about the substance of the military standard applies to the ASNI standard. The two give objective, checklist-friendly statistical sampling plans and acceptance criteria. The biggest strength and biggest weakness of these plans is the lack of nuance. One could create more sophisticated statistical designs that are more powerful, but then you lose the simplicity of a more regimented approach.

A company could choose to go down both paths, using more informative statistical models for internal use and reporting results from the tests dictated by standards. For example, a company could create a statistical model that takes more information into account and use it to assess the predictive probability of passing the tests required by sampling standards.

Sharing data without letting it go

John — Mon, 24 Oct 2022 20:09:51 +0000

Suppose two companies would like to share data, but they’d also each like to retain ownership of their own data. They’d like to enable querying as if each company had given the other all its data, without actually letting go of its data.

Maybe the two companies are competitors who want to collaborate for a particular project. Or maybe the companies each have data that they are not legally allowed to share with the other. Maybe one company is interested in buying (the data of) the other and would like to have some sort of preview of what they may be buying.

Differential privacy makes this possible, and can be useful even if privacy is not an issue. The two companies have data on inanimate widgets, not persons, and yet they have privacy-like concerns. They don’t want to hand over row-level data about their widgets, and yet they both want to be able to pose questions about the combined widget data. The situation is analogous to being concerned about the “privacy” of the widgets.

Both companies would deposit data with a trusted third party, and gain access to this data via an API that implements differential privacy. Such APIs let users pose queries but do not allow either side to request row-level data.

How is this possible? What if one party poses a query that unexpectedly turns out to be asking for row-level data? For example, maybe someone asks for the average net worth of customers in Alaska, assuming there are hundreds of such customers, but the data only contains one customer in Alaska. What was intended to be an aggregate query turns out to be a row-level query.

Differential privacy handles this sort of thing automatically. It adds an amount of random noise to each query in proportion to how sensitive the query is. If you ask for what amounts to data about an individual person (or individual widget) the system will add enough noise to the result to prevent revealing row-level information. (The system may also refuse to answer the query; this is done more often in practice than in theory.) But if you ask a question that reveals very little information about a single database row, the amount of noise added will be negligible.

The degree of collaboration can be limited up front by setting a privacy budget for each company. (Again, we may not necessarily be talking about the privacy of people. We may be looking at analogous protections on units of information about physical things, such as results of destructive testing of expensive hardware.)

Someone could estimate at the start of the collaboration how large the privacy budget would need to be to allow both companies to satisfy their objectives without being so large as to risk giving away intellectual property that the parties do not wish to exchange. This budget would be spent over the course of the project. When the parties exhaust their privacy budgets, they can discuss whether to allow each other more query budget.

This arrangement allows both parties the ability to ask questions of the combined data as if they had exchanged data. However, neither party has given up control of its data. They have given each other some level of information inferred from the combined data, but neither gives a single row of data to the other.

What’s “differential” about differential privacy?
Secret sharing with encryption
Homomorphic encryption
Data privacy consulting

Sharing data without letting it go

John D. Cook

Statistics | John D. Cook

A Bayesian approach to proving you’re human

Related posts

Estimating an author’s vocabulary

Detecting the language of encrypted text

Uncovering names masked with stars

Related posts

When is less data less private?

How likely is a random variable to be far from its center?

Two-digit zip codes

Related posts

Beta inequality symmetries

Related posts

The Five Safes data privacy framework

Database reconstruction attacks

Related posts

Differentially private stochastic gradient descent

Gradient descent

Stochastic gradient descent

Differentially private stochastic gradient descent

Related posts

Using classical statistics to avoid regulatory burden

Identifiers depend on context

V-statistics

Related posts

Moments of Tukey’s g-and-h distribution

Definition

Special cases

Calculating skewness and kurtosis

Related posts

Symmetric functions and U-statistics

Eccentricity of bivariate normal level sets

Plots

Eccentricity and aspect ratio

Related posts

Babies and the beta-binomial distribution

Related posts

Can you have confidence in a confidence interval?

Frequentist theory

Bayesian theory

Example

Commentary

Related posts

Privacy and tomography

Privacy implications of hashing data

Rainbow tables

Keyed hashing

Expensive hashing

Leaking attribute probabilities

Frequency analysis

Maybe this is OK?

Related posts

Burr distribution

Maxwell-Boltzmann and Gamma

Making probability distributions

Maxwell-Boltzmann distribution

Lining up distributions

Surprisingly not that surprising

More rule of thumb posts

Big correlations and big interactions

Related posts

Design of experiments and design theory

Design of Experiments

ANSI and Military Standards

Design Theory

Latin Squares

Three diagrams

Central Composite Design (CCD)

Finite difference method

Integrating over a disk

Related posts

Another problem with A/B testing: interaction effects

Experiment size

Agility

Related posts

ASQ/ANSI Z1.4 sampling procedures

Related links

Sharing data without letting it go

Related posts