Using classical statistics to avoid regulatory burden

On June 29 this year I said on Twitter that companies would start avoiding AI to avoid regulation.

Companies are advertising that their products contain AI. Soon companies may advertise that their projects are AI-free and thus exempt from AI regulations.

I followed that up with an article Three advantages of non-AI models. The third advantage I listed was

Statistical models are not subject to legislation hastily written in response to recent improvements in AI. The chances that such legislation will have unintended consequences are roughly 100%.

Fast forward four months and we now have a long, highly-detailed executive order, Executive Order 14110, effecting all things related to artificial intelligence. Here’s an excerpt:

… the Secretary [of Commerce] shall require compliance with these reporting requirements for: any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations; and any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 1020 integer or floating-point operations per second for training AI.

If a classical model can do what you need, you are not subject to any regulations that will flow out of the executive order above, not if these regulations use definitions similar to those in the executive order.

How many floating point operations does it take to train, say, a logistic regression model? It depends on the complexity of the model and the amount of data fed into the model, but it’s not 1020 flops.

Can you replace an AI model with something more classical like a logistic regression model or a Bayesian hierarchical model? Very often. I wouldn’t try to compete with Midjourney for image generation that way, but classical models can work very well on many problems. These models are much simpler—maybe a dozen parameters rather than a billion parameters—and so are much better understood (and so there is less fear of such models that leads to regulation).

I had a client that was using some complicated models to predict biological outcomes. I replaced their previous models with a classical logistic regression model and got better results. The company was so impressed with the improvement that they filed a patent on my model.

If you’d like to discuss whether I could help your company replace a complicated AI model with a simpler statistical model, let’s talk.

Differential entropy and privacy

Differential entropy is the continuous analog of Shannon entropy. Given a random variable X with density function fX, the differential entropy of X, denoted h(X), is defined as

h(X) = -\int f_X(x) \log_2 f_X(x)\, dx

where the integration is over the support of fX. You may see differential entropy defined using logarithm to a different base, which changes h(X) by a constant amount.

In [1] the authors defined the privacy of a random variable X, denoted Π(X), as 2 raised to the power h(X).

\Pi(X) = 2^{h(X)}

This post will only look at “privacy” as defined above. Obviously the authors chose the name because of its application to privacy in the colloquial sense, but here we will just look at the mathematics.

Location and scale

It follows directly from the definitions above that location parameters do not effect privacy, and scale parameters change privacy linearly. That is, for σ > 0,

\Pi(\sigma X + \mu) = \sigma \,\Pi(X)

If we divide by standard deviation before (or after) computing privacy then we have a dimensionless quantity. Otherwise there’s more privacy is measuring a quantity in centimeters than in measuring it in inches, which is odd since both contain the same information.

Examples

If X is uniformly distributed on an interval of length a, then h(X) = log2 a and Π(X) = a.

The privacy of a standard normal random variable Z is √(2πe) and so the privacy of a normal random variable with mean μ and variance σ² is σ√(2πe).

The privacy of a standard exponential random variable is 1, so the privacy of an exponential with rate λ is 1/λ.

Bounds

A well-known theorem says that for given variance, differential entropy is maximized by a normal random variable. This means that the privacy of a random variable with variance σ² is bounded above by σ√(2πe).

The privacy of a Cauchy random variable with scale σ is 4πσ, which is greater than σ√(2πe). This is does not contradict the statement above because the scaling parameter of a Cauchy random variable is not its standard deviation. A Cauchy random variable does not have and standard deviation.

Related posts

[1] Agrawal D., Aggrawal C. C. On the Design and Quantification of Privacy-Preserving Data Mining Algorithms, ACM PODS Conference, 2002. (Yes, the first author’s name contains one g and the second author’s name contains two.)

Best of N series

A couple days ago I wrote about the likelihood of the better team winning a best-of-five or best-of-seven series. That is, if the probability of X winning a game against Y is p > ½, how likely is it that X will win a majority of 5 games or a majority of 7 games.

This assumes the probability of winning each game is fixed and that each game is independent. Actual sports series are more complicated than that.

I was thinking about the baseball playoffs, and so I chose series of 5 and 7. But some tennis fans asked me to add series of 3. And others asked me to add series of more than 7.

As reported in the earlier post,the probability of X winning a series of N games is

\sum_{n > N-n} \binom{N}{n} p^n q^{N-n}

Here is a plot for series of 1, 3, 5, …, 29 games.

Best of N series

The blue diagonal line is the probability of winning each game, i.e. winning a 1-game series.

As you play longer series, the probability of the better team winning increases. Notice that you get the most lift from the diagonal line when playing a 3-game series, the orange line above. You get more lift going from 3 games to 5 games, from the orange line to the green line, though not as much. And as observed in the earlier post, you don’t get much improvement going from 5 games to 7 games, going from green to red.

Even with a series of 29 games, there’s a decent chance that the better team may not win the series, unless the better team is much better. The race is not always to the swift, nor the battle to the strong.

As you keep increasing the number of games N, the slope of the line in the middle becomes steeper, but slowly. You hit diminishing return immediately, with each additional pair of games making less and less difference. Still, the curve is getting steeper. Here’s the curve for a series of 499 games.

Best of 499 games

The slope of the curve in the middle is increasing in proportion to the square root of the number of games.

Related posts

Photo by J. Schiemann on Unsplash

Best-of-five versus Best-of-seven

Suppose that when Team X and Team Y play, the probability that X will win a single game is p and the probability that Y will win is q = 1 − p.

What is the probability that X will win the majority of a series of N games for some odd number N?

We know intuitively that the team more likely to win each game is more likely to win the series. But how much more likely?

We also know intuitively that the more games in the series, the more likely the better team will win. But how much more likely?

The probability of X winning a series of N games is

\sum_{n > N-n} \binom{N}{n} p^n q^{N-n}

It doesn’t matter that maybe not all games are played: some games may not be played precisely because it makes no difference whether they are played. For example, if one team wins the first three in a best-of-five series, the last two games are not played, but for our purposes we may imagine that they were played.

Let’s plot the probability of X winning a best-of-five series and a best-of-seven series.

The better team is more likely to win a series than a game. The probability of the better team winning a single game would be a diagonal line running from (0, 0) to (1, 1). The curves for a series are below this line to the left of 0.5 and above this line to the right of 0.5. But the difference between a best-of-five and a best-of-seven series is small.

Here’s another plot looking at the difference in probabilities, probability of winning best-of-seven minus probability of winning best-of-five.

The maximum difference is between 3% and 4%.

This assumes the probability of winning a game is constant and that games are independent. This is not exactly the case in, for example, the World Series in which human factors make things more complicated.

Related posts

U statistics and a new paper by Terence Tao

Terence Tao has a new paper out that relates to a couple things I’ve written about recently.

Elementary symmetric polynomials came up when developing the general equations for tangent sum and hyperbolic tangent sum. The latter post goes into more detail.

Before that, means of symmetric functions, not necessarily elementary polynomials or even polynomials, came up in the context of U-statistics.

Now combine these and you have means of elementary symmetric polynomials. This is what Tao just wrote about. Here is his blog post announcing his paper.

There are several inequalities satisfied by means of elementary symmetric named after Maclauren and Newton. If the inequalities don’t go all the way back to these two men, presumably they are a direct continuation of work they started.

The Maclauren and Newton inequalities require arguments to be non-negative; Tao allows arguments to possibly be negative.

U-statistics are not necessarily the means of elementary symmetric polynomials. But for those U-statistics that are, Tao’s new paper may imply new results. That is, Tao’s new paper has implications for statistics.

Consecutive coupon collector problem

Coupon collector problem

Suppose you have a bag of balls labeled 1 through 1,000. You draw balls one at a time and put them back after each draw. How many draws would you have to make before you’ve seen every ball at least once?

This is the coupon collector problem with N = 1000, and the expected number of draws is

N HN

where

HN = 1 + 1/2 + 1/3 + … + 1/N

is the Nth harmonic number.

As N increases, HN approaches log(N) + γ where γ = 0.577… is the Euler-Mascheroni constant, and so the expected time for the coupon collector problem is approximately

N (log(N) + γ).

Consecutive draws

Now suppose that instead of drawing single items, you draw blocks of consecutive items. For example, suppose the 1,000 balls are arranged in a circle. You pick a random starting point on the circle, then scoop up 10 consecutive balls, then put them back. Now how long would it take to see everything?

By choosing consecutive balls, you make it harder for a single ball to be a hold out. Filling in the holes becomes easier.

Bucketed problem

Now suppose the 1,000 balls are placed in 100 buckets and the buckets are arranged in a circle. Now instead of choosing 10 consecutive balls, you choose a bucket of 10 balls. Now you have a new coupon collector problem with N = 100.

This is like the problem above, except you are constraining your starting point to be a multiple of n.

Upper and lower bounds

I’ll use the word “scoop” to mean a selection of n balls at a time to avoid possible confusion over drawing individual balls or groups of balls.

If you scoop n balls at a time by making n independent draws, then you just have the original coupon collector problem, with the expected time divided by n.

If you scoop up n consecutively numbered balls each time, you reduce the expected time to see everything at least once. But your scoops can still overlap. For example, maybe you selected 13 through 22 on one draw, and 19 through 38 on the next.

In the bucketed problem, you reduce the expected time even further. Now your scoops will not partially overlap. (But they may entirely overlap, so it’s not clear that this reduces the total time.)

It would seem that we have sandwiched our problem between two other problems we have the solution to. The longest expected time would be if our scoop is made of n independent draws. Then the expected number of scoops is

N HN / n.

The shortest time is the bucketed problem in which the expected number of scoops is

(N/n) H(N/n).

It seems the problem of scooping n consecutive balls, with no constraint on the starting point, would have expected time somewhere between these two bounds. I say “it seems” because I haven’t proven anything here, just given plausibility arguments.

By the way, we can see how much bucketing reduces the expected time by using the log approximation above. With n independent draws each time, the expected number of scoops is roughly

(N/n) log(N)

whereas with the bucketed problem the expected number of scoops is roughly

(N/n) log(N/n).

Expected number of scoops

I searched a bit on this topic, and I found many problems with titles like “A variation on the coupon collector problem,” but none of the papers I found considered the variation I’ve written about here. If you work out the expected number of scoops, or find a paper where someone has worked this out, please let me know.

The continuous analog seems like an easier problem, and one that would provide a good approximation. Suppose you have a circle of circumference N and randomly place arcs of length n on the circle. What is the expected time until the circle is covered? I imagine this problem has been worked out many times and may even have a name.

Update: Thanks to Monte for posting links to the solution to the continuous problem in the comments below.

Simulation results

When N = 1000 and n = 10, the upper and lower bounds work out to 748 and 518.

When I simulated the consecutive coupon collector problem I got an average of 675 scoops, a little more than the average of the upper and lower bounds.

 

Hyperellipsoid surface area

Dimension 2

The equation for the perimeter of an ellipse is

p = 4aE(e^2)

where a is the semimajor axis, e is eccentricity, and E is a special function. The equation is simple, in the sense that it has few terms, but it is not elementary, because it depends on an advanced function, the complete elliptic integral of the second kind.

However, there is an approximation for the perimeter that is both simple and elementary:

p \approx 2 \pi \left(\frac{a^{3/2} + b^{3/2}}{2} \right )^{2/3}

Dimension 3

The generalization of an ellipse to three dimensions is an ellipsoid. The surface area of an ellipsoid is neither simple nor elementary. The surface area S is given by

\begin{align*} \varphi &= \arccos\left(\frac{c}{a} \right) \\ k^2 &= \frac{a^2\left(b^2 - c^2\right)}{b^2\left(a^2 - c^2\right)} \\ S &= 2\pi c^2 + \frac{2\pi ab}{\sin(\varphi)}\left(E(\varphi, k)\,\sin^2(\varphi) + F(\varphi, k)\,\cos^2(\varphi)\right) \end{align*}

where E is incomplete elliptic integral of the second kind and F is the incomplete elliptic integral of the first kind.

However, once again there is an approximation that is simple and elementary. The surface area approximately

S \approx 4\pi \left( \frac{(ab)^p + (bc)^p + (ac)^p}{3} \right )^{1/p}
where p = 1.6075.

Notice the similarities between the approximation for the perimeter of an ellipse and the approximation for the area of an ellipsoid. The former is the perimeter of a unit circle times a kind of mean of the axes. The latter is the area of a unit sphere times a kind of mean of the products of pairs of axes. The former uses a p-mean with p = 1.5 and the latter uses a p-mean with p = 1.6075. More on such means here.

Dimensions 4 and higher

The complexity of expressions for the surface area of an ellipsoid apparently increase with dimension. The expression get worse for hyperellipsoids, i.e. n-ellipsoids for n > 3. You can find such expressions in [1]. More of that in just a minute.

Conjecture

It is natural to conjecture, based on the approximations above, that the surface area of an n-ellipsoid is the area of a unit sphere in dimension n times the p-mean of all products of n-1 semiaxes for some value of p.

For example, the surface area of an ellipsoid in 4 dimensions might be approximately

S \approx 2\pi^2 \left( \frac{(abc)^p + (abd)^p + (acd)^p + (bcd)^p}{4} \right )^{1/p}

for some value of p.

Why this form? Permutations of the axes do not change the surface area, so we’d expect permutations not to effect the approximation either. (More here.) Also, we’d expect from dimensional analysis for the formula to involve products of n-1 terms since the result gives n-1 dimensional volume.

Surely I’m not the first to suggest this. However, I don’t know what work has been done along these lines.

Exact results

In [1] the author gives some very complicated but general expressions for the surface area of a hyperellipsoid. The simplest of his expression involves probability:

S = n \frac{\Gamma\left(\frac{n}{2}\right)}{\Gamma\left(\frac{n+1}{2} \right)} \text{E}\left(\sqrt{q_1^2 X_1^2 + q_2^2 X_2^2 + \cdots + q_n^2 X_n^2 }\right)

where the Xs are independent normal random variables with mean 0 and variance 1/2.

At first it may look like this can be simplified. The sum of normal random variables is a normal random variable. But the squares of normal random variables are not normal, they’re gamma random variables. The sum of gamma random variables is a gamma random variable, but that’s only if the variables have the same scale parameter, and these do not unless all the semiaxes, the qs, are the same.

You could use the formula above as a way to approximate S via Monte Carlo simulation. You could also use asymptotic results from probability to get approximate formulas valid for large n.

 

[1] Igor Rivin. Surface area and other measures of ellipsoids. Advances in Applied Mathematics 39 (2007) 409–427

Justifiable sample size

One of the most common things a statistician is asked to do is compute a sample. There are well known formulas for this, so why isn’t calculating a sample size trivial?

As with most things in statistics, plugging numbers into a formula is not the hard part. The hard part is deciding what numbers to plug in, which in turn depends on understanding the context of the study. What are you trying to learn? What are the constraints? What do you know a priori?

In my experience, sample size calculation is very different in a scientific setting versus a legal setting.

In a scientific setting, sample size is often determined by budget. This isn’t done explicitly. There is a negotiation between the statistician and the researcher that starts out with talk of power and error rates, but the assumptions are adjusted until the sample size works out to something the researcher can afford.

In a legal setting, you can’t get away with statistical casuistry as easily because you have to defend your choices. (In theory researchers have to defend themselves too, but that’s a topic for another time.)

Opposing counsel or a judge may ask how you came up with the sample size you did. The difficulty here may be more expository than mathematical, i.e. the difficulty lies in explaining subtle concepts, not is carrying out calculations. A statistically defensible study design is no good unless there is someone there who can defend it.

One reason statistics is challenging to explain to laymen is that there are multiple levels of uncertainty. Suppose you want to determine the defect rate of some manufacturing process. You want to quantify the uncertainty in the quality of the output. But you also want to quantify your uncertainty about your estimate of uncertainty. For example, you may estimate the defect rate at 5%, but how sure are you that the defect rate is indeed 5%? How likely is it that the defect rate could be 10% or greater?

When there are multiple contexts of uncertainty, these contexts get confused. For example, variations on the following dialog come up repeatedly.

“Are you saying the quality rate is 95%?”

“No, I’m saying that I’m 95% confident of my estimate of the quality rate.”

Probability is subtle and there’s no getting around it.

Related posts

First time seeing a rare event

Suppose you’ve been monitoring a rare event for a long time, then you see your first occurrence on the Nth observation. Now what would you say about the event’s probability?

For example, suppose you’re wondering whether dogs ever have two tails. You observe thousands of dogs and never see two tails. But then you see a dog with two tails? Now what can you say about the probability of dogs having two tails? It’s certainly not zero.

We’ll first look at the case of 0 successes out of N trials then look at the case of 1 success out of N trials.

Zero successes

If you’re observing a binary event and you’ve seen no successes out of N trials your point estimate of the probability of your event is 0. You can’t have any confidence in the relative accuracy of your estimate: if the true probability is positive, no matter how small, then the relative error in your estimate is infinite.

But you can have a great deal of confidence in its absolute accuracy. When you’re looking for a binary event and you have not seen any instances in N trials for large N, then a 95% confidence interval for the event’s probability is approximately [0, 3/N]. This is the statistical rule of three. This is a robust estimate, one you could derive from either a frequentist or Bayesian perspective.

Note that the confidence interval [0, 3/N] is exceptionally narrow. When observing a moderate mix of successes and failures the width of the confidence interval is on the order of 1/√N, not 1/N.

First success

After seeing your first success, your point estimate jumps from 0 to 1/N, and infinite relative increase. What happens to your confidence interval?

If we use Jeffreys’ beta(1/2, 1/2) prior, then the posterior distribution after seeing 1 success and N − 1 failures is a beta(3/2, N − 1/2). Now an approximate 95% confidence interval is

[0.1/N, 4.7/N]

So compared to the case of seeing zero successes, seeing one success makes our confidence interval about 50% wider and shifts it to the left by 0.1/N.

So if you’ve seen 100,000 dogs and only 1 had two tails, you could estimate that a 95% confidence interval for the probability of a dog having two tails is

[10−6, 4.7 × 10−5].

If we run the exact numbers we get

[ 1.07 × 10−6, 4.67 ×10−5].

Related posts

Can the chi squared test detect fake primes?

Jeweler examining gemstones

This morning I wrote about Dan Piponi’s fake prime function. This evening I thought about it again and wondered whether the chi-squared test could tell the difference between the distribution of digits in real primes and fake primes.

If the distributions were obviously different, this would be apparent from looking at histograms. When distributions are more similar, visual inspection is not as reliable as a test like chi-squared. For small samples, we can be overly critical of plausible variations. For large samples, statistical tests can detect differences our eyes cannot.

When data fall into a number of buckets, with a moderate number of items expected to fall in each bucket, the chi squared test attempts to measure how the actual counts in each bucket compare to the expected counts.

This is a two-sided test. For example, suppose you expect 12% of items to fall in bucket A and 88% to fall in bucket B. Now suppose you test 1,000 items. It would be suspicious if only 50 items landed in bucket A since we’d expect around 120. On the other hand, getting exactly 120 items would be suspicious too. Getting exactly the expected number is unexpected!

Let’s look a the primes, genuine and fake, less than N = 200. We’ll take the distribution of digits in the list of primes as the expected values and compare to the distribution of the digits in the fake primes.

When I ran this experiment, I got a chi-squared value of 7.77. This is an unremarkable value for a sample from a chi-squared distribution with 9 degrees of freedom. (There are ten digits, but only nine degrees of freedom because if you rule out nine possibilities then digit is determined with certainty.)

The p-value in this case, the probability of seeing a value as large as the one we saw or larger, is 0.557.

Next I increased N to 1,000 and ran the experiment again. Now I got a chi-squared value of 19.08, with a corresponding p-value of 0.024. When I set N to 10,000 I got a chi-squared value of 18.19, with a corresponding p-value of 0.033.

When I used N = 100,000 I got a chi-squared value of 130.26, corresponding to a p-value of 10−23. Finally, when I used N = 1,000,000 I got a chi-squared value of 984.7 and a p-value of 3.4 × 10−206.

In short, the chi-squared test needs a fair amount of data to tell that fake primes are fake. The distribution of digits for samples of fake primes less than a thousand or so is plausibly the same as that of actual primes, as far as the test can distinguish. But the chi-squared values get implausibly large for fake primes up to  100,000.