Gamma gamma gamma!

There are several things in math and statistics named gamma. Three examples are

  • the gamma function
  • the gamma constant
  • the gamma distribution

This post will show how these are related. We’ll also look at the incomplete gamma function which connects with all the above.

The gamma function

The gamma function is the most important function not usually found on a calculator. It’s the first “advanced” function you’re likely to learn about. You might see it in passing in a calculus class, in a homework problem on integration by parts, but usually not there’s not much emphasis on it. But it comes up a lot in application.

\Gamma(x) = \int_0^\infty t^{x-1 } e^{-t}\, dt

You can think of the gamma function as a way to extend factorial to non-integer values. For non-negative integers n, Γ(n + 1) = n!.

(Why is the argument n + 1 to Γ rather than n? There are a number of reasons, historical and practical. Short answer: some formulas turn out simpler if we define Γ that way.)

The gamma constant

The gamma constant, a.k.a. Euler’s constant or the Euler-Mascheroni constant, is defined as the asymptotic difference between harmonic numbers and logarithms. That is,

\gamma = \lim_{n\to\infty} \left( 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}- \log n\right)

The constant γ comes up fairly often in applications. But what does it have to do with the gamma function? There’s a reason the constant and the function are both named by the same Greek letter. One is that the gamma constant is part of the product formula for the gamma function.

\frac{1}{\Gamma(x)} = x e^{\gamma x} \prod_{r=1}^\infty \left(1 + \frac{x}{r} \right) e^{-x/r}

If we take the logarithm of this formula and differentiation we find out that

\frac{\Gamma'(1)}{\Gamma(1)} = -\gamma

The gamma distribution

If you take the integrand defining the gamma function and turn it into a probability distribution by normalizing it to integrate to 1, you get the gamma distribution. That is, a gamma random variable with shape k has probability density function (PDF) given by

f(x) = \frac{1}{\Gamma(k)} x^{k-1} e^{-x}

More generally you could add a scaling parameter to the gamma distribution in the usual way. You could imaging the scaling parameter present here but set to 1 to make things simpler.

The incomplete gamma function

The incomplete gamma function relates to everything above. It’s like the (complete) gamma function, except the range of integration is finite. So it’s now a function of two variables, the extra variable being the limit of integration.

\gamma(s, x)= \int_0^x t^{s-1} e^{-t} \, dt

(Note that now x appears in the limit of integration, not the exponent of t. This notation is inconsistent with the definition of the (complete) gamma function but it’s conventional.)

It uses a lower case gamma for its notation, like the gamma constant, and is a generalization of the gamma function. It’s also essentially the cumulative distribution function of the gamma distribution. That is, the CDF of a gamma random variable with shape s is γ(sx) / Γ(s).

The function γ(sx) / Γ(s) is called the regularized incomplete gamma function. Sometimes the distinction between the regularized and unregularized versions is not explicit. For example, in Python, the function gammainc does not compute the incomplete gamma function per se but the regularized incomplete gamma function. This makes sense because the latter is often more convenient to work with numerically.

Related posts

What is differential privacy?

Differential privacy is a strong form of privacy protection with a solid mathematical definition.

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not. This intuitive idea can be made precise as follows.

blurry crowd

Queries and algorithms

First of a differential privacy is something that applies to queries, not to databases. It could apply to a database if your query is to select everything in the database, but typically you want to run far more specific queries. Differential privacy adds noise to protect privacy, and the broader the query, the more noise must be added, so typically you want queries to be more narrow than “Tell me everything about everything.”

Generally the differential privacy literature speaks of algorithms rather than queries. The two are the same if you have a general idea of what a query is, but using the word algorithm emphasizes that the query need not be a simple SQL query.

Opt-out and neighboring databases

To quantify the idea of “whether your information is included or not” we look at two versions of a database, D and D‘, that differ by one row, which you can think of as the row containing your data. Our formalism is symmetric, so we do not specify which database, D or D‘, is the one which contains your row and which is missing your row.

We should mention some fine print here. The paragraph above implicitly assumes that your database is just a simple table. In general we can speak of neighboring databases. In the case of a more complex database design, the notion of what it means for two databases to be neighboring is more complex. Differential privacy can be defined whenever a notion of neighboring databases can be defined, so you could, for example, consider differential privacy for a non-relational (“No SQL”) database. But for the simple case of a single table, two databases are neighboring if they differ by a row.

However, there’s a subtlety regarding what it means even for two tables to “differ by one row.” Unbound differential privacy assumes one row has been removed from one of the databases. Bound differential privacy assumes the two databases have the same number of rows, but the data in one row has been changed. In practice the difference between bound and unbound differential privacy doesn’t often matter.

Quantifying difference

We said it “makes little difference” whether an individual’s data is included or not. How to we quantify that statement? Differential privacy is a stochastic procedure, so we need to measure difference in terms of probability.

Whenever mathematicians want to speak of two things being close together, we often use ε as our symbol. While in principle ε could be large, the choice of symbol implies that you should think of it as a quantity you can make as small as you like (though there may be some cost that increases as ε decreases).

We’re going to look at ratios of probabilities, so we want these ratios to be near 1. This means we’ll work with eε = exp(ε) rather than ε itself. A convenient property that falls out of the Taylor series for the exponential function is that if ε is small then exp(ε) is approximate 1+ε. For example, exp(0.05) = 1.0513, and so if the ratio of two probabilities is bounded by exp(0.05), the probabilities are roughly within 5% of each other.

Definition of differential privacy

Now we can finally state our definition. An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D‘,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

Related posts

Earth mover distance and t-closeness

airport crowd

There’s an old saying that if you want to hide a tree, put it in a forest. An analogous principle in privacy is that a record preserves privacy if it’s like a lot of other records.

k-anonymity

The idea of k-anonymity is that every database record appears at least k times when you restrict your attention to quasi-identifiers [1]. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity. Even when k is large, k-anonymity might prevent re-identification but still suffer from attribute disclosure.

Another problem with k-anonymity is that it doesn’t offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.

Or going the other way around, if you know already know something that stands about a group, this could help you identify the record belonging to an individual. That is, the method is subject to a background knowledge attack.

One way to address this shortcoming is l-diversity. This post won’t go into l-diversity because it’s an intermediate step to where we want to go, which is t-closeness.

t-closeness

The idea of t-closeness is that the distribution of sensitive data in every group is not too far from the distribution in the full population. The “t” comes from requiring that the distributions be no more than a distance t apart in a sense that we’ll define below [2]. If the sensitive data in a group doesn’t stand out, this thwarts the homogeneity attack and the background knowledge attack.

Earth mover distance

When we say that the distribution on sensitive data within a group is not far from the distribution in the full data, how do we quantify what “far” means? That is, how do we measure the distance between two distributions?

There are a lot of ways to measure the similarity of two probability distributions. A common choice is the Kullback-Liebler divergence, though that’s not what we’ll use here. Instead, t-closeness uses the so-called earth mover distance (EMD), also know as the Wasserstein metric.

The idea of EMD is to imagine both probability distributions as piles of dirt and calculate the minimum amount of work needed to reshape the first pile so that it has the same shape as the second. The key attribute of EMD is that it takes distance into account.

Suppose your data is some ordered response, 1 through 5. Suppose distribution X has probability 0.8 at 1 and 0.05 for the rest of the responses. Distributions Y and Z are the same except they have 80% of their probability mass at 2 and at 5 respectively. By some measures, X is equally far from Y and Z, but the earth mover distance would say that X is closer to Y than to Z, which is more appropriate in our setting.

We can calculate the EMD for the example above. To transform X into Y, we need to move a probability mass of 0.75 from 1 to 2, and so the EMD is 0.75. To transform X into Z we need to move the same amount of mass, but we need to move it 4x further, and so the EMD is 3.

Related posts

[1] What exactly is a quasi-identifier? That’s hard to say precisely. An identifier is something that clearly identifies an individual. A phone number, for example, is an identifier. A quasi-identifier is something that helps narrow down who a person is, but isn’t an identifier. For example, someone’s birthday would be a quasi-identifier. But in general “quasi-identifier” is not a precisely defined concept.

[2] It’s common in math to use a variable name as an adjective, as with k-anonymity, l-diversity, and t-closeness. This is unfortunate because it isn’t descriptive and locks in a variable naming convention. Someone used the variable k to count the number of redundant database tables, and the name stuck. Similarly the l in l-diversity counts something.

As a sidenote, the variable names here follow the old FORTRAN convention that variables i through n represent integers by default, and that all other variables represent real (floating point) numbers by default. The t in t-closeness is a continuous measure, so the t is a real value, while k and l are integers.

Biometric security and hypothesis testing

A few weeks ago I wrote about how there are many ways to summarize the operating characteristics of a test. The most basic terms are accuracy, precision, and recall, but there are many others. Nobody uses all of them. Each application area has their own jargon.

Biometric security has its own lingo, and it doesn’t match any of the terms in the list I gave before.

facial scan authentication

False Acceptance Rate

Biometric security uses False Acceptance Rate (FAR) for the proportion of times a system grants access to an unauthorized person. In statistical terms, FAR is Type II error. Also known as False Match Rate (FRM).

False Rejection Rate

False Rejection Rate (FRR) is the proportion of times a biometric system fails to grant access to an authorized person. In statistical terms, FRR is Type I error. FAR is also known as False Non Match Rate (FNMR).

Crossover Error Rate

One way to summarize the operating characteristics of a biometric security system is to look at the Crossover Error Rate (CER), also known as the Equal Error Rate (EER). The system has parameters that can be tuned to adjust the FAR and FRR. Adjust these to the point where the FAR and FRR are equal. When the two are equal, their common value is the CER or EER.

The CER gives a way to compare systems. The smaller the CER the better. A smaller CER value means it’s possible to tune the system so that both the Type I and Type II error rates are smaller than they would be for another system.

Loss Function

CER is kind of a strange metric. Everyone agrees that you wouldn’t want to calibrate a system so that FAR = FRR. In security applications, FAR (unauthorized access) is worse than FRR (authorized user locked out). The former could be a disaster while the latter is an inconvenience. Of course there could be a context where the consequences of FAR and FRR are equal, or that FRR is worse, but that’s not usually the case.

A better approach would be to specify a loss function (or its negative, a utility function). If unauthorized access is K times more costly than locking out an authorized user, then you might want to know at what point K * FAR = FRR or your minimum expected loss [1] over the range of tuning parameters. The better system for you, in your application, is the one corresponding to your value of K.

Since everyone has a different value of K, it is easier to just use K = 1, even though everyone’s value of K is likely to be much larger than 1. Unfortunately this often happens in decision theory. When people can’t agree on a realistic loss function, they standardize on a mathematically convenient implicit loss function that nobody would consciously choose.

If everyone had different values of K near 1, the CER metric might be robust, i.e. it might often make the right comparison between two different systems even though the criteria is wrong. But since K is probably much larger than 1 for most people, it’s questionable that CER would rank two systems the same way people would if they could use their own value of K.

***

[1] These are not the same thing. To compute expected loss you’d need to take into account the frequency of friendly and unfriendly access attempts. In a physically secure location, friends may attempt to log in much more often than foes. On a public web site the opposite is more likely to be true.

Probability of winning the World Series

Astros win 2018 World Series

Suppose you have two baseball teams, A and B, playing in the World Series. If you like, say A stands for Houston Astros and B for Milwaukee Brewers. Suppose that in each game the probability that A wins is p, and the probability of A losing is q = 1 – p. What is the probability that A will win the series?

The World Series is a best-of-seven series, so the first team to win 4 games wins the series. Once one team wins four games there’s no point in playing the rest of the games because the series winner has been determined.

At least four games will be played, so if you win the series, you win on the 4th, 5th, 6th, or 7th game.

The probability of A winning the series after the fourth game is simply p4.

The probability of A winning after the fifth game is 4 p4 q because A must have lost one game, and it could be any one of the first four games.

The probability of A winning after the sixth game is 10 p4 q2 because A must have lost two of the first five games, and there are 10 ways to choose two items from a set of five.

Finally, the probability of A winning after the seventh game is 20 p4 q3 because A must have lost three of the first six games, and there are 20 ways to choose three items from a set of six.

The probability of winning the World Series is the sum of the probabilities of winning after 4, 5, 6, and 7 games which is

p4(1 + 4q + 10q2 + 20q3)

Here’s a plot:

Probability of winning the world series as a function of the probability of winning one game

Obviously, the more likely you are to win each game, the more likely you are to win the series. But it’s not a straight line because the better team is more likely to win the series than to win any given game.

Now if you only wanted to compute the probability of winning the series, not the probability of winning after different numbers of games, you could pretend that all the games are played, even though some may be unnecessary to determine the winner. Then we compute the probability that a Binomial(7, p) random variable takes on a value greater than or equal to 4, which is

35p4q3 + 21p5q2 + 7p6q + p7

While looks very different than the expression we worked out above, they’re actually the same. If you stick in (1 – p) for q and work everything out, you’ll see they’re the same.

Accuracy, precision, and recall

I posted the definitions of accuracy, precision, and recall on @BasicStatistics this afternoon. I think the tweet was popular because people find these terms hard to remember and they liked a succinct cheat sheet.

There seems to be no end of related definitions, and multiple names for the same definitions.

Precision is also known as positive predictive value (PPV) and recall is also known as sensitivityhit rate, and true positive rate (TPR).

Not mentioned in the tweet above are specificity (a.k.a. selectivity and true negative rate or TNR), negative predictive value (NPV), miss rate (a.k.a. false negative rate or FNR), fall-out (a.k.a. false positive rate or FPR), false discovery rate (FDR), and false omission rate (FOR).

How many terms are possible? There are four basic ingredients: TP, FP, TN, and FN. (You may see these arranged into a 2 by 2 grid called a confusion matrix.) So if each term may or may not be included in a sum in the numerator and denominator, that’s 16 possible numerators and 16 denominators, for a total of 256 possible terms to remember. Some of these are redundant, such as one (a.k.a. ONE), given by TP/TP, FP/FP, etc. If we insist that the numerator and denominator be different, that eliminates 16 possibilities, and we’re down to a more manageable 240 definitions. And if we rule out terms that are the reciprocals of other terms, we’re down to only 120 definitions to memorize.

But wait! We’ve assumed every term occurs with a coefficient of either 0 or 1. But the F1 score has a couple coefficients of 2:

F1 = 2TP / (2TP + FP + FN)

If we allow coefficients of 0, 1, or 2, and rule out redundancies and reciprocals …

This has been something of a joke. Accuracy, precision, and recall are useful terms, though I think positive predictive value and true positive rate are easier to remember than precision and recall respectively. But the proliferation of terms is ridiculous. I honestly wonder how many terms for similar ideas are in use somewhere. I imagine there are scores of possibilities, each with some subdiscipline that thinks it is very important, and few people use more than three or four of the possible terms.

Here are the terms I’ve collected so far. If you know of other terms or other names for the same terms, please leave a comment and we’ll see how long this table goes.

Accuracy(TP + TN) / (FP + TP + FN + TN) </dt
True positive rate (TPR)TP / (TP + FN) </dt
Sensitivitysee TPR </dt
Recallsee TPR </dt
Hit ratesee TPR </dt
Probability of detectionsee TPR </dt
True negative rate (TNR)TN / (TN + FP) </dt
Specificitysee TNR </dt
Selectivitysee TNR </dt
Positive predictive value (PPV)TP / (TP + FP) </dt
Precisionsee PPV </dt
Negative predictive value (NPV)TN / (TN + FN) </dt
False negative rate (FNR)FN / (FN + TP) </dt
Miss ratesee FNR </dt
False positive rate (FPR)FP / (FP + TN) </dt
Fall outsee FPR </dt
False discovery rate (FDR)FP / (FP + TP) </dt
False omission rate (FOR)FN / (FN + TN) </dt
F1 score2TP / (2TP + FP + FN) </dt
F-scoresee F1 score </dt
F-measuresee F1 score </dt
Sørensen–Dice coefficientsee F1 score </dt
Dice similarity coefficient (DSC)see F1 score </dt

Pi primes

I was reading a recent blog post by Evelyn Lamb where she mentioned in passing that 314159 is a prime number and that made me curious how many such primes there are.

Let’s look at numbers formed from the digits of π to see which ones are prime.

Obviously 3 and 31 are prime. 314 is even. 3141 is divisible by 9 because its digits sum to 9, and 31415 is clearly divisible by 5. And now we know that 314159 is prime. What’s the next prime in the sequence? Here’s a little Python code to find out.

    from sympy import pi, isprime

    M = 1000
    digits = "3" + str(pi.evalf(M+1))[2:]
    for i in range(1, M+1):
        n = int(digits[:i])
        if isprime(n):
            print(n)

This looks at numbers formed from the first digit up to the thousandth digit in the representation of π. The only other prime it finds is

31415926535897932384626433832795028841.

After writing running the Python code above, I wondered whether this sequence is in OEIS, and indeed it is, sequence A005042. The OEIS says the next number in the sequence is 16,208 digits long.

Expected number of primes

I expected to find more primes out of the first 1,000 digits of π and was surprised that the next prime is so long.

How many primes whould we expect if we look at random numbers of length 1 through 1000? (The numbers formed from the digits of π are not random, but I want to compare them to random selections.)

From the prime number theorem, the probability that a d-digit number is prime is approximately 1/d log 10. So if we pick numbers of length 1, 2, 3, … N, the number of primes we’d expect to find is

\frac{1}{\log 10}\left(1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \cdots +\frac{1}{N} \right ) = \frac{H_N}{\log 10}

where HN is the Nth harmonic number. Harmonic numbers grow very slowly, and so this gives us an idea why the number of primes we find is so small. In particular, it says if we look out 1,000 digits, we’d expect to find 3.25 primes. And so finding 4 primes isn’t surprising.

By the way, HN is roughly log N, so the expected number of primes is roughly log N / log 10, or simply log10 N. [1]

Other numbers

To see if our prediction holds, let’s try a few more irrational numbers.

First, let’s look at e. Then we get these four prime numbers:

2
271
2718281
2718281828459045235360287471352662497757247093699959574966967627724076630353547594571

Next, let’s try the square root of 2. Then we get these three prime numbers:

1414213562373095048801688724209698078569671875376948073
1414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641
141421356237309504880168872420969807856967187537694807317667973799073247846210703885038753432764157273501384623091229702492483605585073721264412149709993583141322266592750559275579995050115278206057147010955997160597027453459

And finally, sin(2018) gives us two primes:

89
890078059172214077866908781516358161827277734319970800477363723960779818810729974998238099333154892009885458973321640841442042739723513347225471340149039460391024770670777037177784685508982613866838293832904866761046669825265135653209394367093522291768886320360545920897111922035021827790766895274726280695701753189405911185099453319400331979946988526902527520607778873012543871813339138882571549068827579760491442082336321409598206489746377828352552073997340874649330911864108558685738003182332758299767812686413014214285003827180623455181083706822164258525158371971877396069605035317998430446771215296978812833752733

We might expect to find more primes when the leading digit is small because each time we select a number with d digits we’re looking lower in the range where primes are more dense. That is consistent with the small number of examples in this post.

Related posts

[1] When I write “log” without a subscript, I always mean natural log, log base e. If you’re not accustomed to this, see why this is conventional.

Distribution of zeros of the Riemann zeta

A recent video by Quanta Magazine says that the eigenvalues of random matrices and the zeros of the Riemann zeta function appear to have the same distribution.

I assume by “random matrices” the video is referring specifically to Gaussian orthogonal ensembles. [Update: See the first comment below for more details. You need some assumptions on the distribution—fat tails are out—but you don’t need to assume a GOE.]

By zeros of the Riemann zeta function, they mean the imaginary parts of the zeros in the critical strip. You can download the first 100,000 zeros of the Riemann zeta function here. So, for example, the first zero is 14.134725142, which actually means 0.5 + 14.134725142 i.

Here’s the histogram of random matrix eigenvalues from my previous post:

And here’s a histogram of the spacing between the first two thousand zeros of the Riemann zeta function:

The distribution of random matrix eigenvalues is known exactly. I believe the distribution of the Riemann zeta function zeros is conjectural but persuasive. Here’s a plot using the first 100,000 zeros.

How long does it take to find large primes?

Descriptions of cryptography algorithms often start with something like “Let p be a large prime” or maybe more specifically “Let p be a 256 bit prime.” How hard is it to find large prime numbers?

Let’s look at the question in base b, so we can address binary and base 10 at the same time. The prime number theorem says that for large n, the number of primes less than bn is approximately

bn / n log b

and so the probability that an n-digit number base b is prime is roughly 1 / n log b. If you select an odd b-digit number, you double your chances.

If you generate odd n-digit numbers at random, how long would it be until you find a prime? You might get lucky and find one on your first try. Or you might have extraordinarily bad luck and find a composite every time until you give up.

The number of tries until you find a prime is a geometric random variable with

p = 2 / n log b.

The expected value is 1 / p.

So, for example, if you wanted to find a 256-bit prime, you would need to test on average 256 log(2) / 2 = 89 candidates. And if you wanted to find a 100-digit prime, you’d need to test on average 100 log(10) / 2 = 115 candidates.

You could get more accurate bounds by using a refined variation of the prime number theorem. Also, we implicitly looked at the problem of finding a number with at most n digits rather than exactly n digits. Neither of these would make much difference to our final result. To back up this claim, let’s compare our estimate to a simulation that actually looks for primes.

histogram

The histogram above was based on finding 10,000 primes, each with 100 digits, and counting how many candidates were evaluated each time. The plot matches a geometric distribution well. The mean from the simulation was 113.5, close to the 115 estimated above.

For cryptographic applications, you often don’t want just any prime. You might want a safe prime or a prime satisfying some other property, in which case you’ll need to test more primes. But your probability of success will likely still be geometrically distributed.

It’s plausible that for large pp being prime and 2p + 1 being prime are independent events, at least approximately, and a numerical experiment suggests that’s true. I generated 10,000 primes, each 100-digits long, and 48 were safe primes, about what you’d expect from theory. It is not known whether the number of safe primes is infinite, but safe primes with hundreds of thousands of digits have been found, so apparently they’re infinite enough for application.

Related posts

The other butterfly effect

monarch butterfly

The original butterfly effect

The butterfly effect is the semi-serious claim that a butterfly flapping its wings can cause a tornado half way around the world. It’s a poetic way of saying that some systems show sensitive dependence on initial conditions, that the slightest change now can make an enormous difference later. Often this comes up in the context of nonlinear, chaotic systems but it isn’t limited to that. I give an example here of a linear differential equation whose solutions start out the essentially the same but eventually diverge completely.

Once you think about these things for a while, you start to see nonlinearity and potential butterfly effects everywhere. There are tipping points everywhere waiting to be tipped!

The other butterfly effect

But a butterfly flapping its wings usually has no effect, even in sensitive or chaotic systems. You might even say especially in sensitive or chaotic systems.

Sensitive systems are not always and everywhere sensitive to everything. They are sensitive in particular ways under particular circumstances, and can otherwise be quite resistant to influence. Not only may a system be insensitive to butterflies, it may even be relatively insensitive to raging bulls. The raging bull may have little more long-term effect than a butterfly. This is what I’m calling the other butterfly effect.

Steering complex systems

In some ways chaotic systems are less sensitive to change than other systems. And this can be a good thing. Resistance to control also means resistance to unwanted tampering. Chaotic or stochastic systems can have a sort of self-healing property. They are more stable than more orderly systems, though in a different way.

The lesson that many people draw from their first exposure to complex systems is that there are high leverage points, if only you can find them and manipulate them. They want to insert a butterfly to at just the right time and place to bring about a desired outcome. Instead, we should humbly evaluate to what extent it is possible to steer complex systems at all. We should evaluate what aspects can be steered and how well they can be steered. The most effective intervention may not come from tweaking the inputs but from changing the structure of the system.

Related posts

The cold start problem

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Thanks to the magic of Bayes theorem, you don’t have to have an ad hoc rule that says “Treat the first patient this way, then turn on the Bayesian machine to determine how to treat the next patient.” No, you use Bayes theorem from beginning to end. There’s no need to handle the first patient differently because expert opinion is already there, captured in the form of prior distributions (and the structure of the probability model).

Each patient is treated according to all information available at the time. At first all available information is prior information. After you have data on one patient, most of the information you have is still prior information, but Bayes’ theorem updates this prior information with your lone observation. As more data becomes available, the Bayesian machine incorporates it all, automatically shifting weight away from the prior and toward the data.

The cold start problem for business applications is easier than the cold start problem for clinical trials. First of all, most business applications don’t have the potential to cost people their lives. Second, business applications typically have fewer competing criteria to balance.

What if you’re not sure where to draw your prior information? Bayes can handle that too. You can use Bayesian model selection or Bayesian model averaging to determine which source (or weighting of sources) best fits the new data as it comes in.

Once you’ve decided to use a Bayesian approach, there’s still plenty of work to do, but the Bayesian approach provides scaffolding for that work, a framework for moving forward.

Distribution of carries

Suppose you add two long numbers together. As you work your way toward the result, adding digits from right to left, sometimes you carry a 1 and sometimes you don’t. How often do you carry 1?

Now suppose you add three numbers at a time. Now your carry might be 0, 1, or 2. How often does each appear? How do things change if you’re not working in base 10?

John Holte goes into this problem in detail in [1]. We’ll only look at his first result here.

Suppose we’re adding m numbers together in base b, with each digit being uniformly distributed. Then at each step of the addition process, the probability distribution of the amount we carry out only depends on the amount we carried in from the previous step. In other words, the carries form a Markov chain!

This means that we can do more than describe the distribution of the amounts carried. We can specify the transition probabilities for carries, i.e. the distribution on the amount carried out, conditional on the the amount carried in.

Examples

When we add two base ten numbers, m = 2 and b = 10, we have the transition matrix

\begin{pmatrix} 0.55 & 0.45 \\ 0.45 & 0.55 \end{pmatrix}

This means that on average, we’re as likely to carry 0 as to carry 1. But more specifically, there’s a 55% chance that we’ll carry a 1 if we carried a 1 on the previous step, and also a 55% chance that we won’t have a carry this time if we didn’t have one last time.

If we add three numbers, things get a little more interesting. Now the transition matrix is

\begin{pmatrix} 0.220 & 0.660 & 0.120 \\ 0.165 & 0.670 & 0.165 \\ 0.120 & 0.660 & 0.220 \end{pmatrix}

This gives the probabilities of each out carry for each in carry. We can compute the long run distribution from the matrix above to find that when adding three long numbers, we’ll carry 0 or 1 with probability 1/6 each, and carry 1 with probability 2/3.

When adding three binary numbers, the transition matrix is fairly different than for base 10

\begin{pmatrix} 0.500 & 0.500 & 0.000 \\ 0.125 & 0.750 & 0.125 \\ 0.000 & 0.500 & 0.500 \\ \end{pmatrix}

but the long run distributions are the same: carry a 1 two thirds of the time, and split the remaining probability equally between carrying 0 and 2.

The transition matrices don’t change much for larger bases; base 100 doesn’t look much different than base 100, for example.

Finally, let’s do an example adding five numbers in base 10. The transition matrix is

\begin{pmatrix} 0.020 & 0.305 & 0.533 & 0.140 & 0.003 \\ 0.013 & 0.259 & 0.548 & 0.176 & 0.005 \\ 0.008 & 0.216 & 0.553 & 0.216 & 0.008 \\ 0.005 & 0.176 & 0.548 & 0.260 & 0.013 \\ 0.003 & 0.140 & 0.533 & 0.305 & 0.020 \\ \end{pmatrix}

and the long run distribution is 1/60, 13/60, 33/60, 13/60, and 1/60 for carries of 0 through 4.

General case

In general, when adding m numbers in base b, Holte gives the transition probabilities as

\pi_{ij} = b^{-m} \sum_{r=0}^{j-\lfloor i/b\rfloor}(-1)^r {m+1\choose r}{m - i -1 + b(j-r+1) \choose m}

Here is some Python code to compute the transition matrices.

from scipy import floor, zeros, array
from scipy.special import comb
from numpy.linalg import matrix_power

# transition probability
def pi(i, j, b, m):
    s = 0
    for r in range(0, j - int(floor(i/b)) + 1):
        s += (-1)**r * comb(m+1, r) * comb(m-i-1 +b*(j-r+1), m)
    return s/b**m

def transition_matrix(b, m):
    P =  [ [pi(i, j, b, m) for j in range(m)]
            for i in range(m) ]
    return array(P)

You can find the long run probabilities by raising the transition matrix to a large power.

Related posts

[1] John M. Holte. The American Mathematical Monthly, Vol. 104, No. 2 (Feb., 1997), pp. 138-149

Distribution of eigenvalues for symmetric Gaussian matrix

Symmetric Gaussian matrices

The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an n by n matrix A with values drawn from a standard normal distribution and let M be the average of A and its transpose, i.e. M = ½(A + AT).  The eigenvalues will all be real because M is symmetric.

This is called a “Gaussian Orthogonal Ensemble” or GOE. The term is standard but a little misleading because such matrices may not be orthogonal.

Eigenvalue distribution

The joint probability distribution for the eigenvalues of M has three terms: a constant term that we will ignore, an exponential term, and a product term. (Source)

p(x_1, x_2, \ldots, x_n) \propto \exp\left(-\frac{1}{2} \sum_{j=1}^nx_j^2 \right ) \prod_{j < k} |x_j - x_k|

The exponential term is the same as in a multivariate normal distribution. This says the probability density drops of quickly as you go away from the origin, i.e. it’s rare for eigenvalues to be too big. The product term multiplies the distances between each pair of eigenvalues. This says it’s also rare for eigenvalues to be very close together.

(The missing constant to turn the expression above from a proportionality to an equation is whatever it has to be for the right side to integrate to 1. When trying to qualitatively understand a probability density, it usually helps to ignore proportionality constants. They are determined by the rest of the density expression, and they’re often complicated.)

If eigenvalues are neither tightly clumped together, nor too far apart, we’d expect that the distance between them has a distribution with a hump away from zero, and a tail that decays quickly. We will demonstrate this with a simulation, then give an exact distribution.

Python simulation

The following Python code simulates 2 by 2 Gaussian matrices.

    import matplotlib.pyplot as plt
    import numpy as np
    
    n = 2
    reps = 1000
    
    diffs = np.zeros(reps)
    for r in range(reps):
        A = np.random.normal(scale=n**-0.5, size=(n,n)) 
        M = 0.5*(A + A.T)
        w = np.linalg.eigvalsh(M)
        diffs[r] = abs(w[1] - w[0])
    
    plt.hist(diffs, bins=int(reps**0.5))
    plt.show()

This produced the following histogram:

The exact probability distribution is p(s) = s exp(-s²/4)/2. This result is known as “Wigner’s surmise.”

Biased random number generation

Melissa O’Neill has a new post on generating random numbers from a given range. She gives the example of wanting to pick a card from a deck of 52 by first generating a 32-bit random integer, then taking the remainder when dividing by 52. There’s a slight bias because 232 is not a multiple of 52.

Since 232 = 82595524*52 + 48, there are 82595525 ways to generate the numbers 0 through 47, but only 82595524 ways to generate the numbers 48 through 51. As Melissa points out in her post, the bias here is small, but the bias increases linearly with the size of our “deck.” To clarify, it is the relative bias that increases, not the absolute bias.

Suppose you want to generate a number between 0 and M, where M is less than 232 and not a power of 2. There will be 1 + ⌊232/M⌋ ways to generate a 0, but ⌊232/M⌋ ways to generate M-1. The difference in the probability of generating 0 vs generating M-1 is 1/232, independent of M. However, the ratio of the two probabilities is 1 + 1/⌊232/M⌋ or approximately 1 + M/232.

As M increases, both the favored and unfavored outcomes become increasingly rare, but ratio of their respective probabilities approaches 2.

Whether this makes any practical difference depends on your context. In general, the need for random number generator quality increases with the volume of random numbers needed.

Under conventional assumptions, the sample size necessary to detect a difference between two very small probabilities p1 and p2 is approximately

8(p1 + p2)/(p1p2

and so in the card example, we would have to deal roughly 6 × 1018 cards to detect the bias between one of the more likely cards and one of the less likely cards.

***

Need help with randomization?

Attribution based on tail probabilities

If all you know about a person is that he or she is around 5′ 7″, it’s a toss-up whether this person is male or female. If you know someone is over 6′ tall, they’re probably male. If you hear they are over 7″ tall, they’re almost certainly male. This is a consequence of heights having a thin-tailed probability distribution. Thin, medium, and thick tails all behave differently for the attribution problem as we will show below.

Thin tails: normal

Suppose you have an observation that either came from X or Y, and a priori you believe that both are equally probable. Assume X and Y are both normally distributed with standard deviation 1, but that the mean of X is 0 and the mean of Y is 1. The probability you assign to X and Y after seeing data will change, depending on what you see. The larger the value you observe, the more sure you can be that it came from Y.

Posterior probability of Y with normal distributions

This plot shows the posterior probability that an observation came from Y, given that we know the observation is greater than the value on the horizontal axis.

Suppose I’ve seen the exact value, and it’s 4. All I tell you is that it’s bigger than 2. Then you would say it probably came from Y. When I go back and tell you that in fact it’s bigger than 3. You would be more sure it came from Y. The more information I give you, the more convinced you are. This isn’t the case with other probability distributions.

Thick tails: Cauchy

Now let’s suppose X and Y have a Cauchy distribution with unit scale, with X having mode 0 and Y having mode 1. The plot below shows how likely is it that our observation came from Y given a lower bound on the observation.

Posterior distribution of Y assuming Cauchy distributions

We are most confident that our data came from Y when we know that our data is greater than 1. But the larger our lower bound is, the further we look out in the tails, the less confident we are! If we know, for example, that our data is at least 5, then we still think that it’s more likely that it came from Y than from X, but we’re not as sure.

As above, suppose I’ve seen the data value but only tell you lower bounds on its value.  Suppose I see a value of 4, but only tell you it’s positive. When I come back and tell you that the value is bigger than 1, your confidence goes up that the sample came from Y. But as I give you more information, telling you that the sample is bigger than 2, then bigger than 3, your confidence that it came from Y goes down, just the opposite of the normal case.

Explanation

What accounts for the very different behavior?

In the normal example, seeing a value of 5 or more from Y is unlikely, but seeing a value so large from X is very unlikely. Both tails are getting smaller as you move to the right, but in relative terms, the tail of X is getting thinner much faster than the tail of Y.

In the Cauchy example, the value of both tails gets smaller as you move to the right, but the relative difference between the two tails is decreasing. Seeing a value greater than 10, say, from Y is unlikely, but it would only be slightly less likely from X.

Medium tails

In between thin tails and thick tails are medium tails. The tails of the Laplace (double exponential) distribution decay exponentially. Exponential tails are a often considered the boundary between thin tails and thick tails, or super-exponential and sub-exponential tails.

Suppose you have two Laplace random variables with the same scale, with X centered at 0 and Y centered at 1. What is the posterior probability that a sample came from Y rather than X, given that it’s at least some value Z > 1? It’s constant! Specifically, it’s e/(1 + e).

In the normal and Cauchy examples, it didn’t really matter that one distribution was centered at 0 and the other at 1. We’d get the same qualitative behavior no matter what the shift between the two distributions. The limit would tend to 1 for the normal distribution and 1/2 for the Cauchy. The constant value of the posterior probability with the Laplace example depends on the size of the shift between the two.

We’ve assumed that X and Y are equally likely a priori. The limiting value in the normal case does not depend on the prior probabilities of X and Y as long as they’re both positive. The prior probabilities will effect the limiting values for the Cauchy and Laplace case.

Technical details

For anyone who wants a more precise formulation of the examples above, let B be a non-degenerate Bernoulli random variable and define ZBX + (1-B)Y. We’re computing the conditional probability Pr(B = 0 | Z > z) using Bayes’ theorem.  If X and Y are normally distributed, the limit of Pr(B = 0 | Z > z) as z goes to infinity is 1. If X and Y are Cauchy distributed, the limit is the unconditional probability Pr(B = 0).

In the normal case, as z goes to infinity, the distribution of B carries no useful information.

In the Cauchy case, as z goes to infinity, the observation z carries no useful information.