Unstructured data is an oxymoron

messy workshop

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.

Non-tabular data could mean anything other than tabular data, but in practice it often means text, or it could mean data with a graph structure or some other structure.

More productive discussions

My point here isn’t to quibble over language usage but to offer a constructive suggestion: say what structure data has, not what structure it doesn’t have.

Discussions about “unstructured data” are often unproductive because two people can use the term, with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.

Eventually there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.

Free text fields

One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or at most in small number of languages. That’s a start. There should be more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?

Clients will ask how to de-identify free text fields. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.

For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breech. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]

Related posts

[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.

[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.

Can I have the last four digits of your social?

call center

Imagine this conversation.

“Could you tell me your social security number?”

“Absolutely not! That’s private.”

“OK, how about just the last four digits?”

“Oh, OK. That’s fine.”

When I was in college, professors would post grades by the last four digits of student social security numbers. Now that seems incredibly naive, but no one objected at the time. Using these four digits rather than names would keep your grades private from the most lazy observer but not from anyone willing to put out a little effort.

There’s a widespread belief in the US that your social security number is a deep secret, and that telling someone your social security number gives them power over you akin to a fairy telling someone his true name. On the other hand, we also believe that telling someone just the last four digits of your SSN is harmless. Both are wrong. It’s not that hard to find someone’s full SSN, and revealing the last four digits gives someone a lot of information to use in identifying you.

In an earlier post I looked at how easily most people could be identified by the combination of birth date, sex, and zip code. We’ll use the analytical results from that post to look at how easily someone could be identified by their birthday, state, and the last four digits of their SSN [1]. Note that the previous post used birth date, i.e. including year, where here we only look at birth day, i.e. month and day but no year. Note also that there’s nothing special about social security numbers for our purposes. The last four digits of your phone number would provide just as much information.

If you know someone lives in Wyoming, and you know their birthday and the last four digits of their SSN, you can uniquely identify them 85% of the time, and in an addition 7% of cases you can narrow down the possibilities to just two people. In Texas, by contrast, the chances of a birthday and four-digit ID being unique are 0.03%. The chances of narrowing the possibilities to two people are larger but still only 0.1%.

Here are results for a few states. Note that even though Texas has between two and three times the population of Ohio, it’s over 100x harder to uniquely identify someone with the information discussed here.

|-----------+------------+--------+--------|
| State     | Population | Unique |  Pairs |
|-----------+------------+--------+--------|
| Texas     | 30,000,000 |  0.03% |  0.11% |
| Ohio      | 12,000,000 |  3.73% |  6.14% |
| Tennessee |  6,700,000 | 15.95% | 14.64% |
| Wyoming   |    600,000 | 84.84% |  6.97% |
|-----------+------------+--------+--------|

Related posts

[1] In that post we made the dubious simplifying assumption that birth dates were uniformly distributed from 0 to 78 years. This assumption is not accurate, but it was good enough to prove the point that it’s easier to identify people than you might think. Here our assumptions are better founded. Birthdays are nearly uniformly distributed, though there are some slight irregularities. The last four digits of social security numbers are uniformly distributed, though the first digits are correlated with the state.

Rényi Differential Privacy

Differential privacy, specifically ε-differential privacy, gives strong privacy guarantees, but it can be overly cautious by focusing on worst-case scenarios. The generalization (ε, δ)-differential privacy was introduced to make ε-differential privacy more flexible.

Rényi differential privacy (RDP) is a new generalization of ε-differential privacy by Ilya Mironov that is comparable to the (ε, δ) version but has several advantages. For instance, RDP is easier to interpret than (ε, δ)-DP and composes more simply.

Rényi divergence

My previous post discussed Rényi entropyRényi divergence is to Rényi entropy what Kullback-Leibler divergence is to Shannon entropy.

Given two random variables X and Y that take on n possible values each with positive probabilities pi and qi respectively, the Rényi divergence of X from Y is defined by

D_\alpha(X || Y) = \frac{1}{\alpha -1} \log\left( \sum_{i=1}^n \frac{p_i^\alpha}{q_i^{\alpha -1}}\right)

for α > 0 and not equal to 1. The definition extends to α = 0, 1, or ∞ by taking limits.

As α converges to 1, Dα converges to the Kullback-Leibler divergence.

Rényi differential privacy

A couple weeks ago I wrote an introduction to differential privacy that started by saying

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not.

The post develops precisely what it means for a query to be differentially private. The bottom line (literally!) was

An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D’,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

It turns out that this definition is equivalent to saying the Rényi divergence of order ∞ between A(D) and A(D‘) is less than ε. That is,

D_\alpha\big(A(D) || A(D')\big) \leq \varepsilon

where α = ∞. The idea of Rényi differential privacy (RDP) is to allow the possibility that α could be finite [1]. That is, an algorithm is (α, ε)-RDP if the Rényi divergence of order α between any two adjacent databases is no more than ε.

One of the nice properties of RDP is how simply algorithms compose: The composition of an (α, ε1)-RDP algorithm and an (α, ε2)-RDP algorithm is a (α, ε1 + ε2)-RDP algorithm. This leads to simple privacy budget accounting.

Comparison to (ε, δ)-differential privacy

The composition of ε-DP algorithms is simple: just substitute α = ∞ in the result above for composing RDP algorithms. However, the resulting bound may be overly pessimistic.

The composition of (ε, δ)-DP algorithms is not as simple: both ε and δ change. With RDP, the order α does not change, and the ε values simply add.

Mironov proves in [1] that every RDP algorithm is also (ε, δ)-DP algorithm. Specifically, Proposition 3 says

If f is an (α, ε)-RDP mechanism, it also satisfies

\left(\varepsilon - \frac{\log \delta}{\alpha - 1}, \delta\right)

differential privacy for any 0 < δ < 1.

This tells us that Rényi differential privacy gives us privacy guarantees that are somewhere between ε-differential privacy and (ε, δ)-differential privacy.

Related posts

[1] Ilya Mironov, Rényi Differential Privacy. arXiv 1702.07476

Gamma gamma gamma!

There are several things in math and statistics named gamma. Three examples are

  • the gamma function
  • the gamma constant
  • the gamma distribution

This post will show how these are related. We’ll also look at the incomplete gamma function which connects with all the above.

The gamma function

The gamma function is the most important function not usually found on a calculator. It’s the first “advanced” function you’re likely to learn about. You might see it in passing in a calculus class, in a homework problem on integration by parts, but usually not there’s not much emphasis on it. But it comes up a lot in application.

\Gamma(x) = \int_0^\infty t^{x-1 } e^{-t}\, dt

You can think of the gamma function as a way to extend factorial to non-integer values. For non-negative integers n, Γ(n + 1) = n!.

(Why is the argument n + 1 to Γ rather than n? There are a number of reasons, historical and practical. Short answer: some formulas turn out simpler if we define Γ that way.)

The gamma constant

The gamma constant, a.k.a. Euler’s constant or the Euler-Mascheroni constant, is defined as the asymptotic difference between harmonic numbers and logarithms. That is,

\gamma = \lim_{n\to\infty} \left( 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}- \log n\right)

The constant γ comes up fairly often in applications. But what does it have to do with the gamma function? There’s a reason the constant and the function are both named by the same Greek letter. One is that the gamma constant is part of the product formula for the gamma function.

\frac{1}{\Gamma(x)} = x e^{\gamma x} \prod_{r=1}^\infty \left(1 + \frac{x}{r} \right) e^{-x/r}

If we take the logarithm of this formula and differentiation we find out that

\frac{\Gamma'(1)}{\Gamma(1)} = -\gamma

The gamma distribution

If you take the integrand defining the gamma function and turn it into a probability distribution by normalizing it to integrate to 1, you get the gamma distribution. That is, a gamma random variable with shape k has probability density function (PDF) given by

f(x) = \frac{1}{\Gamma(k)} x^{k-1} e^{-x}

More generally you could add a scaling parameter to the gamma distribution in the usual way. You could imaging the scaling parameter present here but set to 1 to make things simpler.

The incomplete gamma function

The incomplete gamma function relates to everything above. It’s like the (complete) gamma function, except the range of integration is finite. So it’s now a function of two variables, the extra variable being the limit of integration.

\gamma(s, x)= \int_0^x t^{s-1} e^{-t} \, dt

(Note that now x appears in the limit of integration, not the exponent of t. This notation is inconsistent with the definition of the (complete) gamma function but it’s conventional.)

It uses a lower case gamma for its notation, like the gamma constant, and is a generalization of the gamma function. It’s also essentially the cumulative distribution function of the gamma distribution. That is, the CDF of a gamma random variable with shape s is γ(sx) / Γ(s).

The function γ(sx) / Γ(s) is called the regularized incomplete gamma function. Sometimes the distinction between the regularized and unregularized versions is not explicit. For example, in Python, the function gammainc does not compute the incomplete gamma function per se but the regularized incomplete gamma function. This makes sense because the latter is often more convenient to work with numerically.

Related posts

What is differential privacy?

Differential privacy is a strong form of privacy protection with a solid mathematical definition.

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not. This intuitive idea can be made precise as follows.

blurry crowd

Queries and algorithms

First of a differential privacy is something that applies to queries, not to databases. It could apply to a database if your query is to select everything in the database, but typically you want to run far more specific queries. Differential privacy adds noise to protect privacy, and the broader the query, the more noise must be added, so typically you want queries to be more narrow than “Tell me everything about everything.”

Generally the differential privacy literature speaks of algorithms rather than queries. The two are the same if you have a general idea of what a query is, but using the word algorithm emphasizes that the query need not be a simple SQL query.

Opt-out and neighboring databases

To quantify the idea of “whether your information is included or not” we look at two versions of a database, D and D‘, that differ by one row, which you can think of as the row containing your data. Our formalism is symmetric, so we do not specify which database, D or D‘, is the one which contains your row and which is missing your row.

We should mention some fine print here. The paragraph above implicitly assumes that your database is just a simple table. In general we can speak of neighboring databases. In the case of a more complex database design, the notion of what it means for two databases to be neighboring is more complex. Differential privacy can be defined whenever a notion of neighboring databases can be defined, so you could, for example, consider differential privacy for a non-relational (“No SQL”) database. But for the simple case of a single table, two databases are neighboring if they differ by a row.

However, there’s a subtlety regarding what it means even for two tables to “differ by one row.” Unbound differential privacy assumes one row has been removed from one of the databases. Bound differential privacy assumes the two databases have the same number of rows, but the data in one row has been changed. In practice the difference between bound and unbound differential privacy doesn’t often matter.

Quantifying difference

We said it “makes little difference” whether an individual’s data is included or not. How to we quantify that statement? Differential privacy is a stochastic procedure, so we need to measure difference in terms of probability.

Whenever mathematicians want to speak of two things being close together, we often use ε as our symbol. While in principle ε could be large, the choice of symbol implies that you should think of it as a quantity you can make as small as you like (though there may be some cost that increases as ε decreases).

We’re going to look at ratios of probabilities, so we want these ratios to be near 1. This means we’ll work with eε = exp(ε) rather than ε itself. A convenient property that falls out of the Taylor series for the exponential function is that if ε is small then exp(ε) is approximate 1+ε. For example, exp(0.05) = 1.0513, and so if the ratio of two probabilities is bounded by exp(0.05), the probabilities are roughly within 5% of each other.

Definition of differential privacy

Now we can finally state our definition. An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D‘,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

Update: See a follow-on post about Rényi differential privacy, a generalization of differential privacy that measures the difference between output probability distributions using information theory rather than bounding ratios. This includes the definition above a special case.

Help with differential privacy

If you’d like help understanding and implementing differential privacy, let’s talk.

Related posts

Earth mover distance and t-closeness

airport crowd

There’s an old saying that if you want to hide a tree, put it in a forest. An analogous principle in privacy is that a record preserves privacy if it’s like a lot of other records.

k-anonymity

The idea of k-anonymity is that every database record appears at least k times when you restrict your attention to quasi-identifiers [1]. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity. Even when k is large, k-anonymity might prevent re-identification but still suffer from attribute disclosure.

Another problem with k-anonymity is that it doesn’t offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.

Or going the other way around, if you know already know something that stands about a group, this could help you identify the record belonging to an individual. That is, the method is subject to a background knowledge attack.

One way to address this shortcoming is l-diversity. This post won’t go into l-diversity because it’s an intermediate step to where we want to go, which is t-closeness.

t-closeness

The idea of t-closeness is that the distribution of sensitive data in every group is not too far from the distribution in the full population. The “t” comes from requiring that the distributions be no more than a distance t apart in a sense that we’ll define below [2]. If the sensitive data in a group doesn’t stand out, this thwarts the homogeneity attack and the background knowledge attack.

Earth mover distance

When we say that the distribution on sensitive data within a group is not far from the distribution in the full data, how do we quantify what “far” means? That is, how do we measure the distance between two distributions?

There are a lot of ways to measure the similarity of two probability distributions. A common choice is the Kullback-Leiler divergence, though that’s not what we’ll use here. Instead, t-closeness uses the so-called earth mover distance (EMD), also know as the Wasserstein metric.

The idea of EMD is to imagine both probability distributions as piles of dirt and calculate the minimum amount of work needed to reshape the first pile so that it has the same shape as the second. The key attribute of EMD is that it takes distance into account.

Suppose your data is some ordered response, 1 through 5. Suppose distribution X has probability 0.8 at 1 and 0.05 for the rest of the responses. Distributions Y and Z are the same except they have 80% of their probability mass at 2 and at 5 respectively. By some measures, X is equally far from Y and Z, but the earth mover distance would say that X is closer to Y than to Z, which is more appropriate in our setting.

We can calculate the EMD for the example above. To transform X into Y, we need to move a probability mass of 0.75 from 1 to 2, and so the EMD is 0.75. To transform X into Z we need to move the same amount of mass, but we need to move it 4x further, and so the EMD is 3.

Related posts

[1] What exactly is a quasi-identifier? That’s hard to say precisely. An identifier is something that clearly identifies an individual. A phone number, for example, is an identifier. A quasi-identifier is something that helps narrow down who a person is, but isn’t an identifier. For example, someone’s birthday would be a quasi-identifier. But in general “quasi-identifier” is not a precisely defined concept.

[2] It’s common in math to use a variable name as an adjective, as with k-anonymity, l-diversity, and t-closeness. This is unfortunate because it isn’t descriptive and locks in a variable naming convention. Someone used the variable k to count the number of redundant database tables, and the name stuck. Similarly the l in l-diversity counts something.

As a sidenote, the variable names here follow the old FORTRAN convention that variables i through n represent integers by default, and that all other variables represent real (floating point) numbers by default. The t in t-closeness is a continuous measure, so the t is a real value, while k and l are integers.

Biometric security and hypothesis testing

A few weeks ago I wrote about how there are many ways to summarize the operating characteristics of a test. The most basic terms are accuracy, precision, and recall, but there are many others. Nobody uses all of them. Each application area has their own jargon.

Biometric security has its own lingo, and it doesn’t match any of the terms in the list I gave before.

facial scan authentication

False Acceptance Rate

Biometric security uses False Acceptance Rate (FAR) for the proportion of times a system grants access to an unauthorized person. In statistical terms, FAR is Type II error. Also known as False Match Rate (FRM).

False Rejection Rate

False Rejection Rate (FRR) is the proportion of times a biometric system fails to grant access to an authorized person. In statistical terms, FRR is Type I error. FAR is also known as False Non Match Rate (FNMR).

Crossover Error Rate

One way to summarize the operating characteristics of a biometric security system is to look at the Crossover Error Rate (CER), also known as the Equal Error Rate (EER). The system has parameters that can be tuned to adjust the FAR and FRR. Adjust these to the point where the FAR and FRR are equal. When the two are equal, their common value is the CER or EER.

The CER gives a way to compare systems. The smaller the CER the better. A smaller CER value means it’s possible to tune the system so that both the Type I and Type II error rates are smaller than they would be for another system.

Loss Function

CER is kind of a strange metric. Everyone agrees that you wouldn’t want to calibrate a system so that FAR = FRR. In security applications, FAR (unauthorized access) is worse than FRR (authorized user locked out). The former could be a disaster while the latter is an inconvenience. Of course there could be a context where the consequences of FAR and FRR are equal, or that FRR is worse, but that’s not usually the case.

A better approach would be to specify a loss function (or its negative, a utility function). If unauthorized access is K times more costly than locking out an authorized user, then you might want to know at what point K * FAR = FRR or your minimum expected loss [1] over the range of tuning parameters. The better system for you, in your application, is the one corresponding to your value of K.

Since everyone has a different value of K, it is easier to just use K = 1, even though everyone’s value of K is likely to be much larger than 1. Unfortunately this often happens in decision theory. When people can’t agree on a realistic loss function, they standardize on a mathematically convenient implicit loss function that nobody would consciously choose.

If everyone had different values of K near 1, the CER metric might be robust, i.e. it might often make the right comparison between two different systems even though the criteria is wrong. But since K is probably much larger than 1 for most people, it’s questionable that CER would rank two systems the same way people would if they could use their own value of K.

***

[1] These are not the same thing. To compute expected loss you’d need to take into account the frequency of friendly and unfriendly access attempts. In a physically secure location, friends may attempt to log in much more often than foes. On a public web site the opposite is more likely to be true.

Probability of winning the World Series

Astros win 2018 World Series

Suppose you have two baseball teams, A and B, playing in the World Series. If you like, say A stands for Houston Astros and B for Milwaukee Brewers. Suppose that in each game the probability that A wins is p, and the probability of A losing is q = 1 – p. What is the probability that A will win the series?

The World Series is a best-of-seven series, so the first team to win 4 games wins the series. Once one team wins four games there’s no point in playing the rest of the games because the series winner has been determined.

At least four games will be played, so if you win the series, you win on the 4th, 5th, 6th, or 7th game.

The probability of A winning the series after the fourth game is simply p4.

The probability of A winning after the fifth game is 4 p4 q because A must have lost one game, and it could be any one of the first four games.

The probability of A winning after the sixth game is 10 p4 q2 because A must have lost two of the first five games, and there are 10 ways to choose two items from a set of five.

Finally, the probability of A winning after the seventh game is 20 p4 q3 because A must have lost three of the first six games, and there are 20 ways to choose three items from a set of six.

The probability of winning the World Series is the sum of the probabilities of winning after 4, 5, 6, and 7 games which is

p4(1 + 4q + 10q2 + 20q3)

Here’s a plot:

Probability of winning the world series as a function of the probability of winning one game

Obviously, the more likely you are to win each game, the more likely you are to win the series. But it’s not a straight line because the better team is more likely to win the series than to win any given game.

Now if you only wanted to compute the probability of winning the series, not the probability of winning after different numbers of games, you could pretend that all the games are played, even though some may be unnecessary to determine the winner. Then we compute the probability that a Binomial(7, p) random variable takes on a value greater than or equal to 4, which is

35p4q3 + 21p5q2 + 7p6q + p7

While looks very different than the expression we worked out above, they’re actually the same. If you stick in (1 – p) for q and work everything out, you’ll see they’re the same.

Accuracy, precision, and recall

I posted the definitions of accuracy, precision, and recall on @BasicStatistics this afternoon. I think the tweet was popular because people find these terms hard to remember and they liked a succinct cheat sheet.

There seems to be no end of related definitions, and multiple names for the same definitions.

Precision is also known as positive predictive value (PPV) and recall is also known as sensitivityhit rate, and true positive rate (TPR).

Not mentioned in the tweet above are specificity (a.k.a. selectivity and true negative rate or TNR), negative predictive value (NPV), miss rate (a.k.a. false negative rate or FNR), fall-out (a.k.a. false positive rate or FPR), false discovery rate (FDR), and false omission rate (FOR).

How many terms are possible? There are four basic ingredients: TP, FP, TN, and FN. (You may see these arranged into a 2 by 2 grid called a confusion matrix.) So if each term may or may not be included in a sum in the numerator and denominator, that’s 16 possible numerators and 16 denominators, for a total of 256 possible terms to remember. Some of these are redundant, such as one (a.k.a. ONE), given by TP/TP, FP/FP, etc. If we insist that the numerator and denominator be different, that eliminates 16 possibilities, and we’re down to a more manageable 240 definitions. And if we rule out terms that are the reciprocals of other terms, we’re down to only 120 definitions to memorize.

But wait! We’ve assumed every term occurs with a coefficient of either 0 or 1. But the F1 score has a couple coefficients of 2:

F1 = 2TP / (2TP + FP + FN)

If we allow coefficients of 0, 1, or 2, and rule out redundancies and reciprocals …

This has been something of a joke. Accuracy, precision, and recall are useful terms, though I think positive predictive value and true positive rate are easier to remember than precision and recall respectively. But the proliferation of terms is ridiculous. I honestly wonder how many terms for similar ideas are in use somewhere. I imagine there are scores of possibilities, each with some subdiscipline that thinks it is very important, and few people use more than three or four of the possible terms.

Here are the terms I’ve collected so far. If you know of other terms or other names for the same terms, please leave a comment and we’ll see how long this table goes.

Accuracy(TP + TN) / (FP + TP + FN + TN) </dt
True positive rate (TPR)TP / (TP + FN) </dt
Sensitivitysee TPR </dt
Recallsee TPR </dt
Hit ratesee TPR </dt
Probability of detectionsee TPR </dt
True negative rate (TNR)TN / (TN + FP) </dt
Specificitysee TNR </dt
Selectivitysee TNR </dt
Positive predictive value (PPV)TP / (TP + FP) </dt
Precisionsee PPV </dt
Negative predictive value (NPV)TN / (TN + FN) </dt
False negative rate (FNR)FN / (FN + TP) </dt
Miss ratesee FNR </dt
False positive rate (FPR)FP / (FP + TN) </dt
Fall outsee FPR </dt
False discovery rate (FDR)FP / (FP + TP) </dt
False omission rate (FOR)FN / (FN + TN) </dt
F1 score2TP / (2TP + FP + FN) </dt
F-scoresee F1 score </dt
F-measuresee F1 score </dt
Sørensen–Dice coefficientsee F1 score </dt
Dice similarity coefficient (DSC)see F1 score </dt

Pi primes

I was reading a recent blog post by Evelyn Lamb where she mentioned in passing that 314159 is a prime number and that made me curious how many such primes there are.

Let’s look at numbers formed from the digits of π to see which ones are prime.

Obviously 3 and 31 are prime. 314 is even. 3141 is divisible by 9 because its digits sum to 9, and 31415 is clearly divisible by 5. And now we know that 314159 is prime. What’s the next prime in the sequence? Here’s a little Python code to find out.

    from sympy import pi, isprime

    M = 1000
    digits = "3" + str(pi.evalf(M+1))[2:]
    for i in range(1, M+1):
        n = int(digits[:i])
        if isprime(n):
            print(n)

This looks at numbers formed from the first digit up to the thousandth digit in the representation of π. The only other prime it finds is

31415926535897932384626433832795028841.

After writing running the Python code above, I wondered whether this sequence is in OEIS, and indeed it is, sequence A005042. The OEIS says the next number in the sequence is 16,208 digits long.

Expected number of primes

I expected to find more primes out of the first 1,000 digits of π and was surprised that the next prime is so long.

How many primes whould we expect if we look at random numbers of length 1 through 1000? (The numbers formed from the digits of π are not random, but I want to compare them to random selections.)

From the prime number theorem, the probability that a d-digit number is prime is approximately 1/d log 10. So if we pick numbers of length 1, 2, 3, … N, the number of primes we’d expect to find is

\frac{1}{\log 10}\left(1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \cdots +\frac{1}{N} \right ) = \frac{H_N}{\log 10}

where HN is the Nth harmonic number. Harmonic numbers grow very slowly, and so this gives us an idea why the number of primes we find is so small. In particular, it says if we look out 1,000 digits, we’d expect to find 3.25 primes. And so finding 4 primes isn’t surprising.

By the way, HN is roughly log N, so the expected number of primes is roughly log N / log 10, or simply log10 N. [1]

Other numbers

To see if our prediction holds, let’s try a few more irrational numbers.

First, let’s look at e. Then we get these four prime numbers:

2
271
2718281
2718281828459045235360287471352662497757247093699959574966967627724076630353547594571

Next, let’s try the square root of 2. Then we get these three prime numbers:

1414213562373095048801688724209698078569671875376948073
1414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641
141421356237309504880168872420969807856967187537694807317667973799073247846210703885038753432764157273501384623091229702492483605585073721264412149709993583141322266592750559275579995050115278206057147010955997160597027453459

And finally, sin(2018) gives us two primes:

89
890078059172214077866908781516358161827277734319970800477363723960779818810729974998238099333154892009885458973321640841442042739723513347225471340149039460391024770670777037177784685508982613866838293832904866761046669825265135653209394367093522291768886320360545920897111922035021827790766895274726280695701753189405911185099453319400331979946988526902527520607778873012543871813339138882571549068827579760491442082336321409598206489746377828352552073997340874649330911864108558685738003182332758299767812686413014214285003827180623455181083706822164258525158371971877396069605035317998430446771215296978812833752733

We might expect to find more primes when the leading digit is small because each time we select a number with d digits we’re looking lower in the range where primes are more dense. That is consistent with the small number of examples in this post.

Related posts

[1] When I write “log” without a subscript, I always mean natural log, log base e. If you’re not accustomed to this, see why this is conventional.

Distribution of zeros of the Riemann zeta

A recent video by Quanta Magazine says that the eigenvalues of random matrices and the zeros of the Riemann zeta function appear to have the same distribution.

I assume by “random matrices” the video is referring specifically to Gaussian orthogonal ensembles. [Update: See the first comment below for more details. You need some assumptions on the distribution—fat tails are out—but you don’t need to assume a GOE.]

By zeros of the Riemann zeta function, they mean the imaginary parts of the zeros in the critical strip. You can download the first 100,000 zeros of the Riemann zeta function here. So, for example, the first zero is 14.134725142, which actually means 0.5 + 14.134725142 i.

Here’s the histogram of random matrix eigenvalues from my previous post:

And here’s a histogram of the spacing between the first two thousand zeros of the Riemann zeta function:

The distribution of random matrix eigenvalues is known exactly. I believe the distribution of the Riemann zeta function zeros is conjectural but persuasive. Here’s a plot using the first 100,000 zeros.

How long does it take to find large primes?

Descriptions of cryptography algorithms often start with something like “Let p be a large prime” or maybe more specifically “Let p be a 256 bit prime.” How hard is it to find large prime numbers?

Let’s look at the question in base b, so we can address binary and base 10 at the same time. The prime number theorem says that for large n, the number of primes less than bn is approximately

bn / n log b

and so the probability that an n-digit number base b is prime is roughly 1 / n log b. If you select an odd b-digit number, you double your chances.

If you generate odd n-digit numbers at random, how long would it be until you find a prime? You might get lucky and find one on your first try. Or you might have extraordinarily bad luck and find a composite every time until you give up.

The number of tries until you find a prime is a geometric random variable with

p = 2 / n log b.

The expected value is 1 / p.

So, for example, if you wanted to find a 256-bit prime, you would need to test on average 256 log(2) / 2 = 89 candidates. And if you wanted to find a 100-digit prime, you’d need to test on average 100 log(10) / 2 = 115 candidates.

You could get more accurate bounds by using a refined variation of the prime number theorem. Also, we implicitly looked at the problem of finding a number with at most n digits rather than exactly n digits. Neither of these would make much difference to our final result. To back up this claim, let’s compare our estimate to a simulation that actually looks for primes.

histogram

The histogram above was based on finding 10,000 primes, each with 100 digits, and counting how many candidates were evaluated each time. The plot matches a geometric distribution well. The mean from the simulation was 113.5, close to the 115 estimated above.

For cryptographic applications, you often don’t want just any prime. You might want a safe prime or a prime satisfying some other property, in which case you’ll need to test more primes. But your probability of success will likely still be geometrically distributed.

It’s plausible that for large pp being prime and 2p + 1 being prime are independent events, at least approximately, and a numerical experiment suggests that’s true. I generated 10,000 primes, each 100-digits long, and 48 were safe primes, about what you’d expect from theory. It is not known whether the number of safe primes is infinite, but safe primes with hundreds of thousands of digits have been found, so apparently they’re infinite enough for application.

Related posts

The other butterfly effect

monarch butterfly

The original butterfly effect

The butterfly effect is the semi-serious claim that a butterfly flapping its wings can cause a tornado half way around the world. It’s a poetic way of saying that some systems show sensitive dependence on initial conditions, that the slightest change now can make an enormous difference later. Often this comes up in the context of nonlinear, chaotic systems but it isn’t limited to that. I give an example here of a linear differential equation whose solutions start out the essentially the same but eventually diverge completely.

Once you think about these things for a while, you start to see nonlinearity and potential butterfly effects everywhere. There are tipping points everywhere waiting to be tipped!

The other butterfly effect

But a butterfly flapping its wings usually has no effect, even in sensitive or chaotic systems. You might even say especially in sensitive or chaotic systems.

Sensitive systems are not always and everywhere sensitive to everything. They are sensitive in particular ways under particular circumstances, and can otherwise be quite resistant to influence. Not only may a system be insensitive to butterflies, it may even be relatively insensitive to raging bulls. The raging bull may have little more long-term effect than a butterfly. This is what I’m calling the other butterfly effect.

Steering complex systems

In some ways chaotic systems are less sensitive to change than other systems. And this can be a good thing. Resistance to control also means resistance to unwanted tampering. Chaotic or stochastic systems can have a sort of self-healing property. They are more stable than more orderly systems, though in a different way.

The lesson that many people draw from their first exposure to complex systems is that there are high leverage points, if only you can find them and manipulate them. They want to insert a butterfly to at just the right time and place to bring about a desired outcome. Instead, we should humbly evaluate to what extent it is possible to steer complex systems at all. We should evaluate what aspects can be steered and how well they can be steered. The most effective intervention may not come from tweaking the inputs but from changing the structure of the system.

Related posts

The cold start problem

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Thanks to the magic of Bayes theorem, you don’t have to have an ad hoc rule that says “Treat the first patient this way, then turn on the Bayesian machine to determine how to treat the next patient.” No, you use Bayes theorem from beginning to end. There’s no need to handle the first patient differently because expert opinion is already there, captured in the form of prior distributions (and the structure of the probability model).

Each patient is treated according to all information available at the time. At first all available information is prior information. After you have data on one patient, most of the information you have is still prior information, but Bayes’ theorem updates this prior information with your lone observation. As more data becomes available, the Bayesian machine incorporates it all, automatically shifting weight away from the prior and toward the data.

The cold start problem for business applications is easier than the cold start problem for clinical trials. First of all, most business applications don’t have the potential to cost people their lives. Second, business applications typically have fewer competing criteria to balance.

What if you’re not sure where to draw your prior information? Bayes can handle that too. You can use Bayesian model selection or Bayesian model averaging to determine which source (or weighting of sources) best fits the new data as it comes in.

Once you’ve decided to use a Bayesian approach, there’s still plenty of work to do, but the Bayesian approach provides scaffolding for that work, a framework for moving forward.

Distribution of carries

Suppose you add two long numbers together. As you work your way toward the result, adding digits from right to left, sometimes you carry a 1 and sometimes you don’t. How often do you carry 1?

Now suppose you add three numbers at a time. Now your carry might be 0, 1, or 2. How often does each appear? How do things change if you’re not working in base 10?

John Holte goes into this problem in detail in [1]. We’ll only look at his first result here.

Suppose we’re adding m numbers together in base b, with each digit being uniformly distributed. Then at each step of the addition process, the probability distribution of the amount we carry out only depends on the amount we carried in from the previous step. In other words, the carries form a Markov chain!

This means that we can do more than describe the distribution of the amounts carried. We can specify the transition probabilities for carries, i.e. the distribution on the amount carried out, conditional on the the amount carried in.

Examples

When we add two base ten numbers, m = 2 and b = 10, we have the transition matrix

\begin{pmatrix} 0.55 & 0.45 \\ 0.45 & 0.55 \end{pmatrix}

This means that on average, we’re as likely to carry 0 as to carry 1. But more specifically, there’s a 55% chance that we’ll carry a 1 if we carried a 1 on the previous step, and also a 55% chance that we won’t have a carry this time if we didn’t have one last time.

If we add three numbers, things get a little more interesting. Now the transition matrix is

\begin{pmatrix} 0.220 & 0.660 & 0.120 \\ 0.165 & 0.670 & 0.165 \\ 0.120 & 0.660 & 0.220 \end{pmatrix}

This gives the probabilities of each out carry for each in carry. We can compute the long run distribution from the matrix above to find that when adding three long numbers, we’ll carry 0 or 1 with probability 1/6 each, and carry 1 with probability 2/3.

When adding three binary numbers, the transition matrix is fairly different than for base 10

\begin{pmatrix} 0.500 & 0.500 & 0.000 \\ 0.125 & 0.750 & 0.125 \\ 0.000 & 0.500 & 0.500 \\ \end{pmatrix}

but the long run distributions are the same: carry a 1 two thirds of the time, and split the remaining probability equally between carrying 0 and 2.

The transition matrices don’t change much for larger bases; base 100 doesn’t look much different than base 100, for example.

Finally, let’s do an example adding five numbers in base 10. The transition matrix is

\begin{pmatrix} 0.020 & 0.305 & 0.533 & 0.140 & 0.003 \\ 0.013 & 0.259 & 0.548 & 0.176 & 0.005 \\ 0.008 & 0.216 & 0.553 & 0.216 & 0.008 \\ 0.005 & 0.176 & 0.548 & 0.260 & 0.013 \\ 0.003 & 0.140 & 0.533 & 0.305 & 0.020 \\ \end{pmatrix}

and the long run distribution is 1/60, 13/60, 33/60, 13/60, and 1/60 for carries of 0 through 4.

General case

In general, when adding m numbers in base b, Holte gives the transition probabilities as

\pi_{ij} = b^{-m} \sum_{r=0}^{j-\lfloor i/b\rfloor}(-1)^r {m+1\choose r}{m - i -1 + b(j-r+1) \choose m}

Here is some Python code to compute the transition matrices.

from scipy import floor, zeros, array
from scipy.special import comb
from numpy.linalg import matrix_power

# transition probability
def pi(i, j, b, m):
    s = 0
    for r in range(0, j - int(floor(i/b)) + 1):
        s += (-1)**r * comb(m+1, r) * comb(m-i-1 +b*(j-r+1), m)
    return s/b**m

def transition_matrix(b, m):
    P =  [ [pi(i, j, b, m) for j in range(m)]
            for i in range(m) ]
    return array(P)

You can find the long run probabilities by raising the transition matrix to a large power.

Related posts

[1] John M. Holte. The American Mathematical Monthly, Vol. 104, No. 2 (Feb., 1997), pp. 138-149