Here are 10 of my most popular posts this year, arranged as pairs of posts in different areas.
Here are 10 of my most popular posts this year, arranged as pairs of posts in different areas.
A new prime record was announced yesterday. The largest known prime is now
Written in hexadecimal the newly discovered prime is
For decades the largest known prime has been a Mersenne prime because there’s an efficient test for checking whether a Mersenne number is prime. I explain the test here.
There are now 51 known Mersenne primes. There may be infinitely many Mersenne primes, but this hasn’t been proven. Also, the newly discovered Mersenne prime is the 51st known Mersenne prime, but it may not be the 51st Mersenne prime, i.e. there may be undiscovered Mersenne primes hiding between the last few that have been discovered.
Three weeks ago I wrote a post graphing the trend in Mersenne primes. Here’s the updated post. Mersenne primes have the form 2p -1 where p is a prime, and this graph plots the values of p on a log scale. See the earlier post for details.
Because there’s a one-to-one correspondence between Mersenne primes and even perfect numbers, the new discovery means there is also a new perfect number. M is a Mersenne prime if and only if M(M + 1)/2 is an even perfect number. This is also the Mth triangular number.
This post will look at adaptively randomized trial designs. In particular, we want to focus on multi-arm trials, i.e. trials of more than two treatments. The aim is to drop the less effective treatments quickly so the trial can focus on determining which of the better treatments is best.
We’ll briefly review our approach to adaptive randomization but not go into much detail. For a more thorough introduction, see, for example, this report.
The contraction mapping theorem says that if a function moves points closer together, then there must be some point the function doesn’t move. We’ll make this statement more precise and give a historically important application.
A function f on a metric space X is a contraction if there exists a constant q with 0 ≤ q < 1 such that for any pair of points x and y in X,
d( f(x), f(y) ) ≤ q d(x, y)
where d is the metric on X.
A point x is a fixed point of a function f if f(x) = x.
Banach’s fixed point theorem, also known as the contraction mapping theorem, says that every contraction on a complete metric space has a fixed point. The proof is constructive: start with any point in the space and repeatedly apply the contraction. The sequence of iterates will converge to the fixed point.
Kepler’s equation in for an object in an elliptical orbit says
M + e sin E = E
where M is the mean anomaly, e is the eccentricity, and E is the eccentric anomaly. These “anomalies” are parameters that describe the location of an object in orbit. Kepler solved for E given M and e using the contraction mapping theorem, though he didn’t call it that.
Kepler speculated that it is not possible to solve for E in closed form—he was right—and used a couple iterations  of
f(E) = M + e sin E
to find an approximate fixed point. Since the mean anomaly is a good approximation for the eccentric anomaly, M makes a good starting point for the iteration. The iteration will converge from any starting point, as we will show below, but you’ll get a useful answer sooner starting from a good approximation.
Kepler came up with his idea for finding E around 1620, and Banach stated his fixed point theorem three centuries later. Kepler had the idea of Banach’s theorem, but he didn’t have a rigorous formulation of the theorem or a proof.
In modern terminology, the real line is a complete metric space and so we only need to prove that the function f above is a contraction. By the mean value theorem, it suffices to show that the absolute value of its derivative is less than 1. That is, we can use an upper bound on |f ‘| as the q in the definition of contraction.
f ‘ (E) = e cos E
|f ‘(E)| ≤ e
for all E. If our object is in an elliptical orbit, e < 1 and so we have a contraction.
The following example comes from , though the author uses Newton’s method to solve Kepler’s equation. This is more efficient, but anachronistic.
Consider a satellite on a geocentric orbit with eccentricity e = 0.37255. Determine the true anomaly at three hours after perigee passage, and calculate the position of the satellite.
The author determines that M = 3.6029 and solves Kepler’s equation
M + e sin E = E
for E, which she then uses to solve for the true anomaly and position of the satellite.
The following Python code shows the results of the first 10 iterations of Kepler’s equation.
from math import sin M = 3.6029 e = 0.37255 E = M for _ in range(10): E = M + e*sin(E) print(E)
3.437070 3.494414 3.474166 3.481271 3.478772 3.479650 3.479341 3.479450 3.479412 3.479425
and so it appears the iteration has converged to E = 3.4794 to four decimal places.
Note that this example has a fairly large eccentricity. Presumably Kepler would have been concerned with much smaller eccentricities. The eccentricity of Jupiter’s orbit, for example, is around 0.05. For such small values of e the iteration would converge more quickly.
 Bertil Gustafsson saying in his book Scientific Computing: A Historical Perspective that Kepler only used two iterations. Since M gives a good starting approximation to E, two iterations would give a good answer. I imagine Kepler would have done more iterations if necessary but found empirically that two was enough. Incidentally, it appears Gustaffson has a sign error in his statement of Kepler’s equation.
 Euler Celestial Analysis by Dora Musielak.
Earlier this year I was a coauthor on a paper about the Cap Score™ test for male fertility from Androvia Life Sciences . I just noticed today that when I added the publication to my CV, it caused some garbled text to appear in the PDF.
Here is the corresponding LaTeX source code.
There were two problems: the trademark symbol and the non-printing symbol denoted by a red underscore in the source file. The trademark was a non-ASCII character (Unicode U+2122) and the underscore represented a non-printing (U+00A0). At first I only noticed the trademark symbol, and I fixed it by including a LaTeX package to allow Unicode characters:
An alternative fix, one that doesn’t require including a new package, would be to replace the trademark Unicode character with
\texttrademark\. Note the trailing backslash. Without the backslash there would be no space after the trademark symbol. The problem with the unprintable character would remain, but the character could just be deleted.
I found out there are two Unicode code points render the trademark glyph, U+0099 and U+2122. The former is in the Latin 1 Supplement section and is officially a control character. The correct code point for the trademark symbol is the latter. Unicode files U+2122 under Letterlike Symbols and gives it the official name TRADE MARK SIGN.
 Jay Schinfeld, Fady Sharara, Randy Morris, Gianpiero D. Palermo, Zev Rosenwaks, Eric Seaman, Steve Hirshberg, John Cook, Cristina Cardona, G. Charles Ostermeier, and Alexander J. Travis. Cap-Score™ Prospectively Predicts Probability of Pregnancy, Molecular Reproduction and Development. To appear.
The RSA encryption setup begins by finding two large prime numbers. These numbers are kept secret, but their product is made public. We discuss below just how difficult it is to recover two large primes from knowing their product.
Suppose two people share one prime. That is, one person chooses primes p and q and the other chooses p and r, and q ≠ r. (This has happened .) Then you can easily find p. All three primes can be obtained in less than a millisecond as illustrated by the Python script below.
In a way it would be better to share both your primes with someone else than to share just one. If both your primes are the same as someone else’s, you could read each other’s messages, but a third party attacker could not. If you’re lucky, the person you share primes with doesn’t know that you share primes or isn’t interested in reading your mail. But if that person’s private key is ever disclosed, so is yours.
Nearly all the time required to run the script is devoted to finding the primes, so we time just the factorization.
from secrets import randbits from sympy import nextprime, gcd from timeit import default_timer as timer numbits = 2048 p = nextprime(randbits(numbits)) q = nextprime(randbits(numbits)) r = nextprime(randbits(numbits)) m = p*q n = p*r t0 = timer() g = gcd(m, n) assert(p == g) assert(q == m//p) assert(r == n//p) t1 = timer() # Print time in milliseconds print(1000*(t1 - t0))
There are a few noteworthy things about the Python code.
First, it uses a cryptographic random number generator, not the default generator, to create 2048-bit random numbers.
Second, it uses the portable
default_timer which chooses the appropriate timer on different operating systems. Note that it returns wall clock time, not CPU time.
Finally, it uses integer division
// to recover q and r. If you use the customary division operator Python will carry out integer division and attempt to cast the result to a float, resulting in the error message “OverflowError: integer division result too large for a float.”
If you wanted to find the greatest common divisor of two small numbers by hand, you’d probably start by factoring the two numbers. But as Euclid knew, you don’t have to factor two numbers before you can find the greatest common factor of the numbers. For large numbers, the Euclidean algorithm is orders of magnitude faster than factoring.
Nobody has announced being able to factor a 1024 bit number yet. The number m and n above have four times as many bits. The largest of the RSA numbers factored so far has 768 bits. This was done in 2009 and took approximately two CPU millennia, i.e. around 2000 CPU years.
The classic Euclidean algorithm takes O(n²) operations to find the greatest common divisor of two integers of length n. But there are modern sub-quadratic variations on the Euclidean algorithm that take
O(n log²(n) log(log(n)))
operations, which is much faster for very large numbers. I believe SymPy is using the classic Euclidean algorithm, but it could transparently switch to using the fast Euclidean algorithm for large inputs.
 As noted in the comments, this has happened due to faulty software implementation, but it shouldn’t happen by chance. By the prime number theorem, the number of n-bit primes is approximately 2n / (n log 2). If M people choose 2M primes each n bits long, the probability of two of these primes being the same is roughly
22-n M n log 2.
If M = 7.5 billion (the current world population) and n = 1024, the probability of two primes being the same is about 10-295. This roughly the probability of winning the Powerball jackpot 40 times in a row.
Following an idea to its logical conclusion might be extrapolating a model beyond its valid range.
Suppose you have a football field with area A. If you make two parallel sides twice as long, then the area will be 2A. If you double the length of the sides again, the area will be 4A. Following this reason to its logical conclusion, you could double the length of the sides as many times as you wish, say 15 times, and each time the area doubles.
Except that’s not true. By the time you’ve doubled the length of the sides 15 times, you have a shape so big that it is far from being a rectangle. The fact that Earth is round matters a lot for figure that big.
Euclidean geometry models our world really well for rectangles the size of a football field, or even rectangles the size of Kansas. But eventually it breaks down. If the top extends to the north pole, your rectangle becomes a spherical triangle.
The problem in this example isn’t logic; it’s geometry. If you double the length of the sides of a Euclidean rectangle 15 times, you do double the area 15 times. A football field is not exactly a Euclidean rectangle, though it’s close enough for all practical purposes. Even Kansas is a Euclidean rectangle for most practical purposes. But a figure on the surface of the earth with sides thousands of miles long is definitely not Euclidean.
Models are based on experience with data within some range. The surprising thing about Newtonian physics is not that it breaks down at a subatomic scale and at a cosmic scale. The surprising thing is that it is usually adequate for everything in between.
Most models do not scale up or down over anywhere near as many orders of magnitude as Euclidean geometry or Newtonian physics. If a dose-response curve, for example, is linear for based on observations in the range of 10 to 100 milligrams, nobody in his right mind would expect the curve to remain linear for doses up to a kilogram. It wouldn’t be surprising to find out that linearity breaks down before you get to 200 milligrams.
Kevin Kelly is one of the most optimistic people writing about technology, but there’s a nuance to his optimism that isn’t widely appreciated.
Kelly sees technological progress as steady and inevitable, but not monotone. He has often said that new technologies create almost as many problems as they solve. Maybe it’s 10 steps forward and 9 steps back, but as long as the net motion is forward, progress accumulates.
A naive form of technological optimism believes that everything is getting better every day in every way for everyone, and to suggest otherwise is heresy. Kevin Kelly has been called a “wild-eyed optimist,” but even he is not a dogmatic optimist on a micro scale.
Even if most people believe an innovation is an improvement by the criteria they care most about, that doesn’t mean it’s an improvement for every purpose. You’re not a Luddite just because you think Version 8 of some software package was better than Version 9 for some task.
Photo of Kevin Kelly by Christopher Michel [CC BY 2.0] via Wikipedia.
Recall the setup for RSA encryption given in the previous post.
φ is Euler’s totient function, defined here.
There’s a complication in the first step. Maybe you think the numbers p and q are prime, but they’re not. In that case the calculation of φ in step 2 is wrong.
The numbers p and q need to be “very large,” where the definition of what constitutes large changes over time due to Moore’s law and progress with factoring algorithms. Currently p and q would typically have at least 2048 bits each. It’s easy to find numbers that large that are probably prime, but it’s difficult to be certain.
At the moment, the fastest way to test for primes has a small chance making a false positive error, but no chance of a false negative. That is, if the test says a number is composite, it is certainly composite. But if the test says a number may be prime, there is a small chance that it is not prime. (Details here.) So in practice RSA starts by finding two large probable primes or pseudoprimes.
Discussions of RSA encryption often point out that large pseudoprimes are very rare, so it isn’t a problem that RSA starts with pseudoprimes. But what does that mean? Is there a one in a trillion chance that your private key won’t work and nobody can send you messages? Or that you can receive some messages and not others?
RSA encryption works by taking a message m and raising it to the exponent e modulo n where e and n are defined at the top of the post. To decrypt the message, you raise it to the exponent d modulo n where d is your private decryption key. Because d was computed to satisfy
ed = 1 (mod φ(n)),
Euler’s theorem says that we’ll get our message back. We’ll give an example below.
If p and q are prime, then φ(n) = (p – 1)(q – 1). But if we’re wrong in our assumption that one of these factors is prime, our calculation of φ(n) will be wrong. Will our encryption and decryption process work anyway? Maybe.
We’ll do three examples below, all using small numbers. We’ll use e = 257 as our public encryption exponent and m = 42 as the message to encrypt in all examples.
In the first example p and q are indeed prime, and everything works as it should. In the next two examples we will replace p with a pseudoprime. In the second example everything works despite our error, but in the third example decryption fails.
We’ll start with p = 337 and q = 283. Both are prime. The Python code below shows that d = 60833 and the encrypted message is 45431. Everything works as advertised.
Now we use p = 341 and q = 283. Here p is a pseudoprime for base 2, i.e.
2340 = 1 mod 341
and so 341 passes Fermat’s primality test  for b = 2. Now d = 10073 and the encrypted message is 94956. Decrypting the encrypted message works, even though p is not prime and our calculation for φ is wrong. In fact, the process works not just for our message m = 42 but for every message.
Here again we let p be the pseudoprime 341 but set q to the prime 389. Now d = 6673, the encrypted message is 7660, but decrypting the encrypted message returns 55669, not 42 as we started with. Decryption fails for all other messages as well.
If we use the correct value for φ(pq) the example works. That is, if we use
φ(341*389) = φ(11*31*389) = 10*30*388
rather than the incorrect value 340*388 the decryption process recovers our original message.
The examples above show that if we mistakenly think one of our numbers is a prime when it is only a pseudoprime, decryption might succeed or it might fail. In either case, we assume n = pq has two prime factors when in fact it has more, and so n is easier to factor than we thought.
Here’s the code for the examples above.
from sympy import gcd, mod_inverse message = 42 e = 257 def example(p, q): n = p*q phi = (p-1)*(q-1) assert( gcd(e, phi) == 1 ) d = mod_inverse(e, phi) assert( d*e % phi == 1 ) encrypted = pow(message, e, n) decrypted = pow(encrypted, d, n) return (message == decrypted) print(example(337, 283)) print(example(341, 283)) print(example(341, 389))
 Fermat’s primality test is explained here. In our example we only tried one base, b = 2. If we had also tried b = 3 we would have discovered that 341 is not prime. In practice one would try several different bases. Using the heuristic that failures are independent for each base, it’s very unlikely that a composite number would be a pseudoprime for each of, say, 5o different bases.
Imagine this conversation.
“Could you tell me your social security number?”
“Absolutely not! That’s private.”
“OK, how about just the last four digits?”
“Oh, OK. That’s fine.”
When I was in college, professors would post grades by the last four digits of student social security numbers. Now that seems incredibly naive, but no one objected at the time. Using these four digits rather than names would keep your grades private from the most lazy observer but not from anyone willing to put out a little effort.
There’s a widespread belief in the US that your social security number is a deep secret, and that telling someone your social security number gives them power over you akin to a fairy telling someone his true name. On the other hand, we also believe that telling someone just the last four digits of your SSN is harmless. Both are wrong. It’s not that hard to find someone’s full SSN, and revealing the last four digits gives someone a lot of information to use in identifying you.
In an earlier post I looked at how easily most people could be identified by the combination of birth date, sex, and zip code. We’ll use the analytical results from that post to look at how easily someone could be identified by their birthday, state, and the last four digits of their SSN . Note that the previous post used birth date, i.e. including year, where here we only look at birth day, i.e. month and day but no year. Note also that there’s nothing special about social security numbers for our purposes. The last four digits of your phone number would provide just as much information.
If you know someone lives in Wyoming, and you know their birthday and the last four digits of their SSN, you can uniquely identify them 85% of the time, and in an addition 7% of cases you can narrow down the possibilities to just two people. In Texas, by contrast, the chances of a birthday and four-digit ID being unique are 0.03%. The chances of narrowing the possibilities to two people are larger but still only 0.1%.
Here are results for a few states. Note that even though Texas has between two and three times the population of Ohio, it’s over 100x harder to uniquely identify someone with the information discussed here.
|-----------+------------+--------+--------| | State | Population | Unique | Pairs | |-----------+------------+--------+--------| | Texas | 30,000,000 | 0.03% | 0.11% | | Ohio | 12,000,000 | 3.73% | 6.14% | | Tennessee | 6,700,000 | 15.95% | 14.64% | | Wyoming | 600,000 | 84.84% | 6.97% | |-----------+------------+--------+--------|
 In that post we made the dubious simplifying assumption that birth dates were uniformly distributed from 0 to 78 years. This assumption is not accurate, but it was good enough to prove the point that it’s easier to identify people than you might think. Here our assumptions are better founded. Birthdays are nearly uniformly distributed, though there are some slight irregularities. The last four digits of social security numbers are uniformly distributed, though the first digits are correlated with the state.
The big idea of public key cryptography is that it lets you publish an encryption key e without compromising your decryption key d. A somewhat surprising detail of RSA public key cryptography is that in practice e is nearly always the same number, specifically e = 65537. We will review RSA, explain how this default e was chosen, and discuss why this may or may not be a problem.
Here’s the setup for RSA encryption in a nutshell.
The asymmetry in the encryption scheme comes from the fact that the person who knows p and q can compute n, φ(n), and d easily, but no one else can find d without factoring n. (Or so it seems.)
Here we focus on the third step above. In theory e could be very large, but very often e = 65537. In that case, we don’t actually choose e to be relatively prime to φ(n). Instead, we pick p and q, and hence n, and usually φ(n) will be relatively prime to 65537, but if it’s not, we choose p and q again until gcd(65537, φ(n)) = 1. Kraft and Washington  have this to say about the choice of e:
… surprisingly, e does not have to be large. Sometimes e = 3 has been used, although it is usually recommended that larger values be used. In practice, the most common value is e = 65537, The advantages are that 65537 is a relatively large prime, so it’s easier to arrange that gcd(e, φ(n)) = 1, and it is one more than a power of 2, so raising a number to the 65537th power consists mostly of squarings …
In short, the customary value of e was chosen for efficiency. Also, there aren’t many choices for similar values of e .
Heninger and Shacham  go into more detail.
The choice e = 65537 = 216 + 1 is especially widespread. Of the certificates observed in the UCSD TLS 2 Corpus (which was obtained by surveying frequently-used TLS servers), 99.5% had e = 65537, and all had e at most 32 bits.
So the “most common value” mentioned by Kraft and Washington appears to be used 99.5% of the time.
Is it a problem that e is very often the same number? At first glace no. If the number e is going to be public anyway, there doesn’t seem to be any harm in selecting it even before you choose p and q.
There may be a problem, however, that the value nearly everyone has chosen is fairly small. In particular,  gives an algorithm for recovering private RSA keys when some of the bits are know and the exponent e is small. How would some of the bits be known? If someone has physical access to a computer, they can recover some bits from it’s memory.
 James S. Kraft and Lawrence C. Washington. An Introduction to Number Theory with Cryptography. 2nd edition.
 Nadia Heninger and Hovav Shacham. Reconstructing RSA Private Keys from Random Key Bits.
 The property that makes e an efficient exponent is that it is a Fermat prime, i.e. it has the form 22m + 1 with m = 4. Raising a number to a Fermat number exponent takes n squarings and 1 multiplication. And being prime increases the changes that it will be relatively prime to φ(n). Unfortunately 65537 may be the largest Fermat prime. We know for certain it’s the largest Fermat prime that can be written in less than a billion digits. So if there are more Fermat primes, they’re too large to use.
However, there are smaller Fermat primes than 65537, and in fact the quote from  mentions the smallest, 3, has been used in practice. The remaining Fermat primes are 5, 17, and 257. Perhaps these have been used in practice as well, though that may not be a good idea.
FAS posted an article yesterday explaining how blurring military installations out of satellite photos points draws attention to them, showing exactly where they are and how big they are. The Russian mapping service Yandex Maps blurred out sensitive locations in Israel and Turkey. As the article says, this is an example of the Streisand effect, named after Barbara Streisand’s failed attempt to suppress photos of her Malibu home. Her efforts to protect her privacy caused the photos to go viral.
A similar issue arises in differential privacy. A practical implementation of differential privacy must rule out some queries a priori in some way that is not data-dependent. Otherwise it could be a violation of differential privacy either to answer or refuse to answer the same query. Short of refusing the query entirely, it could also be informative to reply “Are you sure you want to submit that query? It’s going to use up a lot of your privacy budget.”
Differential privacy adds random noise to query results, but in a more sophisticated way than Yandex blurring out portions of photos. Adding noise selectively reveals information about what the noise is trying to conceal. As the FAS article indicated, by blurring out only the sensitive areas, the blurred maps point out their exact location.
Selective security measures can have a similar effect. By placing greater protection on some information, you can mark that information as being particularly important. Once you realize this, otherwise nonsensical security measures start to make sense: applying uniform security results in excessive protection for some information, but keeps attackers from being able to identify the most valuable targets.
I recently got a review copy of Scientific Computing: A Historical Perspective by Bertil Gustafsson. I thought that thumbing through the book might give me ideas for new topics to blog about. It still may, but mostly it made me think of numerical methods I’ve already blogged about.
In historical order, or at least in the order of Gustafsson’s book:
The links above include numerical methods I have written about. What about the methods I have not written about?
I like to write fairly short, self-contained blog posts, and I’ve written about algorithms compatible with that. The methods in Gustafsson’s book that I haven’t written about are mostly methods in partial differential equations. I don’t see how to write short, self-contained posts about numerical methods for PDEs. In any case, my impression is that not many readers would be interested. If you are one of the few readers who would like to see more about PDEs, you may enjoy the following somewhat personal rambling about PDEs.
I studied PDEs in grad school—mostly abstract theory, but also numerical methods—and expected PDEs to be a big part of my career. I’ve done some professional work with differential equations, but the demand for other areas, particularly probability and statistics, has been far greater. In college I had the impression that applied math was practically synonymous with differential equations. I think that was closer to being true a generation or two ago than it is now.
My impression of the market demand for various kinds of math is no doubt colored by my experience. When I was at MD Anderson we had one person in biomathematics working in PDEs, and then none when she left. There may have been people at MDACC doing research into PDEs for modeling cancer, but they weren’t in the biostatistics or biomathematics departments. I know that people are working with differential equations in cancer research, but not at the world’s largest cancer center, and so I expect there aren’t many doing research in that area.
Since leaving MDACC to work as a consultant I’ve seen little demand for differential equations, more in Kalman filters than in differential equations per se. A lot of companies hire people to numerically solve PDEs, but there don’t seem to be many who want to use a consultant for such work. I imagine most companies with an interest in PDEs are large and have a staff of engineers and applied mathematicians working together. There’s more demand for statistical consulting because companies are likely to have an occasional need for statistics. The companies that need PDEs, say for making finite element models of oil reservoirs or airplanes, have an ongoing need and hire accordingly.
As mentioned in the previous post, Latanya Sweeney estimated that 87% of Americans can be identified by the combination of zip code, sex, and birth date. We’ll do a quick-and-dirty estimate and a simulation to show that this result is plausible. There’s no point being too realistic with a simulation because the actual data that Sweeney used is even more realistic. We’re just showing that her result is reasonable.
Suppose average life expectancy is around 78 years. If everyone is between 0 and 78, then there are 78*365 possible birth dates and twice that many combinations of birth date and sex.
What’s the average population of a US zip code? We’ll use 9,330 for reasons explained in .
We have 56,940 possible birth date and sex combinations for 9,330 people. There have to be many unused birth date and sex combinations in a typical zip code, and it’s plausible that many combinations will only be used once. We’ll run a Python simulation to see just how many we’d expect to be used one time.
demo below will keep track of the possible demographic values, i.e. combinations of birth date and sex. We’ll loop over the population of the zip code, randomly assigning everyone to a demographic value, then see what proportion of demographic values is only used once.
from random import randrange from numpy import zeros zctasize = 9330 demosize = 365*78*2 demo = zeros(demosize) for _ in range(zctasize): d = randrange(demosize) demo[d] += 1 unique = len(demo[demo == 1]) print(unique/zctasize)
I ran this simulation 10 times and got values ranging from 84.3% to 85.7%.
As Road White points out in the comments, you can estimate the number of unique demographics by a probability calculation.
Suppose there are z inhabitants in our zip code and d demographic categories. We’re assuming here (and above) that all demographic categories are equally likely, even though that’s not true of birth dates.
We start by looking at a particular demographic category. The probability that exactly one person falls in that category is
To find the expected number of demographic slots with exactly one entry we multiply by d, and to get the proportion of p of people this represents we divide by z.
which in our case is 84.8%, consistent with our simulation above.
Here we used the Taylor series approximation
If z is of the same magnitude as d or smaller, then the error in the approximation above is O(1/d).
You could also work this as a Poisson distribution, then condition on the probability of a slot being occupied.
By the way, a similar calculation shows that the expected number of demographic slots containing two people is r exp(-r)/2 where r = z/d. So while 85% can be uniquely identified, another 7% can be narrowed down to two possibilities.
 I don’t know, but I do know the average population of a “zip code tabulation area” or ZCTA, and that’s 9,330 according to the data linked to here. As I discuss in that post, the Census Bureau reports population by ZTCA, not by zip code per se, for several reasons. Zip codes can cross state boundaries, they are continually adjusted, and some zip codes contain only post office boxes.
In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous. The state of Massachusetts had released data on 135,000 state employees and their families with obvious identifiers removed. However, the data contained zip code, birth date, and sex for each individual. Sweeney was able to cross reference this data with publicly available voter registration data to find the medical records of then Massachusetts governor William Weld.
An estimated 87% of Americans can be identified by the combination of zip code, birth date, and sex. A back-of-the-envelope calculation shows that this should not be surprising, but Sweeney appears to be the first to do this calculation and pursue the results. (Update: See such a calculation in the next post.)
In her paper Only You, Your Doctor, and Many Others May Know Sweeney says that her research was unwelcome. Over 20 journals turned down her paper on the Weld study, and nobody wanted to fund privacy research that might reach uncomfortable conclusions.
A decade ago, funding sources refused to fund re-identification experiments unless there was a promise that results would likely show that no risk existed or that all problems could be solved by some promising new theoretical technology under development. Financial resources were unavailable to support rigorous scientific studies otherwise.
There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.
Photo of Latanya Sweeney via Parker Higgins [CC BY 4.0], from Wikimedia Commons