Simulating identification by zip code, sex, and birthdate

As mentioned in the previous post, Latanya Sweeney estimated that 87% of Americans can be identified by the combination of zip code, sex, and birth date. We’ll do a quick-and-dirty estimate and a simulation to show that this result is plausible. There’s no point being too realistic with a simulation because the actual data that Sweeney used is even more realistic. We’re just showing that her result is reasonable.

Quick estimate

Suppose average life expectancy is around 78 years. If everyone is between 0 and 78, then there are 78*365 possible birth dates and twice that many combinations of birth date and sex.

What’s the average population of a US zip code? We’ll use 9,330 for reasons explained in [1].

We have 56,940 possible birth date and sex combinations for 9,330 people. There have to be many unused birth date and sex combinations in a typical zip code, and it’s plausible that many combinations will only be used once. We’ll run a Python simulation to see just how many we’d expect to be used one time.

Python simulation

The array demo below will keep track of the possible demographic values, i.e. combinations of birth date and sex. We’ll loop over the population of the zip code, randomly assigning everyone to a demographic value, then see what proportion of demographic values is only used once.

    from random import randrange
    from numpy import zeros
    
    zctasize = 9330
    demosize = 365*78*2
    demo = zeros(demosize)
    
    for _ in range(zctasize):
        d = randrange(demosize)
        demo[d] += 1
    
    unique = len(demo[demo == 1])
    print(unique/zctasize)

I ran this simulation 10 times and got values ranging from 84.3% to 85.7%.

Analytic solution

As Road White points out in the comments, you can estimate the number of unique demographics by a probability calculation.

Suppose there are z inhabitants in our zip code and d demographic categories. We’re assuming here (and above) that all demographic categories are equally likely, even though that’s not true of birth dates.

We start by looking at a particular demographic category. The probability that exactly one person falls in that category is

\frac{z}{d}\left(1 - \frac{1}{d}\right)^{z-1}

To find the expected number of demographic slots with exactly one entry we multiply by d, and to get the proportion of p of people this represents we divide by z.

\log p = (z-1)\log\left(1 - \frac{1}{d}\right) \approx - \frac{z}{d}

and so

p \approx \exp(-z/d)

which in our case is 84.8%, consistent with our simulation above.

Here we used the Taylor series approximation

\log(1 + x) = x + {\cal O}(x^2)

If z is of the same magnitude as d or smaller, then the error in the approximation above is O(1/d).

You could also work this as a Poisson distribution, then condition on the probability of a slot being occupied.

By the way, a similar calculation shows that the expected number of demographic slots containing two people is r exp(-r)/2 where rz/d. So while 85% can be uniquely identified, another 7% can be narrowed down to two possibilities.

More privacy posts

[1] I don’t know, but I do know the average population of a “zip code tabulation area” or ZCTA, and that’s 9,330 according to the data linked to here. As I discuss in that post, the Census Bureau reports population by ZTCA, not by zip code per se, for several reasons. Zip codes can cross state boundaries, they are continually adjusted, and some zip codes contain only post office boxes.

8 thoughts on “Simulating identification by zip code, sex, and birthdate

  1. Just for fun, the same program in R:

    zctasize = 9330
    demosize = 365*78*2
    sum(table(sample(demosize, zctasize, TRUE)) == 1) / zctasize

    I wouldn’t be surprised if there was a similar (essentially) one-line way to do it in numpy as well.

  2. A rough (not fully rigorous, but it’s quick) mathematical argument shows the same result:
    Using notation n for zctasize, N for demosize:

    probability of any individual combination having exactly one person assigned to it is n/N * (1 – 1/N)^(n-1) ~= n/N * (1 – n/N/n-1)^(n-1) ~= n/N * exp(-n/N) (using euler’s limit).

    Then, by linearity, the total number of uniquely identifiable demographic values is n*exp(-n/N), and so the proportion of people identified is exp(-n/N), and in this case n/N is about 0.164, giving about 84.9% of people being identified.

    Interestingly, this also implies (for large enough populations) that the proportion of people that can be identified only depends on the ratio of population size to possible demographic size – if you assume the usual independent distribution stuff, that is.

  3. Inspired by Nathan, but in Perl 6:

    my $zctasize = 9330;
    my $demosize = 365*78*2;
    say (1..$demosize).roll($zctasize).Bag.values.grep(* == 1) / $zctasize;

  4. Hi John.

    I was surprised that your simulation result is lower (84-85%) is lower than the 87% number Sweeney came up with using actual census data. I would have expected that age clustering and unevenly sized zipcodes would have reduced this percentage, and that your simulation with uniform ages and equal zipcode populations would have acted as an upper bound.

    Looking at Sweeney’s paper (https://dataprivacylab.org/projects/identifiability/paper1.pdf) I notice that in section 4.3.1, she is using what seems like an odd definition for “Number of subjects uniquely identified in a subdivision of a geographical area”. Rather than a simulation like you ran, she seems to be using a hard threshold, and considering only whether the population of the age subclass in each zipcode exceeds the number of “pigeon holes” available.

    Am I reading her definitions right? If so, am I right that a simulation like yours would probably give a better estimation of the number of people who are uniquely identifiable in terms of being the only person with that zipcode, birthdate, sex combination? Do you have thoughts on how your estimate might change if you included these other factors?

  5. I wouldn’t expect the simulation results to match empirical results because, for one thing, birth dates are not uniformly distributed. But the results seem to come close nevertheless.

    Although this looks like a possible application of the pigeon hole principle, it’s really a probability problem. The pigeon hole principle can tell you that if d > z then at least d-z slots must be empty, but it can’t tell you how many slots are likely to have one element since likely is a matter of probability. I haven’t looked at Sweeney’s work in detail, but maybe she’s making an implicit probability assumption.

  6. Is it justified(even on a very crude approach) to consider the *mean* pop/zcta? Did you make a simulation to see what you get if that average value is a result from (as explicitly pointed out in [1]!) many zip codes having near zero population (post boxes: there are hundreds in most major city districts) and only very few zip codes having a very large population? (Typically there will be much more people for a given zip code : 9000 is the population of a small village! There may be 1000 and more living in a single building (20 stories with > 20 flats per floor) with many such buildings within a single zip code area.)

  7. I wouldn’t expect the simulation results to match empirical results because, for one thing, birth dates are not uniformly distributed. But the results seem to come close nevertheless.

Comments are closed.