The previous post looked at how much information is contained in zip codes. This post will look at how much information is contained in someone’s age, birthday, and birth date. Combining zip code with birthdate will demonstrate the plausibility of Latanya Sweeney’s famous result  that 87% of the US population can be identified based on zip code, sex, and birth date.
Birthday is the easiest. There is a small variation in the distribution of birthdays, but this doesn’t matter for our purposes. The amount of information in a birthday, to three significant figures, is 8.51 bits, whether you include or exclude leap days. You can assume all birthdays are equally common, or use actual demographic data. It only makes a difference in the 3rd decimal place.
I’ll be using the following age distribution data found on Wikipedia.
|-----------+------------| | Age range | Population | |-----------+------------| | 0– 4 | 20201362 | | 5– 9 | 20348657 | | 10–14 | 20677194 | | 15–19 | 22040343 | | 20–24 | 21585999 | | 25–29 | 21101849 | | 30–34 | 19962099 | | 35–39 | 20179642 | | 40–44 | 20890964 | | 45–49 | 22708591 | | 50–54 | 22298125 | | 55–59 | 19664805 | | 60–64 | 16817924 | | 65–69 | 12435263 | | 70–74 | 9278166 | | 75–79 | 7317795 | | 80–84 | 5743327 | | 85+ | 5493433 | |-----------+------------|
To get data for each particular age, I’ll assume ages are evenly distributed in each group, and I’ll assume the 85+ group consists of people from ages 85 to 92. 
With these assumptions, there are 6.4 bits of information in age. This seems plausible: if all ages were uniformly distributed between 0 and 63, there would be exactly 6 bits of information since 26 = 64.
If we assume birth days are uniformly distributed within each age, then age and birth date are independent. The information contained in the birth date would be the sum of the information contained in birthday and age, or 8.5 + 6.4 = 14.9 bits.
Zip code, sex, and age
The previous post showed there are 13.8 bits of information in a zip code. There are about an equal number of men and women, so sex adds 1 bit. So zip code, sex, and birth date would give a total of 29.7 bits. Since the US population is between 228 and 229, it’s plausible that we’d have enough information to identify everyone.
We’ve made a number of simplifying assumptions. We were a little fast and loose with age data, and we’ve assumed independence several times. We know that sex and age are not independent: more babies are boys, but women live longer. Still, Latanya Sweeney found empirically that you can identify 87% of Americans using the combination of zip code, sex, and birth date . Her study was based on 1990 census data, and at that time the US population was a little less than 228.
More privacy posts
- Randomized response and Bayes’ theorem
- Handedness, blood type, and introversion
- Toxic pairs and re-identification
- Data privacy consulting
 Latanya Sweeney. “Simple Demographics Often Identify People Uniquely”. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000. Available here.
 Bob Wells and Mel Tormé. “The Christmas Song.” Commonly known as “Chestnuts Roasting on an Open Fire.”
3 thoughts on “Bits of information in age, birthday, and birthdate”
These facts have a dark side in electoral politics. Under the guise of voting fraud prevention, some lawmakers are agitating for cross-checks on voter records, flagging possible duplicates. It is reasonable to assume that there will be unresolved false positives (i.e. accidental name collisions between two people who are actually distinct) in any such process, so people will be erroneously denied voter registration. This is enough reason, on its own, to be wary of the cross-checks.
But it gets worse: the prevalence of name collisions is not independent of race. The chance that two African-Americans have the same last name is much higher than the chance in the general population. So cross-checks tend to disenfranchise African-Americans. Which is kind of the point if you happen to be a partisan politician who not-so-secretly wants to disenfranchise a population that doesn’t usually vote your way.
Read the paper and found it very sobering. Then I looked at the public databases provided by my local governments (city, county, state), and finally accepted that any hope of privacy or anonymity evaporated decades ago.
But why am I not bombarded by countless targeted campaigns? What I’m aware of is minor, hardly even a nuisance. Which to me means the law is a deterrent, at least to visible entities.
Which means it is the less visible entities I need to learn, and worry, about. No clue where to start on that.
Start by giving up on the myths of security and privacy. DoD orange book was clear about how networks cannot be secured. Eventually, someone will crack it.