If you know someone’s US zip code, how much do you know about them? We can use entropy to measure the amount of information in bits.
To quantify the amount of information in a zip code, we need to know how many zip codes there are, and how evenly people are divided into zip codes.
There are about 43,000 zip codes in the US. The number fluctuates over time due to small adjustments.
Average information is maximized by dividing people into categories as evenly as possible. Maximum information about one person is maximized by dividing people into categories as unevenly as possible. To see this, suppose there were only two zip codes. The information we’d expect to learn from a zip code would be maximized if we divided people into two equal groups. Suppose on the other hand you were in one zip code and everyone else in the other. On average, zip code would reveal very little about someone, though it would reveal a lot about you!
If everyone were divided evenly into one of 43,000 zip codes, the amount of information revealed by knowing someone’s zip code would be about 15.4 bits, a little more information than asking 15 independent yes/no questions, each with equally likely answers.
But zip codes are not evenly populated. How much information is there in an actual five-digit zip code? To answer that question we need to know the population of each zip code. That’s a little tricky. Zip codes represent mail delivery points, not geographical areas. Technically the US Census Bureau tracks population by ZCTA (Zip Code Tabulation Area) rather than zip code per se. Population by ZCTA is freely available, but difficult to find. I gave up trying to find the data from official sources but was able to find it here.
We can go through the data and find the probability p of someone living in each ZCTA and add up –p log2 p over each area. When we do, we find that a ZTCA contains 13.83 bits of information. (We knew it had to be less than 15.4 because uneven distribution reduces entropy.)
The Safe Harbor provision of US HIPAA law lists zip codes as a quasi-identifier. Five digit zip codes do not fall under Safe Harbor. Three digit zip codes (the first three digits of five digit zip codes) do fall under Safe Harbor, mostly. Some areas are so sparsely populated that even three-digit zip code areas are considered too informative. Any three-digit zip code with fewer than 20,000 people is excluded. You can find a snapshot of the list here, though the list may change over time.
If we repeat our calculation for three-digit zip codes, we find that they carry about 9.17 bits of information. It makes little difference to the result whether you include sparse regions, exclude them, or lump them all into one region.
See the next post on information contained in age, birthday, and birth date.