Sparsely populated zip codes

The dormitory I lived in as an undergraduate had its own five-digit zip code at one time. It was rumored to be the largest dorm in the US, or maybe the largest west of the Mississippi, or something like that. There were about 3,000 of us living there. Although the dorm had enough people to justify its own zip code—some zip codes have far fewer people—zip code boundaries were redraw so that the dorm shares its zip code with other areas.

Some zip code are so sparsely populated that people living in these areas are relatively easy to identify if you have other data. The so-called Safe Harbor provision of HIPAA (Health Insurance Portability and Accountability Act) says that it’s usually OK to include the first three digits of someone’s zip code in de-identified data. But there are 17 areas so thinly populated that even listing the first three digits of their zip code is considered too much of an identification risk. These are areas such that the first three digits of the zip code are:

  • 036
  • 059
  • 063
  • 102
  • 203
  • 556
  • 692
  • 790
  • 821
  • 823
  • 830
  • 831
  • 878
  • 879
  • 884
  • 890
  • 893

This list could change over time. These are the regions that currently contain fewer than 20,000 people, the criterion given in the HIPAA regulations.

Knowing that someone is part of an area containing 20,000 people hardly identifies them. The concern is that in combination with other information, zip code data is more informative in these areas.

Related post: Bayesian clinical trials in one zip code

Need help with HIPAA de-identification?

One thought on “Sparsely populated zip codes

  1. Dimitriy Masterov

    You can even have zip codes with a single fictional inhabitant: Smokey Bear has his own ZIP, 20252.

    Do you have strong feeling about using ZIP codes (or some approximation like ZCTAs) for analysis over things like block groups? I’ve thought that looking at ZIP codes is not always a very illuminating exercise, but one I see people use all the time. ZIP codes are not geographic areas, but are simply arrays of street addresses or carrier routes, modified at will by USPS for the purpose of routing mail as efficiently as possible. That means that this way of clustering people is arbitrary method of aggregation that may obscure valuable signal since these clusters are based on factors the PO finds convenient for delivery, and not any kind of underlying similarity. ZIP codes also change fairly frequently, so year-on-year comparisons can be compromised because it is not an apples to apples comparison. It might also be the case that bundling people that respond heterogeneously decreases the statistical power of our tests. Rural maps can have such a sparse network of roads with such strange zip code assignments that some rural areas cannot even be approximated with zip code regions. Finally, ZIPs don’t have characteristics such as population, so it is almost always an approximation to say that sales per capita in some ZIP code are low since the denominator is mis-measured. The trouble is that while census blocks use streets as edge boundaries, postal delivery routes generally service both sides of a single street. Therefore, census blocks near the edge are commonly split between ZIP codes.

Leave a Reply

Your email address will not be published. Required fields are marked *