First names and Bayes’ theorem

Is the woman in this photograph more likely to be named Esther or Caitlin?

black and white photo of an elderly woman

Yesterday Mark Jason Dominus published wrote about statistics on first names in the US from 1960 to 2021. For each year and state, the data tell how many boys and girls were given each name.

Reading the data “forward” you could ask, for example, how common was it for girls born in 1960 to be named Paula. Reading the data “backward” you could ask how likely it is that a woman named Paula was born in 1960.

We do this intuitively. When you hear a name like Blanche you may think of an elderly woman because the name is uncommon now (in the US) but was more common in the past. Sometimes we get a bimodal distribution. Olivia, for example, has made a comeback. If you hear that a female is named Olivia, she’s probably not middle aged. She’s more likely to be in school or retired than to be a soccer mom.

Bayes’ theorem tells us how to turn probabilities around. We could go through Mark’s data and compute the probabilities in reverse. We could quantify the probability that a woman named Paula was born in the 1960s, for example, by adding up the probabilities that she was born in 1960, 1961, …, 1969.

Bayes theorem says

P(age | name) = P(name | age) P(age) / P(name).

Here the vertical bar separates the thing we want the probability of from the thing we assume. P(age | name), for example, is the probability of a particular age, given a particular name.

There is no bar in the probability in the denominator above. P(name) is the overall probability of a particular name, regardless of age.

People very often get probabilities backward; they need Bayes theorem to turn them around. A particular case of this is the prosecutor’s fallacy. In a court of law, the probability of a bit of evidence given that someone is guilty is irrelevant. What matters is the probability that they are guilty given the evidence.

In a paternity case, we don’t need to know the probability of someone having a particular genetic marker given that a certain man is or is not their father. We want to know the probability that a certain man is the father, given that someone has a particular genetic marker. The former probability is not what we’re after, but it is useful information to stick into Bayes’ theorem.

Related posts

Photo by Todd Cravens on Unsplash.

Identifiable to man or machine?

Like the previous post, this post riffs on a photo [1] I stumbled on while looking for something else.

Would it be easier to identify the man in this photo or the man whose photo appeared in the previous post, copied below.

I think it would be easier for a human to recognize the person in the first image. But what about a computer?

We humans identify people most easily by their faces, and especially by their eyes. These features are easier to see in the first photo. But what might we find if we applied some image processing to the two photos? Maybe the green man’s facial features could be exposed by some diligent processing. We see more of the second man’s body. Maybe a computer algorithm could extract more information out of the second image for this reason.

Photographs may, and often do, contain Exif (Exchangeable image file format) metadata, such as the GPS coordinates of the camera at the time the photo was taken. A photo taken when the lens cap on by mistake might contain a good deal of information about the subject even though the photo per se is useless. This information can be useful to the photographer, but it could also pose a privacy risk. Before posting a photo publicly, you might want to strip out the metadata.

As I noted in the previous post, the privacy risk from data depends on context. Suppose the metadata embedded in a photo contains the serial number of the camera. That serial number would not help most people identify a photo(grapher), but it would someone who had access to a database linking serial numbers to the customers who purchased the cameras.

Was the first photo created by actually projecting the red and green lights onto the subject, or were these added in post production? For that matter, was there actually a man who posed for the photo or was the image synthetically generated? A forensic investigation of the photo might be able to answer these questions.

[1] Photo by Sebastian Mark on Unsplash

Privacy and tomography

I ran across the image below [1] when I was searching for something else, and it made me think of a few issues in data privacy.

green grid of light on young man

The green lights cut across the man’s body like tomographic imaging. No one of these green lines would be identifiable, but maybe the combination of the lines is.

We can tell it’s a young man in the image, someone in good shape. But is this image identifiable?

It’s identifiable to the photographer. It’s identifiable to the person in the photo.

If hundreds of men were photographed under similar circumstances, and we knew who those men were, we could probably determine which of them is in this particular photo.

The identifiability of the photo depends on context. The HIPAA Privacy Rule acknowledges that the risk of identification of individuals in a data set depends on who is receiving the data and what information the recipient could reasonably combine the data with. The NSA could probably tell who the person in the photo is; I cannot.

Making photographs like this available is not a good idea if you want to protect privacy, even if its debatable whether the person in the photo can be identified. Other people photographed under the same circumstances might be easier to identify.

Differential privacy takes a more conservative approach. Differential privacy advocates argue that deidentification procedures should not depend on what the recipient knows. We can never be certain what another person does or doesn’t know, or what they might come to know in the future.

Conventional approaches to privacy consider some data more identifiable than other data. Your phone number and your lab test results from a particular time and place are both unique, but the former is readily available public information and the latter is not. Differential privacy purists would say both kinds of data should be treated identically. A Bayesian approach might be to say we need to multiply each risk by a prior probability: the probability of an intruder looking up your phone number is substantially greater than the probability of the intruder accessing your medical records.

Some differential privacy practitioners make compromises, agreeing that not all data is equally sensitive. Purists decry this as irresponsible. But if the only alternatives are pure differential privacy and traditional approaches, many clients will choose the latter, even though an impure approach to differential privacy would have provided better privacy protection.

[1] Photo by Caspian Dahlström on Unsplash

Privacy implications of hashing data

Cryptographic hash functions are also known as one-way functions because given an input x, one can easily compute its hashed value f(x), but it is impractical to recover x from knowing f(x).

However, if we know that x comes from a small universe of possible values, our one-way function can effectively become a two-way function, i.e. it may be possible to start with f(x) and recover x using a rainbow table attack.

It’s possible to defend against a rainbow attack yet still leak information about the probability distribution of values of x given f(x).

For privacy protection, hashing is better than not hashing. Keyed hashing is better than unkeyed hashing. But even keyed hashing can leak information.

Rainbow tables

Suppose a data set contains SHA-256 hashed values of US Social Security Numbers (SSNs). Since SSNs have 10 digits, there are 10 billion possible SSNs. It would be possible to hash all possible SSNs and create a lookup table, known as a rainbow table. There are three things that make the rainbow table attack possible in this example:

  1. The range of possible inputs is known and relatively small.
  2. The hashing algorithm is known.
  3. The hashing algorithm can be computed quickly.

There are a couple ways to thwart a rainbow table attack. Assuming we have no control of (1) above, we can alter (2) or (3).

Keyed hashing

A way to alter (2) is to use a keyed hash algorithm. For example, if we XOR the SSNs with a key before applying SHA-256, an attacker cannot construct a rainbow table without knowing the key. The attacker may know the core hash algorithm we are using, but they do not know the entire algorithm because the key is part of the algorithm.

Expensive hashing

A way to alter (3) is to use hashing algorithms that are expensive to compute, such as Argon2. The idea is that such hash functions can be computed quickly enough for legitimate use cases, but not for brute-force attacks.

The time required to create a rainbow table is the product of the time to compute a single hash and the size of the set of possible inputs. If it took an hour to compute a hash value, but there are only 10 possible values, then an attacker could compute a rainbow table in 10 hours.

Leaking attribute probabilities

Now suppose we’re using a keyed hashing algorithm. Maybe we’re paranoid and use an expensive hashing algorithm on top of that. What can go wrong now?

If we’re hashing values that each appear one time then we’re OK. If we apply a keyed hash to primary keys in a database, such as patient ID, then the hashed ID does not reveal the patient ID. But if we hash attributes associated with that patient ID, things are different.

Frequency analysis

Suppose US state is an attribute in your database, and you hash this value. No matter how secure and how expensive your hash algorithm is, each state will have a unique hash value. If the database contains a geographically representative sample of the US population, then the hashed state value that appears most often is probably the most populous state, i.e. California. The second most common hashed state value probably corresponds to Texas.

Things are fuzzier on the low end. The hashed state value appearing least often may not correspond to Wyoming, but it very likely does not correspond to Florida, for example.

In short, you can infer the state values using the same kind of frequency analysis you’d use to solve a simple substitution cipher in a cryptogram puzzle. The larger the data set, the more closely the empirical order will likely align with the population order.

Maybe this is OK?

Frequency analysis makes it possible to infer (with some uncertainty) the most common values. But the most common values are also the least informative. Knowing someone is from California narrows down their identity less than knowing that they are from Wyoming.

Frequency analysis is less specific for less common values. It might not tell you with much confidence that someone is from Wyoming, but it might tell you with some confidence that they come from a low-population state. However, since there are several low-population states, knowing someone is from such a state, without knowing which particular state, isn’t so informative.

Data privacy depends on context. Knowing what state someone is likely from may or may not be a problem depending on what other information is available.

Related posts

“We won’t sell your personal data, but …”

When a company promises not to sell your personal data, this promise alone doesn’t mean much.

“We will not sell your personal data, but …

  • We might get hacked.
  • We might give it to a law enforcement or intelligence agency.
  • We might share or trade your data without technically selling it.
  • We might alter our terms. Pray we do not alter them any further.
  • We might be acquired by a new company that alters the terms.
  • We might go bankrupt and the data be sold as an asset.”

 

(This post started as a Twitter thread. Thanks to Michael Madden for his contribution to the thread.)

Sharing data without letting it go

Sharing data

Suppose two companies would like to share data, but they’d also each like to retain ownership of their own data. They’d like to enable querying as if each company had given the other all its data, without actually letting go of its data.

Maybe the two companies are competitors who want to collaborate for a particular project. Or maybe the companies each have data that they are not legally allowed to share with the other. Maybe one company is interested in buying (the data of) the other and would like to have some sort of preview of what they may be buying.

Differential privacy makes this possible, and can be useful even if privacy is not an issue. The two companies have data on inanimate widgets, not persons, and yet they have privacy-like concerns. They don’t want to hand over row-level data about their widgets, and yet they both want to be able to pose questions about the combined widget data. The situation is analogous to being concerned about the “privacy” of the widgets.

Both companies would deposit data with a trusted third party, and gain access to this data via an API that implements differential privacy. Such APIs let users pose queries but do not allow either side to request row-level data.

How is this possible? What if one party poses a query that unexpectedly turns out to be asking for row-level data? For example, maybe someone asks for the average net worth of customers in Alaska, assuming there are hundreds of such customers, but the data only contains one customer in Alaska. What was intended to be an aggregate query turns out to be a row-level query.

Differential privacy handles this sort of thing automatically. It adds an amount of random noise to each query in proportion to how sensitive the query is. If you ask for what amounts to data about an individual person (or individual widget) the system will add enough noise to the result to prevent revealing row-level information. (The system may also refuse to answer the query; this is done more often in practice than in theory.) But if you ask a question that reveals very little information about a single database row, the amount of noise added will be negligible.

The degree of collaboration can be limited up front by setting a privacy budget for each company. (Again, we may not necessarily be talking about the privacy of people. We may be looking at analogous protections on units of information about physical things, such as results of destructive testing of expensive hardware.)

Someone could estimate at the start of the collaboration how large the privacy budget would need to be to allow both companies to satisfy their objectives without being so large as to risk giving away intellectual property that the parties do not wish to exchange. This budget would be spent over the course of the project. When the parties exhaust their privacy budgets, they can discuss whether to allow each other more query budget.

This arrangement allows both parties the ability to ask questions of the combined data as if they had exchanged data. However, neither party has given up control of its data. They have given each other some level of information inferred from the combined data, but neither gives a single row of data to the other.

Related posts

Conspicuously missing data

I was working on a report for a client this afternoon when I remembered this comic from Spiked Math.

Waitress: Does everyone want a beer? Logician 1: I don't know. Logician 2: I don't know. Logician 3: Yes!

I needed to illustrate the point that revealing information about one person or group can reveal information on other people or other groups. If you give your genetic information to a company, for example, you also give that company (and every entity they share your data with) information about your relatives.

This comic doesn’t illustrate the point I had in mind, but it does illustrate a related point. The third logician didn’t reveal the preferences of the first two, though it looks like that at first. Actually, the first two implicitly reported their own preferences.

If the first logician did not want a beer, he or she could have said “No” to the question “Does everyone want a beer?” Answering this question with “I don’t know” is tantamount to answering the question “Do you want a beer?” with “Yes.” What appears to be a non-committal answer is a definite answer on closer examination.

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

A naive approach to removing sensitive data can be about as effective as bleeping out profanity: it’s not hard to infer what was removed.

Related posts

Why target ads at pregnant women

I’m listening to a podcast interviewing Neil Richards, the author of Why Privacy Matters. Richards makes a couple interesting points about the infamous example of Target figuring out which women were pregnant based on their purchase history.

First, pregnancy is a point at which women are open to trying new things. So if a company can get a woman to buy a baby stroller at their store, they may be able to get her to remain a customer for years to come. (Richards mentioned going off to college as another such milestone, so a barrage of advertising is aimed at first-year college students.)

Second, women understandably freaked-out over the targeted ads. So Target hid the ads in with irrelevant ads. They might show a woman ads for lawnmowers and baby wipes. That way the baby wipe ads didn’t seem so on-the-nose. The target audience would see the ad without feeling like they’re being targeted.

Just to be clear, I’m not writing this post to offer how-to advice for doing creepy advertising. The info here is presumably common knowledge in the advertising industry, but it’s not common knowledge for the public.

What’s “differential” about differential privacy?

Interest in differential privacy is growing rapidly. As evidence of this, here’s the result of a Google Ngram search [1] on “differential privacy.”

Graph rapidly rising from 2000 to 2019

When I first mentioned differential privacy to consulting leads a few years ago, not many had heard of it. Now most are familiar with the term, though they may not be clear on what it means.

The term “differential privacy” is kinda mysterious, particularly the “differential” part. Are you taking the derivative of some privacy function?!

In math and statistics, the adjective “differential” usually has something to do with calculus. Differential geometry applies calculus to geometry problems. Differential equations are equations involving the derivatives of the function you’re after.

But differential privacy is different. There is a connection to differential calculus, but it’s further up the etymological tree. Derivatives are limits of difference quotients, and that’s the connection. Differential privacy is about differences, but without any limits. In a nutshell, a system is differentially private if it hardly makes any difference whether your data is in the system or not.

The loose phrase “it hardly makes any difference” can be quantified in terms of probability distributions. You can find a brief explanation of how this works here.

Differential privacy is an elegant and defensible solution to the problem of protecting personal privacy while also preserving the usefulness of data in the aggregate. But it’s also challenging to understand and implement. If you’re interested in exploring differential privacy for your business, let’s talk.

More differential privacy posts

[1] You can see the search here. I’ve chopped off the vertical axis labels because they’re virtually meaningless. The proportion of occurrences of “differential privacy” among all two-word phrases is extremely small. But the rate of growth in the proportion is large.

Hashing phone numbers

A cryptographic hash is also known as a one-way function because given an input x, one can quickly compute the hash h(x), but it is extremely time-consuming to try to recover x if you only know h(x).

Even if the hashing algorithm is considered “broken,” it may take an enormous effort to break it. Google demonstrated that they could break a SHA-1 hash, but they used a GPU-century of compute power to do so. Attacks have become more efficient since then, but it still takes many orders of magnitude less work to compute a hash than to attempt to invert it. [1]

However, if you know that the hash value comes from a small set of possible inputs, brute force can discover which one. I wrote about this a couple years ago in the post Hashing names does not protect privacy. You could, for example, easily create a table of the hashes of all nine-digit social security numbers.

I often explain this to clients who have been told that hashed data is “encrypted.” This is subtle, because the data is encrypted, in a way, but not in the way they think.

A paper came out a few weeks ago that hashed 118 billion phone numbers.

The limited amount of possible mobile phone numbers combined with the rapid increase in affordable storage capacity makes it feasible to create key-value databases of phone numbers indexed by their hashes and then to perform constant-time lookups for each given hash value. We demonstrate this by using a high-performance cluster to create an in-memory database of all 118 billion possible mobile phone numbers from [reference] (i.e., mobile phone numbers allowed by Google’s libphonenumber and the WhatsApp registration API) paired with their SHA-1 hashes.

The authors were able to use this data base to query

10% of US mobile phone numbers for WhatsApp and 100% for Signal. For Telegram we find that its API exposes a wide range of sensitive information, even about numbers not registered with the service.

Related posts

[1] Hash functions are not invertible, even in theory, in the sense of a unique x leading to a hash value h(x). Suppose you’re computing a 256-bit hash on files that are one kilobyte (8192 bits). If you’re mapping a space of 28192 possible files into a space of 2256 possible hash values, the mapping cannot be one-to-one. However, if you know the inputs are not random bits but German prose, and you find a file of German prose that has a matching hash value, you’ve almost certainly recovered the file that led to the hash value.