Stochastic rounding and privacy

Suppose ages in some database are reported in decades: 0, 10, 20, etc. You need to add a 27 year old woman to the data set. How do you record her age? A reasonable approach would to round-to-nearest. In this case, 27 would be rounded up to 30.

Another approach would be stochastic rounding. In our example, we would round this woman’s age up to 30 with 70% probability and round it down to 20 with 30% probability. The recorded value is a random variable whose expected value is exactly 27.

Suppose we were to add a large number of 27 year olds to the database. With round-to-nearest, the average value would be 30 because all the values are 30. With stochastic rounding, about 30% of the ages would be recorded as 20 and about 70% would be recorded as 30. The average would likely be close to 27.

Next, suppose we add people to the database of varying ages. Stochastic rounding would record every person’s age using a random variable whose expected value is their age. If someone’s age is a d+x where d is a decade, i.e. a multiple of 10, and 0 < x < 10, then we would record their age as d with probability 1-x/10 and d+10 with probability x/10. There would be no bias in the reported age.

Round-to-nearest will be biased unless ages are uniformly distributed in each decade. Suppose, for example, our data is on undergraduate students. We would expect a lot more students in their early twenties than in their late twenties.

Now let’s turn things around. Instead of looking at recorded age given actual age, let’s look at actual age given recorded age. Suppose someone’s age is recorded as 30. What does that tell you about them?

With round-to-nearest, it tells you that they certainly are between 25 and 35. With stochastic rounding, they could be anywhere between 20 and 40. The probability distribution on this interval could be computed from Bayes’ theorem, depending on the prior distribution of ages on this interval. That is, if you know in general how ages are distributed over the interval (20, 40), you could use Bayes’ theorem to compute the posterior distribution on age, given that age was recorded as 30.

Stochastic rounding preserves more information than round-to-nearest on average, but less information in the case of a particular individual.

Related posts

Computed IDs and privacy implications

Thirty years ago, a lot of US states thought it would be a good idea to compute someone’s drivers license number (DLN) from their personal information [1]. In 1991, fifteen states simply used your Social Security Number as your DLN. Eleven other states computed DLNs by applying a hash function to personal information such as name, birth date, and sex. A few other states based DLNs in part but not entirely on personal information.

Presumably things have changed a lot since then. If you know of any states that still do this, please let me know in the comments. Even if states have stopped computing DLNs from personal data, I’m sure many organizations still compute IDs this way.

The article I stumbled on from 1991 gave no hint perhaps encoding personal information into an ID number could be a problem. And at the time it wasn’t as much of a problem as it would be now.

Why is it a problem if IDs are computed from personal data? People don’t realize what information they’re giving away. Maybe they would be willing to give someone their personal information, but not their DLN, or vice versa, not realizing that the two are equivalent. They also don’t realize what information about them someone may already have; a little bit more info may be all an attacker needs. And they don’t realize the potential consequences of their loss of privacy.

In some cases the hashing functions were complicated, but not too complicated to carry out by hand. And even if states were applying a cryptographic hash function, which they certainly were not, this would still be a problem for reasons explained here. If you have a database of personal information, say from voter registration records, you could compute the hash value of everyone in the state, or at least a large enough portion that you stand a good chance of being able to reverse a hashed value.

Related posts

[1] Joseph A. Gallian. Assigning Driver’s License Numbers. Mathematics Magazine, Vol. 64, No. 1 (Feb., 1991), pp. 13-22.

What is a privacy budget?

The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t for this post, but it matters that there is an exact definition.

Someone designing a differentially private system sets an upper limit on the amount of difference anyone’s participation can make. That’s the privacy budget. The system will allow someone to ask one question that uses the whole privacy budget, or a series of questions whose total impact is no more than that one question.

If you think of a privacy budget in terms of money, maybe your privacy budget is $1.00. You could ask a single $1 question if you’d like, but you couldn’t ask any more questions after that. Or you could ask one $0.30 question and seven $0.10 questions.

Some metaphors are dangerous, but the idea of comparing cumulative privacy impact to a financial budget is a good one. You have a total amount you can spend, and you can chose how you spend it.

The only problem with privacy budgets is that they tend to be overly cautious because they’re based on worst-case estimates. There are several ways to mitigate this. A simple way to stretch privacy budgets is to cache query results. If you ask a question twice, you get the same answer both times, and you’re only charged once.

(Recall that differential privacy adds a little random noise to query results to protect privacy. If you could ask the same question over and over, you could average your answers, reducing the level of added noise, and so a differentially private system will rightly charge you repeatedly for repeated queries. But if the system adds noise once and remembers the result, there’s no harm in giving you back that same answer as often as you ask the question.)

A more technical way to get more from a privacy budget is to use Rényi differential privacy (RDP) rather than the original ε-differential privacy. The former simplifies privacy budget accounting due to simple composition rules, and makes privacy budgets stretch further by leaning away from worst-case analysis a bit and leaning toward average-case analysis. RDP depends on a tuning parameter that includes ε-differential privacy, so one can control how much RDP acts like ε-differential privacy by adjusting that parameter.

There are other ways to stretch privacy budgets as well. The net effect is that when querying a large database, you can often ask all the questions like, and get sufficiently accurate answers, without worrying about privacy budget.

Related posts

Amendment to CCPA regarding personal information

California’s new privacy law takes effect January 1, 2020, less than 100 days from now. The bill was written in a hurry in order to prevent a similar measuring from appearing on a ballot initiative. The thought was that the state legislature would pass something quickly then clean it up later with amendments.

Six amendments were passed recently, and the deadline for further amendments has passed. California governor Gavin Newsom has until October 13 to either sign or veto each of the amendments.

This post will look at just one of the six amendments, AB-874, and what it means for personal information. The text of the amendment repeats the text of the original law, so I ran a diff tool on the two in order to see what changed.

Update: Governor Newsom signed amendment AB-874 into law on October 11, 2019.

In a couple instances, capable was changed to reasonably capable.

“Personal information” means information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household. Personal information includes, but is not limited to, the following if it identifies, relates to, describes, is reasonably capable of being associated with, or could be reasonably linked, directly or indirectly, with a particular consumer or household …

“Capable” is awfully broad. Almost anything is capable of being associated with a particular consumer or household, so adding reasonable was reasonable. You see something similar in the HIPAA privacy rule when it speaks of “reasonably available information.”

The amendment also removed a clause that was ungrammatical and nonsensical as far as I can tell:

… “publicly available” means information that is lawfully made available from federal, state, or local government records, if any conditions associated with such information.

The following sentence from the CCPA was also removed in the amendment:

Information is not “publicly available” if that data is used for a purpose that is not compatible with the purpose for which the data is maintained and made available in the government records or for which it is publicly maintained.

I suppose the idea behind removing this line was that data is either publicly available or it’s not. Once information is publicly available, it’s kinda hard to ask people to act as if it’s not publicly available for some uses.

The final change appears to be correcting a mistake:

Publicly available Personal information” does not include consumer information that is deidentified or aggregate consumer information.

It makes no sense to say public information does not include deidentified information. You might deidentify data precisely because you want to make it public. I believe the author of this line of the CCPA meant to say what the amendment says, that deidentified and aggregate information are not considered personal.


As I have pointed out elsewhere, I am not a lawyer. Nor am I a lepidopterist, auto mechanic, or cosmetologist. Nothing here should be considered legal advice. Nor should it be considered advice on butterflies, cars, or hair care.

Related posts

Right to be forgotten in the news

erased people

The GDPR‘s right-to-be-forgotten has been in the news this week. This post will look at a couple news stories and how they relate.

Forgetting about a stabbing

On Monday the New York Times ran a story about an Italian news site that folded as a result of resisting requests to hide a story about a stabbing.

In 2008, Umberto Pecoraro stabbed his brother Vittorio in a restaurant with a fish knife. The victim, Vittorio, said that the news story violated his privacy and demanded that it be taken down, citing the right-to-be-forgotten clause in the GDPR. The journalist, Alessandro Biancardi, argued that the public’s right to know outweighed the right to be forgotten, but he lost the argument and lost his business.

The Streisand effect

This story is an example of the Streisand effect, making something more public by trying to keep it private. People around the world now about the Pecoraro brothers’ fight only because one of them fought to suppress the story. I’d never know about local news from Pescara, Italy if the NYT hadn’t brought it to my attention [1].

Extending EU law beyond the EU

Will Vittorio Pecoraro now ask the NYT to take down their story? Will he ask me to take down this blog post? Probably not in light of a story in the Los Angeles Times yesterday [2].

France’s privacy regulator had argued that the EU’s right-to-be-forgotten extends outside the EU, but the European Court of Justice ruled otherwise, sorta.

According to the LA Times story, the court ruled that Google did not have to censor search results to comply with the right to be forgotten, but it did need to “put measures in place to discourage internet users from going outside the EU to find the missing information.”

I’ve written about a right to be forgotten before and suggested that it is impractical if not logically impossible. It also seems difficult to have a world wide web subject to local laws. How can the EU both allow its citizens to access the open web and also hide things it doesn’t want them to see? That seems like an unstable equilibrium, with the two stable equilibria being an open web and a Chinese-style closed web.

Related posts

[1] I also wouldn’t write about it. Although I think it’s impractical to legally require news sites to take down articles, I also think it’s often the decent thing to do. Someone’s life can be seriously harmed by easily accessible records of things they did long ago and regret. I would not stir up the Pecoraro story on my own. But since the NYT has far greater circulation than my blog, the story is already public.

[2] There’s an interesting meta-news angle here. If it’s illegal for Alessandro Biancardi to keep his story about the Pecoraro brothers online, would it be legal for someone inside the EU to write a story about the censoring of the Pecoraro brothers story like the NYT did?

You could argue that the Pecoraro brother’s story should be taken down because it’s old news. But the battle over whether the story should be taken down is new news. Maybe Vittorio Pecoraro’s privacy outweighs the public’s right to know about the knife fight, but does his privacy outweigh the public’s right to know about news stories being taken down?


Three-digit zip codes and data privacy

Birth date, sex, and five-digit zip code are enough information to uniquely identify a large majority of Americans. See more on this here.

So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last two digits of a zip code. And even though three-digit zip codes are larger than five-digit zip codes on average, some three-digit zip codes are still sparsely populated.

But if you use three-digit zip codes, and cut out sparsely populated zip3s, then you’re OK, right?

Well, there’s still a problem if you also report state. Ordinarily a zip3 fits within one state, but not always.

Five digit zip codes are each entirely contained within a state as far as I know. But three-digit zip codes can straddle state lines. For example, about 200,000 people live in the three-digit zip code 834. The vast majority of these are in Idaho, but about 500 live in zip code 83414 which is in Wyoming. Zip code 834 is not sparsely populated, and doesn’t give much information about an individual by itself. But it is conditionally sparsely populated. It does carry a lot of information about an individual if that person lives in Wyoming.

On average, a three-digit zip code covers about 350,000 people. And so most of the time, the combination of zip3 and state covers 350,000 people. But in the example above, the combination of zip3 and state might narrow down to 500 people. In a group that small, birthday (just day of the year, not the full date) is enough to uniquely identify around 25% of the population. [1]

Related posts

[1] The 25% figure came from exp(-500/365). See this post for details.

Hashing names does not protect privacy

Secure hash functions are practically impossible to reverse, but only if the input is unrestricted.

If you generate 256 random bits and apply a secure 256-bit hash algorithm, an attacker wanting to recover your input can’t do much better than brute force, hashing 256-bit strings hoping to find one that matches your hash value. Even then, the attacker may have found a collision, another string of bits that happens to have the same hash value.

But you know the input comes from a restricted set, a set small enough to search by brute force, then hash functions are reversible. If I know that you’ve either hashed “yes” or “no,” then I can apply the hash function to both and see which one it was.

Hashing PII

Suppose someone has attempted to anonymize a data set by hashing personally identifying information (PII) such as name, phone number, etc. These inputs come from a small enough space that a brute force search is easy.

For instance, suppose someone has applied a cryptographic hash to first names. Then all an attacker needs to do is find a list of common names, hash them all, and see which hash values match. I searched for a list of names and found this, a list of the 1000 most popular baby girl and boy names in California in 2017.

The data set was compiled based on 473,441 births. Of those births, 366,039 had one of the 2,000 names. That is, 77% of the babies had one of the 1,000 most common names for their sex.

I wrote a little script to read in all the names and compute a SHA256 hash. The program took a fraction of a second to run. With the output of this program, I can’t re-identify every first name in a data set, but I could re-identify 77% of them, assuming my list of names is representative [1].

Now, for example, if I see the hash value


in the data set, I can look this value up in my list of 2000 hash values and see that it is the hashed value of ACHILLES [2].

If you saw the hash value above and had no idea where it came from—it could be the hash of a JPEG image file for all you know—it would be hopeless to try to figure out what produced it. But if you suspect it’s the hash of a first name, it’s trivial to reverse.

Hashing SSNs

Hashing numbers is simpler but more computationally intense. I had to do a little research to find a list of names, but I know that social security numbers are simply 9-digit numbers. There are only a billion possible nine-digit numbers, so it’s feasible to hash them all to make a look-up table.

I tried this using a little Python script on an old laptop. In 20 seconds I was able to create a file with the hash values of a million SSNs, so presumably I could have finished the job in about five and a half hours. Of course it would take even less time with more efficient code or a more powerful computer.

Using a key

One way to improve the situation would be to use a key, a random string of bits that you combine with values before hashing. An attacker not knowing your secret key value could not do something as simple as what was described above.

However, in a large data set, such as one from a data breach, an attacker could apply frequency analysis to get some idea how hash values correspond to names. Hash values that show up most frequently in the data probably correspond to popular names etc. This isn’t definitive, but it is useful information. You might be able to tell, for example, that someone has a common first name and a rare last name. This could help narrow down the possibilities for identifying someone.

Adding salt

Several people have commented that the problem goes away if you use a unique salt value for each record. In some ways this is true, but in others it is not.

If you do use a unique salt per record, and save the salt with the data, then you can still do a brute force attack. The execution time is now quadratic rather than linear, but still feasible.

If you throw the salt away, then you’ve effectively replaced each identifier with a random string. Then you’ve effectively removed these columns since they’re filled with useless random noise. Why not just delete them?

You could use a unique salt per user rather than per record. Then you couldn’t identify a given record, but you could tell when two records belong to the same person. But in this case, why not just assign random IDs to each user? Use the salt itself as the ID. No need to use a hash function.

Custom hash functions

Note that the argument above for keys applies to using a custom hashing algorithm, either something you write from scratch or by combining rounds of established methods.

Kirchoff’s principle advises against relying on security-by-obscurity. It says that you should assume that your algorithm will become known, which often turns out to be the case. But no matter: any hashing scheme that always maps the same input to the same output is vulnerable to frequency analysis.

Related posts

[1] Name frequencies change over time and space, but I imagine the data from California in 2017 would be adequate to identify most Americans of any age and in any state. Certainly this would not meet the HIPAA standard that “the risk is very small that the information could be used … to identify an individual.”

[2] The California baby name data set standardized all names to capital letters and removed diacritical marks. With a little extra effort, one could hash variations such as JOSE, JOSÉ, Jose, and José.



What does CCPA say about de-identified data?

The California Consumer Privacy Act, or CCPA, takes effect January 1, 2020, less than six months from now. What does the act say about using deidentified data?

First of all, I am not a lawyer; I work for lawyers, advising them on matters where law touches statistics. This post is not legal advice, but my attempt to parse the CCPA, ignoring details best left up to others.

In my opinion, the CCPA is more vague than HIPAA, but not as vague as GDPR. It contains some clear language about using deidentified data, but that language is scattered throughout the act.

Deidentified data

Where to start? Section 1798.145 says

The obligations imposed on businesses by this title shall not restrict a business’s ability to … collect, use, retain, sell, or disclose consumer information that is deidentified or in the aggregate consumer information.

The act discusses identifiers and more importantly probabilistic identifiers, a topic I wrote about earlier. This term is potentially very broad. See my earlier post for a discussion.

Aggregate consumer information

So what is “aggregate consumer information”? Section 1798.140 says that

For purposes of this title: (a) “Aggregate consumer information” means information that relates to a group or category of consumers, from which individual consumer identities have been removed, that is not linked or reasonably linkable to any consumer or household, including via a device. “Aggregate consumer information” does not mean one or more individual consumer records that have been de­identified.

So aggregate consumer information is different from deidentified information.


Later on (subsection (s) of the same section) the act says

Research with personal information … shall be … (2) Subsequently pseudonymized and deidentified, or deidentified and in the aggregate, such that the information cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer.

What? We’ve seen “in the aggregate” before, and presumably that has something to do with the use of “in the aggregate” above. But what does pseudonymized mean? Backing up to subsection (r) we have

Pseudonymize” or “Pseudonymization” means the processing of personal information in a manner that renders the personal information no longer attributable to a specific consumer without the use of additional information, provided that the additional information is kept separately and is subject to technical and organizational measures to ensure that the personal information is not attributed to an identified or identifiable consumer.

That sounds a lot like deidentification to me, but slightly weaker. It implies that a company can retain the ability to re-identify an individual, as long as the means of doing so is “is kept separately and is subject to technical and organizational measures.” I’m speculating here, but it seems like that might mean, for example, that a restricted part of a company might apply a secure hash function to data, and another part of the company sees the results and analyzes the data. Then again, the law says “pseudonymized and deidentified,” so who knows what that means.

The CCPA was written and passed in a hurry with the expectation of being amended later, and it shows.


How can you know whether you comply with CCPA‘s requirements for pseudonymizing, deidentifying, and aggregating data? A lawyer would have to tell you how the law applies to your situation. As I said above, I’m not a lawyer. But I can recommend lawyers working in this space. And I can work with your lawyer on the technical aspects: what methods are commonly used, how privacy risk is quantified, etc.

Related posts

Protecting privacy while keeping detailed date information

A common attempt to protect privacy is to truncate dates to just the year. For example, the Safe Harbor provision of the HIPAA Privacy Rule says to remove “all elements of dates (except year) for dates that are directly related to an individual …” This restriction exists because dates of service can be used to identify people as explained here.

Unfortunately, truncating dates to just the year ruins the utility of some data. For example, suppose you have a database of millions of individuals and you’d like to know how effective an ad campaign was. If all you have is dates to the resolution of years, you can hardly answer such questions. You could tell if sales of an item were up from one year to the next, but you couldn’t see, for example, what happened to sales in the weeks following the campaign.

With differential privacy you can answer such questions without compromising individual privacy. In this case you would retain accurate date information, but not allow direct access to this information. Instead, you’d allow queries based on date information.

Differential privacy would add just enough randomness to the results to prevent the kind of privacy attacks described here, but should give you sufficiently accurate answers to be useful. We said there were millions of individuals in the database, and the amount of noise added to protect privacy goes down as the size of the database goes up [1].

There’s no harm in retaining full dates, provided you trust the service that is managing differential privacy, because these dates cannot be retrieved directly. This may be hard to understand if you’re used to traditional deidentification methods that redact rows in a database. Differential privacy doesn’t let you retrieve rows at all, so it’s not necessary to deidentify the rows per se. Instead, differential privacy lets you pose queries, and adds as much noise as necessary to keep the results of the queries from threatening privacy. As I show here, differential privacy is much more cautious about protecting individual date information than truncating to years or even a range of years.

Traditional deidentification methods focus on inputs. Differential privacy focuses on outputs. Data scientists sometimes resist differential privacy because they are accustomed to thinking about the inputs, focusing on what they believe they need to do their job. It may take a little while for them to understand that differential privacy provides a way to accomplish what they’re after, though it does require them to change how they work. This may feel restrictive at first, and it is, but it’s not unnecessarily restrictive. And it opens up new possibilities, such as being able to retain detailed date information.


[1] The amount of noise added depends on the sensitivity of the query. If you were asking questions about a common treatment in a database of millions, the necessary amount of noise to add to query results should be small. If you were asking questions about an orphan drug, the amount of noise would be larger. In any case, the amount of noise added is not arbitrary, but calibrated to match the sensitivity of the query.

Internet privacy as seen from 1975

Science fiction authors set stories in the future, but they don’t necessarily try to predict the future, and so it’s a little odd to talk about what they “got right.” Getting something right implies they were making a prediction rather than imagining a setting of a story.

However, sometimes SF authors do indeed try to predict the future. This seems to have been at least somewhat the case with John Brunner and his 1975 novel The Shockwave Rider because he cites futurist Alvin Toffler in his acknowledgement.

The Shockwave Rider derives in large part from Alvin Toffler’s stimulating study Future Shock, and in consequence I’m much obliged to him.

In light of Brunner’s hat tip to Toffler, I think it’s fair to talk about what he got right, or possibly what Toffler got right. Here’s a paragraph from the dust jacket that seemed prescient.

Webbed in a continental data-net that year by year draws tighter as more and still more information is fed to it, most people are apathetic, frightened, resigned to what ultimately will be a total abolishment of individual privacy. A whole new reason has been invented for paranoia: it is beyond doubt — whoever your are! — that someone, somewhere, knows something about you that you wanted to keep a secret … and you stand no chance of learning what it is.

Related posts