Protecting privacy while keeping detailed date information

A common attempt to protect privacy is to truncate dates to just the year. For example, the Safe Harbor provision of the HIPAA Privacy Rule says to remove “all elements of dates (except year) for dates that are directly related to an individual …” This restriction exists because dates of service can be used to identify people as explained here.

Unfortunately, truncating dates to just the year ruins the utility of some data. For example, suppose you have a database of millions of individuals and you’d like to know how effective an ad campaign was. If all you have is dates to the resolution of years, you can hardly answer such questions. You could tell if sales of an item were up from one year to the next, but you couldn’t see, for example, what happened to sales in the weeks following the campaign.

With differential privacy you can answer such questions without compromising individual privacy. In this case you would retain accurate date information, but not allow direct access to this information. Instead, you’d allow queries based on date information.

Differential privacy would add just enough randomness to the results to prevent the kind of privacy attacks described here, but should give you sufficiently accurate answers to be useful. We said there were millions of individuals in the database, and the amount of noise added to protect privacy goes down as the size of the database goes up [1].

There’s no harm in retaining full dates, provided you trust the service that is managing differential privacy, because these dates cannot be retrieved directly. This may be hard to understand if you’re used to traditional deidentification methods that redact rows in a database. Differential privacy doesn’t let you retrieve rows at all, so it’s not necessary to deidentify the rows per se. Instead, differential privacy lets you pose queries, and adds as much noise as necessary to keep the results of the queries from threatening privacy. As I show here, differential privacy is much more cautious about protecting individual date information than truncating to years or even a range of years.

Traditional deidentification methods focus on inputs. Differential privacy focuses on outputs. Data scientists sometimes resist differential privacy because they are accustomed to thinking about the inputs, focusing on what they believe they need to do their job. It may take a little while for them to understand that differential privacy provides a way to accomplish what they’re after, though it does require them to change how they work. This may feel restrictive at first, and it is, but it’s not unnecessarily restrictive. And it opens up new possibilities, such as being able to retain detailed date information.

***

[1] The amount of noise added depends on the sensitivity of the query. If you were asking questions about a common treatment in a database of millions, the necessary amount of noise to add to query results should be small. If you were asking questions about an orphan drug, the amount of noise would be larger. In any case, the amount of noise added is not arbitrary, but calibrated to match the sensitivity of the query.

Per stirpes and random walks

If an inheritance is to be divided per stirpes, each descendant gets an equal share. If a descendant has died but has living descendants, his or her share is distributed by applying the rule recursively.

Example

For example, suppose a man had two children, Alice and Bob, and stipulates in his will that his estate is to be divided per stirpes. If Alice and Bob are still alive when he dies, his estate is split evenly. Suppose, however, that Alice is still alive but Bob has died, and that Bob has three living children, Carol, David, and Erin. In that case Alice would inherit 1/2 of the estate, and each of Bob’s children would inherit 1/6, i.e. 1/3 of Bob’s 1/2.

State law

In some states, such as Texas, per stirpes is the default when someone dies without a will. Who knows what they do in Nevada? Maybe the descendants play poker for the inheritance. I don’t know. I’m not a lawyer, certainly not a Nevada lawyer.

Random walk

Here’s a random process whose expected value gives the same result as per stirpes.

Convert the inheritance to a jar of pennies, possibly a very large jar. Repeat the following steps until all the pennies are gone.

  1. Take a penny out of the jar and perform a random walk on the family tree representing the descendants of the departed.
  2. When you come to a branch in the tree, choose a branch at random with each branch having equal probability.
  3. When you encounter a living person, give them the penny.

This assumes that you first prune the descendant tree of any lines that have died out. That is, we assume every terminal node of the tree represents a living person.

Why is it necessary to trim the tree? If you ended up at a dead end, couldn’t you just put the penny back and start over? No. Suppose in the example above that Alice and Carol are the only surviving descendants. Then per stirpes says they should receive equal amounts, since Carol inherits all of her father’s share. But if we did the random walk without removing David and Erin, then 1/2 the time we’d give a penny to Alice, 1/6 of the time we’d give it to Carol, and 1/3 of the time we’d start over. Alice would get 75% of the estate.

Related posts

The cost of no costs

The reason businesses have employees rather than contracting out everything is to reduce transaction costs. If a company needs enough graphics work, they hire a graphic artist rather than outsourcing every little project, eliminating the need to evaluate bids, write contracts, etc. Some things are easier when no money has to change hands.

But some things are made more complicated because money does not change hands. In-house transactions don’t require monetary negotiation, but they require emotional and political negotiation. Discussions can become unnecessarily heated because things don’t have prices. Discussions drift toward arguments over what is and is not possible rather than discussions of trade-offs based on cost.

Using one RNG to sample another

Suppose you have two pseudorandom bit generators. They’re both fast, but not suitable for cryptographic use. How might you combine them into one generator that is suitable for cryptography?

Coppersmith et al [1] had a simple but effective approach which they call the shrinking generator, also called irregular decimation. The idea is to use one bit stream to sample the other. Suppose the two bit streams are ai and bi. If ai = 1, then output bi. Otherwise, throw it away. In NumPy or R notation, this is simply b[a > 0].

Examples in Python and R

For example, in Python

    >>> import numpy as np
    >>> a = np.array([1,0,1,1,0,0,0,1])
    >>> b = np.array([1,0,1,0,0,1,1,0])
    >>> b[a > 0]
    array([1, 1, 0, 0])

Here’s the same example in R.

    > a = c(1,0,1,1,0,0,0,1)
    > b = c(1,0,1,0,0,1,1,0)
    > b[a>0]
    [1] 1 1 0 0

Linear Feedback Shift Registers

Coppersmith and colleagues were concerned specifically with linear feedback shift register (LFSR) streams. These are efficient sources of pseudorandom bits because they lend themselves to hardware implementation or low-level software implementation. But the problem is that linear feedback shift registers are linear. They have an algebraic structure that enables simple cryptographic attacks. But the procedure above is nonlinear and so less vulnerable to attack.

A potential problem is that the shrinking generator outputs bits at an irregular rate, and a timing attack might reveal something about the sampling sequence a unless some sort of buffering conceals this.

Other stream sources

While the shrinking generator was developed in the context of LFSRs, it seems like it could be used to combine any two PRNGs in hopes that the combination is better than the components. At a minimum, it doesn’t seem it should make things worse [2].

There is a problem of efficiency, however, because on average the shrinking generator has to generate 4n bits to output n bits. For very efficient generators like LFSRs this isn’t a problem, but it could be a problem for other generators.

Self-shrinking generators

A variation on the shrinking generator is the self-shrinking generator. The idea is to use half the bits of a stream to sample the other half. Specifically, look at pairs of bits, and if the first bit is a 1, return the second bit. If the first bit is a 0, return nothing.

Use in stream ciphers

The eSTREAM competition for cryptographically secure random bit generators included one entry, Decim v2, that uses shrinking generators. The competition had two categories, methods intended for software implementation and methods intended for hardware implementation. Decim was entered in the hardware category. According to the portfolio PDF [3] on the competition site,

Decim contains a unique component in eSTREAM, that of irregular decimation, and is an interesting addition to the field of stream ciphers.

That sounds like it was the only method to use irregular decimation (i.e. shrinking generators).

The first version of Decim had some flaws, but the document says “no adverse cryptanalytic results are known” for the second version. Decim v2 made it to the second round of the eSTREAM competition but was not chosen as a finalist because

… the cipher doesn’t seem to deliver such a satisfying performance profile, so while there might be some very interesting elements to the Decim construction, we feel that the current proposal doesn’t compare too well to the other submissions for the hardware profile.

That seems to imply Decim might be competitive with a different implementation or in some context with different trade-offs.

Related posts

[1] Coppersmith, D. Krawczyk, H. Mansour, Y. The Shrinking Generator. Advances in Cryptology — CRYPTO ’93. Lecture Notes in Computer Scienc, vol. 773, pp. 22–39. Springer, Berlin.

[2] If a and b are good sources of random bits, it seems b sampled by a should be at least as good. In fact, if a is poor quality but b is good quality, b sampled by a should still be good. However, the reverse could be a problem. If b is biased, say it outputs more 1s than 0s, then if a is a quality sample, that sample will be biased in favor of 1s as well.

[3] The link to this report sometimes works but often doesn’t. There’s something unstable about the site. In case it works, here’s the URL: http://www.ecrypt.eu.org/stream/portfolio.pdf

Rock, paper, scissors, algebra

Aatish Bhatia posted something interesting on Twitter: if you define multiplication on Rock, Paper, Scissors to be the winner of a match, the result is commutative but not associative.

The same is true of the extension Rock, Paper, Scissors, Lizard, Spock

Rules of rock, paper, scissors, lizard, spock.

Related post: Weakening the requirements of a group

Liminal and subliminal

It occurred to me for the first time this morning that the words liminal and subliminal must be related, just after reading an article by Vicki Boykis that discusses liminal spaces.

I hear the two words in such in different contexts—architecture versus psychology—and hadn’t thought about the connection until now. If I were playing a word association game, my responses would be these.

Q: Liminal?

A: Spaces.

Q: Subliminal?

A: Message.

Etymology

I checked Etymonline to verify that the two words are indeed cognate. Both come from the Latin word limen for threshold. Something is subliminal if it is below the threshold, typically the threshold of consciousness.

Frequency

Surely the word subliminal is far more common than liminal. To verify this, I turned to Google’s Ngram Viewer. I’ve included a screenshot below, and you can find the original here.

Ngram of liminal vs subliminal

It’s not surprising that subliminal was a popular term during the career of Sigmund Freud. He published The Interpretation of Dreams in 1899 and died in 1939.

What is surprising, at least to me, is that the word liminal has been gaining popularity and passed subliminal around the turn of the century. I didn’t expect liminal to be anywhere near as common as subliminal.

Bias

Google’s Ngram data comes from books. Word frequencies in books can be very different than word frequencies in common speech or other writing as this example shows. I can’t recall ever hearing someone use liminal in conversation. Maybe civil engineers and architects hear it all the time. As I type this, my spell checker puts a red squiggly line under every instance of liminal, showing that the word is not in its default dictionary, though it does recognize subliminal.

Related posts

Cop with a mop

Yesterday I was at a wedding, and a vase broke in the aisle shortly before the bridal party was to enter. Guests quickly picked up the pieces, but the vase left a pool of water on the hard floor.

A security guard ran (literally) for a mop and cheerfully picked up the water. He could have easily stood in the corner and said that mopping floors is not his job. And if he were guarding a jewelry store, it would be inappropriate for him to leave his post to get a mop. But his presence at the wedding was a formality, presumably a venue requirement, and no one was endangered by his fetching a mop. There was more danger of someone slipping on a wet floor.

I enjoy seeing anyone do their job with enthusiasm, doing more than the minimum required. Over-zealous people can cause problems, but I’d much rather deal with such problems than deal with people passively putting in their time.

R with Conda

I’ve been unable to get some R libraries to install on my Linux laptop. Two libraries in particular were tseries and tidyverse. The same libraries installed just fine on Windows. (Maybe you need to install Rtools first before installing these on Windows; I don’t remember.)

I use conda all the time with Python, but I hadn’t tried it with R until this evening. Apparently it just works. The libraries I was trying to install have a lot of dependencies, and conda is very good at managing dependencies.

I removed my installation of R and reinstalled from conda:

    conda install r-base

Then I installed tseries with

    conda install r-tseries

and installed tidyverse analogously:

    conda install r-tidyverse

Just prepend r- to the name of the R library you want to install.

I haven’t used it in anger yet, but it seems that everything works just fine.