Query, then deidentify

Suppose you have a database of personally identifiable information (PII) and you want to allow someone else to query the data while protecting the privacy of the individuals represented by the data. There are two approaches:

  1. Deidentify, then query
  2. Query, then deidentify

The first approach is to do whatever is necessary to deidentify the data—remove some fields, truncate or randomize others, etc.—and then pose a query to this redacted data.

The second approach is to query the original data, then do whatever is necessary to deidentify the results.

In graphical terms, you can get from raw data to a deidentified result either by following the green arrows or the blue arrows below. In mathematical terms, this diagram does not commute.

The first approach is most common. A company that owns data (a “covered entity” in HIPAA terms) will deidentify it and license it to another company who then queries it. The second approach is becoming more common, where a company will license access to querying their data.

Pros and cons

Which approach is better? If by better you mean more accurate results, it’s always best to query first then deidentify. The order in which you do things matters, and deidentifying as late as possible preserves information.

The situation is analogous to carrying out a sequence of steps on a calculator. If you want your final result to be accurate to two decimal places, you first carry out all your operations to as much precision as you can, then round the final result. If you round your numbers first, you probably will get less accurate results, maybe even useless results.

However, deidentifying data before querying it is better in some non-mathematical ways. Data scientists want the convenience of working with the data with their tools in their environment. They want to possess (a deidentified version of) the data rather than have access to query the (exact) data. They also want the freedom to run ad hoc queries [1].

There are logistical and legal details to work out in order to license access to query data rather than licensing the data. But it is doable, and companies are doing it.

Why query first

When you deidentify data first, you have to guard against every possible use of the data. But when you deidentify data last, you only have to guard against the actual use of the data.

For example, suppose you are considering creating a new clinic and you would like to know how many patients of a certain type live closer to the location you have in mind than the nearest alternative. A data vendor cannot give you exact locations of patients. If they were to release such data, they’d have to obscure the addresses somehow, such as giving you the first three digits of zip codes rather than full addresses. But if you could ask your query of someone holding the full data, they may tell you exactly what you want to know.

Some queries may pose no privacy risk, and the data holder can return exact results. Or they may need to jitter the result a bit in order to protect privacy, for reasons explained here. But it’s better to jitter an exact result than to jitter your data before computing.

How to query first

The query-first approach requires a trusted party to hold the unredacted data. There are a variety of ways the data holder can license access, from simple to sophisticated, and in between.

The simplest approach would be for the data holder to sell reports. Maybe the data holder offers a predetermined set of reports, or maybe they allow requests.

The most sophisticated approach would be to use differential privacy. Clients are allowed to pose any query they wish, and a query manager automatically adds an amount of randomness to the results in proportion to the sensitivity of the query. All this is done automatically according to a mathematical model of privacy with no need for anyone to decide a priori which queries will be allowed.

There are approaches conceptually between pre-determined reports and differential privacy, offering more flexibility than the former and being easier to implement than the latter. There’s a lot of room for creativity in this space.

Related posts

[1] Being able to run ad hoc queries with no privacy budget is certainly simpler, in the same way that an all-you-can-eat buffet is simpler than ordering food à la carte. But it also means the price is higher. Deidentifying an entire data set entails more loss of accuracy that deidentifying a set of queries.

Identifiable to man or machine?

Like the previous post, this post riffs on a photo [1] I stumbled on while looking for something else.

Would it be easier to identify the man in this photo or the man whose photo appeared in the previous post, copied below.

I think it would be easier for a human to recognize the person in the first image. But what about a computer?

We humans identify people most easily by their faces, and especially by their eyes. These features are easier to see in the first photo. But what might we find if we applied some image processing to the two photos? Maybe the green man’s facial features could be exposed by some diligent processing. We see more of the second man’s body. Maybe a computer algorithm could extract more information out of the second image for this reason.

Photographs may, and often do, contain Exif (Exchangeable image file format) metadata, such as the GPS coordinates of the camera at the time the photo was taken. A photo taken when the lens cap on by mistake might contain a good deal of information about the subject even though the photo per se is useless. This information can be useful to the photographer, but it could also pose a privacy risk. Before posting a photo publicly, you might want to strip out the metadata.

As I noted in the previous post, the privacy risk from data depends on context. Suppose the metadata embedded in a photo contains the serial number of the camera. That serial number would not help most people identify a photo(grapher), but it would someone who had access to a database linking serial numbers to the customers who purchased the cameras.

Was the first photo created by actually projecting the red and green lights onto the subject, or were these added in post production? For that matter, was there actually a man who posed for the photo or was the image synthetically generated? A forensic investigation of the photo might be able to answer these questions.

[1] Photo by Sebastian Mark on Unsplash

Topological sort

When I left academia [1] my first job was working as a programmer. I was very impressed by a new programmer we hired who hit the ground running. His first week he looked at some problem we were working on and said “Oh, you need a topological sort.” I’d never heard of a topological sort and it sounded exotic. What could this have to do with topology?!

A topological sort of a directed graph lists source nodes before target nodes. For example, if there is a directed edge from A to B and from C to A, then the nodes would be list C, A, B. It’s just a way of listing items in a directed graph so that no item in the list points to an item earlier in the list. All arrows point forward.

This is not exotic at all. It’s something you’ve likely done, maybe by hand. As pointed out in the comments, the make utility does this, compiling source files in the order that they’re needed [2].

Where does topology come in? Imagine your directed graph made of beads and strings. You want to pick up the graph by some bead so that all beads are higher than the beads they point to. It’s topological in the sense that you don’t need to preserve the geometry of the graph, only its connectivity.


The Unix utility tsort will do a topological sort. The input to the utility is a text file with two items per line, separated by white space, indicating a directed edge from the first item to the second.


Here is a thumbnail image of a graph of relationships between special functions. See this page for a full-sized image and an explanation of what the arrows represent.

special function relationships

I took the GraphViz file used to create the graph and formatted it for tsort. Then I randomly shuffled the file with shuf.

    Gegenbauer_polynomials Legendre_polynomials
    Gegenbauer_polynomials Chebyshev_polynomials_Second_kind
    Hypergeometric_2F1 Jacobi_polynomials
    Error_function Fresnel_S
    Hypergeometric_1F1 Error_function

The lines are not sorted topologically because, for example, the Gegenbauer polynomials are special cases of the Hypergeometric 2F1 functions, so Hypergeometric 2F1 should be listed before Gegenbauer polynomials.

When I ran the shuffled file through tsort I got


and now in this list more general functions always come before special cases.

Related posts

[1] After a postdoc at Vanderbilt, I took a job as a programmer. I got the job because they needed a programmer who knew some DSP. A few years later I got a job at MD Anderson Cancer Center managing a group of programmers. It’s fuzzy whether my time at MDACC should be considered time in Academia. My responsibilities there were sometimes academic—writing journal articles, teaching classes—and sometimes not—developing software and managing software developers.

[2] The make software can be used to run any directed acyclic graph of tasks, but is most often used to compile software.

Playfair cipher

The Playfair cipher was the first encryption technique to encrypt text two letters at a time. Instead of substituting one letter for another, it substitutes one pair of letters for another pair. This makes the method more secure than a simple substitution cipher, but hardly secure by modern standards.

The Playfair cipher was used (and broken) during the first world war. I vaguely remember reading somewhere that the cipher took about an hour to break using pencil and paper. It was secure in the sense that it could be used for messages that only needed to be secure for less time than it took to break the method. It was more secure than simple substitution, and easy to encrypt and decrypt manually.

True to Stigler’s law of eponymy, the Playfair cipher was not named after its inventor, Charles Wheatstone of Wheatstone bridge fame, but after Lyon Playfair who popularized the method. Playfair acknowledged Wheatstone, but his name stuck to the method nevertheless.

Message preparation

The Playfair cipher uses a 5 × 5 grid of letters, so some letter of the Roman alphabet has to go. A common choice was to use the same letter for I and J. (A variation on the method using a 6 × 6 grid of letters and digits would not have to leave out any letters.)

For reasons that will soon be apparent, double letters had to be broken up, say with an X. So “FOOTBALL” would become “FOXOTBALXL.” Amusingly, “MISSISSIPPI” would become “MISXSISXSIPXPI.”

After eliminating Js and splitting double letters, the message is divided into pairs. So FOXOTBALXL becomes FO XO TB AL XL.

Encryption algorithm

The key for the encryption method is the arrangement of the letters in a square. In practice, the key would be some word or phrase that was used to permute the alphabet, and then that permutation was filled into the grid.

Here’s a grid I constructed by asking Python for a random permutation of the alphabet.


Given a pair of letters, the two letters either lie on the same row, the same column, or are in different rows and columns. (This is why you break up double letters.)

If the two letters lie in the same row, advance each letter one position, wrapping around if necessary. For example, IT would be encrypted as FV, and TX would be encrypted as VI.

If two letter line in the same column, proceed analogously, moving each letter down. So TH would be encrypted as GB and OI would be encrypted as IP.

Finally, if the two letters are in different rows and columns, they form the diagonal corners of a rectangle. Replace the two letters with the letters on the remaining corners. For example, IH becomes TR, HE becomes RB, GW becomes DM, etc.


Just as you can attack a simple substitution cipher by looking at letter frequencies, you can attack a Playfair cipher by looking at bigram frequencies. You can find these frequencies for English text on Peter Norvig’s site. TH sticks out in bigram frequencies similarly to how E sticks out in letter frequencies. However, bigram frequencies are more evenly distributed than letter frequencies.

As I pointed out in the previous post, making a mapping between 676 pairs of letters to a randomly generated list of 676 other pairs of letters will not create a secure cipher. But Playfair is much weaker than such a random assignment. There is a lot of structure to the Playfair cipher. This makes it more convenient to use, and easier to break.

Suppose pairs of letters where mapped to random pairs of letters and you learn that GB is the encrypted form of TH. What have you learned about decrypting any other pair? Nothing, except that you’ve eliminated 1 out of 676 possibilities.

But if you learn that a Playfair cipher sends TH to GB, you learn that either (1) T, H. G, and B all lie in the same row or column, or (2) that T and B are in the same column, G and B are in the same column, T and G are in the same row, and H and B are in the same row.


If we rotate the rows or columns in our encryption matrix, nothing changes. This is easy to see in the case when two letters are in the same row or in the same column. It’s a little harder to see but still true when the letters are in different rows and columns.

For example, consider the following encryption matrix, formed by rotating the columns two positions and the rows one position.


If you work through all the examples above, you’ll see that they remain the same. IT still goes to FV etc.

The reason rotating columns or rows doesn’t make a difference is that in matrix notation, the encryption algorithm does not depend on the subscripts per se but the difference in subscripts mod 5.

It almost doesn’t matter if you transpose the encryption matrix. If you transpose a matrix, elements that were in the same row are now in the same column and vice versa. When two letters are not in the same row or column, transposing the encryption matrix transposes the encrypted pair. In the example above HE goes to RB. If we transpose the encryption matrix, HE goes to BR.

We said above that the key to a Playfair cipher is a permutation of the alphabet. But many keys correspond to the same encryption mapping. The analyst doesn’t need to recover the original encryption matrix but only some rearrangement of it.

Related posts

Simple substitution ciphers over a gargantuan alphabet

Simple substitution ciphers replace one letter with another. Maybe A goes to W, B goes to G, C goes to A, etc.

These ciphers are famously easy to break, so easy that they’re common in puzzle books. Here’s one I made [1] for this post in case you’d like to try it.


As is common in puzzle books, I kept the spaces and punctuation.

When you learn that simple substitution is breakable, you might reasonably think that the problem is the small alphabet size. What if you replaced pairs of letters with pairs of letters, effectively working over an alphabet of size 26² = 676. That’s an improvement, but it’s still not secure. It could be broken manually in a few hours, depending on the length of the text, and of course could be broken quickly using a computer.

If we want a cipher to be secure against computer-aided cryptanalysis, we’re going to need a much bigger alphabet.

The Roman alphabet has 26 letters, which can be expressed in 5 bits. Pairs of Roman letters would require 10 bits. What if we used a 32-bit alphabet, substituting 32-bit sequences with other 32-bit sequences? This is working over an alphabet of over 4 billion symbols. Surely that’s secure? Nope.

What if we use blocks of 128 bits? This is working over an alphabet of size

2128 = 340,282,366,920,938,463,463,374,607,431,768,211,456.

Nope. Still not good enough. Because you can see the penguin.

Original encrypted Tux image

The image above is a famous example of a downfall of simple substitution, albeit over a gargantuan alphabet. The image was created by taking a graphic of the Linux mascot and encrypting the bits using 128-bit encryption. Each block of 128 bits goes to a unique, essentially random replacement. Each block is well encrypted. But there are repetitive blocks in the original that become repetitive blocks in the encrypted version.

The AES (Rijndael) encryption algorithm is a good algorithm, but in the example above it was used poorly. It was used in electronic code book mode (ECB), something that nobody would do in practice.

In practice, you might do something like cipher block chaining where you XOR each block with the encrypted version of the previous block. You could think of this as a clever way of using a simple substitution over an enormous alphabet. You look up the substitution of each block, but then XOR its bits with the previously encrypted block. Now repetitive input does not produce repetitive output. You cannot see the penguin. The penguin image becomes random-looking static.

Related posts

[1] I produced the cryptogram using

    cat myfile | tr [a-z] [A-Z] | tr [A-Z] ...

where “…” is a permutation of the 26 upper case letters.

Small-scale automation


Saving keystrokes is overrated, but maintaining concentration is underrated.

This post is going to look at automating small tasks in order to maintain concentration, not to save time.

If a script lets you easily carry out some ancillary task without taking your concentration off your main task, that’s a big win. Maybe the script only saves you five seconds, but it could save you from losing a train of thought.

If your goal in writing a script is to preserve concentration, that script has to be effortless to run. It’s worth taking a few minutes to search for a script that is going to save you an hour. But if the purpose of a script is to preserve your state of flow, having to search for it defeats the purpose.

Remembering what you’ve written

I’ve often said to myself “I’ve had to do this task several times. I should automate it!” Good idea, except I couldn’t quickly find the code later when I needed it.

I’ve heard many people say you should automate repetitive tasks; I’ve never heard anyone discuss the problem of remembering what you’ve automated and how to invoke it. Maybe there’s less discussion of the memory problem because the solution is personal. It depends, for instance, on what tools you use and what you find memorable.

One suggestion would be to periodically review what you’ve written, say once a month [1]. Maybe you’ve put useful aliases in your shell configuration file. Skimming that config file occasionally could help you remember what’s there. If you have a directory where you keep custom scripts, it could help to browse that directory once in a while. It helps if there aren’t too many places you need to look, which leads to the next section.

Tool priorities

It would also help to minimize the number of tools you use, or at least the number of tools you customize.

And even with a very minimal tool set, it helps to have a primary emphasis on one of those tools. For example, maybe your work environment consists mostly of a shell, a programming language, and an editor. When it’s not obvious which tool to pick, are you going to write a shell script, a program, or an editor extension? By picking one tool as your default, you get better at that tool, accumulate more sample code for that tool, and have fewer contexts to explore when you’re looking for something you’ve written.


[1] A long time ago I heard someone say he reads documentation ever Friday afternoon. I did that for a while and recommend it. Maybe set aside a few minutes each Friday afternoon to review and tweak config files. If you don’t get through everything, pick up next week where you left off.

Number of bits in a particular integer

When I think of bit twiddling, I think of C. So I was surprised to read Paul Khuong saying he thinks of Common Lisp (“CL”).

As always when working with bits, I first doodled in SLIME/SBCL: CL’s bit manipulation functions are more expressive than C’s, and a REPL helps exploration.

I would not have thought of Common Lisp being more expressive for bit manipulation than C, though in hindsight perhaps I should have. Common Lisp is a huge language, and a lot of thought went into it. It’s a good bet that if CL supports something it supports it well.

One of the functions Khoung uses is integer-length. I looked it up in Guy Steele’s book. Here’s what he says about the function.

This function performs the computation

ceiling(log2(if integer < 0 theninteger else integer + 1))

… if integer is non-negative, then its value can be represented in unsigned binary form in a field whose width is no smaller than (integer-length integer). …

Steele also describes how the function works for negative arguments and why this is useful. I’ve cut these parts out because they’re not my focus here.

I was curious how you’d implement integer-length in C, and so I turned to Hacker’s Delight. This book doesn’t directly implement a counterpart to integer-length, but it does implement the function nlz (number of leading zeros), and in fact implements it many times. Hacker’s Delight points out that for a 32-bit unsigned integer x,

⌊log2(x)⌋ = 31 – nlz(x)


⌈log2(x)⌉ = 32 – nlz(x -1).

So nlz(x) corresponds to 32 − (integer-length x).

Hacker’s Delight implements nlz at least 10 times. I say “at least” because it’s unclear whether a variation of sample code discussed in commentary remarks counts as a separate implementation.

Why so many implementations? Typically when you’re doing bit manipulation, you’re concerned about efficiency. Hacker’s Delight gives a variety of implementations, each of which may have advantages in different hardware. For example, one implementation is recommended in the case that your environment has a microcode implementation of popcount. The book also gives ways to compute nlz that involve casting an integer to a floating point number. The advisability of such a technique will be platform-dependent.

If you’re looking for C implementations of integer-length you can find a few on Sean Anderson’s Bit Twiddling Hacks page.

Related posts

Proof of optimization

Suppose you hire me to solve an optimization problem for you. You want me to find the value of x that minimizes f(x). I go off and work on finding the best value of x. I report back what I found, and you might say “Thanks, That’s a good value of x. But how do I know there’s not an even better value?”

In general this is a hard question to answer. If x were a single number, maybe I could produce a plot of f and show that my x is where f takes on its smallest value. But usually x is a vector, maybe a thousand-dimensional vector. I’m not very good at graphing functions in a thousand dimensions, so this approach isn’t going to work.

I may be able to back up my result by defending the process used to produce it. For example, maybe you ask me for the shortest path through the 254 counties in Texas and I come back with the following tour:

If you ask whether this is optimal, I’ll have to admit that I’m not certain that it is, but I am certain that it is close. The tour was produced using Mathematica, which in turn uses Bill Cook’s Concorde algorithm, which is known to produce near-optimal results. If the tour above isn’t the shortest, the shortest tour isn’t much shorter.

But if the optimization problem is convex then you may be able to certify the result analogous to the way you can certify primes. In general a certificate gives you a way to verify a solution with much less effort than it took to find the solution.

You can prove things by moving back and forth between your original (“primal”) problem and its dual, or between a variation of your original problem and its dual. You may be able to certify that no solution to the constraints of the original problem exists, or certify that a proposed solution is or isn’t optimal. In every case, demonstrating one solution to the dual problem proves something about all solutions to the primal problem.

Related posts

Self-documenting software

programmer using a laptop in the dark

The electricity went out for a few hours recently, and because the power was out, the internet was out. I was trying to do a little work on my laptop, but I couldn’t do what I intended to do because I needed a network connection to access some documentation. I keep offline documentation for just this situation, but the information I needed wasn’t in my local files. Or maybe it was there, but I gave up too soon.

This made me think of the Emacs slogan that it is a self-documenting editor. It’s also a very old editor, with roots going back to the 1970s. Originally the phrase “self-documenting” contrasted with software that only had paper documentation. Now it’s common for software to have online documentation, but most software still isn’t self-documenting in the way that Emacs is. The documentation for Emacs is extensive, well-written, and thoroughly integrated with the editor.

Most of the software I use has local documentation, but the documentation is more difficult to use than doing a web search. Maybe the local documentation would be easier to use if I invested more time learning how to use it, but this investment has to be repeated for each application; every application has its own documentation system.

The best approach may be to commit to a small number of tools and learn how each one’s documentation works. I’ve done the former but wish I’d put more work into the latter sooner.

Years ago I had gotten to the point that I was using a menagerie of different software applications, none of which I knew well. Following the advice to use the best tool for the job lead to too many tools for wide variety of jobs. I determined that sometimes using a sub-optimal tool would be optimal overall if it allowed me to switch tools less often.

What I didn’t do at the time, but I’d recommend now, is to also to dig into each tool’s documentation system. A web search will always be faster in the moment than learning how to use an arcane help system. (More on this here.) But in the long run, becoming fluent in the local help systems of your most important applications is more efficient, and leads to serendipitous discoveries. It also helps you preserve a state of flow by reducing context switches.

Related posts

Photo by Valeriy Khan on Unsplash

Pratt Primality Certificates

The previous post implicitly asserted that J = 8675309 is a prime number. Suppose you wanted proof that this number is prime.

You could get some evidence that J is probably prime by demonstrating that

2J-1 = 1 mod J.

You could do this in Python by running the following [1].

    >>> J = 8675309
    >>> assert( pow(2, J-1, J) == 1 )

This shows J is a probable prime to base 2.

If you want more evidence, you could also show J is a probable prime to base 3.

    >>> assert( pow(3, J-1, J) == 1 )

But no matter how many bases you try, you won’t have proof that J is prime, only evidence. There are pseudoprimes, (rare) composite numbers that satisfy the necessary-but-not-quite-sufficient conditions of Fermat’s primality test.

Primality certificates

A primality certificate is a proof that a number is prime. To be practical, a certificate must be persuasive and efficient to compute.

We could show that J is not divisible by any integer less than √J. That would actually be practical because J is not that large.

    >>> for n in range(2, int(J**0.5)+1):
    ...     assert(J % n > 0)

But we’d like to use J to illustrate a method that scales to much larger numbers than J.

Pratt certificates

Pratt primality certificates are based on a theorem by Lucas [2] that says a number n is prime if there exists a number a such that two conditions hold:

an-1 = 1 mod n,


a(n-1)/p ≠ 1 mod n

for all primes p that divide n-1.

How do you find a? See this post.


To find a Pratt certificate for J, we have to factor J-1. I assert that

J-1 = 8675308 = 4 × 2168827

and that 2168827 is prime. Here’s verification that Lucas’ theorem holds:

    >>> assert( pow(2, (J-1)//2, J) != 1 )
    >>> assert( pow(2, (J-1)//2168827, J) != 1 )

What’s that you say? You’re not convinced that 2168827 is prime? Well then, we’ll come up with a Pratt certificate for 2168827.

Pratt certificates are generally recursive. To prove that p is prime, we have to factor p-1, then prove all the claimed prime factors of p-1 are prime, etc. The recursion ends when it gets down to some set of accepted primes.

Now I assert that

    2168827 - 1 = 2168826 = 2 × 3 × 11 × 17 × 1933

and that all these numbers are prime. I’ll assume you’re OK with that, except you’re skeptical that 1933 is prime.

The following code is proof that 2168827 is prime, assuming 1933 is prime.

    >>> m = 2168827
    >>> for p in [2, 3, 11, 17, 1933]:
    ...     assert( pow(3, (m-1)//p, m) != 1 )

Finally, we’ll prove that 1933 is prime.

You can verify that

    1933 - 1 = 1932 = 2² × 3 × 7 × 23

and I assume you’re convinced that each of these factors is prime.

    >>> m = 1933
    >>> for p in [2, 3, 7, 23]:
    ...     assert( pow(5, (m-1)//p, m) != 1 )

Pratt certificates can be written in a compact form that verification software can read. Here I’ve made the process more chatty just to illustrate the parts.

Update: Here’s a visualization of the process above, drawing arrows from each prime p to the prime factors of p-1.

In this post we’ve agree to recognize 2, 3, 7, 11, 17, and 23 as primes. But you only have to assume 2 is prime. This would make the software implementation more elegant but would make the example tedious for a human consumption.


A primality certificate does not have to be efficient to produce, though of course that would be nice. It has to be efficient to verify. You could imagine that the prime number salesman has more compute power available than the prime number customer. In the example above, I used a computer to generate the Pratt certificate, but it wouldn’t be unreasonable to verify the certificate by hand.

The brute force certificate above, trying all divisors up to √p, obviously takes √p calculations to verify. A Pratt certificate, however, takes about

4 log2 p

calculations. So verifying a 10-digit prime requires on the order of 100 calculations rather than on the order of 100,000 calculations.

Atkin-Goldwasser-Kilian-Morain certificates

Producing Pratt certificates for very large numbers is difficult. Other certificate methods, like Atkin-Goldwasser-Kilian-Morain certificates, scale up better. Atkin-Goldwasser-Kilian-Morain certificates are more complicated to describe because they involve elliptic curves.

Just as Pratt took a characterization of primes by Lucas and turned it into a practical certification method, Atkin and Morain turned a characterization of primes by Goldwasser and Kilian, one involving elliptic curves, and turned it into an efficient certification method.

These certificates have the same recursive nature as Pratt certificates: proving that a number is prime requires proving that another (smaller) number is prime.

Update: More on elliptic curve primality proofs.


[1] This is a more efficient calculation than it seems. It can be done quickly using fast exponentiation. Note that it is not necessary to compute 2J-1 per se; we can carry out every intermediate calculation mod J.

[2] Lucas was French, and so the final s in his name is silent.