Format-preserving encryption (FPE) for privacy

Posted on 25 October 2018 by John

The idea of format-preserving encryption is to encrypt data while keeping its form, a sort of encryption in kind. An encrypted credit card number would look like a credit card number, a string of text would be replaced with a string of text, etc.

Format preserving encryption (FPE) is useful in creating a test or demo database. You want realistic data without having accurate data (at least for sensitive data fields), and using FPE on real data might be the way to go.

If a field is supposed to contain a 10-digit number, say a phone number, you want the test data to also contain a 10-digit number. Otherwise validation software might break. And if that number is a key that links tables together, simply substituting a random number would break the relationships unless the same random replacement was used everywhere. Also, two clear numbers could be replaced with the same randomly chosen value. FPE would be a simple way to avoid these problems.

FPE is a two-edged sword. It may be desirable to preserve formatting, but it could also cause problems. Using any form of encryption, format-preserving or not, to preserve the database structure could reveal information you don’t intend to reveal.

It’s quite possible to encrypt data and still compromise privacy. If you encrypt data, but not metadata, then you might keep enough information to re-identify individuals. For example, if you encrypt someone’s messages but retain the time stamp on the messages, that might be enough to identify that person.

The meaning of “format-preserving” can vary, and that could create inadvertently leak information. What does it mean to encrypt English text in a format-preserving way? It could mean that English words are replaced with English words. If this is done simplistically, then the number of words in the clear text is revealed. If a set of English words is replaced with a variable number of English words, you’re still revealing that the original text was English.

FPE may not reveal anything that wasn’t already known. If you know that a field in a database contains a 9-digit number, then encrypting it as a 9-digit number doesn’t reveal anything per se. But it might be a problem if it reveals that two numbers are the same number.

What about errors? What happens if a field that is supposed to be a 9-digit number isn’t? Maybe it contains a non-digit character, or it contains more than 9 digits. The encryption software should report an error. But if it doesn’t, maybe the encrypted output is not a 9-digit number, revealing that there was an error in the input. Maybe that’s a problem, maybe not. Depends on context.

Related: Data privacy consulting

Software quality: better in practice than in theory

Posted on 25 September 2018 by John

Sir Tony Hoare

C. A. R. Hoare wrote an article How Did Software Get So Reliable Without Proof? in 1996 that still sounds contemporary for the most part.

In the 1980’s many believed that programs could not get much bigger unless we started using formal proof methods. The argument was that bugs are fairly common, and that each bug has the potential to bring a system down. Therefore the only way to build much larger systems was to rely on formal methods to catch bugs. And yet programs continued to get larger and formal methods never caught on. Hoare asks

Why have twenty years of pessimistic predictions been falsified?

Another twenty years later we can ask the same question. Systems have gotten far larger, and formal methods have not become common. Formal methods are used—more on that shortly—but have not become common.

Better in practice than in theory

It’s interesting that Hoare was the one to write this paper. He is best known for the quicksort, a sorting algorithm that works better in practice than in theory! Quicksort is commonly used in practice, even though has terrible worst-case efficiency, because its average efficiency has optimal asymptotic order [1], and in practice it works better than other algorithms with the same asymptotic order.

Economic considerations

It is logically possible that the smallest bug could bring down a system. And there have been examples, such as the Mars Climate Orbiter, where a single bug did in fact lead to complete failure. But this is rare. Most bugs are inconsequential.

Some will object “How can you be so blasé about bugs? A bug crashed a $300 million probe!” But what is the realistic alternative? Would spending an additional billion dollars on formal software verification have prevented the crash? Possibly, though not certainly, and the same money could send three more missions to Mars. (More along these lines here.)

It’s all a matter of economics. Formal verification is extremely tedious and expensive. The expense is worth it in some settings and not in others. The software that runs pacemakers is more critical than the software that runs a video game. For most software development, less formal methods have proved more cost effective at achieving acceptable quality: code reviews, unit tests, integration testing, etc.

Formal verification

I have some experience with formal software verification, including formal methods software used by NASA. When someone says that software has been formally verified, there’s an implicit disclaimer. It’s usually the algorithms have been formally verified, not the implementation of those algorithms in software. Also, maybe not all the algorithms have been verified, but say 90%, the remaining 10% being too difficult to verify. In any case, formally verified software can and has failed. Formal verification greatly reduces the probability of encountering a bug, but it does not reduce the probability to zero.

There has been a small resurgence of interest in formal methods since Hoare wrote his paper. And again, it’s all about economics. Theorem proving technology has improved over the last 20 years. And software is being used in contexts where the consequences of failure are high. But for most software, the most economical way to achieve acceptable quality is not through theorem proving.

There are also degrees of formality. Full theorem proving is extraordinarily tedious. If I remember correctly, one research group said that they could formally verify about one page of a mathematics textbook per man-week. But there’s a continuum between full formality and no formality. For example, you could have formal assurance that your software satisfies certain conditions, even if you can’t formally prove that the software is completely correct. Where you want to be along this continuum of formality is again a matter of economics. It depends on the probability and consequences of errors, and the cost of reducing these probabilities.

[1] The worst-case performance of quicksort is O(n²) but the average performance is O(n log n).

Photograph of C. A. R. Hoare by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr, CC BY-SA 2.0 fr

The right level of abstraction

Posted on 4 September 2018 by John

Mark Dominus wrote a blog post yesterday entitled Why I never finish my Haskell programs (part 1 of ∞). In a nutshell, there’s always another layer of abstraction. “Instead of just adding lists of numbers, I can do addition-like operations on list-like containers of number-like things!”

Is this a waste of time? It depends entirely on context.

I can think of two reasons to pursue high levels of abstraction. One is reuse. You have multiple instances of things that you want to handle simultaneously. The other reason is clarity. Sometimes abstraction makes things simpler, even if you only have one instance of your abstraction. Dijkstra had the latter in mind when he said

The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.

Both of these can backfire. You could make your code so reusable (in your mind) that nobody else wants to use it. Your bird’s eye view can become a Martian’s eye view that loses essential details. [1]

It’s easy, and often appropriate, to criticize high levels of abstraction. I could imagine asking “Just how often do you need to do addition-like operations on list-like containers of number-like things? We’ve got to ship by Friday. Why don’t you just add lists of numbers for now.”

And yet, sometimes what seems like excessive abstraction can pay off. I remember an interview with John Tate a few years ago in which he praised Alexander Grothendieck.

He just had an instinct for the right degree of generality. Some people make things too general, and they’re not of any use. But he just had an instinct to put whatever theory he thought about in the most general setting that was still useful. Not generalization for generalization’s sake but the right generalization. He was unbelievable.

I was taken aback by Tate saying that Grothendieck found just the right level of abstraction. But Tate is in a position to judge and I am not.

From my perspective, Grothendieck’s work, what glimpses I’ve seen, looks gratuitously abstract. Basic category theory is about as abstract as my mind can go, but category theory was the floor of the castle Grothendieck built in the sky. And yet he built his castle to solve specific challenging problems in number theory, and succeeded. (Maybe his castle in the sky turned into a Winchester Mansion later in life. I can’t say.)

***

[1] I’m more sympathetic to the clarity argument than the reuse argument. The former gives immediate feedback. You try something because you think it will make things more clear. Did it, at least in your opinion? Does anyone else find it helpful? But reuse is speculative because it happens in the future. (If you have several cases in hand that you want to handle uniformly, that’s a safer bet. You might just call that “use” rather than “reuse.” My skepticism is more about anticipated reuse.)

In software development in particular, I believe it’s easier to make your code re-editable than reusable. It’s easier to predict that code will need to do something different in the future than it is to predict exactly what that something will be.

Does computer science help you program?

Posted on 28 June 2018 by John

The relationship between programming and computer science is hard to describe. Purists will say that computer science has nothing to do with programming, but that goes too far.

Computer science is about more than programming, but it’s is all motivated by getting computers to do things. With few exceptions. students major in computer science in college with the intention of becoming programmers.

I asked on Twitter yesterday how helpful people found computer science in writing software.

Has theoretical computer science helped you write software?

— Computer Science (@CompSciFact) June 26, 2018

In a follow up tweet I said “For this poll, basic CS would be data structures and analysis of algorithms. Advanced CS is anything after that.”

So about a quarter didn’t find computer science useful, but the rest either expected it to be useful or at least found the basic theory useful.

I suspect some of those who said they haven’t found (advanced) CS theory useful don’t know (advanced) CS theory. This isn’t a knock on them. It’s only the observation that you can’t use what you aren’t at least aware of. In fact, you may need to know something quite well before you can recognize an opportunity to use it. (More on that here.)

Many programmers are in jobs where they don’t have much need for computer science theory. I thought about making that a possible choice, something like “No, but I wish I were in a job that could use more theory.” Unfortunately Twitter survey responses have to be fairly short.

Of course this isn’t a scientific survey. (Even supposedly scientific surveys aren’t that great.) People who follow the CompSciFact twitter account have an interest in computer science. Maybe people who had strong feelings about CS, such as resentment for having to study something they didn’t want to or excitement for being able to use something they find interesting, were more likely to answer the question.

Proving life exists on Earth

Posted on 26 May 2018 by John

NASA’s Galileo mission was primarily designed to explore Jupiter and its moons. In 1989, the Galileo probe started out traveling away from Jupiter in order to do a gravity assist swing around Venus. About a year later it also did a gravity assist maneuver around Earth. Carl Sagan suggested that when passing Earth, the Galileo probe should turn its sensors on Earth to look for signs of life. [1]

Artist conception of Galileo probe surveying Jupiter and its moons

Now obviously we know there’s life on Earth. But if we’re going look for life on other planets, it’s reasonable to ask that our methods return positive results when examining the one planet we know for sure does host life. So scientists looked at the data from Galileo as if it were coming from another planet to see what patterns in the data might indicate life.

I’ve started using looking for life on Earth as a metaphor. I’m working on a project right now where I’m looking for a needle in a haystack, or rather another needle in a haystack: I knew that one needle existed before I got started. So I want to make sure that my search procedure at least finds the one positive result I already know exists. I explained to a colleague that we need to make sure we can at least find life on Earth.

This reminds me of simulation work. You make up a scenario and treat it as the state of nature. Then pretend you don’t know that state, and see how successful your method is at discovering that state. It’s sort of a schizophrenic way of thinking, pretending that half of your brain doesn’t know what the other half is doing.

It also reminds me of software testing. The most trivial tests can be surprisingly effective at finding bugs. So you write a few tests to confirm that there’s life on Earth.

[1] I found out about Galileo’s Earth reconnaissance listening to the latest episode of the .NET Rocks! podcast.

Dark debt

Posted on 1 March 2018 by John

Dark debt is a form of technical debt that is invisible until it causes failures. The term was coined in the STELLA conference and codified in the conference report.

Dark debt is found in complex systems and the anomalies it generates are complex system failures. Dark debt is not recognizable at the time of creation. … It arises from the unforeseen interactions of hardware or software with other parts of the framework. …

The challenge of dark debt is a difficult one. Because it exists mainly in interactions between pieces of the complex system, it cannot be appreciated by examination of those pieces. …
Unlike technical debt, which can be detected and, in principle at least, corrected by refactoring, dark debt surfaces through anomalies. Spectacular failures like those listed above do not arise from technical debt.

Complexity posts

The most disliked programming language

Posted on 31 October 2017 by John

According to this post from Stack Overflow, Perl is the most disliked programming language.

I have fond memories of writing Perl, though it’s been a long time since I used it. I mostly wrote scripts for file munging, the task it does best, and never had to maintain someone else’s Perl code. Under different circumstances I probably would have had less favorable memories.

Perl is a very large, expressive language. That’s a positive if you’re working alone but a negative if working with others. Individuals can carve out their favorite subsets of Perl and ignore the rest, but two people may carve out different subsets. You may personally avoid some feature, but you have to learn it anyway if your colleague uses it. Also, in a large language there’s greater chance that you’ll accidentally use a feature you didn’t intend to. For example, in Perl you might use an array in a scalar context. This works, but not as you’d expect if you didn’t intend to do it.

I suspect that people who like large languages like C++ and Common Lisp are more inclined to like Perl, while people who prefer small languages like C and Scheme have opposite inclinations.

One practical application of functional programming

Posted on 9 June 2017 by John

Arguments in favor of functional programming are often unconvincing. For example, the most common argument is that functional programming makes it easier to “reason about your code.” That’s true to some extent. All other things being equal, it’s easier to understand a function if all its inputs and outputs are explicit. But all other things are not equal. In order to make one function easier to understand, you may have to make something else harder to understand.

Here’s an argument from Brian Beckman for using a functional style of programming in a particular circumstance that I find persuasive. The immediate context is Kalman filtering, but it applies to a broad range of mathematical computation.

By writing a Kalman ﬁlter as a functional fold, we can test code in friendly environments and then deploy identical code with confidence in unfriendly environments. In friendly environments, data are deterministic, static, and present in memory. In unfriendly, real-world environments, data are unpredictable, dynamic, and arrive asynchronously.

If you write the guts of your mathematical processing as a function to be folded over your data, you can isolate the implementation of your algorithm from the things that make code hardest to test, i.e. the data source. You can test your code in a well-controlled “friendly” test environment and deploy exactly the same code into production, i.e. an “unfriendly” environment.

Brian continues:

The flexibility to deploy exactly the code that was tested is especially important for numerical code like filters. Detecting, diagnosing and correcting numerical issues without repeatable data sequences is impractical. Once code is hardened, it can be critical to deploy exactly the same code, to the binary level, in production, because of numerical brittleness. Functional form makes it easy to test and deploy exactly the same code because it minimizes the coupling between code and environment.

I ran into this early on when developing clinical trial methods first for simulation, then for production. Someone would ask whether we were using the same code in production as in simulation.

“Yes we are.”

“Exactly the same code?”

“Well, essentially.”

“Essentially” was not good enough. We got to where we would use the exact same binary code for simulation and production, but something analogous to Kalman folding would have gotten us there sooner, and would have made it easier to enforce this separation between numerical code and its environment across applications.

Why is it important to use the exact same binary code in test and production, not just a recompile of the same source code? Brian explains:

Numerical issues can substantially complicate code, and being able to move exactly the same code, without even recompiling, between testing and deployment can make the difference to a successful application. We have seen many cases where differences in compiler flags, let alone differences in architectures, even between different versions of the same CPU family, introduce enough differences in the generated code to cause qualitative differences in the output. A filter that behaved well in the lab can fail in practice.

Emphasis added, here and in the first quote above.

Note that this post gives an argument for a functional style of programming, not necessarily for the use of functional programming languages. Whether the numerical core or the application that uses it would best be written in a functional language is a separate discussion.

Building software the right way

Posted on 5 May 2017 by John

Yesterday a friend told me about a software project whose owners said “We’re going to do this the right way.” I told him I have two opposite reactions when I hear that:

Ooh, that sounds like fun!
Run away!

I’ve been on several projects where the sponsors have identified some aspect of the status quo that clearly needs improving. Working on a project that realizes these problems and is willing to address them sounds like fun. But then the project runs to the opposite extreme and creates something worse.

For example, most software development is sloppy. So a project reacts against this and becomes so formal that nobody can get any work done. In those settings I like to say “Hold off on buying a tux. Just start by tucking your shirt tail in. Maybe that’s as formal as you need to be.”

I’d be more optimistic about a project with more modest goals, say one that wants to move 50% of the way toward an ideal, rather than wanting to jump from one end of a spectrum to the other. Or even better, a project that has identified a direction they want to move in, and thinks in terms of experimenting to find their optimal position along that direction.

Unnatural language processing

Posted on 11 March 2017 by John

Japanese Russian dictionary

Larry Wall, creator of the Perl programming language, created a custom degree plan in college, an interdisciplinary course of study in natural and artificial languages, i.e. linguistics and programming languages. Many of the features of Perl were designed as an attempt to apply natural language principles to the design of an artificial language.

I’ve been thinking of a different connection between natural and artificial languages, namely using natural language processing (NLP) to reverse engineer source code.

The source code of computer program is text, but not a text. That is, it consists of plain text files, but it’s not a text in the sense that Paradise Lost or an email is a text. The most efficient way to parse a programming language is as a programming language. Treating it as an English text will loose vital structure, and wrongly try to impose a foreign structure.

But what if you have two computer programs? That’s the problem I’ve been thinking about. I have code in two very different programming languages, and I’d like to know how functions in one code base relate to those in the other. The connections are not ones that a compiler could find. The connections are more psychological than algorithmic. I’d like to reverse engineer, for example, which function in language A a developer had in mind when he wrote a function in language B.

Both code bases are in programming language, but the function names are approximately natural language. If a pair of functions have the same name in both languages, and that name is not generic, then there’s a good chance they’re related. And if the names are similar, maybe they’re related.

I’ve done this sort of thing informally forever. I imagine most programmers do something like this from time to time. But only recently have I needed to do this on such a large scale that proceeding informally was not an option. I wrote a script to automate some of the work by looking for fuzzy matches between function names in both languages. This was far from perfect, but it reduced the amount of sleuthing necessary to line up the two sets of source code.

Around a year ago I had to infer which parts of an old Fortran program corresponded to different functions in a Python program. I also had to infer how some poorly written articles mapped to either set of source code. I did all this informally, but I wonder now whether NLP might have sped up my detective work.

Another situation where natural language processing could be helpful in software engineering is determining code authorship. Again this is something most programmers have probably done informally, saying things like “I bet Bill wrote this part of the code because it looks like his style” or “Looks like Pat left her fingerprints here.” This could be formalized using NLP techniques, and I imagine it has been. Just as Frederick Mosteller and colleagues did a statistical analysis of The Federalist Papers to determine who wrote which paper, I’m sure there have been similar analyses to try to find out who wrote what code, say for legal reasons.

Maybe this already has a name, but I like “unnatural language processing” for the application of natural language processing to unnatural (i.e. programming) languages. I’ve done a lot of ad hoc unnatural language processing, and I’m curious how much of it I could automate in the future.

Software development