# A wrinkle in Clojure

Bob Martin recently posted a nice pair of articles, A Little Clojure and A Little More Clojure. In the first article he talks about how spare and elegant Clojure is.

In the second article he shows how to write a program to list primes using `map` and `filter` rather than `if` and `while`. He approaches the example leisurely, showing how to arrive at the solution in small steps understandable to someone new to Clojure.

The second article passes over a small wrinkle in Clojure that I’d like to say a little about. Near the top of the post we read

The `filter` function also takes a function and a list. `(filter odd? [1 2 3 4 5])` evaluates to `(1 3 5)`. From that I think you can tell what both the `filter` and the `odd?` functions do.

Makes sense: given a bunch of numbers, we pick out the ones that are odd. But look a little closer. We start with `[1 2 3 4 5]` in square brackets and end with `(1 3 5)` in parentheses. What’s up with that? The article doesn’t say, and rightfully so: it would derail the presentation to explain this subtlety. But it’s something that everyone new to Clojure will run into fairly quickly.

As long as we’re vague about what “a bunch of numbers” means, everything looks fine. But when we get more specific, we see that `filter` takes a vector of numbers and returns a list of numbers [1]. Or rather it can take a vector, as it does above; it could also take a list. There are reasons for this, explained here, but it’s puzzling if you’re new to the language.

There are a couple ways to make the filter example do what you’d expect, to either have it take a vector and return a vector, or to have it take a list and return a list. Both would have interrupted the flow of an introductory article. To take a vector and return a vector you could run

`    (filterv odd? [1 2 3 4 5])`

This returns the vector `[1 3 5]`.

Notice we use the function `filterv` rather than `filter`. If the article had included this code, readers would ask “Why does filter have a ‘v’ on the end? Why isn’t it just called ‘filter’?”

To take a list and return a list you could run

`    (filter odd? '(1 2 3 4 5))`

This returns the list `(1 3 5)`.

But if the article had written this, readers would ask “What is the little mark in front of `(1 2 3 4 5)`? Is that a typo? Why didn’t you just send it the list?” The little mark is a quote and not a typo. It tells Clojure that you are passing in a list and not making a function call.

One of the core principles of Lisp is that code and data use the same structure. Everything is a list, hence the name: LISP stands for LISt Processing. Clojure departs slightly from this by distinguishing vectors and lists, primarily for efficiency. But like all Lisps, function calls are lists, where the first element of the list is the name of the function and the remaining elements are arguments [2]. Without the quote mark in the example above, the Clojure compiler would look in vain for a function called 1 and throw an exception:

CompilerException java.lang.RuntimeException: Unable to resolve symbol: quote in this context, compiling: …

Since vectors are unambiguously containers, never used to indicate function calls, there’s no need for vectors to be quoted, which simplifies the exposition in an introductory article.

## More Lisp posts

[1] The `filter` function actually returns a lazy sequence, displayed at the REPL as a list. Another detail one would be wise to omit from an introductory article.

[2] In Clojure function calls provide arguments in a list, but function definitions gather arguments into vectors.

# Software metric outliers

Goodhart’s law says “When a measure becomes a target, it ceases to be a good measure.” That is, when people are rewarded on the basis of some metric, they’ll learn how to improve that metric, but not necessarily in a way that increases what you’re after. Here are three examples of Goodhart’s law related to software development.

• If you use lines of code to measure software developer productivity, Goodhart’s law predicts that you’ll get more lines of code, but not more productivity. In fact, you decrease productivity by discouraging code reuse.
• If you evaluate developers by number of comments, you’re likely to get verbose, unhelpful comments that make code harder to maintain.

Despite their flaws and the potential for perverse incentives, I claim it’s worth looking at metric outliers.

When I managed a software development team, I ran software that computed the complexity [1] of all the functions in our code base. One function was 100x more complex than anything else anyone had written. That drew my attention to a problem I was unaware of, or at least had underestimated.

If one function is 50% more complex than another, as measured by the software metric, that does not mean that the former is more complex in terms of the mental burden it places on human readers. But a difference of 10,000% is worth investigating. Maybe there’s a good explanation—maybe it’s machine-generated code, for example—but more likely it’s an indication that there’s a problem.

Code reviews are far better than code metrics. But code metrics could suggest code that should be a priority to review.

## More software complexity posts

[1] McCabe complexity, a.k.a. cyclomatic complexity. Essentially the number of paths through a function.

# Computational survivalist

Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism.

I’m not much of a computational survivalist, but I’ve come to appreciate such a perspective. It’s an efficiency/robustness trade-off, and in general I’ve come to appreciate the robustness side of such trade-offs more over time. It especially makes sense for consultants who find themselves working on someone else’s computer with no ability to install software. I’m not often in that position, but that’s kinda where I am on one project.

## Example

I’m working on a project where all my work has to be done on the client’s laptop, and the laptop is locked down for security. I can’t install anything. I can request to have software installed, but it takes a long time to get approval. It’s a Windows box, and I requested a set of ports of basic Unix utilities at the beginning of the project, not knowing what I might need them for. That has turned out to be a fortunate choice on several occasions.

For example, today I needed to count how many times certain characters appear in a large text file. My first instinct was to write a Python script, but I don’t have Python. My next idea was to use `grep -c`, but that would count the number of lines containing a given character, not the number of occurrences of the character per se.

I did a quick search and found a Stack Overflow question “How can I use the UNIX shell to count the number of times a letter appears in a text file?” On the nose! The top answer said to use `grep -o` and pipe it to `wc -l`.

The `-o` option tells `grep` to output the regex matches, one per line. So counting the number of lines with `wc -l` gives the number of matches.

## Computational minimalism

Computational minimalism is a variation on computational survivalism. Computational minimalists limit themselves to a small set of tools, maybe the same set of tools as computational survivalist, but for different reasons.

I’m more sympathetic to minimalism than survivalism. You can be more productive by learning to use a small set of tools well than by hacking away with a large set of tools you hardly know how to use. I use a lot of different applications, but not as many as I once used.

# The hopeless task of the Unicode Consortium

Randall Munroe, author of xkcd, discussing Unicode on the Triangulation podcast:

I am endlessly delighted by the hopeless task that the Unicode Consortium has created for themselves. … They started out just trying to unify a couple different character sets. And before they quite realized what was happening, they were grappling with decisions at the heart of how we use language, no matter how hard they tried to create policies to avoid these problems. It’s just a fun example of how weird language is and how hard human communication is and how you really can’t really get around those problems. … These are really hard problems and I do not envy them.

Reminds me of Jeffrey Snover’s remark about problems vs dilemmas: problems can be solved, but dilemmas can only be managed. Unicode faces a host of dilemmas.

Regarding Munroe’s comment about Unicode starting out small and getting more ambitious, see the next post for a plot of the number of characters as a function of time and of version number.

# Regular expressions and special characters

Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular expressions, or vice versa.

This post goes through an example in detail that shows how to manage special characters in several different contexts.

## Escaping special TeX characters

I recently needed to write a regular expression [1] to escape TeX special characters. I’m reading in text like `ICD9_CODE` and need to make that `ICD9\_CODE` so that TeX will understand the underscore to be a literal underscore, and a subscript instruction.

Underscore isn’t the only special character in TeX. It has ten special characters:

`    \ { } \$ & # ^ _ % ~`

The two that people most commonly stumble over are probably `\$` and `%` because these are fairly common in ordinary prose. Since `%` begins a comment in TeX, importing a percent sign without escaping it will fail silently. The result is syntactically valid. It just effectively cuts off the remainder of the line.

So whenever my script sees a TeX special character that isn’t already escaped, I’d like it to escape it.

## Raw strings

First I need to tell Python what the special characters are for TeX:

`    special = r"\\{}\$&#^_%~"`

There’s something interesting going on here. Most of the characters that are special to TeX are not special to Python. But backslash is special to both. Backslash is also special to regular expressions. The `r` prefix in front of the quotes tells Python this is a “raw” string and that it should not interpret backslashes as special. It’s saying “I literally want a string that begins with two backslashes.”

Why two backslashes? Wouldn’t one do? We’re about to use this string inside a regular expression, and backslashes are special there too. More on that shortly.

## Lookbehind

Here’s my regular expression:

`    re.sub(r"(?<!\\)([" + special + "])", r"\\\1", line)`

I want special characters that have not already been escaped, so I’m using a negative lookbehind pattern. Negative lookbehind expressions begin with `(?<!` and end with `)`. So if, for example, I wanted to look for the string “ball” but only if it’s not preceded by “charity” I could use the regular expression

`    (?<!charity )ball`

This expression would match “foot ball” or “foosball” but not “charity ball”.

Our lookbehind expression is complicated by the fact that the thing we’re looking back for is a special character. We’re looking for a backslash, which is a special character for regular expressions [2].

After looking behind for a backslash and making sure there isn’t one, we look for our special characters. The reason we used two backslashes in defining the variable `special` is so the regular expression engine would see two backslashes and interpret that as one literal backslash.

## Captures

The second argument to `re.sub` tells it what to replace its match with. We put parentheses around the character class listing TeX special characters because we want to capture it to refer to later. Captures are referred to by position, so the first capture is \1, the second is \2, etc.

We want to tell `re.sub` to put a backslash in front of the first capture. Since backslashes are special to the regular expression engine, we send it `\\` to represent a literal backslash. When we follow this with `\1` for the first capture, the result is `\\\1` as above.

## Testing

We can test our code above on with the following.

`    line = r"a_b \$200 {x} %5 x\y"`

and get

`    a\_b \\$200 \{x\} \%5 x\\y`

which would cause TeX to produce output that looks like

a_b \$200 {x} %5 x\y.

Note that we used a raw string for our test case. That was only necessary for the backslash near the end of the string. Without that we could have dropped the `r` in front of the opening quote.

## P.S. on raw strings

Note that you don’t have to use raw strings. You could just escape your special characters with backslashes. But we’ve already got a lot of backslashes here. Without raw strings we’d need even more. Without raw strings we’d have to say

`    special = "\\\\{}\$&#^_%~"`

starting with four backslashes to send Python two to send the regular expression engine one.

## Related posts

[1] Whenever I write about using regular expressions someone will complain that my solution isn’t completely general and that they can create input that will break my code. I understand that, but it works for me in my circumstances. I’m just writing scripts to get my work done, not claiming to have written hardened production software for anyone else to use.

[2] Keep context in mind. We have three languages in play: TeX, Python, and regular expressions. One of the keys to understanding regular expressions is to see them as a small language embedded inside other languages like Python. So whenever you hear a character is special, ask yourself “Special to whom?”. It’s especially confusing here because backslash is special to all three languages.

# Contributing to open source projects

David Heinemeier Hansson presents a very gracious view of open source software in his keynote address at RailsConf 2019. And by gracious, I mean gracious in the theological sense.

He says at one point “If I were a Christian …” implying that he is not, but his philosophy of software echos the Christian idea of grace, a completely free gift rather than something earned. If you want to use my software without giving anything back in return, enjoy. If you’re motivated by gratitude, not obligation, to give something back, that’s great. Works follow grace. Works don’t earn grace.

I was thinking about making a donation to a particular open source project that has been important to my business when I stumbled on DHH’s talk. While watching it, I reconsidered that donation. The software is freely given, no strings attached. I don’t take anything from anyone else by using it. Etc. Then I made the donation anyway, out of a sense of gratitude rather than a sense of obligation.

My biggest contributions to open source software have been unconscious. I had no idea that code from this blog was being used in hundreds of open source projects until Tim Hopper pointed it out.

Most contributions to open source software are in kind, i.e. contributing code. But cash helps too. Here are a couple ideas if you’d like to donate a little cash.

You could buy some swag with a project logo on it, especially some of the more overpriced swag. Maybe your company rules would allow this that wouldn’t allow making a donation. It feels odd to deliberately buy something overpriced—they want how much for this coffee mug?!—but that’s how the project can make a little money.

If you’d like to make a cash donation, but you’re hesitating because it’s not tax deductible, here’s a possibility: deduct your own tax. Give the after-tax amount that corresponds to the amount you would have given before taxes. For example, suppose you’d like to donate \$100 cash if it were tax deductible, and suppose that your marginal tax rate is 30%. Then donate \$70. It’s the same to you as donating \$100 and saving \$30 on your taxes.

Disclaimer: I am not an accountant or a tax attorney. And in case it ever needs to be said, I’m not a lepidopterist, auto mechanic, or cosmetologist either.

# Professional, amateur, and something else

I opened a blog posts a while back by saying

One of the differences between amateur and professional software development is whether you’re writing software for yourself or for someone else. It’s like the difference between keeping a journal and being a journalist.

This morning I saw where someone pulled that quote and I thought about how I’m now somewhere that doesn’t fall into either category. I’m a professional, and a software developer, but not a professional software developer.

People pay me for work that may require writing software, but they usually don’t want my software per se. They want reports that may require me to write software to generate. My code is directly for my own use but indirectly for someone else. Occasionally I will have a client ask me for software, but not often.

In the last few years I’ve been asked to read software more often than I’ve been asked to write it. For example, I was an expert witness on an intellectual property case and had to read a lot of source code. But even that project fit into the pattern above: I wrote some code to help me do my review, but my client didn’t care about that code per se, only what I found with it.

# Format-preserving encryption (FPE) for privacy

The idea of format-preserving encryption is to encrypt data while keeping its form, a sort of encryption in kind. An encrypted credit card number would look like a credit card number, a string of text would be replaced with a string of text, etc.

Format preserving encryption (FPE) is useful in creating a test or demo database. You want realistic data without having accurate data (at least for sensitive data fields), and using FPE on real data might be the way to go.

If a field is supposed to contain a 10-digit number, say a phone number, you want the test data to also contain a 10-digit number. Otherwise validation software might break. And if that number is a key that links tables together, simply substituting a random number would break the relationships unless the same random replacement was used everywhere. Also, two clear numbers could be replaced with the same randomly chosen value. FPE would be a simple way to avoid these problems.

FPE is a two-edged sword. It may be desirable to preserve formatting, but it could also cause problems. Using any form of encryption, format-preserving or not, to preserve the database structure could reveal information you don’t intend to reveal.

It’s quite possible to encrypt data and still compromise privacy. If you encrypt data, but not metadata, then you might keep enough information to re-identify individuals. For example, if you encrypt someone’s messages but retain the time stamp on the messages, that might be enough to identify that person.

The meaning of “format-preserving” can vary, and that could create inadvertently leak information. What does it mean to encrypt English text in a format-preserving way? It could mean that English words are replaced with English words. If this is done simplistically, then the number of words in the clear text is revealed. If a set of English words is replaced with a variable number of English words, you’re still revealing that the original text was English.

FPE may not reveal anything that wasn’t already known. If you know that a field in a database contains a 9-digit number, then encrypting it as a 9-digit number doesn’t reveal anything per se. But it might be a problem if it reveals that two numbers are the same number.

What about errors? What happens if a field that is supposed to be a 9-digit number isn’t? Maybe it contains a non-digit character, or it contains more than 9 digits. The encryption software should report an error. But if it doesn’t, maybe the encrypted output is not a 9-digit number, revealing that there was an error in the input. Maybe that’s a problem, maybe not. Depends on context.

Related: Data privacy consulting

# Software quality: better in practice than in theory

C. A. R. Hoare wrote an article How Did Software Get So Reliable Without Proof? in 1996 that still sounds contemporary for the most part.

In the 1980’s many believed that programs could not get much bigger unless we started using formal proof methods. The argument was that bugs are fairly common, and that each bug has the potential to bring a system down. Therefore the only way to build much larger systems was to rely on formal methods to catch bugs. And yet programs continued to get larger and formal methods never caught on. Hoare asks

Why have twenty years of pessimistic predictions been falsified?

Another twenty years later we can ask the same question. Systems have gotten far larger, and formal methods have not become common. Formal methods are used—more on that shortly—but have not become common.

## Better in practice than in theory

It’s interesting that Hoare was the one to write this paper. He is best known for the quicksort, a sorting algorithm that works better in practice than in theory! Quicksort is commonly used in practice, even though has terrible worst-case efficiency, because its average efficiency has optimal asymptotic order [1], and in practice it works better than other algorithms with the same asymptotic order.

## Economic considerations

It is logically possible that the smallest bug could bring down a system. And there have been examples, such as the Mars Climate Orbiter, where a single bug did in fact lead to complete failure. But this is rare. Most bugs are inconsequential.

Some will object “How can you be so blasé about bugs? A bug crashed a \$300 million probe!” But what is the realistic alternative? Would spending an additional billion dollars on formal software verification have prevented the crash? Possibly, though not certainly, and the same money could send three more missions to Mars. (More along these lines here.)

It’s all a matter of economics. Formal verification is extremely tedious and expensive. The expense is worth it in some settings and not in others. The software that runs pacemakers is more critical than the software that runs a video game. For most software development, less formal methods have proved more cost effective at achieving acceptable quality: code reviews, unit tests, integration testing, etc.

## Formal verification

I have some experience with formal software verification, including formal methods software used by NASA. When someone says that software has been formally verified, there’s an implicit disclaimer. It’s usually the algorithms have been formally verified, not the implementation of those algorithms in software. Also, maybe not all the algorithms have been verified, but say 90%, the remaining 10% being too difficult to verify. In any case, formally verified software can and has failed. Formal verification greatly reduces the probability of encountering a bug, but it does not reduce the probability to zero.

There has been a small resurgence of interest in formal methods since Hoare wrote his paper. And again, it’s all about economics. Theorem proving technology has improved over the last 20 years. And software is being used in contexts where the consequences of failure are high. But for most software, the most economical way to achieve acceptable quality is not through theorem proving.

There are also degrees of formality. Full theorem proving is extraordinarily tedious. If I remember correctly, one research group said that they could formally verify about one page of a mathematics textbook per man-week. But there’s a continuum between full formality and no formality. For example, you could have formal assurance that your software satisfies certain conditions, even if you can’t formally prove that the software is completely correct. Where you want to be along this continuum of formality is again a matter of economics. It depends on the probability and consequences of errors, and the cost of reducing these probabilities.

## Related posts

[1] The worst-case performance of quicksort is O(n²) but the average performance is O(n log n).

Photograph of C. A. R. Hoare by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr, CC BY-SA 2.0 fr

# The right level of abstraction

Mark Dominus wrote a blog post yesterday entitled Why I never finish my Haskell programs (part 1 of ∞). In a nutshell, there’s always another layer of abstraction. “Instead of just adding lists of numbers, I can do addition-like operations on list-like containers of number-like things!”

Is this a waste of time? It depends entirely on context.

I can think of two reasons to pursue high levels of abstraction. One is reuse. You have multiple instances of things that you want to handle simultaneously. The other reason is clarity. Sometimes abstraction makes things simpler, even if you only have one instance of your abstraction. Dijkstra had the latter in mind when he said

The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.

Both of these can backfire. You could make your code so reusable (in your mind) that nobody else wants to use it. Your bird’s eye view can become a Martian’s eye view that loses essential details. [1]

It’s easy, and often appropriate, to criticize high levels of abstraction. I could imagine asking “Just how often do you need to do addition-like operations on list-like containers of number-like things? We’ve got to ship by Friday. Why don’t you just add lists of numbers for now.”

And yet, sometimes what seems like excessive abstraction can pay off. I remember an interview with John Tate a few years ago in which he praised Alexander Grothendieck.

He just had an instinct for the right degree of generality. Some people make things too general, and they’re not of any use. But he just had an instinct to put whatever theory he thought about in the most general setting that was still useful. Not generalization for generalization’s sake but the right generalization. He was unbelievable.

I was taken aback by Tate saying that Grothendieck found just the right level of abstraction. But Tate is in a position to judge and I am not.

From my perspective, Grothendieck’s work, what glimpses I’ve seen, looks gratuitously abstract. Basic category theory is about as abstract as my mind can go, but category theory was the floor of the castle Grothendieck built in the sky. And yet he built his castle to solve specific challenging problems in number theory, and succeeded. (Maybe his castle in the sky turned into a Winchester Mansion later in life. I can’t say.)

***

[1] I’m more sympathetic to the clarity argument than the reuse argument. The former gives immediate feedback. You try something because you think it will make things more clear. Did it, at least in your opinion? Does anyone else find it helpful? But reuse is speculative because it happens in the future. (If you have several cases in hand that you want to handle uniformly, that’s a safer bet. You might just call that “use” rather than “reuse.” My skepticism is more about anticipated reuse.)

In software development in particular, I believe it’s easier to make your code re-editable than reusable. It’s easier to predict that code will need to do something different in the future than it is to predict exactly what that something will be.