Software analysis and synthesis

People who haven’t written large programs think that writing software is easy. All you have to do is break a big problem into smaller problems until you have something so small that it’s easy to program.

The problem is putting the pieces back together. If you’ve only written small programs, you haven’t had many pieces to put together. It’s harder to put the pieces together when you write a large program by yourself. It’s even harder when you work on a large program with other people.

Synthesis is harder than analysis. Or as Perdita Stevens put it, integration is harder than separation.

The image above is a screenshot from her keynote at the RC2020 conference on reversible computation.

Related post: The cost of taking things apart and putting them back together.

Randomization audit

clipboard list of names

“How would you go about drawing a random sample?”

I thought that was kind of a silly question. I was in my first probability class in college, and the professor started the course with this. You just take a sample, right?

As with many things in life, it gets more complicated when you get down to business. Random sample of what? Sampled according to what distribution? What kind of constraints are there?

How would you select a random sample of your employees in a way that you could demonstrate was not designed to include any particular employee? How can you not only be fair, but convince people who may not understand probability that you were fair?

When there appear to be patterns in your random selections, how can you decide whether there has been an error or whether the results that you didn’t expect actually should have been expected?

How do you sample data that isn’t simply a list but instead has some sort of structure? Maybe you have data split into tables in a relational database. Maybe you have data that naturally forms a network. How can you sample from a relational database or a network in a way that respects the underlying structure?

How would you decide whether the software you’re using for randomization is adequate for your purposes? If it is adequate, are you using it the right way?

These are all questions that I’ve answered for clients. Sometimes they want me to develop randomization procedures. Sometimes they want me to audit the procedures they’re already using. They not only want good procedures, they want procedures that can stand up to critical review.

If you’d like me to design or audit randomization procedures for your company, let’s talk. You will get the assurance that you’re doing the right thing, and the ability to demonstrate that you’re doing the right thing.

Pretending OOP never happened

I ran across someone recently who says the way to move past object oriented programming (OOP) is to go back to simply telling the computer what to do, to clear OOP from your mind like it never happened. I don’t think that’s a good idea, but I also don’t think it’s possible.

Object oriented programming, for all its later excesses, was a big step forward in software engineering. It made it possible to develop much larger programs than before, maybe 10x larger. Someone might counter by saying that programs had to be 10x larger because of all the overhead of OOP, but that’s not the case. OOP does add some overhead, and the amount of overhead grew over time as tools and frameworks became more complicated, but OOP made it possible to write programs that could not have been written before.

OOP provides a way for programmers to organize their code. It may not be the best way, depending on the problem, but the way to move past OOP is to replace it with another discipline. And I imagine most people who have learned and then rejected OOP do just that, whether they realize it or not. Maybe they retain some organizational patterns that they learned in the context of OOP.

That has been my experience. I hardly ever write classes anymore; I write functions. But I don’t write functions quite the way I did before I spent years writing classes.

And while I don’t often write classes, I do often use classes that come from libraries. Sometimes these objects seem like they’d be better off as bare functions, but I imagine the same libraries would be harder to use if no functions were wrapped in objects.

There are many alternatives to OOP for organizing code. I generally like functional programming, but in my experience there’s a hockey stick effort curve as you try to push the purity to 100%. James Hague said it well:

100% pure functional programing doesn’t work. Even 98% pure functional programming doesn’t work. But if the slider between functional purity and 1980s BASIC-style imperative messiness is kicked down a few notches — say to 85% — then it really does work. You get all the advantages of functional programming, but without the extreme mental effort and unmaintainability that increases as you get closer and closer to perfectly pure.

It’s possible, and a good idea, to develop large parts of a system in purely functional code. But someone has to write the messy parts that interact with the outside world.

More programming posts

Short essays on programming languages

I saw a link to So You Think You Know C? by Oleksandr Kaleniuk on Hacker News and was pleasantly surprised. I expected a few comments about tricky parts of C, and found them, but there’s much more. The subtitle of the free book is And Ten More Short Essays on Programming Languages. Good reads.

This post gives a few of my reactions to the essays, my even shorter essays on Kaleniuk’s short essays.

My C

The first essay is about undefined parts of C. That essay, along with this primer on C obfuscation that I also found on Hacker News today, is enough to make anyone run screaming away from the language. And yet, in practice I don’t run into any of these pitfalls and find writing C kinda pleasant.

I have an atypical amount of freedom, and that colors my experience. I don’t maintain code that someone else has written—I paid my dues doing that years ago—and so I simply avoid using any features I don’t fully understand. And I usually have my choice of languages, so I use C only when there’s a good reason to use C.

I would expect that all these dark corners of C would be accidents waiting to happen. Even if I don’t intentionally use undefined or misleading features of the language, I could use them accidentally. And yet in practice that doesn’t seem to happen. C, or at least my personal subset of C, is safer in practice than in theory.

APL

The second essay is on APL. It seems that everyone who programs long enough eventually explores APL. I downloaded Iverson’s ACM lecture Notation as a Tool of Thought years ago and keep intending to read it. Maybe if things slow down I’ll finally get around to it. Kaleniuk said something about APL I hadn’t heard before:

[APL] didn’t originate as a computer language at all. It was proposed as a better notation for tensor algebra by Harvard mathematician Kenneth E. Iverson. It was meant to be written by hand on a blackboard to transfer mathematical ideas from one person to another.

There’s one bit of notation that Iverson introduced that I use fairly often, his indicator function notation described here. I used it a report for a client just recently where it greatly simplified the write-up. Maybe there’s something else I should borrow from Iverson.

Fortran

I last wrote Fortran during the Clinton administration and never thought I’d see it again, and yet I expect to need to use it on a project later this year. The language has modernized quite a bit since I last saw it, and I expect it won’t be that bad to use.

Apparently Fortran programmers are part of the dark matter of programmers, far more numerous than you’d expect based on visibility. Kaleniuk tells the story of a NASA programming competition in which submissions had to be written in Fortran. NASA cancelled the competition because they were overwhelmed by submissions.

Syntax

In his last essay, Kaleniuk gives some ideas for what he would do if he were to design a new language. His first point is that our syntax is arbitrarily constrained. We still use the small collection of symbols that were easy to input 50 years ago. As a result, symbols are highly overloaded. Regular expressions are a prime example of this, where the same character has to play multiple roles in various contexts.

I agree with Kaleniuk in principle that we should be able to expand our vocabulary of symbols, and yet in practice this hasn’t worked out well. It’s possible now, for example, to use λ than lambda in source code, but I never do that.

I suspect the reason we stick to the old symbols is that we’re stuck at a local maximum: small changes are not improvements. A former client had a Haskell codebase that used one non-ASCII character, a Greek or Russian letter if I remember correctly. The character was used fairly often and it did made the code slightly easier to read. But it wreaked havoc with the tool chain and eventually they removed it.

Maybe a wholehearted commitment using more symbols would be worth it; it would take no more effort to allow 100 non-ASCII characters than to allow one. For that matter, source code doesn’t even need to be limited to text files, ASCII or Unicode. But efforts along those lines have failed too. It may be another local maximum problem. A radical departure from the status quo might be worthwhile, but there’s not a way to get there incrementally. And radical departures nearly always fail because they violate Gall’s law: A complex system that works is invariably found to have evolved from a simple system that worked.

Related posts

A wrinkle in Clojure

Bob Martin recently posted a nice pair of articles, A Little Clojure and A Little More Clojure. In the first article he talks about how spare and elegant Clojure is.

In the second article he shows how to write a program to list primes using map and filter rather than if and while. He approaches the example leisurely, showing how to arrive at the solution in small steps understandable to someone new to Clojure.

The second article passes over a small wrinkle in Clojure that I’d like to say a little about. Near the top of the post we read

The filter function also takes a function and a list. (filter odd? [1 2 3 4 5]) evaluates to (1 3 5). From that I think you can tell what both the filter and the odd? functions do.

Makes sense: given a bunch of numbers, we pick out the ones that are odd. But look a little closer. We start with [1 2 3 4 5] in square brackets and end with (1 3 5) in parentheses. What’s up with that? The article doesn’t say, and rightfully so: it would derail the presentation to explain this subtlety. But it’s something that everyone new to Clojure will run into fairly quickly.

As long as we’re vague about what “a bunch of numbers” means, everything looks fine. But when we get more specific, we see that filter takes a vector of numbers and returns a list of numbers [1]. Or rather it can take a vector, as it does above; it could also take a list. There are reasons for this, explained here, but it’s puzzling if you’re new to the language.

There are a couple ways to make the filter example do what you’d expect, to either have it take a vector and return a vector, or to have it take a list and return a list. Both would have interrupted the flow of an introductory article. To take a vector and return a vector you could run

    (filterv odd? [1 2 3 4 5])

This returns the vector [1 3 5].

Notice we use the function filterv rather than filter. If the article had included this code, readers would ask “Why does filter have a ‘v’ on the end? Why isn’t it just called ‘filter’?”

To take a list and return a list you could run

    (filter odd? '(1 2 3 4 5))

This returns the list (1 3 5).

But if the article had written this, readers would ask “What is the little mark in front of (1 2 3 4 5)? Is that a typo? Why didn’t you just send it the list?” The little mark is a quote and not a typo. It tells Clojure that you are passing in a list and not making a function call.

One of the core principles of Lisp is that code and data use the same structure. Everything is a list, hence the name: LISP stands for LISt Processing. Clojure departs slightly from this by distinguishing vectors and lists, primarily for efficiency. But like all Lisps, function calls are lists, where the first element of the list is the name of the function and the remaining elements are arguments [2]. Without the quote mark in the example above, the Clojure compiler would look in vain for a function called 1 and throw an exception:

CompilerException java.lang.RuntimeException: Unable to resolve symbol: quote in this context, compiling: …

Since vectors are unambiguously containers, never used to indicate function calls, there’s no need for vectors to be quoted, which simplifies the exposition in an introductory article.

More Lisp posts

[1] The filter function actually returns a lazy sequence, displayed at the REPL as a list. Another detail one would be wise to omit from an introductory article.

[2] In Clojure function calls provide arguments in a list, but function definitions gather arguments into vectors.

Software metric outliers

Goodhart’s law says “When a measure becomes a target, it ceases to be a good measure.” That is, when people are rewarded on the basis of some metric, they’ll learn how to improve that metric, but not necessarily in a way that increases what you’re after. Here are three examples of Goodhart’s law related to software development.

  • If you use lines of code to measure software developer productivity, Goodhart’s law predicts that you’ll get more lines of code, but not more productivity. In fact, you decrease productivity by discouraging code reuse.
  • If you evaluate developers by number of comments, you’re likely to get verbose, unhelpful comments that make code harder to maintain.

Despite their flaws and the potential for perverse incentives, I claim it’s worth looking at metric outliers.

When I managed a software development team, I ran software that computed the complexity [1] of all the functions in our code base. One function was 100x more complex than anything else anyone had written. That drew my attention to a problem I was unaware of, or at least had underestimated.

If one function is 50% more complex than another, as measured by the software metric, that does not mean that the former is more complex in terms of the mental burden it places on human readers. But a difference of 10,000% is worth investigating. Maybe there’s a good explanation—maybe it’s machine-generated code, for example—but more likely it’s an indication that there’s a problem.

Code reviews are far better than code metrics. But code metrics could suggest code that should be a priority to review.

More software complexity posts

[1] McCabe complexity, a.k.a. cyclomatic complexity. Essentially the number of paths through a function.

Computational survivalist

survival gear

Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism.

I’m not much of a computational survivalist, but I’ve come to appreciate such a perspective. It’s an efficiency/robustness trade-off, and in general I’ve come to appreciate the robustness side of such trade-offs more over time. It especially makes sense for consultants who find themselves working on someone else’s computer with no ability to install software. I’m rarely in that position, but that’s kinda where I am on one project.

Example

I’m working on a project where all my work has to be done on the client’s laptop, and the laptop is locked down for security. I can’t install anything. I can request to have software installed, but it takes a long time to get approval. It’s a Windows box, and I requested a set of ports of basic Unix utilities at the beginning of the project, not knowing what I might need them for. That has turned out to be a fortunate choice on several occasions.

For example, today I needed to count how many times certain characters appear in a large text file. My first instinct was to write a Python script, but I don’t have Python. My next idea was to use grep -c, but that would count the number of lines containing a given character, not the number of occurrences of the character per se.

I did a quick search and found a Stack Overflow question “How can I use the UNIX shell to count the number of times a letter appears in a text file?” On the nose! The top answer said to use grep -o and pipe it to wc -l.

The -o option tells grep to output the regex matches, one per line. So counting the number of lines with wc -l gives the number of matches.

Computational minimalism

Computational minimalism is a variation on computational survivalism. Computational minimalists limit themselves to a small set of tools, maybe the same set of tools as computational survivalist, but for different reasons.

I’m more sympathetic to minimalism than survivalism. You can be more productive by learning to use a small set of tools well than by hacking away with a large set of tools you hardly know how to use. I use a lot of different applications, but not as many as I once used.

More computational tool posts

The hopeless task of the Unicode Consortium

Randall Munroe, author of xkcd, discussing Unicode on the Triangulation podcast:

I am endlessly delighted by the hopeless task that the Unicode Consortium has created for themselves. … They started out just trying to unify a couple different character sets. And before they quite realized what was happening, they were grappling with decisions at the heart of how we use language, no matter how hard they tried to create policies to avoid these problems. It’s just a fun example of how weird language is and how hard human communication is and how you really can’t really get around those problems. … These are really hard problems and I do not envy them.

Reminds me of Jeffrey Snover’s remark about problems vs dilemmas: problems can be solved, but dilemmas can only be managed. Unicode faces a host of dilemmas.

Regarding Munroe’s comment about Unicode starting out small and getting more ambitious, see the next post for a plot of the number of characters as a function of time and of version number.

Related posts

Regular expressions and special characters

Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular expressions, or vice versa.

This post goes through an example in detail that shows how to manage special characters in several different contexts.

Escaping special TeX characters

I recently needed to write a regular expression [1] to escape TeX special characters. I’m reading in text like ICD9_CODE and need to make that ICD9\_CODE so that TeX will understand the underscore to be a literal underscore, and a subscript instruction.

Underscore isn’t the only special character in TeX. It has ten special characters:

    \ { } $ & # ^ _ % ~

The two that people most commonly stumble over are probably $ and % because these are fairly common in ordinary prose. Since % begins a comment in TeX, importing a percent sign without escaping it will fail silently. The result is syntactically valid. It just effectively cuts off the remainder of the line.

So whenever my script sees a TeX special character that isn’t already escaped, I’d like it to escape it.

Raw strings

First I need to tell Python what the special characters are for TeX:

    special = r"\\{}$&#^_%~"

There’s something interesting going on here. Most of the characters that are special to TeX are not special to Python. But backslash is special to both. Backslash is also special to regular expressions. The r prefix in front of the quotes tells Python this is a “raw” string and that it should not interpret backslashes as special. It’s saying “I literally want a string that begins with two backslashes.”

Why two backslashes? Wouldn’t one do? We’re about to use this string inside a regular expression, and backslashes are special there too. More on that shortly.

Lookbehind

Here’s my regular expression:

    re.sub(r"(?<!\\)([" + special + "])", r"\\\1", line)

I want special characters that have not already been escaped, so I’m using a negative lookbehind pattern. Negative lookbehind expressions begin with (?<! and end with ). So if, for example, I wanted to look for the string “ball” but only if it’s not preceded by “charity” I could use the regular expression

    (?<!charity )ball

This expression would match “foot ball” or “foosball” but not “charity ball”.

Our lookbehind expression is complicated by the fact that the thing we’re looking back for is a special character. We’re looking for a backslash, which is a special character for regular expressions [2].

After looking behind for a backslash and making sure there isn’t one, we look for our special characters. The reason we used two backslashes in defining the variable special is so the regular expression engine would see two backslashes and interpret that as one literal backslash.

Captures

The second argument to re.sub tells it what to replace its match with. We put parentheses around the character class listing TeX special characters because we want to capture it to refer to later. Captures are referred to by position, so the first capture is \1, the second is \2, etc.

We want to tell re.sub to put a backslash in front of the first capture. Since backslashes are special to the regular expression engine, we send it \\ to represent a literal backslash. When we follow this with \1 for the first capture, the result is \\\1 as above.

Testing

We can test our code above on with the following.

    line = r"a_b $200 {x} %5 x\y"

and get

    a\_b \$200 \{x\} \%5 x\\y

which would cause TeX to produce output that looks like

a_b $200 {x} %5 x\y.

Note that we used a raw string for our test case. That was only necessary for the backslash near the end of the string. Without that we could have dropped the r in front of the opening quote.

P.S. on raw strings

Note that you don’t have to use raw strings. You could just escape your special characters with backslashes. But we’ve already got a lot of backslashes here. Without raw strings we’d need even more. Without raw strings we’d have to say

    special = "\\\\{}$&#^_%~"

starting with four backslashes to send Python two to send the regular expression engine one.

Related posts

[1] Whenever I write about using regular expressions someone will complain that my solution isn’t completely general and that they can create input that will break my code. I understand that, but it works for me in my circumstances. I’m just writing scripts to get my work done, not claiming to have written hardened production software for anyone else to use.

[2] Keep context in mind. We have three languages in play: TeX, Python, and regular expressions. One of the keys to understanding regular expressions is to see them as a small language embedded inside other languages like Python. So whenever you hear a character is special, ask yourself “Special to whom?”. It’s especially confusing here because backslash is special to all three languages.

Contributing to open source projects

David Heinemeier Hansson presents a very gracious view of open source software in his keynote address at RailsConf 2019. And by gracious, I mean gracious in the theological sense.

He says at one point “If I were a Christian …” implying that he is not, but his philosophy of software echos the Christian idea of grace, a completely free gift rather than something earned. If you want to use my software without giving anything back in return, enjoy. If you’re motivated by gratitude, not obligation, to give something back, that’s great. Works follow grace. Works don’t earn grace.

I was thinking about making a donation to a particular open source project that has been important to my business when I stumbled on DHH’s talk. While watching it, I reconsidered that donation. The software is freely given, no strings attached. I don’t take anything from anyone else by using it. Etc. Then I made the donation anyway, out of a sense of gratitude rather than a sense of obligation.

My biggest contributions to open source software have been unconscious. I had no idea that code from this blog was being used in hundreds of open source projects until Tim Hopper pointed it out.

Most contributions to open source software are in kind, i.e. contributing code. But cash helps too. Here are a couple ideas if you’d like to donate a little cash.

You could buy some swag with a project logo on it, especially some of the more overpriced swag. Maybe your company rules would allow this that wouldn’t allow making a donation. It feels odd to deliberately buy something overpriced—they want how much for this coffee mug?!—but that’s how the project can make a little money.

If you’d like to make a cash donation, but you’re hesitating because it’s not tax deductible, here’s a possibility: deduct your own tax. Give the after-tax amount that corresponds to the amount you would have given before taxes. For example, suppose you’d like to donate $100 cash if it were tax deductible, and suppose that your marginal tax rate is 30%. Then donate $70. It’s the same to you as donating $100 and saving $30 on your taxes.

Disclaimer: I am not an accountant or a tax attorney. And in case it ever needs to be said, I’m not a lepidopterist, auto mechanic, or cosmetologist either.