Set intersection and difference at the command line

A few years ago I wrote about comm, a utility that lets you do set theory at the command line. It’s a really useful little program, but it has two drawbacks: the syntax is hard to remember, and the input files must be sorted.

If A and B are two sorted lists,

    comm A B

prints A − B, B − A, and A ∩ B. You usually don’t want all three, and so comm lets you filter the output. It’s a little quirky in that you specify what you don’t want instead of what you do. And you have to remember that 1, 2, and 3 correspond to A − B, B − A, and A ∩ B respectively.

Venn diagram of comm parameters

A couple little scripts can hide the quirks. I have a script intersect

    comm -12 <(sort "$1") <(sort "$2")

and another script setminus

    comm -23 <(sort "$1") <(sort "$2")

that sort the input files on the fly and eliminate the need to remember comm‘s filtering syntax.

The setminus script computes A − B. To find B − A call the script with the arguments reversed.

Leave a Reply

Your email address will not be published. Required fields are marked *