Don’t be a technical masochist

There’s an old joke from Henny Youngman:

I told the doctor I broke my leg in two places. He told me to quit going to those places.

Sometimes tech choices are that easy: if something is too hard, stop doing it. A great deal of pain comes from using a tool outside its intended use, and often that’s avoidable.

For example, when regular expressions get too hard, I stop using regular expressions and write a little procedural code. Or when Python is too slow, I try some simple ways of speeding it up, and if that’s not good enough I switch from Python to C++. If something is too hard to do in Windows, I’ll do it in Linux, and vice versa.

Sometimes there’s not a better tool available and you just have to slog through with what you have. And sometimes you don’t have the freedom to use a better tool even though one is available. But a lot of technical pain is self-imposed. If you keep breaking your leg somewhere, stop going there.

“Conventional” is relative

I found this line from Software Foundations amusing:

… we can ask Coq to “extract,” from a Definition, a program in some other, more conventional, programming language (OCaml, Scheme, or Haskell) with a high-performance compiler.

Most programmers would hardly consider OCaml, Scheme, or Haskell “conventional” programming languages, but they are conventional relative to Coq. As the authors said, these languages are “more conventional,” not “conventional.”

I don’t mean to imply anything negative about OCaml, Scheme, or Haskell. They have their strengths — I briefly mentioned the advantages of Haskell just yesterday — but they’re odd birds from the perspective of the large majority of programmers who work in C-like languages.

Real World Haskell

I’m reading Real World Haskell because one of my clients’ projects is written in Haskell. Some would say that “real world Haskell” is an oxymoron because Haskell isn’t used in the real world, as illustrated by a recent xkcd cartoon.

It’s true that Haskell accounts for a tiny portion of the world’s commercial software and that the language is more popular in research. (There would be no need to put “real world” in the title of a book on PHP, for example. You won’t find a lot of computer science researchers using PHP for its elegance and nice theoretical properties.) But people do use Haskell on real projects, particularly when correctness is a high priority.[1] In any case, Haskell is “real world” for me since one of my clients uses it. As I wrote about before, applied is in the eye of the client.

I’m not that far into Real World Haskell yet, but so far it’s just what I was looking for. Another book I’d recommend is Graham Hutton’s Programming in Haskell. It makes a good introduction to Haskell because it’s small (184 pages) and focused on the core of the language, not so much on “real world” complications.

A very popular introduction to Haskell is Learn You a Haskell for Great Good. I have mixed feelings about that one. It explains most things clearly and the informal tone makes it easy to read, but the humor becomes annoying after a while. It also introduces some non-essential features of the language up front that could wait until later or be left out of an introductory book.

* * *

[1] Everyone would say that it’s important for their software to be correct. But in practice, correctness isn’t always the highest priority, nor should it be necessarily. As the probability of error approaches zero, the cost of development approaches infinity. You have to decide what probability of error is acceptable given the consequences of the errors.

It’s more important that the software embedded in a pacemaker be correct than the software that serves up this blog. My blog fails occasionally, but I wouldn’t spend $10,000 to cut the error rate in half. Someone writing pacemaker software would jump at the chance to reduce the probability of error so much for so little money.

On a related note, see Maybe NASA could use some buggy software.

Book review: Practical Data Analysis

Many people have drawn Venn diagrams to locate machine learning and related ideas in the intellectual landscape. Drew Conway’s diagram may have been the first. It has at least been frequently referenced.

By this classification, Hector Cuesta’s new book Practical Data Anaysis is located toward the “hacking skills” corner of the diagram. No single book can cover everything, and this one emphasizes practical software knowledge more than mathematical theory or details of a particular problem domain.

The biggest strength of the book may be that it brings together in one place information on tools that are used together but whose documentation is scattered. The book is great source for sample code. The source code  is available on GitHub, though it’s more understandable in the context of the book.

Much of the book uses Python and related modules and tools including:

  • NumPy
  • mlpy
  • PIL
  • twython
  • Pandas
  • NLTK
  • IPython
  • Wakari

It also uses D3.js (with JSON, CSS, HTML, …), MongoDB (with MapReduce, Mongo Shell, PyMongo, …), and miscellaneous other tools and APIs.

There’s a lot of material here in 360 pages, making it a useful reference.

* * *

For daily tips on data science, follow @DataSciFact on Twitter.

DataSciFact twitter icon

10 software tool virtues

Diomidis Spinellis gives a list of 10 software tool sins in The Tools at Hand episode of his Tools of the Trade podcast. Here are his points, but turned around. For each sin he lists, I give the opposite as a virtue.

10. Maintain API documentation with the source code.

9. Integrate unit testing in development.

8. Track bugs electronically.

7. Let the compiler do what it can do better than you.

6. Learn how to script your tools to work together.

5. Pay attention to compiler warnings and fix them.

4. Use a version control system.

3. Use tools to find definitions rather than scanning for them.

2. Use a debugger.

1. Use tools that eliminate repetitive manual editing.

I turned the original list around because I believe it’s easier to agree that the things above are good than it is to see that their lack is bad. Some items are opposites, like #5: you either pay attention to warnings or you ignore them. But some are not, like #8. Tracking bugs electronically is a good idea, but I wouldn’t call tracking bugs on paper a “sin.”

Related post: Reducing development friction comments on another podcast from Diomidis Spinellis.

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

CompSciFact twitter icon

Reducing development friction

Diomidis Spinellis gave an insightful list of ways to reduce software development friction in the Tools of the Trade podcast episode The Frictionless Development Environment Scorecard.

The first item on his list grabbed my attention:

Are my personal settings and preferences consistent on all the computers I’m using? Are they stored under version control? Can I install them on a new computer using a single command?

Listening to the podcast provoked me to finally sync my .emacs files on all my computers so that I now have the exact same file on all computers, maintained under version control. (Xah Lee gave me some sample code for creating the branching logic I needed for a few differences between Windows and Linux.)

Here is a small sample of questions from the podcast.

  • Are my files getting backed up? Is the backup tested, accessible, off site, in multiple media, with regularly retained copies?
  • Can I use the same editor for all my code and documentation editing tasks?
  • Can I get context-sensitive help and code completion?
  • Can I search recursively down a directory tree? Ignoring case? Only in a subset of files? With a regular expression?
  • Can I open a shell from the graphical file explorer and vice versa?
  • Can I quickly build the application I’m working on after a change? Can I test the application with a single command?
  • Can I automatically check my code for common or tricky errors? Are these checks run by default? Are they clean?
  • Does my application log its actions?
  • Is documentation for the tools and APIs I use readily available? Is it hyperlinked? Available offline?

The last question from the podcast summarizes the whole list:

Do I regularly evaluate my development environment to pinpoint and eliminate the sources of friction? Do I help my colleagues do the same?

Naming collections

When you have an array of things, do you name the array with a plural noun because it contains many things, or you you name it with a singular noun because each thing it contains is singular? For example, if you have a collection of words, should you name it words or word?

Does it make any difference if you’re using some container other than an array? For example if you have a dictionary (a.k.a. map, hash, associative array, etc.) counting word frequencies, should it be count or counts?

I’ve never had a convention that I consciously follow. But I’ve often stopped to wonder which way I should name things. One approach may look right when I declare a variable and another when I use it.

Damian Conway has a reasonable suggestion in his book Perl Best Practices. (There are many things in that book that are good advice for people who never touch Perl.) He recommends using plural names for most arrays and singular names for dictionaries and arrays used like dictionaries.

Because hash entries are typically accessed individually, it makes sense for the hash itself to be named in the singular. That convention causes the individual accesses to read more naturally in the code. … On the other hand, array values are more often processed collectively … So it makes sense to name them in the plural, after the group of items they store. … If, however, an array is to be used as a random-access look-up table, name it in the singular, using the same conventions as a hash.

Tragedies and messes

Dorothy Parker said “It’s not the tragedies that kill us; it’s the messes.”

Sometime that’s how I feel about computing. I think of messes such as having to remember that arc tangent is atan in R and Python, but arctan in NumPy and a in bc. Or that C, Python, and Perl use else if, elif, and elsif respectively. Or did I switch those last two?

These trivial but innumerable messes keep us from devoting our full energy to bigger problems.

One way to reduce these messes is to use fewer tools. Then you know less to be confused about. If you only use Python, for example, then elif is just how it is. But knowing more tools is worth the added mess, up to a point. Past some point, however, new tools add more mental burden than utility. You have to find the optimal combination of tools for yourself, and that combination will change over time.

To use fewer tools, you may need to use more complex tools. Maybe you can replace a list of moderately complex but inconsistent tools with one tool that is more complex but internally consistent.

Perl as a better …

Today I ran across Minimal Perl: For UNIX and Linux People. The book was published a few years ago but I hadn’t heard of it because I haven’t kept up with the Perl world. The following chapters from the table of contents jumped out at me because I’ve been doing a fair amount of awk and sed lately.:

3. Perl as a (better) grep command
4. Perl as a (better) sed command
5. Perl as a (better) awk command
6. Perl as a (better) find command

These chapters can be read a couple ways. The most obvious reading would be “Learn a few features of Perl and use it as a replacement for a handful of separate tools.”

But if you find these tools familiar and are not looking to replace them, you could read the book as saying “Here’s an introduction to Perl that teaches you the language by comparing it to things you already know well.”

The book suggests learning one tool instead of several, and in the bargain getting more powerful features, such as more expressive pattern matching. It also suggests not necessarily committing to learn the entire enormous Perl language, and not necessarily committing to use Perl for every programming task.

Regarding Perl’s pattern matching, I could relate to the following quip from the book.

What the only thing worse than not having a particular metacharacter … in a pattern-matching utility? Thinking you do, when you don’t! Unfortunately, that’s a common problem when using Unix utilities for pattern matching.

That was my experience just yesterday. I wrote a regular expression containing \d for a digit and couldn’t understand why it wasn’t matching.

Most of the examples rely on giving Perl command line options such as -e so that it acts more like command line utility. The book gives numerous examples carrying out common tasks in grep etc. and with Perl one-liners. The latter tend to be a little more verbose. If a task falls in the sweet spot of a common tool, that tool’s syntax will be more succinct. But when a task falls outside that sweet spot, such as matching a pattern that cannot be easily expressed with traditional regular expressions, the Perl solution will be shorter.


Related posts:

How to avoid shell scripting

Suppose you know a scripting language (Perl, Python, Ruby, etc) and you’d rather not learn shell scripting (bash, PowerShell, batch, etc.). Or maybe you know shell scripting on one platform and don’t want to take the time right now to learn shell scripting on another platform. For example, maybe you know bash on Linux but don’t want to learn PowerShell on Windows, or vice versa.

One strategy would be to use your preferred language to generate shell scripts. Shell scripts are trivial when they’re just a list of commands: do this, do this, do this, etc. Where shell scripting gets more complicated is when you have variables, branching logic, library calls, all the stuff you already know how to do in another language. Maybe you could do all the complicated logic in your “native language” and just generate a shell script that’s simply as list of instructions with no other logic.

Another strategy is to make system calls from your preferred language. Most scripting languages have a system() function that takes a string and executes it as a system command. The advantage of this approach is that it could have conditional logic that the code generation approach could not handle. The disadvantage is that you have to sort out what process the system() call is running under etc.

Maybe you want to learn shell scripting, but you need to get work done now that you don’t yet know how to do. One of these strategies could buy you some time. You might transition, for example, from Python to PowerShell by generating more sophisticated shell scripts over time and writing simpler generator code until you just write scripts directly.

* * *

For daily tips on using Unix, follow @UnixToolTip on Twitter.

UnixToolTip twitter icon

Too many objects

Around 1990, object oriented programming (OOP) was all the buzz. I was curious what the term meant and had a hard time finding a good definition. I still remember a description I saw somewhere that went something like this:

Object oriented programming is a way of organizing large programs. … Unless your program is fairly large, you will not see any difference between object oriented and structured programming.

The second sentence is no longer true. OOP is not just a high-level organizational style. Objects and method calls now permeate software at the lowest level. And that’s where things went wrong. Software developers got the idea that if objects are good, more objects are better. Everything should be an object!

For example, I had a discussion with a colleague once on how to represent depth in an oil well for software we were writing. I said “Let’s just use a number.”

double depth;

My suggestion was laughed off as hopelessly crude. We need to create depth objects! And not just C++ objects, COM objects! We couldn’t send a double out into the world without first wrapping it in the overhead of an object.

Languages like C# and Java enforce the everything-is-an-object paradigm. You can’t just write a function; you have to write member functions of an object. This leads to a proliferation of “-er” classes that do nothing but wrap a function. For example, suppose you need a function to solve a quadratic equation. You might make it the Solve method on a QuadraticEquationSolver object. That’s just silly. as John Carmack said,

Sometimes, the elegant implementation is a function. Not a method. Not a class. Not a framework. Just a function.

Languages are changing. You can create anonymous functions in C#, for example. You can even create a named function, by creating an anonymous function first and saving it to a named variable. A little wonky, but not too bad.

I imagine when people say OOP is terrible, they often mean that OOP as commonly practiced now goes too far.

Related posts:

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

CompSciFact twitter icon

The weight of code

From Bjorn Freeman-Benson’s talk Airplanes, Spaceships, and Missiles: Engineering Lessons from Famous Projects

Bjorn is discussing the ferrite core memory of the Apollo guidance system.

These are very, very robust memory systems. … But the problem is that they actually have weight to them. Core memory actually weighs a bunch, so when you’re writing your program for the lunar module … every line of code that you wrote had a consequence in weight. And you could measure how heavy your code was at the end of a compile line. … It’s an interesting analogy to keep in mind because in fact even today our code has weight. It doesn’t really have physical weight … Our code has psychological weight because every line of code we write has to be maintained. It has to be supported. It has to be operated.

Here’s the video. The context of the quote begins at 33:14.

Related posts:

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

CompSciFact twitter icon

Sum-free subset challenge

A set of integers is called sum-free if no element of the set is the sum of any other pair of elements in the set. For example, {1, 10, 100} is sum-free.

Let’s look at pulling out a sum-free subset of a larger set. For example, if we start with {1, 2, 3, …, 10}, then {1, 5, 10} as a sum-free subset. So is {1, 2, 4, 7}. Notice in this case 1 + 2 + 4 = 7, but that’s OK because we’re only concerned with whether an element is the sum of two other elements.

[Update: Thanks to Sjoerd Visscher for pointing out that the definition of sum-free does not require that the elements of a sum be distinct. So when I said that the set {1, 2, 4, 7} is sum-free, this was wrong because 2 + 2 = 4. The set A is sum-free if the intersection of A+A with A is empty.]

Now let A be a set of integers with n elements. How large of a sum-free subset does A contain? It could be as large as n if the set A were sum-free to begin with, so that’s an upper bound. But what is a lower bound on the size of the largest sum-free subset?

There is a theorem that gives a number k such that every set of n non-zero integers contains a sum-free subset of size at least kn. You could let k be zero, but that’s no fun. Can you find a larger value of k? I’ll tell you later what value of k the theorem has. Until then, maybe you could try to find your own value.

Suppose you want to write a program to explore this empirically. For a given set, how would you find a maximal sum-free subset? Brute force examination of all subsets would take 2n steps, so hopefully you could do better than that.

What are some sets that have relatively small maximal sum-free subsets?

Other quiz/puzzle posts: