Naming collections

When you have an array of things, do you name the array with a plural noun because it contains many things, or you you name it with a singular noun because each thing it contains is singular? For example, if you have a collection of words, should you name it words or word?

Does it make any difference if you’re using some container other than an array? For example if you have a dictionary (a.k.a. map, hash, associative array, etc.) counting word frequencies, should it be count or counts?

I’ve never had a convention that I consciously follow. But I’ve often stopped to wonder which way I should name things. One approach may look right when I declare a variable and another when I use it.

Damian Conway has a reasonable suggestion in his book Perl Best Practices. (There are many things in that book that are good advice for people who never touch Perl.) He recommends using plural names for most arrays and singular names for dictionaries and arrays used like dictionaries.

Because hash entries are typically accessed individually, it makes sense for the hash itself to be named in the singular. That convention causes the individual accesses to read more naturally in the code. … On the other hand, array values are more often processed collectively … So it makes sense to name them in the plural, after the group of items they store. … If, however, an array is to be used as a random-access look-up table, name it in the singular, using the same conventions as a hash.

Tragedies and messes

Dorothy Parker said “It’s not the tragedies that kill us; it’s the messes.”

Sometime that’s how I feel about computing. I think of messes such as having to remember that arc tangent is atan in R and Python, but arctan in NumPy and a in bc. Or that C, Python, and Perl use else if, elif, and elsif respectively. Or did I switch those last two?

These trivial but innumerable messes keep us from devoting our full energy to bigger problems.

One way to reduce these messes is to use fewer tools. Then you know less to be confused about. If you only use Python, for example, then elif is just how it is. But knowing more tools is worth the added mess, up to a point. Past some point, however, new tools add more mental burden than utility. You have to find the optimal combination of tools for yourself, and that combination will change over time.

To use fewer tools, you may need to use more complex tools. Maybe you can replace a list of moderately complex but inconsistent tools with one tool that is more complex but internally consistent.

Perl as a better …

Today I ran across Minimal Perl: For UNIX and Linux People. The book was published a few years ago but I hadn’t heard of it because I haven’t kept up with the Perl world. The following chapters from the table of contents jumped out at me because I’ve been doing a fair amount of awk and sed lately.:


3. Perl as a (better) grep command
4. Perl as a (better) sed command
5. Perl as a (better) awk command
6. Perl as a (better) find command

These chapters can be read a couple ways. The most obvious reading would be “Learn a few features of Perl and use it as a replacement for a handful of separate tools.”

But if you find these tools familiar and are not looking to replace them, you could read the book as saying “Here’s an introduction to Perl that teaches you the language by comparing it to things you already know well.”

The book suggests learning one tool instead of several, and in the bargain getting more powerful features, such as more expressive pattern matching. It also suggests not necessarily committing to learn the entire enormous Perl language, and not necessarily committing to use Perl for every programming task.

Regarding Perl’s pattern matching, I could relate to the following quip from the book.

What the only thing worse than not having a particular metacharacter … in a pattern-matching utility? Thinking you do, when you don’t! Unfortunately, that’s a common problem when using Unix utilities for pattern matching.

That was my experience just yesterday. I wrote a regular expression containing \d for a digit and couldn’t understand why it wasn’t matching.

Most of the examples rely on giving Perl command line options such as -e so that it acts more like command line utility. The book gives numerous examples carrying out common tasks in grep etc. and with Perl one-liners. The latter tend to be a little more verbose. If a task falls in the sweet spot of a common tool, that tool’s syntax will be more succinct. But when a task falls outside that sweet spot, such as matching a pattern that cannot be easily expressed with traditional regular expressions, the Perl solution will be shorter.

 

Related posts:

How to avoid shell scripting

Suppose you know a scripting language (Perl, Python, Ruby, etc) and you’d rather not learn shell scripting (bash, PowerShell, batch, etc.). Or maybe you know shell scripting on one platform and don’t want to take the time right now to learn shell scripting on another platform. For example, maybe you know bash on Linux but don’t want to learn PowerShell on Windows, or vice versa.

One strategy would be to use your preferred language to generate shell scripts. Shell scripts are trivial when they’re just a list of commands: do this, do this, do this, etc. Where shell scripting gets more complicated is when you have variables, branching logic, library calls, all the stuff you already know how to do in another language. Maybe you could do all the complicated logic in your “native language” and just generate a shell script that’s simply as list of instructions with no other logic.

Another strategy is to make system calls from your preferred language. Most scripting languages have a system() function that takes a string and executes it as a system command. The advantage of this approach is that it could have conditional logic that the code generation approach could not handle. The disadvantage is that you have to sort out what process the system() call is running under etc.

Maybe you want to learn shell scripting, but you need to get work done now that you don’t yet know how to do. One of these strategies could buy you some time. You might transition, for example, from Python to PowerShell by generating more sophisticated shell scripts over time and writing simpler generator code until you just write scripts directly.

Too many objects

Around 1990, object oriented programming (OOP) was all the buzz. I was curious what the term meant and had a hard time finding a good definition. I still remember a description I saw somewhere that went something like this:

Object oriented programming is a way of organizing large programs. … Unless your program is fairly large, you will not see any difference between object oriented and structured programming.

The second sentence is no longer true. OOP is not just a high-level organizational style. Objects and method calls now permeate software at the lowest level. And that’s where things went wrong. Software developers got the idea that if objects are good, more objects are better. Everything should be an object!

For example, I had a discussion with a colleague once on how to represent depth in an oil well for software we were writing. I said “Let’s just use a number.”

double depth;

My suggestion was laughed off as hopelessly crude. We need to create depth objects! And not just C++ objects, COM objects! We couldn’t send a double out into the world without first wrapping it in the overhead of an object.

Languages like C# and Java enforce the everything-is-an-object paradigm. You can’t just write a function; you have to write member functions of an object. This leads to a proliferation of “-er” classes that do nothing but wrap a function. For example, suppose you need a function to solve a quadratic equation. You might make it the Solve method on a QuadraticEquationSolver object. That’s just silly. as John Carmack said,

Sometimes, the elegant implementation is a function. Not a method. Not a class. Not a framework. Just a function.

Languages are changing. You can create anonymous functions in C#, for example. You can even create a named function, by creating an anonymous function first and saving it to a named variable. A little wonky, but not too bad.

I imagine when people say OOP is terrible, they often mean that OOP as commonly practiced now goes too far.

Related posts:

The weight of code

From Bjorn Freeman-Benson’s talk Airplanes, Spaceships, and Missiles: Engineering Lessons from Famous Projects

Bjorn is discussing the ferrite core memory of the Apollo guidance system.

These are very, very robust memory systems. … But the problem is that they actually have weight to them. Core memory actually weighs a bunch, so when you’re writing your program for the lunar module … every line of code that you wrote had a consequence in weight. And you could measure how heavy your code was at the end of a compile line. … It’s an interesting analogy to keep in mind because in fact even today our code has weight. It doesn’t really have physical weight … Our code has psychological weight because every line of code we write has to be maintained. It has to be supported. It has to be operated.

Here’s the video. The context of the quote begins at 33:14.

Related posts:

Sum-free subset challenge

A set of integers is called sum-free if no element of the set is the sum of any other pair of elements in the set. For example, {1, 10, 100} is sum-free.

Let’s look at pulling out a sum-free subset of a larger set. For example, if we start with {1, 2, 3, …, 10}, then {1, 5, 10} as a sum-free subset. So is {1, 2, 4, 7}. Notice in this case 1 + 2 + 4 = 7, but that’s OK because we’re only concerned with whether an element is the sum of two other elements.

[Update: Thanks to Sjoerd Visscher for pointing out that the definition of sum-free does not require that the elements of a sum be distinct. So when I said that the set {1, 2, 4, 7} is sum-free, this was wrong because 2 + 2 = 4. The set A is sum-free if the intersection of A+A with A is empty.]

Now let A be a set of integers with n elements. How large of a sum-free subset does A contain? It could be as large as n if the set A were sum-free to begin with, so that’s an upper bound. But what is a lower bound on the size of the largest sum-free subset?

There is a theorem that gives a number k such that every set of n non-zero integers contains a sum-free subset of size at least kn. You could let k be zero, but that’s no fun. Can you find a larger value of k? I’ll tell you later what value of k the theorem has. Until then, maybe you could try to find your own value.

Suppose you want to write a program to explore this empirically. For a given set, how would you find a maximal sum-free subset? Brute force examination of all subsets would take 2n steps, so hopefully you could do better than that.

What are some sets that have relatively small maximal sum-free subsets?

Other quiz/puzzle posts:

Extreme syntax

In his book Let Over Lambda, Doug Hoyte says

Lisp is the result of taking syntax away, Perl is the result of taking syntax all the way.

Lisp practically has no syntax. It simply has parenthesized expressions. This makes it very easy to start using the language. And above all, it makes it easy to treat code as data. Lisp macros are very powerful, and these macros are made possible by the fact that the language is simple to parse.

Perl has complex syntax. Some people say it looks like line noise because its liberal use of non-alphanumeric characters as operators. Perl is not easy to parse — there’s a saying that only Perl can parse Perl — nor is it easy to start using. But the language was designed for regular users, not beginners, because you spend more time using a language than learning it.

There are reasons I no longer use Perl, but I don’t object to the rich syntax. Saying Perl is hard to use because of its symbols is like saying Greek is hard to learn because it has a different alphabet. It takes years to master Greek, but you can learn the alphabet in a day. The alphabet is not the hard part.

Symbols can make text more expressive. If you’ve ever tried to read mathematics from the 18th or 19th century, you’ll see what I mean. Before the 20th century, math publications were very verbose. It might take a paragraph to say what would now be said in a single equation. In part this is because notation has developed and standardized over time. Also, it is now much easier to typeset the symbols someone would use in handwriting. Perl’s repertoire of symbols is parsimonious compared to mathematics.

I imagine that programming languages will gradually expand their range of symbols.

People joke about how unreadable Perl code is, but I think a page of well-written Perl is easier to read than a page of well-written Lisp.  At least the Perl is easier to scan: Lisp’s typographical monotony makes it hard to skim for landmarks. One might argue that a page of Lisp can accomplish more than a page of Perl, and that may be true, but that’s another topic.

***

Any discussion of symbols and programming languages must mention APL. This language introduced a large number of new symbols and never gained wide acceptance. I don’t know that much about APL, but I’ll give my impression of why I don’t think APL’s failure is not proof that programmers won’t use more symbols.

APL required a special keyboard to input. That would no longer be necessary. APL also introduced a new programming model; the language would have been hard to adopt even without the special symbols. Finally, APL’s symbols were completely unfamiliar and introduced all at once, unlike math notation that developed world-wide over centuries.

***

What if programming notation were more like music notation? Music notation is predominately non-verbal, but people learn to read it fluently with a little training. And it expresses concurrency very easily. Or maybe programs could look more like choral music, a mixture of symbols and prose.

Dogfooding

Dogfooding refers companies using their own software. According to Wikipedia,

In 1988, Microsoft manager Paul Maritz sent Brian Valentine, test manager for Microsoft LAN Manager, an email titled “Eating our own Dogfood”, challenging him to increase internal usage of the company’s product. From there, the usage of the term spread through the company.

Dogfooding is a great idea, but it’s no substitute for usability testing. I get the impression that some products, if they’re tested at all, are tested by developers intimately familiar with how they’re intended to be used.

If your company is developing consumer software, it’s not dogfooding if just the developers use it. It’s dogfooding when people in sales and accounting use it. But that’s still no substitute for getting people outside the company to use it.

Dogfooding doesn’t just apply to software development. Whenever I buy something with inscrutable assembly instructions, I wonder why the manufacturer didn’t pay a couple people off the street to put the thing together on camera.

Hacking debt

The term technical debt describes the accumulated effect of short term decisions in a software development process. In order to meet a deadline, for example, a project will take shortcuts, developing code in a way that’s not best for future maintainability but that saves time immediately. Once the pressure is off, hopefully, the team goes back and repays the technical debt by refactoring.

I’d like to propose hacking debt to describe a person who has been focused on “real work” for so long that he or she hasn’t spent enough time playing around, making useless stuff for the fun of it. Some portion of a career should be devoted to hacking. Not 100%, but not 0% either. Without some time spent exploring and having fun, people become less effective and eventually burn out.

Related posts:

For hacking financial debt, see this.

A web built on LaTeX

The other day on TeXtip, I threw this out:

Imagine if the web had been built on LaTeX instead of HTML …

Here are some of the responses I got:

  • It would have been more pretty looking.
  • Frightening.
  • Single tear down the cheek.
  • No crap amateurish content because of the steep learning curve, and beautiful rendering … What a dream!
  • Shiny math, crappy picture placement: glad it did not!
  • Overfull hboxes EVERYWHERE.
  • LaTeX would have become bloated, and people would be tweeting about HTML being so much better.
  • Noooo! LaTeX would have been “standardised”, “extended” and would by now be a useless pile of complexity.

Wrapping a function in a burkha

Terrific quote from Jessica Kerr via Dan North:

In an OO language you can't just pass a function around unless it's dressed in a burkha and accompanied by nouns @jessitron at #scandev

If you feel like you’re missing an inside joke, here’s an explanation.

In object oriented languages, languages that don’t simply support object oriented programming but try to enforce their vision of it, you can’t pass around a simple function. You have to pass around an object that contains the function.

Functions are analogous to verbs and objects are analogous to nouns. In contrast to spoken English where the trend is to turn nouns into verbs, OOP wraps verbs up as nouns.

“Sometimes, the elegant implementation is a function. Not a method. Not a class. Not a framework. Just a function.” — John Carmack

Related posts:

Can regular expressions parse HTML or not?

Can regular expressions parse HTML? There are several answers to that question, both theoretical and practical.

First, let’s look at theoretical answers.

When programmers first learn about regular expressions, they often try to use them on HTML. Then someone wise will tell them “You can’t do that. There’s a computer science theorem that says regular expressions are not powerful enough.” And that’s true, if you stick to the original meaning of “regular expression.”

But if you interpret “regular expression” the way it is commonly used today, then regular expressions can indeed parse HTML. This post by Nikita Popov explains that what programmers commonly call regular expressions, such as PCRE (Perl compatible regular expressions), can match context-free languages.

Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.

So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE can.

Now on to the practical answers. The next lines in Nikita Popov’s post say

But don’t forget two things: Firstly, most HTML you see in the wild is not well-formed (usually not even close to it). And secondly, just because you can, doesn’t mean that you should.

HTML in the wild can be rather wild. On the other hand, it can also be simpler than the HTML grammar allows. In practice, you may be able to parse a particular bit of HTML with regular expressions, even old fashioned regular expressions. It depends entirely on context, your particular piece of (possibly malformed) HTML and what you’re trying to do with it. I’m not advocating regular expressions for HTML parsing, just saying that the question of whether they work is complicated.

This opens up an interesting line of inquiry. Instead of asking whether strict regular expressions can parse strict HTML, you could ask what is the probability that a regular expression will succeed at a particular task for an HTML file in the wild. If you define “HTML” as actual web pages rather than files conforming to a particular grammar, every technique will fail with some probability. The question is whether that probability is acceptable in context, whether using regular expressions or any other technique.

Related post: Coming full circle

Randomized studies of productivity

A couple days ago I wrote a blog post quoting Cal Newport suggesting that four hours of intense concentration a day is as much as anyone can sustain. That post struck a chord and has gotten a lot of buzz on Hacker News and Twitter. Most of the feedback has been agreement, but a few people have complained that this four-hour limit is based only on anecdotes, not scientific data.

Realistic scientific studies of productivity are often not feasible. For example, people often claim that programming language X makes them more productive than language Y. How could you conduct a study where you randomly assign someone a programming language to use for a career? You could do some rinky-dink study where you have 30 CS students do an artificial assignment using language X and 30 using Y. But that’s not the same thing, not by a long shot.

If someone, for example Rich Hickey, says that he’s three times more productive using one language than another, you can’t test that assertion scientifically. But what you can do is ask whether you think you are similar to that person and whether you work on similar problems. If so, maybe you should give their recommended language a try.

Suppose you wanted to test whether people are more productive when they concentrate intensely for two hours in the morning and two hours in the afternoon. You couldn’t just randomize people to such a schedule. That would be like randomizing some people to run a four-minute mile. Many people are not capable of such concentration. They either lack the mental stamina or the opportunity to choose how they work. So you’d have to start with people who have the stamina and opportunity to work the way you want to test. Then you’d randomize some of these people to working longer, fractured work days. Is that even possible? How would you keep people from concentrating? Harrison Bergeron anyone? And if it’s possible, would it be ethical?

Real anecdotal evidence is sometimes more valuable than artificial scientific data. As Tukey said, it’s better to solve the right problem the wrong way than to solve the wrong problem the right way.

Related posts: