Naming collections

When you have an array of things, do you name the array with a plural noun because it contains many things, or you name it with a singular noun because each thing it contains is singular? For example, if you have a collection of words, should you name it words or word?

Does it make any difference if you’re using some container other than an array? For example if you have a dictionary (a.k.a. map, hash, associative array, etc.) counting word frequencies, should it be count or counts?

I’ve never had a convention that I consciously follow. But I’ve often stopped to wonder which way I should name things. One approach may look right when I declare a variable and another when I use it.

Damian Conway has a reasonable suggestion in his book Perl Best Practices. (There are many things in that book that are good advice for people who never touch Perl.) He recommends using plural names for most arrays and singular names for dictionaries and arrays used like dictionaries.

Because hash entries are typically accessed individually, it makes sense for the hash itself to be named in the singular. That convention causes the individual accesses to read more naturally in the code. … On the other hand, array values are more often processed collectively … So it makes sense to name them in the plural, after the group of items they store. … If, however, an array is to be used as a random-access look-up table, name it in the singular, using the same conventions as a hash.

15 thoughts on “Naming collections

  1. The Welsh language has a thing called “collective-unit” nouns. One is “y plant”, which means “the children” even though the noun itself is singular. (If you are talking about one child, you have to say “y plentyn”, IIRC.) Usually, though not always, collective-unit nouns are things that you see as a group, so that you’d refer to “a set of children” as a singular thing.

  2. If I’m ever going to use a for(each) loop, I prefer plural nouns, so in python I would write:

    for word in words:
    dosomething(word)

  3. Interesting, I haven’t read “Perl Best Practices”, but I think that’s how I do it most of the time (in spirit anyway.. really I just go by the seat of my pants). If you only handle the array (or hash) at the array (hash) level the plural form of the array name makes the code read more cleanly. On the other hand if the code is indexing individual elements then reference like e.g. “things[n]” looks awful ugly to me, while “thing[n]” does not, so I feel forced to drop the plural on aesthetic grounds.
    As a side effect, high-level code tends to have pluralized names, but lower-level code tends to use singular names.

  4. There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

    [Well, somebody had to say it.]

  5. I think you just do what’s most compatible with your native grammar, in which case the perl book is consistent with English.

    thing[i] and thing[‘i’] are like modifiers, read like knuth[‘donald’], whereas you’d make it plural just so you can use the languages equivalent to ‘for thing in things’.

  6. If I put something in a hash it’s usually because I want to look it up later by using the key, so a lot of my hashes are named X_by_Y. E.g. %count_by_word, %name_by_id, %size_by_filename. Then I don’t even have to think when I later write $name = $name_by_id{$id}.

  7. I name my hashes “apple_to_banana” (instead of “banana_by_apple”, as Jonathan does), and I think part of it is whether you think of your hash as an index on apples, or as a function from the apple domain to the banana domain.

    I use plural for arrays (although I like the point made about individual access being an exception to that rule). For hashes, I don’t use the plural type names unless the type is plural.

    In general, I would say that tracking plurality correctly is a big part of good naming.

  8. I use the reverse logic for arrays: I think the way entries are accessed (for both reading and writing) is much more important than how the array is allocated or how its type is declared, so I always use word[i], since it makes more sense to use that form to access the i-th word.

    For maps, I always name the map with both its domain and its range, and the word “To” inbetween, e.g. geneIdx = geneNameToGeneIndex.get(geneName) . Using this convention has saved me a lot of confusion over the years when I am dealing with large numbers of complicated collections.

  9. I agree that word[i] reads better than words[i]. But sometimes you never explicitly index an array, such as

    for word in words:
       ...
    

    Another of Conway’s recommendations is not to index an array more than once. So if you have to explicitly use words[i], you could say word = words[i] and henceforth only work with word.

  10. Reading Luke’s and John’s comments made me understand my habits better. We all name hashes using the key and values “types” (or domain and range, as Luke says), but I say “value_by_key” instead of “keyToValue”. I realized I’m motivated by the dimensional analysis I learned in Chemistry and Physics.

    You know how easy it is to figure out

    (meters) = (meter/sec) * (seconds)

    because “the seconds in the numerator and denominator cancel each other”? If I have a map I think of it as expressing geneIndex divided by geneName, in some abstract, not precisely defined sense. So to convert a geneName into a geneIndex I do the same thing as I do with meters and seconds

    (geneIndex) = (geneIndex/geneName) * geneName

    which in Perl is written

    $geneIndex = $geneIndex_by_geneName{ $geneName };

    I’s like the “by_geneName” puts it in the denominator, and accessing the map with a geneName cancels it out.

    This also reminds me of the Chain Rule used for differentials [ dy = (dy/dx) * dx ], and the Einstein summation convention, which reinforces the model.

  11. I’ve been thinking about whether Key_to_Value or Value_by_Key is better.
    “meters/sec” is a good argument for Value_by_Key. Another argument for
    Value_by_Key is that a lexographic sort will place items of the same
    type near each other if the hash name is prepended, which supports aggregate access patterns when you want to get all objects of a certain type…

    Also, when there is a chain, I think Value_by_Key is more readable…
    B_to_C[A_to_B[A1]]
    vs
    C_by_B[B_by_A[A1]]]

    I think the reason why I’ve preferred “A_to_B” instead of “B_by_A” has to do with the same reason why one might prefer infix notation to prefix notation. If I have an object A, I often want to call a method A.ToB() and it’s easier for my internal autocomplete to go from “object you are currently dealing with” to “objects you might want”, than it is for it to perform the reverse computaiton.

  12. The reason I use Key_to_Value in my naming is that in my line of work, I construct a lot of complex data manipulation pipelines, and once you’ve built dozens of systems like this, the very clear single unifying design pattern that emerges is that you’re building DAGs of collections over and over again. Each collection in the DAG is produced from other collections, i.e. each node in the DAG has two parts: a function applied to the input collections represented by the incoming arcs, and the collection resulting from applying that function to the inputs. This “pipes and filters” approach views data as *passing through* filters, not being *returned by* functions. Once you start viewing things in this way, you realize that functions and maps have a very deep equivalence (beyond just thinking of functions as memoizable): both functions and maps are examples of morphisms in category theory; they both map from a domain to a range. Looking something up in a map is the same as calling a function corresponding to that map, effectively mapping from the domain to the range of the function.

    A function f that takes an input of type A and returns an output of type B could be thought of as “B from A”, but it’s easier to write it in “A to B” form as f:A->B, since once you start composing morphisms, the chaining of domains in dataflow order makes more sense. Given f:A->B and g:B->C, the function composition (g . f) is actually a mapping from A->B->C, which can be collapsed down into a single morphism A->C.

    Since working so extensively with data processing DAGs for so many years, I can’t think in terms of “from” mappings anymore. I have “seen the Matrix”, and now everything looks like a “to” mapping :)

  13. @luke I agree about maps == functions, the day I realized that was a big day for me. what do you think about lists though? it’s one I keep coming back to, whether a list should be thought of as a function/map from integers to objects…the thing that seems to make a list different is that it has a length and might now allow for missing/null entries, depending on your language…but if you allow for null entries then it seems to be very similar to a map/function.

  14. Yes, I have come to the same conclusion, that a list should be treated as a morphism from int (i.e. index) to value. Lists with null entries, if null means “missing”, can then be represented as morphisms with sparse domains. I’m creating a programming language where the orderedness and sparsity of domains are toplevel concepts. If you know a domain is non-sparse, the types of parallel operations you can perform on it are greater than if the domain is sparse, among many other things.

    The one I haven’t figured out yet is what to do with sets. Is a set just a mapping from an element to the value true, or from an element to itself? Should sets be treated as morphisms at all, or just as domains that morphisms can be constructed to map between?

    I need more knowledge of category theory. I have the category theory primer printed out and sitting on my desk that John Cook linked to recently…

  15. I like the idea of orderedness and spareness being top-level concepts. They seem to be pretty fundamental concepts, so it would be interesting to see them formalized somehow. I admit that category theory is above my abstraction level, but feel free to email me any notes you have on your language, I’d be interested in hearing more (john.a.fries at gmail).

    I spent a decent amount of time working on my own language (although I think in my case it was more just me organizing my own thoughts on which concepts were fundamental than anything I was going to be able to release publicly). In my case, I ended up using the “relation” (or predicate) instead of the function as the fundamental concept, and building everything else, including functions, on top of that.

    The nice thing about using predicates is that you don’t need to have a set type (it also deals very neatly with the classical paradoxes of set theory, but that’s a longer discussion). For instance, instead of a set “bananas”, you have a predicate “banana”, and for any given thing X you can ask “banana X?” and it will return True, False, Unknown or Contradiction. You can ask for all the objects Y for which (Banana Y?) is True, and it’s only a set in the i/o sense.

Comments are closed.