Word frequencies in human and computer languages

This is one of my favorite quotes from Starbucks’ coffee cups:

When I was young I was mislead by flash cards into believing that xylophones and zebras were much more common.

Alphabet books treat every letter as equally important even though letters like X and Z are far less common than letters like E and T. Children need to learn the entire alphabet eventually, and there are only 26 letters, so teaching all the letters at once is not bad. But uniform emphasis doesn’t scale well. Learning a foreign language, or a computer language, by learning words without regard to frequency is absurd. The most common words are far more common than the less common words, and so it makes sense to learn the most common words first.

John Miles White has applied this idea to learning R. He did a keyword frequency analysis for R and showed that the frequency of the keywords follows Zipf’s law or something similar. I’d like to see someone do a similar study for other programming languages.

It would be interesting to write a programming language tutorial that introduces the keywords in the approximately the order of their frequency. Such a book might be quite unorthodox, and quite useful.

White points out that when teaching human languages in a classroom, “the usefulness of a word tends to be confounded with its respectability.” I imagine something similar happens with programming languages. Programs that produce lists of Fibonacci numbers or prime numbers are the xylophones and zebras of the software world.

Related posts

5 thoughts on “Word frequencies in human and computer languages

  1. “It would be interesting to write a programming language tutorial that introduces the keywords in the approximately the order of their frequency. Such a book might be quite unorthodox, and quite useful.”

    Interesting idea.

    Also, interesting that you mention this… Just yesterday I got around to analyzing the first round of data from vim-logging, a patched version of Vim which records which commands I use most. The analysis of that is here: http://blog.codekills.net/archives/67-You-and-Your-Editor-Data-from-vim-logging-2-of-N.html

  2. You might find http://jtauber.com/blog/2008/02/10/a_new_kind_of_graded_reader/
    and http://jtauber.com/blog/2004/11/26/programmed_vocabulary_learning_as_a_travelling_salesman_problem/
    interesting.

    The idea is to figure out what new word will bring as many sentences as possible closer to being understood. Repeat and in exchange for learning a few words, you can read many sentences. (I have a Haskell implementation I work on every so often, but since I’m not actively learning any languages, and it’s not at all obvious how to do this for a programming language, I haven’t looked at it in a long time.)

  3. The information you are after is available, at least for C. I would advocate teaching students the 2..4 most commonly occurring instances of a construct. It would simplify code a lot, making life a lot easy for subsequent maintainers of the code and making life a lot easier for optimizing compilers and static analyzers.

Comments are closed.