Word frequencies in human and computer languages

This is one of my favorite quotes from Starbucks’ coffee cups:

When I was young I was mislead by flash cards into believing that xylophones and zebras were much more common.

Alphabet books treat every letter as equally important even though letters like X and Z are far less common than letters like E and T. Children need to learn the entire alphabet eventually, and there are only 26 letters, so teaching all the letters at once is not bad. But uniform emphasis doesn’t scale well. Learning a foreign language, or a computer language, by learning words without regard to frequency is absurd. The most common words are far more common than the less common words, and so it makes sense to learn the most common words first.

John Miles White has applied this idea to learning R. He did a keyword frequency analysis for R and showed that the frequency of the keywords follows Zipf’s law or something similar. I’d like to see someone do a similar study for other programming languages.

It would be interesting to write a programming language tutorial that introduces the keywords in the approximately the order of their frequency. Such a book might be quite unorthodox, and quite useful.

White points out that when teaching human languages in a classroom, “the usefulness of a word tends to be confounded with its respectability.” I imagine something similar happens with programming languages. Programs that produce lists of Fibonacci numbers or prime numbers are the xylophones and zebras of the software world.

5 thoughts on “Word frequencies in human and computer languages”

David Wolever

7 December 2009 at 12:04

“It would be interesting to write a programming language tutorial that introduces the keywords in the approximately the order of their frequency. Such a book might be quite unorthodox, and quite useful.”

Interesting idea.

Also, interesting that you mention this… Just yesterday I got around to analyzing the first round of data from vim-logging, a patched version of Vim which records which commands I use most. The analysis of that is here: http://blog.codekills.net/archives/67-You-and-Your-Editor-Data-from-vim-logging-2-of-N.html

Greg Wilson

7 December 2009 at 12:19

You might enjoy Todd Veldhuizen’s “Software Libraries and Their Reuse: Entropy, Kolmogorov Complexity, and Zipf’s Law”, available at http://arxiv.org/abs/cs/0508023.

John

7 December 2009 at 12:23

Thanks, Greg. That looks like a good paper.

gwern

12 December 2009 at 15:23

You might find http://jtauber.com/blog/2008/02/10/a_new_kind_of_graded_reader/
and http://jtauber.com/blog/2004/11/26/programmed_vocabulary_learning_as_a_travelling_salesman_problem/
interesting.

The idea is to figure out what new word will bring as many sentences as possible closer to being understood. Repeat and in exchange for learning a few words, you can read many sentences. (I have a Haskell implementation I work on every so often, but since I’m not actively learning any languages, and it’s not at all obvious how to do this for a programming language, I haven’t looked at it in a long time.)

Derek Jones

24 August 2010 at 19:15

The information you are after is available, at least for C. I would advocate teaching students the 2..4 most commonly occurring instances of a construct. It would simplify code a lot, making life a lot easy for subsequent maintainers of the code and making life a lot easier for optimizing compilers and static analyzers.

Comments are closed.

Related posts

5 thoughts on “Word frequencies in human and computer languages”