1000 most common words

Posted on 14 April 2025 by John

Last week I wrote about a hypothetical radio station that plays the top 100 songs in some genre, with songs being chosen randomly according to Zipf’s law. The nth most popular song is played with probability proportional to 1/n.

This post is a variation on that post looking at text consisting of the the 1,000 most common words in a language, where word frequencies follow Zipf’s law.

How many words of text would you expect to read until you’ve seen all 1000 words at least once? The math is the same as in the radio station post. The simulation code is the same too: I just changed a parameter from 100 to 1,000.

The result of a thousand simulation runs was an average of 41,246 words with a standard deviation of 8,417.

This has pedagogical implications. Say you were learning a foreign language by studying naturally occurring text with a relatively small vocabulary, such as newspaper articles. You might have to read a lot of text before you’ve seen all of the thousand most common words.

On the one hand, it’s satisfying to read natural text. And it’s good to have the most common words reinforced the most. But it might be more effective to have slightly engineered text, text that has been subtly edited to make sure common words have not been left out. Ideally this would be done with such a light touch that it isn’t noticeable, unlike heavy-handed textbook dialogs.

Limitations on Venn diagrams

Posted on 28 September 2024 by John

Why do Venn diagrams almost always show the intersections of three sets and not more? Can Venn diagrams be generalized to show all intersections of more sets?

That depends on the rules you give yourself for generalization. If you require that your diagram consist of circles, then three is the limit. As John Venn put it in the original paper on Venn diagrams [1]

Beyond three terms circles fail us, since we cannot draw a fourth circle which shall intersect three others in the way required.

But Mr. Venn noted that you could create what we now call a Venn diagram using four ellipses and included the following illustration.

Venn diagram with four ellipses by John Venn

(It’s confusing that there is an X inside the diagram. Venn meant that to be an asterisk and not the same symbol as the X outside. He says in the paper “Thus the one which is asterisked is instantly seen to be ‘X that is Y and Z, but is not W’.” Maybe someone else, like the publisher, drew the diagram for him.)

So the answer to whether, or how far, it is possible to generalize the classic Venn diagram depends on permissible generalizations of a circle. If you replace circles with arbitrary closed curves then Venn diagrams exist for all orders. If you demand the curves have some sort of symmetry, there are fewer options. It’s possible to make a Venn diagram from five ellipses, and that may be the limit for ellipses.

A Venn diagram is a visualization device, and so an important question is what is the limit for the use of Venn diagrams as an effective visualization technique. This is an aesthetic / pedagogical question, and not one with an objective answer, but in my mind the answer is four. Venn’s diagram made from four ellipses is practical for visualization, though it takes more effort to understand than the typical three-circle diagram.

Although my upper bound of four is admittedly subjective, it may be possible to make it objective post hoc. A Venn diagram made from n curves divides the plane into 2ⁿ regions [2]. In order to use more than four curves, you either have to gerrymander the curves or else tolerate some regions being much smaller than others. The former makes the diagram hard to understand, and he latter makes it hard to label the regions.

I suspect that if you make precise what it means for a curve to be simple [3], such as using ellipses or convex symmetric curves, and specify a maximum ratio between the largest and smallest bounded regions, then four curves will be the maximum.

Update: Here are a couple useful references.

[1] John Venn. On the Diagrammatic and Mechanical Representation of Propositions and Reasonings. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. July 1880.

[2] This includes the outside of the diagram representing the empty set. The diagram shows the intersection of 0, 1, 2, …, n sets, and the intersection of no sets is the empty set. This last statement might seem like an arbitrary definition, but it can be justified using category theory.

[3] Simple in the colloquial sense, which is more restrictive than the technical mathematical sense of a simple curve.

Packing versus unpacking

Posted on 11 May 2023 by John

I usually think of an instructor as someone who unpacks things, such as unpacking the meaning of an obscure word or explaining a difficult concept.

Last night I was trying to read some unbearably dry medical/legal material and thought about how an instructor might also pack things, wrapping dry material in some sort of story or illustration to make it more palatable.

I’ve often thought it would be nice to have someone explain this or that. I hadn’t really thought before that it would be nice to have someone make perfectly clear material more bearable to absorb, but I suppose that’s what a lot of instructors do. I was able to avoid courses that needed such an instructor, for which I am grateful, but I could imagine how a good instructor might make a memorization-based course tolerable.

Derive or memorize?

Posted on 24 February 2023 by John

A lot of smart people have a rule not to memorize anything that they can derive on the spot. That’s a good rule, up to a point. But past that point it becomes a liability.

Most students err on the side of memorizing too much. For example, it’s common for students to memorize three versions of Ohms law:

V = IR
I = V/R
R = V/I.

Not only is this wasteful, tripling the number of facts to remember, it’s also error prone. When you memorize things without understanding them, you have no way to detect mistakes. Someone memorizing the list above might think “Is I = V/R or R/V?” but someone who knows what the terms mean will know that more resistance means less current, so the latter cannot be right.

I got through my first probability class in college without memorizing anything. I worked every problem from first principles, and that was OK. But later on I realized that even though I could derive things from scratch every time I needed them, doing so was slowing me down and keeping me from seeing larger patterns.

The probability example stands out in my mind because it was a conscious decision. I must have implicitly decided to remember things in other classes, but it wasn’t such a deliberate choice.

I’d say “Don’t memorize, derive” is a good rule of thumb. But once you start to say to yourself “Here we go again. I know I can derive this. I’ve done it a dozen times.” then maybe it’s time to memorize the result. To put it another way, don’t memorize to avoid understanding. Memorize after thoroughly understanding.

Related post: Just-in-case versus just-in-time

Telescopes, awk, and learning

Posted on 1 December 2022 by John

Here’s a quote I think about often:

“It is faster to make a four-inch mirror and then a six-inch mirror than to make a six-inch mirror.” — Bill McKeenan, Thompson’s law of telescopes

If your goal is to make a six-inch mirror, why make a four-inch mirror first? From a reductionist perspective this makes no sense. But when you take into account how people learn, it makes perfect sense. The bigger project is more likely to succeed after you learn more about mirror-making in the context of a smaller project.

Awk

I was thrilled to discover the awk programming language in college. Munging files with little awk scripts was at least ten times easier than writing C programs.

When I told a friend about awk, he said “Have you seen Perl? It’ll do everything awk does and a lot more.”

If you want to learn Perl, I expect it would be faster to learn awk and then Perl than to learn Perl. I think I would have been intimidated by Perl if I’d tried to learn it first. But thinking of Perl as a more powerful awk made me more willing to try it. Awk make my life easier, and Perl had the potential to make it even easier. I’m not sure whether learning Perl was a good idea—that’s a discussion for another time—but I did.

C

I also learned C before learning C++. That was beneficial for similar reasons, starting with the four-inch mirror version of C++ before going on to the six-inch version.

Many people have said that learning C before C++ is a bad idea, that it teaches bad habits, and that it would be better to learn (modern) C++ from the beginning. That depends on what the realistic alternative is. Maybe if you attempted to learn C++ first you’d be intimidated and give up. As with giving up on learning Perl, giving up on learning C++ might be a good idea. At the time, however, learning C++ was a good move. Knowing C++ served me well when I left academia.

Learning on your own

Teaching yourself something requires different tactics than learning something in a classroom. The four-inch mirror warmup is more important when you’re learning on your own.

If I were teaching a course on C++, I would not teach C first. The added structure of a classroom makes it easier to learn C++ directly. The instructor can pace students through the material so as to avoid the intimidation they might face if they were attempting to learn C++ alone. Students don’t become overwhelmed and give up because they have the accountability of homework assignments etc. Of course some students will give up, but more would give up without the structure of a class.

Top-down vs bottom-up

From a strictly logical perspective, it’s most efficient to learn the most abstract version of a theorem first. But this is bad pedagogy. The people who are excited about the efficiency of compressing math this way, e.g. Bourbaki, learned what they know more concretely and incrementally, and think in hindsight that the process could be shortened.

It does save time to present things at some level of generality. However, the number of steps you can go up the abstraction ladder at a time varies by person. Some people might need to go one rung at a time, some could go two at a time or maybe three, but everyone has a limit. And you can take bigger steps when you have a teacher, or even better a tutor, to guide you and to rescue you if you try to take too big of a step.

You typically understand something better, and are more able to apply it, when you learn it bottom-up. People think they can specialize more easily than they can generalize, but the opposite is usually true. It’s easier to generalize from a few specific examples than to realize that a particular problem is an instance of a general pattern.

I’ve noticed this personally, and I’ve noticed it in other people. On Twitter, for example, I sometimes post a general and a concrete version of a theorem, and the more concrete version gets more engagement. The response to a general theorem may be “Ho hum. Everybody knows that.” but the response to a particular application may be “Wow, I never thought of that!” even when the latter is a trivial consequence of the former.

I think I’ll pass

Posted on 16 November 2020 by John

The other day I saw an article about some math test and thought “I bet I’d blow that away now.”

Anyone who has spent a career using some skill ought to blow away an exam intended for people who have been learning that skill for a semester.

However, after thinking about it more, I’m pretty sure I’d pass the test in question, but I’m not at all sure I’d ace it. Academic exams often test unimportant material that is in the short term memory of both the instructor and the students.

From Timbuktu to …

When I was in middle school, I remember a question that read

It is a long way from ________ to ________.

I made up two locations that were far apart but my answer was graded as wrong.

My teacher was looking for a direct quote from a photo caption in our textbook that said it was a long way from Timbuktu to some place I can’t remember.

That stuck in my mind as the canonical example of a question that doesn’t test subject matter knowledge but tests the incidental minutia of the course itself [1]. A geography professor would stand no better chance of giving the expected answer than I did.

The three reasons …

Almost any time you see a question asking for “the 3 reasons” for something or “the 5 consequences” of this or that, it’s likely a Timbuktu question. In open-world contexts [2], I’m suspicious whenever I see “the” followed by a specific number.

In some contexts you can make exhaustive lists—it makes sense to talk about the 3 branches of the US government or the 5 Platonic solids, but it doesn’t make sense to talk about the 4 causes of World War I. Surely historians could come up with more than 4 causes, and there’s probably no consensus regarding what the 4 most important causes are.

There’s a phrase teaching to the test for when the goal is not to teach the subject per se but to prepare the students to pass a standardized test related to the subject. The phenomena discussed here is sort of the opposite, testing to the teaching.

When you ask students for the 4 causes of WWI, you’re asking for the 4 causes given in lecture or the 4 causes in the text book. You’re not testing knowledge of WWI per se but knowledge of the course materials.

[1] Now that I’m in middle age rather than middle school, I could say that the real question was not geography but psychology. The task was to reverse-engineer from an ambiguous question what someone was thinking. That is an extremely valuable skill, but not one I possessed in middle school.

[2] A closed world is one in which the rules are explicitly known, finite, and exhaustive. Chess is a closed world. Sales is not. Academia often puts a box around some part of an open world so it can think of it as a closed world.

There’s more going on here

Posted on 17 September 2020 by John

At a new faculty orientation, a professor encouraged us rookies to teach intro courses and to keep coming back to teach them periodically. I didn’t fully appreciate what he said at the time, though I remembered it, even though I left academia a couple years later.

Now I think I have an idea what he was referring to. There’s a LOT of stuff swept under the rug, out of necessity, when teaching intro courses. The students think they’re starting at the beginning, and maybe junior faculty think the same thing, but they’re really starting in medias res.

For example, Michael Spivak’s Physics for Mathematicians makes explicit many of the implicit assumptions in a freshman mechanics class. Hardly anyone could learn physics if they had to start with Spivak. Instead, you do enough homework problems that you intuitively get a feel for things you can’t articulate and don’t fully understand. But it’s satisfying to read Spivak later and feel justified in thinking that things didn’t quite add up.

When you learn to read English, you’re told a lot of half-truths or quarter-truths. You’re told, for example, that English has 10 vowel sounds, when in reality it has more. Depending on how you count them, there are more than 20 vowel sounds in English. A child learning to read shouldn’t be burdened with a college-level course in phonetics, so it’s appropriate not to be too candid about the complexities of language at first.

It would have been easier for me to teach statistics when I was fresh out of college rather than teaching a few courses while I was working at MD Anderson. As a fresh graduate I could have taught out of a standard textbook in good conscience. By the time I did teach statistics classes, I was aware of how much material was not completely true or not practical.

I was thinking this morning about how there’s much more going on in a simple change of coordinates than is apparent at first. Tensor calculus is essentially the science of changing coordinates. It points out hidden structure, and creates conventions for making calculations manageable and for reducing errors. That’s not to say tensor calculus is easy but rather to say that changes of coordinates are hard.

Related post: Coming full circle

Variable-speed learning

Posted on 16 July 2018 by John

When I was in college, one of the professors seemed to lecture at a sort of quadratic pace, maybe even an exponential pace.

He would proceed very slowly at the beginning of the semester, so slowly that you didn’t see how he could possibly cover the course material by the end. But his pace would gradually increase to the point that he was going very quickly at the end. And yet the pace increased so smoothly that you were hardly aware of it. By understanding the first material thoroughly, you were able to go through the latter material quickly.

If you’ve got 15 weeks to cover 15 chapters, don’t assume the optimal pace is to cover one chapter every week.

I often read technical books the way the professor mentioned above lectured. The density of completely new ideas typically decreases as a book progresses. If your reading pace is proportional to the density of new ideas, you’ll start slow and speed up.

The preface may be the most important part of the book. Some books I’ve only read the preface and felt like I got a lot out of the book.

The last couple chapters of technical books can often be ignored. It’s common for authors to squeeze in something about their research at the end of a book, even if its out of character with the rest of the book.

Off by one character

Posted on 26 April 2018 by John

There was a discussion on Twitter today about a mistake calculus students make:

$\frac{d}{dx}e^x = x e^{x-1}$

I pointed out that it’s only off by one character:

$\frac{d}{de}e^x = x e^{x-1}$

The first equation is simply wrong. The second is correct, but a gross violation of convention, using x as a constant and e as a variable.

It’s like this other thing except …

Posted on 24 April 2018 by John

One of my complaints about math writing is that definitions are hardly ever subtractive, even if that’s how people think of them.

For example, a monoid is a group except without inverses. But that’s not how you’ll see it defined. Instead you’ll read that it’s a set with an associative binary operation and an identity element. A module is a vector space, except the scalars come from a ring instead of a field. But the definition from scratch is more than I want to write here. Any time you have a sub-widget or a pre-widget or a semi-widget, it’s probably best to define the widget first.

I understand the logical tidiness of saying what a thing is rather than what it is not. But it makes more pedagogical sense to describe the difference between a new concept and the most similar familiar concept. And the nearest familiar concept may have more structure rather than less.

Suppose you wanted to describe where a smaller city is by giving directions from larger, presumably more well known city, but you could only move east. Then instead of saying Ft. Worth is 30 miles west of Dallas, you’d have to say it’s 1,000 miles east of Phoenix.

Writers don’t have to choose between crisp logic and good pedagogy. They can do both. They can say, for example, that a pre-thingy is a thingy without some property, then say “That is, a pre-thingy satisfies the following axioms: …”

Education