The world is lumpy

The Pareto principle, or the 80-20 rule, says that 80% of output comes from 20% of inputs. For example, maybe the top 20% of salesmen generate 80% of a company’s revenue.

For some reason, the Pareto principle angers some people. Mention the Pareto principle and someone will explain why it can’t be true, based on an overly-literal understanding of the principle. It’s a principle, a rule of thumb. It doesn’t mean that exactly 80% of output will come from 20% of input, though this is approximately true surprisingly often.

More generally, the Pareto principle means that importance is very unevenly distributed. For example, a study of Japanese kanji used in newspapers found around 4,000 characters in use, but the 10 most frequently used characters accounted for 10% of character usage [1]. If I wanted to learn to read Japanese, I’d start by learning these 10 kanji.

Top 10 kanji: 日一十二人大年会国三

The opposite of the Pareto principle would be the uniformitarian presumption that everything is equally important. A uniformatarian approach to reading Japanese would say “Well, there are 4,000 kanji in use, so I’m going to study a Japanese dictionary from the beginning and learn them all.” Even this would not be an entirely uniformitarian approach, since it starts with the 4,000 kanji the survey found in newspapers. Japanese has over 50,000 kanji.

Nobody would do this. Nobody is completely uniformitarian. However, most of us tend to underestimate how unevenly things are distributed. Of course the most common kanji are used the most: that’s what it means to be most common! But I would not have expected that just 10 characters account for 10% of character use.

I know about the Pareto principle, power laws, etc. I know things are unevenly distributed and have written about this many times. For example, I wrote about Twitter follower distribution a few weeks ago. I expected kanji frequency to be very uneven, but I still underestimated how uneven it is.

[1] The same study found that the top 500 kanji accounted for 80% of characters. So about 12% of kanji accounted for 80% of usage. That’s relative to the 4,000 kanji found in the study. It’s less than 1% of the potential kanji someone could use.

3 thoughts on “The world is lumpy”

Ciprian Tomoiaga

1 November 2022 at 09:52

We need a regular reminder about this! Thank you for posting the example.

Do you think this relates to the Pareto frontier, beyond just the name ? My mind ran to that when you mentioned “everything is equally important”. And I remembered that with more dimensions, there are more pareto-optimal items. So from that point of view, in some way, everything is almost equally important, right?

Maybe I’m getting this wrong. Our school example of Pareto frontier was on the decision to buy a car. If you only judge by speed, there is clearly 1 winner. If both speed and colour are important to you, there will be more items to dominate the others. If you add more criteria, you get a lot more items that are “optimal”

Robert

1 November 2022 at 17:19

When I started studying Japanese, I erroneously believed that if, say, 80% of a typical text consist of X number of Kanji, then learning X number of Kanji means I’ll understand 80% of the text.
The mistake I made is thinking that reading comprehension is linear. It’s not.
I’m now about halfway through the official ~2000 Kanji list and my reading comprehension is still very bad, even though I theoretically recognize 80–85% of all characters.
Somebody who’s a bit more advanced told me there’s a threshold at around 1300 Kanji. When you cross this threshold, you can read most texts without getting lost.
The Pareto principle sometimes lures us into the wrong direction, when the nonlinearities are what characterizes our system.

Ondra

1 November 2022 at 17:57

Day, one, ten, two, person, big, year, meet, country, three. You can see why these kanji in particular would be popular. And they form compounds as well, e.g. 米国人 means an American person – two of the top 10 right there.

Comments are closed.