How rare is it to encounter a rare word?


I recently ran across a paper on typesetting rare Chinese characters. From the abstract:

Written Chinese has tens of thousands of characters. But most available fonts contain only around 6 to 12 thousand common characters that can meet the needs of everyday users. However, in publications and information exchange in many professional fields, a number of rare characters that are not in common fonts are needed in each document.

There’s sort of a paradox here: the author is saying it’s common to need rare words. Aren’t rare words, you know, rare? Of course they are, but the chances of needing some rare word, not just a particular rare word, can be large, particularly in lengthy documents.

This post gives a sort of back-of-the-envelope calculation to justify the preceding paragraph.

Word frequencies often approximately follow Zipf’s law, where the frequency of the nth most common word is proportional to n raised to some negative power s. I’ve seen estimates that there are around N = 50,000 characters in Chinese, but that 1,000 characters make up about 90% of usage. This would correspond to a value of s around 1.25.

In practice, Zipf’s law, like all power laws, fits better over some parts of its range than others. We’re making a simplifying assumption by applying Zipf’s law to the entire vocabulary of Chinese, but this post isn’t trying to precisely model Chinese character frequency, only to show that the statement quoted above is plausible.

With our Zipf’s law model, the 10,000th most common character in Chinese would appear about 2 times in a million characters. But the frequency of all the words from the 10,000th most common to the 50,000th most common would be about 0.03.

So if we list all characters in order of frequency and call everything after the 10,000th position on the list rare, the combined frequency of all rare words is quite high, about 3%. To put it another way, a document of 1,000 words would likely contain around 30 rare words, according to the simplified model presented here.

Related posts

[1] The Chinese character at the top of the post comes from here. According to the source, “The Chinese character ‘biáng’ used to represent Biang Biang noodles, is one of the most complex and rare Chinese characters. It has 56 strokes and cannot be found in modern dictionaries or entered into computers.”

Twitter follower distribution

A conversation this morning prompted the question of how many Twitter accounts have between 10,000 and 20,000 followers. I hadn’t thought about the distribution numbers of followers in a while and was curious to revisit the topic.

Apparently this question was more popular five years ago. When I did a few searches on the topic, all the results I got back were five or more years old. Also, data about the Twitter network was more available then that it is now.

This post will come up with an estimate based on almost no data, approaching this like a Fermi problem.


I’m going to assume that the number of followers for the nth most followed account is

f = c n−α

for some constants c and α. A lot of networks approximately follow some distribution like this, and often α is somewhere between 1 and 3.

Two data points

I’ve got two unknowns, so I need two data points. Wikipedia has a list of the 50 most followed Twitter accounts here, and the 50th account has 37.7 million followers. (I chose the 50th account on the list rather than a higher one because power laws typically fit better after the first few data points.)

I believe that a few years ago the median number of Twitter followers was 0: more than half of Twitter accounts had no followers. Let’s assume the median number of followers is 1. But median out of what? I think I read there are about 350 million total Twitter accounts, and about 200 million active accounts. So should we base our median out of 350 accounts or 200 accounts? We’ll split the difference and assume the median account is the 137,500,000th most popular account.

Solve for parameters

So now we have two equations:

c 50−α = 37,700,000

c 137500000−α = 1

and taking logs gives us two linear equations in two unknowns. We solve this to find α = 1.1 and c = 2.9 × 109. The estimate of α is about the size we expected, so that’s a good sign.

Take this with a grain of salt. We’ve guessed a very simple model and fit it with just two data points, one of which we based on an educated guess.

Final estimate

We assumed f = c n−α and we can solve this for n. If an account has n followers, we’d estimate its rank n as

n = (c / f)1/α.

So we’d estimate the number of accounts with between 10,000 and 20,000 followers to be

(c / 10000)1/α − (c /20000)1/α.

which is about 40,000. I expect this final estimate is not bad, say within an order of magnitude, despite all the crude assumptions made along the way.

Yule-Simon distribution

The Yule-Simon distribution, named after Udny Yule and Herbert Simon, is a discrete probability with pmf

f(k; \rho) = \rho B(k, \rho + 1)

The semicolon in f(k; ρ) suggests that we think of f as a function of k, with a fixed parameter ρ. The way the distribution shows the connection to the beta function, but for our purposes it will be helpful to expand this function using

B(x, y) = \frac{\Gamma(x) \, \Gamma(y)}{\Gamma(x+y)}

and so

\begin{align*} f(k; \rho) &= \rho B(k, \rho + 1) \\ &= \rho \Gamma(\rho + 1) \,\,\frac{\Gamma(k)}{\Gamma(k + \rho + 1)} \\ &= \rho \Gamma(\rho + 1) \,\, \frac{1}{(k + \rho)^{\underline{\rho + 1}}} \end{align*}

Ignore the first part of the last line, ρ Γ(ρ + 1), because it doesn’t involve k. It helps to ignore proportionality constants in probability densities when they’re not necessary. What’s left is the (ρ + 1) falling power of k + ρ.

For large values of k, the falling power term is asymptotically equal to kρ+1. To see this, let k = 1000 and ρ = 3. Then we’re saying that the ratio of

1003 × 1002 × 1001 × 1000


1000 × 1000 × 1000 × 1000

is approximately 1, and the ratio converges 1 as k increases.

This says that the Yule-Simon distribution is a power law in the tails, just like the Zipf distribution and the zeta distribution. Details of the comparison between these three distributions are given here.

Related posts

Estimating the exponent of discrete power law data

Suppose you have data from a discrete power law with exponent α. That is, the probability of an outcome n is proportional to n−α. How can you recover α?

A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That does a very poor job [1]. The discrete case needs its own estimator.

To illustrate this, we start by generating 5,000 samples from a discrete power law with exponent 3 in the following Python code.

   import numpy.random

   alpha = 3
   n = 5000
   x = numpy.random.zipf(alpha, n)

The continuous MLE is very simple to implement:

    alpha_hat = 1 + n / sum(log(x))

Unfortunately, it gives an estimate of 6.87 for alpha, though we know it should be around 3.

The MLE for the discrete power law distribution satisfies

\frac{\zeta'(\hat{\alpha})}{\zeta(\hat{\alpha})} = -\frac{1}{n} \sum_{i-1}^n \log x_i

Here ζ is the Riemann zeta function, and xi are the samples. Note that the left side of the equation is the derivative of log ζ, or what is sometimes called the logarithmic derivative.

There are three minor obstacles to finding the estimator using Python. First, SciPy doesn’t implement the Riemann zeta function ζ(x) per se. It implements a generalization, the Hurwitz zeta function, ζ(x, q). Here we just need to set q to 1 to get the Riemann zeta function.

Second, SciPy doesn’t implement the derivative of zeta. We don’t need much accuracy, so it’s easy enough to implement our own. See an earlier post for an explanation of the implementation below.

Finally, we don’t have an explicit equation for our estimator. But we can easily solve for it using the bisection algorithm. (Bisect is slow but reliable. We’re not in a hurry, so we might as use something reliable.)

    from scipy import log
    from scipy.special import zeta
    from scipy.optimize import bisect 

    xmin = 1

    def log_zeta(x):
        return log(zeta(x, 1))

    def log_deriv_zeta(x):
        h = 1e-5
        return (log_zeta(x+h) - log_zeta(x-h))/(2*h)

    t = -sum( log(x/xmin) )/n
    def objective(x):
        return log_deriv_zeta(x) - t

    a, b = 1.01, 10
    alpha_hat = bisect(objective, a, b, xtol=1e-6)

We have assumed that our data follow a power law immediately from n = 1. In practice, power laws generally fit better after the first few elements. The code above works for the more general case if you set xmin to be the point at which power law behavior kicks in.

The bisection method above searches for a value of the power law exponent between 1.01 and 10, which is somewhat arbitrary. However, power law exponents are very often between 2 and 3 and seldom too far outside that range.

The code gives an estimate of α equal to 2.969, very near the true value of 3, and much better than the naive estimate of 6.87.

Of course in real applications you don’t know the correct result before you begin, so you use something like a confidence interval to give you an idea how much uncertainty remains in your estimate.

The following equation [2] gives a value of σ from a normal approximation to the distribution of our estimator.

\sigma = \frac{1}{\sqrt{n\left[ \frac{\zeta''(\hat{\alpha}, x_{min})}{\zeta(\hat{\alpha}, x_{min})} - \left(\frac{\zeta'(\hat{\alpha}, x_{min})}{\zeta(\hat{\alpha}, x_{min})}\right)^2\right]}}

So an approximate 95% confidence interval would be the point estimate +/− 2σ.

    from scipy.special import zeta
    from scipy import sqrt

    def zeta_prime(x, xmin=1):
        h = 1e-5
        return (zeta(x+h, xmin) - zeta(x-h, xmin))/(2*h)

    def zeta_double_prime(x, xmin=1):
        h = 1e-5
        return (zeta(x+h, xmin) -2*zeta(x,xmin) + zeta(x-h, xmin))/h**2

    def sigma(n, alpha_hat, xmin=1):
        z = zeta(alpha_hat, xmin)
        temp = zeta_double_prime(alpha_hat, xmin)/z
        temp -= (zeta_prime(alpha_hat, xmin)/z)**2
        return 1/sqrt(n*temp)

    print( sigma(n, alpha_hat) )

Here we use a finite difference approximation for the second derivative of zeta, an extension of the idea used above for the first derivative. We don’t need high accuracy approximations of the derivatives since statistical error will be larger than the approximation error.

In the example above, we have α = 2.969 and σ = 0.0334, so a 95% confidence interval would be [2.902, 3.036].


[1] Using the continuous MLE with discrete data is not so bad when the minimum output xmin is moderately large. But here, where xmin = 1 it’s terrible.

[2] Equation 3.6 from Power-law distributions in empirical data by Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman.


Social networks in fact and fiction

SIAM News arrived this afternoon and had an interesting story on the front page: Applying math to myth helps separate fact from fiction.

In a nutshell, the authors hope to get some insight into whether a myth is based on fact by seeing whether the social network of characters in the myth looks more like a real social network or like the social network in a work of deliberate fiction. For instance, the social networks of the Iliad and Beowulf look more like actual social networks than does the social network of Harry Potter. Real social networks follow a power law distribution more closely than do social networks in works of fiction.

This could be interesting. For example, the article points out that some scholars believe Beowulf has a basis in historical events, though they don’t believe that Beowulf the character corresponds to a historical person. The network approach lends support to this position: the Beowulf social network looks more realistic when Beowulf himself is removed.

It seems however that an accurate historical account might have a suspicious social network, not because the events in it were made up but because they were filtered according to what the historian thought was important.

Word frequencies in human and computer languages

This is one of my favorite quotes from Starbucks’ coffee cups:

When I was young I was mislead by flash cards into believing that xylophones and zebras were much more common.

Alphabet books treat every letter as equally important even though letters like X and Z are far less common than letters like E and T. Children need to learn the entire alphabet eventually, and there are only 26 letters, so teaching all the letters at once is not bad. But uniform emphasis doesn’t scale well. Learning a foreign language, or a computer language, by learning words without regard to frequency is absurd. The most common words are far more common than the less common words, and so it makes sense to learn the most common words first.

John Miles White has applied this idea to learning R. He did a keyword frequency analysis for R and showed that the frequency of the keywords follows Zipf’s law or something similar. I’d like to see someone do a similar study for other programming languages.

It would be interesting to write a programming language tutorial that introduces the keywords in the approximately the order of their frequency. Such a book might be quite unorthodox, and quite useful.

White points out that when teaching human languages in a classroom, “the usefulness of a word tends to be confounded with its respectability.” I imagine something similar happens with programming languages. Programs that produce lists of Fibonacci numbers or prime numbers are the xylophones and zebras of the software world.

Related posts

Power laws and the generalized CLT

Here’s an expert from a recent ACM Ubiquity interview with David Alderson that raises a few questions.

Actually, they [power laws] aren’t special at all. They can arise as natural consequences of aggregation of high variance data. You know from statistics that the Central Limit Theorem says distributions of data with limited variability tend to follow the Normal (bell-shaped, or Gaussian) curve. There is a less well-known version of the theorem that shows aggregation of high (or infinite) variance data leads to power laws. Thus, the bell curve is normal for low-variance data and the power law curve is normal for high-variance data. In many cases, I don’t think anything deeper than that is going on.

In this post I will explain the theory I believe Alderson is alluding to in his informal remarks. I’ll also explain some restrictions necessary for this theory to hold.

I don’t understand what Alderson has in mind when he refers to data with high but finite variance. If the variance is large but finite, the classical Central Limit Theorem (CLT) holds. If the variance is infinite, the classical CLT does not apply but a Generalized Central Limit Theorem might (or might not) apply.

The Generalized CLT says that if the “aggregation” converges to a non-degenerate distribution, that distribution must be a stable distribution. Also, stable distributions (except for normal distributions) have tails that are asymptotically proportional to the tails of a power law distribution. Note that this does not say under what conditions the aggregation has a non-degenerate limit. It only says something about what that limit must be like if it exists. Also, this does not say that the limit is a power law, only that it is a distribution whose tails are eventually proportional to those of a power law distribution.

In order to better understand what’s going on, there are several gaps to fill in.

  1. What are stable distributions?
  2. What do we mean by aggregation?
  3. What conditions insure that a non-degenerate limiting distribution exists?

Let X0, X1, and X2 be independent, identically distributed (iid) random variables. The distribution of these random variables is called stable if for every pair of positive real numbers a and b, there exists a positive c and a real d such that cX0 + d has the same distribution as aX1 + bX2.

Stable distributions can be specified by four parameters. One of the four parameters is the exponent parameter 0 < α ≤ 2. This parameter is controls the thickness of the distribution tails. The distributions with α = 2 are the normal (Gaussian) distributions. For α < 2, the PDF is asymptotically proportional to |x|−α−1 and the CDF is asymptotically proportional to |x|−α as x → ±∞.  And so except for the normal distribution, all stable distributions have thick tails.

A stable distribution can be described in terms of its characteristic function, the Fourier transform of its PDF.  The description of the characteristic function is a little complicated, but it can be written down in closed form. (See John P. Nolan’s notes on stable distributions for much more information.) However, the PDFs can only be written down in closed form in three special cases: the normal, Cauchy, and Lévy distributions. These three distributions correspond to α = 2, 1, and 1/2 respectively.

The Generalized CLT holds if there is a sequence of constants an and bn such that (X1 + X2 + … + Xnbn) / an converges to a stable distribution. This is what is meant by the “aggregation” of the X‘s. The factors are an necessarily asymptotically equal to n1/α where α is the exponential parameter for the limiting distribution.

We now get to the most critical question: what kinds of random variables lead to stable distributions when aggregated? They must have tails something like the tails of the limiting distribution. In this sense the Generalized CLT is not as magical as the classical CLT. The classical CLT says you can aggregate random variables quite unlike a normal distribution and get a normal distribution out in the limit. But the Generalized CLT requires that the distribution of the X‘s must be somewhat similar to limiting distribution. The specific requirements are given below.

Let F(x) be the CDF for the random variables Xi. The following conditions on F are necessary and sufficient for the aggregation of the X‘s to converge to a stable distribution with exponent α < 2.

  1. F(x) = (c1 + o(1)) |x|−α h(|x|) as x → -∞, and
  2. 1 − F(x) = (c2 + o(1)) x−α h(x) as x → ∞

where h(x) is a slowly varying function. The notation o(1) is explained in these notes on asymptotic notation. A slowly varying function h(x) is one such that h(cx) / h(x) → 1 as x → ∞ for all c > 0. Roughly speaking, this means F(x) has to look something like |x|−α in both the left and right tails, and so the X‘s must be distributed something like the limiting distribution.

Power laws do not fall out of the Generalized CLT as easily as the normal distribution falls out of the classical CLT. The aggregation of random variables with infinite variance might not converge to any distribution, or it might converge to a degenerate distribution. And if the aggregation converges to a non-degenerate distribution, this distribution is not strictly a power law but rather has tails like a power law.

Related links

Metabolism and power laws

Bigger animals have more cells than smaller animals. More cells means more cellular metabolism and so more heat produced. How does the amount of heat an animal produces vary with its size? We clearly expect it to go up with size, but does it increase in proportion to volume? Surface area? Something in between?

A first guess would be that metabolism (equivalently, heat produced) goes up in proportion to volume. If cells are all roughly the same size, then number of cells increases proportionately with volume. But heat is dissipated through the surface. Surface area increases in proportion to the square of length but volume increases in proportion to the cube of length. That means the ratio of surface area to volume decreases as overall size increases. The surface area to volume ratio for an elephant is much smaller than it is for a mouse. If an elephant’s metabolism per unit volume were the same as that of a mouse, the elephant’s skin would burn up.

So metabolism cannot be proportional to volume. What about surface area? Here we get into variety and controversy. Many people assume metabolism is proportional to surface area based on the argument above. This idea was first proposed by Max Rubner in 1883. Others emphasize data that supports the theory that suggests metabolism is proportional to surface area.

In the 1930’s, Max Kleiber proposed that metabolism increases according to body mass raised to the power 3/4. (I’ve been a little sloppy here using body mass and volume interchangeably. Body mass is more accurate, though to first approximation animals have uniform density.) If metabolism were proportional to volume, the exponent would be 1. If it were proportional to surface area, the exponent would be 2/3. But Kleiber’s law says it’s somewhere in between, namely 3/4. The image below comes from a paper by Kleiber from 1947.

Kleiber M. (1947). Body size and metabolic rate. Physiological Reviews 27: 511-541.

The graph shows that on a log-log plot, the metabolism rate versus body mass for a large variety of animals has slope approximately 3/4.

So why the exponent 3/4? There is a theoretical explanation called the metabolic scaling theory proposed by Geoffrey West, Brian Enquist, and James Brown. Metabolic scaling theory says that circulatory systems and other networks are fractal-like because this is the most efficient way to serve an animal’s physiological needs. To quote Enquist:

Although living things occupy a three-dimensional space, their internal physiology and anatomy operate as if they were four-dimensional. … Fractal geometry has literally given life an added dimension.

The fractal theory would explain the power law exponent 3/4 simply: it’s the ratio of the volume dimension to the fractal dimension. However, as I suggested earlier, this theory is controversial. Some biologists dispute Kleiber’s law. Others accept Kleiber’s law as an empirical observation but dispute the theoretical explanation of West, Enquist, and Brown.

To read more about metabolism and power laws, see chapter 17 of Complexity: A Guided Tour.

More power law posts

Rate of regularizing English verbs

English verbs form the past tense two ways. Regular verbs ad “ed” to the end of the verb. For example, “work” becomes “worked.” Irregular verbs, sometimes called “strong” verbs, require internal changes to form the past tense. For example, “sing” becomes “sang” and “do” becomes “did.”

Irregular verbs are becoming regularized over time. For example, “help” is now a regular verb, though its past tense was once “holp.” (I’ve heard that you can still occasionally hear someone use archaic forms such as “holp” and “holpen.”)

What I find most interesting about this change quantifying the rate of change. It appears that the half-life of an irregular verb is proportional to the square root of its frequency. Rarely used irregular verbs are most quickly regularized while commonly used irregular verbs are the most resistant to change.

Exceptions have to be constantly reinforced to keep speakers from applying the more general rules. Exceptions that we hear less often get dropped over time. So it’s not surprising that half-life is a decreasing function of frequency. What is surprising is that that half life is such a simple decreasing function, a constant over square root.

Source: Quantifying the evolutionary dynamics of language

Networks and power laws

In many networks—social networks, electrical power networks, networks of web pages, etc.—the number of connections for each node follows a statistical distribution known as a power law. Here’s a model of network formation that shows how power laws can emerge from very simple rules.

Albert-László Barabási describes the following algorithm in his book Linked. Create a network by starting with two connected nodes. Then add nodes one at a time. Every time a new node joins the network, it connects to two older nodes at random with probability proportional to the number of links the old nodes have. That is, nodes that already have more links are more likely to get new links. If a node has three times as many links as another node, then it’s three times as attractive to new nodes. The rich get richer, just like with Page Rank.

Barabási says this algorithm produces networks whose connectivity distribution follows a power law. That is, the number of nodes with n connections is proportional to nk. In particular he says k = 3.

I wrote some code to play with this algorithm. As the saying goes, programming is understanding. There were aspects of the algorithm I never would have noticed had I not written code for it. For one thing, after I decided to write a program I had to read the description more carefully and I noticed I’d misunderstood a couple details.

If the number of nodes with n connections really is proportional to nk, then when you plot the number of nodes with 1, 2, 3, … connections, the points should fall near a straight line when plotted on the log-log scale and the slope of this line should be –k.

In my experiment with 10,000 nodes, I got a line of slope −2.72 with a standard error of 0.04. Not exactly −3, but maybe the theoretical result only holds in the limit as the network becomes infinitely large. The points definitely fall near a straight line in log-log scale:

In this example, about half the nodes (5,073 out of 10,000) had only two connections. The average number of connections was 4, but the most connected node had 200 connections.

Note that the number of connections per node does not follow a bell curve at all. The standard deviation on the number of connections per node was 6.2. This means the node with 200 connections was over 30 standard deviations from the average. That simply does not happen with normal (Gaussian) distributions. But it’s not unusual at all for power law distributions because they have such thick tails.

Of course real networks are more complex than the model given here. For example, nodes add links over time, not just when they join a network. Barabási discusses in his book how some elaborations of the original model still produce power laws (possibly with different exponents) and while others do not.