Estimating vocabulary size with Heaps’ law

Heaps’ law says that the number of unique words in a text of n words is approximated by

V(n) = K nβ

where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 an 0.6.

(Note that it’s Heaps’ law, not Heap’s law. The law is named after Harold Stanley Heaps. However, true to Stigler’s law of eponymy, the law was first observed by someone else, Gustav Herdan.)

I’ll demonstrate Heaps law looking at books of the Bible and then by looking at novels of Jane Austen. I’ll also look at unique words, what linguists call “hapax legomena.”

Demonsrating Heaps law

For a collection of related texts, you can estimate the parameters K and β from data. I decided to see how well Heaps’ law worked in predicting the number of unique words in each book of the Bible. I used the King James Version because it is easy to download from Project Gutenberg.

I converted each line to lower case, replaced all non-alphabetic characters with spaces, and split the text on spaces to obtain a list of words. This gave the following statistics:

    |------------+-------+------|
    | Book       |     n |    V |
    |------------+-------+------|
    | Genesis    | 38520 | 2448 |
    | Exodus     | 32767 | 2024 |
    | Leviticus  | 24621 | 1412 |
                    ...
    | III John   |   295 |  155 |
    | Jude       |   609 |  295 |
    | Revelation | 12003 | 1283 |
    |------------+-------+------|

The parameter values that best fit the data were K = 10.64 and β = 0.518, in keeping with the typical ranges of these parameters.

Here’s a sample of how the actual vocabulary size and predicted vocabulary size compare.

    |------------+------+-------|
    | Book       | True | Model |
    |------------+------+-------|
    | Genesis    | 2448 |  2538 |
    | Exodus     | 2024 |  2335 |
    | Leviticus  | 1412 |  2013 |
                    ...
    | III John   |  155 |   203 |
    | Jude       |  295 |   296 |
    | Revelation | 1283 |  1387 |
    |------------+------+-------|

Here’s a visual representation of the results.

KJV bible total words vs distinct words

It looks like the predictions are more accurate for small books, and that’s true on an absolute scale. But the relative error is actually smaller for large books as we can see by plotting again on a log-log scale.

KJV bible total words vs distinct words

Jane Austen novels

It’s a little surprising that Heaps’ law applies well to books of the Bible since the books were composed over centuries and in two different languages. On the other hand, the same committee translated all the books at the same time. Maybe Heaps’ law applies to translations better than it applies to the original texts.

I expect Heaps’ law would fit more closely if you looked at, say, all the novels by a particular author, especially if the author wrote all the books in his or her prime. (I believe I read that someone did a vocabulary analysis of Agatha Christie’s novels and detected a decrease in her vocabulary in her latter years.)

To test this out I looked at Jane Austen’s novels on Project Gutenberg. Here’s the data:

    |-----------------------+--------+------|
    | Novel                 |      n |    V |
    |-----------------------+--------+------|
    | Northanger Abbey      |  78147 | 5995 |
    | Persuasion            |  84117 | 5738 |
    | Sense and Sensibility | 120716 | 6271 |
    | Pride and Prejudice   | 122811 | 6258 |
    | Mansfield Park        | 161454 | 7758 |
    | Emma                  | 161967 | 7092 |
    |-----------------------+--------+------|

The parameters in Heaps’ law work out to K = 121.3 and β = 0.341, a much larger K than before, and a smaller β.

Here’s a comparison of the actual and predicted vocabulary sizes in the novels.

    |-----------------------+------+-------|
    | Novel                 | True | Model |
    |-----------------------+------+-------|
    | Northanger Abbey      | 5995 |  5656 |
    | Persuasion            | 5738 |  5799 |
    | Sense and Sensibility | 6271 |  6560 |
    | Pride and Prejudice   | 6258 |  6598 |
    | Mansfield Park        | 7758 |  7243 |
    | Emma                  | 7092 |  7251 |
    |-----------------------+------+-------|

If a suspected posthumous manuscript of Jane Austen were to appear, a possible test of authenticity would be to look at its vocabulary size to see if it is consistent with her other works. One could also look at the number of words used only once, as we discuss next.

Hapax legomenon

In linguistics, a hapax legomenon is a word that only appears once in a given context. The term comes comes from a Greek phrase meaning something said only once. The term is often shortened to just hapax.

I thought it would be interesting to look at the number of hapax legomena in each book since I could do it with a minor tweak of the code I wrote for the first part of this post.

Normally if someone were speaking of hapax legomena in the context of the Bible, they’d be looking at unique words in the original languages, i.e. Hebrew and Greek, not in English translation. But I’m going with what I have at hand.

Here’s a plot of the number of haxap in each book of the KJV compared to the number of words in the book.

Hapax logemenon in Bible, linear scale

This looks a lot like the plot of vocabulary size and total words, suggesting the number of hapax also follow a power law like Heaps law. This is evident when we plot again on a logarithmic scale and see a linear relation.

Number of hapax logemena on a log-log scale

Just to be clear on the difference between two analyses this post, in the first we looked at vocabulary size, the number of distinct words in each book. In the second we looked at words that only appear once. In both cases we’re counting unique words, but unique in different senses. In the first analysis, unique means that each word only counts once, no matter how many times it’s used. In the second, unique means that a work only appears once.

Related posts

Testing Cliff RNG with DIEHARDER

My previous post introduced the Cliff random number generator. The post showed how to find starting seeds where the generator will start out by producing approximately equal numbers. Despite this flaw, the generator works well by some criteria.

I produced a file of s billion 32-bit integers by multiplying the output values, which were floating point numbers between 0 and 1, by 232 and truncating to integer. Then I ran the DIEHARDER random number generator test suite.

The results were interesting. Before running the tests, I thought the tests would nearly all pass or nearly all fail, more likely the latter. But what happened was that many tests passed and some failed hard [1].

Here’s a list of the tests that passed:

  • diehard_birthdays
  • diehard_rank_32x32
  • diehard_rank_6x8
  • diehard_bitstream
  • diehard_oqso
  • diehard_dna
  • diehard_count_1s_str
  • diehard_count_1s_byt
  • diehard_runs
  • sts_monobit
  • sts_serial
  • rgb_bitdist
  • rgb_kstest_test
  • dab_dct
  • dab_filltree2
  • dab_monobit2

The tests that failed were:

  • diehard_parking_lot
  • diehard_2sphere
  • diehard_3sphere
  • diehard_squeeze
  • diehard_craps
  • marsaglia_tsang_gcd
  • rgb_lagged_sum
  • dab_bytedistrib

I’ve left out a few test restults that were ambiguous as well as tests that were described as “Suspect” and “Do not use” on the DIEHARDER web site.

The site I mentioned in the previous post where I ran across this generator said that it passed a spherical generation test. I assume the implementation of that test was less demanding that the version included in DIEHARD. But the generator does well by other tests.

The lagged sum test tests for autocorrelation. Maybe the failure of this test has something to do with the fixed points discussed earlier.

Update: After writing this post I discovered that the generator has a short period, as I discuss here. That explains why the lagged sum test fails: the output has perfect autocorrelation at a lag equal to the period.

Related posts

[1] By “failed hard” I mean the test return a p-value of zero. The p-value couldn’t actually be zero, but it was close enough that it the displayed value was exactly zero.

Serious applications of a party trick

people at a party

In a group of 30 people, it’s likely that two people have the same birthday. For a group of 23 the probability is about 1/2, and it goes up as the group gets larger.

In a group of a few dozen people, it’s unlikely that anyone will have a particular birthday, but it’s likely that two people will have the same birthday. This makes a good party trick, but it’s much more than a party trick. It illustrates a general principle that comes up frequently in applications.

For example, it explains a common mistake in seeding random number generators and it lets you predict the chances of hash function collisions. Also, simulation of the birthday problem is used as a test for random number generators because the simulation is sensitive to flaws such as correlated output.

What are other examples of mathematical party tricks that illustrate an important principle in applications? I’m sure there are others, but the birthday problem is the first one that comes to mind.

Per stirpes and random walks

If an inheritance is to be divided per stirpes, each descendant gets an equal share. If a descendant has died but has living descendants, his or her share is distributed by applying the rule recursively.

Example

For example, suppose a man had two children, Alice and Bob, and stipulates in his will that his estate is to be divided per stirpes. If Alice and Bob are still alive when he dies, his estate is split evenly. Suppose, however, that Alice is still alive but Bob has died, and that Bob has three living children, Carol, David, and Erin. In that case Alice would inherit 1/2 of the estate, and each of Bob’s children would inherit 1/6, i.e. 1/3 of Bob’s 1/2.

State law

In some states, such as Texas, per stirpes is the default when someone dies without a will. Who knows what they do in Nevada? Maybe the descendants play poker for the inheritance. I don’t know. I’m not a lawyer, certainly not a Nevada lawyer.

Random walk

Here’s a random process whose expected value gives the same result as per stirpes.

Convert the inheritance to a jar of pennies, possibly a very large jar. Repeat the following steps until all the pennies are gone.

  1. Take a penny out of the jar and perform a random walk on the family tree representing the descendants of the departed.
  2. When you come to a branch in the tree, choose a branch at random with each branch having equal probability.
  3. When you encounter a living person, give them the penny.

This assumes that you first prune the descendant tree of any lines that have died out. That is, we assume every terminal node of the tree represents a living person.

Why is it necessary to trim the tree? If you ended up at a dead end, couldn’t you just put the penny back and start over? No. Suppose in the example above that Alice and Carol are the only surviving descendants. Then per stirpes says they should receive equal amounts, since Carol inherits all of her father’s share. But if we did the random walk without removing David and Erin, then 1/2 the time we’d give a penny to Alice, 1/6 of the time we’d give it to Carol, and 1/3 of the time we’d start over. Alice would get 75% of the estate.

Related posts

Using one RNG to sample another

Suppose you have two pseudorandom bit generators. They’re both fast, but not suitable for cryptographic use. How might you combine them into one generator that is suitable for cryptography?

Coppersmith et al [1] had a simple but effective approach which they call the shrinking generator, also called irregular decimation. The idea is to use one bit stream to sample the other. Suppose the two bit streams are ai and bi. If ai = 1, then output bi. Otherwise, throw it away. In NumPy or R notation, this is simply b[a > 0].

Examples in Python and R

For example, in Python

    >>> import numpy as np
    >>> a = np.array([1,0,1,1,0,0,0,1])
    >>> b = np.array([1,0,1,0,0,1,1,0])
    >>> b[a > 0]
    array([1, 1, 0, 0])

Here’s the same example in R.

    > a = c(1,0,1,1,0,0,0,1)
    > b = c(1,0,1,0,0,1,1,0)
    > b[a>0]
    [1] 1 1 0 0

Linear Feedback Shift Registers

Coppersmith and colleagues were concerned specifically with linear feedback shift register (LFSR) streams. These are efficient sources of pseudorandom bits because they lend themselves to hardware implementation or low-level software implementation. But the problem is that linear feedback shift registers are linear. They have an algebraic structure that enables simple cryptographic attacks. But the procedure above is nonlinear and so less vulnerable to attack.

A potential problem is that the shrinking generator outputs bits at an irregular rate, and a timing attack might reveal something about the sampling sequence a unless some sort of buffering conceals this.

Other stream sources

While the shrinking generator was developed in the context of LFSRs, it seems like it could be used to combine any two PRNGs in hopes that the combination is better than the components. At a minimum, it doesn’t seem it should make things worse [2].

There is a problem of efficiency, however, because on average the shrinking generator has to generate 4n bits to output n bits. For very efficient generators like LFSRs this isn’t a problem, but it could be a problem for other generators.

Self-shrinking generators

A variation on the shrinking generator is the self-shrinking generator. The idea is to use half the bits of a stream to sample the other half. Specifically, look at pairs of bits, and if the first bit is a 1, return the second bit. If the first bit is a 0, return nothing.

Use in stream ciphers

The eSTREAM competition for cryptographically secure random bit generators included one entry, Decim v2, that uses shrinking generators. The competition had two categories, methods intended for software implementation and methods intended for hardware implementation. Decim was entered in the hardware category. According to the portfolio PDF [3] on the competition site,

Decim contains a unique component in eSTREAM, that of irregular decimation, and is an interesting addition to the field of stream ciphers.

That sounds like it was the only method to use irregular decimation (i.e. shrinking generators).

The first version of Decim had some flaws, but the document says “no adverse cryptanalytic results are known” for the second version. Decim v2 made it to the second round of the eSTREAM competition but was not chosen as a finalist because

… the cipher doesn’t seem to deliver such a satisfying performance profile, so while there might be some very interesting elements to the Decim construction, we feel that the current proposal doesn’t compare too well to the other submissions for the hardware profile.

That seems to imply Decim might be competitive with a different implementation or in some context with different trade-offs.

Related posts

[1] Coppersmith, D. Krawczyk, H. Mansour, Y. The Shrinking Generator. Advances in Cryptology — CRYPTO ’93. Lecture Notes in Computer Scienc, vol. 773, pp. 22–39. Springer, Berlin.

[2] If a and b are good sources of random bits, it seems b sampled by a should be at least as good. In fact, if a is poor quality but b is good quality, b sampled by a should still be good. However, the reverse could be a problem. If b is biased, say it outputs more 1s than 0s, then if a is a quality sample, that sample will be biased in favor of 1s as well.

[3] The link to this report sometimes works but often doesn’t. There’s something unstable about the site. In case it works, here’s the URL: http://www.ecrypt.eu.org/stream/portfolio.pdf

Random sampling from a file

I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling.

Given just a file name, shuf randomly permutes the lines of the file.

With the option -n you can specify how many lines to return. So it’s doing sampling without replacement. For example,

    shuf -n 10 foo.txt

would select 10 lines from foo.txt.

Actually, it would select at most 10 lines. You can’t select 10 lines without replacement from a file with less than 10 lines. If you ask for an impossible number of lines, the -n option is ignored.

You can also sample with replacement using the -r option. In that case you can select more lines than are in the file since lines may be reused. For example, you could run

    shuf -r -n 10 foo.txt

to select 10 lines drawn with replacement from foo.txt, regardless of how many lines foo.txt has. For example, when I ran the command above on a file containing

    alpha
    beta
    gamma

I got the output

    beta
    gamma
    gamma
    beta
    alpha
    alpha
    gamma
    gamma
    beta

I don’t know how shuf seeds its random generator. Maybe from the system time. But if you run it twice you will get different results. Probably.

Related

Comparing Truncation to Differential Privacy

Traditional methods of data de-identification obscure data values. For example, you might truncate a date to just the year.

Differential privacy obscures query values by injecting enough noise to keep from revealing information on an individual.

Let’s compare two approaches for de-identifying a person’s age: truncation and differential privacy.

Truncation

First consider truncating birth date to year. For example, anyone born between January 1, 1955 and December 31, 1955 would be recorded as being born in 1955. This effectively produces a 100% confidence interval that is one year wide.

Next we’ll compare this to a 95% confidence interval using ε-differential privacy.

Differential privacy

Differential privacy adds noise in proportion to the sensitivity Δ of a query. Here sensitivity means the maximum impact that one record could have on the result. For example, a query that counts records has sensitivity 1.

Suppose people live to a maximum of 120 years. Then in a database with n records [1], one person’s presence in or absence from the database would make a difference of no more than 120/n years, the worst case corresponding to the extremely unlikely event of a database of n-1 newborns and one person 120 year old.

Laplace mechanism and CIs

The Laplace mechanism implements ε-differential privacy by adding noise with a Laplace(Δ/ε) distribution, which in our example means Laplace(120/nε).

A 95% confidence interval for a Laplace distribution with scale b centered at 0 is

[b log 0.05, –b log 0.05]

which is very nearly

[-3b, 3b].

In our case b = 120/nε, and so a 95% confidence interval for the noise we add would be [-360/nε, 360/nε].

When n = 1000 and ε = 1, this means we’re adding noise that’s usually between -0.36 and 0.36, i.e. we know the average age to within about 4 months. But if n = 1, our confidence interval is the true age ± 360. Since this is wider than the a priori bounds of [0, 120], we’d truncate our answer to be between 0 and 120. So we could query for the age of an individual, but we’d learn nothing.

Comparison with truncation

The width of our confidence interval is 720/ε, and so to get a confidence interval one year wide, as we get with truncation, we would set ε = 720. Ordinarily ε is much smaller than 720 in application, say between 1 and 10, which means differential privacy reveals far less information than truncation does.

Even if you truncate age to decade rather than year, this still reveals more information than differential privacy provided ε < 72.

Related posts

[1] Ordinarily even the number of records in the database is kept private, but we’ll assume here that for some reason we know the number of rows a priori.

Random projection

Last night after dinner, the conversation turned to high-dimensional geometry. (I realize how odd that sentence sounds; I was with some unusual company.)

Someone brought up the fact that two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal. This is a surprising but well known fact. Next the conversation turned to something also surprising but not so well known.

Suppose you’re in a 20,000 dimensional space and you choose 10,000 vectors at random. Now you choose one more, and ask what angle the new vector makes with the space spanned by the 10,000 previous vectors. The new vector is very likely to have a very small component in the direction of each of the previous vectors individually, but what about their span?

There are several ways to approach this problem. You can find the exact distribution on the angle analytically, but I thought it might be more interesting to demonstrate this by actually drawing random vectors. I started to write a simulation and discovered that the Python installation on my laptop is corrupted, and so I decided to take a third approach. I’ll give an informal explanation of what’s going on. Those who wish could either fill in the details to write a proof [1] or code it up to write a simulation.

First of all, what does it mean to pick a vector randomly from a vector space? We say we want to choose vectors uniformly, but that isn’t actually possible. What we actually mean is we want to pick vectors so that their directions are uniform. That’s possible, and we only need to work with unit vectors anyway.

The way to generate uniform random samples of unit vectors in n dimensions is to first choose a standard normal random value for each coordinate, then normalize so the vector has length 1.

Suppose we do as we discussed above. We work in 20,000 dimensions and choose 10,000 random vectors. Call their span S. Then we choose another random vector v. What angle does v make with S? You can’t simply project v onto each of the basis vectors for S because even though they are nearly orthogonal, they’re not exactly orthogonal.

It turns out that it doesn’t matter what the vectors are that defined S. All that matters is that S is almost certainly a 10,000 dimensional subspace. By symmetry it doesn’t matter which subspace it is, so we’ll assume for convenience that it’s the subspace spanned by the first 10,000 unit vectors. The angle that v makes with S is the angle between v and its projection w onto S, and w is simply the vector you get by zeroing out the last 10,000 components of v.

The cosine of the angle between two unit vectors is their dot product. Our vector v is a unit vector, but its projection w is not. So the cosine of the angle is

cos θ = v·w/||w||

Now v·w is the sum of the squares of the first half of v‘s components. This is probably very close to 1/2 since v is a unit vector. This is also w·w, and so ||w|| is probably very close to the square root of 1/2. And so cos θ is very nearly √2/2, which means θ = 45°.

So in summary, if you generate 10,000 vectors in 20,000 dimensions, then find the angle of a new random vector with the span of the previous vectors, it’s very likely to be very near 45°.

Related posts

[1] If you’re in an n-dimensional space and S has dimension m < n, then cos² θ has the distribution of X/(XY) where X ~ χ²(m) and Y ~ χ²(nm), which has distribution beta(m/2, (nm)/2). If mn/2 and m is large, this is a beta distribution concentrated tightly around 1/2, and so cos θ is concentrated tightly around √2/2. More generally, if m and n are large, cos² θ is concentrated around m/n.

A truly horrible random number generator

I needed a bad random number generator for an illustration, and chose RANDU, possibly the worst random number generator that was ever widely deployed. Donald Knuth comments on RANDU in the second volume of his magnum opus.

When this chapter was first written in the late 1960’s, a truly horrible random number generator called RANDU was commonly used on most of the world’s computers.

This generator starts with an odd seed x0 and generates subsequent values via

xn+1 = 65539 xn mod 231

I needed to generate 8-bit numbers, and I took the lowest eight bits of each value from RANDU. (To be fair, the highest 8 bits would have been better, but my goal was to find a bad RNG, so I used the lowest.)

To demonstrate the generator’s (lack of) quality I made a histogram of sorts with a 16 by 16 grid, one cell for each possible 8-bit number. I generated a large number of samples and plotted the grid. Here’s what I got:

RANDU histogram

Here’s what I get with a better random number generator:

randint histogram

I was looking for something bad, but I didn’t realize RANDU was that bad. The white stripes represent the generated values and the black stripes values that are never generated. So out of 256 possible 8-bit numbers, this generator only ever outputs 64 of them. (I used 33 as my seed. I might have gotten different vertical stripes if I had chosen a different seed, but I’d still get stripes.) Also, the frequencies of the values it does take on have suspiciously little variation.

You can see the pattern of values RANDU takes on by printing out the first few generated values in hex:

    0x63
    0x29
    0x7b
    0x71
    0x53
    0xf9
    0xeb
    0xc1

The last hex digit cycles 3, 9, b, 1, 3, 9, b, 1, … producing only four possible values.

We could prove the statements empirically demonstrated above by noting that

xn = 65539n x0 mod 2k

for k = 31, but also for smaller k. We could set k = 8 and prove that there are only 64 values, or set k = 4 and prove that there are only four final hex digits.

Related posts

Regression, modular arithmetic, and PQC

Linear regression

Suppose you have a linear regression with a couple predictors and no intercept term:

β1x1 + β2x2 = y + ε

where the x‘s are inputs, the β are fixed but unknown, y is the output, and ε is random error.

Given n observations (x1, x2, y + ε), linear regression estimates the parameters β1 and β2.

I haven’t said, but I implicitly assumed all the above numbers are real. Of course they’re real. It would be strange if they weren’t!

Learning with errors

Well, we’re about to do something strange. We’re going to pick a prime number p and do our calculations modulo p except for the addition of the error ε. Our inputs (x1, x2) are going to be pairs of integers. Someone is going to compute

r = β1x1 + β2x2 mod p

where β1 and β2 are secret integers. Then they’re going to tell us

r/p + ε

where ε is a random variable on the interval [0, 1].  We give them n pairs (x1, x2) and they give back n values of r/p with noise added. Our job is to infer the βs.

This problem is called learning with errors or LWE. It’s like linear regression, but much harder when the problem size is bigger. Instead of just two inputs, we could have m of inputs with m secret coefficients where m is large. Depending on the number of variables m, the number of equations n, the modulus p, and the probability distribution on ε, the problem may be possible to solve but computationally very difficult.

Why is it so difficult? Working mod p is discontinuous. A little bit of error might completely change our estimation of the solution. If n is large enough, we could recover the coefficients anyway, using something like least squares. But how would we carry that out? If m and p are small we can just try all pm possibilities, but that’s not going to be practical if m and p are large.

In linear regression, we assume there is some (approximately) linear process out in the real world that we’re allowed to reserve with limited accuracy. Nobody’s playing a game with us, that just how data come to us. But with LWE, we are playing a game that someone has designed to be hard. Why? For cryptography. In particular, quantum-resistant cryptography.

Post Quantum Cryptography

Variations on LWE are the basis for several proposed encryption algorithms that believed to be secure even if an adversary has access to a quantum computer.

The public key encryption systems in common use today would all be breakable if quantum computing becomes practical. They depend on mathematical problems like factoring and discrete logarithms being computationally difficult, which they appear to be with traditional computing resources. But we know that these problems could be solved in polynomial time on a quantum computer with Shor’s algorithm. But LWE is a hard problem, even on a quantum computer. Or so we suspect.

The US government’s National Institute of Standards and Technology (NIST) is holding a competition to identify quantum-resistant encryption algorithms. Last month they announced 26 algorithms that made it to the second round. Many of these algorithms depend on LWE or variations.

One variation is LWR (learning with rounding) which uses rounding rather than adding random noise. There are also ring-based counterparts RLWE and RLWR which add random errors and use rounding respectively. And there are polynomial variations such as poly-LWE which uses a polynomial-based learning with errors problem. The general category for these methods is lattice methods.

Lattice methods

Of the public-key algorithms that made it to the second round of the the NIST competition, 9 out of 17 use lattice-based cryptography:

  • CRYSTALS-KYBER
  • FrodoKEM
  • LAC
  • NewHope
  • NTRU
  • NTRU Prime
  • Round5
  • SABER
  • Three Bears

Also, two of the nine digital signature algorithms are based on lattice problems:

  • CRYSTALS-DILITHIUM
  • FALCON

Based purely on the names, and not on the merits of the algorithms, I hope the winner is one of the methods with a science fiction allusion in the name.

Related posts