Learning languages with the help of algorithms

Suppose you’re learning a new language and want to boost your vocabulary in a very time-efficient way.

People have many ways to learn a language, different for each person. Suppose you wanted to improve your vocabulary by reading books in that language. To get the most impact, you’d like to pick books that cover as many common words in the language as possible.

Here is a formalization. Suppose for a large set of m books of average length n words, you want to pick the one book that has the highest vocabulary impact from the set of books. This vocabulary impact of a book is measured by a weighted sum across all vocabulary words in the book, each word weighted by how common the word is in the language, as measured by word frequency across all books; this essentially gives a probability weight of each word in the language.

That’s an easy problem to solve. First, optionally filter out stop words like (in the case of English) “the“ and “and”  considered in some sense to have “not much meaning.“ Second, across all books build a unique word list along with counts of number of occurrences of each word. Finally, for each book evaluate the coverage of those words, computing a score as described above.

Computing the unique word list and scores can be done by a hashing process that runs in linear order mn time typically, worst case mn log(mn). To compute the score also costs average linear time. So the entire process can be done in linear time.

What if you want to find the best two books to read, with the best joint vocabulary coverage? To find the optimal solution, the best known time is not linear but is quadratic for the general case.

How about the best k books for arbitrary k > 0?

This is an NP-hard problem, meaning that the compute time to solve this problem exactly for the best known algorithms for the general case grows exponentially in k as the size k of your desired set of reading books increases.

So you cannot expect to solve this problem exactly for large k. All hope is not lost, however. This is a maximal weighted cover problem, residing in a subset of the NP hard problems known as submodular problems.  Because of this, approximate algorithms are known that guarantee approximation accuracy within a certain known factor of the true best solution (see [1-5]).

These algorithms are described in [6], [7], [8]. The basic idea of the algorithms is to add a single high-impact book at a time to the running set of high-impact books—a greedy algorithm. It is not guaranteed to be the best book list, but it is reasonable.

The helpful Python submodlib package runs very fast on this problem.

One can improve the quality of the result by spending more on computation.  First, you could use a blocking strategy to compute precisely the best two books to add at each step, or three books, and so forth (computational complexity is quadratic, cubic and so forth). Similarly one could use a look-ahead strategy: add two books to the list that are together the best by an exact computation, then leave one off and find the best second and third books, and so forth. These in general do not improve on the submodularity bound, however in practice the result is more accurate.

One can also use various heuristics to improve performance in some cases. For example, if a book has little or no vocabulary that is not present in the other books, it can sometimes be safely discarded. However, in general the exact case remains hard (unless P=NP).

References

[1] Abhimanyu Das and David Kempe. Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection. Proceedings of ICML, 2011.

[2] Abhimanyu Das and David Kempe. Approximate Submodularity and its Applications: Subset Selection, Sparse Approximation and Dictionary Selection.
Journal of Machine Learning Research (JMLR), 19(3):1–35, 2018.

[3] M. Conforti and G. Cornuéjols. Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds. Discrete Applied Mathematics, 7(3):251–274, 1984.

[4] Maxim Sviridenko, Jan Vondrák, and Justin Ward. Optimal approximation for submodular and supermodular optimization with bounded curvature. Proceedings of SODA, 2013.

[5] Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions. 2013.
https://arxiv.org/abs/1311.2110

[6] Nemhauser, George L., Laurence A. Wolsey, and Marshall L. Fisher. “An analysis of approximations for maximizing submodular set functions—I.” Mathematical programming 14, no. 1 (1978): 265-294.

[7] “Maximum coverage problem,” https://en.wikipedia.org/wiki/Maximum_coverage_problem.

[8] Wei, Kai, Rishabh Iyer, and Jeff Bilmes. “Fast multi-stage submodular maximization.” In International conference on machine learning, pp. 1494-1502. PMLR, 2014.

 

Mental math posts

I’ve written quite a few posts on mental math over the years. I think mental math is important/interesting for a couple reasons. First, there is some utility in being able to carry out small calculations with rough accuracy without having stop and open up a calculator. Second, the constraints imposed by mental calculation make you look at a problem differently, often in interesting ways. An approximate mental calculation often requires more understanding that a precise machine calculation.

Divisibility and factoring

Day of the week

Transcendental functions

This page outlines how to calculate logarithms base 2, e, and 10, and their inverses, linking to other posts for details.

It also outlines computing trig functions, roots, and the gamma function. See the links below for more accurate ways to compute logarithms.

Incidentally, when I’ve posted about these approximations before, inevitably someone will say these are just Taylor approximations. No, they aren’t.

Logarithms

Miscellaneous

Morse code beyond the solar system

Voyager's golden record

The two Voyager probes, launched in 1977, are now in interstellar space beyond our solar system. Each carries a Golden Record, a recording of sounds and encoded images meant to represent Earth and human civilization.

I’ve long intended to listen to the record and yesterday I did. One of the cuts is a 12-minute collage of sounds from our planet. It starts with natural sounds—thunder, crickets, frogs, birds, etc.—and progresses human sounds—a heartbeat, speech, tool use, etc. The sounds appear in roughly historical order, with the sound of a rocket launch toward the end.

At 7:49 there is Morse code on top of other sounds. I was curious what the message was, so I transcribed it: it’s “ad astra per aspera” played twice. This is Latin for “To the stars through hardships.”

Cycles in Marsaglia’s mental RNG

Last week I wrote about a mental random number generator designed by George Marsaglia. It’s terrible compared to any software RNG, but it produces better output than most people would if asked to say a list of random digits.

Marsaglia’s RNG starts with a two-digit number as a seed state, then at each step replaces n with the 10s digit of n plus 6 times the 1s digit. Here’s the generator in Python form, taken from the earlier post.

state = 42 # Set the seed to any two digit number

def random_digit():
    tens = lambda n: n // 10
    ones = lambda n: n % 10
    global state
    state = tens(state) + 6*ones(state)
    return ones(state)

Only the last digit of the state is returned: the state has two digits but the output has one digit.

Alexander Janßen left an interesting comment on Mastodon regarding the generator:

I’m not sure if it’s already documented anywhere, but a random seed of 59 sticks and always returns a 9. This is similiar to the initial state of 0 (which is more trivial). In return that means, that the state 59 is never reached from any other sequence.

To explore this further, let’s look at just the states of the mental RNG, not the output digits.

If you start with the state 1, you will produce each of the numbers from 1 to 58 in a cycle. It follows that if you start at any number between 1 and 58 you’ll produce the same sequence of states in a cycle, though starting at a different point in the cycle.

… → 1 → 6 → 36 → 39 → … → 1 → …

If you start at 59, you’ll stay at 59 because 5 + 6×9 = 59. In this case, the RNG literally does what is described in the Dilbert cartoon below.

Over here we have our random number generator. nine nine nine nine nine nine ...

Collatz interlude

If you start with a seed larger than 59, you’ll eventually land on a state less than 59, and then produce the same cycle as described above. This is reminiscent of the Collatz conjecture, which I describe here as follows.

The Collatz conjecture asks whether the following procedure always terminates at 1. Take any positive integer n. If it’s odd, multiply it by 3 and add 1. Otherwise, divide it by 2. For obvious reasons the Collatz conjecture is also known as the 3n + 1 conjecture.

The Collatz conjecture remains an open question, though there has been some progress on the problem. I routinely get email from people who say they have no training in math but believe they have solved the problem. Recently they have begun to explain how an LLM led them to their proof.

Does the Collatz sequence get stuck in a cycle for some starting point? Nobody knows. But we can show every starting point for Marsaglia’s mental RNG ends up in a cycle.

Back to Marsaglia

If we start the mental RNG with 99, we next get 63, and then 24. So starting at 99, we end up in a cycle after two steps. The same is true starting at 69 or 89. Otherwise, any starting point between 60 and 98 leads us to a cycle after one step.

Marsaglia’s generator is supposed to be seeded with a two-digit number. What if we seed it with a larger number? For example, what if we start at 1066 and consider the 1os “digit” to be 106?

Each step of the algorithm reduces the state by an order of magnitude, and so if we start with a state greater than 1000, we eventually get to a state less than 1000. Then every state between 100 and 1000 leads to a cycle, usually the permutation of the digits 1 through 58, but the following three-digit states lead to the fixed point of 59.

118, 177, 236, 295, 354, 413, 472, 531, 590

Related posts

Monero’s seed phrase words

I wrote a couple posts last month about the seed phrase words used by Bitcoin and other cryptocurrencies. There are 2048 words on the BIP39 list. Monero uses a different word list, one with 1626 words [1]. You can find Monero’s list here.

Why 1626 words?

It’s not hard to guess why the BIP 39 list has 2048 words: each one encodes 11 bits of a key because 211 = 2048. It’s not as obvious where the number 1626 comes from. It is the smallest value of n such that

n24 > 2256

Monero uses a seed phrase of 25 words, but the last word is a checksum, so there are 24 words which are used to create a 256-bit private key.

Distinctiveness

I criticized the BIP 39 list for being less than ideal for memorization because some words are similar phonetically, such as angle and ankle. Other words are harder to remember because they are not vivid nouns or verbs, such as  either or neither.

The Monero list has 200 words that differ by one character, compared to 484 for BIP 39.

But as with the BIP 39 list, some of Monero’s words are similar and not easy to visualize, such as adapt, adept, and adopt.

Prefix uniqueness

One nice feature of the BIP 39 list is that the first four letters of each word are distinct. So if you’re typing the words, autocomplete can fill in the rest of the word by the time you’ve entered four letters.

Monero goes one step further: all words are uniquely determined by the first three letters.

Overlap with other lists

Only about 1/4 of the words on Monero’s list. And most of Monero’s words are not in Google’s list of 10,000 most common words.

venn diagram M: 1626, B: 2048, M∩B: 425, G: 10000, M∩G: 701, B∩G: 1666, M∩B∩G: 339

Related posts

[1] Monero has other ways to convert words to keys. The way described here is now known as the Legacy mnemonics. The new Polyseed algorithm uses the BIP 39 word list.

Area of the unit disk after a Möbius transformation

Let f(z) = (az + b)/(cz + d) where Δ = adbc ≠ 1.

If f has no singularity inside the unit disk, i.e. if |d/c| > 1, then the image of the unit disk under f is another disk. What is the area of that disk?

The calculation is complicated, but the result turns out to be

Area = π |Δ|² / (|d|² − |c|²)².

Just as a sanity check, set c = 0 and d = 1. Then we multiply the disk by a and shift by b. The shift doesn’t change the area, and multiplying by a multiples the area by |a|², which is consistent with our result.

As another sanity check, note that the area is infinite if cd, which is correct since there would be a singularity at z = −1.

Finally, here’s a third sanity check in the form of Python code.

from numpy import linspace, pi, exp

a, b, c, d = 9, 15j, 20, 25j
theory_r = abs(a*d - b*c)/(abs(d)**2 - abs(c)**2)
print("theory r:", theory_r)

t = linspace(0, 2*pi, 10000)
z = exp(1j*t)
w = (a*z + b)/(c*z + d)
approx_r = (max(w.real) - min(w.real))/2
print("approx r:", approx_r)

More triangle inequalities

Yesterday I wrote about a triangle inequality discovered by Paul Erdős.

Let P be a point inside a triangle ABC. Let xyz be the distances from P to the vertices and let pqr, be the distances to the sides. Then Erdős’ inequality says

x + y + z ≥ 2(p + q + r).

Using the same notation, here are four more triangle inequalities discovered by Oppenheim [1].

  • px + qy + rz ≥ 2(qr + rp + pq)
  • yz + zx + xy ≥ 4(qr + rp + pq),
  • xyz ≥ 8pqr
  • 1/p+ 1/q + 1/r ≥ 2(1/x + 1/y + 1/z)

[1] A. Oppenheim. The Erdös Inequality and Other Inequalities for a Triangle. The American Mathematical Monthly. Vol. 68, No. 3 (Mar., 1961), pp. 226–230

Area of unit disk under a univalent function

Let D be the unit disk in the complex plane and let f be a univalent function on D, meaning it is analytic and one-to-one on D.

There is a simple way to compute the area of f(D) from the coefficients in its power series.

If

f(z) = \sum_{n=0}^\infty c_n z^n

then

\text{area}\left(f(D)\right) = \int_D |f^\prime(z)|^2 \, dx \,dy = \pi \sum_{n=1}^\infty n |c_n|^2

The first equality follows from the change of variables theorem for functions of two variables and applying the Cauchy-Riemann equations to simplify the Jacobian. The second equality is a straight-forward calculation that you can work out in polar coordinates.

Application

Let’s apply this to what I called the minimalist Mandelbrot set the other day.

The orange disk has radius 1/4 and so its area is simply π/16.

Finding the area of the blue cardioid takes more work, but the theorem above makes it easy. The cardioid is the image of the set {α : |α| < ½} under the map f(z) = zz². To apply the theorem above we need the domain to be the unit disk, not the unit disk times ½, so we define

f(z) = \frac{1}{2} z - \frac{1}{4} z^2

as a function on the unit disk. Now c1 = ½ and c2 = −¼ and so the area of f(D) = 3π/8.

I said in the earlier post that the minimalist Mandelbrot set makes up most of the Mandelbrot set. Now we can quantify that. The area of the minimalist Mandelbrot set is 7π/16 = 1.3744. The area of the Mandelbrot set is 1.5065, so the minimalist set shown above makes up over 91% of the total area.

Related posts

Random samples from a polygon

Ted Dunning left a comment on my post on random sampling from a triangle saying you could extend this to sampling from a polygon by dividing the polygon into triangles, and selecting a triangle each time with probability proportional to the triangle’s area.

To illustrate this, let’s start with a irregular pentagon.

To pick a point inside, I used the centroid, the average of the vertices. Connecting the centroid to each of the vertices splits the pentagon into triangles. (Here I implicitly used the fact that this pentagon is convex. The centroid of a non-convex polygon could be outside the polygon.)

We can find the area of the triangles using Heron’s rule.

Here’s what we get for random samples.

A triangle inequality by Erdős

Plane geometry has been studied since ancient times, and yet new results keep being discovered millennia later, including elegant results. It’s easy to come up with a new result by proving a complicated theorem that Euclid would not have cared about. It’s more impressive to come up with a new theorem that Euclid would have understood and found interesting.

Paul Erdős conjectured another triangle inequality in 1935 which was proved by Mordell and Barrow in 1937 [1].

Let P be a point inside a triangle ABC. Let x, y, z be the distances from P to the vertices and let p, q, r, be the distances to the sides. Then

xyz ≥ 2(pqr)

with equality only if P is the center of an equilateral triangle [2]. In the figure above, the theorem says the dashed blue lines together are more than twice as long as the solid red lines.

How far apart are the left and right sides of the inequality? This was the motivation for the previous post on selecting random points from a triangle. I wanted to generate random points and compare the two sides of the Erdős-Mordell-Barrow inequality.

We can visualize the inequality by generating random points inside the triangle and plotting the points with a color that indicates the inequality gap, darker blue corresponding to a larger gap.

This shows the inequality is sharper in the middle of the triangle than near the vertices.

[1] Problem 3740, American Mathematical Monthly, 44 (1937) 252-254.

[2] You could interpret this as a theorem comparing barycentric and trilinear coordinates.