Factored random numbers

A couple days ago Michael Nielsen posted an image of a one-page paper that gives an algorithm for generating factored random numbers, uniformly distributed from 1 to some designated N.

The algorithm does not generate random numbers then factor them. It’s more efficient than that, generating the factorization along with the final result. It does require testing for whether a number is prime, but this is more efficient than factorization.

I thought about trying to code up the algorithm in Python, but then I see that @iconjack beat me to it.

from sympy import isprime
from random import random, randint

def randfacts(N):
    while True:
        n, r, s = N, 1, []
        while n > 1:
            if r > N: break
            if isprime(n := randint(1,n)):
                r *= n
                s.append(n)
        else:
            if random() < r/N:
                return r, s

Python code for means

The last couple article have looked at various kinds of mean. The Python code for four of these means is trivial:

gm  = lambda a, b: (a*b)**0.5
am  = lambda a, b: (a + b)/2
hm  = lambda a, b: 2*a*b/(a+b)
chm = lambda a, b: (a**2 + b**2)/(a + b)

But the arithmetic-geometric mean (AGM) is not trivial:

from numpy import pi
from scipy.special import ellipk

agm = lambda a, b: 0.25*pi*(a + b)/ellipk((a - b)**2/(a + b)**2) 

The arithmetic-geometric mean is defined by iterating the arithmetic and geometric means and taking the limit. This iteration converges very quickly, and so writing code that directly implements the definition is efficient.

But the AGM can also be computed via a special function K, the “complete elliptic integral of the first kind,” which makes the code above more compact. This is conceptually nice because we can think of the AGM as a simple function, not an iterative process.

But how is K evaluated? In some sense it doesn’t matter: it’s encapsulated in the SciPy library. But someone has to write SciPy. I haven’t looked at the SciPy source code, but usually K is calculated numerically using the AGM because, as we said above, the AGM converges very quickly.

Bell curve meme: How to calculate the AGM? The left and right tails say to use a while loop. The middle says to evaluate a complete ellliptic integral of the first kind.

This fits the pattern of a bell curve meme: the novice and expert approaches are the same, but for different reasons. The novice uses an iterative approach because that directly implements the definition. The expert knows about the elliptic integral, but also knows that the iterative approach suggested by the definition is remarkably efficient and eliminates the need to import a library.

Although it’s easy to implement the AGM with a while loop, the code above does have some advantages. For one thing, it pushes the responsibility for validation and exception handling onto the library. On the other hand, the code is easy to get wrong because there are two conventions on how to parameterize K and you have to be sure to use the same one your library uses.

Searching for proper nouns

Suppose you want to find all the proper nouns in a document. You could grep for every word that starts with a capital letter with something like

    grep '\b[A-Z]\w+'

but this would return the first word of each sentence in addition to the words you’re after.

You could grep for capitalized words that are not preceded by a period or question mark followed by a space.

    grep -P '(?<![.?] )\b[A-Z]\w+'

That’s possibly better, but it misses proper nouns at the beginning of a sentence.

You might be able to accomplish what you’re after by tinkering with regular expressions, but it would be better to use a library that has some idea of what a proper noun is.

NLP with spaCy

The Python natural language processing library spaCy classifies words by part of speech, and so could in particular search for proper nouns.

Here’s an example using the opening lines of Moby Dick.

    import spacy
    nlp = spacy.load("en_core_web_lg")

    doc = nlp("Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul ... I account it high time to get to sea as soon as I can.")

    for tok in doc:
        if tok.pos_ == "PROPN":
            print(tok)

This will print Ishmael and November only. It does not print words at the beginning of a sentence such as Call or Some even though they are capitalized. When spaCy got to the the line

Queequeg was George Washington cannibalistically developed.

it detected that Queequeg is a proper noun. Presumably the model can tell this from context, because the word precedes the verb was and not because it knows Queeqeug is proper name.

When I changed November to november spaCy was still able to detect that november was a proper noun. When I downcased Ishmael it did not detect that ishmael was a proper noun, presumably because Ishmael is an uncommon name. When I changed the text to “Call me tim” the library did recognize tim as a proper noun.

When I fed spaCy the sentence

I never go as a passenger; nor, though I am something of a salt, do I ever go to sea as a Commodore, or a Captain, or a Cook.

the library thought that Commadore, Captain, and Cook were proper nouns. If I downcase these words, spaCy does not flag them as proper nouns.

When processing the line

For as in this world,head winds are far more prevalent than winds from astern (that is, if you never violate the Pythagorean maxim), so for the most part the Commodore on the quarter-deck gets his atmosphere at second hand from the sailors on the forecastle

spaCy correctly flagged Commodore as a proper noun in this instance. Also, it did not classify Pythagorean as a proper noun; the word is proper but not a noun, i.e. it’s a proper adjective.

TANSTAAFL

My script above has only six lines of code. But it depends on a library that uses a 588 MB language model. [1]

Related posts

[1] “TANSTAALF” stands for “There ain’t no such thing as a free lunch.” It comes from The Moon is a Harsh Mistress by Heinlein.

Incidentally, when I fed “The term TANSTAAFL comes from The Moon is a Harsh Mistress by Heinlein.” to spaCy, it flagged Harsh and Mistress as proper nouns.

When I fed it “The term TANSTAAFL comes from ‘The moon is a harsh mistress’ by Heinlein.” the library correctly tagged harsh as an adjective and mistress as a (non-proper) noun.

Golden integration

Let φ be the golden ratio. The fractional parts of nφ bounce around in the unit interval in a sort of random way. Technically, the sequence is quasi-random.

Quasi-random sequences are like random sequences but better in the sense that they explore a space more efficiently than random sequences. For this reason, Monte Carlo integration (“integration by darts“) can often be made more efficient by replacing random sequences with quasi-random sequence. This post will illustrate this efficiency advantage in one dimension using the fractional parts of nφ.

Here are functions that will generate our integration points.

    from numpy import random, zeros

    def golden_source(n):
        phi = (1 + 5**0.5)/2
        return (phi*n)%1

    def random_source(N):
        return random.random()

We will pass both of these generators as arguments to the following function which saves a running average of function evaluates at the generated points.

    def integrator(f, source, N):
        runs = zeros(N)
        runs[0] = f(source(0))
        for n in range(1, N):
            runs[n] = runs[n-1] + (f(source(n)) - runs[n-1])/n
        return runs

We’ll use as our example integrand f(x) = x² (1 – x)³. The integral of this function over the unit interval is 1/60.

    def f(x):
        return x**2 * (1-x)**3
    exact = 1/60

Now we call our integrator.

    random.seed(20230429)
    N = 1000
    golden_run = integrator(f, golden_source, N)
    random_run = integrator(f, random_source, N)

Now we plot the difference between each run and the exact value of the integral. Both methods start out with wild fluctuations. We leave out the first 10 elements in order to make the error easier to see.

    import matplotlib.pyplot as plt

    k = 10
    x = range(N)
    plt.plot(x[k:], golden_run[k:] - exact)
    plt.plot(x[k:], random_run[k:] - exact)

This produces the following plot.

The integration error using φn – ⌊φn⌋ is so close to zero that it’s hard to observe its progress. So we plot it again, this time taking the absolute value of the integration error and plotting on a log scale.

    plt.plot(x[k:], abs(golden_run[k:] - exact))
    plt.plot(x[k:], abs(random_run[k:] - exact))
    plt.yscale("log")

This produces a more informative plot.

The integration error for the golden sequence is at least an order of magnitude smaller, and often a few orders of magnitude smaller.

The function we’ve integrated has a couple features that make integration using quasi-random sequences (QMC, quasi-Monte Carlo) more efficient. First, it’s smooth. If the integrand is jagged, QMC has no advantage over MC. Second, our integrand could be extended smoothly to a periodic function, i.e. f(0) = f(1) and f′(0) = f′(1). This makes QMC integration even more efficient.

Related posts

Python code to solve Kepler’s equation

The previous post looked at solving Kepler’s equation using Newton’s method. The problem with using Newton’s method is that it may not converge when the eccentricity e is large unless you start very close to the solution. As discussed at the end of that post, John Machin came up with a clever way to start close. His starting point is defined as follow.

\begin{align*} n &= \sqrt{5 + \sqrt{16 + \frac{9}{e}}} \\ M &= n \left((1-e)s + \frac{e(n^2 - 1) + 1}{6}s^3 \right) \\ x_0 &= n \arcsin s \end{align*}

The variable s is implicitly defined as the root of a cubic polynomial. This could be messy. Maybe there are three real roots and we have to decide which one to use. Fortunately this isn’t the case.

The discriminant of our cubic equation is negative, so there is only one real root. And because our cubic equation for s has no s² term the expression for the root isn’t too complicated.

Here’s Python code to solve Kepler’s equation using Newton’s method with Machin’s starting point.

    from numpy import sqrt, cbrt, pi, sin, cos, arcsin, random
    
    # This will solve the special form of the cubic we need.
    def solve_cubic(a, c, d):
        assert(a > 0 and c > 0)
        p = c/a
        q = d/a
        k = sqrt( q**2/4 + p**3/27 )
        return cbrt(-q/2 - k) + cbrt(-q/2 + k)
    
    # Machin's starting point for Newton's method
    # See johndcook.com/blog/2022/11/01/kepler-newton/
    def machin(e, M):
        n = sqrt(5 + sqrt(16 + 9/e))
        a = n*(e*(n**2 - 1)+1)/6
        c = n*(1-e)
        d = -M
        s = solve_cubic(a, c, d)
        return n*arcsin(s)    
    
    def solve_kepler(e, M):
        "Find E such that M = E - e sin E."
        assert(0 <= e < 1)
        assert(0 <= M <= pi) 
        f = lambda E: E - e*sin(E) - M 
        E = machin(e, M) 
        tolerance = 1e-10 

        # Newton's method 
        while (abs(f(E)) > tolerance):
            E -= f(E)/(1 - e*cos(E))
        return E

To test this code, we’ll generate a million random values of e and M, solve for the corresponding value of E, and verify that the solution satisfies Kepler’s equation.

    random.seed(20221102)
    N = 1_000_000
    e = random.random(N)
    M = random.random(N)*pi
    for i in range(N):
        E = solve_kepler(e[i], M[i])
        k = E - e[i]*sin(E) - M[i]
        assert(abs(k) < 1e-10)
    print("Done")

All tests pass.

Machin’s starting point is very good, and could make an adequate solution on its own if e is not very large and if you don’t need a great deal of accuracy. Let’s illustrate by solving Kepler’s equation for the orbit of Mars with eccentricity e = 0.09341.

Error in Machin's starting guess as a function of M

Here the maximum error is 0.01675 radians and the average error is 0.002486 radians. The error is especially small for small values of M. When M = 1, the error is only 1.302 × 10-5 radians.

Dump a pickle file to a readable text file

I got a data file from a client recently in “pickle” format. I happen to know that pickle is a binary format for serializing Python objects, but trying to open a pickle file could be a puzzle if you didn’t know this.

Be careful

There are a couple problems with using pickle files for data transfer. First of all, it’s a security risk because an attacker could create a malformed pickle file that would cause your system to run arbitrary code. In the Python Cookbook, the authors David Beazley and Brian K. Jones warn

It’s essential that pickle only be used internally with interpreters that have some ability to authenticate one another.

The second problem is that the format could change. Again quoting the Cookbook,

Because of its Python-specific nature and attachment to source code, you probably shouldn’t use pickle as a format for long-term storage. For example, if the source code changes, all of your stored data might break and become unreadable.

Suppose someone gives you a pickle file and you’re willing to take your chances and open it. It’s from a trusted source, and it was created recently enough that the format probably hasn’t changed. How do you open it?

Unpickling

The following code will open the file data.pickle and read it into an object obj.

    import pickle
    obj = pickle.load(open("data.pickle", "rb"))

If the object in the pickle file is very  small, you could simply print obj. But if the object is at all large, you probably want to save it to a file rather than dumping it at the command line, and you also want to “pretty” print it than simply printing it.

Pretty printing

The following code will dump a nicely-formatted version of our pickled object to a text file out.txt.

    import pickle
    import pprint

    obj = pickle.load(open("sample_data.pickle", "rb"))

    with open("out.txt", "a") as f:
         pprint.pprint(obj, stream=f)

In my case, the client’s file contained a dictionary of lists of dictionaries. It printed as one incomprehensible line, but it pretty printed as 40,000 readable lines.

Prettier printing

Simon Brunning left a comment suggesting that the json module output is even easier to read.

    import json

    with open("out.txt", "a") as f:
        json.dump(obj, f, indent=2)

And he’s right, at least in my case. The indentation json.dump produces is more what I’d expect, more like what you’d see if you were writing the structure in well-formatted source code.

Related posts

Org-mode as a lightweight notebook

You can think of org-mode as simply a kind of markdown, a plain text file that can be exported to fancier formats such as HTML or PDF. It’s a lot more than that, but that’s a reasonable place to start.

Org-mode also integrates with source code. You can embed code in your file and have the code and/or the result of running the code appear when you export the file to another format.

Org-mode as notebook

You can use org-mode as a notebook, something like a Jupyter notebook, but much simpler. An org file is a plain text file, and you can execute embedded code right there in your editor. You don’t need a browser, and there’s no hidden state.

Here’s an example of mixing markup and code:

    The volume of an n-sphere of radius r is 

    $$\frac{\pi^{\frac{n}{2}}}{\Gamma\left(\frac{n}{2} + 1\right)}r^n.$$

    #+begin_src python :session
    from scipy import pi
    from scipy.special import gamma

    def vol(r, n):
        return pi**(n/2)*r**n/gamma(n/2 + 1)

    vol(1, 5)
    #+end_src

If you were to export the file to PDF, the equation for the volume of a sphere would be compiled into a image using LaTeX.

To run the code [1], put your cursor somewhere in the code block and type C-c C-c. When you do, the following lines will appear below your code.

    #+RESULTS:
    : 5.263789013914324

If you think of your org-mode file as primary, and you’re just inserting some code as a kind of scratch area, an advantage of org-mode is that you never leave your editor.

Jupyter notebooks

Now let’s compare that to a Jupyter notebook. Jupyter organizes everything by cells, and a cell can contain markup or code. So you could create a markup cell and enter the exact same introductory text [2].

    The volume of an n-sphere of radius r is 

    $$\frac{\pi^{\frac{n}{2}}}{\Gamma\left(\frac{n}{2} + 1\right)}r^n$$.

When you “run” the cell, the LaTeX is processed and you see the typeset expression rather than its LaTeX source. You can click on the cell to see the LaTeX code again.

Then you would enter the Python code in another cell. When you run the cell you see the result, much as in org-mode. And you could export your notebook to PDF as with org-mode.

File diff

Now suppose we make a couple small changes. We want the n and r in the comment section set in math italic, and we’d like to find the volume of a 5-sphere of radius 2 rather than radius 1. We do this, in Jupyter and in org-mode [3], by putting dollar signs around the “n” and the “r”, and we change vol(1, 5) to vol(2, 5).

Let’s run diff on the two versions of the org-mode file and on the two versions of the Jupyter notebook.

The differences in the org files are easy to spot:

    1c1
    < The volume of an n-sphere of radius r is 
    ---
    > The volume of an \(n\)-sphere of radius \(r\) is 
    12c12
    < vol(1, 5)
    ---
    > vol(2, 5)
    16c16,17
    < : 5.263789013914324
    ---
    > : 168.44124844525837

However, the differences in the Jupyter files are more complicated:

    5c5
    <    "id": "2a1b0bc4",
    ---
    >    "id": "a0a89fcf",
    8c8
    <     "The volume of an n-sphere of radius r is \n",
    ---
    >     "The volume of an $n$-sphere of radius $r$ is \n",
    15,16c15,16
    <    "execution_count": 1,
    <    "id": "888660a2",
    ---
    >    "execution_count": 2,
    >    "id": "1adcd8b1",
    22c22
    <        "5.263789013914324"
    ---
    >        "168.44124844525837"
    25c25
    <      "execution_count": 1,
    ---
    >      "execution_count": 2,
    37c37
    <     "vol(1, 5)"
    ---
    >     "vol(2, 5)"
    43c43
    <    "id": "f8d4d1b0",

There’s a lot of extra stuff in a Jupyter notebook. This is a trivial notebook, and more complex notebooks have more extra stuff. An apparently small change to the notebook can cause a large change in the underlying notebook file. This makes it difficult to track changes in a Jupyter notebook in a version control system.

Related posts

[1] Before this will work, you have to tell Emacs that Python is one of the languages you want to run inside org-mode. I have the following line in my init file to tell Emacs that I want to be able to run Python, DITAA, R, and Perl.

    (org-babel-do-load-languages 'org-babel-load-languages '((python . t) (ditaa . t) (R . t) (perl . t)))

[2] Org-mode will let you use \[ and \] to bracket LaTeX code for a displayed equation, and it will also let you use $$. Jupyter only supports the latter.

[3] In org-mode, putting dollar signs around variables sometimes works and sometimes doesn’t. And in this example, it works for the “r” but not for the “n”. This is very annoying, but it can be fixed by using \( and \) to enter and leave math mode rather than use a dollar sign for both.

Find log normal parameters for given mean and variance

Earlier today I needed to solve for log normal parameters that yield a given mean and variance. I’m going to save the calculation here in case I needed in the future or in case a reader needs it. The derivation is simple, but in the heat of the moment I’d rather look it up and keep going with my train of thought.

NB: The parameters μ and σ² of a log normal distribution are not the mean and variance of the distribution; they are the mean and variance of its log.

If m is the mean and v is the variance then

\begin{align*} m &= \exp(\mu + \sigma^2/2) \\ v &= (\exp(\sigma^2) - 1) \exp(2\mu + \sigma^2) \end{align*}

Notice that the square of the m term matches the second part of the v term.

Then

\frac{v}{m^2} = \exp(\sigma^2) -1

and so

\sigma^2 = \log\left(\frac{v}{m^2} + 1 \right)

and once you have σ² you can find μ by

\mu = \log m - \sigma^2/2

Here’s Python code to implement the above.

    from numpy immport log
    def solve_for_log_normal_parameters(mean, variance):
        sigma2 = log(variance/mean**2 + 1)
        mu = log(mean) - sigma2/2
        return (mu, sigma2)

And here’s a little test code for the code above.

    mean = 3.4
    variance = 5.6

    mu, sigma2 = solve_for_log_normal_parameters(mean, variance)

    X = lognorm(scale=exp(mu), s=sigma2**0.5)
    assert(abs(mean - X.mean()) < 1e-10)
    assert(abs(variance - X.var()) < 1e-10)

Related posts

Spaceship operator in Python

Some programming languages, such as Perl, have an infix operator <=> that returns a three-state comparison. The expression

    a <=> b

evaluates to -1, 0, or 1 depending on whether a < b, a = b, or a > b. You could think of <=> as a concatenation of <, =, and >.

The <=> operator is often called the “spaceship operator” because it looks like Darth Vader’s ship in Star Wars.

Python doesn’t have a spaceship operator, but you can get the same effect with numpy.sign(a-b). For example, suppose you wanted to write a program to compare two integers.

You could write

    from numpy import sign
    def compare(x, y):
        cmp = ["equal to", "greater than", "less than"][sign(x-y)]
        print(f"{x} is {cmp} {y}.")

Here we take advantage of the fact that an index of -1 points to the last element of a list.

The sign function will return an integer if its argument is an integer or a float if its argument is a float. The code above will break if you pass in floating point numbers because sign will return -1.0, 0.0, or 1.0. But if you replace sign(x-y) with int(sign(x-y)) it will work for floating point arguments.

Related post: Symbol pronunciation

Illustrating Gershgorin disks with NumPy

Gershgorin’s theorem gives bounds on the locations of eigenvalues for an arbitrary square complex matrix.

The eigenvalues are contained in disks, known as Gershgorin disks, centered on the diagonal elements of the matrix. The radius of the disk centered on the kth diagonal element is the sum of the absolute values of the elements in the kth row, excluding the diagonal element itself.

To illustrate this theorem, we create a 5 by 5 diagonal matrix and add some random noise to it. The diagonal elements are

0, 3 + i, 4 + i, 1 + 5i, 9 + 2i.

The eigenvalues of the diagonal matrix would simply be the diagonal elements. The additional random values pull the eigenvalues away from the diagonal values, but by an amount no more than Gershgorin predicts.

Gershgorin disks and eigenvalues

Note that in this example, two of the eigenvalues land in the smallest disk. It’s possible that a disk may not contain any eigenvalues; what Gershgorin guarantees is that the union of all the disks contains the union of all the eigenvalues.

Here’s the Python code that created the graph above.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(202103014)

N = 5 # dimension of our square matrix

D = np.diag([0, 3 + 1j, 4 + 1j, 1 + 5j, 9 + 2j])
M = np.random.rand(N, N) + D

R = np.zeros(N) # disk radii
for i in range(N):
    R[i] = sum(abs(M[i,:])) - abs(M[i,i])

eigenvalues = np.linalg.eigvals(M)

# Plotting code
fig, ax = plt.subplots()
for k in range(N):
    x, y = M[k,k].real, M[k,k].imag
    ax.add_artist( plt.Circle((x, y), R[k], alpha=0.5) )
    plt.plot(eigenvalues[k].real, eigenvalues[k].imag, 'k+')

ax.axis([-4, 12.5, -4, 9])
ax.set_aspect(1)    
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.title("Gershgorin disks and eigenvalues $x + iy$")