Distance from a point to a line

Eric Lengyel’s new book Projective Geometric Algebra Illuminated arrived yesterday and I’m enjoying reading it. Imagine if someone started with ideas like dot products, cross products, and determinants that you might see in your first year of college, then thought deeply about those things for years. That’s kinda what the book is.

Early in the book is the example of finding the distance from a point q to a line of the form p + tv.

If you define u = q − p then a straight-forward derivation shows that the distance d from q to the line is given by

d = \sqrt{{\bold u}^2 - \frac{({\bold u}\cdot {\bold v})^2}{{\bold v}^2}}

But as the author explains, it is better to calculate d by

d = \sqrt{\frac{({\bold u}\times {\bold v})^2}{{\bold v}^2}}

Why is that? The two expressions are algebraically equal, but the latter is better suited for numerical calculation.

The cardinal rule of numerical calculation is to avoid subtracting nearly equal floating point numbers. If two numbers agree to b bits, you may lose up to b bits of significance when computing their difference.

If u and v are vectors with large magnitude, but q is close to the line, then the first equation subtracts two large, nearly equal numbers under the square root.

The second equation involves subtraction too, but it’s less obvious. This is a common theme in numerical computing. Imagine this dialog.

[Student produces first equation.]

Mentor: Avoid subtracting nearly equal numbers.

[Student produces section equation.]

Student: OK, did it.

Mentor: That’s much better, though it could still have problems.

Where is there a subtraction in the second equation? We started with a subtraction in defining u. More subtly, the definition of cross product involves subtractions. But these subtractions involve smaller numbers than the first formula, because the first formula subtracts squared values. Eric Lengyel points this out in his book.

None of this may matter in practice, until it does matter, which is a common pattern in numerical computing. You implement something like the first formula, something that can be derived directly. You implicitly have in mind vectors whose magnitude is comparable to d and this guides your choice of unit tests, which all pass.

Some time goes by and a colleague tells you your code is failing. Impossible! You checked your derivation by hand and in Mathematica. Your unit tests all pass. Must be your colleague’s fault. But it’s not. Your code would be correct in infinite precision, but in an actual computer it fails on inputs that violate your implicit assumptions.

This can be frustrating, but it can also be fun. Implementing equations from a freshman textbook accurately, efficiently, and robustly is not a freshman-level exercise.

Related posts

Experiences with Thread Programming in Microsoft Windows

Lately I’ve been helping a colleague to add worker threads to his GUI-based Windows application.

Thread programming can be tricky. Here are a few things I’ve learned along the way.

Performance. This app does compute-intensive work. It is helpful to offload this very compute-heavy work to a worker thread. Doing this frees the main thread to service GUI requests better.

Thread libraries. Windows has multiple thread libraries, for example Microsoft Foundation Class library threads and C++ standard library threads. It is hazardous to use different thread libraries in the same app. In the extreme case, different thread libraries, such as GOMP  vs. LOMP, used in resp. the GCC and LLVM compiler families, have different threading runtimes which keep track of threads in different ways. Mixing them in the same code can cause hazardous silent errors.

Memory fences are a thing. Different threads can run on different processor cores and hold variables in different respective L1 caches that are not flushed (this to maintain high performance). An update to a variable by one thread is not guaranteed to be visible to other threads without special measures. For example, one could safely transfer information using ::PostMessage coupled with a handler function on the receiver thread. Or one could send a signal using an MFC CEvent on one thread and read its Lock on the other. Also, a thread launch implicitly does a memory fence, so that, at least then, the new thread is guaranteed to correctly see the state of all memory locations.

GUI access should be done from the master thread only, not a worker thread. Doing so can result in deadlock. A worker thread can instead ::PostMessage to ask the master thread to do a GUI action.

Thread launch. By default AfxBeginThread returns a thread handle which MFC takes care of deleting when no longer needed. If you want to manage the life cycle of the handle yourself, you can do something like:

myWorkerThreadHandle = AfxBeginThread(myFunc, myParams,
  THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED);
myWorkerThreadHandle->m_bAutoDelete = false;
myWorkerThreadHandle->ResumeThread();

Joint use of a shared library like the DAO database library has hazards. One should beware of using the library to allocate something in one thread and deallocating in another, as this will likely allocate in a thread-local heap or stack instead of a shared thread-safe heap, this resulting in a crash.

Initialization. One should call CoInitializeEx(NULL, COINIT_APARTMENTTHREADED) and AfxDaoInit() (if using DAO) at thread initialization on both master and worker threads, and correspondingly CoUninitialize() and AfxDaoTerm() at completion of the thread.

Monitoring of thread state can be done with
WaitForSingleObject(myWorkerThreadHandle->m_hThread, 0) to determine if the thread has completed or WaitForSingleObject(myWorkerThreadHandle->m_hThread, INFINITE) for a blocking wait until completion.

Race conditions are always a risk but can be avoided by careful reasoning about execution. Someone once said up to 90% of code errors can be found by desk checking [1]. Race conditions are notoriously hard to debug, partly because they can occur nondeterministically. There are tools for trying to find race condition errors, though I’ve never tried them.

So far I find no rigorous specification of the MFC threading model online that touches on all these concerns. Hopefully this post is useful to someone else working through these issues.

References

[1] Dasso, Aristides., Funes, Ana. Verification, Validation and Testing in Software Engineering. United Kingdom: Idea Group Pub., 2007, p. 163.

Accelerating Archimedes

One way to approximate π is to find the areas of polygons inscribed inside a circle and polygons circumscribed outside the circle. The approximation improves as the number of sides in the polygons increases. This idea goes back at least as far as Archimedes (287–212 BC). Maybe you’ve tried this. It’s a lot of work.

In 1667 James Gregory came up with a more efficient way to carry out Archimedes’ approach. Tom Edgar and David Richeson cite Gregory’s theorem in [1] as follows.

Let Ik and Ck denote the areas of regular k-gons inscribed in and circumscribed around a given circle. Then I2n is the geometric mean of In and Ck, and C2n is the harmonic mean of I2n and Cn; that is,

\begin{align*} I_{2n} &= \sqrt{I_nC_n} \\ C_{2n} &= \frac{2C_n I_{2n}}{C_n + I_{2n}} \end{align*}

We can start with n = 4. Obviously the square circumscribed around the unit circle has vertices at (±1, ±1) and area 4. A little thought finds that the inscribed square has vertices at (±√2/2, ±√2/2) and area 2.

The following Python script finds π to six decimal places.

n = 4
I, C = 2, 4 # inscribed area, circumscribed area
delta = 2

while delta > 1e-6:
    I = (I*C)**0.5
    C = 2*C*I / (C + I)
    delta = C - I
    n *= 2
    print(n, I, C, delta)

The script stops when n = 8192, meaning that the difference between the areas of an inscribed and circumscribed 8192-gon is less than 10−6. The final output of the script is.

8192 3.1415923 3.1415928 4.620295e-07

If we average the upper and lower bound we get one more decimal place of accuracy.

Gregory’s algorithm isn’t fast, though it’s much faster than carrying out Archimedes’ approach by computing each area from scratch.

Gregory gives an interesting recursion. It has an asymmetry that surprised me: you update I, then C. You don’t update them simultaneously as you do when you’re computing the arithmetic-geometric mean.

Related posts

[1] Tom Edgar and David Richeson. A Visual Proof of Gregory’s Theorem. Mathematics Magazine, December 2019, Vol. 92, No. 5 (December 2019), pp. 384–386

How much will a cable sag? A simple approximation

Suppose you have a cable of length 2s suspended from two poles of equal height a distance 2x apart. Assuming the cable hangs in the shape of a catenary, how much does it sag in the middle?

If the cable were pulled perfectly taut, we would have s = x and there would be no sag. But in general, s > x and the cable is somewhat lower in the middle than on the two ends. The sag is the difference between the height at the end points and the height in the middle.

The equation of a catenary is

y = a cosh(t/a)

and the length of the catenary between t = −x and t = x is

2s = 2a sinh(x/a).

You could solve this equation for a and the sag will be

g = a cosh(x/a) − a.

Solving for a given s is not an insurmountable problem, but there is a simple approximation for g that has an error of less than 1%. This approximation, given in [1], is

g² = (s − x)(sx/2).

To visualize the error, I set x = 1 and let a vary to produce the plot below.

The relative error is always less than 1/117. It peaks somewhere around x = 1/3 and decreases from there, the error getting smaller as a increases (i.e. the cable is pulled tighter).

Related posts

[1] J. S. Frame. An approximation for the dip of a catenary. Pi Mu Epsilon Journal, Vol. 1, No. 2 (April 1950), pp. 44–46

Unique letter patterns in words

The word Mississippi has a unique pattern of letters. If you were solving a cryptogram puzzle and saw ZVFFVFFVCCV you might guess that the word is Mississippi.

Is the pattern of letters in Mississippi literally unique or just uncommon? What is the shortest word with a unique letter pattern? The longest word?

We can answer these questions by looking at normalized cryptograms, a sort of word signature. These are formed by replacing the first letter in a word with ‘A’, the next unique letter with ‘B’, etc. The normalized cryptogram of Mississippi is ABCCBCCBDDB.

The set of English words is fuzzy, but for my purposes I will take “words” to mean the entries in the dictionary file american-english on my Linux box, removing words that contain apostrophes or triple letters. I computed the cryptogram of each word, then looked for those that only appear once.

Relative to the list of words I used, yes, Mississippi is unique.

The shortest word with a unique cryptogram is eerie. [1]

The longest word with a unique cryptogram is ambidextrously. Every letter in this 14-letter word appears only once.

[1] Update: eerie is a five-letter example, but there are more. Jack Kennedy pointed out amass, llama, and mamma in the comments. I noticed eerie because its cryptogram comes first in aphabetical order.

How to Organize Technical Research?

 

64 million scientific papers have been published since 1996 [1].

Assuming you can actually find the information you want in the first place—how can you organize your findings to be able to recall and use them later?

It’s not a trifling question. Discoveries often come from uniting different obscure pieces of information in a new way, possibly from very disparate sources.

Many software tools are used today for notetaking and organizing information, including simple text files and folders, Evernote, GitHub, wikis, Miro, mymind, Synthical and Notion—to name a diverse few.

AI tools can help, though they can’t always recall correctly and get it right, and their ability to find connections between ideas is elementary. But they are getting better [2,3].

One perspective was presented by Jared O’Neal of Argonne National Laboratory, from the standpoint of laboratory notebooks used by teams of experimental scientists [4]. His experience was that as problems become more complex and larger, researchers must invent new tools and processes to cope with the complexity—thus “reinventing the lab notebook.”

While acknowledging the value of paper notebooks, he found electronic methods essential because of distributed teammates. In his view many streams of notes are probably necessary, using tools such as GitLab and Jupyter notebooks. Crucial is the actual discipline and methodology of notetaking, for example a hierarchical organization of notes (separating high-level overview and low-level details) that are carefully written to be understandable to others.

A totally different case is the research methodology of 19th century scientist Michael Faraday. He is not to be taken lightly, being called by some “the best experimentalist in the history of science” (and so, perhaps, even compared to today) [5].

A fascinating paper [6] documents Faraday’s development of “a highly structured set of retrieval strategies as dynamic aids during his scientific research.” He recorded a staggering 30,000 experiments over his lifetime. He used 12 different kinds of record-keeping media, including lab notebooks proper, idea books, loose slips, retrieval sheets and work sheets. Often he would combine ideas from different slips of paper to organize his discoveries. Notably, his process to some degree varied over his lifetime.

Certain motifs emerge from these examples: the value of well-organized notes as memory aids; the need to thoughtfully innovate one’s notetaking methods to find what works best; the freedom to use multiple media, not restricted to a single notetaking tool or format.

Do you have a favorite method for organizing your research? If so, please share in the comments below.

References

[1] How Many Journal Articles Have Been Published? https://publishingstate.com/how-many-journal-articles-have-been-published/2023/

[2] “Multimodal prompting with a 44-minute movie | Gemini 1.5 Pro Demo,” https://www.youtube.com/watch?v=wa0MT8OwHuk

[3] Geoffrey Hinton, “CBMM10 Panel: Research on Intelligence in the Age of AI,” https://www.youtube.com/watch?v=Gg-w_n9NJIE&t=4706s

[4] Jared O’Neal, “Lab Notebooks For Computational Mathematics, Sciences, Engineering: One Ex-experimentalist’s Perspective,” Dec. 14, 2022, https://www.exascaleproject.org/event/labnotebooks/

[5] “Michael Faraday,” https://dlab.epfl.ch/wikispeedia/wpcd/wp/m/Michael_Faraday.htm

[6] Tweney, R.D. and Ayala, C.D., 2015. Memory and the construction of scientific meaning: Michael Faraday’s use of notebooks and records. Memory Studies8(4), pp.422-439. https://www.researchgate.net/profile/Ryan-Tweney/publication/279216243_Memory_and_the_construction_of_scientific_meaning_Michael_Faraday’s_use_of_notebooks_and_records/links/5783aac708ae3f355b4a1ca5/Memory-and-the-construction-of-scientific-meaning-Michael-Faradays-use-of-notebooks-and-records.pdf

A surprising result about surprise index

Surprise index

Warren Weaver [1] introduced what he called the surprise index to quantify how surprising an event is. At first it might seem that the probability of an event is enough for this purpose: the lower the probability of an event, the more surprise when it occurs. But Weaver’s notion is more subtle than this.

Let X be a discrete random variable taking non-negative integer values such that

\text{Prob}(X = i) = p_i

Then the surprise index of the ith event is defined as

S_i = \frac{\sum_{j=0}^\infty p_j^2 }{p_i}

Note that if X takes on values 0, 1, 2, … N−1 all with equal probability 1/N, then Si = 1, independent of N. If N is very large, each outcome is rare but not surprising: because all events are equally rare, no specific event is surprising.

Now let X be the number of legs a human selected at random has. Then p2 ≈ 1, and so the numerator in the definition of Si is approximately 1 and S2 is approximately 1, but Si is large for any value of i ≠ 2.

The hard part of calculating the surprise index is computing the sum in the numerator. This is the same calculation that occurs in many contexts: Friedman’s index of coincidence, collision entropy in physics, Renyi entropy in information theory, etc.

Poisson surprise index

Weaver comments that he tried calculating his surprise index for Poisson and binomial random variables and had to admit defeat. As he colorfully says in a footnote:

I have spent a few hours trying to discover that someone else had summed these series and spent substantially more trying to do it myself; I can only report failure, and a conviction that it is a dreadfully sticky mess.

A few years later, however, R. M. Redheffer [2] was able to solve the Poisson case. His derivation is extremely terse. Redheffer starts with the generating function for the Poisson

e^{-\lambda} e^{\lambda x} = \sum p_mx^m

and then says

Let x = eiθ; then eiθ; multiply; integrate from 0 to 2π and simplify slightly to obtain

\sum p_m^2 = (e^{2\lambda}/\pi) \int_0^\pi e^{2\lambda \cos \theta}\, d\theta

The integral on the right is recognized as the zero-order Bessel function …

Redheffer then “recognizes” an expression involving a Bessel function. Redheffer acknowledges in a footnote at a colleague M. V. Cerrillo was responsible for recognizing the Bessel function.

It is surprising that the problem Weaver describes as a “dreadfully sticky mess” has a simple solution. It is also surprising that a Bessel function would pop up in this context. Bessel functions arise frequently in solving differential equations but not that often in probability and statistics.

Unpacking Redheffer’s derivation

When Redheffer says “Let x = eiθ; then eiθ; multiply; integrate from 0 to 2π” he means that we should evaluate both sides of the equation for the Poisson generating function equation at these two values of x, multiply the results, and average the both sides over the interval [0, 2π].

On the right hand side this means calculating

\frac{1}{2\pi} \int_0^{2\pi}\left(\sum_{m=0}^\infty p_m \exp(im\theta)\right) \left(\sum_{n=0}^\infty p_n \exp(-in\theta)\right)\, d\theta

This reduces to

\sum_{m=0}^\infty p_m^2

because

alt=

i.e. the integral evaluates to 1 when m = n but otherwise equals zero.

On the left hand side we have

\begin{align*} \frac{1}{2\pi} \int_0^{2\pi} \exp(-2\lambda) \exp(\lambda(e^{i\theta} + e^{-i\theta})) \, d\theta &= \frac{1}{2\pi} \int_0^{2\pi} \exp(-2\lambda) \exp(2 \cos \theta) \, d\theta \\ &= \frac{e^{-2\lambda}}{\pi} \int_0^{\pi} \exp(2 \cos \theta) \, d\theta \end{align*}

Cerrillo’s contribution was to recognize the integral as the Bessel function J0 evaluated at -2iλ or equivalently the modified Bessel function I0 evaluated at -2λ. This follows directly from equations 9.1.18 and 9.6.16 in Abramowitz and Stegun.

Putting it all together we have

\frac{\pi}{e^{2\lambda}}\sum_{m=0}^\infty p_m^2 = \int_0^{\pi} \exp(2 \cos \theta) \, d\theta = J_0(-2i\lambda ) = I_0(-2\lambda )

Using the asymptotic properties of I0 Redheffer notes that for large values of λ,

\sum_{m=0}^\infty p_m^2 \sim \frac{1}{2\sqrt{\pi\lambda}}

[1] Warren Weaver, “Probability, rarity, interest, and surprise,” The Scientific Monthly, Vol 67 (1948), p. 390.

[2] R. M. Redheffer. A Note on the Surprise Index. The Annals of Mathematical Statistics, Mar., 1951, Vol. 22, No. 1 pp. 128ndash;130.

Estimating an author’s vocabulary

How would you estimate the size of an author’s vocabulary? Suppose you have a analyzed the author’s available works and found n words, x of which are unique. Then you know the author’s vocabulary was at least x, but it’s reasonable to assume that the author may have know words he never used in writing, or that at least not in works you have access to.

Brainerd [1] suggested the following estimator based on a Markov chain model of language. The estimated vocabulary is the number N satisfying the equation

\sum_{j=0}^{x-1}\left(1 - \frac{j}{N}\right)^{-1} = n

The left side is a decreasing function of N, so you could solve the equation by finding a values of N that make the sum smaller and larger than n, then use a bisection algorithm.

We can see that the model is qualitatively reasonable. If every word is unique, i.e. x = n, then the solution is N = ∞. If you haven’t seen any repetition, you the author could keep writing new words indefinitely. As the amount of repetition increases, the estimate of N decreases.

Brainerd’s model is simple, but it tends to underestimate vocabulary. More complicated models might do a better job.

Problems analogous to estimating vocabulary size come up in other applications. For example, an ecologist might want to estimate the number of new species left to be found based on the number of species seen so far. In my work in data privacy I occasionally have to estimate diversity in a population based on diversity in a sample. Both of these examples are analogous to estimating potential new words based on the words you’ve seen.

[1] Brainerd, B. On the relation between types and tokes in literary text, J. Appl. Prob. 9, pp. 507-5

Detecting the language of encrypted text

Imagine you are a code breaker living a century ago. You’ve intercepted a message, and you go through your bag of tricks, starting with the simplest techniques first. Maybe the message has been encrypted using a simple substitution cipher, so you start with that.

Simple substitution ciphers can be broken by frequency analysis: the most common letter probably corresponds to E, the next most common letter probably corresponds to T, etc. But that’s only for English prose. Maybe the message was composed in French. Or maybe it was composed in Japanese, then transliterated into the Latin alphabet so it could be transmitted via Morse code. You’d like to know what language the message was written in before you try identifying letters via their frequency.

William Friedman’s idea was to compute a statistic, what he dubbed the index of coincidence, to infer the probable language of the source. Since this statistic only depends on symbol frequencies, it gives the same value whether computed on clear text or text encrypted with simple substitution. It also gives the same value if the text has been run through a transposition cipher as well.

(Classical cryptanalysis techniques, such as computing the index of coincidence, are completely useless against modern cryptography. And yet ideas from classical cryptanalysis are still useful for other applications. Here’s an example that came up in a consulting project recently.)

As I mentioned at the top of the post, you’d try breaking the simplest encryption first. If the index of coincidence is lower than you’d expect for a natural language, you might suspect that the message has been encrypted using polyalphabetic substitution. That is, instead of using one substitution alphabet for every letter, maybe the message has been encrypted using a cycle of n different alphabets, such as the Vigenère cypher.

How would you break such a cipher? First, you’d like to know what n is. How would you do that? By trial and error. Try splitting the text into groups of letters according to their position mod n, then compute the index of coincidence again for each group. If the index statistics are much larger when n = 7, you’re probably looking a message encrypted with a key of length 7.

The source language would still leave its signature. If the message was encrypted by cycling through seven scrambled alphabets, each group of seven letters would most likely have the statistical distribution of the language used in the clear text.

Friedman’s index of coincidence, published in 1922, was one statistic that could be computed based on letter frequencies, one that worked well in practice, but you could try other statistics, and presumably people did. The index of coincidence is essentially Rényi entropy with parameter α = 2. You might try different values of α.

If the approach above doesn’t work, you might suspect that the text was not encrypted one letter at a time, even using multiple alphabets. Maybe pairs of letters were encrypted, as in the Playfair cipher. You could test this hypothesis by looking that the frequencies of pairs of letters in the encrypted text, calculating an index of coincidence (or some other statistic) based on pairs of letters.

Here again letter pair frequencies may suggest the original language. It might not distinguish Spanish from Portuguese, for example, but it would distinguish Japanese written in Roman letters from English.

Blow up in finite time

A few years ago I wrote a post about approximating the solution to a differential equation even though the solution did not exist. You can ask a numerical method for a solution at a point past where the solution blows up to infinity, and it will dutifully give you a finite solution. The result is meaningless, but will give a result anyway.

The more you can know about the solution to a differential equation before you attempt to solve it numerically the better. At a minimum, you’d like to know whether there even is a solution before you compute it. Unfortunately, a lot of theorems along these lines are local in nature: the theorem assures you that a solution exists in some interval, but doesn’t say how big that interval might be.

Here’s a nice theorem from [1] that tells you that a solution is going to blow up in finite time, and it even tells you what that time is.

The initial value problem

y′ = g(y)

with y(0) = y0 with g(y) > 0 blows up at T if and only if the integral

\int_{y_0}^\infty \frac{1}{g(t)} \, dt
converges to T.

Note that it is not necessary to first find a solution then see whether the solution blows up.

Note also that an upper (or lower) bound on the integral gives you an upper (or lower) bound on T. So the theorem is still useful if the integral is hard to evaluate.

This theorem applies only to autonomous differential equations, i.e. the right hand side of the equation depends only on the solution y and not on the solution’s argument t. The differential equation alluded to at the top of the post is not autonomous, and so the theorem above does not apply. There are non-autonomous extensions of the theorem presented here (see, for example, [2]) but I do not know of a theorem that would cover the differential equation presented here.

[1] Duff Campbell and Jared Williams. Exloring finite-time blow-up. Pi Mu Epsilon Journal, Spring 2003, Vol. 11, No. 8 (Spring 2003), pp. 423–428

[2] Jacob Hines. Exploring finite-time blow-up of separable differential equations. Pi Mu Epsilon Journal, Vol. 14, No. 9 (Fall 2018), pp. 565–572