Imagine parallel parking is available along the shoulder of a road, but no parking spaces are marked.

The first person to park picks a spot on the shoulder at random. Then another car also chooses a spot along the shoulder at random, with the constraint that the second car can’t overlap the first. This process continues until no more cars can park. How many people can park this way?

Assume all cars have the same length, and we choose our units so that cars have length 1. The expected number of cars that can park is a function *M*(*x*) of the length of the parking strip *x*. Clearly if *x* < 1 then *M*(*x*) = 0. Alfréd Rényi [1] found that for *x* ≥ 1, *M*(*x*) satisfies the equation

This is an unusual equation, difficult to work with because it defined *M* only implicitly. It’s not even clear that the equation has a solution. But it does, and the ratio of *M*(*x*) to *x* approaches a constant as *x* increases.

The number *m* is known as **Rényi’s parking constant**.

This say for a long parking strip, parking sequentially at random will allow about 3/4 as many cars to park as if you were to pack the cars end-to-end.

[1] Steven R. Finch. Mathematical Constants. Cambridge University Press, 2003.

The post Rényi’s parking constant first appeared on John D. Cook.]]>The word *planet* means “wanderer.” This because the planets appear to wander in the night sky. Unlike stars that appear to orbit the earth in a circle as the earth orbits the sun, planets appear to occasionally reverse course. When planets appear to move backward this is called retrograde motion.

Here’s what the motion of Venus would look like over a period of 8 years as explored here.

Venus completes 13 orbits around the sun in the time it takes Earth to complete 8 orbits. The ratio isn’t exactly 13 to 8, but it’s very close. Five times over the course of eight years Venus will appear to reverse course for a few days. How many days? We will get to that shortly.

When we speak of the motion of the planets through the night sky, we’re not talking about their rising and setting each day due to the rotation of the earth on its axis. We’re talking about their motion from night to night. The image above is how an observer far above the Earth and not rotating with the Earth would see the position of Venus over the course of eight years.

The orbit of Venus as seen from earth is beautiful but complicated. From the Copernican perspective, the orbits of Earth and Venus are simply concentric circles. You may bristle at my saying planets have circular rather than elliptical orbits [1]. The orbits are not exactly circles, but are so close to circular that you cannot see the difference. For the purposes of this post, we’ll assume planets orbit the sun in circles.

There is a surprisingly simple equation [2] for finding the points where a planet will appear to change course:

Here *r* is the radius of Earth’s orbit and *R* is the radius of the other planet’s orbit [3]. The constant *k* is the difference in angular velocities of the two planets. You can solve this equation for the times when the apparent motion changes.

Note that the equation is entirely symmetric in *r* and *R*. So Venusian observing Earth and an Earthling observing Venus would agree on the times when the apparent motions of the two planets reverse.

Let’s find when Venus enters and leaves retrograde motion. Here are the constants we need.

r = 1 # AU R = 0.72332 # AU venus_year = 224.70 # days earth_year = 365.24 # days k = 2*pi/venus_year - 2*pi/earth_year c = sqrt(r*R) / (r + R - sqrt(r*R))

With these constants we can now plot cos(*kt*) and see when it equals *c*.

This shows there are five times over the course of eight years when Venus is in apparent retrograde motion.

If we set time *t* = 0 to be a time when Earth and Venus are aligned, we start in the middle of retrograde period. Venus enters prograde motion 21 days later, and the next retrograde period begins at day 563. So out of every 584 days, Venus spends 42 days in retrograde motion and 542 days in prograde motion.

[1] Planets do not exactly orbit in circles. They don’t *exactly* orbit in ellipses either. Modeling orbits as ellipses is much more accurate than modeling orbits as circles, but not still not perfectly accurate.

[2] 100 Great Problems of Elementary Mathematics: Their History and Solution. By Heinrich Dörrie. Dover, 1965.

[3] There’s nothing unique about observing planets from Earth. Here “Earth” simply means the planet you’re viewing from.

The post Calculating when a planet will appear to move backwards first appeared on John D. Cook.]]>The business principle of kaizen, based on the Japanese 改善 for improvement, is based on the assumption that incremental improvements accumulate. But quantifying how improvements accumulate takes some care.

Two successive 1% improvements amount to a 2% improvement. But two successive 50% improvements amount to a 125% improvement. So sometimes you can add, and sometimes you cannot. What’s going on?

An *x*% improvement multiplies something by 1 + *x*/100. For example, if you earn 5% interest on a principle of *P* dollars, you now have 1.05 *P* dollars.

So an *x*% improvement followed by a *y*% improvement multiplies by

(1 + *x*/100)(1 + *y*/100) = 1 + (*x* + *y*)/100 + *xy*/10000.

If *x* and *y* are small, then *xy*/10000 is negligible. But if *x* and *y* are large, the product term may not be negligible, depending on context. I go into this further in this post: Small probabilities add, big ones don’t.

Now let’s look at a variation. Suppose doing one thing by itself brings an *x*% improvement and doing another thing by itself makes a *y*% improvement. How much improvement could you expect from doing both?

For example, suppose you find through A/B testing that changing the font on a page increases conversions by 10%. And you find in a separate A/B test that changing an image on the page increases conversions by 15%. If you change the font and the image, would you expect a 25% increase in conversions?

The issue here is not so much whether it is appropriate to add percentages. Since

1.1 × 1.15 = 1.265

you don’t get a much different answer whether you multiply or add. But maybe you could change the font and the image and conversions increase 12%. Maybe either change alone creates a better impression, but together they don’t make a better impression than doing one of the changes. Or maybe the new font and the new image clash somehow and doing both changes together *lowers* conversions.

The statistical term for what’s going on is interaction effects. A sequence of small improvements creates an additive effect if the improvements are independent. But the effects could be dependent, in which case the whole is less than the sum of the parts. This is typical. Assuming that improvements are independent is often overly optimistic. But sometimes you run into a synergistic effect and the whole is greater than the sum of the parts.

In the example above, we imagine testing the effect of a font change and an image change separately. What if we first changed the font, then with the new font tested the image? That’s better. If there were a clash between the new font and the new image we’d know it.

But we’re missing something here. If we had tested the image first and then tested the new font with the new image, we might have gotten different results. In general, the order of sequential testing matters.

If you have a small number of things to test, you can discover interaction effects by doing a factorial design, either a full factorial design or a fractional factorial design.

If you have a large number of things to test, you’ll have to do some sort of sequential testing. Maybe you do some combination of sequential and factorial testing, guided by which effects you have reason to believe will be approximately independent.

In practice, a testing plan needs to balance simplicity and statistical power. Sequentially testing one option at a time is simple, and may be fine if interaction effects are small. But if interaction effects are large, sequential testing may be leaving money on the table.

If you’d like some help with testing, or with web analytics more generally, we can help.

The post Do incremental improvements add, multiply, or something else? first appeared on John D. Cook.]]>I wondered whether it also sounds like a sawtooth wave, and indeed it does. More on that shortly.

The Clausen function can be defined in terms of its Fourier series:

The function commonly known as *the* Clausen function is one of a family of functions, hence the subscript 2. The Clausen functions for all non-negative integers *n* are defined by replacing 2 with *n* on both sides of the defining equation.

The Fourier coefficients decay quadratically, as do those of a triangle wave or sawtooth wave, as discussed here. This implies the function Cl_{2}(*x*) cannot have a continuous derivative. In fact, the derivative of Cl_{2}(*x*) is infinite at 0. This follows quickly from the integral representation of the function.

The fundamental theorem of calculus shows that the derivative

blows up at 0.

Now suppose we create an audio clip of Cl_{2}(440*x*). This creates a sound with pitch A 440, but rather than a sinewave it has an unpleasant buzzing sound, much like a sawtooth wave.

The harshness of the sound is due to the slow decay of the Fourier coefficients; the Fourier coefficients of more pleasant musical sounds decay much faster than quadratically.

Next you draw a circle through the vertices of the triangle, and draw a square outside that.

Then you draw a circle through the vertices of the square, and draw a pentagon outside that.

Then you think “Will this ever stop?!”, meaning the meeting, but you could ask a similar question about your doodling: does your sequence of doodles converge to a circle of finite radius, or do they grow without bound?

An *n*-gon circumscribed on the outside of a circle of radius *r* is inscribed in a circle of radius

So if you start with a unit circle, the radius of the circle through the vertices of the *N*-gon is

and the limit as *N* → ∞ exists. The limit is known as the **polygon circumscribing constant**, and it equals 8.7000366252….

We can visualize the limit by making a plot with large *N*. The plot is less cluttered if we leave out the circles and just show the polygons. *N* = 30 in the plot below.

The reciprocal of the polygon circumscribing constant is known as the **Kepler-Bouwkamp constant**. The Kepler-Bouwkamp constant is the limiting radius if you were to reverse the process above, *inscribing* polygons at each step rather than *circumscribing* them. It would make sense to call the Kepler-Bouwkamp constant the polygon *inscribing* constant, but for historical reasons it is named after Johannes Kepler and Christoffel Bouwkamp.

The specification for the NPI number format says that the first digit must be either 1 or 2. Currently every NPI in the database starts with 1. There are about 8.4 million NPIs currently, so it’ll be a while before they’ll need to roll the first digit over to 2.

The last digit of the NPI is a check sum. The check sum uses the Luhn algorithm, the same check sum used for credit cards and other kinds of identifiers. The Luhn algorithm was developed in 1954 and was designed to be easy to implement by hand. It’s kind of a quirky algorithm, but it will catch all single-digit errors and nearly all transposition errors.

The Luhn algorithm is not applied to the NPI itself but by first prepending 80840 to the (first nine digits of) the NPI.

For example, let’s look at 1993999998. This is not (currently) anyone’s NPI, but it has a valid NPI format because the Luhn checksum of 80840199399999 is 8. We will verify this with the code below.

The following code computes the Luhn checksum.

def checksum(payload): digits = [int(c) for c in reversed(str(payload))] s = 0 for i, d in enumerate(digits): if i % 2 == 0: t = 2*d if t > 9: t -= 9 s += t else: s += d return (s*9) % 10

And the following checks whether the last digit of a number is the checksum of the previous digits.

def verify(fullnumber): payload = fullnumber // 10 return checksum(payload) == fullnumber % 10

And finally, the following validates an NPI number.

def verify_npi(npi): return verify(int("80840" + str(npi)))

Here we apply the code above to the hypothetical NPI number mentioned above.

assert(checksum(80840199399999) == 8) assert(verify(808401993999998)) assert(verify_npi(1993999998))

How can you possibly solve a mission-critical problem with millions of variables—when the worst-case computational complexity of *every known algorithm* for that problem is exponential in the number of variables?

SAT (Satisfiability) solvers have seen dramatic orders-of-magnitude performance gains for many problems through algorithmic improvements over the last couple of decades or so. The SAT problem—finding an assignment of Boolean variables that makes a given Boolean expression true—represents the archetypal NP-complete problem and in the general case is intractable.

However, for many practical problems, solutions can be found very efficiently by use of modern methods. This “killer app” of computer science, as described by Donald Knuth, has applications to many areas, including software verification, electronic design automation, artificial intelligence, bioinformatics, and planning and scheduling.

Its uses are surprising and diverse, from running billion dollar auctions to solving graph coloring problems to computing solutions to Sudoku puzzles. As an example, I’ve included a toy code below that uses SMT, a relative of SAT, to find the English language suffix rule for regular past tense verbs (“-ed”) from data.

When used as a machine learning method, SAT solvers are quite different from other methods such as neural networks. SAT solvers can for some problems have long or unpredictable runtimes (though MAXSAT can sometimes relax this restriction), whereas neural networks have essentially fixed inference cost (though looping agent-based models do not).

On the other hand, answers from SAT solvers are always guaranteed correct, and the process is interpretable; this is currently not so for neural network-based large language models.

To understand better how to think about this difference in method capabilities, we can take a lesson from the computational science community. There, it is common to have a well-stocked computational toolbox of both slow, accurate methods and fast, approximate methods.

In computational chemistry, ab initio methods can give highly accurate results by solving Schrödinger’s equation directly, but only scale to limited numbers of atoms. Molecular dynamics (MD), however, relies more on approximations, but scales efficiently to many more atoms. Both are useful in different contexts. In fact, the two methodologies can cross-pollenate, for example when ab initio calculations are used to devise force fields for MD simulations.

A lesson to take from this is, it is paramount to find the best tool for the given problem, using any and all means at one’s disposal.

The following are some of my favorite general references on SAT solvers:

- Donald Knuth, The Art Of Computer Programming: Volume 4 Fascicle 6, Satisfiability. A coherent introduction to the methods, accessible to the mathematically inclined.
- Marijn J.H. Heule, Advanced Topics in Logic: Automated Reasoning and Satisfiability. Slides from an advanced class by one of the world experts.
- A. Biere, H. van Maaren, M. Heule, eds, Handbook of Satisfiability. An in-depth tome on state of the art, arranged topically.
- Daniel Kroening and Ofer Strichman, Decision Procedures: An Algorithmic Point of View. An accessible book on an important class of extensions to SAT: satisfiability modulo theories (SMT), which make SAT-type methods applicable to a much wider range of problems.
- Documentation for Z3, an innovative open-source solver for solving SAT, SMT, MAXSAT and combinatorial optimization problems.
- Simons Institute workshops here, here, here. here and here covering a great range of topics and state of the art research.

It would seem that unless P = NP, commonly suspected to be false, the solution of these kinds of problems for any possible input is hopelessly beyond reach of even the world’s fastest computers. Thankfully, many of the problems we care about have an internal structure that makes them much more solvable (and likewise for neural networks). Continued improvement of SAT/SMT methods, in theory and implementation, will greatly benefit the effective solution of these problems.

import csv import z3 def char2int(c): return ord(c) - ord('a') def int2char(i): return chr(i + ord('a')) # Access the language data from the file. with open('eng_cols.txt', newline='') as csvfile: reader = csv.reader(csvfile, delimiter='\t') table = [row for row in reader] nrow, ncol = len(table), len(table[0]) # Identify which columns of input table have stem and targeted word form. stem_col, form_col = 0, 1 # Calculate word lengths. nstem = [len(table[i][stem_col]) for i in range(nrow)] nform = [len(table[i][form_col]) for i in range(nrow)] # Length of suffix being sought. ns = 2 # Initialize optimizer. solver = z3.Optimize() # Define variables to identify the characters of suffix; add constraints. var_suf = [z3.Int(f'var_suf_{i}') for i in range(ns)] for i in range(ns): solver.add(z3.And(var_suf[i] >= 0, var_suf[i] < 26)) # Define variables to indicate whether the given word matches the rule. var_m = [z3.Bool(f'var_m_{i}') for i in range(nrow)] # Loop over words. for i in range(nrow): # Constraint on number of characters. constraints = [nform[i] == nstem[i] + ns] # Constraint that the form contains the stem. for j in range(nstem[i]): constraints.append( table[i][stem_col][j] == table[i][form_col][j] if j < nform[i] else False) # Constraint that the end of the word form matches the suffix. for j in range(ns): constraints.append( char2int(table[i][form_col][nform[i]-1-j]) == var_suf[j] if j < nform[i] else False) # var_m[i] is the "and" of all these constraints. solver.add(var_m[i] == z3.And(constraints)) # Seek suffix that maximizes number of matches. count = z3.Sum([z3.If(var_m[i], 1, 0) for i in range(nrow)]) solver.maximize(count) # Run solver, output results. if solver.check() == z3.sat: model = solver.model() suf = [model[var_suf[i]] for i in range(ns)] print('Suffix identified: ' + ''.join(list([int2char(suf[i].as_long()) for i in range(ns)]))[::-1]) print('Number of matches: ' + str(model.evaluate(count)) + ' out of ' + str(nrow) + '.') var_m_values = [model[var_m[i]] for i in range(nrow)] print('Matches:') for i in range(nrow): if var_m_values[i]: print(table[i][stem_col], table[i][form_col])The post Getting some (algorithmic) SAT-isfaction first appeared on John D. Cook.]]>

The second step is always the same, applying Lagrange interpolation with enough points to get the accuracy you need. But the first step, range reduction, depends on the function being evaluated. And as the previous post concluded, evaluating more advanced functions generally requires more advanced range reduction.

For the gamma function, the identity

can be used to reduce the problem of computing Γ(*x*) for any real *x* to the problem of computing Γ(*x*) for *x* in the interval [1, 2]. If *x* is large, the identity will have to be applied many times and so this would be a lot of work. However, the larger *x* is, the more accurate Stirling’s approximation becomes.

Computing Γ(*x* + *iy*) is more complex, pardon the pun. We can still use the identity above to reduce the *real* part *x* of the argument to lie in the interval [1, 2], but what about the *complex* part* y*?

Abramowitz and Stegun, Table 6.7, tabulates the principal branch of log Γ(*x* + *iy*) for *x* from 1 to 2 and for *y* from 0 to 10, both in increments of 0.1. Generally the logarithm of the gamma function is more useful in computation than the gamma function itself. It is also easier to interpolate, which I imagine is the main reason A&S tabulate it rather than the gamma function *per se*. A note below Table 6.7 says that linear interpolation gives about 3 significant figures, and eight-point interpolation gives about 8 figures.

By the Schwarz reflection principle,

and with this we can extend our range on *y* to [−10, 10].

What about larger *y*? We have two options: the duplication formula and Stirling’s approximation.

The duplication formula

lets us compute Γ(2*z*) if we can compute Γ(*z*) and Γ(*z* + ½).

Stirling’s approximation for Γ(*z*) is accurate when |*z*| is large, and |*x* + *iy*| is large when |*y*| is large.

For example, the Handbook of Mathematical Functions edited by Abramowitz and Stegun tabulates sines and cosines in increments of one tenth of a degree, from 0 degrees to 45 degrees. What if your angle was outside the range 0° to 45° or if you needed to specify your angle more precisely than 1/10 of a degree? What if you wanted, for example, to calculate cos 203.147°?

The high-level answer is that you would use range reduction and interpolation. You’d first use range reduction to reduce the problem of working with any angle to the problem of working with an angle between 0° and 45°, then you’d use interpolation to get the necessary accuracy for a value within this range.

OK, but how exactly do you do the range reduction and how exactly do you to the interpolation? This isn’t deep, but it’s not trivial either.

Since sine and cosine have a period of 360°, you can add or subtract some multiple of 360° to obtain an angle between −180° and 180°.

Next, you can use parity to reduce the range further. That is, since sin(−*x*) = −sin(*x*) and cos(−*x*) = cos(*x*) you can reduce the problem to computing the sine or cosine of an angle between 0 and 180°.

The identities sin(180° − *x*) = sin(*x*) and cos(180° −*x*) = −cos(*x*) let you reduce the range further to between 0 and 90°.

Finally, the identities cos(*x*) = sin(90° − *x*) and sin(*x*) = cos(90° − *x*) can reduce the range to 0° to 45°.

You can fill in between the tabulated angles using interpolation, but how accurate will your result be? How many interpolation points will you need to use in order to get single precision, e.g. an error on the order of 10^{−7}?

The tables tell you. As explained in this post on using a table of logarithms, the tables have a notation at the bottom of the table that tells you how many Lagrange interpolation points to use and what kind of accuracy you’ll get. Five interpolation points will give you roughly single precision accuracy, and the notation gives you a little more accurate error bound. The post on using log tables also explains how Lagrange interpolation works.

I intend to write more posts on using tables. The general pattern is always range reduction and interpolation, but it takes more advanced math to reduce the range of more advanced functions.

**Update**: The next post shows how to use tables to compute the gamma function for complex arguments.

Kepler noted in 1609 that you could approximate the perimeter of an ellipse as the perimeter of a circle whose radius is the mean of the semi-axes of the ellipse, where the mean could be either the arithmetic mean or the geometric mean. The previous post showed that the arithmetic mean is more accurate, and that it under-estimates the perimeter. This post will explain both of these facts.

There are several series for calculating the perimeter of an ellipse. In 1798 James Ivory came up with a series that converges more rapidly than previously discovered series. Ivory’s series is

where

If you’re not familiar with the !! notation, see this post on multifactorials.

The 0th order approximation using Ivory’s series, dropping all the infinite series terms, corresponds to Kepler’s approximation using the arithmetic mean of the semi-axes *a* and *b*. By convention the semi-major axis is labeled *a* and the semi-minor axis *b*, but the distinction is unnecessary here since Ivory’s series is symmetric in *a* and *b*.

Note that *h* ≥ 0 and *h* = 0 only if the ellipse is a circle. So the terms in the series are positive, which explains why Kepler’s approximation under-estimates the perimeter.

Using just one term in Ivory’s series gives a very good approximation

The approximation error increases as the ratio of *a* to *b* increases, but even for a highly eccentric ellipse like the orbit of the Mars Orbital Mission, the ratio of *a* to *b* is only 2, and the relative approximation error is about 1/500, about 12 times more accurate than Kepler’s approximation.

*P* ≈ 2π(ab)^{½}

or

*P* ≈ π(*a* + *b*).

In other words, you can approximate the perimeter of an ellipse by the circumference of a circle of radius *r* where *r* is either the geometric mean or arithmetic mean of the semi-major and semi-minor axes.

How good are these approximations, particularly when *a* and *b* are roughly equal? Which one is better?

When can choose our unit of measurement so that the semi-minor axis *b* equals 1, then plot the error in the two approximations as *a* increases.

We see from this plot that both approximations give lower bounds, and that arithmetic mean is more accurate than geometric mean.

Incidentally, if we used the geometric mean of the semi-axes as the radius of a circle when approximating the *area* then the results would be exactly correct. But for perimeter, the arithmetic mean is better.

Next, if we just consider ellipses in which the semi-major axis is no more than twice as long as the semi-minor axis, the arithmetic approximation is within 2% of the exact value and the geometric approximation is within 8%. Both approximations are good when *a* ≈ *b*.

The next post goes into more mathematical detail, explaining why Kepler’s approximation behaves as it does and giving ways to improve on it.

One way to capture the structure of a graph is its adjacency matrix *A*. Each element *a*_{ij} of this matrix equals 1 if there is an edge between the *i*th and *j*th node and a 0 otherwise. If you square the matrix *A*, the (*i*, *j*) entry of the result tells you how many paths of length 2 are between the *i*th and *j*th nodes. In general, the (*i,* *j*) entry of *A*^{n} tells you how many paths of length *n* there are between the corresponding nodes.

Calculating eigenvector centrality requires finding an eigenvector associated with the largest eigenvalue of *A*. One way to find such an eigenvector is the power method. You start with a random initial vector and repeatedly multiply it by *A*. This produces a sequence of vectors that converges to the eigenvector we’re after.

Conceptually this is the same as computing *A*^{n} first and multiplying it by the random initial vector. So not only does the matrix *A*^{n} count paths of length *n*, for large *n* it helps us compute eigenvector centrality.

Now for a little fine print. The power method will converge for any starting vector that has some component in the direction of the eigenvector you’re trying to find. This is almost certainly the case if you start with a vector chosen at random. The power method also requires that the matrix has a single eigenvector of largest magnitude and that its associated eigenspace have dimension 1. The post on eigenvector centrality stated that these conditions hold, provided the network is connected.

In principle, you could use calculate eigenvector centrality by computing *A*^{n} for some large *n*. In practice, you’d never do that. For a square matrix of size *N*, a matrix-vector product takes *O*(*N*²) operations, whereas squaring *A* requires *O*(*N*³) operations. So you would repeatedly apply *A* to a vector rather than computing powers of *A*.

Also, you wouldn’t use the power method unless *A* is sparse, making it relatively efficient to multiply by *A*. If most of the entries of *A* are zeros, and there is an exploitable pattern to where the non-zero elements are located, you can multiply *A* by a vector with far less than *N*² operations.

Anyone with more than casual experience with ChatGPT knows that prompt engineering is a thing. Minor or even trivial changes in a chatbot prompt can have significant effects, sometimes even dramatic ones, on the output [1]. For simple requests it may not make much difference, but for detailed requests it could matter a lot.

Industry leaders said they thought this would be a temporary limitation. But we are now a year and a half into the GPT-4 era, and it’s still a problem. And since the number of possible prompts has scaling that is exponential in the prompt length, it can sometimes be hard to find a good prompt given the task.

One proposed solution is to use search procedures to automate the prompt optimization / prompt refinement process. Given a base large language model (LLM) and an input (a prompt specification, commonly with a set of prompt/answer pair samples for training), a search algorithm seeks the best form of a prompt to use to elicit the desired answer from the LLM.

This approach is sometimes touted [2] as a possible solution to the problem. However, it is not without limitations.

A main one is cost. With this approach, one search for a good prompt can take many, many trial-and-error invocations of the LLM, with cost measured in dollars compared to the fraction of a cent cost of a single token of a single prompt. I know of one report of someone who does LLM prompting with such a tool full time for his job, at cost of about $1,000/month (though, for certain kinds of task, one might alternatively seek a good prompt “template” and reuse that across many near-identical queries, to save costs).

This being said, it would seem that for now (depending on budget) our best option for difficult prompting problems is to use search-based prompt refinement methods. Various new tools have come come out recently (for example, [3-6]). The following is a report on some of my (very preliminary) experiences with a couple of these tools.

The first is PromptAgent [5]. It’s a research code available on GitHub. The method is based on Monte Carlo tree search (MCTS), which tries out multiple chains of modification of a seed prompt and pursues the most promising. MCTS can be a powerful method, being part of the AlphaGo breakthrough result in 2016.

I ran one of the PromptAgent test problems using GPT-4/GPT-3.5 and interrupted it after it rang up a couple of dollars in charges. Looking at the logs, I was somewhat amazed that it generated long detailed prompts that included instructions to the model for what to pay close attention to, what to look out for, and what mistakes to avoid—presumably based on inspecting previous trial prompts generated by the code.

Unfortunately, PromptAgent is a research code and not fully productized, so it would take some work to adapt to a specific user problem.

DSPy [6] on the other hand is a finished product available for general users. DSPy is getting some attention lately not only as a prompt optimizer but also more generally as a tool for orchestrating multiple LLMs as agents. There is not much by way of simple examples for how to use the code. The website does have an AI chatbot that can generate sample code, but the code it generated for me required significant work to get it to behave properly.

I ran with the MIPRO optimizer which is most well-suited to prompt optimization. My experience with running the code was that it generated many random prompt variations but did not do in-depth prompt modifications like PromptAgent. PromptAgent does one thing, prompt refinement, and must do it well, unlike DSPy which has multiple uses. DSPy would be well-served to have implemented more powerful prompt refinement algorithms.

I would wholeheartedly agree that it doesn’t seem right for an LLM would be so dependent on the wording of a prompt. Hopefully, future LLMs, with training on more data and other improvements, will do a better job without need for such lengthy trial-and-error processes.

[1] “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting,” https://openreview.net/forum?id=RIu5lyNXjT

[2] “AI Prompt Engineering Is Dead” (https://spectrum.ieee.org/prompt-engineering-is-dead, March 6, 2024

[3] “Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing,” https://openreview.net/forum?id=OXv0zQ1umU

[4] “Large Language Models as Optimizers,” https://openreview.net/forum?id=Bb4VGOWELI

[5] “PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization,” https://openreview.net/forum?id=22pyNMuIoa

[6] “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines,” https://openreview.net/forum?id=sY5N0zY5Od

The post The search for the perfect prompt first appeared on John D. Cook.]]>

Let’s look at a few motivating examples before we get into the details.

If you wanted to advertise something, say a book you’ve written, and you’re hoping people will promote it on Twitter. Would you rather get a shout out from someone with more followers or someone with less followers? All else being equal, more followers is better. But even better would be a shout out from someone whose followers have a lot of followers.

Suppose you’re at a graduation ceremony. Your mind starts to wander after the first few hundred people walk across the stage, and you start to think about how a cold might spread through the crowd. The dean handing out diplomas could be a superspreader because he’s shaking hands with everyone as they receive their diplomas. But an inconspicuous parent in the audience may also be a superspreader because she’s a flight attendant and will be in a different city every day for the next few days. And not only is she a traveler, she’s in contact with travelers.

Ranking web pages according to the number of inbound links they have was a good idea because this takes advantage of revealed preferences: instead of asking people what pages they recommend, you can observe what pages they implicitly recommend by linking to them. An even better idea was Page Rank, weighing inbound links by how many links the linking pages have.

The idea of eigenvector centrality is to give each node a rank proportional to the sum of the ranks of the adjacent nodes. This may seem circular, and it kinda is.

To know the rank of a node, you have to know the ranks of the nodes linking to it. But to know their ranks, you have to know the ranks of the nodes linking to them, etc. There is no **sequential** solution to the problem because you’d end up in an infinite regress, even for a finite network. But there is a **simultaneous** solution, considering all pages at once.

You want to assign to the *i*th node a rank *x*_{i} proportional to the sum of the ranks of all adjacent nodes. A more convenient way to express this to compute the weighted sum of the ranks of *all* nodes, with weight 1 for adjacent nodes and weight 0 for non-adjacent nodes. That is, you want to compute

where the *a*‘s are the components of the adjacency matrix *A*. Here *a*_{ij} equals 1 if there is an edge between nodes *i* and *j* and it equals 0 otherwise.

This is equivalent to looking for solutions to the system of equations

where *A* is the adjacency matrix and *x* is the vector of node ranks. If there are *N* nodes, then *A* is an *N* × *N* matrix and *x* is a column vector of length *N*.

In linear algebra terminology, *x* is an eigenvector of the adjacency matrix *A*, hence the name eigenvector centrality.

There are several mathematical details to be concerned about. Does the eigenvalue problem defining *x* have a solution? Is the solution unique (up to scalar multiples)? What about the eigenvalue λ? Does the adjacency matrix have multiple eigenvalues, which would mean there are multiple eigenvectors?

If *A* is the adjacency matrix for a connected graph, then there is a unique eigenvalue λ of largest magnitude, there is a unique corresponding eigenvector *x*, and all the components of *x* are non-negative. It is often said that this is a result of the Perron–Frobenius theorem, but there’s a little more to it than that because you need the hypothesis that the graph is connected.

The matrix *A* is non-negative, but not necessarily positive: some entries may be zero. But if the graph is connected, there’s a path between any node *i* and another node *j*, which means for some power of *A*, the *ij* entry of *A* is not zero. So although *A* is not necessarily positive, some power of *A* is positive. This puts us in the positive case of the Perron–Frobenius theorem, which is nicer than the non-negative case.

This section has discussed graphs, but Page Rank applies to *directed* graphs because links have a direction. But similar theorems hold for directed graphs.

If you were to do a linear regression on the data you’d get a relation

lumens = *a* × watts + *b*

where the intercept term *b* is not zero. But this doesn’t make sense: a light bulb that is turned off doesn’t produce light, and it certainly doesn’t produce negative light. [1]

You may be able fit the regression and ignore *b*; it’s probably small. But what if you wanted to *require* that *b* = 0? Some regression software will allow you to specify zero intercept, and some will not. But it’s easy enough to compute the slope *a* without using any regression software.

Let **x** be the vector of input data, the wattage of the LED bulbs. And let **y** be the corresponding light output in lumens. The regression line uses the slope *a* that minimizes

(*a* **x** − **y**)² = *a*² **x** · **x** − 2*a* **x** · **y** + **y** · **y**.

Setting the derivative with respect to *a* to zero shows

*a* = **x** · **y** / **x** · **x**

Now there’s more to regression than just line fitting. A proper regression analysis would look at residuals, confidence intervals, etc. But the calculation above was good enough to conclude that LED lights put out about 100 lumens per watt.

It’s interesting that making the model more realistic, i.e. requiring *b* = 0, is either a complication or a simplification, depending on your perspective. It complicates using software, but it simplifies the math.

- Best line to fit three points
- Logistic regression quick takes
- Linear regression and post quantum cryptography

[1] The orange line in the image above is the least squares fit for the model *y* = *ax*, but it’s not quite the same line you’d get if you fit the model *y* = *ax* + *b*.

The answer to the first question is that I wrote about the local maxima of the sinc function three years ago. That post shows that the derivative of the sinc function sin(*x*)/*x* is zero if and only if *x* is a fixed point of the tangent function.

As for why that should be connected to zeros a Bessel function, that one’s pretty easy. In general, Bessel functions cannot be expressed in terms of elementary functions. But the Bessel functions whose order is an integer plus ½ can.

For integer *n*,

So when *n* = 1, we’ve got the derivative of sinc right there in the definition.

Not only can you bootstrap tables to calculate logarithms of real numbers not given in the tables, you can also bootstrap a table of logarithms and a table of arctangents to calculate logarithms of complex numbers.

One of the examples in Abramowitz and Stegun (Example 7, page 90) is to compute log(2 + 3*i*). How could you do that with tables? Or with a programming language that doesn’t support complex numbers?

Now we have to be a little careful about what we mean by the logarithm of a complex number.

In the context of real numbers, the logarithm of a real number *x* is the real number *y* such that *e*^{y} = *x*. This equation has a unique solution if *x* is positive and no solution otherwise.

In the context of complex numbers, **a** logarithm of the complex number *z* is any complex number *w* such that *e*^{w} = *z*. This equation has no solution if *z* = 0, and it has infinitely many solutions otherwise: for any solution *w*, *w* + 2*n*π*i* is also a solution for all integers *n*.

If you write the complex number *z* in polar form

*z* = *r* *e*^{iθ}

then

log(*z*) = log(*r*) + *i*θ.

The proof is immediate:

*e*^{log(r) + iθ} = *e*^{log(r)} *e*^{iθ} = *r* *e*^{iθ}.

So computing the logarithm of a complex number boils down to computing its magnitude *r* and its argument θ.

The equation defining a logarithm has a unique solution if we make a branch cut along the negative real axis and restrict θ to be in the range −π < θ ≤ π. This is called the **principal branch** of log, sometimes written Log. As far as I know, every programming language that supports complex logarithms uses the principal branch implicitly. For example, in Python (NumPy), `log(x)`

computes the principal branch of the log function.

Going back to the example mentioned above,

log(2 + 3*i*) = log( √(2² + 3²) ) + arctan(3/2) = ½ log(13) + arctan(3/2) *i*.

This could easily be computed by looking up the logarithm of 13 and the arc tangent of 3/2.

The exercise in A&S actually asks the reader to calculate log(±2 ± 3*i*). The reason for the variety of signs is to require the reader to pick the value of θ that lies in the range −π < θ ≤ π. For example,

log(−2 + 3*i*) = = ½ log(13) + (π − arctan(3/2)) *i*.

Before calculators were common, function values would be looked up in a table. For example, here is a piece of a table of logarithms from Abramowitz and Stegun, affectionately known as A&S.

But you wouldn’t just “look up” logarithm values. If you needed to know the value of a logarithm at a point where it is explicitly tabulated, then yes, you’d simply look it up. If you wanted to know the log of 1.754, then there it is in the table. But what if, for example, you wanted to know the log of 1.7543?

Notice that function values are given to 15 significant figures but input values are only given to four significant figures. If you wanted 15 sig figs in your output, presumably you’d want to specify your input to 15 sig figs as well. Or maybe you only needed 10 figures of precision, in which case you could ignore the rightmost column of decimal places in the table, but you still can’t directly specify input values to 10 figures.

If you go to the bottom of the column of A&S in the image above, you see this:

What’s the meaning of the mysterious square bracket expression? It’s telling you that for the input values in the range of this column, i.e. between 1.750 and 1.800, the error using linear interpolation will be less than 4 × 10^{−8}, and that if you want full precision, i.e. 15 sig figs, then you’ll need to use Lagrange interpolation with 5 points.

So going back to the example of wanting to know the value of log(1,7543), we could calculate it using

0.7 × log(1.754) + 0.3 × log(1.755)

and expect the error to be less than 4 × 10^{−8}.

We can confirm this with a little Python code.

>>> from math import log >>> exact = log(1.7543) >>> approx = 0.7*log(1.754) + 0.3*log(1.755) >>> exact - approx 3.411265947494968e-08

Python uses double precision arithmetic, which is accurate to between 15 and 16 figures—more on that here—and so the function calls above are essentially the same as the tabulated values.

Now suppose we want the value of *x* = 1.75430123456789. The hint in square brackets says we should use Lagrange interpolation at five points, centered at the nearest tabulated value to *x*. That is, we’ll use the values of log at 1.752, 1.753, 1.754, 1.755, and 1.756 to compute the value of log(*x*).

Here’s the Lagrange interpolation formula, given in A&S as equation 25.2.15.

We illustrate this with the following Python code.

def interpolate(fs, p, h): s = (p**2 - 1)*p*(p-2)*fs[0]/24 s -= (p - 1)*p*(p**2 - 4)*fs[1]/6 s += (p**2 - 1)*(p**2 - 4)*fs[2]/4 s -= (p + 1)*p*(p**2 - 4)*fs[3]/6 s += (p**2 - 1)*p*(p + 2)*fs[4]/24 return s xs = np.linspace(1.752, 1.756, 5) fs = np.log(xs) h = 0.001 x = 1.75430123456789 p = (x - 1.754)/h print(interpolate(fs, p, h)) print(np.log(x))

This prints

0.5620706206909348 0.5620706206909349

confirming that the interpolated value is indeed accurate to 15 figures.

Lagrange interpolation takes a lot of work to carry out by hand, and so sometimes you might use other techniques, such as transforming your calculation into one for which a Taylor series approximation converges quickly. In any case, sophisticated use of numerical tables was not simply a matter of looking things up.

A book of numerical tables enables you to do calculations without a computer. More than that, understanding how to do calculations **without** a computer helps you program calculations **with** a computer. Computers have to evaluate functions somehow, and one way is interpolating tabulated values.

For example, you could think of a digital image as a numerical table, the values of some ideal analog image sampled at discrete points. The screenshots above are interpolated: the HTML specifies the width to be less than that of the original screenshots,. You’re not seeing the original image; you’re seeing a new image that your computer has created for you using interpolation.

Interpolation is a kind of compression. A&S would be 100 billion times larger if it tabulated functions at 15 figure inputs. Instead, it tabulated functions for 4 figure inputs and gives you a recipe (Lagrange interpolation) for evaluating the functions at 15 figure inputs if you desire. This is a very common pattern. An SVG image, for example, does not tell you pixel values, but gives you equations for calculating pixel values at whatever scale is needed.

He talks about how software developers bemoan duct taping systems together, and would rather work on core technologies. He thinks it is some tragic failure, that if only wise system design was employed, you wouldn’t be doing all the duct taping.

Wrong.

Every expansion in capabilities opens up the opportunity to duct tape it to new areas, and this is where a lot of value creation happens. Eventually, when a sufficient amount of duct tape is found in an area, it is an opportunity for systemic redesigns, but you don’t wait for that before grabbing newly visible low hanging fruit!

The realistic alternative to duct tape and other aesthetically disappointing code is often no code.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Common sense says it matters what you saw. If 148 people were randomized to Design A, and every single one of them bought your product, while 10 out of the 152 people randomized to Design B bought your product, common sense says you should call Design A the winner and push it into production ASAP.

But what if you saw somewhat better results for Design A? You can have *some* confidence that Design A is better, though not as much as the nominal confidence level of the full experiment. This is what your (frequentist) statistician was trying to protect you from.

When your statistician designed your experiment, he obviously didn’t know what data you’d see, so he designed a *process* that would be reliable in a certain sense. When you looked at the data early, you violated the process, and so now your actual practice no longer has the probability of success initially calculated.

But you don’t care about the process; you want to know whether to deploy Design A or Design B. And you saw the data that you saw. Particularly in the case where the results were lopsidedly in favor of Design A, your gut tells you that you know what to do next. You might reasonably say “I get what you’re saying about repeated experiments and all that. (OK, not really, but let’s say I do.) But look what happened? Design A is a runaway success!”

In formal terms, your common sense is telling you to condition on the observed data. If you’ve never studied Bayesian statistics you may not know exactly what that means and how to calculate it, but it’s intuitively what you’ve done. You’re making a decision based on what you actually saw, not on the basis of a hypothetical sequence of experiments you didn’t run and won’t run.

Bayesian statistics does formally what your intuition does informally. This is important because even though your intuition is a good guide in extreme cases, it can go wrong when things are less obvious. As I wrote about recently, smart people can get probability very wrong, even when their intuition is correct in some instances.

If you’d like help designing experiments or understanding their results, we can help.

The post Condition on your data first appeared on John D. Cook.]]>Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

Of course there’s nothing morally or legally wrong with looking at interim data, but is there anything statistically wrong? That depends on what you do with your observation.

There are basically two statistical camps, Frequentist and Bayesian. (There are others, but these two account for the lion’s share.) Frequentists say no, you should not look at interim data, unless you apply something called **alpha spending**. Bayesians, on the other hand, say go ahead. Why shouldn’t you look at your data? I remember one Bayesian colleague mocking alpha spending as being embarrassing.

So who is right, the Frequentists or the Bayesians? **Both are right**, given their respective criteria.

If you want to achieve success, as defined by Frequentists, you have to play by Frequentist rules, alpha spending and all. Suppose you design a hypothesis test to have a confidence level α, and you look at the data midway through an experiment. If the results are conclusive at the midpoint, you stop early. This procedure does **not** have a confidence level α. You would have to require stronger evidence for early stopping, as specified by alpha spending.

The Bayesian interprets the data differently. This approach says to quantify what is believed before conducting the experiment in the form of a prior probability distribution. Then after each data point comes in, you update your prior distribution using Bayes’ theorem to produce a posterior distribution, which becomes your new prior distribution. That’s it. At every point along the way, this distribution captures all that is known about what you’re testing. Your planned sample size is irrelevant, and the fact that you’ve looked at your data is irrelevant.

Now regarding your A/B test, **why** are you looking at the data early? If it’s simply out of curiosity, and cannot affect your actions, then it doesn’t matter. But if you act on your observation, you change the Frequentist operating characteristics of your experiment.

Stepping back a little, why are you conducting your test in the first place? If you want to make the correct decision with probability 1 − α in an infinite sequence of experiments, then you need to take into account alpha spending, or else you will lower that probability.

But if you don’t care about a hypothetical infinite sequence of experiments, you may find the Bayesian approach more intuitive. What do you know right at this point in the experiment? It’s all encapsulated in the posterior distribution.

Suppose your experiment size is a substantial portion of your potential visitors. You want to experiment with graphics, yes, but you also want to make money along the way. You’re ultimately concerned with profit, not publishing a scientific paper on your results. Then you could use a Bayesian approach to maximize your expected profit. This leads to things like “**bandits**,” so called by analogy to slot machines (“one-armed bandits”).

If you want to keep things simple, and if the experiment size is negligible relative to your expected number of visitors, just design your experiment according to Frequentist principles and don’t look at your data early.

But if you have good business reasons to want to look at the data early, not simply to satisfy curiosity, then you should probably interpret the interim data from a Bayesian perspective. As the next post explains, the Bayesian approach aligns well with common sense.

I’d recommend taking **either** a Frequentist approach or a Bayesian approach, but not complicating things by hybrid approaches such as alpha spending or designing Bayesian experiments to have desired Frequentist operating characteristics. The middle ground is more complicated and prone to subtle mistakes, though we can help you navigate this middle ground if you need to.

If you need help designing, conducting, or interpreting experiments, we can help. If you want/need to look at interim results, we can show you how to do it the right way.

`\label{foo}`

and referenced like `\ref{foo}`

. Referring to sections by labels rather than hard-coded numbers allows references to automatically update when sections are inserted, deleted, or rearranged.
For every reference there ought to be a label. A label without a corresponding reference is fine, though it might be a mistake. If you have a reference with no corresponding label, and one label without a reference, there’s a good chance the reference is a typo variation on the unreferenced label.

We’ll build up a one-liner for comparing labels and references. We’ll use `grep`

to find patterns that look like labels by searching for `label{`

followed by any string of letters up to but not including a closing brace. We don’t want the `label{`

part, just what follows it, so we’ll use look-behind syntax, to exclude it from the match.

Here’s our regular expression:

(?<=label{)[^}]+

We’re using Perl-style look-behind syntax, so we’ll need to give `grep`

the `-P`

option. Also, we only want the match itself, not matching lines, so we’ll also using the `-o`

option. This will print all the labels:

grep -oP '(?<=label{)[^}]+' foo.tex

The regex for finding references is the same with `label`

replaced with `ref`

.

To compare the list of labels and the list of references, we’ll use the `comm`

command. For more on `comm`

, see Set theory at the command line.

We could save the labels to a file, save the references to a file, and run `comm`

on the two files. But we’re more interested in the differences between the two lists than the two lists, so we could pass both as streams to `comm`

using the `<(...)`

syntax. Finally, `comm`

assumes its inputs are sorted so we pipe the output of both `grep`

commands to `sort`

.

Here’s our one-liner

comm -12 <(grep -oP '(?<=label{)[^}]+' foo.tex | sort) <(grep -oP '(?<=ref{)[^}]+' foo.tex | sort)

This will produce three sections of output: labels which are not references, references which not labels, and labels that are also references.

If you just want to see references that don’t refer to a label, give `comm`

the option `-13`

. This suppresses the first and third sections of output, leaving only the second section, references that are not labels.

You can also add a `-u`

option (*u* for *unique*) to the calls to `sort`

to suppress multiple instances of the same label or same reference.

Note that the right hand side is not a series in φ but rather in sin φ.

Why might you know sin φ and want to calculate sin *m*φ / cos φ? This doesn’t seem like a sufficiently common task for the series to be well-known. The references are over a century old, and maybe the series were useful in hand calculations in a way that isn’t necessary anymore.

However, [1] was using the series for a theoretical derivation, not for calculation; the author was doing some hand-wavy derivation, sticking the difference operator *E* into a series as if it were a number, a technique known as “umbral calculus.” The name comes from the Latin word *umbra* for shadow. The name referred to the “shadowy” nature of the technique which wasn’t made rigorous until much later.

The series above terminates if *m* is an even integer. But there are no restrictions on *m*, and in general the series is infinite.

The series obviously has trouble if cos φ = 0, i.e. when φ = ±π/2, but it converges for all *m* if −π/2 < φ < π/2.

If *m* = 1, sin *m*φ / cos φ is simply tan φ. The function tan φ has a complicated power series in φ involving Bernoulli numbers, but it has a simpler power series in sin φ.

[1] G. J. Lidstone. Notes on Everett’s Interpolation Formula. 1922

[2] E. W. Hobson. A Treatise on Plane Trigonometry. Fourth Edition, 1918. Page 276.

The post A “well-known” series first appeared on John D. Cook.]]>I wrote years ago about how striking it was to see two senior professors arguing over an undergraduate probability exercise. As I commented in that post, “Professors might forget how to do a calculus problem, or make a mistake in a calculation, but you wouldn’t see two professors defending incompatible solutions.”

Not only do smart people often get probability wrong, they can be very confident while doing so. The same applies to cryptography.

I recently learned of a cipher J. E. Littlewood invented that he believed was unbreakable. His idea was essentially a stream cipher, simulating a one-time pad by using a pseudorandom number generator. He assumed that since a one-time pad is secure, his simulacrum of a one-time pad would be secure. But it was not, for reasons explained in this paper.

Littlewood was a brilliant mathematician, but he was naive, and even arrogant, about cryptography. Here’s the opening to the paper in which he explained his method.

The legend that every cipher is breakable is of course absurd, though still widespread among people who should know better. I give a sufficient example …

He seems to be saying “Here’s a little example off the top of my head that shows how easy it is to create an unbreakable cipher.” He was the one who should have known better.

Richard Feynman’s Nobel Prize winning discoveries in quantum electrodynamics were partly inspired by his randomly observing a spinning dinner plate one day in the cafeteria. Paul Feyerabend said regarding science discovery, “The only principle that does not inhibit progress is: anything goes” (within relevant ethical constraints, of course).

Ideas can come from anywhere, including physical play. Various books can improve creative discovery skills, like George Pólya’s *How to Solve It*, Isaac Watts’ *Improvement of the Mind*, W. J. J. Gordon’s *Synectics*, and methodologies like mind mapping and C-K theory, to name a few. Many software products present themselves as shiny new tools promising help. However, we are not just disembodied minds interacting with a computer, but instead integrated beings with reasoning integrated with memories, life history, emotions and multisensory input and interaction with the world The tactile is certainly a key avenue of learning, discovering, understanding.

Fidget toys are popular. Different kinds of toys have different semiotics with respect to how they interplay with our imaginations. Like Legos. Structured, like Le Corbusier-style architecture, or multidimensional arrays or tensors, or the snapping together of many software components with well-defined interfaces, with regular scaling from the one to many. Or to take a much older example, Tinkertoys—the analogy of the graph, interconnectedness, semi-structured but composable, like DNA or protein chains, or complex interrelated biological processes, or neuronal connections, or the wild variety within order of human language.

As creative workers, we seek ideas from any and every place to help us in what we do. The tactile, the physical, is a vital place to look.

The post Thinking by playing around first appeared on John D. Cook.]]>This means that linear combinations of the polynomials

1, *x*, *x*², *x*³, …

are dense in *C* [0, 1].

Do you need all these powers of *x*? Could you approximate any continuous function arbitrarily well if you left out one of these powers, say *x*^{7}? Yes, you could.

You cannot omit the constant polynomial 1, but you can leave out any other power of *x*. In fact, you can leave out a lot of powers of *x*, as long as the sequence of exponents doesn’t thin out too quickly.

Herman Müntz proved in 1914 that a necessary and sufficient pair of conditions on the exponents of *x* is that the first exponent is 0 and that the sum of the reciprocals of the exponents diverges.

In other words, the sequence of powers of *x*

*x*^{λ0}, *x*^{λ1}, *x*^{λ2}, …

with

λ_{0} < λ_{1} < λ_{2}

is dense in *C* [0, 1] if and only if λ_{0} = 0 and

1/λ_{1} + 1/λ_{2} + 1/λ_{3} + … = ∞

Euler proved in 1737 that the sum of the reciprocals of the primes diverges, so the sequence

1, *x*^{2}, *x*^{3}, *x*^{5}, *x*^{7}, *x*^{11}, …

is dense in *C* [0, 1]. We can find a polynomial as close as we like to any particular continuous function if we combine enough prime powers.

Let’s see how well we can approximate |*x* − ½| using prime exponents up to 11.

The polynomial above is

0.4605 − 5.233 *x*^{2} + 7.211* x*^{3} + 0.9295 *x*^{5} − 4.4646 *x*^{7} + 1.614 *x*^{11}.

This polynomial is not the best possible uniform approximation: it’s the least squares approximation. That is, it minimizes the 2-norm and not the ∞-norm. That’s because it’s convenient to do a least squares fit in Python using `scipy.optimize.curve_fit`

.

Incidentally, the Müntz approximation theorem holds for the 2-norm as well.

log_{10}(*x*) ≈ (*x* – 1)/(*x* + 1)

base *e*,

log* _{e}*(

and base 2

log_{2}(*x*) ≈ 3(*x* – 1)/(*x* + 1).

These can be used to mentally approximate logarithms to moderate accuracy, accurate enough for quick estimates.

Here’s what’s curious about the approximations: the proportionality constants are apparently wrong, and yet the approximations are each fairly accurate.

It is **not** the case that

log* _{e}*(

In fact,

log* _{e}*(

and so it seems that the approximation for natural logarithms should be off by 15%. But it’s not. The error is less than 2.5%.

Similarly,

log_{2}(*x*) = log_{2}(10) log_{10}(*x*) = 3.32 log_{10}(*x*)

and so the approximation for logarithms base 2 should be off by about 10%. But it’s not. The error here is also less than 2.5%.

What’s going on?

First of all, the approximation errors are nonlinear functions of *x* and the three approximation errors are not proportional. Second, the approximation for log* _{b}*(

Here’s a plot of the three error functions.

This plot makes it appear that the approximation error is much worse for natural logs and logs base 2 than for logs base 10. And it would be if we ignored the range of each approximation. Here’s another plot of the approximation errors, plotting each over only its valid range.

When restricted to their valid ranges, the approximations for logarithms base *e* and base 2 are *more* accurate than the approximation for logarithms base 10. Both errors are small, but in opposite directions.

Here’s a look at the relative approximation errors.

We can see that the relative errors for the log 2 and log *e* errors are less than 2.5%, while the relative error for log 10 can be up to 15%.

The post Logarithm approximation curiosity first appeared on John D. Cook.]]>

It turns out that if 2^{k} − 1 is prime then *k* must be prime, so Mersenne numbers have the form 2^{p} − 1 is prime. What about the converse? If *p* is prime, is 2^{k} − 1 also prime? No, because, for example, 2^{11} − 1 = 2047 = 23 × 89.

If *p* is not just a prime but a Mersenne prime, then is 2^{p} − 1 a prime? Sometimes, but not always. The first counterexample is *p* = 8191.

There is an interesting chain of iterated Mersenne primes:

This raises the question of whether *m* = 2^{M12} − 1 is prime. Direct testing using available methods is completely out of the question. The only way we’ll ever know is if there is some theoretical result that settles the question.

Here’s an easier question. Suppose *m* is prime. Where would it fall on the list of Mersenne primes if conjectures about the distribution of Mersenne primes are true?

This post reports

It has been conjectured that as

xincreases, the number of primesp≤xsuch that 2^{p}– 1 is also prime is asymptotically

e^{γ}logx/ log 2where γ is the Euler-Mascheroni constant.

If that conjecture is true, the number of primes less than *M*_{12} that are the exponents of Mersenne primes would be approximately

*e*^{γ} log *M*_{12} / log 2 = 226.2.

So if *m* is a Mersenne prime, it may be the 226th Mersenne prime, or *M*_{n} for some *n* around 226, if the conjectured distribution of Mersenne primes is correct.

We’ve discovered a dozen Mersenne primes since the turn of the century and we’re up to 51 discovered so far. We’re probably not going to get up to the 226th Mersenne prime, if there even is a 226th Mersenne prime, any time soon.

The post Iterated Mersenne primes first appeared on John D. Cook.]]>

This is wrong, but it’s a common mistake. And one reason it’s common is that **a variation on the mistake is approximately correct**, which we will explain shortly.

It’s obvious the reasoning in the opening paragraph is wrong when you extend it to five, or especially six, attempts. Are you certain to succeed after five attempts? What does it even mean that you have a 120% chance of success after six attempts?!

But let’s reduce the probabilities in the opening paragraph. If there’s a 2% chance of success on your first attempt, is there a 4% chance of success in two attempts and a 6% chance of success in three attempts? *Yes*, approximately.

Here’s is the correct formula for the probability of an event happening in two tries.

In words, the probability of *A* or *B* happening equals the probability of *A* happening, plus the probability of *B* happening, minus the probability of *A* and *B* both happening. The last term is is a correction term. Without it, you’re counting some possibilities twice.

So if the probability of success on each attempt is 0.02, the probability of success on two attempts is

0.02 + 0.02 − 0.0004 = 0.0396 ≈ 0.04.

When the probabilities of *A* and *B* are each small, the probability of *A* and *B* both happening is an order of magnitude smaller, assuming independence [2]. The smaller the probabilities of *A* and *B*, the less the correction term matters.

If the probability of success on each attempt is 0.2, now the probability of success after two attempts is 0.36. Simply adding probabilities and neglecting the correction term is incorrect, but not terribly far from correct in this case.

When you consider more attempts, things get more complicated. The probability of success after three attempts is given by

as I discuss here. Adding the probabilities of success separately over-estimates the correct probability. So you correct by subtracting the probabilities of pairs of successes. But then this is over-corrects, because you need to add back in the probability of three successes.

If *A*, *B*, and *C* all have a 20% probability, the probability of *A* or *B* or *C* happening is 48.8%, not 60%, again assuming independence.

The error from naively adding probabilities increases when the number of probabilities increase.

Now let’s look at the general case. Suppose your probability of success on each attempt is *p*. Then your probability of failure on each independent attempt is 1 − *p*. The probability of at least one success out of *n* attempts is the complement of the probability of all failures, i.e.

When *p* is small, and when *n* is small, we can approximate this by *np*. That’s why naively adding probabilities works when the probabilities are small and there aren’t many of them. Here’s a way to say this precisely using the binomial theorem.

The exact probability is *np* plus (*n* − 1) terms that involve higher powers of *p*. When *p* and *n* are sufficiently small, these terms can be ignored.

[1] I’m deliberately not saying who. My point here is not to rub his nose in his mistake. This post will be online long after the particular video has been forgotten.

[2] Assuming *A* and *B* are independent. This is not always the case, and wrongly assuming independence can have disastrous consequences as I discuss here, but that’s a topic for another day.

***

Logistic regression models the probability of a yes/no event occurring. It gives you more information than a model that simply tries to classify yeses and nos. I advised a client to move from an uninterpretable classification method to logistic regression and they were so excited about the result that they filed a patent on it.

It’s too late to patent logistic regression, but they filed a patent on the application of logistic regression to their domain. I don’t know whether the patent was ever granted.

***

The article cited above is entitled “Rough approximations to move between logit and probability scales.” Here is a paragraph from the article giving its motivation.

When working in this space, it’s helpful to have some approximations to move between the logit and probability scales. (As an analogy, it is helpful to know that for a normal distribution, the interval ± 2 standard deviations around the mean covers about 95% of the distribution, while ± 3 standard deviations covers about 99%.)

Here are half the results from the post; the other half follow by symmetry.

|-------+-------| | prob | logit | |-------+-------| | 0.500 | 0 | | 0.750 | 1 | | 0.900 | 2 | | 0.995 | 3 | | 0.999 | 7 | |-------+-------|

Zero on the logit scale corresponds exactly to a probability of 0.5. The other values are approximate.

When I say the rest of the table follows by symmetry, I’m alluding to the fact that

logit(1 − *p*) = − logit(*p*).

So, for example, because logit(0.999) ≈ 7, logit(0.001) ≈ −7.

***

The post reminded me of the decibel scale. As I wrote in this post, “It’s a curious and convenient fact that many decibel values are close to integers.”

- 3 dB ≈ 2
- 6 dB ≈ 4
- 7 dB ≈ 5
- 9 dB ≈ 8

I was curious whether the logit / probability approximations were as accurate as these decibel approximations. Alas, they are not. They are rough approximations, as advertised in the title, but still useful.

***

The post also reminded me of a comment by Andrew Gelman and Jennifer Hill on why natural logs are natural for regression.

The post Logistic regression quick takes first appeared on John D. Cook.]]>