Ancient accurate approximation for sine

This post started out as a Twitter thread. The text below is the same as that of the thread after correcting an error in the first part of the thread. I also added a footnote on a theorem the thread alluded to.

***

The following approximation for sin(x) is remarkably accurate for 0 < x < π.

\sin(x) \approx \frac{16x(\pi - x)}{5\pi^2 - 4x(\pi - x)}

The approximation is so good that you can’t see the difference between the exact value and the approximation until you get outside the range of the approximation.

Here’s a plot of just the error.

This is a very old approximation, dating back to Aryabhata I, around 500 AD.

In modern terms, it is a rational approximation, quadratic in the numerator and denominator. It’s not quite optimal since the ripples in the error function are not of equal height [1], but the coefficients are nice numbers.

***

As pointed out in the comments, replacing x with π/2 − x in order to get an approximation for cosine gives a much nicer equation.

\cos x \approx \frac{\pi^2 - 4x^2}{\pi^2 + x^2}

***

[1] The equioscillation theorem says that the optimal approximation will have ripples of equal positive and negative amplitude. This post explores the equioscillation theorem further and finds how far Aryabhata’s is from optimal.

Mentally multiply by π

This post will give three ways to multiply by π taken from [1].

Simplest approach

Here’s a very simple observation about π :

π ≈ 3 + 0.14 + 0.0014.

So if you need to multiply by π, you need to multiply by 3 and by 14. Once you’ve multiplied by 14 once, you can reuse your work.

For example, to compute 4π, you’d compute 4 × 3 = 12 and 4 × 14 = 56. Then

4π ≈ 12 + 0.56 + 0.0056 = 12.5656.

The correct value is 12.56637… and so the error is .00077.

First refinement

Now of course π = 3.14159… and so the approximation above is wrong in the fourth decimal place. But you can squeeze out a little more accuracy with the observation

π ≈ 3 + 0.14 + 0.0014 + 0.00014 = 3.14154.

Now if we redo our calculation of 4π we get

4π ≈ 12 + 0.56 + 0.0056 + 0.00056 = 12.56616.

Now our error is .00021, which is 3.6 times smaller.

Second refinement

The approximation above is based on an underestimate of π. We can improve it a bit by adding half of our last term, based on

π ≈ 3 + 0.14 + 0.0014 + 0.00014 + 0.00014/2 = 3.14157

So in our running example,

4π ≈ 12 + 0.56 + 0.0056 + 0.00056 + 00028 = 12.5656 = 12.56654.

which has an error of 0.00007, which is three times smaller than above.

Related posts

[1] Trevor Lipscombe. Mental mathematics for multiples of π. The Mathematical Gazette, Vol. 97, No. 538 (March 2013), pp. 167–169

A better integral for the normal distribution

For a standard normal random variable Z, the probability that Z exceeds some cutoff z is given by

\mbox{Prob}(Z \geq z) = Q(z) = \frac{1}{\sqrt{2\pi}} \int_z^\infty \exp(-x^2/2)\, dx

If you wanted to compute this probability numerically, you could obviously evaluate its defining integral numerically. But as is often the case in numerical analysis, the most obvious approach is not the best approach. The range of integration is unbounded and it varies with the argument.

J. W. Craig [1] came up with a better integral representation, better from the perspective of numerical integration. The integration is always over the same finite interval, with the argument appearing inside the integrand. The integrand is smooth and bounded, well suited to numerical integration.

For positive z, Craig’s integer representation is

Q(z) = \frac{1}{\pi} \int_0^{\pi/2} \exp\left( -\frac{z^2}{2\sin^2 \theta} \right) \, d\theta

Illustration

To show that the Craig’s integral is easy to integrate numerically, we’ll evaluate it using Gaussian quadrature with only 10 integration points.

    from numpy import sin, exp, pi
    from scipy import integrate
    from scipy.stats import norm

    for x in [0.5, 2, 5]:
        q, _ = integrate.fixed_quad(
            lambda t: exp(-x**2 / (2*sin(t)**2))/pi,
            0.0, pi/2, n=10)
        print(q, norm.sf(x))

(SciPy uses sf (“survival function”) for the CCDF. More on that here.)

The code above produces the following.

    0.30858301 0.30853754
    0.02274966 0.02275013
    2.86638437e-07 2.86651572e-07

So with 10 integration points, we get four correct figures. And the accuracy seems to be consistent for small, medium, and large values of x. (Five standard deviations is pretty far out in the tail of a normal distribution, as evidenced by the small value of the integral.)

Related posts

[1] J. W. Craig, A new, simple and exact result for calculating the probability of error for two-dimensional signal constellations, in TEEE MILCOM’91 Conf. Rec., Boston, MA (1991) рр. 25.2.1-25.5.5.

Drawing with a compass on a globe

Take a compass and draw a circle on a globe. Then take the same compass, opened to the same width, and draw a circle on a flat piece of paper. Which circle has more area?

If the circle is small compared to the radius of the globe, then the two circles will be approximately equal because a small area on a globe is approximately flat.

To get an idea what happens for larger circles, let’s a circle on the globe as large as possible, i.e. the equator. If the globe has radius r, then to draw the equator we need our compass to be opened a width of √2 r, the distance from the north pole to the equator along a straight line cutting through the globe.

The area of a hemisphere is 2πr². If we take our compass and draw a circle of radius √2 r on a flat surface we also get an area of 2πr². And by continuity we should expect that if we draw a circle that is nearly as big as the equator then the corresponding circle on a flat surface should have approximately the same area.

Interesting. This says that our compass will draw a circle with the same area whether on a globe or on a flat surface, at least approximately, if the width of the compass sufficiently small or sufficiently large. In fact, we get exactly the same area, regardless of how wide the compass is opened up. We haven’t proven this, only given a plausibility argument, but you can find a proof in [1].

Note that the width w of the compass is the radius of the circle drawn on a flat surface, but it is not the radius of the circle drawn on the globe. The width w is greater than the radius of the circle, but less than the distance along the sphere from the center of the circle. In the case of the equator, the radius of the circle is r, the width of the compass is √2 r , and the distance along the sphere from the north pole to the equator is πr/2.

Related posts

[1] Nick Lord. On an alternative formula for the area of a spherical cap. The Mathematical Gazette, Vol. 102, No. 554 (July 2018), pp. 314–316

The negative binomial distribution and Pascal’s triangle

The Poisson probability distribution gives a simple, elegant model for count data. You can even derive from certain assumptions that data must have a Poisson distribution. Unfortunately reality doesn’t often go along with those assumptions.

A Poisson random variable with mean λ also has variance λ. But it’s often the case that data that would seem to follow a Poisson distribution has a variance greater than its mean. This phenomenon is called over-dispersion: the dispersion (variance) is larger than a Poisson distribution assumption would allow.

One way to address over-dispersion is to use a negative binomial distribution. This distribution has two parameters, r and p, and has the following probability mass function (PMF).

P(X = x) = \binom{r + x - 1}{x} p^r(1-p)^x

As the parameter r goes to infinity, the negative binomial distribution converges to a Poisson distribution. So you can think of the negative binomial distribution as a generalization of the Poisson distribution.

These notes go into the negative binomial distribution in some detail, including where its name comes from.

If the parameter r is a non-negative integer, then the binomial coefficients in the PMF for the negative binomial distribution are on the (r+1)st diagonal of Pascal’s triangle.

Pascal's triangle

The case r = 0 corresponds to the first diagonal, the one consisting of all 1s. The case r = 1 corresponds to the second diagonal consisting of consecutive integers. The case r = 2 corresponds to the third diagonal, the one consisting of triangular numbers. And so forth.

Related posts

A strange take on the harmonic series

It is well known that the harmonic series

1 + ½ + ⅓ + ¼ + …

diverges. But if you take the denominators as numbers in base 11 or higher, the series converges [1].

I wonder what inspired this observation. Maybe Brewster was bored, teaching yet another cohort of students that the harmonic series diverges, and let his mind wander.

Proof

Let f(n) be the function that takes a positive integer n, writes it in base 10, then reinterprets the result as a number in base b where b > 10. Brewster is saying that the sum of the series 1/f(n) converges.

To see this, note that the first 10 terms are less than or equal to 1. The next 100 terms are less than 1/b. The next 1000 terms are less than 1/b², and so on. This means the series is bounded by the geometric series 10 (10/b)m.

Python

Incidentally, despite being an unusual function, f is very easy to implement in Python:

   def f(n, b): return int(str(n), b)

Citation

Brewster’s note was so brief that I will quote it here in full.

The [harmonic series] is divergent. But if the denominators of the terms are read as numbers in scale 11 or any higher scale, the series is convergent, and the sum is greater than 2.828 and less than 26.29. The convergence is rather slow. I estimate that, to find the last number by direct addition, one would have to work out 1090 terms, to about 93 places of decimals.

[1] G. W. Brewster. An Old Result in a New Dress. The Mathematical Gazette, Vol. 37, No. 322 (Dec., 1953), pp. 269–270.

Variance matters more than mean in the extremes

Suppose you have two normal random variables, X and Y, and that the variance of X is less than the variance of Y.

Let M be an equal mixture of X and Y. That is, to sample from M, you first chose X or Y with equal probability, then you choose a sample from the random variable you chose.

Now suppose you’ve observed an extreme value of M. Then it is more likely the that the value came from Y. The means of X and Y don’t matter, other than determining the cutoff for what “extreme” means.

High-level math

To state things more precisely, there is some value t such that the posterior probability that a sample m from M came from Y, given that |m| > t, is greater than the posterior probability that m came from X.

Let’s just look at the right-hand tails, even though the principle applies to both tails. If X and Y have the same variance, but the mean of X is greater, then larger values of Z are more likely to have come from X. Now suppose the variance of Y is larger. As you go further out in the right tail of M, the posterior probability of an extreme value having come from Y increases, and eventually it surpasses the posterior probability of the sample having come from X. If X has a larger mean than Y that will delay the point at which the posterior probability of Y passes the posterior probability of X, but eventually variance matters more than mean.

Detailed math

Let’s give a name to the random variable that determines whether we choose X or Y. Let’s call it C for coin flip, and assume C takes on 0 and 1 each with probability 1/2. If C = 0 we sample from X and if C = 1 we sample from Y. We want to compute the probability P(C = 1 | Mt).

Without loss of generality we can assume X has mean 0 and variance 1. (Otherwise transform X and Y by subtracting off the mean of X then divide by the standard deviation of X.) Denote the mean of Y by μ and the standard deviation by σ.

From Bayes’ theorem we have

\text{P}(C = 1 \mid M \geq t) = \frac{ \text{P}(Y \geq t) }{ \text{P}(X \geq t) + \text{P}(Y \geq t) } = \frac{\Phi^c\left(\frac{t-\mu}{\sigma}\right)}{\Phi^c(t) + \Phi^c\left(\frac{t-\mu}{\sigma}\right)}

where Φc(t) = P(Zt) for a standard normal random variable.

Similarly, to compute P(C = 1 | Mt) just flip the direction of the inequality signs replace Φc(t) = P(Zt) with Φ(t) = P(Zt).

The calculation for P(C = 1  |  |M| ≥ t) is similar

Example

Suppose Y has mean −2 and variance 10. The blue curve shows that a large negative sample from M very likely comes from Y and the orange line shows that large positive sample very likely comes from Y as well.

The dip in the orange curve shows the transition zone where Y‘s advantage due to a larger mean gives way to the disadvantage of a smaller variance. This illustrates that the posterior probability of Y increases eventually but not necessarily monotonically.

Here’s a plot showing the probability of a sample having come from Y depending on its absolute value.

Related posts

Increasing speed due to friction

Orbital mechanics is fascinating. I’ve learned a bit about it for fun, not for profit. I seriously doubt Elon Musk will ever call asking me to design an orbit for him. [1]

One of the things that makes orbital mechanics interesting is that it can be counter-intuitive. For example, atmospheric friction can make a satellite move faster. How can this be? Doesn’t friction always slow things down?

Friction does reduce a satellite’s tangential velocity, causing it to move into a lower orbit, which increases its velocity. It’s weird to think about, but the details are worked out in [2].

Note the date on the article: May 1958. The paper was written in response to Sputnik 1 which launched in October 1957. Parkyn’s described the phenomenon of acceleration due to friction in general, and how it applied to Sputnik in particular.

Related posts

[1] I had a lead on a project with NASA once, but it wasn’t orbital mechanics, and the lead didn’t materialize.

[2] D. G. Parkyn. The Effect of Friction on Elliptic Orbits. The Mathematical Gazette. Vol. 42, No. 340 (May, 1958), pp. 96-98

Ptolemy’s theorem

Draw a quadrilateral by pick four arbitrary points on a circle and connecting them cyclically.

inscribed quadrilateral

Now multiply the lengths of the pairs of opposite sides. In the diagram below this means multiplying the lengths of the two horizontal-ish blue sides and the two vertical-ish orange sides.

quadrilateral with opposite sides colored

Ptolemy’s theorem says that the sum of the two products described above equals the product of the diagonals.

inscribed quadrilateral with diagonals

To put it in colorful terms, the product of the blue sides plus the product of the orange sides equals the product of the green diagonals.

The converse of Ptolemy’s theorem also holds. If the relationship above holds for a quadrilateral, then the quadrilateral can be inscribed in a circle.

Note that if the quadrilateral in Ptolemy’s theorem is a rectangle, then the theorem reduces to the Pythagorean theorem.

Related posts

Rule for converting trig identities into hyperbolic identities

There is a simple rule of thumb for converting between (circular) trig identities and hyperbolic trig identities known as Osborn’s rule: stick an h on the end of trig functions and flip signs wherever two sinh functions are multiplied together.

Examples

For example, the circular identity

sin(θ + φ) = sin(θ) cos(φ) + cos(θ) sin(φ)

becomes the hyperbolic identity

sinh(θ + φ) = sinh(θ) cosh(φ) + cosh(θ) sinh(φ)

but the identity

2 sin(θ) sin(φ) = cos(θ − φ) − cos(θ + φ)

becomes

2 sinh(θ) sinh(φ) = cosh(θ + φ) − cosh(θ − φ)

because there are two sinh terms.

Derivation

Osborn’s rule isn’t deep. It’s a straight-forward application of Euler’s theorem:

exp(iθ) = cos(θ) + i sin(θ).

More specifically, Osborn’s rule follows from two corollaries of Euler’s theorem:

sin(iθ) = i sinh(θ)
cos(iθ) = cosh(θ)

Why bother?

The advantage of Osborn’s rule is that it saves time, and perhaps more importantly, it reduces the likelihood of making a mistake.

You could always derive any identity you need on the spot. All trig identities—circular or hyperbolic—are direct consequences of Euler’s theorem. But it saves time to work at a higher level of abstraction. And as I’ve often said in the context of more efficient computer usage, the advantage of doing things faster is not so much the time directly saved but the decreased probability of losing your train of thought.

Caveats

Osborn’s rule included implicit expressions of sinh, such as in tanh = sinh / cosh. So, for example, the circular identity

tan(2θ) = 2 tan(θ) / (1 − tan²(θ))

becomes

tanh(2θ) = 2 tanh(θ) / (1 + tanh²(θ))

because the tanh² term implicitly contains two sinh terms.

Original note

Osborn’s original note [1] from 1902 is so short that I include the entire text below:

Related posts

[1] G. Osborn. Mnemonic for Hyperbolic Formulae. The Mathematical Gazette, Vol. 2, No. 34 (Jul., 1902), p. 189