Stochastic rounding and privacy

Suppose ages in some database are reported in decades: 0, 10, 20, etc. You need to add a 27 year old woman to the data set. How do you record her age? A reasonable approach would to round-to-nearest. In this case, 27 would be rounded up to 30.

Another approach would be stochastic rounding. In our example, we would round this woman’s age up to 30 with 70% probability and round it down to 20 with 30% probability. The recorded value is a random variable whose expected value is exactly 27.

Suppose we were to add a large number of 27 year olds to the database. With round-to-nearest, the average value would be 30 because all the values are 30. With stochastic rounding, about 30% of the ages would be recorded as 20 and about 70% would be recorded as 30. The average would likely be close to 27.

Next, suppose we add people to the database of varying ages. Stochastic rounding would record every person’s age using a random variable whose expected value is their age. If someone’s age is a d+x where d is a decade, i.e. a multiple of 10, and 0 < x < 10, then we would record their age as d with probability 1-x/10 and d+10 with probability x/10. There would be no bias in the reported age.

Round-to-nearest will be biased unless ages are uniformly distributed in each decade. Suppose, for example, our data is on undergraduate students. We would expect a lot more students in their early twenties than in their late twenties.

Now let’s turn things around. Instead of looking at recorded age given actual age, let’s look at actual age given recorded age. Suppose someone’s age is recorded as 30. What does that tell you about them?

With round-to-nearest, it tells you that they certainly are between 25 and 35. With stochastic rounding, they could be anywhere between 20 and 40. The probability distribution on this interval could be computed from Bayes’ theorem, depending on the prior distribution of ages on this interval. That is, if you know in general how ages are distributed over the interval (20, 40), you could use Bayes’ theorem to compute the posterior distribution on age, given that age was recorded as 30.

Stochastic rounding preserves more information than round-to-nearest on average, but less information in the case of a particular individual.

More privacy posts

Crinkle Crankle Calculus

A crinkle crankle wall

A crinkle crankle wall, also called a serpentine wall, is a wavy wall that may seem to sacrifice some efficiency for aesthetics. The curves add visual interest, but they use more material than a straight wall. Except they don’t! They use more bricks than a straight wall of the same thickness but they don’t have to be as thick.

Crinkle crankle walls resist horizontal forces, like wind, more than straight wall would. So if the alternative to a crinkle crankle wall one-brick thick is a straight wall two or more bricks thick, the former saves material. How much material?

The amount of material used in the wall is proportional to the product of its length and thickness. Suppose the wall is shaped like a sine wave and consider a section of wall 2π long. If the wall is in the shape of a sin(θ), then we need to find the arc length of this curve. This works out to the following integral.

\int_0^{2\pi} \sqrt{1 + a^2 \cos^2(x)}\, dx

The parameter a is the amplitude of the sine wave. If a = 0, we have a flat wave, i.e. a straight wall, as so the length of this segment is 2π = 6.2832. If a = 1, the integral is 7.6404. So a section of wall is 22% longer, but uses 50% less material per unit length as a wall two bricks thick.

The integral above cannot be computed in closed form in terms of elementary functions, so this would make a good homework exercise for a class covering numerical integration.

The integral can be computed in terms of special functions. It equals 4 E(-a²) where E is the “complete elliptic integral of the second kind.” This function is implemented as EllipticE in Mathematica and as scipy.special.ellipe in Python.

As the amplitude a increases, the arc length of a section of wall increases. You could solve for the value of a to give you whatever arc length you like. For example, if a = 1.4422 then the length is twice that of a straight line. So a crinkle crankle wall with amplitude 1.4422 uses about as many bricks as a straight wall twice as thick.

***

Photo “Crinkle crankle wall, Fulbourn” by Bob Jones is licensed under CC BY-SA 2.0

Inverse trig function implementations

Programming languages differ in the names they use for inverse trig functions, and in some cases differ in their return values for the same inputs.

Choice of range

If sin(θ) = x, then θ is the inverse sine of x, right? Maybe. Depends on how you define the inverse sine. If -1 ≤ x ≤ 1, then there are infinitely many θ’s whose sine is x. So you have to make a choice.

The conventional choice is for inverse sine to return a value of θ in the closed interval

[-π/2, π/2].

Inverse tangent uses the same choice. However, the end points aren’t possible outputs, so inverse tangent typically returns a value in the open interval

(-π/2, π/2).

For inverse cosine, the usual choice is to return values in the closed interval

[0, π].

Naming conventions

Aside from how to define inverse trig functions, there’s the matter of what to call them. The inverse of tangent, for example, can be written tan-1, arctan, or atan. You might even see arctg or atn or some other variation.

Most programming languages I’m aware of use atan etc. Python uses atan, but SciPy uses arctan. Mathematica uses ArcTan.

Software implementations typically follow the conventions above, and will return the same value for the same inputs, as long as inputs and outputs are real numbers.

Arctangent with two arguments

Many programming languages have a function atan2 that takes two arguments, y and x, and returns an angle whose tangent is y/x. Note that the y argument to atan2 comes first: top-down order, numerator then denominator, not alphabetical order.

However, unlike the one-argument atan function, the return value may not be in the interval (-π/2, π/2). Instead, atan2 returns an angle in the same quadrant as x and y. For example,

    atan2(-1, 1)

returns -π/4; because the point (1, -1) is in the 4th quadrant, it returns an angle in the 4th quadrant. But

    atan2(1, -1)

returns 3π/4. Here the point (-1, 1) and the angle 3π/4 are in the 2nd quadrant.

In Common Lisp, there’s no function named atan2; you just call atan with two arguments. Mathematica is similar: you call ArcTan with two arguments. However, Mathematica uses the opposite convention and takes x as its first argument and y as its second.

Complex values

Some software libraries support complex numbers and some do not. And among those that do support complex numbers, the return values may differ. This is because, as above, you have choices to make when extending these functions to complex inputs and outputs.

For example, in Python,

    math.asin(2)

returns a domain error and

    scipy.arcsin(2)

returns 1.5707964 + 1.3169578i.

In Common Lisp,

    (asin 2)

returns 1.5707964 – 1.3169578i. Both SciPy and Common Lisp return complex numbers whose sine equals 2, but they follow different conventions for what number to chose. That is, both

    scipy.sin(1.5707964 - 1.3169578j)
    scipy.sin(1.5707964 + 1.3169578j)

in Python and

    (sin #C(1.5707964 -1.3169578))
    (sin #C(1.5707964 +1.3169578))

in Common Lisp both return 2, aside from rounding error.

In C99, the function casin (complex arc sine) from <complex.h>

    double complex z = casin(2);

returns 1.5707964 + 1.3169578i. The Mathematica call

    ArcSin[2.]

returns 1.5707964 – 1.3169578i.

Branch cuts

The Common Lisp standardization committee did a very careful job of defining math functions for complex arguments. I’ve written before about this here. You don’t need to know anything about Lisp to read that post.

The committee ultimately decided to first rigorously define the two-argument arctangent function and bootstrap everything else from there. The post above explains how.

Other programming language have made different choices and so may produce different values, as demonstrated above. I mention the Common Lisp convention because they did a great job of considering every detail, such as how to handle +0 and -0 in IEEE floating point.

arctan( k tan(x) )

I recently had to work with the function

f(x; k) = arctan( k tan(x) )

The semicolon between x and k is meant to imply that we’re thinking of x as the argument and k as a parameter.

This function is an example of the coming full circle theme I’ve written about several times. Here’s how a novice, a journeyman, and an expert might approach studying our function.

  • Novice: arctan( k tan(x) ) = kx.
  • Journeyman: You can’t do that!
  • Expert: arctan( k tan(x) ) ≈ kx for small x.

Novices often move symbols around without thinking about their meaning, and so someone might pull the k outside (why not?) and notice that arctan( tan(x) ) = x.

Someone with a budding mathematical conscience might conclude that since our function is nonlinear in x and in k that there’s not much that can be said without more work.

Someone with more experience might see that both tan(x) and arctan(x) have the form x + O(x³) and so

arctan( k tan(x) ) ≈ kx

should be a very good approximation for small x.

Here’s a plot of our function for varying values of k.

Plot of atan( k tan(x) ) for varying k

Each is fairly flat in the middle, with slope equal to its value of k.

As k increases, f(x; k) becomes more and more like a step function, equal to -π/2 for negative x and π/2 for positive x, i.e.

arctan( k tan(x) ) ≈ sgn(x) π/2

for large k. Here again we might have discussion like above.

  • Novice: Set k = ±∞. Then ±∞ tan(x) = ±∞ and arctan(±∞) = ±π/2.
  • Journeyman: You can’t do that!
  • Expert: Well, you can if you interpret everything in terms of limits.

Related posts

Progress on the Collatz conjecture

The Collatz conjecture is for computer science what until recently Fermat’s last theorem was for mathematics: a famous unsolved problem that is very simple to state.

The Collatz conjecture, also known as the 3n+1 problem, asks whether the following function terminates for all positive integer arguments n.

    def collatz(n):
        if n == 1:
            return 1
        elif n % 2 == 0: 
            return collatz(n/2)
        else:
            return collatz(3*n+1)

In words, this says to start with a positive integer. Repeatedly either divide it by 2 if it’s even, or multiply it by 3 and add 1 if it’s odd. Will this sequence always reach 1?

The Collatz conjecture is a great example of how hard it can be to thoroughly understand even a few lines of code.

Terence Tao announced today that he has new partial results toward proving the Collatz conjecture. His blog post and arXiv paper are both entitled “Almost all Collatz orbits attain almost bounded values.”

When someone like Tao uses the word “almost,” it is a term of art, a common word used as a technical term. He is using “almost” as it is used as a technical term in number theory, which is different from the way the word is used technically in measure theory.

I get email routinely from people who believe they have a proof of the Collatz conjecture. These emails are inevitably from amateurs. The proofs are always short, elementary, and self-contained.

The contrasts with Tao’s result are stark. Tao has won the Fields Medal, arguably the highest prize in mathematics [1], and a couple dozen other awards. Amateurs can and do solve open problems, but it’s uncommon.

Tao’s proof is 48 pages of dense, advanced mathematics, building on the work of other researchers. Even so, he doesn’t claim to have a complete proof, but partial results. That is how big conjectures typically fall: by numerous people chipping away at them, building on each other’s work.

Related posts

[1] Some say the Abel prize is more prestigious because it’s more of a lifetime achievement award. Surely Tao will win that one too when he’s older.

Asimov’s question about π

In 1977, Isaac Asimov [1] asked how many terms of the slowly converging series

π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – …

would you have to sum before doing better than the approximation

π ≈ 355/113.

A couple years later Richard Johnsonbaugh [2] answered Asimov’s question in the course of an article on techniques for computing the sum of series. Johnsonbaugh said you would need at least N = 3,748,630 terms.

Johnsonbaug’s answer is based on exact calculations. I wondered how well you’d do with N terms using ordinary floating point arithmetic. Would there be so much rounding error that the result is terrible?

I wrote the most direct implementation in Python, with no tricks to improve the accuracy.

    from math import pi
    s = 0
    N = 3748630
    for n in range(1, N+1):
        s += (-1)**(n+1) * 4/(2*n - 1)

I intended to follow this up by showing that you could do better by summing all the positive and negative terms separately, then doing one subtraction at the end. But the naive version actually does quite well. It’s essentially as accurate as 355/113, with both approximations having an error of 2.66764 × 10-7.

Extended precision with bc

Next, I translated my program to bc [3] so I could control the precision. bc lets you specify your desired precision with its scale parameter.

    scale = 20
    pi = 4*a(1)
    s = 0
    m = 3748630
    for (n = 1; n <= m; n++) {
        s += (-1)^(n+1) * 4/(2*n - 1)
    }

Floating point precision is between 15 and 16 decimal places. I added more precision by setting set scale to 20, i.e. carrying out calculations to 20 decimal places, and summed the series again.

The absolute error in the series was less than the error in 355/113 in the 14th decimal place. When I used one less term in the series, its error was larger than the error in 355/113 in the 14th decimal place. In other words, the calculations suggest Johnsonbaugh found exactly the minimum number of terms needed.

I doubt Johnsonbaugh ever verified his result computationally. He doesn’t mention computer calculations in his paper [4], and it would have been difficult in 1979 to have access to the necessary hardware and software.

If he had access to an Apple II at the time, it would have run at 1 MHz. My calculation took around a minute to run on a 2 GHz laptop, so I’m guessing the same calculation would have taken a day or two on an Apple II. This assumes he could find extended precision software like bc on an Apple II, which is doubtful.

The bc programming language had been written four years earlier, so someone could have run a program like the one above on a Unix machine somewhere. However, such machines were hard to find, and their owners would have been reluctant to give up a couple days of compute time for a guest to run a frivolous calculation.

Related posts

[1] Isaac Asimov, Asimov on Numbers, 1977.

[2] Richard Johnsonbaugh, Summing an Alternating Series. The American Mathematical Monthly, Vol 86, No 8, pp.637–648.

[3] Notice that N from the Python program became m in bc. I’ve used bc occasionally for years, and didn’t know until now that you cannot use capital letters for variables in standard bc. I guess I never tried before. The next post explains why bc doesn’t allow capital letters in variable names.

[4] Johnsonbaugh’s paper does include some numerical calculations, but he only sums up 500 terms, not millions of terms, and it appears he only uses ordinary precision, not extended precision.

Lebesgue’s proof of Weierstrass’ theorem

A couple weeks ago I wrote about the Weierstrass approximation theorem, the theorem that says every continuous function on a closed finite interval can be approximated as closely as you like by a polynomial.

The post mentioned above uses a proof by Bernstein. And in that post I used the absolute value function as an example. Not only is |x| an example, you could go the other way around and use it as a step in the proof. That is, there is a proof of the Weierstrass approximation theorem that starts by proving the special case of |x| then use that result to build a proof for the general case.

There have been many proofs of Weierstrass’ theorem, and recently I ran across a proof due to Lebesgue. Here I’ll show how Lebesgue constructed a sequence of polynomials approximating |x|. It’s like pulling a rabbit out of a hat.

The staring point is the binomial theorem. If x and y are real numbers with |x| > |y| and r is any real number, then

(x+y)^r & =\sum_{k=0}^\infty {r \choose k} x^{r-k} y^k.

Now apply the theorem substituting 1 for x and x² – 1 for y above and you get

|x| = (1 + (x^2 - 1))^{1/2} =\sum_{k=0}^\infty {1/2 \choose k} (x^2-1)^k

The partial sums of the right hand side are a sequence of polynomials converging to |x| on the interval [-1, 1].

***

If you’re puzzled by the binomial coefficient with a top number that isn’t a positive integer, see the general definition of binomial coefficients. The top number can even be complex, and indeed the binomial theorem holds for complex r.

You might also be puzzled by the binomial theorem being an infinite sum. Surely if r is a positive integer we should get the more familiar binomial theorem which is a finite sum. And indeed we do. The general definition of binomial coefficients insures that if r is a positive integer, all the binomial coefficients with k > r are zero.

How to plot the spiral of Theodorus

You may have seen the spiral of Theodorus. It sticks a sequence of right triangles together to make a sort of spiral. Each triangle has a short side of length 1, and the hypotenuse of each triangle becomes the long leg of the next triangle as shown below.

Spiral of Theodorus

How would you plot this spiral? At each step, you need to draw a segment of length 1, perpendicular to the hypotenuse of the previous triangle. There are two perpendicular directions, and you want to choose the one that moves counterclockwise.

If we step outside the xy plane, we can compute the cross product of the unit vector in the z direction with the vector (x, y). The cross product will be perpendicular to both, and by the right-hand rule, it will point in the counterclockwise direction.

The cross product of (0, 0, 1) and (x, y, 0) is (-y, x, 0), so the direction we want to go in the xy plane is (-y, x). We divide this vector by its length to get a vector of length 1, then add it to our previous point. [1]

Here’s Python code to draw the spiral.

    import matplotlib.pyplot as plt
    
    def next_vertex(x, y):
        h = (x**2 + y**2)**0.5
        return (x - y/h, y + x/h)
    
    plt.axes().set_aspect(1)
    plt.axis('off')
    
    # base of the first triangle
    plt.plot([0, 1], [0, 0])
    
    N = 17
    x_old, y_old = 1, 0
    for n in range(1, N):
        x_new, y_new = next_vertex(x_old, y_old)
        # draw short side
        plt.plot([x_old, x_new], [y_old, y_new])
        # draw hypotenuse
        plt.plot([0, x_new], [0, y_new])
        x_old, y_old = x_new, y_new
    plt.savefig("theodorus.png")

If you’re not familiar with the plot function above, you might expect the two arguments to plot to be points. But the first argument is a list of x coordinates and the second a list of y coordinates.

Update: See the next post for how the hypotenuse angles fill in a circle.

More spiral posts

[1] Here’s an alternative derivation using complex numbers.

Label the current vertex in the complex plane zn. Then zn/|zn| has length 1 and points in the same direction from the origin. Multiplying this point by i rotates it a quarter turn counterclockwise, so

zn+1 = zn + i zn/|zn|.

Discriminant of a cubic

The discriminant of a quadratic equation

ax² + bx + c = 0

is

Δ = b² – 4ac.

If the discriminant Δ is zero, the equation has a double root, i.e. there is a unique x that makes the equation zero, and it counts twice as a root. If the discriminant is not zero, there are two distinct roots.

Cubic equations also have a discriminant. For a cubic equation

ax³ + bx² + cx + d = 0

the discriminant is given by

Δ = 18abcd – 4b³d + b²c² – 4ac³ – 27a²d².

If Δ = 0, the equation has a multiple root, but otherwise it has three distinct roots.

A change of variable can reduce the general cubic equation to a so-called “depressed” cubic equation of the form

x³ + px + q = 0

in which case the discriminant simplifies to

Δ = – 4 – 27q².

Here are a couple interesting connections. The idea reducing a cubic equation to a depressed cubic goes back to Cardano (1501–1576). What’s called a depressed cubic in this context is known as the Weierstrass (1815–1897) form in the context of elliptic curves. That is, an elliptic curve of the form

y² = x³ + ax + b

is said to be in Weierstrass form. In other words, an elliptic curve is in Weierstrass form if the right hand side is a depressed cubic.

Furthermore, an elliptic curve is required to be non-singular, which means it must satisfy

4 + 27b² ≠ 0.

In other words, the discriminant of the right hand side is non-zero. In the context of elliptic curves, the discriminant is defined to be

Δ = -16(4 + 27b²)

which is the same as the discriminant above, except for a factor of 16 that simplifies some calculations with elliptic curves.

A note on fields

In the context of solving quadratic and cubic equations, we’re usually implicitly working with real or complex numbers. Suppose all the coefficients of a quadratic equation are real. If the discriminant is positive, there are two distinct real roots. If the discriminant is negative, there are two distinct complex roots, and these roots are complex conjugates of each other.

Similar remarks hold for cubic equations when the coefficients are all real. If the discriminant is positive, there are three distinct real roots. If the discriminant is negative, there is one real root and a complex conjugate pair of complex roots.

In the first section I only considered whether the discriminant was zero, and so the statements are independent of the field the coefficients come from.

For elliptic curves, one works over a variety of fields. Maybe real or complex numbers, but also finite fields. In most of the blog posts I’ve written about elliptic curves, the field is integers modulo a large prime.

More posts related to cubic equations

Quantum leaps

A literal quantum leap is a discrete change, typically extremely small [1].

A metaphorical quantum leap is a sudden, large change.

I can’t think of a good metaphor for a small but discrete change. I was reaching for such a metaphor recently and my first thought was “quantum leap,” though that would imply something much bigger than I had in mind.

Sometimes progress comes in small discrete jumps, and only in such jumps. Or at least that’s how it feels.

There’s a mathematical model for this called the single big jump principle. If you make a series of jumps according to a fat tailed probability distribution, most of your progress will come from your largest jump alone.

Your distribution can be continuous, and yet there’s something subjectively discrete about it. If you have a Lévy distribution, for example, your jumps can be any size, and so they are continuous in that sense. But when the lion’s share of your progress comes from one jump, it feels discrete, as if the big jump counted and the little ones didn’t.

Related posts

[1] A literal quantum leap, such an electron moving from one energy level to another in a hydrogen atom, is on the order of a billionth of a billionth of a joule. A joule is roughly the amount of energy needed to bring a hamburger to your mouth.