Limitations on Venn diagrams

Why do Venn diagrams almost always show the intersections of three sets and not more? Can Venn diagrams be generalized to show all intersections of more sets?

That depends on the rules you give yourself for generalization. If you require that your diagram consist of circles, then three is the limit. As John Venn put it in the original paper on Venn diagrams [1]

Beyond three terms circles fail us, since we cannot draw a fourth circle which shall intersect three others in the way required.

But Mr. Venn noted that you could create what we now call a Venn diagram using four ellipses and included the following illustration.

Venn diagram with four ellipses by John Venn

(It’s confusing that there is an X inside the diagram. Venn meant that to be an asterisk and not the same symbol as the X outside. He says in the paper “Thus the one which is asterisked is instantly seen to be ‘X that is Y and Z, but is not W’.” Maybe someone else, like the publisher, drew the diagram for him.)

So the answer to whether, or how far, it is possible to generalize the classic Venn diagram depends on permissible generalizations of a circle. If you replace circles with arbitrary closed curves then Venn diagrams exist for all orders. If you demand the curves have some sort of symmetry, there are fewer options. It’s possible to make a Venn diagram from five ellipses, and that may be the limit for ellipses.

A Venn diagram is a visualization device, and so an important question is what is the limit for the use of Venn diagrams as an effective visualization technique. This is an aesthetic / pedagogical question, and not one with an objective answer, but in my mind the answer is four. Venn’s diagram made from four ellipses is practical for visualization, though it takes more effort to understand than the typical three-circle diagram.

Although my upper bound of four is admittedly subjective, it may be possible to make it objective post hoc. A Venn diagram made from n curves divides the plane into 2n regions [2]. In order to use more than four curves, you either have to gerrymander the curves or else tolerate some regions being much smaller than others. The former makes the diagram hard to understand, and he latter makes it hard to label the regions.

I suspect that if you make precise what it means for a curve to be simple [3], such as using ellipses or convex symmetric curves, and specify a maximum ratio between the largest and smallest bounded regions, then four curves will be the maximum.

***

[1] John Venn. On the Diagrammatic and Mechanical Representation of Propositions and Reasonings. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. July 1880.

[2] This includes the outside of the diagram representing the empty set. The diagram shows the intersection of 0, 1, 2, …, n sets, and the intersection of no sets is the empty set. This last statement might seem like an arbitrary definition, but it can be justified using category theory.

[3] Simple in the colloquial sense, which is more restrictive than the technical mathematical sense of a simple curve.

Birthday problem approximation

The birthday problem is a party trick with serious practical applications. It’s well known to people who have studied probability, but the general public is often amazed by it.

If you have a group of 23 people, there’s a 50-50 chance that at least two people have the same birthday. With a larger group, say 30, its quite likely two birthdays are the same. Not only is this a theoretical result, based on certain modeling assumptions, it actually works in practice, essentially as predicted.

Variations of the birthday problem come up routinely in applications. For example, in cryptography it’s important to know the probability of secure hash collisions. Hash functions are deterministic, but for practical purposes they act random. If you are hashing into a space of N possible hash values, you can expect to compute about √N items before two items have the same hash value.

The square root rule of thumb is very useful. For example, if you’re computing 128-bit hash values, there’s about a 50-50 chance of seeing two duplicate hash values after hashing about 264 items.

The square root heuristic works well for large N, but gives mediocre results for N as small as 365. When applied to the original birthday problem, it predicts even odds for seeing a pair of equal birthdays in a group of 19 people. That’s a little low, but not too far off.

As useful as the square root rule is, it is only good for finding when the probability of duplication is 1/2. What if you’d like to know when the probability of a collision is, say, 0.01?

Let N be the number of possible options and let r be the number of items chosen independently from the set of N options. Let P(N, r) be the probability that all r choices are distinct. Then

P(N, r) ≈ exp( −r²/2N).

This approximation [1] is valid when N is large and r is small relative to N. We could be more precise about the error bounds, but suffice it to say that bigger N is better.

When N = 365 and r = 23, the approximation above computes the probability that all 23 choices are distinct as 0.48, matching the canonical birthday problem and showing an improvement over the square root heuristic.

Related posts

[1] Anthony C. Robin. The Birthday Distribution: Some More Approximations. The Mathematical Gazette, Vol. 68, No. 445 (October, 1984), pp. 204–206

(1 − z) / (1 + z)

“I keep running into the function f(z) = (1 − z)/(1 + z).” I wrote this three years ago and it’s still true.

This function came up implicitly in the previous post. Ramanujan’s excellent approximation for the perimeter of an ellipse with semi-axes a and b begins by introducing

λ = (ab)/(a + b).

If the problem is scaled so that a = 1, then λ = f(a). Kummer’s series for the exact perimeter of an ellipse begins by introducing the same variable squared.

As this post points out, the function f(z) comes up in the Smith chart from electrical engineering, and is also useful in mental calculation of roots. It also comes up in mentally calculating logarithms.

The function f(z) is also useful for computing the tangent of angles near a right angle because

tan(π/4 − z) ≈ f(z)

with an error on the order of z³. So when z is small, the error is very, very small, much like the approximation sin(x) ≈ x for small angles.

Error in Ramanujan’s approximation for ellipse perimeter

Ramanujan discovered an incredibly accurate approximation for the perimeter of an ellipse. This post will illustrate how accurate the approximation is and push its limits.

As with all computations involving ellipses, the error of Ramanujan’s approximation increases as eccentricity increases. But the error increases slowly, asymptotically approaching an upper bound that is remarkably small.

Let a and b be the semi-major and semi-minor axes of an ellipse. Then Ramanujan’s approximation for the perimeter of the ellipse is

\pi (a + b) \left(1 + \frac{3\lambda^2}{10 + \sqrt{4 - 3\lambda^2}}\right)

where λ = (ab)/(a + b).

Example

To illustrate how accurate the approximation is, let’s apply it to a very large, highly eccentric ellipse: the orbit of Sedna, a dwarf planet discovered in 2003. Sedna has the most elliptical orbit of any of the dwarf planets, 0.8496 compared to 0.2488 for Pluto. Sedna is also about 12 times further from the sun than Pluto.

The semi-major axis of Sedna’s orbit is 76 billion kilometers. Eccentricity e corresponds to aspect ratio √(1 − e²) (see this post), and so the semi-minor axis is about 40 billion kilometers. Let’s assume the orbit of Sedna is perfectly elliptical, which it is not, and that the semi-axes stated above are exact, which they are not. Then the length of Sedna’s orbit is on the order of 366 billion kilometers, and Ramanujan’s approximation has an error of about 53 kilometers.

Error

Here’s a plot of the relative error when b = 1 and a varies.

The error appears to be approaching an asymptote, and in fact it is. The error is bound by 4/π − 14/11 = 0.00051227…., as proved here.

The Cauchy distribution’s counter-intuitive behavior

Someone with no exposure to probability or statistics likely has an intuitive sense that averaging random variables reduces variance, though they wouldn’t state it in those terms. They might, for example, agree that the average of several test grades gives a better assessment of a student than a single test grade. But data from a Cauchy distribution doesn’t behave this way.

Averages and scaling

If you have four independent random variables, each normally distributed with the same scale parameter σ, then their average is also normally distributed but with scale parameter σ/2.

If you have four independent random variables, each Cauchy distributed with the same scale parameter σ, then their average is also Cauchy distributed but with exact same scale parameter σ.

So the normal distribution matches common intuition, but the Cauchy distribution does not.

In the case of random variables with a normal distribution, the scale parameter σ is also the standard deviation. In the case of random variables with a Cauchy distribution, the scale parameter σ is not the standard deviation because Cauchy random variables don’t have a variance, so they don’t have a standard deviation.

Modeling

Some people object that nothing really follows a Cauchy distribution because the Cauchy distribution has no mean or variance. But nothing really follows a normal distribution either. All probability distributions are idealizations. The question of any probability distribution is whether it adequately captures the aspect of reality it is being used to model.

Mean

Suppose some phenomenon appears to behave like it has a Cauchy distribution, with no mean. Alternately, suppose the phenomenon has a mean, but this mean is so variable that it is impossible to estimate. There’s no practical difference between the two.

Variance

And in the alternate case, suppose there is a finite variance, but the variance is so large that it is impossible to estimate. If you take the average of four observations, the result is still so variable that the variance is impossible to estimate. You’ve cut the theoretical variance in half, but that makes no difference. Again this is practically indistinguishable from a Cauchy distribution.

Truncating

Now suppose you want to tame the Cauchy distribution by throwing out samples with absolute value less than M. Now you have a truncated Cauchy distribution, and it has finite mean and variance.

But how do you choose M? If you don’t have an objective reason to choose a particular value of M, you would hope that your choice doesn’t matter too much. And that would be the case for a thin-tailed probability distribution like the normal, but it’s not true of the Cauchy distribution.

The variance of the truncated distribution will be approximately equal to M, so by choosing M you choose the variance. So if you double your cutoff for outliers that are to be discarded, you approximately double the variance of what’s left. Your choice of M matters a great deal.

Related posts

Arithmetic, Geometry, Harmony, and Gold

I recently ran across a theorem connecting the arithmetic mean, geometric mean, harmonic mean, and the golden ratio. Each of these comes fairly often, and there are elegant connections between them, but I don’t recall seeing all four together in one theorem before.

Here’s the theorem [1]:

The arithmetic, geometric, and harmonic means of two positive real numbers are the lengths of the sides of a right triangle if, and only if, the ratio of the arithmetic to the harmonic mean is the Golden Ratio.

The proof given in [1] is a straight-forward calculation, only slightly longer than the statement of the theorem.

The conclusion of the theorem stops short of saying how to construct the triangle, though this is a simple exercise, which we carry out here.

Given two positive numbers, a and b, the three means are defined as follows.

AM = (a + b)/2
GM = √ab
HM = 2ab/(a + b)

Denote the Golden Ratio by

φ = (1 + √5)/2.

Then the equation AM/HM = φ is equivalent to the quadratic equation

a² + (2 − 4φ)ab + b² = 0.

The means are all homogeneous functions of a and b, i.e. if we multiply a and b by a constant, we multiply the three means by the same constant. Therefore we can set one of the parameters to 1 without loss of generality. Setting b = 1 gives

a² + (2 − 4φ)a + 1 = 0

and so there are two solutions:

a = 2φ − 3

and

a = 2φ + 1.

However, there is in a sense only one solution: the two solutions are reciprocals of each other, reversing the roles of a and b. So while there are two solutions to the quadratic equation, there is only one triangle, up to similarity.

[1] Angelo Di Domenico. The Golden Ratio: The Right Triangle: And the Arithmetic, Geometric, and Harmonic Means. The Mathematical Gazette Vol. 89, No. 515 (July, 2005), p. 261

Ceva, cevians, and Routh’s theorem

I keep running into Edward John Routh (1831–1907). He is best known for the Routh-Hurwitz stability criterion but he pops up occasionally elsewhere. The previous post discussed Routh’s mnemonic for moments of inertia and his “stretch” theorem. This post will discuss his triangle theorem.

Before stating Routh’s theorem, we need to say what a cevian is. Giovanni Ceva (1647–1734) was an Italian geometer, best known for Ceva’s theorem, and for a construction in that theorem now known as a cevian.

A cevian is a line from the vertex of a triangle to the opposite side. Draw three cevians by connecting each vertex of a triangle to a point on its opposite side. If the cevians intersect at a point, Ceva’s theorem says something about how the lines divide the sides. If the cevians form a triangle, Routh’s theorem find the area of that triangle.

Routh’s theorem is a generalization of Ceva’s theorem because if the cevians intersect at a common point, the area of the triangle formed is zero, and then Routh’s area equation implies Ceva’s theorem.

Let A, B, and C be the vertices of a triangle and let D, E, and F be the points where their cevians intersect the opposite sides.

Let xy, and z be the ratios into which each side is divided by the cevians. Specifically let x = FB/AF, y = DC/BD, and z = EA/CE.

Then Routh’s theorem says the relative area of the green triangle formed by the cevians is

\frac{(xyz - 1)^2}{(xy + y + 1)(yz + z + 1)(zx + x + 1)}

If the cevians intersect at a point, the area of the triangle is 0, which implies xyz = 1, which is Ceva’s theorem.

Related posts

Moments of inertia mnemonic

Edward John Routh (1831–1907) came up with a mnemonic for summarizing many formulas for moment of inertia of a solid rotating about an axis through its center of mass.

Routh’s mnemonic is

I = MS / k

where M is the mass of an object, S is the sum of the squares of the semi-axes, and k is 3, 4, or 5 depending on whether the object is rectangular, elliptical, or ellipsoidal respectively.

This post will show how a variety of formulas fit into Routh’s framework.

Rectangular solids

Suppose we have a box whose base is a rectangle of with sides of length a and b, and we’re rotating the box about the vertical axis through the center of mass. The moment of inertia is

I = M(a² + b²) / 12.

The semi-axes have length a/2 and b/2 and so the formula above fits into Routh’s mnemonic with k = 3:

I = M( (a/2)² + (b/2)² ) / 3.

Why did Routh state his theorem in terms of semi-axes rather than axes? Because circles and spheres are typically described in terms of radius, and ellipses are described in terms of semi-axes.

Cylinders

The moment of inertia for a (circular) cylinder about its center is

I = Mr² /2.

From Routh’s perspective, there are two perpendicular axes to the central axis, both of length r. So his mnemonic could calculate the moment of inertia as

I = M(r² + r²)/4

using k = 4.

For an elliptical cylinder, where the ellipse has semi-major axis a and semi-minor axis b, the moment of inertia is

I = M(a² + b²)/4

which reduces to the circular cylinder result when a = b = r.

Spheres and ellipsoids

The moment of inertia of a sphere about a line through its center is

I = 2Mr² / 5.

Again there are two perpendiculars to the line, both of length r, and so we get the result above using Roth’s mnemonic with k = 5.

For an ellipsoid with semi-axes a, b, and c, rotated about the axis corresponding to c, the moment of inertia is

I = M(a² + b²)/5.

Thin rod

The moment of inertia for a thin rod of length L rotated about its center is

I = ML²/3.

This can be derived from the case of a rectangular solid with length L and negligible width.

Note that the formula for moment of inertia of a cylinder does not apply because we are rotating the rod about its middle, not along the axis running the length of the rod.

Routh’s stretch rule

Moving a point mass in a direction parallel to the axis of rotation doesn’t change its moment of inertia. The continuous version of this observation means that we can stretch the shapes without changing their inertia if we stretch them in the direction of the axis of rotation. This means the rules above apply to more general shapes.

Related posts

Note the that this post refers to physical moments and the link above refers to statistical moments. They’re closely related.

Binomial bound

I recently came across an upper bound I hadn’t seen before [1]. Given a binomial coefficient C(r, k), let

n = min(k, rk)

and

m = r −  n.

Then for any ε > 0,

C(n + m, n) ≤ (1 + ε)n + m / εn.

The proof follows quickly from applying the binomial theorem to (1 + ε)n + m.

I could imagine how non-optimal choice of ε could be convenient in some context, it’s natural to want to see how good the upper bound is for the best ε, which works out to be ε = n/m.

A little algebra shows this value of ε leads to

C(n + m, n) ≤ (n + m)n + m / nn mm.

Note that while the original bound is not symmetric in n and m, the optimal bound is.

Returning to the original notation C(r, k), let’s see how tight the optimal bound is by plotting, as a function of r, the maximum relative error as a k varies.

The maximum relative error, over the range plotted, is very roughly r/10.

Related posts

[1] Grzegorz Łysik. The ε-binomial inequality. The Mathematical Gazette. Vol. 92, No. 523 (March 2008), pp. 97–99

Separable functions in different contexts

I was skimming through the book Mathematical Reflections [1] recently. He was discussing a set of generalizations [2] of the Star of David theorem from combinatorics.

\gcd\left(\binom{n - 1}{r - 1}, \binom{n}{r+1}, \binom{n+1}{r}\right) = \gcd\left(\binom{n-1}{r}, \binom{n+1}{r+1}, \binom{n}{r-1}\right)

The theorem is so named because if you draw a Star of David by connecting points in Pascal’s triangle then each side corresponds to the vertices of a triangle.

diagram illustrating the Star of David Theorem

One such theorem was the following.

\binom{n - \ell}{r - \ell} \binom{n - k}{r} \binom{n - k - \ell}{r - k} = \binom{n-k}{r-k} \binom{n - \ell}{r} \binom{n - k - \ell}{r - \ell}

This theorem also has a geometric interpretation, connecting vertices within Pascal’s triangle.

The authors point out that the binomial coefficient is a separable function of three variables, and that their generalized Star of David theorem is true for any separable function of three variables.

The binomial coefficient C(nk) is a function of two variables, but you can think of it as a function of three variables: n, k, and nk. That is

{n \choose k} = f(n) \, g(k) \, g(n-k)

where f(n) = n! and g(k) = 1/k!.

I was surprised to see the term separable function outside of a PDE context. My graduate work was in partial differential equations, and so when I hear separable function my mind goes to separation of variables as a technique for solving PDEs.

Coincidentally, I was looking a separable coordinate systems recently. These are coordinate systems in which the Helmholtz equation can be solved by separable function, i.e. a coordinate system in which the separation of variables technique will work. The Laplacian can take on very different forms in different coordinate systems, and if possible you’d like to choose a coordinate system in which a PDE you care about is separable.

Related posts

[1] Peter Hilton, Derek Holton, and Jean Pedersen. Mathematical Reflections. Springer, 1996.

[2] Hilton et al refer to a set of theorems as generalizations of the Star of David theorem, but these theorems are not strictly generalizations in the sense that the original theorem is clearly a special case of the generalized theorems. The theorems are related, and I imagine with more effort I could see how to prove the older theorem from the newer ones, but it’s not immediately obvious.