Random projection

Last night after dinner, the conversation turned to high-dimensional geometry. (I realize how odd that sentence sounds; I was with some unusual company.)

Someone brought up the fact that two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal. This is a surprising but well known fact. Next the conversation turned to something also surprising but not so well known.

Suppose you’re in a 20,000 dimensional space and you choose 10,000 vectors at random. Now you choose one more, and ask what angle the new vector makes with the space spanned by the 10,000 previous vectors. The new vector is very likely to have a very small component in the direction of each of the previous vectors individually, but what about their span?

There are several ways to approach this problem. You can find the exact distribution on the angle analytically, but I thought it might be more interesting to demonstrate this by actually drawing random vectors. I started to write a simulation and discovered that the Python installation on my laptop is corrupted, and so I decided to take a third approach. I’ll give an informal explanation of what’s going on. Those who wish could either fill in the details to write a proof [1] or code it up to write a simulation.

First of all, what does it mean to pick a vector randomly from a vector space? We say we want to choose vectors uniformly, but that isn’t actually possible. What we actually mean is we want to pick vectors so that their directions are uniform. That’s possible, and we only need to work with unit vectors anyway.

The way to generate uniform random samples of unit vectors in n dimensions is to first choose a standard normal random value for each coordinate, then normalize so the vector has length 1.

Suppose we do as we discussed above. We work in 20,000 dimensions and choose 10,000 random vectors. Call their span S. Then we choose another random vector v. What angle does v make with S? You can’t simply project v onto each of the basis vectors for S because even though they are nearly orthogonal, they’re not exactly orthogonal.

It turns out that it doesn’t matter what the vectors are that defined S. All that matters is that S is almost certainly a 10,000 dimensional subspace. By symmetry it doesn’t matter which subspace it is, so we’ll assume for convenience that it’s the subspace spanned by the first 10,000 unit vectors. The angle that v makes with S is the angle between v and its projection w onto S, and w is simply the vector you get by zeroing out the last 10,000 components of v.

The cosine of the angle between two unit vectors is their dot product. Our vector v is a unit vector, but its projection w is not. So the cosine of the angle is

cos θ = v·w/||w||

Now v·w is the sum of the squares of the first half of v‘s components. This is probably very close to 1/2 since v is a unit vector. This is also w·w, and so ||w|| is probably very close to the square root of 1/2. And so cos θ is very nearly √2/2, which means θ = 45°.

So in summary, if you generate 10,000 vectors in 20,000 dimensions, then find the angle of a new random vector with the span of the previous vectors, it’s very likely to be very near 45°.

Related posts

[1] If you’re in an n-dimensional space and S has dimension m < n, then cos² θ has the distribution of X/(XY) where X ~ χ²(m) and Y ~ χ²(nm), which has distribution beta(m/2, (nm)/2). If mn/2 and m is large, this is a beta distribution concentrated tightly around 1/2, and so cos θ is concentrated tightly around √2/2. More generally, if m and n are large, cos² θ is concentrated around m/n.

A truly horrible random number generator

I needed a bad random number generator for an illustration, and chose RANDU, possibly the worst random number generator that was ever widely deployed. Donald Knuth comments on RANDU in the second volume of his magnum opus.

When this chapter was first written in the late 1960’s, a truly horrible random number generator called RANDU was commonly used on most of the world’s computers.

This generator starts with an odd seed x0 and generates subsequent values via

xn+1 = 65539 xn mod 231

I needed to generate 8-bit numbers, and I took the lowest eight bits of each value from RANDU. (To be fair, the highest 8 bits would have been better, but my goal was to find a bad RNG, so I used the lowest.)

To demonstrate the generator’s (lack of) quality I made a histogram of sorts with a 16 by 16 grid, one cell for each possible 8-bit number. I generated a large number of samples and plotted the grid. Here’s what I got:

RANDU histogram

Here’s what I get with a better random number generator:

randint histogram

I was looking for something bad, but I didn’t realize RANDU was that bad. The white stripes represent the generated values and the black stripes values that are never generated. So out of 256 possible 8-bit numbers, this generator only ever outputs 64 of them. (I used 33 as my seed. I might have gotten different vertical stripes if I had chosen a different seed, but I’d still get stripes.) Also, the frequencies of the values it does take on have suspiciously little variation.

You can see the pattern of values RANDU takes on by printing out the first few generated values in hex:


The last hex digit cycles 3, 9, b, 1, 3, 9, b, 1, … producing only four possible values.

We could prove the statements empirically demonstrated above by noting that

xn = 65539n x0 mod 2k

for k = 31, but also for smaller k. We could set k = 8 and prove that there are only 64 values, or set k = 4 and prove that there are only four final hex digits.

Related posts

Maybe you should’t script it after all

Programmers have an easier time scaling up than scaling down. You could call this foresight or over-engineering, depending on how things work out. Scaling is a matter of placing bets.

Experienced programmers are rightfully suspicious of claims that something only needs to be done once, or that quick-and-dirty will be OK [*]. They’ve been burned by claims that something was “temporary” and they have war stories in which they did more than required and were later vindicated. These stories make good blog posts.

But some things really do need to only be done once, or only so infrequently that it might as well be only once. And it might be OK for intermediate steps to be quick-and-dirty if the deliverable is high quality.

As a small business owner and a former/occasional programmer, I think about this often. For years I had a little voice in my head say “You really should script this.” And I have automated a few common tasks. But I’ve made peace with the fact that I often do things that (1) could be done far more elegantly and efficiently, and that (2) I will likely never do again [**].

Related posts

[*] “People forget how fast you did a job, but they remember how well you did it.” — Howard Newton

[**] I include in “never do again” things I might do in the future, but far enough in the future that I won’t remember where I saved the script I wrote to do the task last time, if I saved it. Or I saved the script, was able to find it, but it depends on a library that has gone away. Or the task is just different enough that I’d need to practically rewrite the script. Or …

Squircle perimeter and the isoparametric problem

If you have a fixed length of rope and you want to enclose the most area inside the rope, make it into a circle. This is the solution to the so-called isoparametric problem.

Dido’s problem is similar. If one side of your bounded area is given by a straight line, make your rope into a semi-circle.

This post looks at a similar problem for a squircle. Peter Panholzer mentioned the problem of finding the squircle exponent that led to the largest area in proportion to its arclength. This sounds like the isoparametric problem, but it’s not.

The isoparametric problem holds perimeter constant and lets the shape enclosed vary, maximizing the area. The question here is to hold the axes constant and maximize the ratio of the area to the perimeter. Panholzer reports the maximum occurs at p = 4.39365.

Computing the perimeter

The volume of a squircle can be found in closed form, and I’ve mentioned the equation a few times, for example here. The perimeter, however, cannot be found in closed form, as far as I know, except for special exponents.

We can solve for y as a function of x and find the arclength using the formula taught in calculus courses. However, the derivative of y has a singularity at x = 1. By switching to polar coordinates, we can find arclength in terms of an integrand with no singularities. We can also simplify things a little by computing the total arclength as 4 times the arclength in the first quadrant; this avoids having to take absolute values.

The following Python code computes the perimeter and the ratio of the area to the perimeter.

    from scipy import sin, cos, pi
    from scipy.integrate import quad
    from scipy.special import gamma
    def r(t, p):
        return (cos(t)**p + sin(t)**p)**-(1/p)
    def drdt(t, p):
        deriv = (cos(t)**p + sin(t)**p)**(-1-1/p)
        deriv *= cos(t)**(p-1)*sin(t) - sin(t)**(p-1)*cos(t)
        return deriv
    def integrand(t, p):
        return (drdt(t,p)**2 + r(t,p)**2)**0.5
    def arclength(p):
        integral = quad(lambda t: integrand(t,p), 0, pi/2)[0]
        return 4*integral
    def area(p):
        return 4*gamma(1 + 1/p)**2 / gamma(1 + 2/p)
    def ratio(p):
        return area(p) / arclength(p)

Basic geometry tells us the ratio is 1/2 when p = 2 and we have a circle. The ratio is also 1/2 when p = ∞, i.e. when we have a square. We can use the code above to find that the ratio when p = 0.528627, so there is at least one local maximum for values of p between 2 and ∞.

Here’s a plot of the ratio of area to perimeter as a function of p.

ratio of area to perimeter for squircle

The plot is quite flat for large values of p, but if we zoom in we can see that there is a local maximum near 4.4.

close up of previous graph near the maximum

When I find the maximum of the ratio function using Brent’s method (scipy.optimize.brent) I find p = 4.39365679, which agrees with the value Panholzer stated.

Here’s a plot of the squircle corresponding to this value of p.

squircle with largest area to perimeter ratio

Back to the isoparametric problem

Now why doesn’t this contradict the isoparametric problem? Area scales quadratically but perimeter scales linearly. If you don’t hold perimeter constant, you can find a larger ratio of area to volume by making the perimeter larger. And that’s what we’ve done. When p = 2, we have a unit circle, with area π and perimeter 2π. When p = 4.39365679 we have area 3.750961567 and permimeter 7.09566295. If we were to take a circle with the same perimeter, it would have area 4.00660097, larger than the squircle we started with.

Related posts

Taking the derivative of a muscle car

I’ve been getting a lot of spam lately saying my web site does not rank well on “certain keywords.” This is of course true: no web site ranks well for every keyword.

I was joking about this on Twitter, saying that my site does not rank well for women’s shoes, muscle cars, or snails because I don’t write about these topics. J. D. Long replied saying I should write more about muscle cars.

My first thought was to do an example drawing a Shelby Cobra with cubic splines. This would not be such an artificial example. Mathematical splines are named after physical splines, devices that have been used in designing, among other things, cars.

Instead, I downloaded an image of a Shelby Cobra from Wikipedia

Original image

and played with it in Mathematica. Specifically, I applied a Sobel filter which you can think of as a kind of derivative, looking for edges, places where the pixel values have a sharp change.

Sobol filtered image

Here’s the same image with the colors reversed.

Cobra image reversed

Here’s the Mathematica code to do the edge detection.

    cobra = Import["c:/users/mail/desktop/cobra.jpg"]
    ImageConvolve[cobra, {{-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1}}]

Safe Harbor and the calendar rollover problem

elderly woman

Data privacy is subtle and difficult to regulate. The lawmakers who wrote the HIPAA privacy regulations took a stab at what would protect privacy when they crafted the “Safe Harbor” list. The list is neither necessary or sufficient, depending on context, but it’s a start.

Extreme values of any measurement are more likely to lead to re-identification. Age in particular may be newsworthy. For example, a newspaper might run a story about a woman in the community turning 100. For this reason, the Safe Harbor previsions require that ages over 90 be lumped together. Specifically,

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.

One problem with this rule is that “age 90” is a moving target. Suppose that last year, in 2018, a data set recorded that a woman was born in 1930 and had a child in 1960. This data set was considered de-identified under the Safe Harbor provisions and published in a medical journal. On New Years Day 2019, does that data suddenly become sensitive? Or on New Years Day 2020? Should the journal retract the paper?!

No additional information is conveyed by the passage of time per se. However, if we knew in 2018 that the woman in question was still alive, and we also know that she’s alive now in 2019, we have more information. Knowing that someone born in 1930 is alive in 2019 is more informative than knowing that the same person was alive in 2018; there are fewer people in the former category than in the latter category.

The hypothetical journal article, committed to print in 2018, does not become more informative in 2019. But an online version of the article, revised with new information in 2019 implying that the woman in question is still alive, does become more informative.

No law can cover every possible use of data, and it would be a bad idea to try. Such a law would be both overly restrictive in some cases and not restrictive enough in others. HIPAA’s expert determination provision allows a statistician to say, for example, that the above scenario is OK, even though it doesn’t satisfy the letter of the Safe Harbor rule.

Related posts

Data privacy Twitter account

My newest Twitter account is Data Privacy (@data_tip). There I post tweets about ways to protect your privacy, statistical disclosure limitation, etc.

I had a clever idea for the icon, or so I thought. I started with the default Twitter icon, a sort of stylized anonymous person, and colored it with the same blue and white theme as the rest of my Twitter accounts. I think it looked so much like the default icon that most people didn’t register that it had been customized. It looked like an unpopular account, unlikely to post much content.

Now I’ve changed to the new icon below, and the number of followers is increasing.
data tip icon

Related pages

Ratio of Lebesgue norm ball volumes

As dimension increases, the ratio of volume between a unit ball and a unit cube goes to zero. Said another way, if you have a high-dimensional ball inside a high-dimensional box, nearly all the volume is in the corners. This is a surprising result when you first see it, but it’s well known among people who work with such things.

In terms of Lp (Lebesgue) norms, this says that the ratio of the volume of the 2-norm ball to that of the ∞-norm ball goes to zero. More generally, you could prove, using the volume formula in the previous post, that if p < q, then the ratio of the volume of a p-norm ball to that of a q-norm ball goes to zero as the dimension n goes to infinity.

Proof sketch: Write down the volume ratio, take logs, use the asymptotic series for log gamma, slug it out.

Here’s a plot comparing p = 2 and q = 3.

Plot of volume ratio for balls in L2 and L3 norm as dimension increases

Related posts

Higher dimensional squircles

The previous post looked at what exponent makes the area of a squircle midway between the area of a square and circle of the same radius. We could ask the analogous question in three dimensions, or in any dimension.

(What do you call a shape between a cube and a sphere? A cuere? A sphube?)


The sphube

In more conventional mathematical terminology, higher dimensional squircles are balls under Lp norms. The unit ball in n dimensions under the Lp norm has volume

2^n \frac{\Gamma\left(1 + \frac{1}{p}\right)^n}{\Gamma\left( 1 + \frac{n}{p} \right)}

We’re asking to solve for p so the volume of a p-norm ball is midway between that of 2-norm ball and an ∞-norm ball. We can compute this with the following Mathematica code.

    v[p_, n_] := 2^n Gamma[1 + 1/p]^n / Gamma[1 + n/p]
            v[p, n] - (2^n + v[2, n])/2, 
            {p, 3}
        {n, 2, 10}

This shows that the value of p increases steadily with dimension:


We saw the value 3.16204 in the previous post. The result for three dimensions is 3.43184, etc. The image above uses the solution for n = 3, and so it has volume halfway between that of a sphere and a cube.

In order to keep the total volume midway between that of a cube and a sphere, p has to increase with dimension, making each 2-D cross section more and more like a square.

Here’s the Mathematica code to draw the cuere/sphube.

    p = 3.43184
         Abs[x]^p + Abs[y]^p + Abs[z]^p == 1, 
         {x, -1, 1}, 
         {y, -1, 1}, 
         {z, -1, 1}

History of the “Squircle”

Architect Peter Panholzer coined the term “squircle” in the summer of 1966 while working for Gerald Robinson. Robinson had seen a Scientific American article on the superellipse shape popularized by Piet Hein and suggested Panholzer use the shape in a project.

Piet Hein used the term superellipse for a compromise between an ellipse and a rectangle, and the term “supercircle” for the special case of axes of equal length. While Piet Hein popularized the superellipse shape, the discovery of the shape goes back to Gabriel Lamé in 1818.

Squircle with p = 3.162034

You can find more on the superellipse and squircle by following these links, but essentially the idea is to take the equation for an ellipse or circle and replace the exponent 2 with a larger exponent. The larger the exponent is, the closer the superellipse is to being a rectangle, and the closer the supercircle/squircle is to being a square.

Panholzer contacted me in response to my article on squircles. He gives several pieces of evidence to support his claim to have been the first to use the term. One is a letter from his employer at the time, Gerald Robinson. He also cites these links. [However, see Andrew Dalke’s comment below.]

Optimal exponent

As mentioned above, squircles and more generally superellipses, involve an exponent p. The case p = 2 gives a circle. As p goes to infinity, the squircle converges to a square. As p goes to 0, you get a star-shape as shown here. As noted in that same post, Apple uses p = 4 in some designs. The Sergels Torg fountain in Stockholm is a superellipse with p = 2.5. Gerald Robinson designed a parking garage using a superellipse with p = e = 2.71828.

Panholzer experimented with various exponents [1] and decided that the optimal value of p would be the one for which the squircle has an area half way between the circle and corresponding square. This would create visual interest, leaving the viewer undecided whether the shape is closer to a circle or square.

The area of the portion of the unit circle contained in the first quadrant is π/4, and so we want to find the exponent p such that the area of the squircle in the first quadrant is (1 + π/4)/2. This means we need to solve

\int_0^1 (1 - x^p)^{1/p}\, dx = \frac{\Gamma\left(\frac{p+1}{p}\right)^2}{\Gamma\left(\frac{p+2}{p} \right )} = \frac{1}{2} + \frac{\pi}{8}

We can solve this numerically [2] to find p = 3.1620. It would be a nice coincidence if the solution were π, but it’s not quite.

Sometime around 1966 Panholzer had a conference table made in the shape of a squircle with this exponent.


I asked Panholzer how he created his squircles, and whether he had access to a computer in 1966. He did use a computer to find the optimal value of p; his brother in law, Hans Thurow, had access to a computer at McPhar Geophysics in Toronto. But he drew the plots by hand.

There was no plotter around at that time, but I used transparent vellum over graph paper and my architectural drawing skills with “French curves” to draw 15 squircles from p=2.6 (obviously “circlish”) to p=4.0 (obviously “squarish”).

Related posts

[1] The 15 plots mentioned in the quote at the end came first. A survey found that people preferred the curve corresponding to p around 3.1. Later the solution to the equation for the area to be half way between that of a circle and a square produced a similar value.

[2] Here are a couple lines of Mathematica code to find p.

    f[p_] := Gamma[1 + 1/p]^2/Gamma[1 + 2/p]
    FindRoot[f[p] - (1 + Pi/4)/2, {p, 4}]

The 4 in the final argument to FindRoot is just a suggested starting point for the search.