Moore-Penrose pseudoinverse is not an adjoint

The Moore-Penrose pseudoinverse of a matrix is a way of coming up with something like an inverse for a matrix that doesn’t have an inverse. If a matrix does have an inverse, then the pseudoinverse is in fact the inverse. The Moore-Penrose pseudoinverse is also called a generalized inverse for this reason: it’s not just like an inverse, it actually is an inverse when that’s possible.

Given an m by n matrix A, the Moore-Penrose pseudoinverse A+ is the unique n by m matrix satisfying four conditions:

  1. A A+ A = A
  2. A+ A A+ = A+
  3. (A A+)* = A A+
  4. (A+ A)* = A+ A

The first equation says that AA+ is a left inverse for A, and A+A is a right inverse for A.

The second equation says A+A is a left inverse for A+, and A A+ is a right inverse for A+.

The third and fourth equations say that A A+ and A+A are Hermitian.

If A is invertible, A A+ and A+A are both the identity matrix. Otherwise A A+ and A+A act an awful lot like the identity, as much as you could expect, maybe a little more than you’d expect.

Galois connections and adjoints

John Baez recently wrote that a Galois connection, a kind of categorical adjunction, is

“the best approximation to reversing a computation that can’t be reversed.”

That sounds like a pseudoinverse! And the first two equations defining a pseudoinverse look a lot like things you’ll see in the context of adjunctions, so the pseudoinverse must be an adjunction, right?

The question was raised on MathOverflow and Michal R. Przybylek answered

I do not think the concept of Moore-Penrose Inverse and the concept of categorical adjunction have much in common (except they both try to generalise the concept of inverse) …

and gives several reasons why. (Emphasis added.)

Too bad. It would have made a good connection. Applied mathematicians are likely to be familiar with Moore-Penrose pseudoinverses but not categorical adjoints. And pure mathematicians, depending on their interests, may be more familiar with adjoint functors than matrix pseudoinverses.

So what about John Baez’ comment? His comment was expository (and very helpful) but not meant to be rigorous. To make it rigorous you’d have to be rigorous about what you mean by “best approximation” etc. And when you define your terms carefully, in the language of category theory, you get adjoints. This means that the Moore-Penrose inverse, despite its many nice properties, doesn’t mesh well with categorical definitions.

Przybylek concludes

… adjunctions compose … but Moore-Penrose pseudoinverses—generally—do not. … pseudoinverses are not stable under isomorphisms, thus are not categorical.

This turns out to be an interesting example, but not of what I first expected. Rather than the pseudoinverse of a matrix being an example of an adjoint, it is an example of something that despite having convenient properties [1] does not compose well from a categorical perspective.

Related posts:

[1] The book Matrix Mathematics devotes about 40 pages to stating theorems about the Moore-Penrose pseudoinverse.

It’s like this other thing except …

One of my complaints about math writing is that definitions are hardly ever subtractive, even if that’s how people think of them.

For example, a monoid is a group except without inverses. But that’s not how you’ll see it defined. Instead you’ll read that it’s a set with an associative binary operation and an identity element. A module is a vector space, except the scalars come from a ring instead of a field. But the definition from scratch is more than I want to write here. Any time you have a sub-widget or a pre-widget or a semi-widget, it’s probably best to define the widget first.

I understand the logical tidiness of saying what a thing is rather than what it is not. But it makes more pedagogical sense to describe the difference between a new concept and the most similar familiar concept. And the nearest familiar concept may have more structure rather than less.

Suppose you wanted to describe where a smaller city is by giving directions from larger, presumably more well known city, but you could only move east. Then instead of saying Ft. Worth is 30 miles west of Dallas, you’d have to say it’s 1,000 miles east of Phoenix.

Writers don’t have to chose between crisp logic and good pedagogy. They can do both. They can say, for example, that a pre-thingy is a thingy without some property, then say “That is, a pre-thingy satisfies the following axioms: …”

Obesity index: Measuring the fatness of probability distribution tails

A probability distribution is called “fat tailed” if its probability density goes to zero slowly. Slowly relative to what? That is often implicit and left up to context, but generally speaking the exponential distribution is the dividing line. Probability densities that decay faster than the exponential distribution are called “thin” or “light,” and densities that decay slower are called “thick”, “heavy,” or “fat,” or more technically “subexponential.” The distinction is important because fat-tailed distributions tend to defy our intuition.

One surprising property of heavy-tailed (subexponential) distributions is the single big jump principle. Roughly speaking, most of the contribution to the sum of several heavy-tailed random variables comes from the largest of the samples. To be more specific, let “several” = 4 for reasons that’ll be apparent soon, though the result is true for any n. As x goes to infinity, the probability that

X1 + X2 + X3 + X4

is larger than x is asymptotically equal the probability that

max(X1, X2, X3, X4)

is larger than x.

The idea behind the obesity index [1] is turn the theorem above around, making it an empirical measure of how thick a distribution’s tail is. If you draw four samples from a random variable and sort them, the obesity index is the probability that that the sum of the max and min, X1 + X4, is greater than the sum of the middle samples, X2 + X3.

The obesity index could be defined for any distribution, but it only measures what the name implies for right-tailed distributions. For any symmetric distribution, the obesity index is 1/2. A Cauchy distribution is heavy-tailed, but it has two equally heavy tails, and so its obesity index is the same as the normal distribution, which has two light tails.

Note that location and scale parameters have no effect on the obesity index; shifting and scaling effect all the X values the same, so it doesn’t change the probability that X1 + X4 is greater than X2 + X3.

To get an idea of the obesity index in action, we’ll look at the normal, exponential, and Cauchy distributions, since these are the canonical examples of thin, medium, and thick tailed distributions. But for reasons explained above, we’ll actually look at the folded normal and folded Cauchy distributions, i.e. we’ll take their absolute values to create right-tailed distributions.

To calculate the obesity index exactly you’d need to do analytical calculations with order statistics. We’ll simulate the obesity index because that’s easier. It’s also more in the spirit of calculating the obesity index from data.

    from scipy.stats import norm, expon, cauchy

    def simulate_obesity(dist, N):
        data = abs(dist.rvs(size=(N,4)))
        count = 0
        for row in range(N):
            X = sorted(data[row])
            if X[0] + X[3] > X[1] + X[2]:
                count += 1
        return count/N

    for dist in [norm, expon, cauchy]:
        print( simulate_obesity(dist, 10000) )

When I ran the Python code above, I got

    0.6692
    0.7519
    0.8396

This ranks the three distributions in the anticipated order of tail thickness.

Note that the code above takes the absolute value of the random samples. This lets us pass in ordinary (unfolded) versions of the normal and Cauchy distributions, and its redundant for any distribution like the exponential that’s already positive-valued.

[I found out after writing this blog post that SciPy now has foldnorm and foldcauchy, but they don’t seem to work like I expect.]

Let’s try it on a few more distributions. Lognormal is between exponential and Cauchy in thickness. A Pareto distribution with parameter b goes to zero like x-1-b and so we expect a Pareto distribution to have a smaller obesity index than Cauchy when b is greater than 1, and a larger index when b is less than one. Once again the simulation results are what we’d expect.

The code

    for dist in [lognorm, pareto(2), pareto(0.5)]:
        print( simulate_obesity(dist, 10000) )

returns

    0.7766
    0.8242
    0.9249

By this measure, lognormal is just a little heavier than exponential. Pareto(2) comes in lighter than Cauchy, but not by much, and Pareto(0.5) comes in heavier.

Since the obesity index is a probability, it will always return a value between 0 and 1. Maybe it would be easier to interpret if we did something like take the logit transform of the index to spread the values out more. Then the distinctions between Pareto distributions of different orders, for example, might match intuition better.

[1] Roger M. Cooke et al. Fat-Tailed Distributions: Data, Diagnostics and Dependence. Wiley, 2014.

Duffing equation for nonlinear oscillator

The Duffing equation is an ordinary differential equation describing a nonlinear damped driven oscillator.

\frac{d^2x}{dt} + h\frac{dx}{dt} + \Omega_0^2 x + \mu x^3 = F \cos \omega t

If the parameter μ were zero, this would be a damped driven linear oscillator. It’s the nonlinear x³ term that makes things nonlinear and interesting.

Using an analog computer in 1961, Youshisuke Ueda discovered that this system was chaotic. It was the first discovery of chaos in a second-order non-autonomous periodic system. Lorenz discovered his famous Lorenz attractor around the same time, though his system is third order.

Ueda’s system has been called “Ueda’s strange attractor” or the “Japanese attractor.”

In the linear case, when μ = 0, the phase portrait simply spirals outward from the origin towards its steady state.

But things get much more interesting when μ = 1. Here’s Mathematica code to numerically solve the differential equation and plot the phase portrait.

duff[t_] = 
    NDSolve[{x''[t] + 0.05 x'[t] + x[t] + x[t]^3 == 7.5 Cos[t], 
             x[0] == 0, x'[0] == 1}, x[t], {t, 0, 100}][[1, 1, 2]]
ParametricPlot[{duff[t], duff'[t]}, {t, 0, 50}, AspectRatio -> 1]

Here’s the same system integrated out further, to t = 200 rather 50.

Source: Wanda Szemplińska-Stupnicka. Chaos, Bifurcations and Fractals Around Us.

Surface area of an egg

The first post in this series looked at a possible formula for the shape of an egg, how to fit the parameters of the formula, and the curvature of the shape at each end of the egg.

The second post looked at the volume. This post looks at the surface area.

If you rotate the graph of a function f(x) around the x-axis between c and d, the area of the resulting surface is

2 \pi \int_c^d f(x)\sqrt{1 + (f'(x))^2} \, dx

This integral cannot be computed in closed form for the function f describing an egg. (At least I can’t find a closed form, and neither can Mathematica.) But we can do it with numerical integration.

Here’s the Mathematica code.

f[x_, a_, b_, k_] := b Sqrt[(1 - x^2/a^2) / (1 + x k)]
area[a_, b_, k_] := 
    2 Pi* NIntegrate[ 
    f[x, a, b, k] Sqrt[1 + D[f[x, a, b, k], x]^2], 
    {x, -a, a}
]

As a sanity check, let’s verify that if our egg were spherical we would get back the area of that sphere.

area[3, 3, 0] returns 113.097 and N[36 Pi] also returns 113.097, so that’s a good sign.

Now let’s plot the surface area as a function of the parameter k.

Plot[area[4, 2, k], {k, -0.2, 0.2}]

The y-axis starts at 85.9, so the plot above exaggerates the effect of k. Here’s another plot with the y-axis starting at zero.

Plot[g[4, 2, k], {k, -0.2, 0.2}, PlotRange -> {0, 100}]

As with volume, the difference between an egg and an ellipsoid is approximately a quadratic function of the parameter k.

Volume of an egg

The previous post looked at an equation to fit the shape of an egg. In two dimensions we had

\frac{x^2}{a^2} + \frac{y^2}{b^2}(1 + kx) = 1

In this post, we’ll rotate that curve around the x-axis to find the volume. Then we’ll see how it compares to that of an ellipsoid.

If we rotate the graph of a function f(x) around the x-axis with x ranging from c to d, the volume is given by

\pi \int_c^d \left(f(x) \right)^2 \, dx

It works out well that the function f is squared because when we express y explicitly as a function of x we get a square root. Our volume works out to be

b^2 \pi \int_{-a}^a \left(1 - \frac{x^2}{a^2} \right ) \frac{1}{1 + kx}\, dx = \frac{2b^2\pi \left(\left(a^2 k^2-1\right) \tanh ^{-1}(a k)+a k\right)}{a^2 k^3}

We’d like to see how this compares to the volume of an ellipsoid, and that would be easier if we expanded the inverse hyperbolic tangent using its power series

\tanh^{-1}x = x + \frac{x^3}{3} + \frac{x^5}{5} + \cdots

This says our volume is given by

2b^2\pi\left(\frac{2}{3}a + \frac{2}{15}a^3k^2 + \frac{2}{35} a^5 k^4 + \cdots \right )

Note that if abr and k = 0 this reduces to the volume of a sphere of radius r, i.e. 4πr³/3. If a and b are not necessarily equal but k = 0 we get 4πab²/3, the volume of an ellipse.

To first order, k does not effect the volume. That is, k does not appear in the series above except with exponent 2 and higher. This says to first approximation, the volume of an egg (assuming our formula for the shape) is simply that of an ellipsoid with the same major and minor axes. Also, k only appears to even powers. We should have expected that from the previous post since changing the sign of k just flips the egg over and doesn’t change the volume.

To second order, the volume of an egg relative to that of an ellipse is a quadratic function of the parameter k. To change an ellipse into an egg shape, making one end flatter and the other end more pointy, but keeping the length and width the same, you have to add volume. You gain more volume on the flatter end than you lose on the pointier end.

See the next post for a discussion of the surface area of an egg.

Equation to fit an egg

How would you fit an equation to the shape of an egg? This site suggests an equation of the form

\frac{x^2}{a^2} + \frac{y^2}{b^2}(1 + kx) = 1

Note that if k = 0 we get an ellipse. The larger the parameter k is, the more asymmetric the shape is about the y-axis.

Let’s try that out in Mathematica:

    ContourPlot[
        x^2/16 + y^2 (1 + 0.1 x)/4 == 1, 
        {x, -4, 4}, {y, -3, 3}
    ]

Here’s another plot with k = 0.05. This one’s a little closer to an ellipse.

Solving for parameters

If you measured an egg, how would you find ab, and k?

Setting y = 0 shows that 2a is the length of the egg. Setting x = 0 shows that 2b is the width of the egg at the midpoint of the length. Note that it’s not the maximum height of the egg because that occurs to the left of the midpoint. (To the left if k is positive. The parameter k could be negative, which flips the egg about the y-axis so that the flatter side is on the right.)

To find k we measure the point x where the maximum height occurs.

We have the equation

y^2 = \left(1 - \frac{x^2}{a^2} \right) \frac{b^2}{1 + kx}

and implicit differentiation shows

y^2 = \left(1 - \frac{x^2}{a^2} \right) \frac{b^2}{1 + kx}

At the maximum height the derivative of y is zero, and so the right side also equals zero. This lets us solve for k.

k = \frac{2x}{a^2 - 3x^2}

Curvature

As k increases, the egg gets flatter on the left side and more pointed on the right side. We can quantify this by calculating the curvature at both ends.

For a curve given implicitly by F(xy) = 0, the curvature is given by

\kappa = \frac{-F_y^2F_{xx}+2F_xF_yF_{xy}-F_x^2F_{yy}}{(F_x^2+F_y^2)^{3/2}}

The expression above simplifies greatly at the two points we’re interest in, (±a, 0).

\begin{eqnarray*} F_x &=& \frac{2x}{a^2} + \frac{ky^2}{b^2} = \pm \frac{2}{a} \\ F_y &=& \frac{2y}{b^2} (1 + kx) = 0 \\ F_{xx} &=& \frac{2}{a^2} \\ F_{yy} &=& \frac{2(1 + kx)}{b^2} = \frac{2(1 \pm ka)}{b^2} \\ F_{xy} &=& \frac{2yk}{b^2} = 0 \end{eqnarray*}

And so the curvature reduces to

\frac{a(1 \pm ka)}{b^2}

So in our first example above, with a = 4, b = 2, and k = 0.1, we have a curvature of 0.6 on the left and 1.4 on the right. In the second example with k = 0.05, we have a curvature of 0.8 on the left and 1.2 on the right.

See the next post for the volume of an egg, assuming the equation for the shape in this post.

Related posts

Viability of unpopular programming languages

I said something about Perl 6 the other day, and someone replied asking whether anyone actually uses Perl 6. My first thought was I bet more people use Perl 6 than Haskell, and it’s well known that people use Haskell. I looked at the TIOBE Index to see whether that’s true. I won’t argue how well the index measures popularity, but for this post I’ll assume it’s a good enough proxy.

TIOBE doesn’t separate out variations on Perl [1]. What it calls Perl is 16th on the list this year, while Haskell comes in at 42nd. A few of the more obscure languages that TIOBE ranks higher than Haskell are Scratch, D, ABAP, Apex, and PL/I. Haskell has better public relations than all these languages.

There’s a lot more to viability than just popularity, though popularity matters. More users means more people to find bugs, write libraries, develop tools, answer questions, write tutorials, etc. But the benefit of community size is not linear. It goes through a sort of logistic S-curve. There’s some threshold size where the community is large enough for a language to be viable. And somewhere above that threshold you start hitting diminishing return.

It’s interesting to look at some of the languages currently less popular than Haskell but more familiar: Common Lisp (63), Erlang (66), and F# (67). These show that popularity isn’t everything.

Common Lisp has been around since 1982, and was standardizing a language that had been in development since 1958. Erlang has been around since 1986. These languages have many of the benefits of popularity listed above, accumulated over time.

There is not a huge community devoted specifically to F#, but it shares tooling and libraries with C#, the 5th language on the list. (Maybe the number of F# developers is underestimated because F# is so closely related to C#, not syntactically but in terms of infrastructure.)

Common Lisp, Erlang, and F# would all be safer bets for a production software project than several more popular languages.

Related posts:

[1] At least I don’t think they do. TIOBE does separate out some versions of Lisp as separate languages. It’s possible they do consider Perl 6 a separate language that didn’t make the top rankings.

Larry Wall deliberately introduced many natural language principles in Perl. It seems that one feature that Perl has in common with natural languages is controversy over when two dialects of a language are sufficiently different to be considered separate languages. Advocates consider Perl 6 to be a separate language but outside observers, like TIOBE, may not.

 

Eight-bit floating point

Researchers have discovered that for some problems, deep neural networks (DNNs) can get by with low precision weights. Using fewer bits to represent weights means that more weights can fit in memory at once. This, as well as embedded systems, has renewed interest in low-precision floating point.

Microsoft mentioned its proprietary floating point formats ms-fp8 and ms-fp9 in connection with its Brainwave Project [1]. I haven’t been able to find any details about these formats, other than that they use two- and three-bit exponents (respectively?).

This post will look at what an 8-bit floating point number would look like if it followed the pattern of IEEE floats or posit numbers. In the notation of the previous post, we’ll look at ieee<8,2> and posit<8,0> numbers. (Update: Added a brief discussion of ieee<8,3>, ieee<8,4>, and posit<8,1> at the end.)

Eight-bit IEEE-like float

IEEE floating point reserves exponents of all 0’s and all 1’s for special purposes. That’s not as much of a high price with large exponents, but with only four possible exponents, it seems very wasteful to devote half of them for special purposes. Maybe this is where Microsoft does something clever. But for this post, we’ll forge ahead with the analogy to larger IEEE floating point numbers.

There would be 191 representable finite numbers, counting the two representations of 0 as one number. There would be two infinities, positive and negative, and 62 ways to represent NaN.

The smallest non-zero number would be

2-5 = 1/32 = 0.03125.

The largest value would be 01011111 and have value

4(1 – 2-5) = 31/8 = 3.3875.

This makes the dynamic range just over two decades.

Eight-bit posit

A posit<8, 0> has no significand, just a sign bit, regime, and exponent. But in this case the useed value is 2, and so the range acts like an exponent.

There are 255 representable finite numbers and one value corresponding to ±∞.

The smallest non-zero number would be 1/64 and the largest finite number would be 64. The dynamic range is 3.6 decades.

Update: Here is a list of all possible posit<8,0> numbers.

Distribution of values

The graphs below give the distribution of 8-bit IEEE-like numbers and 8-bit posits on a log scale.

eight bit IEEE distribution

eight bit posit distribution

The distribution of IEEE-like numbers is asymmetric because much of the dynamic range comes from denormalized numbers.

The distributions of posits is approximately symmetrical. If a power of 2 is representable as a posit, so is its reciprocal. But you don’t have perfect symmetry because, for example, 3/2 is representable while 2/3 is not.

Other eight-bit formats

I had originally considered a 2-bit significand because Microsoft’s ms-fp8 format has a two-bit significand. After this post was first published it was suggested in the comments that an ieee<8, 4> float might be better than ieee<8, 2>, so let’s look at that. Let’s look at ieee<8, 3> too while we’re at it. And a posit<8, 1> too.

An ieee<8, 3> floating point number would have a maximum value of 7 and a minimum value of 2-6 = 1/64, a dynamic range of  2.7 decades. It would have 223 finite values, including two zeros, as well as 2 infinities as 30 NaNs.

An ieee<8, 4> floating point number would have a maximum value of 120 and a minimum value of 2-9 = 1/512, a dynamic range of 4.7 decades. It would have 239 finite values, including two zeros, as well as 2 infinities and 14 NaNs.

A posit<8, 1> would have a maximum value of 212 = 4096 and a minimum value of 1/4096, a dynamic range of  7.2 decades. Any 8-bit posit, regardless of the maximum number of exponent bits, will have 255 finite values and one infinity.

Near 1, an ieee<8, 4> has 3 significand bits, an ieee<8, 3> has 4, and a posit<8,1> has 4.

***

[1] Chung et al. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. Available here.

Comparing range and precision of IEEE and posit

The IEEE standard 754-2008 defines several sizes of floating point numbers—half precision (binary16), single precision (binary32), double precision (binary64), quadruple precision (binary128), etc.—each with its own specification. Posit numbers, on the other hand, can be defined for any number of bits. However, the IEEE specifications share common patterns so that you could consistently define theoretical IEEE numbers that haven’t actually been specified, making them easier to compare to posit numbers.

An early post goes into the specification of posit numbers in detail. To recap briefly, a posit<nes> number has n bits, a maximum of es of which are devoted to the exponent. The bits are divided into a sign bit, regime bits, exponent bits, and fraction bits. The sign bit is of course one bit, but the other components have variable lengths. We’ll come back to posits later for comparison.

IEEE floating point range and precision

We will denote a (possibly hypothetical) IEEE floating point number as ieee<nes> to denote one with n total bits and (exactly) es exponent bits. Such a number has one sign bit and n – es -1 significand bits. Actual specifications exist for ieee<16, 5>, ieee<32, 8>, ieee<64, 11>, and ieee<128, 15>.

The exponent of a posit number is simply represented as an unsigned integer. The exponent of an IEEE floating point number equals the exponent bits interpreted as an unsigned integers minus a bias.

\text{bias} = 2^{es -1} - 1.

So the biases for half, single, double, and quad precision floats are 15, 127, 1023, and 65535 respectively. We could use the formula above to define the bias for a hypothetical format not yet specified, assuming the new format is consistent with existing formats in this regard.

The largest exponent, emax is 2es-1 – 1 (also equal to the bias), and the smallest (most negative) exponent is emin = 2 – 2es-1. This accounts for 2es-1 – 2 possible exponents. The two remaining possibilities consist of all 1’s and all 0’s, and are reserved for special use. They represent, in combination with sign and signifcand bits, special values ±0, ±∞, NaN, and denomalized numbers. (More on denormalized numbers shortly.)

The largest representable finite number has the maximum exponent and a significand of all 1’s. Its value is thus

2^{e_{\text{max}}} (1 - 2^{-s})

where s is the number of significand bits. And so the largest representable finite number is just slightly less than

2^{2^{es -1}} }

We’ll use this as the largest representable value when calculating dynamic range below.

The smallest representable normalized number (normalized meaning the signifcand represents a number greater than or equal to 1) is

2^{e_{\text{min}}} = 2^{2 - 2^{es -1}}

However, it is possible to represent smaller values with denomalized numbers. Ordinarily the significand bits fff… represent a number 1.fff… But when the exponent bit pattern consists of all 0’s, the significand bits are interpreted as 0.fff… This means that the smallest denormalized number has a significand of all o’s except for a 1 at the end. This represents a value of

2^{e_{\text{min}}} \cdot 2^{-s} = 2^{2 - 2^{es-1} - s}

where again s is the number of significand bits.

The dynamic range of an ieee<nes> number is the log base 10 of the ratio of the largest to smallest representable numbers, smallest here including denormalized numbers.

\log_{10} \left(  \frac{2^{2^{es - 1} }}{2^{2 - 2^{es-1} - s}} \right) = \log_{10}2^{2^{es} - 2 + s} = (2^{es} - 2 + s) \log_{10}2

IEEE float and posit dynamic range at comparable precision

Which posit number should we compare with each IEEE number? We can’t simply compare ieee<nes> with posit<nes>. The value n means the same in both cases: the total number of bits. And although es does mean the number of exponent bits in both cases, they are not directly comparable because posits also have regime bits that are a special kind of exponent bits. In general a comparable posit number will have a smaller es value than its IEEE counterpart.

One way to compare IEEE floating point numbers and posit numbers is to chose a posit number format with comparable precision around 1. See the first post on posits their dynamic range and significance near 1.

In the following table, the numeric headings are the number of bits in a number. The “sig” rows contain the number of sigificand bits in the representation of 1, and “DR” stands for dynamic range in decades.

|-----------+----+-----+------+-------|
|           | 16 |  32 |   64 |   128 |
|-----------+----+-----+------+-------|
| IEEE  es  |  5 |   8 |   11 |    15 |
| posit es  |  1 |   3 |    5 |     8 |
| IEEE  sig | 10 |  23 |   52 |   112 |
| posit sig | 12 |  26 |   56 |   117 |
| IEEE  DR  | 12 |  83 |  632 |  9897 |
| posit DR  | 17 | 144 | 1194 | 19420 |
|-----------+----+-----+------+-------|

Note that in each case the posit number has both more precision for numbers near 1 and a wider dynamic range.

It’s common to use a different set of posit es values that have a smaller dynamic range than their IEEE counterparts (except for 16 bits) but have more precision near 1.

|-----------+----+-----+------+-------|
|           | 16 |  32 |   64 |   128 |
|-----------+----+-----+------+-------|
| IEEE  es  |  5 |   8 |   11 |    15 |
| posit es  |  1 |   2 |    3 |     4 |
| IEEE  sig | 10 |  23 |   52 |   112 |
| posit sig | 12 |  27 |   58 |   122 |
| IEEE  DR  | 12 |  83 |  632 |  9897 |
| posit DR  | 17 |  72 |  299 |  1214 |
|-----------+----+-----+------+-------|

Python code

Here’s a little Python code if you’d like to experiment with other number formats.

from math import log10

def IEEE_dynamic_range(total_bits, exponent_bits):

    # number of significand bits
    s = total_bits - exponent_bits - 1
    
    return (2**exponent_bits + s - 2)*log10(2)

def posit_dynamic_range(total_bits, max_exponent_bits):
    
    return (2*total_bits - 4) * 2**max_exponent_bits * log10(2)

Next: See the next post for a detailed look at eight bit posits and IEEE-like floating point numbers.

Categorical Data Analysis

Categorical data analysis could mean a couple different things. One is analyzing data that falls into unordered categories (e.g. red, green, and blue) rather than numerical values (e..g. height in centimeters).

Another is using category theory to assist with the analysis of data. Here “category” means something more sophisticated than a list of items you might choose from in a drop-down menu. Instead we’re talking about applied category theory.

So we have ((categorical data) analysis) and (categorical (data analysis)), i.e. analyzing categorical data and categorically analyzing data. The former is far, far more common.

I ran across Alan Agresti’s classic book the other day in a used book store. The image below if from the third (2012) edition. The book store had the 1st (1990) edition with a more austere cover.

I bought Agresti’s book because it’s a good reference to have. But I was a little disappointed. My first thought was  that someone has written a book on category theory and statistics, which is not the case, as far as I know.

The main reference for category theory and statistics is Peter McCullagh’s 2002 paper What is a statistical model? That paper raised a lot of interesting ideas, but the statistics community did not take McCullagh’s bait.

commutative diagram for statistical models

Maybe this just wasn’t a fruitful idea. I suspect it is a fruitful idea, but the number of people available to develop it, conversant in both statistics and category theory, is very small. I’ve seen category theory used in mathematical modeling more generally, but not in statistics per se.

At its most basic, category theory asks you to be explicit about the domain and range (codomain) of functions. It would be very helpful if statisticians merely did this. Statistical notation is notoriously bad at where a function goes from and to, or even when a function is a function. Just 0th level category theory, defining categories, would be useful. Maybe it would be useful to go on to identifying limits or adjoints, but simply being explicit about “from” and “to” would be a good start.

Category theory is far too abstract to completely carry out a statistical analysis. But it can prompt you to ask questions that check whether your model has any inconsistencies you hadn’t noticed. The idea of a “categorical error” doesn’t differ that much moving from its philosophical meaning under Aristotle to its mathematical meaning under MacLane. Nor does the idea of something being “natural.” One of the primary motivations for creating category theory was to come up with a rigorous definition of what it means for something in math to be “natural.”

Anatomy of a posit number

This post will introduce posit numbers, explain the interpretation of their bits, and discuss their dynamic range and precision.

Posit numbers are a new way to represent real numbers for computers, an alternative to the standard IEEE floating point formats. The primary advantage of posits is the ability to get more precision or dynamic range out of a given number of bits. If an application can switch from using 64-bit IEEE floats to using 32-bit posits, for example, it can fit twice as many numbers in memory at a time. That can make a big difference in the performance of applications that process large amounts of data.

Let’s back up and say what a posit number is.

Unums and posits

John Gustafson introduced unums (universal numbers) as a different way to represent real numbers using using a finite number of bits, an alternative to IEEE floating point. See, for example, his 2015 book The End of Error. Posits are a hardware-friendly version of unums.

A conventional floating point number (IEEE 754) has a sign bit, a set of bits to represent the exponent, and a set of bits called the significand (formerly called the mantissa). For details, see Anatomy of a floating point number. For a given size number, the lengths of the various parts are fixed. A 64-bit floating point number, for example, has 1 sign bit, 11 exponent bits, and 52 bits for the significand.

A posit adds an additional category of bits, known as the regime. A posit has four parts

  1. sign bit
  2. regime
  3. exponent
  4. fraction

while an IEEE floating point number has a sign bit, exponent, and significand, the latter corresponding to the fraction part of a posit. Unlike IEEE numbers, the exponent and fraction parts of a posit do not have fixed length. The sign and regime bits have first priority. Next, the remaining bits, if any, go into the exponent. If there are still bits left after the exponent, the rest go into the fraction.

The main reference for this post is [1].

Bit pattern of a posit

To understand posits in more detail, and why they have certain advantages over conventional floating point numbers, we need to unpack their bit representation. A posit number type is specified by two numbers: the total number of bits n, and the maximum number of bits devoted to the exponent, es. (Yes, it’s a little odd to use a two-letter variable name, but that’s conventional in this context.) Together we say we have a posit<nes> number.

Sign bit

As with an IEEE floating point number, the first bit of a posit is the sign bit. If the sign bit is 1, representing a negative number, take the two’s complement of the rest of the bits before unpacking the regime, exponent, and fraction bits.

Regime bits

After the sign bit come the regime bits. The number of regime bits is variable. There could be anywhere from 1 to n-1 regime bits. How do you know when the regime bits stop? When a run of identical bits ends, either because you run out of bits or because you run into an opposite bit.

If the first bit after the sign bit is a 0, then the regime bits continue until you run out of bits or encounter a 1. Similarly, if the first bit after the sign bit is a 1, the regime bits continue until you run out of bits or encounter a 0. The bit that indicates the end of a run is not included in the regime; the regime is a string of all 0’s or all 1’s.

Exponent bits

The sign bit and regime bits get first priority. If there are any bits left, the exponent bits are next in line.  There may be no exponent bits. The maximum number of exponent bits is specified by the number es. If there are at least es bits after the sign bit, regime bits, and the regime terminating bit, the next es bits belong to the exponent. If there are fewer than es bits left, what bits remain belong to the exponent.

Fraction bits

If there are any bits left after the sign bit, regime bits, regime terminating bit, and the exponent bits, they all belong to the fraction.

Interpreting the components of a posit

Next we look at how the components described above represent a real number.

Let b be the sign bit in a posit. The sign s of the number represented by the bit pattern is positive if this bit is 0 and negative otherwise.

s = (-1)^b

Let m be the number of bits in the regime, i.e. the length of the run of identical bits following the sign bit. Then let k = –m if the regime consists of all 0’s, and let km-1 otherwise.

k = \left\{ \begin{array}{ll} -m & \text{ if regime has } m \text{ 0's} \\ m-1 & \text{ if regime has } m \text{ 1's} \end{array} \right.

The useed u of the posit is determined by es, the maximum exponent size.

u = 2^{2^{\text{\small\emph{es}}} }

The exponent e is simply the exponent bits interpreted as an unsigned integer.

The fraction f is 1 + the fraction bits interpreted as following a binary point. For example, if the fraction bits are 10011, then f = 1.10011 in binary.

Putting it all together, the value of the posit number is the product of the contributions from the sign bit, regime bits, exponent bits (if any), and fraction bits (if any).

x = s\, u^k\, 2^e f = (-1)^b \, f\, 2^{e + k2^{\text{\small\emph{es}}} }

Exceptional posits

There are two exceptional posits, both with all zeros after the sign bit. A string of n 0’s represents the number zero, and a 1 followed by n-1 0’s represents ±∞.

There’s only one zero for posit numbers, unlike IEEE floats that have two kinds of zero, one positive and one negative.

There’s also only one infinite posit number. For that reason you could say that posits represent projective real numbers rather than extended real numbers. IEEE floats have two kinds of infinities, positive and negative, as well as several kinds of non-numbers. Posits have only one entity that does not correspond to a real number, and that is ±∞.

Dynamic range and precision

The dynamic range and precision of a posit number depend on the value of es. The larger es is, the larger the contribution of the regime and exponent bits will be, and so the larger range of values one can represent. So increasing es increases dynamic range. Dynamic range, measured in decades, is the log base 10 of the ratio between the largest and smallest representable positive values.

However, increasing es means decreasing the number of bits available to the fraction, and so decreases precision. One of the benefits of posit numbers is this ability to pick es to adjust the trade-off between dynamic range and precision to meet your needs.

The largest representable finite posit is labeled maxpos. This value occurs when k is as large as possible, i.e. when all the bits after the sign bit are 1’s. In this case kn-2. So maxpos equals

u^{n-2} = \left( 2^{2^{\text{\small\emph{es}}} } \right)^{n-2}

The smallest representable positive number, minpos, occurs when k is as negative as possible, i.e. when the largest possible number of bits after the sign bit are 0’s. They can’t all be zeros or else we have the representation for the number 0, so there must be a 1 on the end. In this case m = n-2 and k = 2-n.

\mbox{minpos} = u^{2-n} = \left( 2^{2^{\text{\small\emph{es}}} } \right)^{2-n} = 1/\mbox{maxpos}

The dynamic range is given by the log base 10 of the ratio between maxpos and minpos.

\log_{10}\left( 2^{2^{\text{\small\emph{es}}} } \right)^{2n-4} = (2n-4)2^{es}\log_{10}2

For example, 16-bit posit with es = 1 has a dynamic range of 17 decades, whereas a 16-bit IEEE floating point number has a dynamic range of 12 decades. The former has a fraction of 12 bits for numbers near 1, while the latter has a significand of 10 bits. So a posit<16,1> number has both a greater dynamic range and greater precision (near 1) than its IEEE counterpart.

[Update: See this post for more on the dynamic range and precision of IEEE floats of various sizes and how posits compare.]

Note that the precision of a posit number depends on its size. This is the sense in which posits have tapered precision. Numbers near 1 have more precision, while extremely big numbers and extremely small numbers have less. This is often what you want. Typically the vast majority of numbers in a computation are roughly on the order of 1, while with the largest and smallest numbers, you mostly want them to not overflow or underflow.

Related post: Anatomy of a floating point number

***

[1] John L. Gustafson and Isaac Yonemoto. Beating Floating Point at its Own Game: Posit Arithmetic. DOI: 10.14529/jsfi170206

 

Up arrow and down arrow notation

I recently ran into a tweet saying that if ** denotes exponentiation then // should denote logarithm. With this notation, for example, if we say

    3**4 == 81

we would also say

    81 // 3 == 4.

This runs counter to convention since // has come to be a comment marker or a notation for integer division. I assume the original proposal was meant to be funny, or at most semi-serious, but it would have made sense had notation developed that way.

Donald Knuth’s arrow notation for hyperexponentiation works that way. An up arrow denotes repeated exponentiation and a down arrow denotes repeated logarithm.

Up arrow notation

One up arrow denotes repeated multiplication, i.e. exponentiation.

b \uparrow n &=& \underbrace{b \cdot b \cdot b \cdots b}_{n \text{ copies of } b} = b^n

Two up arrows denote repeated exponentiation, i.e. hyperexponentiation.

b \uparrow \uparrow n &=& b \underbrace{\uparrow(b \uparrow \cdots(b\uparrow b))}_{n \text{ copies of } b} = \underbrace{b^{b^{.\,^{.\,^{.\,^b}}}}}_{n \text{ copies of } b}

Three up arrows denotes repeated applications of double arrow, etc.

Here’s how you could calculate

b \underbrace{\uparrow \uparrow \uparrow \cdots \uparrow}_{k \text{ arrows}} n

using Python:

    def hyperexp(b, k, n):

        if n == 0:
            return 1
        if k == 1:
            return b**n
        return hyperexp(b, k-1, hyperexp(b, k, n-1))

This function grows shockingly quickly. The first four values of hyperexp(2, 2, n) starting with n=1 are 2, 4, 16, and 65536. But the fifth value has over 19,000 digits.

Up arrow notation comes up sometimes, for example, in analysis of algorithms.

Down arrow notation

The downarrow notation is less common, and a little more complicated.

The logarithm of a number x base b is the number of times you’d need to multiply b by itself to get x. So if down arrow is the inverse of up arrow, down arrow is logarithm.

b \downarrow n = \log_b n

Now what should two down arrows represent? It should be the inverse of down up arrows. For example,

2 \uparrow \uparrow 4 = 2 \uparrow( 2 \uparrow (2\uparrow 2)) = 2^{2^{2^{2}}} = 65,536

and so we should have

2 \downarrow \downarrow 65536 = 4

That means the double down arrow counts how many times you have to use up arrows to get from 4 to 65536. But notice we’re counting things. We’re assuming there are an integer number of times you can apply up arrows to define the down arrows. In general you can’t do that. What if we change 65536 to 65535 above?

So here’s where there’s a slight complication, an asymmetry between up arrow and down arrow notation. Repeated up arrows count the number of times you have to apply down arrows to get something less than or equal to the base. See MathWorld, for example.

Seems like down arrow notation might be useful in number theory where iterated logs are common. (There’s a joke that asks what sound a number theorist makes when drowning. Answer: log log log ….)

Related posts

Asymmetric surprise

Motivating example: planet spacing

My previous post showed that planets are roughly evenly distributed on a log scale, not just in our solar system but also in extrasolar planetary systems. I hadn’t seen this before I stumbled on it by making some plots.

I didn’t think it was an original discovery—I assume someone did this exercise immediately when systems with several planets were discovered—but I didn’t know what this observation was called. I now know it’s known as the Titius-Bode law, a generalization of an observation about our solar system by Messrs. Titius and Bode a couple centuries ago. See, for example, [1].

Several people were skeptical of the claim that planets are distributed according to a power law and pointed out that uniformly distributed points can look fairly evenly distributed on a logarithmic scale. Which is true, and gets to the topic I want to discuss in this post. Planets are not spaced like uniform random samples (see [1]) and yet it reasonable, at first glance, to ask whether they are.

Asymmetric surprise

If you’re expecting a power law, and you’re given uniformly distributed data, it doesn’t look too surprising. On the other hand, if you’re expecting uniformly distributed data and you see data distributed according to a power law, you are surprised. I’ll formalize this below.

If you’ve ever tried to make a scaled model of our solar system, you were probably surprised that the planets are far from uniformly spaced. A scaled model of our solar system, say at a museum, is likely to position a few of the inner planets to scale, and then use text to explain where the outer planets should be. For example, there may be a footnote saying “And if everything were to scale, Pluto would be behind the Exxon station at the end of the street.” This is an example of implicitly expected a uniform distribution and receiving data distributed according to a power law.

Some people suspected that I was doing the opposite. By plotting distances on a log scale, I’m implicitly expected a power law distribution. Maybe the data were roughly uniform, but I fooled myself into seeing a power law.

Quantifying surprise

The Kullback-Liebler divergence from Y to X, written KL(X || Y), is the average surprise of seeing Y when you expected X. That’s one of the interpretations. See this post for more interpretations.

In general, Kullback-Liebler divergence is not symmetric. The divergence from X to Y typically does not equal the divergence from Y to X. The discussion above claims that the surprise from seeing power law data when expecting a uniform distribution is greater than the surprise from seeing uniform data when expected a power law distribution. We show below that this is true.

Let X be random variable uniformly distributed on [0, 1] and let Y be a random variable with distribution proportional to xα on the same interval. (The proportionality constant necessary to make the probability integrate to 1 is α + 1.) We will show that KL(X || Y) is greater than KL(Y || X).

First we calculate the two divergences.

\begin{eqnarray*} \mathrm{KL}(X || Y) &=& - \int_0^1 f_X(x) \, \log\left(\frac{f_Y(x)}{f_X(x)} \right) \, dx \\ &=& -\int_0^1 1 \cdot \left( \log(\alpha+1) + \alpha \log x - \log 1 \right) \, dx \\ &=& \alpha - \log(\alpha+1) \end{eqnarray*}

and

\begin{eqnarray*} \mathrm{KL}(Y || X) &=& - \int_0^1 f_Y(x) \, \log\left(\frac{f_X(x)}{f_Y(x)} \right) \, dx \\ &=& -\int_0^1 (\alpha + 1)x^\alpha \left(\log 1 -\log(\alpha+1) - \alpha \log x \right) \, dx \\ &=& \log(\alpha+1) - \frac{\alpha}{1 + \alpha} \end{eqnarray*}

And here is a plot comparing the two results as a function of the exponent α.

Related posts

***

[1] Timothy Bovaird, Charles H. Lineweaver; Exoplanet predictions based on the generalized Titius–Bode relation, Monthly Notices of the Royal Astronomical Society, Volume 435, Issue 2, 21 October 2013, Pages 1126–1138, https://doi.org/10.1093/mnras/stt1357

Planets evenly spaced on log scale

The previous post was about Kepler’s observation that the planets were spaced out around the sun the same way that nested regular solids would be. Kepler only knew of six planets, which was very convenient because there are only five regular solids. In fact, Kepler thought there could only be six planets because there are only five regular solids.

The distances to each of the planets is roughly a geometric series. Ratios of consecutive distances are between 1.3 and 3.4. That means the distances should be fairly evenly spaced on a logarithmic scale. The plot below shows that this is the case.

Distance to planets in our solar system on log scale

The plot goes beyond the six planets known in Kepler’s day and adds four more: Uranus, Neptune, Pluto, and Eris. Here I’m counting the two largest Kuiper belt objects as planets. Distances are measured in astronomical units, i.e. Earth = 1.

Update: The fit is even better if we include Ceres, the largest object in the asteroid belt. It’s now called a dwarf planet, but it was considered a planet for the first fifty years after it was discovered.

Distances in our solar system including planets and dwarf planets

Extrasolar planets

Is this just true of our solar system, or is it true of other planetary systems as well? At the time of writing, we know of one planetary system outside our own that has 8 planets. Three other systems have around 7 planets; I say “around” because it depends on whether you include unconfirmed planets. The spacing between planets in these other systems is also fairly even on a logarithmic scale. Data on planet distances is taken from each system’s Wikipedia page. Distances are semimajor axes, not average distances.

Kepler-90

Kepler-90 is the only planetary system outside our own with eight confirmed planets that we know of.

Distances to planets in Kepler-90 system

HD 10180

HD 10180 has seven confirmed planets and two unconfirmed planets. The unconfirmed planets are included below, the 3rd and 6th objects.

Distances to planets in HD 10180 system

HR 8832

The next largest system is HR 8832 with five confirmed planets and two unconfirmed, numbers 5 and 6 below. It would work out well for this post if the 6th object were found to be a little closer to its star.

Distances to planets in HR 8832 system

TRAPPIST-1

TRAPPIST-1 is interesting because the planets are very close to their star, ranging from 0.01 to 0.06 AU. Also, this system follows an even logarithmic spacing more than the others.

Distances to planets in TRAPPIST-1 system

Systems with six planets

There are currently four known systems with four planets: Kepler-11, Kepler-20, HD 40307, and HD 34445. They also appear to have planets evenly spaced on a log scale.

Distances between planets in known systems with six planets

Like TRAPPIST-1, Kepler-20 has planets closer in and more evenly spaced (on a log scale).

Related posts