Poetic description of privacy-preserving analysis

Erlingsson et al give a poetic description of privacy-preserving analysis in their RAPPOR paper [1]. They say that the goal is to

… allow the forest of client data to be studied, without permitting the possibility of looking at individual trees.

Related posts

[1] Úlfar Erlingsson, Vasyl Pihur, and Alexsandra Korolova. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. Available here.

Searching for Mersenne primes

The nth Mersenne number is

Mn = 2n – 1.

A Mersenne prime is a Mersenne number which is also prime. So far 50 have been found [1].

A necessary condition for Mn to be prime is that n is prime, so searches for Mersenne numbers only test prime values of n. It’s not sufficient for n to be prime as you can see from the example

M11 = 2047 = 23 × 89.

Lucas-Lehmer test

The largest known prime has been a Mersenne prime since 1952, with one exception in 1989. This is because there is an efficient algorithm, the Lucas-Lehmer test, for determining whether a Mersenne number is prime. This is the algorithm used by GIMPS (Great Internet Mersenne Prime Search).

The Lucas-Lehmer test is very simple. The following Python code tests whether Mp is prime.

    def lucas_lehmer(p):
        M = 2**p - 1
        s = 4
        for _ in range(p-2):
            s = (s*s - 2) % M
        return s == 0

Using this code I was able to verify the first 25 Mersenne primes in under 50 seconds. This includes all Mersenne primes that were known as of 40 years ago.

History

Mersenne primes are named after the French monk Marin Mersenne (1588–1648) who compiled a list of Mersenne primes.

Édouard Lucas came up with his test for Mersenne primes in 1856 and in 1876 proved that M127 is prime. That is, he found a 39-digit prime number by hand. Derrick Lehmer refined the test in 1930.

As of January 2018, the largest known prime is M77,232,917.

Related posts

[1] We’ve found 50 Mersenne primes, but we’re not sure whether we’ve found the first 50 Mersenne primes. We know we’ve found the 47 smallest Mersenne primes. It’s possible that there are other Mersenne primes between the 47th and the largest one currently known.

Searching for Fermat primes

Fermat numbers have the form

F_n = 2^{2^n} + 1

Fermat numbers are prime if n = 0, 1, 2, 3, or 4. Nobody has confirmed that any other Fermat numbers are prime. Maybe there are only five Fermat primes and we’ve found all of them. But there might be infinitely many Fermat primes. Nobody knows.

There’s a specialized test for checking whether a Fermat number is prime, Pépin’s test. It says that for n ≥ 1, the Fermat number Fn is prime if and only if

3^{(F_n - 1)/2} \equiv -1 \pmod {F_n}

We can verify fairly quickly that F1 through F4 are prime and that F5 through F14 are not with the following Python code.

    def pepin(n):
        x = 2**(2**n - 1)
        y = 2*x
        return y == pow(3, x, y+1)

    for i in range(1, 15):
        print(pepin(i))

After that the algorithm gets noticeably slower.

We have an efficient algorithm for testing whether Fermat numbers are prime, efficient relative to the size of numbers involved. But the size of the Fermat numbers is growing exponentially. The number of digits in Fn is

1 + \lfloor 2^n \log_{10} 2 \rfloor \approx 0.3 \times 2^n

So F14 has 4,933 digits, for example.

The Fermat numbers for n = 5 to 32 are known to be composite. Nobody knows whether F33 is prime. In principle you could find out by using Pépin’s test, but the number you’d need to test has 2,585,827,973 digits, and that will take a while. The problem is not so much that the exponent is so big but that the modulus is also very big.

The next post presents an analogous test for whether a Mersenne number is prime.

Geometry of an oblate spheroid

We all live on an oblate spheroid [1], so it could be handy to know a little about oblate spheroids.

Eccentricity

Conventional notation uses a for the equatorial radius and c for the polar radius. Oblate means ac. The eccentricity e is defined by

e = \sqrt{1 - \frac{c^2}{a^2}}

For a perfect sphere, ac and so e = 0. The eccentricity for earth is small, e = 0.08. The eccentricity increases as the polar radius becomes smaller relative to the equatorial radius. For Saturn, the polar radius is 10% less than the equatorial radius, and e = 0.43.

Volume

The volume of an oblate spheroid is simple:

V = \frac{4}{3}\pi a^2 c

Clearly we recover the volume of a sphere when ac.

Surface area

The surface area is more interesting. The surface of a spheroid is a surface of rotation, and so can easily be derived. It works out to be

A = 2\pi a^2 + \pi \frac{c^2}{e} \log\left(\frac{1 +e}{1 - e} \right)

It’s not immediately clear that we get the area of a sphere as c approaches a, but it becomes clear when we expand the log term in a Taylor series.

A = 2\pia^2 + 2\pi c^2\left( 1 + \frac{e^2}{3} + \frac{e^4}{5} + \frac{e^6}{7} + \cdots \right)

This suggests that an oblate spheroid has approximately the same area as a sphere with radius √((a² + c²)/2), with error on the order of e².

If we set a = 1 and let c vary from 0 to 1, we can plot how the surface area varies.

plot c vs area

Here’s the corresponding plot where we use the eccentricity e as our independent variable rather than the polar radius c.

plot of area as a function of eccentricity

Related posts

[1] Not in a yellow submarine as some have suggested.

All possible scales

Pete White contacted me in response to a blog post I wrote enumerating musical scales. He has written a book on the subject, with audio, that he is giving away. He asked if I would host the content, and I am hosting it here.

Here are a couple screen shots from the book to give you an idea what it contains.

Here’s an example scale, number 277 out of 344.

scale 277

And here’s an example of the notes for the accompanying audio files.

sheet music example, track 22

 

Ellipsoid distance on Earth

globe

To first approximation, Earth is a sphere. But it bulges at the equator, and to second approximation, Earth is an oblate spheroid. Earth is not exactly an oblate spheroid either, but the error in the oblate spheroid model is about 100x smaller than the error in the spherical model.

Finding the distance between two points on a sphere is fairly simple. Here’s a calculator to compute the distance, and here’s a derivation of the formula used in the calculator.

Finding the distance between two points on an ellipsoid is much more complicated. (A spheroid is a kind of ellipsoid.) Wikipedia gives a description of Vincenty’s algorithm for finding the distance between two points on Earth using an oblate spheroid model (specifically WGS-84). I’ll include a Python implementation below.

Comparison with spherical distance

How much difference does it make when you calculate difference on an oblate spheroid rather than a sphere? To address that question I looked at the coordinates of several cities around the world using the CityData function in Mathematica. Latitude is in degrees north of the equator and longitude is in degrees east of the prime meridian.

    |--------------+--------+---------|
    | City         |    Lat |    Long |
    |--------------+--------+---------|
    | Houston      |  29.78 |  -95.39 |
    | Caracas      |  10.54 |  -66.93 |
    | London       |  51.50 |   -0.12 |
    | Tokyo        |  35.67 |  139.77 |
    | Delhi        |  28.67 |   77.21 |
    | Honolulu     |  21.31 | -157.83 |
    | Sao Paulo    | -23.53 |  -46.63 |
    | New York     |  40.66 |  -73.94 |
    | Los Angeles  |  34.02 | -118.41 |
    | Cape Town    | -33.93 |   18.46 |
    | Sydney       | -33.87 |  151.21 |
    | Tromsø       |  69.66 |   18.94 |
    | Singapore    |   1.30 |  103.85 |
    |--------------+--------+---------|

Here are the error extremes.

The spherical model underestimates the distance from London to Tokyo by 12.88 km, and it overestimates the distance from London to Cape Town by 45.40 km.

The relative error is most negative for London to New York (-0.157%) and most positive for Tokyo to Sidney (0.545%).

Update: The comparison above used the same radius for both the spherical and ellipsoidal models. As suggested in the comments, a better comparison would use the mean radius (average of equatorial and polar radii) in the spherical model rather than the equatorial radius.

With that change the most negative absolute error is between Tokyo and New York at -30,038 m. The most negative relative error is between London and New York at -0.324%.

The largest positive error, absolute and relative, is between Tokyo and Sydney. The absolute error is 29,289 m and the relative error is 0.376%.

Python implementation

The code below is a direct implementation of the equations in the Wikipedia article.

Note that longitude and latitude below are assumed to be in radians. You can convert from degrees to radians with SciPy’s deg2rad function.

from scipy import sin, cos, tan, arctan, arctan2, arccos, pi

def spherical_distance(lat1, long1, lat2, long2):
    phi1 = 0.5*pi - lat1
    phi2 = 0.5*pi - lat2
    r = 0.5*(6378137 + 6356752) # mean radius in meters
    t = sin(phi1)*sin(phi2)*cos(long1-long2) + cos(phi1)*cos(phi2)
    return r * arccos(t)

def ellipsoidal_distance(lat1, long1, lat2, long2):

    a = 6378137.0 # equatorial radius in meters 
    f = 1/298.257223563 # ellipsoid flattening 
    b = (1 - f)*a 
    tolerance = 1e-11 # to stop iteration

    phi1, phi2 = lat1, lat2
    U1 = arctan((1-f)*tan(phi1))
    U2 = arctan((1-f)*tan(phi2))
    L1, L2 = long1, long2
    L = L2 - L1

    lambda_old = L + 0

    while True:
    
        t = (cos(U2)*sin(lambda_old))**2
        t += (cos(U1)*sin(U2) - sin(U1)*cos(U2)*cos(lambda_old))**2
        sin_sigma = t**0.5
        cos_sigma = sin(U1)*sin(U2) + cos(U1)*cos(U2)*cos(lambda_old)
        sigma = arctan2(sin_sigma, cos_sigma) 
    
        sin_alpha = cos(U1)*cos(U2)*sin(lambda_old) / sin_sigma
        cos_sq_alpha = 1 - sin_alpha**2
        cos_2sigma_m = cos_sigma - 2*sin(U1)*sin(U2)/cos_sq_alpha
        C = f*cos_sq_alpha*(4 + f*(4-3*cos_sq_alpha))/16
    
        t = sigma + C*sin_sigma*(cos_2sigma_m + C*cos_sigma*(-1 + 2*cos_2sigma_m**2))
        lambda_new = L + (1 - C)*f*sin_alpha*t
        if abs(lambda_new - lambda_old) <= tolerance:
            break
        else:
            lambda_old = lambda_new

    u2 = cos_sq_alpha*((a**2 - b**2)/b**2)
    A = 1 + (u2/16384)*(4096 + u2*(-768+u2*(320 - 175*u2)))
    B = (u2/1024)*(256 + u2*(-128 + u2*(74 - 47*u2)))
    t = cos_2sigma_m + 0.25*B*(cos_sigma*(-1 + 2*cos_2sigma_m**2))
    t -= (B/6)*cos_2sigma_m*(-3 + 4*sin_sigma**2)*(-3 + 4*cos_2sigma_m**2)
    delta_sigma = B * sin_sigma * t
    s = b*A*(sigma - delta_sigma)

    return s

Sequence alignment

In my previous post I illustrated the Levenshtein edit distance by comparing the opening paragraphs of Finnegans Wake by James Joyce and a parody by Adam Roberts.

In this post I’ll show how to align two sequences using the sequence alignment algorithms of Needleman-Wunsch and Hirschberg. These algorithms can be used to compare any sequences, though they are more often used to compare DNA sequences than impenetrable novels and parodies.

I’ll be using Gang Li’s implementation of these algorithms, available on github. I believe the two algorithms are supposed to produce the same results, that Hirschberg’s algorithm is a more space-efficient implementation of the Needleman-Wunsch algorithm, though the two algorithms below produce slightly different results. I’ll give the output of Hirschberg’s algorithm.

Li’s alignment code uses lists of characters for input and output. I wrote a simple wrapper to take in strings and output strings.

    from alignment import Needleman, Hirschberg

    def compare(str1, str2):
        seq1 = list(str1)
        seq2 = list(str2)
        for algorithm in [Needleman(), Hirschberg()]:
            a, b = algorithm.align(seq1, seq2)
            print("".join(a))
            print("".join(b))
            print()

The code inserts vertical bars to indicate spaces added for alignment. Here’s the result of using the Needleman-Wunsch algorithm on the opening paragraphs of Finnegans Wake and the parody Finnegans Ewok.

    |||riverrun, past Ev|e| and Adam'||||s,
    mov|i|er|un, past ||new and |||||hopes,

    from swe|rv||e of shore|||| to bend of
    from s||tr|ike of |||||back to bend of

    b|||ay, brings us by a commodius
    |jeday, brings us by a commodius

    vic|u||s of recirculation back to
    |||lucas of recirculation back to

    H|owth Ca||stle|||| and E|nvi||r|ons.
    |fo||||||rest||moon and |en||dor.||||

I mentioned in my previous post that I could compare the first four paragraphs easily, but I had some trouble aligning the fifth paragraphs. The fifth paragraphs of each version start out quite simiar:

    Bygme||ster Fi|nnega||n, of the 
    Bygm|onster ||Ann||akin, of the 
    
    Stutte||r|||||||ing Hand, f|re|emen'|s
    ||||||Throatchokin| Hand, for|cemen|’s
    
    mau-rer, lived in the broadest way
    mau-rer, lived in the broadest way
    
    immarginable in his rushlit
    immarginable in his rushlit
    
    toofar-|||back for messuages before
    toofar| — back for messuages before 

but then Roberts takes the liberty of skipping over a large section of the original. This is what I suspected by looking at the two texts, but Hirschberg’s algorithm makes the edits obvious by showing two long sequences of vertical bars, one about 600 characters long and another about 90 characters long.

Levenshtein distance from Finnegans Wake to Return of the Jedi

Ewok

I ran into a delightfully strange blog post today called Finnegans Ewok that edits the first few paragraphs of Finnegans Wake to make it into something like Return of the Jedi.

The author, Adam Roberts, said via Twitter “What I found interesting here was how little I had to change Joyce’s original text. Tweak a couple of names and basically leave it otherwise as was.”

So what I wanted to do is quantify just how much had to change using the Levenshtein distance, which is essentially the number of one-character changes necessary to transform one string into another.

Here’s the first paragraph from James Joyce:

riverrun, past Eve and Adam’s, from swerve of shore to bend of bay, brings us by a commodius vicus of recirculation back to Howth Castle and Environs.

And here’s the first paragraph from Adam Roberts:

movierun, past new and hopes, from strike of back to bend of jeday, brings us by a commodius lucas of recirculation back to forestmoon and endor.

The original paragraph is 150 characters, the parody is 145 characters, and the Levenshtein distance is 44.

Here’s a summary of the results for the first four paragraphs.

    |-------+---------+----------|
    | Joyce | Roberts | Distance |
    |-------+---------+----------|
    |   150 |     145 |       44 |
    |   700 |     727 |      119 |
    |   594 |     615 |      145 |
    |  1053 |     986 |      333 |
    |-------+---------+----------|

The fifth paragraph seems to diverge more from Joyce. I maybe have gotten something misaligned, and reading enough of Finnegans Wake to debug the problem made my head hurt, so I stopped.

Update: See the next post for sequence alignment applied to the two sources. This lets you see not just the number of edits but what the edits are. This show why I was having difficulty aligning the fifth paragraphs.

Related posts

Rényi Differential Privacy

Differential privacy, specifically ε-differential privacy, gives strong privacy guarantees, but it can be overly cautious by focusing on worst-case scenarios. The generalization (ε, δ)-differential privacy was introduced to make ε-differential privacy more flexible.

Rényi differential privacy (RDP) is a new generalization of ε-differential privacy by Ilya Mironov that is comparable to the (ε, δ) version but has several advantages. For instance, RDP is easier to interpret than (ε, δ)-DP and composes more simply.

Rényi divergence

My previous post discussed Rényi entropyRényi divergence is to Rényi entropy what Kullback-Leibler divergence is to Shannon entropy.

Given two random variables X and Y that take on n possible values each with positive probabilities pi and qi respectively, the Rényi divergence of X from Y is defined by

D_\alpha(X || Y) = \frac{1}{\alpha -1} \log\left( \sum_{i=1}^n \frac{p_i^\alpha}{q_i^{\alpha -1}}\right)

for α > 0 and not equal to 1. The definition extends to α = 0, 1, or ∞ by taking limits.

As α converges to 1, Dα converges to the Kullback-Leibler divergence.

Rényi differential privacy

A couple weeks ago I wrote an introduction to differential privacy that started by saying

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not.

The post develops precisely what it means for a query to be differentially private. The bottom line (literally!) was

An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D’,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

It turns out that this definition is equivalent to saying the Rényi divergence of order ∞ between A(D) and A(D‘) is less than ε. That is,

D_\alpha\big(A(D) || A(D')\big) \leq \varepsilon

where α = ∞. The idea of Rényi differential privacy (RDP) is to allow the possibility that α could be finite [1]. That is, an algorithm is (α, ε)-RDP if the Rényi divergence of order α between any two adjacent databases is no more than ε.

One of the nice properties of RDP is how simply algorithms compose: The composition of an (α, ε1)-RDP algorithm and an (α, ε2)-RDP algorithm is a (α, ε1 + ε2)-RDP algorithm. This leads to simple privacy budget accounting.

Comparison to (ε, δ)-differential privacy

The composition of ε-DP algorithms is simple: just substitute α = ∞ in the result above for composing RDP algorithms. However, the resulting bound may be overly pessimistic.

The composition of (ε, δ)-DP algorithms is not as simple: both ε and δ change. With RDP, the order α does not change, and the ε values simply add.

Mironov proves in [1] that every RDP algorithm is also (ε, δ)-DP algorithm. Specifically, Proposition 3 says

If f is an (α, ε)-RDP mechanism, it also satisfies

\left(\varepsilon - \frac{\log \delta}{\alpha - 1}, \delta\right)

differential privacy for any 0 < δ < 1.

This tells us that Rényi differential privacy gives us privacy guarantees that are somewhere between ε-differential privacy and (ε, δ)-differential privacy.

Related posts

[1] Ilya Mironov, Rényi Differential Privacy. arXiv 1702.07476

Rényi Entropy

Alfred Renyi

The most common way of measuring information is Shannon entropy, but there are others. Rényi entropy, developed by Hungarian mathematician Alfréd  Rényi, generalizes Shannon entropy and includes other entropy measures as special cases.

Rényi entropy of order α

If a discrete random variable X has n possible values, where the ith outcome has probability pi, then the Rényi entropy of order α is defined to be

H_\alpha(X) = \frac{1}{1 - \alpha} \log_2 \left(\sum_{i=1}^n p_i^\alpha \right)

for 0 ≤ α ≤ ∞. In the case α = 1 or ∞ this expression means the limit as α approaches 1 or ∞ respectively.

Rényi entropy of continuous random variable

The definition of Rényi entropy can be extended to continuous random variables by

 H_\alpha(X) = \frac{1}{1 - \alpha} \log_2 f_X(x)^\alpha \, dx

but unlike the discrete case, Rényi entropy can be negative for continuous random variables, and so Rényi entropy is typically only used for discrete variables. For example, let X be the random variable defined on [1, ∞) with density

f_X(x) = 3x^{-4}

Then

H_2(X) = - \int_1^\infty (3x^{-4})^2\, dx = -\log_2(9/7) = -0.36

Special cases

Each value of α gives a possible entropy measure. All are additive for independent random variables. And for each discrete random variable X, Hα is a monotone non-decreasing function of α.

Max-entropy: α = 0

Assume all the probabilities  pi are positive. Then the H0 is known as the max-entropy, or Hartley entropy. It is simply log2n, the log of the number of values X takes on with positive probability.

Shannon entropy: α = 1

When α = 1 we get the more familiar Shannon entropy:

H_1(X) = \lim_{\alpha\to1} H_\alpha(X) = - \left(\sum_{i=1}^n p_i \log_2 p_i \right)

Collision entropy: α = 2

When the order α is not specified, it’s implicit default value is 2. That is, when people speak of Rényi entropy without qualification, they often have in mind the case α = 2. This case is also called collision entropy and is used in quantum information theory.

Min-entropy: α = ∞

In the limit as α goes to ∞, the Rényi entropy of X converges to the negative log of the probability of the most probable outcome. That is,

H_{\infty}(X) = \lim_{\alpha\to\infty} H_\alpha(X) = - \log_2 \max p_i

Relation to Lebesgue norms

Let p without a subscript be the vector of all the pi. Then for α not equal to 1,

H_\alpha(X) = \frac{\alpha}{1 - \alpha} \log_2 || p ||_\alpha

As with Lebesgue norms, you use varying values of the parameter to emphasize various features.

Related posts

This post was a warm-up for the next post: Rényi differential privacy.

Other related posts:

Curry-Howard-Lambek correspondence

Curry-Howard-Lambek is a set of correspondences between logic, programming, and category theory. You may have heard of the slogan proofs-as-programs or propositions-as-types. These refer to the Curry-Howard correspondence between mathematical proofs and programs. Lambek’s name is appended to the Curry-Howard correspondence to represent connections to category theory.

The term Curry-Howard isomorphism is often used but is an overstatement. Logic and programming are not isomorphic, nor is either isomorphic to category theory.

You’ll also hear the term computational trinitarianism. That may be appropriate because logic, programming, and category theory share common attributes without being identical.

The relations between logic, programming, and categories are more than analogies, but less than isomorphisms.

There are formal correspondences between parts of each theory. And aside from these rigorous connections, there is also a heuristic that something interesting in one area is likely to have an interesting counterpart in the other areas. In that sense Curry-Howard-Lambek is similar to the Langlands program relating number theory and geometry, a set of theorems and conjectures as well as a sort of philosophy.

International internet privacy law

world map

Scott Hanselman interviewed attorney Gary Nissenbaum in show #647 of Hanselminutes. The title was “How GDPR is effecting the American Legal System.”

Can Europe pass laws constraining American citizens? Didn’t we settle that question in 1776, or at least by 1783? And yet it is inevitable that European law effects Americans. And in fact Nissembaum argues that every country has the potential to pass internet regulation effecting citizens of every other country in the world.

Hanselman: Doesn’t that imply that we can’t win? There’s two hundred and something plus countries in the world and if any European decides to swing by a web site in Djibouti now they’re going to be subject to laws of Europe?

Nissenbaum: I’ll double down on that. It implies that any country that has users of the internet can create a more stringent law than even the Europeans, and then on the basis of that being the preeminent regulatory body of the world, because it’s a race to who can be the most restrictive. Because the most restrictive is what everyone needs to comply with.

So if Tanzania decides that it is going to be the most restrictive country in terms of the laws … that relate to internet use of their citizens, theoretically, all web sites around the world have to be concerned about that because there are users that could be accessing their web site from Tanzania and they wouldn’t even know it.

Will the “world wide web” someday not be worldwide at all? There has been speculation, for example, that we’ll eventually have at least two webs, one Chinese and and one non-Chinese. The web could tear into a lot more pieces than that.

As Nissenbaum says toward the end of the podcast

If anyone assumes there’s a simple way of handling this, they’re probably wrong. It is complicated, and you just have to live with that, because that’s the world we’re in.

Related post: GDPR and the right to be forgotten

Prime denominators and nines complement

Let p be a prime. If the repeating decimal for the fraction a/p has even period, the the second half of the decimals are the 9’s complement of the first half. This is known as Midy’s theorem.

For a small example, take

1/7 = 0.142857142857…

and notice that 142 + 857 = 999. That is, 8, 5, and 7 are the nine’s complements of 1, 4, and 2 respectively.

For a larger example, we can use Mathematica to look at the decimal expansion of 6/47:

    In:  N[6/47, 60]
    Out: 0.127659574468085106382978723404255319148936170212765957446809

and we can confirm

    12765957446808510638297 + 
    87234042553191489361702 =
    99999999999999999999999

Let’s do another example with 6/43:

    In:  N[6/43, 50]
    Out: 0.13953488372093023255813953488372093023255813953488

Does the theorem hold here? The the hypothesis of the theorem doesn’t hold because the fraction has period 21, which is not even. Can the conclusion of the theorem hold anyway? No, because we can’t split 21 digits in half! But we can split 21 digits into three parts, and if we do we find

    1395348 + 8372093 + 232558 = 9999999

Let’s finish up with an example in another base. We’ll look at the fraction 3/7 in base 17.

    In:  BaseForm[N[3/7, 24], 17]
    Out: 0.74e9c274e9c274e9c2717

If we split the repeating part of the decimal into two parts we get 74e and 9c2, and these add up to the base 17 analog of all 9’s.

    In:  BaseForm[17^^74e + 17^^9c2, 17]
    Out: ggg17

That is, the base 17 numbers 9, c, and 2, are the g’s complements of 7, 4, and e respectively.

[The Mathematica operator ^^ interprets its right argument as a number whose base is given by the left argument. The BaseForm function takes a number as its first argument and returns its representation in the base given as its second argument.]

Related post: Casting out Z’s

Kilogram redefined in terms of Planck constant

The General Conference on Weights and Measures voted today to redefine the kilogram. The official definition no longer refers to the mass of the International Prototype of the Kilogram (IPK) stored at the BIPM (Bureau International des Poids et Measures) in France. The Coulombkelvin, and mole have also been redefined.

The vote took place today, 2018-11-16, and the new definitions will officially begin 2019-05-20.

New definition of kilogram

Until now, Planck’s constant had been measured to be

h = 6.62607015 × 10-34 kg m2 s-1.

Now this is the exact value of Planck’s constant by definition, implicitly defining the kilogram.

The other SI units were already defined in terms of fundamental physical phenomena. The second is defined as

the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the cesium-133 atom

and the statement that the speed of light is

c299,792,458 m s-1

went from being a measurement to a definition, much as the measurement of Planck’s constant became a definition [1].

So now the physics of cesium-133 defines the second, the speed of light defines the meter, and Planck’s constant defines the kilogram. There are no more SI units defined by physical objects.

New definition of ampere

The elementary charge had been measured as

e = 1.602176634 × 10-19 C

and now this equation is exact by definition, defining the coulomb in terms of the elementary charge. And since the ampere is defined as flow of one Coulomb per second, this also changes the definition of the ampere.

New definition of kelvin

The kelvin unit of temperature is now defined in terms of the Boltzmann constant, again reversing the role of measurement and definition. Now the expression

k = 1.380649 × 10-23 kg m2 s-1 K-1

for the Boltzmann constant is exact by definition, defining K.

New definition of mole

Avogadro’s number is now exact, defining the mol:

NA = 6.02214076 × 1023 mol-1.

Connection to Twitter accounts

Incidentally, when I first started my Twitter account @UnitFact for units of measure, I used the standard kilogram as the icon.

Then some time later I changed the icon to a Greek mu, the symbol for “micro” used with many units.

UnitFact icon

The icon for @ScienceTip, however, is Planck’s constant!

h bar, Planck constant over two pi

It would have been cool if I had anticipated the redefinition of the kilogram and replaced my kilogram icon with the Plank constant icon. (Actually, it’s not Planck’s constant h but the “reduced Planck constant” h-bar equal to h/2π.)

Related posts

[1] The meter was originally defined in 1791 to be one ten-millionth of the distance from the equator to the North Pole. In 1889, it was redefined to be the distance between two marks on a metal bar stored at BIPM. In 1960 the meter was defined in terms of the spectrum of krypton-86, the first definition that did not depend on an object kept in a vault. In 1983 the definition changed to the one above, making the speed of light exact.

The second was defined in terms of the motion of the earth until 1967 when the definition in terms of cesium-133 was adopted.

 

Comparing bfloat16 range and precision to other 16-bit numbers

server rack

Deep learning has spurred interest in novel floating point formats. Algorithms often don’t need as much precision as standard IEEE-754 doubles or even single precision floats. Lower precision makes it possible to hold more numbers in memory, reducing the time spent swapping numbers in and out of memory. Also, low-precision circuits are far less complex. Together these can benefits can give significant speedup.

Here I want to look at bfloat16, or BF16 for short, and compare it to 16-bit number formats I’ve written about previously, IEEE and posit. BF16 is becoming a de facto standard for deep learning. It is supported by several deep learning accelerators (such as Google’s TPU), and will be supported in Intel processors two generations from now.

Bit layout

The BF16 format is sort of a cross between FP16 and FP32, the 16- and 32-bit formats defined in the IEEE 754-2008 standard, also known as half precision and single precision.

BF16 has 16 bits like FP16, but has the same number of exponent bits as FP32. Each number has 1 sign bit. The rest of the bits in each of the formats are allocated as in the table below.

    |--------+------+----------+----------|
    | Format | Bits | Exponent | Fraction |
    |--------+------+----------+----------|
    | FP32   |   32 |        8 |       23 |
    | FP16   |   16 |        5 |       10 |
    | BF16   |   16 |        8 |        7 |
    |--------+------+----------+----------|

BF16 has as many bits as a FP16, but as many exponent bits as a FP32. The latter makes conversion between BF16 and FP32 simple, except for some edge cases regarding denormalized numbers.

Precision

The epsilon value, the smallest number ε such that 1 + ε > 1 in machine representation, is 2e where e is the number of fraction bits. BF16 has much less precision near 1 than the other formats.

    |--------+------------|
    | Format |    Epsilon |
    |--------+------------|
    | FP32   | 0.00000012 |
    | FP16   | 0.00390625 |
    | BF16   | 0.03125000 |
    |--------+------------|

Dynamic range

The dynamic range of bfloat16 is similar to that of a IEEE single precision number. Relative to FP32, BF16 sacrifices precision to retain range. Range is mostly determined by the number of exponent bits, though not entirely.

Dynamic range in decades is the log base 10 of the ratio of the largest to smallest representable positive numbers. The dynamic ranges of the numeric formats are given below. (Python code to calculate dynamic range is given here.)

    |--------+-------|
    | Format | DR    |
    |--------+-------|
    | FP32   | 83.38 |
    | BF16   | 78.57 |
    | FP16   | 12.04 |
    |--------+-------|

Comparison to posits

The precision and dynamic range of posit numbers depends on how many bits you allocate to the maximum exponent, denoted es by convention. (Note “maximum.” The number of exponent bits varies for different numbers.) This post explains the anatomy of a posit number.

Posit numbers can achieve more precision and more dynamic range than IEEE-like floating point numbers with the same number of bits. Of course there’s no free lunch. Posits represent large numbers with low precision and small numbers with high precision, but this trade-off is often what you’d want.

For an n-bit posit, the number of fraction bits near 1 is n – 2 – es and so epsilon is 2 to the exponent es – n – 2. The dynamic range is

\log_{10}\left( 2^{2^{\text{\small\emph{es}}} } \right)^{2n-4} = (2n-4)2^{es}\log_{10}2

which is derived here. The dynamic range and epsilon values for 16-bit posits with es ranging from 1 to 4 are given in the table below.

    |----+--------+-----------|
    | es |     DR |   epsilon |
    |----+--------+-----------|
    |  1 |  16.86 | 0.0000076 |
    |  2 |  33.82 | 0.0000153 |
    |  3 |  37.43 | 0.0000305 |
    |  4 | 143.86 | 0.0000610 |
    |----+--------+-----------|

For all the values of es above, a 16-bit posit number has a smaller epsilon than either FP16 or BF16. The dynamic range of a 16-bit posit is larger than that of a FP16 for all values of es, and greater than BF16 and FP32 when es = 4.

Related posts