Ellipsoid distance on Earth

globe

To first approximation, Earth is a sphere. But it bulges at the equator, and to second approximation, Earth is an oblate spheroid. Earth is not exactly an oblate spheroid either, but the error in the oblate spheroid model is about 100x smaller than the error in the spherical model.

Finding the distance between two points on a sphere is fairly simple. Here’s a calculator to compute the distance, and here’s a derivation of the formula used in the calculator.

Finding the distance between two points on an ellipsoid is much more complicated. (A spheroid is a kind of ellipsoid.) Wikipedia gives a description of Vincenty’s algorithm for finding the distance between two points on Earth using an oblate spheroid model (specifically WGS-84). I’ll include a Python implementation below.

Comparison with spherical distance

How much difference does it make when you calculate difference on an oblate spheroid rather than a sphere? To address that question I looked at the coordinates of several cities around the world using the CityData function in Mathematica. Latitude is in degrees north of the equator and longitude is in degrees east of the prime meridian.

    |--------------+--------+---------|
    | City         |    Lat |    Long |
    |--------------+--------+---------|
    | Houston      |  29.78 |  -95.39 |
    | Caracas      |  10.54 |  -66.93 |
    | London       |  51.50 |   -0.12 |
    | Tokyo        |  35.67 |  139.77 |
    | Delhi        |  28.67 |   77.21 |
    | Honolulu     |  21.31 | -157.83 |
    | Sao Paulo    | -23.53 |  -46.63 |
    | New York     |  40.66 |  -73.94 |
    | Los Angeles  |  34.02 | -118.41 |
    | Cape Town    | -33.93 |   18.46 |
    | Sydney       | -33.87 |  151.21 |
    | Tromsø       |  69.66 |   18.94 |
    | Singapore    |   1.30 |  103.85 |
    |--------------+--------+---------|

Here are the error extremes.

The spherical model underestimates the distance from London to Tokyo by 12.88 km, and it overestimates the distance from London to Cape Town by 45.40 km.

The relative error is most negative for London to New York (-0.157%) and most positive for Tokyo to Sidney (0.545%).

Update: The comparison above used the same radius for both the spherical and ellipsoidal models. As suggested in the comments, a better comparison would use the mean radius (average of equatorial and polar radii) in the spherical model rather than the equatorial radius.

With that change the most negative absolute error is between Tokyo and New York at -30,038 m. The most negative relative error is between London and New York at -0.324%.

The largest positive error, absolute and relative, is between Tokyo and Sydney. The absolute error is 29,289 m and the relative error is 0.376%.

Python implementation

The code below is a direct implementation of the equations in the Wikipedia article.

Note that longitude and latitude below are assumed to be in radians. You can convert from degrees to radians with SciPy’s deg2rad function.

from scipy import sin, cos, tan, arctan, arctan2, arccos, pi

def spherical_distance(lat1, long1, lat2, long2):
    phi1 = 0.5*pi - lat1
    phi2 = 0.5*pi - lat2
    r = 0.5*(6378137 + 6356752) # mean radius in meters
    t = sin(phi1)*sin(phi2)*cos(long1-long2) + cos(phi1)*cos(phi2)
    return r * arccos(t)

def ellipsoidal_distance(lat1, long1, lat2, long2):

    a = 6378137.0 # equatorial radius in meters 
    f = 1/298.257223563 # ellipsoid flattening 
    b = (1 - f)*a 
    tolerance = 1e-11 # to stop iteration

    phi1, phi2 = lat1, lat2
    U1 = arctan((1-f)*tan(phi1))
    U2 = arctan((1-f)*tan(phi2))
    L1, L2 = long1, long2
    L = L2 - L1

    lambda_old = L + 0

    while True:
    
        t = (cos(U2)*sin(lambda_old))**2
        t += (cos(U1)*sin(U2) - sin(U1)*cos(U2)*cos(lambda_old))**2
        sin_sigma = t**0.5
        cos_sigma = sin(U1)*sin(U2) + cos(U1)*cos(U2)*cos(lambda_old)
        sigma = arctan2(sin_sigma, cos_sigma) 
    
        sin_alpha = cos(U1)*cos(U2)*sin(lambda_old) / sin_sigma
        cos_sq_alpha = 1 - sin_alpha**2
        cos_2sigma_m = cos_sigma - 2*sin(U1)*sin(U2)/cos_sq_alpha
        C = f*cos_sq_alpha*(4 + f*(4-3*cos_sq_alpha))/16
    
        t = sigma + C*sin_sigma*(cos_2sigma_m + C*cos_sigma*(-1 + 2*cos_2sigma_m**2))
        lambda_new = L + (1 - C)*f*sin_alpha*t
        if abs(lambda_new - lambda_old) <= tolerance:
            break
        else:
            lambda_old = lambda_new

    u2 = cos_sq_alpha*((a**2 - b**2)/b**2)
    A = 1 + (u2/16384)*(4096 + u2*(-768+u2*(320 - 175*u2)))
    B = (u2/1024)*(256 + u2*(-128 + u2*(74 - 47*u2)))
    t = cos_2sigma_m + 0.25*B*(cos_sigma*(-1 + 2*cos_2sigma_m**2))
    t -= (B/6)*cos_2sigma_m*(-3 + 4*sin_sigma**2)*(-3 + 4*cos_2sigma_m**2)
    delta_sigma = B * sin_sigma * t
    s = b*A*(sigma - delta_sigma)

    return s

Rényi Differential Privacy

Differential privacy, specifically ε-differential privacy, gives strong privacy guarantees, but it can be overly cautious by focusing on worst-case scenarios. The generalization (ε, δ)-differential privacy was introduced to make ε-differential privacy more flexible.

Rényi differential privacy (RDP) is a new generalization of ε-differential privacy by Ilya Mironov that is comparable to the (ε, δ) version but has several advantages. For instance, RDP is easier to interpret than (ε, δ)-DP and composes more simply.

Rényi divergence

My previous post discussed Rényi entropyRényi divergence is to Rényi entropy what Kullback-Leibler divergence is to Shannon entropy.

Given two random variables X and Y that take on n possible values each with positive probabilities pi and qi respectively, the Rényi divergence of X from Y is defined by

D_\alpha(X || Y) = \frac{1}{\alpha -1} \log\left( \sum_{i=1}^n \frac{p_i^\alpha}{q_i^{\alpha -1}}\right)

for α > 0 and not equal to 1. The definition extends to α = 0, 1, or ∞ by taking limits.

As α converges to 1, Dα converges to the Kullback-Leibler divergence.

Rényi differential privacy

A couple weeks ago I wrote an introduction to differential privacy that started by saying

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not.

The post develops precisely what it means for a query to be differentially private. The bottom line (literally!) was

An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D’,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

It turns out that this definition is equivalent to saying the Rényi divergence of order ∞ between A(D) and A(D‘) is less than ε. That is,

D_\alpha\big(A(D) || A(D')\big) \leq \varepsilon

where α = ∞. The idea of Rényi differential privacy (RDP) is to allow the possibility that α could be finite [1]. That is, an algorithm is (α, ε)-RDP if the Rényi divergence of order α between any two adjacent databases is no more than ε.

One of the nice properties of RDP is how simply algorithms compose: The composition of an (α, ε1)-RDP algorithm and an (α, ε2)-RDP algorithm is a (α, ε1 + ε2)-RDP algorithm. This leads to simple privacy budget accounting.

Comparison to (ε, δ)-differential privacy

The composition of ε-DP algorithms is simple: just substitute α = ∞ in the result above for composing RDP algorithms. However, the resulting bound may be overly pessimistic.

The composition of (ε, δ)-DP algorithms is not as simple: both ε and δ change. With RDP, the order α does not change, and the ε values simply add.

Mironov proves in [1] that every RDP algorithm is also (ε, δ)-DP algorithm. Specifically, Proposition 3 says

If f is an (α, ε)-RDP mechanism, it also satisfies

\left(\varepsilon - \frac{\log \delta}{\alpha - 1}, \delta\right)

differential privacy for any 0 < δ < 1.

This tells us that Rényi differential privacy gives us privacy guarantees that are somewhere between ε-differential privacy and (ε, δ)-differential privacy.

Related posts

[1] Ilya Mironov, Rényi Differential Privacy. arXiv 1702.07476

Rényi Entropy

Alfred Renyi

The most common way of measuring information is Shannon entropy, but there are others. Rényi entropy, developed by Hungarian mathematician Alfréd  Rényi, generalizes Shannon entropy and includes other entropy measures as special cases.

Rényi entropy of order α

If a discrete random variable X has n possible values, where the ith outcome has probability pi, then the Rényi entropy of order α is defined to be

H_\alpha(X) = \frac{1}{1 - \alpha} \log_2 \left(\sum_{i=1}^n p_i^\alpha \right)

for 0 ≤ α ≤ ∞. In the case α = 1 or ∞ this expression means the limit as α approaches 1 or ∞ respectively.

Rényi entropy of continuous random variable

The definition of Rényi entropy can be extended to continuous random variables by

 H_\alpha(X) = \frac{1}{1 - \alpha} \log_2 f_X(x)^\alpha \, dx

but unlike the discrete case, Rényi entropy can be negative for continuous random variables, and so Rényi entropy is typically only used for discrete variables. For example, let X be the random variable defined on [1, ∞) with density

f_X(x) = 3x^{-4}

Then

H_2(X) = - \int_1^\infty (3x^{-4})^2\, dx = -\log_2(9/7) = -0.36

Special cases

Each value of α gives a possible entropy measure. All are additive for independent random variables. And for each discrete random variable X, Hα is a monotone non-decreasing function of α.

Max-entropy: α = 0

Assume all the probabilities  pi are positive. Then the H0 is known as the max-entropy, or Hartley entropy. It is simply log2n, the log of the number of values X takes on with positive probability.

Shannon entropy: α = 1

When α = 1 we get the more familiar Shannon entropy:

H_1(X) = \lim_{\alpha\to1} H_\alpha(X) = - \left(\sum_{i=1}^n p_i \log_2 p_i \right)

Collision entropy: α = 2

When the order α is not specified, it’s implicit default value is 2. That is, when people speak of Rényi entropy without qualification, they often have in mind the case α = 2. This case is also called collision entropy and is used in quantum information theory.

Min-entropy: α = ∞

In the limit as α goes to ∞, the Rényi entropy of X converges to the negative log of the probability of the most probable outcome. That is,

H_{\infty}(X) = \lim_{\alpha\to\infty} H_\alpha(X) = - \log_2 \max p_i

Relation to Lebesgue norms

Let p without a subscript be the vector of all the pi. Then for α not equal to 1,

H_\alpha(X) = \frac{\alpha}{1 - \alpha} \log_2 || p ||_\alpha

As with Lebesgue norms, you use varying values of the parameter to emphasize various features.

Related posts

This post was a warm-up for the next post: Rényi differential privacy.

Other related posts:

Curry-Howard-Lambek correspondence

Curry-Howard-Lambek is a set of correspondences between logic, programming, and category theory. You may have heard of the slogan proofs-as-programs or propositions-as-types. These refer to the Curry-Howard correspondence between mathematical proofs and programs. Lambek’s name is appended to the Curry-Howard correspondence to represent connections to category theory.

The term Curry-Howard isomorphism is often used but is an overstatement. Logic and programming are not isomorphic, nor is either isomorphic to category theory.

You’ll also hear the term computational trinitarianism. That may be appropriate because logic, programming, and category theory share common attributes without being identical.

The relations between logic, programming, and categories are more than analogies, but less than isomorphisms.

There are formal correspondences between parts of each theory. And aside from these rigorous connections, there is also a heuristic that something interesting in one area is likely to have an interesting counterpart in the other areas. In that sense Curry-Howard-Lambek is similar to the Langlands program relating number theory and geometry, a set of theorems and conjectures as well as a sort of philosophy.

Prime denominators and nines complement

Let p be a prime. If the repeating decimal for the fraction a/p has even period, the the second half of the decimals are the 9’s complement of the first half. This is known as Midy’s theorem.

For a small example, take

1/7 = 0.142857142857…

and notice that 142 + 857 = 999. That is, 8, 5, and 7 are the nine’s complements of 1, 4, and 2 respectively.

For a larger example, we can use Mathematica to look at the decimal expansion of 6/47:

    In:  N[6/47, 60]
    Out: 0.127659574468085106382978723404255319148936170212765957446809

and we can confirm

    12765957446808510638297 + 
    87234042553191489361702 =
    99999999999999999999999

Let’s do another example with 6/43:

    In:  N[6/43, 50]
    Out: 0.13953488372093023255813953488372093023255813953488

Does the theorem hold here? The the hypothesis of the theorem doesn’t hold because the fraction has period 21, which is not even. Can the conclusion of the theorem hold anyway? No, because we can’t split 21 digits in half! But we can split 21 digits into three parts, and if we do we find

    1395348 + 8372093 + 232558 = 9999999

Let’s finish up with an example in another base. We’ll look at the fraction 3/7 in base 17.

    In:  BaseForm[N[3/7, 24], 17]
    Out: 0.74e9c274e9c274e9c2717

If we split the repeating part of the decimal into two parts we get 74e and 9c2, and these add up to the base 17 analog of all 9’s.

    In:  BaseForm[17^^74e + 17^^9c2, 17]
    Out: ggg17

That is, the base 17 numbers 9, c, and 2, are the g’s complements of 7, 4, and e respectively.

[The Mathematica operator ^^ interprets its right argument as a number whose base is given by the left argument. The BaseForm function takes a number as its first argument and returns its representation in the base given as its second argument.]

Related post: Casting out Z’s

Comparing bfloat16 range and precision to other 16-bit numbers

server rack

Deep learning has spurred interest in novel floating point formats. Algorithms often don’t need as much precision as standard IEEE-754 doubles or even single precision floats. Lower precision makes it possible to hold more numbers in memory, reducing the time spent swapping numbers in and out of memory. Also, low-precision circuits are far less complex. Together these can benefits can give significant speedup.

Here I want to look at bfloat16, or BF16 for short, and compare it to 16-bit number formats I’ve written about previously, IEEE and posit. BF16 is becoming a de facto standard for deep learning. It is supported by several deep learning accelerators (such as Google’s TPU), and will be supported in Intel processors two generations from now.

Bit layout

The BF16 format is sort of a cross between FP16 and FP32, the 16- and 32-bit formats defined in the IEEE 754-2008 standard, also known as half precision and single precision.

BF16 has 16 bits like FP16, but has the same number of exponent bits as FP32. Each number has 1 sign bit. The rest of the bits in each of the formats are allocated as in the table below.

    |--------+------+----------+----------|
    | Format | Bits | Exponent | Fraction |
    |--------+------+----------+----------|
    | FP32   |   32 |        8 |       23 |
    | FP16   |   16 |        5 |       10 |
    | BF16   |   16 |        8 |        7 |
    |--------+------+----------+----------|

BF16 has as many bits as a FP16, but as many exponent bits as a FP32. The latter makes conversion between BF16 and FP32 simple, except for some edge cases regarding denormalized numbers.

Precision

The epsilon value, the smallest number ε such that 1 + ε > 1 in machine representation, is 2e where e is the number of fraction bits. BF16 has much less precision near 1 than the other formats.

    |--------+------------|
    | Format |    Epsilon |
    |--------+------------|
    | FP32   | 0.00000012 |
    | FP16   | 0.00390625 |
    | BF16   | 0.03125000 |
    |--------+------------|

Dynamic range

The dynamic range of bfloat16 is similar to that of a IEEE single precision number. Relative to FP32, BF16 sacrifices precision to retain range. Range is mostly determined by the number of exponent bits, though not entirely.

Dynamic range in decades is the log base 10 of the ratio of the largest to smallest representable positive numbers. The dynamic ranges of the numeric formats are given below. (Python code to calculate dynamic range is given here.)

    |--------+-------|
    | Format | DR    |
    |--------+-------|
    | FP32   | 83.38 |
    | BF16   | 78.57 |
    | FP16   | 12.04 |
    |--------+-------|

Comparison to posits

The precision and dynamic range of posit numbers depends on how many bits you allocate to the maximum exponent, denoted es by convention. (Note “maximum.” The number of exponent bits varies for different numbers.) This post explains the anatomy of a posit number.

Posit numbers can achieve more precision and more dynamic range than IEEE-like floating point numbers with the same number of bits. Of course there’s no free lunch. Posits represent large numbers with low precision and small numbers with high precision, but this trade-off is often what you’d want.

For an n-bit posit, the number of fraction bits near 1 is n – 2 – es and so epsilon is 2 to the exponent es – n – 2. The dynamic range is

\log_{10}\left( 2^{2^{\text{\small\emph{es}}} } \right)^{2n-4} = (2n-4)2^{es}\log_{10}2

which is derived here. The dynamic range and epsilon values for 16-bit posits with es ranging from 1 to 4 are given in the table below.

    |----+--------+-----------|
    | es |     DR |   epsilon |
    |----+--------+-----------|
    |  1 |  16.86 | 0.0000076 |
    |  2 |  33.82 | 0.0000153 |
    |  3 |  37.43 | 0.0000305 |
    |  4 | 143.86 | 0.0000610 |
    |----+--------+-----------|

For all the values of es above, a 16-bit posit number has a smaller epsilon than either FP16 or BF16. The dynamic range of a 16-bit posit is larger than that of a FP16 for all values of es, and greater than BF16 and FP32 when es = 4.

Related posts

New expansions of confluent hypergeometric function

Last week Bujanda et al published a paper [1] that gives new expansions for the confluent hypergeometric function. I’ll back up explain what that means before saying more about the new paper.

Hypergeometric functions

Hypergeometric functions are something of a “grand unified theory” of special functions. Many functions that come up in application are special cases of hypergeometric function, as shown in the diagram below. (Larger version here.)

special function relationships

I give a brief introduction to hypergeometric functions in these notes (4-page pdf).

Confluent hypergeometric functions

The confluent hypergeometric function corresponds to Hypergeometric 1F1 in the chart above [2]. In Mathematica it goes by Hypergeometric1F1 and in Python goes by scipy.special.hyp1f1.

This function is important in probability because you can use it to compute the CDFs for the normal and gamma distributions.

New expansions

Bujanda et al give series expansions of the confluent hypergeometric functions in terms of incomplete gamma functions. That may not sound like progress because the incomplete gamma functions are still special functions, but the confluent hypergeometric function is a function of three variables M(ab; z) whereas the incomplete gamma function γ(sz) is a function of two variables.

They also give expansions in terms of elementary functions which may be of more interest. The authors give the series expansion below.

M_n(a, b; z) = \frac{\Gamma(b)}{\Gamma(a)\,\Gamma(b-a)} \sum_{k=0}^{n-1} A_k(a, b)\, F_k(z)

The A functions are given recursively by

and

A_n(a, b) = \frac{2}{n} \left( (2a - b)A_{n-1}(a, b) + 2(n-b)A_{n-2}(a,b) \right) 

for n > 1.

The F functions are given by

F_n(z) = \frac{n!}{(-z)^{n+1}} \Big( e_n(z/2) - \exp(z) e_n(-z/2) \Big)

where

e_n(z) = \sum_{k=0}^n \frac{z^k}{k!}

The authors give approximation error bounds in [1]. In the plot below we show that n = 3 makes a good approximation for M(3, 4.1, z). The error converges to zero uniformly as n increases.

Related posts

[1] Blanca Bujanda, José L. López, and Pedro J. Pagola. Convergence expansions of the confluent hypergeometric functions in terms of elementary functions. Mathematics of computation. DOI https://doi.org/10.1090/mcom/3389

[2] “Confluent” is a slightly archaic name, predating a more systematic naming for hypergeometric functions. The name mFn means the hypergeometric function has m upper parameters and n lower parameters. The confluent hypergeometric function has one of each, so it corresponds to 1F1. For more details, see these notes.

Big data and privacy

patient survey

How does big data impact privacy? Which is a bigger risk to your privacy, being part of a little database or a big database?

Rows vs Columns

People commonly speak of big data in terms of volume—the “four v’s” of big data being volume, variety, velocity, and veracity—but what we’re concerned with here might better be called “area.” We’ll think of our data being in one big table. If there are repeated measures on an individual, think of them as more columns in a denormalized database table.

In what sense is the data big: is it wide or long? That is, if we think of the data as a table with rows for individuals and columns for different fields of information on individuals, are there a lot of rows or a lot of columns?

All else being equal, your privacy goes down as columns go up. The more information someone has about you, the more likely some of it may be used in combination to identify you.

How privacy varies with the number of rows is more complicated. Your privacy could go up or down with the number of rows.

The more individuals in a dataset, the more likely there are individuals like you in the dataset. From the standpoint of k-anonymity, this makes it harder to indentify you, but easier to reveal information about you.

Group privacy

For example, suppose there are 50 people who have all the same quasi-identifiers as you do. Say you’re a Native American man in your 40’s and there are 49 others like you. Then someone who knows your demographics can’t be sure which record is yours; they’d have a 2% chance of guessing the right one. The presence of other people with similar demographics makes you harder to identify. On the other hand, their presence makes it more likely that someone could find out something you all have in common. If the data show that middle aged Native American men are highly susceptible to some disease, then the data imply that you are likely to be susceptible to that disease.

Because privacy measures aimed at protecting individuals don’t necessarily protect groups, some minority groups have been reluctant to participate in scientific studies. The Genetic Information Nondiscrimination Act of 2008 (GINA) makes some forms of genetic discrimination illegal, but it’s quite understandable that minority groups might still be reluctant to participate in studies.

Improving privacy with size

Your privacy can improve as a dataset gets bigger. If there are a lot of rows, the data curator can afford a substantial amount of randomization without compromising the value of the data. The noise in the data will not effect statistical conclusions from the data but will protect individual privacy.

With differential privacy, the data is held by a trusted curator. Noise is added not to the data itself but to the results of queries on the data. Noise is added in proportion to the sensitivity of a query. The sensitivity of a query often goes down with the size of the database, and so a differentially private query of a big dataset may only need to add a negligible amount of noise to maintain privacy.

If the dataset is very large, it may be possible to randomize the data itself before it enters the database using randomized response or local differential privacy. With these approaches, there’s no need for a trusted data curator. This wouldn’t be possible with a small dataset because the noise would be too large relative to the size of the data.

Related posts

What is proof-of-work?

The idea of proof of work (PoW) was first explained in a paper Cynthia Dwork and Moni Naor [1], though the term “proof of work” came later [2]. It was first proposed as a way to deter spam, but it’s better known these days through its association with cryptocurrency.

If it cost more to send email, even a fraction of a cent per message, that could be enough to deter spammers. So suppose you want to charge anyone $0.005 to send you an email message. You don’t actually want to collect their money, you just want proof that they’d be willing to spend something to email you. You’re not even trying to block robots, you just want to block cheap robots.

So instead of asking for a micropayment, you could ask the sender to solve a puzzle, something that would require around $0.005 worth of computing resources. If you’re still getting too much spam, you could increase your rate and by giving them a harder puzzle.

Dwork and Naor list several possible puzzles. The key is to find a puzzle that takes a fair amount of effort to solve but the solution is easy to verify.

Bitcoin uses hash problems for proof-of-work puzzles. Cryptographic hash functions are difficult to predict, and so you can’t do much better than brute force search if you want to come up with input whose hashed value has a specified pattern.

The goal is to add a fixed amount of additional text to a message such that when the hash function is applied, the resulting value is in some narrow range, such as requiring the first n bits to be zeros. The number n could be adjusted over time as needed to calibrate the problem difficulty. Verifying the solution requires computing only one hash, but finding the solution requires computing 2n hashes on average.

Related posts

[1] Cynthia Dwork and Noni Naor (1993). “Pricing via Processing, Or, Combatting Junk Mail, Advances in Cryptology”. CRYPTO’92: Lecture Notes in Computer Science No. 740. Springer: 139–147.

[2] Markus Jakobsson and Ari Juels (1999). “Proofs of Work and Bread Pudding Protocols”. Communications and Multimedia Security. Kluwer Academic Publishers: 258–272.

Gamma gamma gamma!

There are several things in math and statistics named gamma. Three examples are

  • the gamma function
  • the gamma constant
  • the gamma distribution

This post will show how these are related. We’ll also look at the incomplete gamma function which connects with all the above.

The gamma function

The gamma function is the most important function not usually found on a calculator. It’s the first “advanced” function you’re likely to learn about. You might see it in passing in a calculus class, in a homework problem on integration by parts, but usually not there’s not much emphasis on it. But it comes up a lot in application.

\Gamma(x) = \int_0^\infty t^{x-1 } e^{-t}\, dt

You can think of the gamma function as a way to extend factorial to non-integer values. For non-negative integers n, Γ(n + 1) = n!.

(Why is the argument n + 1 to Γ rather than n? There are a number of reasons, historical and practical. Short answer: some formulas turn out simpler if we define Γ that way.)

The gamma constant

The gamma constant, a.k.a. Euler’s constant or the Euler-Mascheroni constant, is defined as the asymptotic difference between harmonic numbers and logarithms. That is,

\gamma = \lim_{n\to\infty} \left( 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}- \log n\right)

The constant γ comes up fairly often in applications. But what does it have to do with the gamma function? There’s a reason the constant and the function are both named by the same Greek letter. One is that the gamma constant is part of the product formula for the gamma function.

\frac{1}{\Gamma(x)} = x e^{\gamma x} \prod_{r=1}^\infty \left(1 + \frac{x}{r} \right) e^{-x/r}

If we take the logarithm of this formula and differentiation we find out that

\frac{\Gamma'(1)}{\Gamma(1)} = -\gamma

The gamma distribution

If you take the integrand defining the gamma function and turn it into a probability distribution by normalizing it to integrate to 1, you get the gamma distribution. That is, a gamma random variable with shape k has probability density function (PDF) given by

f(x) = \frac{1}{\Gamma(k)} x^{k-1} e^{-x}

More generally you could add a scaling parameter to the gamma distribution in the usual way. You could imaging the scaling parameter present here but set to 1 to make things simpler.

The incomplete gamma function

The incomplete gamma function relates to everything above. It’s like the (complete) gamma function, except the range of integration is finite. So it’s now a function of two variables, the extra variable being the limit of integration.

\gamma(s, x)= \int_0^x t^{s-1} e^{-t} \, dt

(Note that now x appears in the limit of integration, not the exponent of t. This notation is inconsistent with the definition of the (complete) gamma function but it’s conventional.)

It uses a lower case gamma for its notation, like the gamma constant, and is a generalization of the gamma function. It’s also essentially the cumulative distribution function of the gamma distribution. That is, the CDF of a gamma random variable with shape s is γ(sx) / Γ(s).

The function γ(sx) / Γ(s) is called the regularized incomplete gamma function. Sometimes the distinction between the regularized and unregularized versions is not explicit. For example, in Python, the function gammainc does not compute the incomplete gamma function per se but the regularized incomplete gamma function. This makes sense because the latter is often more convenient to work with numerically.

Related posts

Logic and applications Twitter account

I stopped posting to the @FormalFact Twitter account last July, but I didn’t deactivate the account. Now I’m going to restart it.

Unlike my other Twitter accounts, I don’t plan to have a regular posting schedule. I may not post often. We’ll see how it goes.

LogicPractice icon

I’ve changed the account name from @FormalFact to @LogicPractice. The “formal” part of the original name referred to formal theorem proving, the initial focus of the account. The new name reflects a focus on logic more generally, and practical applications of logic that are less laborious than formal theorem proving.

Continued fraction cryptography

Every rational number can be expanded into a continued fraction with positive integer coefficients. And the process can be reversed: given a sequence of positive integers, you can make them the coefficients in a continued fraction and reduce it to a simple fraction.

In 1954, Arthur Porges published a one-page article pointing out that continued fractions could be used to create a cipher. To encrypt your cleartext, convert it to a list of integers, use them as continued fraction coefficients, and report the resulting fraction. To decrypt, expand the fraction into a continued fraction and convert the coefficients back to text.

We can implement this in Mathematica as follows:

    
encode[text_] := FromContinuedFraction[ ToCharacterCode[ text ]]
decode[frac_] := FromCharacterCode[ ContinuedFraction[ frac ]]

So, for example, suppose we want to encrypt “adobe.” If we convert each letter to its ASCII code we get {97, 100, 111, 98, 101}. When we make these numbers coefficients in a continued fraction we get

\mathrm{

which reduces to 10661292093 / 109898899.

This isn’t a secure encryption scheme by any means, but it’s a cute one. It’s more of an encoding scheme, a (non-secret) way to convert a string of numbers into a pair of numbers.

Related posts

[1] Arthur Porges. A Continued Fraction Cipher. The American Mathematical Monthly, Vol. 59, No. 4 (Apr., 1952), p. 236

Earth mover distance and t-closeness

airport crowd

There’s an old saying that if you want to hide a tree, put it in a forest. An analogous principle in privacy is that a record preserves privacy if it’s like a lot of other records.

k-anonymity

The idea of k-anonymity is that every database record appears at least k times when you restrict your attention to quasi-identifiers [1]. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity. Even when k is large, k-anonymity might prevent re-identification but still suffer from attribute disclosure.

Another problem with k-anonymity is that it doesn’t offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.

Or going the other way around, if you know already know something that stands about a group, this could help you identify the record belonging to an individual. That is, the method is subject to a background knowledge attack.

One way to address this shortcoming is l-diversity. This post won’t go into l-diversity because it’s an intermediate step to where we want to go, which is t-closeness.

t-closeness

The idea of t-closeness is that the distribution of sensitive data in every group is not too far from the distribution in the full population. The “t” comes from requiring that the distributions be no more than a distance t apart in a sense that we’ll define below [2]. If the sensitive data in a group doesn’t stand out, this thwarts the homogeneity attack and the background knowledge attack.

Earth mover distance

When we say that the distribution on sensitive data within a group is not far from the distribution in the full data, how do we quantify what “far” means? That is, how do we measure the distance between two distributions?

There are a lot of ways to measure the similarity of two probability distributions. A common choice is the Kullback-Leiler divergence, though that’s not what we’ll use here. Instead, t-closeness uses the so-called earth mover distance (EMD), also know as the Wasserstein metric.

The idea of EMD is to imagine both probability distributions as piles of dirt and calculate the minimum amount of work needed to reshape the first pile so that it has the same shape as the second. The key attribute of EMD is that it takes distance into account.

Suppose your data is some ordered response, 1 through 5. Suppose distribution X has probability 0.8 at 1 and 0.05 for the rest of the responses. Distributions Y and Z are the same except they have 80% of their probability mass at 2 and at 5 respectively. By some measures, X is equally far from Y and Z, but the earth mover distance would say that X is closer to Y than to Z, which is more appropriate in our setting.

We can calculate the EMD for the example above. To transform X into Y, we need to move a probability mass of 0.75 from 1 to 2, and so the EMD is 0.75. To transform X into Z we need to move the same amount of mass, but we need to move it 4x further, and so the EMD is 3.

Related posts

[1] What exactly is a quasi-identifier? That’s hard to say precisely. An identifier is something that clearly identifies an individual. A phone number, for example, is an identifier. A quasi-identifier is something that helps narrow down who a person is, but isn’t an identifier. For example, someone’s birthday would be a quasi-identifier. But in general “quasi-identifier” is not a precisely defined concept.

[2] It’s common in math to use a variable name as an adjective, as with k-anonymity, l-diversity, and t-closeness. This is unfortunate because it isn’t descriptive and locks in a variable naming convention. Someone used the variable k to count the number of redundant database tables, and the name stuck. Similarly the l in l-diversity counts something.

As a sidenote, the variable names here follow the old FORTRAN convention that variables i through n represent integers by default, and that all other variables represent real (floating point) numbers by default. The t in t-closeness is a continuous measure, so the t is a real value, while k and l are integers.

Integration by long division

Since integration is the inverse of differentiation, you can think of integration as “dividing” by d.

J. P. Ballantine [1] shows that you can formally divide by d and get the correct integral. For example, he arrives at

\int x^2 \sin x \, dx = (2 - x^2 )\cos x + 2x \sin x + C

using long division!

[1] J. P. Ballantine. Integration by Long Division. The American Mathematical Monthly, Vol. 58, No. 2 (Feb., 1951), pp. 104-105

Modal and temporal logic for computer security

key

In the previous post, I mentioned that modal logic has a lot of interpretations and a lot of axiom systems. It can also have a lot of operators. This post will look at Security Logic, a modal logic for security applications based on a seminal paper by Glasgow et al [1]. It illustrates how modal and temporal logic can be applied to computer security, and it illustrates how a logic system can have a large number of operators and axioms.

Knowledge axioms

Security Logic starts with operators Ki that extend the box operator □. For a proposition p, Ki p is interpreted as saying that the ith agent knows p to be true. Glasgow et al chose the following axioms for inference regarding Ki.

Axiom K1:   Ki φ → φ
Axiom K2:   Ki(φ → ψ) → (Ki φ → Ki ψ)
Axiom K3:   Ki φ → Ki Ki φ
Axiom K4:   ¬ Ki φ → Ki( ¬ Ki φ)

These can be interpreted as follows.

If you know something to be true, it’s true.
If you know one thing implies another, then knowing the first lets you know the other.
If you know something, you know that you know it.
If you don’t know something, you know that you don’t know it.

This is not to claim that these propositions are universal metaphysical truths, only modeling assumptions. It may be possible, for example, to know something without knowing that you know it, but that’s a subtle matter excluded from this model.

Temporal logic axioms

Temporal logic is modal logic with operators interpreted as applying to propositions over time. Glasgow et al introduce three temporal operators: ∀□, ∀◇, and  ∃□. Note that each of these is a single operator; there is no box or diamond operator in Security Logic.

∀□p means that p always holds, ∀◇p means that p eventually holds, and ∃□p means that p sometimes holds.

The ∃□ and ∀□ operators are dual in the same sense that the basic modal operators □ and ◇ are dual:

∃□p ↔ ¬ ∀□ ¬p
∀□p ↔ ¬ ∃□ ¬p

Security Logic not give axioms for ∃□ because its behavior is determined by the axioms for ∀□.

The three temporal logic axioms for Security Logic are as follows.

Axiom T1:  φ → ∀◇φ
Axiom T2:  ∀□(φ → ψ) → (∀□φ → ∀□ψ)
Axiom T3:  ∀□φ → ∀□ ∀□φ

These can be interpreted as follows.

If a proposition holds, it eventually holds.
If its always the case that one proposition implies another, then if the former always holds then the latter always holds.
If something always holds, then it always always holds.

Obligation and Permission

Finally, Security Logic introduces modal operators O and P for obligation and permission.

Obligation and permission are dual, just as necessity and possibility are dual and the always and sometimes operators are dual.

Op ↔ ¬ P ¬p
Pp ↔ ¬ O ¬p

When obligation and permission are combined with knowledge, Security Logic introduces the following abbreviations.

Oi = OKi
Pi = PKi

Axioms for Security Logic

The complete axioms for Security Logic are the four knowledge axioms above, the three temporal axioms above, and the five additional axioms below.

Axiom SL1:   Pi φ for all propositional tautologies φ.
Axiom SL2:  Pi φ → φ
Axiom SL3:  (Pi φ ∧ Pi ψ) ↔ Pi (φ ∧ ψ)
Axiom SL4:  Pi φ → Pi (φ ∨ ψ)
Axiom SL5:  Oi φ → Pi φ

These can be interpreted as follows.

You are permitted to know all tautologies.
Whatever is permitted must be true.
Permission holds through conjunction and disjunction.
Whatever is obligatory must be permissible.

Related posts

[1] Janice Glasgow, Glenn MacEwen, and Prakash Panangaden. A Logic for Reasoning About Security. ACM Transactions on Computer Systems, Vol 10 No 3, August 1992, pp 226–264.