Database reconstruction attacks

In 2018, three researchers from the US Census Bureau published a paper entitled “Understanding Database Reconstruction Attacks on Public Data.” [1] The article showed that private data on many individuals could be reverse engineered from public data.

As I wrote about a few days ago, census blocks are at the bottom of the US Census Bureau’s hierarchy of geographical entities. On average a census block may contain about 40 people, but a block may contain only one person.

In hindsight it seems fairly obvious that data reported at the census block level is vulnerable to re-identification, and yet this doesn’t seem to have been noticed before around 2000. There were some privacy measures in place before then, but it wasn’t clear that these methods were insufficient to protect privacy.

You can think of each fact about each person as a variable and each reported statistic as an equation. When the number of equations is comparable to the number of variables, it’s possible that the system of equations has a unique solution. (We know a priori that there exists at least one solution, assuming the reported statistics were correctly computed.)

It’s not quite as simple as that, though that is roughly the idea in [1]. The data collected in the census is binary or integer data, which makes database reconstruction easier. Ages, for example, are integers, and typically integers less than 100.

One of the techniques the Census Bureau previously used in an attempt to protect individual privacy was a sort of small cell rule, a rule to not report statistics based on three or fewer individuals. This may or may not help. In the example given in [1], there are 7 people in a hypothetical census block, of whom 4 are adults and an unreported number are minors. Determining the number of minors is left as an exercise for the reader.

The set of equations is more complicated than a set of linear equations. The inference problem is a matter of logic programming or constraint satisfaction. Missing data is not always as trivial to reconstruct as in the preceding paragraph, but missing data can still convey partial information. The very fact that the data is missing tells you something.

The discrete nature of the data makes the solution process both harder and easier. It makes things harder in the sense of requiring a more complicated solution algorithm, but it makes things easier in the sense of increasing the likelihood that the equations have a unique solution.

This is why the Census Bureau embraced differential privacy for the 2020 census. They had no choice but to do something substantially different than they had done in the past once it became apparent that their previous approach failed rather badly at protecting confidentiality.

Related posts

[1] Simson Garfinkel, John M. Abowd, Christain Martindale. Understanding Database Reconstruction Attacks on Public Data. ACM Quque, October 2018. The article was also published in Communications of the ACM in March 2019.

Number of groups of squarefree order

This post is a sort of footnote to the previous post, Estimating the number of groups of a given order.

The following is taken from an answer to a question on Stack Exchange.

In general there is no formula f(n) for the number of groups of order up to isomorphism. However, if n is squarefree (no prime to the power 2 divides n), then Otto Hölder in 1893 proved the following amazing formula

f(n) = \sum_{m \mid n} \prod_p \frac{p^{c(p)} - 1}{p-1}

where p is a prime divisor of n/m and c(p) is the number of prime divisors q of m that satisfy q = 1 (mod p).

In a sense that can be made precise, about 60% of integers are square free [1]. So Hölder’s formula is often useful, but there are a lot of values it can’t compute.

In this post I develop Python code to implement Hölder’s formula. I’ve tested the code below by comparing it to a table of f(n) for n = 1, 2, 3, …, 100.

from sympy import mobius, divisors, factorint

def squarefree(n):
    return n > 1 and mobius(n) != 0

def prime_divisors(n):
    return list(factorint(n).keys())

def c(p, m):
    return len( [q for q in prime_divisors(m) if q % p == 1] )

def holder(n):
    if not squarefree(n):
        print("Sorry. I can't help you.")
        return None

    s = 0
    for m in divisors(n):
        prod = 1
        for p in prime_divisors(n // m):
            prod *= (p**c(p, m) - 1) // (p - 1)
        s += prod
        
    return s

[1] The number of square-free integers between 1 and x is asymptotically equal to 6x/π² as x → ∞.

Estimating number of groups of a given order

John Conway et al [1] give the name gnu(n) to the number of groups of order n, where “gnu” stands for group number. This function has been studied since the 19th century, but I don’t know whether there has ever been a standard notation for it. Mathematica calls it FiniteGroupCount. It’s also the first sequence in OEIS.

When n is square-free, there is a formula due to Hölder that computes gnu(n). This formula is given in the next post. But in general computing gnu(n) is hard. However, [1] gives a surprisingly good heuristic for estimating gnu(n).

Let Ω(n) be the number of prime factors of n, counted with multiplicity. So, for example, Ω(75) = 3 because 3 is a factor with multiplicity 1 and 5 is a factor with multiplicity 2.

Let B(n) be the nth Bell number, the number is the number of ways to partition a set of n labeled items.

The heuristic estimate for gnu(n) is

gnu(n) ≈ B( Ω(n) ).

This can be implemented in Python as follows.

    from sympy import bell, primeomega
    def estimate(n): return bell(primeomega(n))

The following plot shows the exact and approximate values of gnu(n) for n = 1, 2, 3, …, 100.

The exact value was plotted first in blue, then the estimate was plotted in orange. The orange line matches the blue so well as to hide it, except at the two spikes where gnu(n) is largest.

Here’s a plot of the errors.

Aside from the two outliers, the error is between −5 and 1.

[1] John H. Conway, Heiko Dietrich and E.A. O’Brien. Counting groups: gnus, moas and other exotica

Period of a nonlinear pendulum

The term “nonlinear pendulum” is analogous to a retronym, a new name for an old thing to distinguish it from a new variation. For example, once upon a time a guitar was just a guitar. Now such a guitar is called an acoustic guitar to distinguish it from an electric guitar. Similarly, analog signal processing is a retronym to distinguish what was once the only kind of signal processing from the new arrival, digital signal processing.

The equation of motion for a pendulum is nonlinear. If the initial angle of displacement is sufficiently small, the linearized form of the equation is adequate for most applications. This linearized approximation is better known than the more accurate original equation, and so the un-linearized equation is known as the nonlinear pendulum equation.

The (nonlinear) equation of motion for a pendulum is the differential equation

\theta'' + \frac{g}{\ell}\sin \theta = 0

where g is the acceleration due to gravity and ℓ is the length of the pendulum. For small initial displacement θ0 the linear approximation

\theta'' + \frac{g}{\ell} \theta = 0

works well. The smaller θ0 is the more accurate the linear approximation is.

Linear and nonlinear period

The period of a pendulum obtained by solving the linearized equation is

T = 2\pi \sqrt{\frac{\ell}{g}}

The solution to the nonlinear pendulum equation is also periodic, though the solution is a combination of Jacobi functions rather than a combination of trig functions. The difference between the two solutions is small when θ0 is small, but becomes more significant as θ0 increases.

The difference in the periods is more evident than the difference in shape for the two waves. The period of the nonlinear solution is longer than that of the linearized solution. Here’s a plot of the solutions to the linear and nonlinear equations, with ℓ = g and θ0 = 1.

The period for the nonlinear pendulum is given by

T = 2\pi \sqrt{\frac{\ell}{g}}\, f(\theta_0)

where f is an increasing function, equal to 1 at θ = 0.

The exact form of f involves special functions, and so there is naturally a lot of interest in approximations to f. The exact value is given by

f(\theta) = \frac{2}{\pi}K(\sin(\theta/2)) = \frac{1}{\text{AGM}(1, \cos(\theta/2))}

where K is the “complete elliptic integral of the first kind” and AGM is the arithmetic-geometric mean.

The AGM of two numbers is found by taking their ordinary (arithmetic) mean and geometric mean, then repeating the process. This process converges very rapidly, and so doing one step of the iteration gives a good approximation. If that’s not good enough, doing two steps gives an even better approximation, and so on. In fact, a common approximation for f(θ) is to do half a step, taking the geometric mean of 1 and cos(θ/2), i.e.

f(\theta) \approx \frac{1}{\sqrt{\cos(\theta/2)}}

To see how accurate this approximation is, let’s plot the exact and approximate values of f.

The two curves can hardly be distinguished visually, so let’s look at a plot of their difference.

Here’s the code that produced the two plots above.

from scipy.special import ellipk
from numpy import sin, cos, pi, linspace
import matplotlib.pyplot as plt

def exact(θ):
    return 2*ellipk(sin(θ/2)**2)/pi

def approx(θ):
    return cos(θ/2)**-0.5

t = linspace(0, 1.5)
plt.plot(t, exact(t))
plt.plot(t, approx(t))
plt.xlabel(r"$\theta$")
plt.legend(["exact", "approx"])
plt.savefig("nonlinear_pendulum1.png")
plt.close()

plt.plot(t, exact(t) - approx(t))
plt.xlabel(r"$\theta$")
plt.ylabel("approximation error")
plt.savefig("nonlinear_pendulum2.png")

NB: There are two conventions for defining the complete elliptic integral of the first kind. SciPy uses a convention for K that requires us to square the argument.

Driven oscillations

The differences between the linearized and nonlinear equation become more apparent when there is a forcing function, i.e. when the right-hand side of the differential equations is not zero. Here are the solutions when the forcing function is cos(2t).

Now the solutions not only differ in their period, the shapes of the solutions are substantially different. The linear solutions are well-behaved but the nonlinear solutions can be chaotic with sensitive dependence on initial conditions. This remains true if a damping term is added.

Related posts

Kepler triangle

A Kepler triangle is a right triangle whose sides are in geometric progression. That is, if the sides have length a < b < c, then b/a = c/b = k.

All Kepler triangles are similar because the proportionality constant k can only take on one value. To see this, we first pick our units so that a = 1. Then b = k and c = k². By the Pythagorean theorem

a² + b² = c²

and so

1 + k2 = k4

which means k² equals the golden ratio φ.

Here’s a nice geometric property of the Kepler triangle proved in [1].

Go around the triangle counterclockwise placing a point on each side dividing the side into pieces that are in golden proportion. Connect these three points with the opposite vertex. Then the triangle formed by the intersections of these line segments is also a Kepler triangle.

On each side, the ratio of the length of the green segment to the blue segment is φ. The grey triangle in the middle is another Kepler triangle.

The rest of this post will present the code that was used to create the image above.

We’ll need the following imports.

import matplotlib.pyplot as plt
from numpy import array
from numpy.linalg import solve

We’ll also need to define the golden ratio, a function to split a line segment into golden proportions, and a function to draw a line between two points.

φ = (1 + 5**0.5)/2

def golden_split(pt1, pt2):
    return (1/φ)*pt1 + (1 - 1/φ)*pt2

def connect(pt1, pt2, style):
    plt.plot([pt1[0], pt2[0]], [pt1[1], pt2[1]], style)

Now we can draw the figure.

A = array([0, 1])
B = array([0, 0])
C = array([φ**0.5, 0])

X = golden_split(A, B)
Y = golden_split(B, C)
Z = golden_split(C, A)

connect(A, X, "b")
connect(X, B, "g")
connect(B, Y, "b")
connect(Y, C, "g")
connect(C, Z, "b")
connect(Z, A, "g")
connect(A, Y, "grey")
connect(B, Z, "grey")
connect(C, X, "grey")

plt.gca().set_aspect("equal")
plt.axis("off")
plt.show()

We can show algebraically that the golden_split works as claimed, but here is a numerical illustration.

assert(abs( (C[0] - Y[0]) / (Y[0] - A[0]) - φ) < 1e-14)

Similarly, we can show numerically what [1] proves exactly, i.e. that the triangle in the middle is a Kepler triangle.

from numpy.linalg import solve, norm

def intersect(pt1, pt2, pt3, pt4):
    # Find the intersection of the line joining pt1 and pt2
    # with the line joining pt3 and pt4.
    m1 = (pt2[1] - pt1[1])/(pt2[0] - pt1[0])
    m3 = (pt4[1] - pt3[1])/(pt4[0] - pt3[0])
    A = array([[m1, -1], [m3, -1]])
    rhs = array([m1*pt1[0]-pt1[1], m3*pt3[0]-pt3[1]])
    x = solve(A, rhs)
    return x

I = intersect(A, Y, C, X)
J = intersect(A, Y, B, Z)
K = intersect(B, Z, C, X)

assert( abs( norm(I - J)/norm(J - K) - φ**0.5 ) < 1e-14 )
assert( abs( norm(I - K)/norm(I - J) - φ**0.5 ) < 1e-14 )

Related posts

[1] Jun Li. Some properties of the Kepler triangle. The Mathematical Gazette, November 2017, Vol. 101, No. 552, pp. 494–495

Decoupling formal theorem proving effort

Terence Tao has been experimenting with formal theorem proving using Lean and writing about his experience.

Here’s something Tao said on Mathstodon that I thought was interesting.

It is remarkable how much “decoupling” is achieved by the Lean+Blueprint combo. Contributors can work locally on proving a lemma, without necessarily fully understanding the global proof structure. Mathematicians who do understand the global proof can work on the blueprint, without necessarily understanding the mechanics of Lean. Lean experts can work on technical aspects of the implementation, such as optimizing the selection of classes and definitions, without needing expert domain knowledge. A theorem can be formalized, before, after, or concurrently with the lemmas it relies on, or the applications it has. Two participants who want to discuss some finer point of the argument can localize to a very specific and highly formalized step and have a constructive discussion even if they come from quite different backgrounds. It allows for (certain types of) high-level mathematical activity to be done at a far more atomized level than is usually possible.

Related posts

Partitioning dots and dashes

Given a set of dots and dashes, how many ways can they be partitioned into a set of Morse code letters?

There is at least one way, since you could take each dot to be an E and each dash to be a T.

If you have a sequence of n dots and dashes, there no more than 2n−1 ways to partition the symbols: at each of the n − 1 spaces between symbols, you either start a new partition or you don’t. This is an over-estimate for large n since a Morse code letter has at most 4 dots or dashes, and not all combinations of four dots and dashes corresponds to a letter.

Last year I wrote about the song YYZ and how it was inspired by the sound of “YYZ” in Morse code, YYZ being the designation of the Toronto airport. Here’s the song’s opening theme:

The C code given here enumerates partitions of dots and dashes, and it shows that there are 1324 ways to partition -.---.----.. into the Morse code for letters [1]. This number 1324 is closer to our upper estimate of 211 = 2048 than our lower estimate of 1.

Define a function M(n) as follows. Express n in binary, convert the 0s to dots and the 1s to dashes, and let M(n) be the number of ways this sequence of dots and dashes can be partitioned into letters. The n corresponding to the Morse code for YYZ is 101110111100two = 3004ten, so M(3004) = 1324.

I looked to see whether M(n) were in OEIS and it doesn’t appear to be, though there are several sequences in OEIS that include Morse code in their definition.

It’s easy to see that 1 ≤ M(n) ≤ n. Exercise for the reader: find sharper upper and lower bounds for M(n). For example, show that every group of three bits can be partitioned four ways, and so M(n) has a lower bound something like n2/3.

Related posts

[1] The code returns 1490 possibilities, but some of these contain one or more asterisks indicating combinations of dots and dashes that do not correspond to letters.

Schwarz lemma, Schwarz-Pick theorem, and Poincare metric

Let D be the open unit disk in the complex plane. The Schwarz lemma says that if f is an analytic function from D to D with f(0) = 0, then

|f(z)| \leq |z|

for all z in D. The lemma also says more, but this post will focus on just this portion of the theorem.

The Schwarz-Pick theorem generalizes the Schwarz lemma by not requiring the origin to be fixed. That is, it says that if f is an analytic function from D to D then

\left| \frac{f(z) - f(w)}{1 - f(z)\,\overline{f(w)}} \right| \leq \left| \frac{z - w}{1 - z\,\overline{w}}\right|

The Schwarz-Pick theorem also concludes more, but again we’re focusing on part of the theorem here. Note that if f(0) = 0 then the Schwarz-Pick theorem reduces to the Schwarz lemma.

The Schwarz lemma is a sort of contraction theorem. Assuming f(0) = 0, the lemma says

|f(z) - f(0)| \leq |z - 0|

This says applying f to a point cannot move the point further from 0. That’s interesting, but it would be more interesting if we could say f is a contraction in general, not just with respect to 0. That is indeed what the Schwarz-Pick theorem does, though with respect to a new metric.

For any two points z and w in the open unit disk D, define the Poincaré distance between z and w by

d(z,w) = \tanh^{-1}\left( \left| \frac{z - w}{1 - z\overline{w}}\right| \right)

It’s not obvious that this is a metric, but it really is. As is often the case, most of the properties of a metric are simple to confirm, but the proving the triangle inequality is the hard part.

If we apply the monotone function tanh-1 to both sides of the Schwarz-Pick theorem, then we have that any analytic function f from D to D is a contraction on D with respect to the Poincaré metric.

Here we’re using “contraction” in the lose sense. It would be more explicit to say that f is a non-expansive map. Applying f to a pair of points may not bring the points closer together, but it cannot move them any further apart (with respect to the Poincaré metric).

By using the Poincaré metric, we turn the unit disk into a hyperbolic space. That is D with the metric d is a model of the hyperbolic plane.

Related posts

Factored random numbers

A couple days ago Michael Nielsen posted an image of a one-page paper that gives an algorithm for generating factored random numbers, uniformly distributed from 1 to some designated N.

The algorithm does not generate random numbers then factor them. It’s more efficient than that, generating the factorization along with the final result. It does require testing for whether a number is prime, but this is more efficient than factorization.

I thought about trying to code up the algorithm in Python, but then I see that @iconjack beat me to it.

from sympy import isprime
from random import random, randint

def randfacts(N):
    while True:
        n, r, s = N, 1, []
        while n > 1:
            if r > N: break
            if isprime(n := randint(1,n)):
                r *= n
                s.append(n)
        else:
            if random() < r/N:
                return r, s

DICOM image data

X ray of hand and arm

 

The previous post discussed EXIF data embedded in a digital photo. DICOM files are analogous medical images.

You can think of a DICOM image as a JPEG with medical metadata. Strictly speaking a DICOM file is a sort of database, and one of the fields in the database contains the pixels. The pixels are usually stored in JPEG format or some variation thereof, but they don’t have to be.

DICOM stands for Digital Imaging and Communications in Medicine. It is a standard created by the ACR (American College of Radiology) and NEMA (National Electrical Manufacturers Association). DICOM uses other standards, such as JPEG, and is used by standards built on top of it, such as IHE and HL7.

Consumer photos may contain a lot of EXIF metadata, but DICOM images can contain even more metadata. The DICOM standard is huge. It comes in 16 parts, and the data dictionary part alone is 274 pages. Pages 23 through 176 of the data dictionary consist of one long table defining possible DICOM data fields. Assuming an average of 30 fields per page, this is about 4,600 data fields. To make matters worse more flexible, many of these fields can contain arbitrarily long text strings. Well, not entirely arbitrary: fields must be less than 232 characters. Moby Dick is about 220 characters, so a 232 character limit is essentially no practical limit.

I have had numerous clients send me a description of their data in a small Excel file. Then they’ll say “and we have some images” meaning DICOM images. Maybe the Excel file contains a few dozen fields, but then the DICOM images potentially contain thousands of fields. The Excel file is burying the lede: the vast majority of the data (potentially) is in the DICOM images.

The enormous number of fields, and the lack of much structure to these fields, is a widely recognized problem. According to Mustra et al [1],

A major disadvantage of the DICOM Standard is the possibility for entering probably too many optional fields. This disadvantage is mostly showing in inconsistency of filling all the fields with the data. Some image objects are often incomplete because some fields are left blank and some are filled with incorrect data.

This is quite an understatement, saying that over four thousand fields is “probably too many optional fields.”

Related posts

[1] Mario Mustra, Kresimir Delac, Mislav Grgic. Overview of the DICOM Standard. 50th International Symposium ELMAR-2008, 10-12 September 2008, Zadar, Croatia

Photo by Cara Shelton on Unsplash