When is less data less private?

If I give you a database, I give you every row in the database. So if you delete some rows from the database, you have less information, not more, right?

This seems very simple, and it mostly is, but there are a couple subtleties.

A common measure in data privacy is k-anonymity. The idea is that if at least k individuals in a data set share some set of data values, and k is large enough, then the privacy of those individuals is protected.

Now suppose you randomly select a single record from a database that was deemed deidentified because it satisfied k-anonymity with k = 10. Now your new dataset, consisting of only one record, is k-anonymous with k = 1: every record is unique because there’s only one record. But how is this person’s data any less private that it was before?

Note that I said above that you selected a record at random. If you selected the row using information that you know but which isn’t in the database, you might have implicitly added information. But if you select a subset of data, using only information explicit in that data, you haven’t added information.

Here’s where k-anonymity breaks down. The important measure is k-anonymity in the general population, not k-anonymity in a data set, unless you know that someone is in the data set.

If you find someone named John Cook in a data set, you probably haven’t found my information, even if there is only one person by that name in the data set. My name may or may not be common in that particular data set, but my name is common in general.

The number of times a combination of data fields gives a lower bound on how often the combination appears in general, so k-anonymity in a data set is a good sign for privacy, but the lack of k-anonymity is not necessarily a bad sign. The latter could just be an artifact of having a small data set.

Additive functions

A function f from positive integers to real numbers is defined to be additive if for relatively prime numbers m and n,

f(mn) = f(m) + f(n).

The function f is called completely addititive if the above holds for all positive integers m and n, i.e. we drop the requirement that m and n are relatively prime.

Example: total prime factors

One example of an additive function is the function Ω(n) defined to be the number of prime factors of n, counted with multiplicity. For example, Ω(12) = 3 because 12 = 2 × 2 × 3. The numbers 10 and 63 are relatively prime, and

Ω(630) = 5 = Ω(10) + Ω(63).

Example: distinct prime factors

Another example of an additive function is ω(n) defined to be the number of distinct prime factors of n, i.e. not counting with multiplicity. So, for example, ω(12) = 2.

This function is additive but not completely additive because, for example,

ω(20) = 2 ≠ ω(2) + ω(10)  = 3

A theorem of Erdős

Here is a remarkable theorem due to Paul Erdős [1]. Suppose f is an additive function such that

f(n + 1) − f(n)

converges to zero as n goes to infinity. Then

f(n) = c log(n)

for some constant c. And since a multiple of a logarithm is a logarithm to a different base, we can restate the conclusion by simply saying f is a logarithm.

Logarithms are completely additive functions, so even though we only assumed f was additive, this combined with the limit condition proves that in fact f is completely additive.

Related posts

[1] Paul Erdős, “On the distribution function of additive functions,” Ann. of Math., Vol. 47 (1946), pp. 1–20.

Frequency analysis

Suppose you have a list of encrypted surnames names of US citizens. If the list is long enough, the encrypted name that occurs most often probably corresponds to Smith. The second most common encrypted name probably corresponds to Johnson, and so forth. This kind of inference is analogous to solving a cryptogram puzzle by counting letter frequencies.

The probability of correctly guessing the most common names based on frequency analysis depends critically in the sample size. In a small sample, there may be no Smiths. In a larger sample, the name Smith may be common, but not the most common.

I did some simulations to estimate how well frequency analysis would work at identifying the 10 most common names as a function of the sample size N. For each N, I simulated 100 data sets using probabilities derived from the surname frequencies derived from US Census Bureau data.

When N = 1,000, there was a 53% chance that the most common name in the population, Smith, would be the most common name in the sample. The second most common name in the population, Johnson, was the second most common name in the sample only 14% of the time.

When N = 10,000, there was a 94% chance of identifying Smith, and at least a 30% chance of identifying the five most common names.

When N = 1,000,000, the three most common names were identified every time in the simulation. And each of the 10 most common names were correctly identified most of the time. In fact, the 18 most common names were correctly identified most of the time.

A consequence of this analysis is that hashing names does not protect privacy if the sample size is large. Hashing names along with other information, so that the combined data has a more uniform distribution, may protect privacy.

Related posts

Security by obscurity

Security-by-obscurity is a bad idea in general. It’s better, for example, to have a login page than to give your site an obscure URL. It’s better to encrypt a file than to hide it in some odd directory. It’s better to use a well-vetted encryption algorithm than to roll your own.

There there are people whose knee-jerk reaction to any form of obscurity is to shout “That’s security-by-obscurity!” but obscurity can be subtle.

All else being equal, adding a layer of obscurity doesn’t hurt. For example, you can literally make a public encryption key public, as I’ve done here. But for extra security, why distribute your encryption key more widely than necessary? And if your message is adequately encrypted, you could in principle publish it for the world to see. But why not just give it to the intended recipient?

The public key on my site is there for strangers to contact me, but if I were really concerned about secure communication between colleagues, I’d just circulate the key among those colleagues. That may not be much more secure, but surely it’s no less secure. And I’d share messages privately, even though they are encrypted.

It’s good to look closely at any argument that beings “all else being equal” to see if all else is indeed equal. A more nuanced objection to security-by-obscurity is that it can create a false sense of security.

One could argue, for example, that making your public key available to the world forces you to be more careful about your encryption. Maybe you’ve been using an RSA key for years, and you really should use a longer key, but you don’t because you can argue that not many people have your public key anyway. But if your key’s too sort, obscuring your public key doesn’t help.

And while it’s better to deliver encrypted messages privately, it helps to not count on this, to assume that the encrypted message might be made public. That’s the basic premise behind encryption.

The principle behind no-security-by-obscurity is that you want to concentrate your security where it can be quantified. You can, for example, quantify how much more effort it would take to break a 64-bit key (like Blowfish) than a 56-bit key (like DES). Or even better, a 128-bit key (like AES). But you can’t quantify the level of protection that comes from obscurity.

Is it more secure to give someone a 56-bit DES key on a flash drive in a dark alley than to send them a 64-bit Blowfish key over SMS You can’t calculate an answer to that question.

In some sense all security is by obscurity. Cryptography literally means hidden writing. But all else being equal—there’s that phrase again—you want to minimize the surface area of what you have to obscure, e.g. limiting your secret to your key and not your algorithm, and it’s better to have quantified risks than unquantified risks. But all else is often not equal, and there are difficult trade-offs.

Related posts

Advanced questions about a basic diagram

Unit circle trig diagram

I saw a hand-drawn version of the diagram above yesterday and noticed that the points were too evenly distributed. That got me to thinking: is there any objective way to say that this famous diagram is in some sense complete? If you were to make a diagram with more points, what would they be?

Simple numbers

The numbers on the diagram are all simple. Once we’re more precise about what it means for these numbers to be “simple,” we can answer the questions above.

The angles in the diagram are all rational parts of a circle, that is, rational multiples of 2π. For the rest of the post, I’ll say “rational angle” to mean a rational multiple of 2π.

The sines and cosines all involve only one square root, i.e. no nested roots. A more useful way to express this is that all the values are the roots of a quadratic polynomial with integer coefficients.

Completeness

Could we add more rational angles whose sines and cosines are roots of quadratics? Maybe the chart would be too cluttered to put in a textbook, but would it be possible in principle? Could there be some chart analogous to the one above that has, for example, (1 + √7)/5 as one of the labels?

The angles in the common unit circle diagram, integer multiples of π/4 and π/6, are the only rational angles with sines and cosines that are roots of a quadratic polynomial with integer coefficients. That is, these are the only rational angles that have sines and cosines that are algebraic of degree 2. In that sense the diagram is complete.

The number (1 + √7)/5 is algebraic of degree 2 [1] but isn’t on our exhaustive list of possible algebraic values of degree 2. So even if you were to try numbers of the form pπ/q for very large integers p and q, you’ll never get a sine or cosine equal to (1 + √7)/5.

In 1933 Lehmer [2] showed how to classify all rational angles whose sines or cosines are algebraic of given degree. His theorem proves that the only rational angles whose sine is algebraic of degree 2 are integer multiples of π/4 and π/6.

Interestingly, there is another rational angle whose cosine is algebraic of degree 2:

cos(π/5) = (1 + √5)/4

So we could extend the unit circle diagram to include multiples of π/5, but only the cosine would be algebraic of degree 2. The sines are more complicated. For example,

sin(π/5) = √(5/8 + √(5)/8)

which is algebraic of degree 4.

Higher degrees

There are no rational angles whose sine is algebraic of degree 3, so going up to degree 3 wouldn’t help.

If we go up to degree 4 then we could add multiples of π/5, π/8, and π/12. These all have sines and cosines that are algebraic of degree 4.

Related posts

[1] (1 + √7)/5  is a root of 25x² − 10x = 6 = 0.

[2] D. H. Lehmer. A Note on Trigonometric Algebraic Numbers. The American Mathematical Monthly , March 1933, Vol. 40, No. 3, pp. 165–166

How much metadata is in a photo?

A few days ago I wrote about the privacy implications of metadata in a PDF. This post will do the same for photos.

Dalek on a Seattle train

You can see the metadata in a photo using exiftool. By default cameras include time and location data. I ran this tool on a photo I took in Seattle a few years ago when I was doing some work for Amazon. The tool reported 114 fields, some of which are redundant. Here is some of the information contained in the metadata.

GPS Altitude  : 72.5 m Above Sea Level
GPS Date/Time : 2017:05:05 17:47:33.31Z
GPS Position  : 47 deg 36' 39.71" N, 122 deg 19' 59.40" W
Lens ID       : iPhone SE back camera 4.15mm f/2.2

How finely does this specify the location? The coordinates are given to 1/100 of a second, so 1/360000 of a degree. A degree of latitude is 111 km, so the implied accuracy is on the order of 30 cm or one foot, whether that’s correct or not.

You can look up that ground level at that location is 46 meters above sea level, which would imply the photo was taken on the 8th floor of a building. (It clearly wasn’t. Either the elevation of ground level or the elevation recorded in the phone isn’t correct.)

When I cropped the image, the edited image contained the software and operating system that was used to edit it.

Platform    : Linux
Software    : GIMP 2.10.30
Modify Date : 2024:02:13 08:39:49

This shows that I edited the image this morning using GIMP installed on a Linux box.

You can change your phone’s settings to not include location data in photos. If you do, the photos may still include the time zone, which is a weak form of location data. You can remove some or all the metadata later using image editing software, but by default a photo reveals more than you may intend.

More metadata posts

Related posts

The Borwein integrals

The Borwein integrals introduced in [1] are a famous example of how proof-by-example can go wrong.

Define sinc(x) as sin(x)/x. Then the following equations hold.

 \begin{align*} \int_0^\infty \text{sinc}(x) \,dx &= \frac{\pi}{2} \\ \int_0^\infty \text{sinc}(x) \, \text{sinc}\left(\frac{x}{3}\right) \,dx &= \frac{\pi}{2} \\ \int_0^\infty \text{sinc}(x)\, \text{sinc}\left(\frac{x}{3}\right) \,\text{sinc}\left(\frac{x}{5}\right) \,dx &= \frac{\pi}{2} \\ \vdots &\phantom{=} \\ \int_0^\infty \text{sinc}(x) \, \text{sinc}\left(\frac{x}{3}\right) \cdots \text{sinc}\left(\frac{x}{13}\right) \,dx &= \frac{\pi}{2} \\ \end{align*}

However

\int_0^\infty \text{sinc}(x) \, \text{sinc}\left(\frac{x}{3}\right) \cdots \text{sinc}\left(\frac{x}{15}\right) \,dx = \frac{\pi}{2} - \delta

where δ ≈ 2.3 × 10−11.

This is where many presentations end, concluding with the moral that a pattern can hold for a while and then stop. But I’d like to go just a little further.

Define

B(n) = \int_0^\infty \prod_{k=0}^{n} \text{sinc}\left(\frac{x}{2k+1}\right) \, dx.

Then B(n) = π/2 for n = 1, 2, 3, …, 6 but not for n = 7, though it almost holds for n = 7. What happens for larger values of n?

The Borwein brothers proved that B(n) is a monotone function of n, and the limit as n → ∞ exists. In fact the limit is approximately π/2 − 0.0000352.

So while it would be wrong to conclude that B(n) = π/2 based on calculations for n ≤ 6, this conjecture would be approximately correct, never off by more than 0.0000352.

[1] David Borwein and Jonathan Borwein. Some Remarkable Properties of Sinc and Related Integrals. The Ramanujan Journal, 3, 73–89, 2001.

This-way-up and Knuth arrows

I was looking today at a cardboard box that had the “this way up” symbol on it and wondered whether there is a Unicode value for it.

ISO 7000 symbol 0623 This way up

Apparently not. But there is an ISO code for it: ISO 7000 symbol 0623. It’s an international standard symbol for indicating how to orient a package. The name says it all: this way up.

There is a similar symbol in math and computer science: ↑↑. This is so-called up-arrow notation, introduced by Donald Knuth in 1976 [1].

In Knuth’s notation, ↑ indicates exponentiation, i.e. repeated multiplication, and ↑↑ indicates repeated exponentiation. There’s a little ambiguity here: we have to clarify in what order we apply exponentiation. Knuth stipulated that ↑↑ is right-associative, i.e.

b \uparrow \uparrow n &=& b \underbrace{\uparrow(b \uparrow \cdots(b\uparrow b))}_{n \text{ copies of } b} = \underbrace{b^{b^{.\,^{.\,^{.\,^b}}}}}_{n \text{ copies of } b}

So, for example, 5 ↑↑ 3 equals 5125, not 1255.

In general, n arrows means to repeatedly apply n-1 arrows. If we use ↑n as a shortcut for writing n up arrows, then we can define ↑n recursively as meaning we apply ↑n−1 n times.

What I find most interesting about Knuth’s notation is how rarely it is used. I don’t think it’s because anyone object’s to Knuth’s notation; it’s just that there isn’t much need for what the notation represents. It’s primary use may be theoretical computer science. There you sometimes want to construct functions that grow ridiculously fast, such as Ackermann’s function, and functions of the form an b are good for that.

This is curious. Multiplication is repeated addition, exponentiation is repeated multiplication, and so repeated exponentiation seems like a natural extension. I won’t say that it’s unnatural, but it is very uncommon.

Related posts

[1] Donald E. Knuth. “Mathematics and Computer Science: Coping with Finiteness”. Science. 194 (4271): 1235–1242.

Factoring pseudoprimes

Fermat’s little theorem says that if p is a prime number, then for any positive integer b < p we hve

bp-1 = 1 (mod p).

This theorem gives a necessary but not sufficient condition for a number to be prime.

Fermat’s primality test

The converse of Fermat’s little theorem is not always true, but it’s often true. That is, if there exists some base 1 < b < n such that

bn-1 = 1 (mod n)

then n is likely to be prime. There are examples where the equation above holds for a pair (b, n) even though n is not prime, and in that case n is called a pseudoprime to the base b.

If you’re searching for large primes, say for use in encryption, then you’d begin by applying Fermat’s little theorem with a few small values of b. This is because although Fermat’s test can’t prove that a number is prime, it can prove that a number is not prime.

For a small example, suppose you wanted to test whether 50621 is prime. You could start by applying Fermat’s test with b = 2 as in the following Python code.

>>> n = 50621
>>> 2**(n−1) % n
9605

Since the result is not 1, we know 50621 is not prime. This doesn’t tell us what the factors of 50621 are, but we know that it has nontrivial factors. We say 2 is a witness that the number 50621 is not prime.

Next, let’s see whether 294409 might be prime.

>>> n = 294409
>>> 2**(n−1) % n
1

This tells us 294409 might be prime. It has passed a test that filters out a lot of composite numbers. What now? We could try other values of b: 3, 5, 7, 11, …. This will not resolve the question of whether 294409 is prime unless we keep going until we try 37. And in fact 37 is the smallest factor of 294409. Our number 294409 is a Carmichael number, a composite number n that passes Fermat’s primality test for all bases b relatively prime to n.

Note that it would be more efficient to use pow(b, n−1, n) rather than 2**(n−1) % n because the former takes advantage of the fact that we don’t need to compute 2n−1 per se and can reduce all intermediate calculations mod n.

Factoring pseudoprimes

Now suppose we have a number n that has passed Fermat’s primality test for some base b and we suspect that n is a pseudoprime. If we want to (try to) factor n, knowing that it is a pseudoprime to the base b gives us a head start. We can exploit the fact that we know b to factor n in polynomial time, unless n is a strong pseudoprime.

Suppose we have a number n that we suspect is a pseudoprime to the base b, and we’re smart enough to at least check that n is an odd number, then we begin by pulling out all the factors of 2 that we can from n − 1:

n − 1 = 2e f.

Next consider the set of numbers

bkf

for k = 1, 2, 4, …, 2e. Let x be the smallest of these numbers which is not congruent to 1 mod n. The existence of such an x is essentially the definition of strong pseudoprime [1].

Then gcd(x − 1, n) and gcd(x + 1, n) are factors of n. This is theorem 10.4 of [2].

Python example

Let n = 873181. This is a pseudoprime to the base b = 3, which we can confirm by seeing that pow(3, n−1, n) returns 1.

Now 873180 is divisible by 4 but not by 8, so e = 2. So the theorem above says we should compute

>>> b, e = 3, 2
>>> [pow(b, f*2**k, n) for k in range(e+1)]

This produces [2643, 1, 1]. So x = 2643,

>> x = 2643
>>> from sympy import gcd
>>> gcd(x−1, n)
1321
>>> gcd(x+1, n)
661

shows that 1321 and 661 are both factors of 873181.

Related posts

[1] Definition of strong pseudoprime. A strong pseudo prime to base b is a composite odd integer m such that if m − 1 = 2ef with f odd, then either bf = 1 (mod m) or bf2c ≡ −1 (mod m) for some 0 ≤ c < e.

[2] The Joy of Factoring by Samuel S. Wagstaff, Jr.

Do comments in a LaTeX file change the output?

When you add a comment to a LaTeX file, it makes no visible change to the output. The comment is ignored as far as the appearance of the file. But is that comment somehow included in the file anyway?

If you compile a LaTeX file to PDF, then edit it by throwing in a comment, and compile again, your two files will differ. As I wrote about earlier, the time that a file is created is embedded in a PDF. That time stamp is also included in two or three hashes, so the files will differ by more than just the bits in the time stamp.

But even if you compile two files at the same time (within the resolution of the time stamp, which is one second), the PDF files will still differ. Apparently some kind of hash of the source file is included in the PDF.

So suppose you have two files. The content of foo.tex is

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

and the content of bar.tex is

    \documentclass{article}
    \begin{document}
    Hello world. % comment
    \end{document}

then the output of running pdflatex on both files will look the same.

Suppose you compile the files at the same time so that the time stamps are the same.

    pdflatex foo.tex && pdflatex bar.tex

It’s possible that the two time stamps could be different, one file compiling a little before the tick of a new second and one compiling a little after. But if your computer is fast enough and you don’t get unlucky, the time stamps will be the same.

Then you can compare hex dumps of the two PDF files with

    diff  <(xxd foo.pdf) <(xxd bar.pdf)

This produces the following

    < ...  ./ID [<F12AF1442
    < ...  E03CC6B3AB64A5D9
    < ... 8DEE2FE> <F12AF1
    < ...  442E03CC6B3AB64A
    < ... 5D98DEE2FE>]./Le
    --
    > ...  ./ID [<4FAA0E9F1
    > ...  CC6EFCC5068F481E
    > ...  0419AD6> <4FAA0E
    > ...  9F1CC6EFCC5068F4
    > ...  81E0419AD6>]./Le

You can’t recover the comment from the binary dump, but you can tell that the files differ.

I don’t know what hash is being used. My first guess was MD5, but that’s not it. It’s a 128-bit hash, so that rules out newer hashes like SHA256. I tried searching for it but didn’t find anything. If you know what hash pdflatex uses, please let me know.

LaTeX will also let you add text at the end of the file, after the \end{document} command. This also will change the hash code but will not change the appearance of the output.

Related posts