No funding for uncomfortable results

Posted on 7 December 2018 by John

In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous. The state of Massachusetts had released data on 135,000 state employees and their families with obvious identifiers removed. However, the data contained zip code, birth date, and sex for each individual. Sweeney was able to cross reference this data with publicly available voter registration data to find the medical records of then Massachusetts governor William Weld.

An estimated 87% of Americans can be identified by the combination of zip code, birth date, and sex. A back-of-the-envelope calculation shows that this should not be surprising, but Sweeney appears to be the first to do this calculation and pursue the results. (Update: See such a calculation in the next post.)

In her paper Only You, Your Doctor, and Many Others May Know Sweeney says that her research was unwelcome. Over 20 journals turned down her paper on the Weld study, and nobody wanted to fund privacy research that might reach uncomfortable conclusions.

A decade ago, funding sources refused to fund re-identification experiments unless there was a promise that results would likely show that no risk existed or that all problems could be solved by some promising new theoretical technology under development. Financial resources were unavailable to support rigorous scientific studies otherwise.

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

Sine of a googol

Posted on 5 December 2018 by John

How do you evaluate the sine of a large number in floating point arithmetic? What does the result even mean?

Sine of a trillion

Let’s start by finding the sine of a trillion (10¹²) using floating point arithmetic. There are a couple ways to think about this. The floating point number t = 1.0e12 can only be represented to 15 or 16 significant decimal figures (to be precise, 53 bits [1]), and so you could think of it as a representative of the interval of numbers that all have the same floating point representation. Any result that is the sine of a number in that interval should be considered correct.

Another way to look at this is to say that t can be represented exactly—its binary representation requires 42 bits and we have 53 bits of significand precision available—and so we expect sin(t) to return the sine of exactly one trillion, correct to full precision.

It turns out that IEEE arithmetic does the latter, computing sin(1e12) correctly to 16 digits.

Here’s the result in Python

    >>> sin(1.0e12)
    -0.6112387023768895

and verified by Mathematica by computing the value to 20 decimal places

    In:= N[Sin[10^12], 20]
    Out= -0.61123870237688949819

Range reduction

Note that the result above is not what you’d get if you were first to take the remainder when dividing by 2π and then taking the sine.

    >>> sin(1.0e12 % (2*pi))
    -0.6112078499756778

This makes sense. The result of dividing a trillion by the floating point representation of 2π, 159154943091.89536, is correct to full floating point precision. But most of that precision is in the integer part. The fractional part is only has five digits of precision, and so we should expect the result above to be correct to at most five digits. In fact it’s accurate to four digits.

When your processor computes sin(1e12) it does not naively take the remainder by 2π and then compute the sine. If it did, you’d get four significant figures rather than 16.

We started out by saying that there were two ways of looking at the problem, and according to the first one, returning only four significant figures would be quite reasonable. If we think of the value t as a measured quantity, measured to the precision of floating point arithmetic (though hardly anything can be measured that accurately), then four significant figures would be all we should expect. But the people who designed the sine function implementation on your processor did more than they might be expected to do, finding the sine of a trillion to full precision. They do this using a range reduction algorithm that retains far more precision than naively doing a division. (I worked on this problem a long time ago.)

Sine of a googol?

What if we try to take the sine of a ridiculously large number like a googol (10¹⁰⁰)? This won’t work because a googol cannot be represented exactly as a floating point number (i.e. as a IEEE 754 double). It’s not too big; floating point numbers can be as large as around 10³⁰⁸. The problem is that a number that big cannot be represented to full precision. But we can represent 2³³³ exactly, and a googol is between 2³³² and 2³³³. And amazingly, Python’s sine function (or rather the sine function built into the AMD processor on my computer) returns the correct result to full precision.

    >>> sin(2**333)
    0.9731320373846353

How do we know this is right? I verified it in Mathematica:

    In:= Sin[2.0^333]
    Out= 0.9731320373846353

How do we know Mathematica is right? We can do naive range reduction using extended precision, say 200 decimal places.

    In:= p = N[Pi, 200]
    In:= theta = x - IntegerPart[x/ (2 p)] 2 p
    Out= 1.8031286178334699559384196689…

and verify that the sine of the range reduced value is correct.

    In:= Sin[theta]
    Out= 0.97313203738463526…

Interval arithmetic interpretation

Because floating point numbers have 53 bits of precision, all real numbers between 2⁵⁶ − 2² and 2⁵⁶ + 2² are represented as the floating point number 2⁵⁶. This is a range of width 8, wider than 2π, and so the sines of the numbers in this range cover the possible values of sine, i.e. [−1, 1]. So if you think of a floating point number as a sample from the set of all real numbers with the same binary representation, every possible value of sine is a valid return value for numbers bigger than 2⁵⁶. But if you think of a floating point number as representing itself exactly, then it makes sense to ask for the sine of numbers like 2³³³ and larger, up to the limits of floating point representation [2].

Six degrees of Kevin Bacon, Paul Erdos, and Wikipedia

Posted on 4 December 2018 by John

I just discovered the website Six Degrees of Wikipedia. It lets you enter two topics and it will show you how few hops it can take to get from one to the other.

Since the mathematical equivalent of Six Degrees of Kevin Bacon is Six degrees of Paul Erdős, I tried looking for the distance between Kevin Bacon and Paul Erdős and found this:

Erdos to Bacop

Kevin Bacon and Paul Erdős are both known for their prolific collaborations. Many actors have acted with Kevin Bacon, or acted with actors who have acted with him, etc. Many mathematicians wrote papers with Erdős, or have written papers with people who have written papers with him, etc. I’m four degrees of separation from Paul Erdős last I checked. [1]

Here’s a more complex graph showing the three degrees of separation between Thomas Aquinas and Thomas the Tank Engine.

Aquinas to Tank Engine

Note that the edges are directed, links from the starting page to pages that lead to the target page. If you reverse the starting and ending page, you get different results. For example, there are still three degrees of separation if we reverse our last example, going from Thomas the Tank Engine to Thomas Aquinas, but there are about twice as many possible paths.

Mersenne prime trend

Posted on 4 December 2018 by John

Mersenne primes have the form 2^p − 1 where p is a prime. The graph below plots the trend in the size of these numbers based on the 50 51 Mersenne primes currently known.

The vertical axis plots the exponents p, which are essentially the logs base 2 of the Mersenne primes. The scale is logarithmic, so this plot suggests that the Mersenne primes grow approximately linearly on a double logarithmic scale.

Trend in Mersenne primes

Update: A new Mersenne prime was announced December 21, 2018. The graph was updated to include this new data point.

Spherical trig, Research Triangle, and Mathematica

Posted on 3 December 2018 by John

This post will look at the triangle behind North Carolina’s Research Triangle using Mathematica’s geographic functions.

Spherical triangles

A spherical triangle is a triangle drawn on the surface of a sphere. It has three vertices, given by points on the sphere, and three sides. The sides of the triangle are portions of great circles running between two vertices. A great circle is a circle of maximum radius, a circle with the same center as the sphere.

An interesting aspect of spherical geometry is that both the sides and angles of a spherical triangle are angles. Because the sides of a spherical triangle are arcs, they have angular measure, the angle formed by connecting each vertex to the center of the sphere. The arc length of a side is its angular measure times the radius of the sphere.

Denote the three vertices by A, B, and C. Denote the side opposite A by a, etc. Denote the angles at A, B, and C by α, β, and γ respectively.

Research Triangle

Research Triangle is a (spherical!) triangle with vertices formed by Duke University, North Carolina State University, and University of North Carolina at Chapel Hill.

(That was the origin of the name, though it’s now applied more loosely to the general area around these three universities.)

We’ll take as our vertices

A = UNC Chapel Hill (35.9046 N, 79.0468 W)
B = Duke University in Durham (36.0011 N, 78.9389 W),
C = NCSU in Raleigh (35.7872 N, 78.6705 W)

Research Triangle

Mathematica

We’ll illustrate several features of Mathematica using the spherical triangle corresponding to Research Triangle.

Map

The map above was produced with the following Mathematica code.

    ptA = GeoPosition[{35.9046, -79.0468}]
    ptB = GeoPosition[{36.0011, -78.9389}]
    ptC = GeoPosition[{35.7872, -78.6705}]

    GeoGraphics[{Red, PointSize[Large], 
        Point[ptA], Point[ptB], Point[ptC]}, 
        GeoScaleBar -> "Imperial", 
        GeoRange -> 29000]

Note that longitude is in degrees east, so the longitudes above are negative.

Distance

To find the distance between two locations, you can use the function GeoDistance. For example,

    GeoDistance[ptA, ptB]

tells me that the distance between UNC and Duke is 8.99185 miles. I assume it displays miles by default based on my location in the US, though the GeoRange option above defaults to meters. You can make the system of units explicit. For example

    GeoDistance[ptA, ptB, UnitSystem -> "Metric"]

returns 14.741 km.

If we want to find the length of the side c in degrees, we can use the Earth’s radius.

    r = PlanetData["Earth", "Radius"]
    c = GeoDistance[ptA, ptB] 360 / (2 Pi r)

This shows that c is 0.13014°. Calculating the other sides similarly shows a = 0.30504° and b = 0.32739°.

Angles

Calling GeoDirection[ptA, ptB] returns 42.2432°, which says we need to head at an angle of about 42° to walk from UNC to Duke.

The code

    GeoDirection[ptA, ptB] - GeoDirection[ptA, ptC]

shows that the angle α is 68.6128°. (The code returns the negative of this angle because the angle is clockwise.) Similarly we find β = 87.9808° and γ = 23.4068.

The angles add up to 180° only because our triangle is small compared to the earth’s total area. The actual sum should be slightly more than 180°, but we’ve not retained enough precision to detect the difference. In general the “spherical excess,” i.e. the amount by which the sum of the angles exceed 180°, is proportional to the area of the triangle.

Visualizing data breaches

Posted on 1 December 2018 by John

The image below is a static screen shot of an interactive visualization of the world’s biggest data breaches. The site lets you filter the data by industry and type of breach. See the site for credits and the raw data.

Data breach visualization

Related: Help with data breach response

Topping out

Posted on 1 December 2018 by John

There’s an ancient tradition of construction workers putting a Christmas tree on top of a building when it reaches its full height. I happened to drive by a recently topped out building this morning.

topping out

Month: December 2018

No funding for uncomfortable results

More privacy posts

Sine of a googol

Sine of a trillion

Range reduction

Sine of a googol?

Interval arithmetic interpretation

More numerical math posts

Six degrees of Kevin Bacon, Paul Erdos, and Wikipedia

More network posts

Mersenne prime trend

More on Mersenne primes

Spherical trig, Research Triangle, and Mathematica

Spherical triangles

Research Triangle

Mathematica

Map

Distance

Angles

More spherical trig posts

Visualizing data breaches

Topping out