Exploring the sum-product conjecture

Quanta Magazine posted an article yesterday about the sum-product problem of Paul Erdős and Endre Szemerédi. This problem starts with a finite set of real numbers A then considers the size of the sets A+A and A*A. That is, if we add every element of A to every other element of A, how many distinct sums are there? If we take products instead, how many distinct products are there?

Proven results

Erdős and Szemerédi proved that there are constants c and ε > 0 such that

max{|A+A|, |A*A|} ≥ c|A|1+ε

In other words, either A+A or A*A is substantially bigger than A. Erdős and Szemerédi only proved that some positive ε exists, but they suspected ε could be chosen close to 1, i.e. that either |A+A| or |A*A| is bounded below by a fixed multiple of |A|² or nearly so. George Shakan later showed that one can take ε to be any value less than

1/3 + 5/5277 = 0.3342899…

but the conjecture remains that the upper limit on ε is 1.

Python code

The following Python code will let you explore the sum-product conjecture empirically. It randomly selects sets of size N from the non-negative integers less than R, then computes the sum and product sets using set comprehensions.

    from numpy.random import choice

    def trial(R, N):
        # R = integer range, N = sample size
        x = choice(R, N, replace=False)
        s = {a+b for a in x for b in x}
        p = {a*b for a in x for b in x}
        return (len(s), len(p))

When I first tried this code I thought it had a bug. I called trial 10 times and got the same values for |A+A| and |A*A| every time. That was because I chose R large relative to N. In that case, it is likely that every sum and every product will be unique, aside from the redundancy from commutativity. That is, if R >> N, it is likely that |A+A| and |A*A| will both equal N(N+1)/2. Things get more interesting when N is closer to R.

Probability vs combinatorics

The Erdős-Szemerédi problem is a problem in combinatorics, looking for deterministic lower bounds. But it seems natural to consider a probabilistic extension. Instead of asking about lower bounds on |A+A| and |A*A| you could ask for the distribution on |A+A| and |A*A| when the sets A are drawn from some probability distribution.

If the set A is drawn from a continuous distribution, then |A+A| and |A*A| both equal N(N+1)/2 with probability 1. Only careful choices, ones that would happen randomly with probability zero, could prevent the sums and products from being unique, modulo commutativity, as in the case R >> N above.

If the set A is an arithmetic sequence then |A+A| is small and |A*A| is large, and the opposite holds if A is a geometric sequence. So it might be interesting to look at the correlation of |A+A| and |A*A| when A comes from a discrete distribution, such as choosing N integers uniformly from [1, R] when N/R is not too small.

Normal approximation to Laplace distribution?

I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace!

Normal and Laplace distributions

A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole holding up a circus tent.

Normal and Laplace probability density functions

A normal distribution has very thin tails, i.e. probability density drops very rapidly as you move further from the middle, like exp(-x²). The Laplace distribution has moderate tails [1], going to zero like exp(-|x|).

So normal and Laplace distributions are qualitatively very different, both in the center and in the tails. So why would you want to replace one by the other?

Statistics meets differential privacy

The normal distribution is convenient to use in mathematical statistics. Whether it is realistic in application depends on context, but it’s convenient and conventional. The Laplace distribution is convenient and conventional in differential privacy. There’s no need to ask whether it is realistic because Laplace noise is added deliberately; the distribution assumption is exactly correct by construction. (See this post for details.)

When mathematical statistics and differential privacy combine, it could be convenient to “approximate” a Laplace distribution by a normal distribution [2].

Solving for parameters

So if you wanted to replace a Laplace distribution with a normal distribution, which one would you choose? Both distributions are symmetric about their means, so it’s natural to pick the means to be the same. So without loss of generality, we’ll assume both distribution have mean 0. The question then becomes how to choose the scale parameters.

You could just set the two scale parameters to be the same, but that’s similar to the Greek letter fallacy, assuming two parameters have the same meaning just because they have the same symbol. Because the two distributions have different tail weights, their scale parameters serve different functions.

One way to replace a Laplace distribution with a normal would be to pick the scale parameter of the normal so that both two quantiles match. For example, you might want both distributions to have have 95% of their probability mass in the same interval.

I’ve written before about how to solve for scale parameters given two quantiles. We find two quantiles of the Laplace distribution, then use the method in that post to find the corresponding normal distribution scale (standard deviation).

The Laplace distribution with scale s has density

f(x) = exp(-|x|/s)/2s.

If we want to solve for the quantile x such that Prob(Xx) = p, we have

x = –s log(2 – 2p).

Using the formula derived in the previously mentioned post,

σ = 2x / Φ-1(x)

where Φ is the cumulative distribution function of the standard normal.

Related posts

[1] The normal distribution is the canonical example of a thin-tailed distribution, while exponential tails are conventionally the boundary between thick and thin. “Thick tailed” and “thin tailed” are often taken to mean thicker than exponential and thinner that exponential respectively.

[2] You could use a Gaussian mechanism rather than a Laplace mechanism for similar reasons, but this makes the differential privacy theory more complicated. Rather than working with ε-differential privacy you have to work with (ε, δ)-differential privacy. The latter is messier and harder to interpret.

Probabilisitic Identifiers in CCPA

The CCPA, the California Consumer Privacy Act, was passed last year and goes into effect at the beginning of next year. And just as the GDPR impacts businesses outside Europe, the CCPA will impact businesses outside California.

The law specifically mentions probabilistic identifiers.

“Probabilistic identifier” means the identification of a consumer or a device to a degree of certainty of more probable than not based on any categories of personal information included in, or similar to, the categories enumerated in the definition of personal information.

So anything that gives you better than a 50% chance of guessing personal data fields [1]. That could be really broad. For example, the fact that you’re reading this blog post makes it “more probable than not” that you have a college degree, and education is one of the categories mentioned in the law.

Personal information

What are these enumerated categories of personal information mentioned above? They start out specific:

Identifiers such as a real name, alias, postal address, unique personal identifier, online identifier Internet Protocol address, email address, …

but then they get more vague:

purchasing or consuming histories or tendencies … interaction with an Internet Web site … professional or employment-related information.

And in addition to the vague categories are “any categories … similar to” these.

Significance

What is the significance of a probabilistic identifier? That’s hard to say. A large part of the CCPA is devoted to definitions, and some of these definitions don’t seem to be used. Maybe this is a consequence of the bill being rushed to a vote in order to avoid a ballot initiative. Maybe the definitions were included in case they’re needed in a future amended version of the law.

The CCPA seems to give probabilistic identifiers the same status as deterministic identifiers:

“Unique identifier” or “Unique personal identifier” means … or probabilistic identifiers that can be used to identify a particular consumer or device.

That seems odd. Data that can give you a “more probable than not” guess at someone’s “purchasing or consuming histories” hardly seems like a unique identifier.

Devices

It’s interesting that the CCPA says “a particular consumer or device.” That would seem to include browser fingerprinting. That could be a big deal. Identifying devices, but not directly people, is a major industry.

Related posts

[1] Nothing in this blog post is legal advice. I’m not a lawyer and I don’t give legal advice. I enjoy working with lawyers because the division of labor is clear: they do law and I do math.

Font Fingerprinting

Browser fingerprint

Web sites may not be able to identify you, but they can probably identify your web browser. Your browser sends a lot of information back to web servers, and the combination of settings for a particular browser are usually unique. To get an idea what information we’re talking about, you could take a look at Device Info.

Installed fonts

One of the pieces of information that gets sent back to servers is the list of fonts installed on your device. Your font fingerprint is just one component of your browser fingerprint, but it’s an easy component to understand.

Application fonts

Various applications install their own fonts. If you’ve installed Microsoft Office, for example, that would be evident in your list of fonts. However, Office is ubiquitous, so that information doesn’t go very far to identifying you. Maybe the lack of fonts installed with Office would be more conspicuous.

Less common software goes further toward identifying you. For example, I have Mathematica on one of my computers, and along with it Mathematica fonts, something that’s not too common.

Personal fonts

Then there are the fonts you’ve installed deliberately, many of the free. Maybe you’ve installed fonts to support various languages, such as Hebrew and Greek fonts for Bible scholars. Maybe you have dyslexia and have installed fonts that are easier for you to read. Maybe you’ve installed a font because it contains technical symbols you need for your work. These increase the chances that your combination of fonts is unique.

Commercial fonts

Maybe you have purchased a few commercial fonts. One of the reasons to buy fonts is to have something that doesn’t look so common. This also makes the font fingerprint of your browser less common.

Moderate obscurity

Servers have to query whether particular fonts are installed. An obscure font would go a long way toward identifying you. But if a font is truly obscure, the server isn’t likely to ask whether it’s installed. So the greatest privacy risk comes from moderately uncommon fonts [1].

Advertising

Your browser fingerprint is probably unique, unless you have a brand new device, or you’ve made a deliberate effort to keep your fingerprint generic. So while a site may not know who you are, it can recognize whether you’ve been there before, and customize the content you receive accordingly. Maybe you’ve looked at the same product three times without buying, and so you get a nudge to encourage you to go ahead and buy.

(It’ll be interesting to see what effect the California Consumer Privacy Act has on this when it goes into effect the beginning of next year.)

What about changes?

Since there’s more than enough information to uniquely identify your browser, fingerprints are robust to changes. Installing a new font won’t throw advertisers off your trail. If you still have the same monitor size, same geographic location, etc. then advertisers can update your fingerprint information to include he new font. You might even get an advertisement for more fonts if they infer you’re a typography aficionado.

Related posts

[1] Except for a spearphishing attack. A server might check for the presence of fonts that, although uncommon in general, are likely to be on the target’s computer. For example, if someone wanted to detect my browser in particular, they know I have Mathematica fonts installed because I said so above. And they might guess that I have installed the Greek and Hebrew fonts I mentioned. They might also look for obscure fonts I’ve mentioned in the blog, such as Unifont, Andika, and Inconsolata.

The Soviet license plate game and Kolmogorov complexity

Soviet license plate

Physicist Lev Landau used to play a mental game with Soviet license plates [1]. The plates had the form of two digits, a dash, two more digits, and some letters.

Rules of the game

His game was to apply high school math operators to the numbers on both side of the dash so that the dash could be replaced by an equal sign. For example, given the license plate 44-74, one solution would be

4! + 4 = 7*4.

Note that we can insert operators, such as !, +, and *, but not more digits.

Is there a solution for every possible license plate? That depends on what class of operators you allow.

You could trivialize the game by applying the fractional part operation {x} to both sides since the fractional part of an integer is zero. You could disallow the fractional part operator on the grounds that it’s not clearly a high school math operation, or just disallow it because it makes the game uninteresting.

Universal solution

It turns out there’s a universal solution, starting with the observation that

√(n + 1) = sec arctan √n.

If one side is greater than the other by one, the formula above gives an immediate solution. For example, a solution for the license plate 89-88 would be

√89 = sec arctan √88.

If the difference is greater, the formula can be applied repeatedly. For example, we could apply the formula twice to obtain

√(n + 2) = sec arctan √(n + 1) = sec arctan sec arctan √n

and so a possible solution for 35-37 is

sec arctan sec arctan √35 = √37.

Kolmogorov complexity

Given that a solution is always possible, we can make the game more interesting by looking for the simplest solution. We have some intuitive idea what this means. With our example of 44-74, the first solution

4! + 4 = 7*4

is simpler than the universal solution

sec arctan sec arctan … √44 = √74

which would require applying secant and arctangent each 30 times.

The Kolmogorov complexity of an object is the length of the shortest computer program to produce the object. We could compute the Kolmogorov complexity of the functions applied to the digits on each side in order to measure how complex a solution is.

To make this precise, we need to specify what our programming language is, and that’s not as easy as it sounds. If we think of mathematics notation as a programming language, do we want to count ! as one character and arctan as 6 characters? That doesn’t seem right. If we wrote “arctan” as “atn” we would use fewer characters without producing a different solution.

Complexity of Python code

To make things more objective, we could look at the length of actual computer programs rather than imagining math notation to be a programming language. Say we choose Python. Then here are a couple functions that compute our two solutions for license plate 44-74.

    from math import sqrt, cos, atan

    def f():
        sec = lambda x: 1/cos(x)
        y = sqrt(44)
        for _ in range(30):
            y = sec(atan(y))
        return y

    def g():
        return sqrt(77)

We could measure the complexity of our functions f and g by counting the number of characters in each one. But there still are difficulties.

What about the import statement? It should count toward the length of f because it uses everything imported but g could have used a shorter statement that only imported sqrt. More fundamentally, are we cheating by even importing a library?

Furthermore, the two functions above don’t produce exactly the same output due to limited precision. We could imagine that our imported functions are infinitely precise, but then we’re not really using Python but rather an idealized version of Python.

And what about the loop? That introduced new digits, 3 and 0, and so violates the rules of Landau’s game. So should we unroll the loop before calculating complexity?

Thought experiment

Kolmogorov complexity is a very useful concept, but it’s more of a thought experiment than something you can practically compute. We can imagine the shortest program to compute something, but we can seldom know that we’ve actually found such a program. All we can know in practice is upper bounds.

In theory you can enumerate all Turing machines of a given length, or all Python programs of a given length, and find the shortest one that does a given task [2], but the list grows exponentially with length.

However, it is possible to compute the length of particular programs if we deal with some of the complications mentioned above. We could make Landau’s game a two-player game by seeing who could come up with the simpler solution in a fixed amount of time.

Back to Landau

If we allow sine and degree in our set of operators, there’s a universal solution due to B. S. Gorobets. For n ≥ 6, n! is a multiple of 360, and so

sin (n!)° = 0.

And if n is less than 6, it’s two-digit representation begins with zero, so we can multiply the digits to get zero.

If we disallow transcendental functions, we block Gorobets’ trick and we have functions whose length we can objectively measure in a programming language.

Related posts

[1] Harun Šiljak. Soviet Street Mathematics: Landau’s License Plate Game. Mathematical Intelligencer, https://doi.org/10.1007/s00283-017-9743-9

[2] In addition to having to test an exponentially growing set of programs, there’s the problem of knowing if even one program will complete.

If you run a program and see it complete, then you know it completes! If you don’t see it complete, it might just need more time, or it might never finish. The halting problem theorem says that there’s no program that can tell whether another arbitrary program will terminate. You still might be able to determine whether a particular program terminates, but you might not.

Photo credit CC BY-SA 3.0 via Wikimedia Commons
Русский: Задний автомобильный номер СССР образца 1936 года

3,000th blog post

I just saw that I’d written 2,999 blog posts, so that makes this one the 3,000th. About a year ago was the 10th anniversary, and Tim Hopper wrote his retrospective about my blog.

In addition to chronological blog posts, there are about 200 “pages” on the site, mostly technical notes. These include the most popular content on the site.

***

The following is a rambling discussion of things that have changed over the history of the blog.

I started out posting a little more than once a day on average. I’ve slowed down a little, but I still wrote about 300 posts over the last year. My posts are getting a little longer. They’re still fairly short, but a little longer than they used to be.

I’ve also started using more footnotes. That has worked out well, letting me write for two audiences at the same time: those who want a high-level overview and those who want technical details.

More people are reading from mobile devices, so I moved to a responsive design. I also use more SVG images because they look great across a variety of platforms.

HTTPS has become more common, and I switched to HTTPS a while back.

Google nearly killed RSS when it killed the most popular RSS reader. I announce most of my posts on Twitter because Twitter has sorta taken the place of RSS. You can still subscribe by RSS, and many do, but not as many as before the demise of Google Reader.

I’ve been writing more about privacy and security lately because I’m doing more work with privacy and security.

I was reluctant to start blogging, but I gave it a chance after several people exhorted me to do it. I was especially hesitant to allow comments, but the signal to noise ratio has been a pleasant surprise (aside from the millions of comments blocked by my spam filter). I’ve learned a lot from your feedback. I’ve met a lot of friends and clients through the blog and am very glad I started it.

Economics, power laws, and hacking

Increasing costs impact some players more than others. Those who know about power laws and know how to prioritize are impacted less than those who naively believe everything is equally important.

This post will look at economics and power laws in the context of password cracking. Increasing the cost of verifying a password does not impact attackers proportionately because the attackers have a power law on their side.

Key stretching

In an earlier post I explained how key stretching increases the time required to verify a password. The idea is that if authentication systems can afford to spend more computation time verifying a password, they can force attackers to spend more time as well.

The stretching algorithm increases the time required to test a single salt and password combination by a factor of r where r is a parameter in the algorithm. But increasing the amount of work per instance by a factor of r does not decrease the number of users an attacker can compromise by a factor of r.

Power laws

If an attacker is going through a list of compromised passwords, the number of user passwords recovered for a fixed amount of computing power decreases by less than r because passwords follow a power law distribution [1]. The nth most common password appears with frequency proportional to ns for some constant s. Studies suggest s is between 0.5 and 0.9. Let’s say s = 0.7.

Suppose an attacker has a sorted list of common passwords. He starts with the most common password and tries it for each user. Then he tries the next most common password, and keeps going until he runs out of resources. If you increase the difficulty of testing each password by r, the attacker will have to stop after trying fewer passwords.

Example

Suppose the attacker had time to test the 1000 most common passwords, but then you set r to 1000, increasing the resources required for each individual test by 1000. Now the attacker can only test the most common password. The most common password accounts for about 1/24 of the users of the 1000 most common passwords. So the number of passwords the attacker can test went down by 1000, but the number of users effected only went down by 24.

It would take 1000 times longer if the attacker still tested the same number of passwords. But the attacker is hitting diminishing return after the first password tested. By testing passwords in order of popularity, his return on resources does not need to go down by a factor of 1000. In this example it only went down by a factor of 24.

How much the hacker’s ROI decreases depends on several factors. It never decreases by more than r, and often it decreases less.

Related posts

[1] Any mention of power laws brings out two opposite reactions, one enthusiastic and one pedantic. The naive reaction is “Cool! Another example of how power laws are everywhere!” The pedantic reaction is “Few things actually follow a power law. When people claim that data follow a power law distribution the data usually follow a slightly different distribution.”

The pedantic reaction is technically correct but not very helpful. The enthusiastic reaction is correct with a little nuance: many things approximately follow a power law distribution over a significant range. Nothing follows a power law distribution exactly and everywhere, but the same is true of any probability distribution.

It doesn’t matter for this post whether data exactly follow a power law. This post is just a sort of back-of-the-envelope calculation. The power law distribution has some empirical basis in this context, and it’s a convenient form for calculations.

Varsity versus junior varsity sports

soccer

Yesterday my wife and I watched our daughter’s junior varsity soccer game. Several statistical questions came to mind.

Larger schools tend to have better sports teams. If the talent distributions of a large school and a small school are the same, the larger school will have a better team because its players are the best from a larger population. If one school is twice as big as another, its team may consist of the top 5% while the other school’s team consists of its top 10%.

Does size benefit a school’s top (varsity) team or its second (junior varsity) team more? Would you expect more variability in varsity or junior varsity scores? Does your answer depend on whether you assume a thin-tailed (e.g. normal) or thick tailed (e.g. Cauchy) distribution on talent?

What if two schools have the same size, but one has a better athletic program, say due to better coaching. Suppose this shifts the center of the talent distribution. Does such a shift benefit varsity or junior varsity teams more?

Suppose both the varsity and junior varsity teams from two schools are playing each other, as was the case last night. If you know the outcome of the junior varsity game, how much should that influence your prediction of the outcome of the varsity game? Has anyone looked at this, either as an abstract model or by analyzing actual scores?

Related post: Probability of winning the World Series

 

The most low-key newsletter

My monthly newsletter is one of the most low-key ones around. It’s almost a secret. You can find it via the navigation menu if you look for it.

I won’t put a popup on my site cajoling you to subscribe, nor will I ask you to sign up before letting you read something I’ve written. My newsletter is just for people who want to read it.

I’ve been sending out a monthly newsletter for about four years. I’ve never sent out more than one email a month, though I might some day if I have a good reason to.

Each newsletter highlights a few of the most popular posts since the previous newsletter. I used to say a little about my business each month, or maybe something personal, but I quit when that got to be repetitive. I may go back to doing that occasionally if I have something to say that I believe you’ll find interesting.

Salting and stretching a password

This post will look at a progression of ways to store passwords, from naive to sophisticated.

Most naive: clear text

Storing passwords in plain text is least secure thing a server could do. If this list is leaked, someone knows all the passwords with no effort.

Better: hash values

A better approach would be to run passwords through a secure hash function before storing them. The server would never store a password per se, only its hashed value. When someone enters a password, the server would check whether the hash of the password matches the stored hashed value. If the list of hashed values is leaked, it will take some effort to recover the original passwords, though maybe not much effort, as I wrote about in my previous post.

Better still: add salt

The next step in sophistication is to append a random value, called a salt, to the password before applying the hash function. The server would store each user’s salt value and the hash of the password with the salt added.

Now if the user has a naive password like “qwerty” you couldn’t crack the password by doing a simple Google search. You can find the hash value of “qwerty” by searching, but not the hash value of qwerty + random salt, because although the former is common the latter is probably unique. You could still crack the password if the hash is insecure, but it would take a little effort.

Better still: key stretching

Suppose an attacker has a list of salt values and corresponding hash values for salt + password. They could guess passwords, hashing each with a salt value, to see if any hash values match. They would start with the most common passwords and probably get some matches.

Key stretching is a way to make this brute force search more time consuming by requiring repeated hashing. In the following stretching algorithm, p is the password, s is the salt, h is the hash function, and || means string concatenation.

\begin{align*} x_0 &= \phi \\ x_i &= h(x_{i-1} || p || s) \mbox{ for i = 1, \ldots, }r \\ K &= x_r \end{align*}

Now the time required to test each password has been multiplied by r. The idea is to pick a value of r that is affordable for legitimate use but expensive for attacks.