Regression, modular arithmetic, and PQC

Linear regression

Suppose you have a linear regression with a couple predictors and no intercept term:

β1x1 + β2x2 = y + ε

where the x‘s are inputs, the β are fixed but unknown, y is the output, and ε is random error.

Given n observations (x1, x2, y + ε), linear regression estimates the parameters β1 and β2.

I haven’t said, but I implicitly assumed all the above numbers are real. Of course they’re real. It would be strange if they weren’t!

Learning with errors

Well, we’re about to do something strange. We’re going to pick a prime number p and do our calculations modulo p except for the addition of the error ε. Our inputs (x1, x2) are going to be pairs of integers. Someone is going to compute

r = β1x1 + β2x2 mod p

where β1 and β2 are secret integers. Then they’re going to tell us

r/p + ε

where ε is a random variable on the interval [0, 1].  We give them n pairs (x1, x2) and they give back n values of r/p with noise added. Our job is to infer the βs.

This problem is called learning with errors or LWE. It’s like linear regression, but much harder when the problem size is bigger. Instead of just two inputs, we could have m of inputs with m secret coefficients where m is large. Depending on the number of variables m, the number of equations n, the modulus p, and the probability distribution on ε, the problem may be possible to solve but computationally very difficult.

Why is it so difficult? Working mod p is discontinuous. A little bit of error might completely change our estimation of the solution. If n is large enough, we could recover the coefficients anyway, using something like least squares. But how would we carry that out? If m and p are small we can just try all pm possibilities, but that’s not going to be practical if m and p are large.

In linear regression, we assume there is some (approximately) linear process out in the real world that we’re allowed to reserve with limited accuracy. Nobody’s playing a game with us, that just how data come to us. But with LWE, we are playing a game that someone has designed to be hard. Why? For cryptography. In particular, quantum-resistant cryptography.

Post Quantum Cryptography

Variations on LWE are the basis for several proposed encryption algorithms that believed to be secure even if an adversary has access to a quantum computer.

The public key encryption systems in common use today would all be breakable if quantum computing becomes practical. They depend on mathematical problems like factoring and discrete logarithms being computationally difficult, which they appear to be with traditional computing resources. But we know that these problems could be solved in polynomial time on a quantum computer with Shor’s algorithm. But LWE is a hard problem, even on a quantum computer. Or so we suspect.

The US government’s National Institute of Standards and Technology (NIST) is holding a competition to identify quantum-resistant encryption algorithms. Last month they announced 26 algorithms that made it to the second round. Many of these algorithms depend on LWE or variations.

One variation is LWR (learning with rounding) which uses rounding rather than adding random noise. There are also ring-based counterparts RLWE and RLWR which add random errors and use rounding respectively. And there are polynomial variations such as poly-LWE which uses a polynomial-based learning with errors problem. The general category for these methods is lattice methods.

Lattice methods

Of the public-key algorithms that made it to the second round of the the NIST competition, 9 out of 17 use lattice-based cryptography:

  • CRYSTALS-KYBER
  • FrodoKEM
  • LAC
  • NewHope
  • NTRU
  • NTRU Prime
  • Round5
  • SABER
  • Three Bears

Also, two of the nine digital signature algorithms are based on lattice problems:

  • CRYSTALS-DILITHIUM
  • FALCON

Based purely on the names, and not on the merits of the algorithms, I hope the winner is one of the methods with a science fiction allusion in the name.

Related posts

Entropy extractor used in μRNG

Yesterday I mentioned μRNG, a true random number generator (TRNG) that takes physical sources of randomness as input. These sources are independent but non-uniform. This post will present the entropy extractor μRNG uses to take non-uniform bits as input and produce uniform bits as output.

We will present Python code for playing with the entropy extractor. (μRNG is extremely efficient, but the Python code here is not; it’s just for illustration.) The code will show how to use the pyfinite library to do arithmetic over a finite field.

Entropy extractor

The μRNG generator starts with three bit streams—X, Y, and Z—each with at least 1/3 bit min-entropy per bit.

Min-entropy is Rényi entropy with α = ∞. For a Bernoulli random variable, that takes on two values, one with probability p and the other with probability 1-p, the min-entropy is

-log2 max(p, 1-p).

So requiring min-entropy of at least 1/3 means the two probabilities are less than 2-1/3 = 0.7937.

Take eight bits (one byte) at a time from XY, and Z, and interpret each byte as an element of the finite field with 28 elements. Then compute

X*YZ

in this field. The resulting stream of bits will be independent and uniformly distributed, or very nearly so.

Python implementation

We will need the bernoulli class for generating our input bit streams, and the pyfinite for doing finite field arithmetic on the bits.

    from scipy.stats import bernoulli
    from pyfinite import ffield

And we will need a couple bit manipulation functions.

    def bits_to_num(a):
        "Convert an array of bits to an integer."
    
        x = 0
        for i in range(len(a)):
            x += a[i]*2**i
        return x

    def bitCount(n):
        "Count how many bits are set to 1."
        count = 0
        while(n):
            n &= n - 1
            count += 1
        return count 

The following function generates random bytes using the entropy extractor. The input bit streams have p = 0.79, corresponding to min-entropy 0.34.

    def generate_byte():
        "Generate bytes using the entropy extractor."
    
        b = bernoulli(0.79)
    
        x = bits_to_num(b.rvs(8))
        y = bits_to_num(b.rvs(8))
        z = bits_to_num(b.rvs(8)) 

        F = ffield.FField(8)
        return F.Add(F.Multiply(x, y), z)

Note that 79% of the bits produced by the Bernoulli generator will be 1’s. But we can see that the output bytes are about half 1’s and half 0’s.

    s = 0
    N = 1000
    for _ in range(N):
        s += bitCount( generate_byte() )
    print( s/(8*N) )

This returned 0.50375 the first time I ran it and 0.49925 the second time.

For more details see the μRNG paper.

Related posts

 

Exploring the sum-product conjecture

Quanta Magazine posted an article yesterday about the sum-product problem of Paul Erdős and Endre Szemerédi. This problem starts with a finite set of real numbers A then considers the size of the sets A+A and A*A. That is, if we add every element of A to every other element of A, how many distinct sums are there? If we take products instead, how many distinct products are there?

Proven results

Erdős and Szemerédi proved that there are constants c and ε > 0 such that

max{|A+A|, |A*A|} ≥ c|A|1+ε

In other words, either A+A or A*A is substantially bigger than A. Erdős and Szemerédi only proved that some positive ε exists, but they suspected ε could be chosen close to 1, i.e. that either |A+A| or |A*A| is bounded below by a fixed multiple of |A|² or nearly so. George Shakan later showed that one can take ε to be any value less than

1/3 + 5/5277 = 0.3342899…

but the conjecture remains that the upper limit on ε is 1.

Python code

The following Python code will let you explore the sum-product conjecture empirically. It randomly selects sets of size N from the non-negative integers less than R, then computes the sum and product sets using set comprehensions.

    from numpy.random import choice

    def trial(R, N):
        # R = integer range, N = sample size
        x = choice(R, N, replace=False)
        s = {a+b for a in x for b in x}
        p = {a*b for a in x for b in x}
        return (len(s), len(p))

When I first tried this code I thought it had a bug. I called trial 10 times and got the same values for |A+A| and |A*A| every time. That was because I chose R large relative to N. In that case, it is likely that every sum and every product will be unique, aside from the redundancy from commutativity. That is, if R >> N, it is likely that |A+A| and |A*A| will both equal N(N+1)/2. Things get more interesting when N is closer to R.

Probability vs combinatorics

The Erdős-Szemerédi problem is a problem in combinatorics, looking for deterministic lower bounds. But it seems natural to consider a probabilistic extension. Instead of asking about lower bounds on |A+A| and |A*A| you could ask for the distribution on |A+A| and |A*A| when the sets A are drawn from some probability distribution.

If the set A is drawn from a continuous distribution, then |A+A| and |A*A| both equal N(N+1)/2 with probability 1. Only careful choices, ones that would happen randomly with probability zero, could prevent the sums and products from being unique, modulo commutativity, as in the case R >> N above.

If the set A is an arithmetic sequence then |A+A| is small and |A*A| is large, and the opposite holds if A is a geometric sequence. So it might be interesting to look at the correlation of |A+A| and |A*A| when A comes from a discrete distribution, such as choosing N integers uniformly from [1, R] when N/R is not too small.

Normal approximation to Laplace distribution?

I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace!

Normal and Laplace distributions

A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole holding up a circus tent.

Normal and Laplace probability density functions

A normal distribution has very thin tails, i.e. probability density drops very rapidly as you move further from the middle, like exp(-x²). The Laplace distribution has moderate tails [1], going to zero like exp(-|x|).

So normal and Laplace distributions are qualitatively very different, both in the center and in the tails. So why would you want to replace one by the other?

Statistics meets differential privacy

The normal distribution is convenient to use in mathematical statistics. Whether it is realistic in application depends on context, but it’s convenient and conventional. The Laplace distribution is convenient and conventional in differential privacy. There’s no need to ask whether it is realistic because Laplace noise is added deliberately; the distribution assumption is exactly correct by construction. (See this post for details.)

When mathematical statistics and differential privacy combine, it could be convenient to “approximate” a Laplace distribution by a normal distribution [2].

Solving for parameters

So if you wanted to replace a Laplace distribution with a normal distribution, which one would you choose? Both distributions are symmetric about their means, so it’s natural to pick the means to be the same. So without loss of generality, we’ll assume both distribution have mean 0. The question then becomes how to choose the scale parameters.

You could just set the two scale parameters to be the same, but that’s similar to the Greek letter fallacy, assuming two parameters have the same meaning just because they have the same symbol. Because the two distributions have different tail weights, their scale parameters serve different functions.

One way to replace a Laplace distribution with a normal would be to pick the scale parameter of the normal so that both two quantiles match. For example, you might want both distributions to have have 95% of their probability mass in the same interval.

I’ve written before about how to solve for scale parameters given two quantiles. We find two quantiles of the Laplace distribution, then use the method in that post to find the corresponding normal distribution scale (standard deviation).

The Laplace distribution with scale s has density

f(x) = exp(-|x|/s)/2s.

If we want to solve for the quantile x such that Prob(Xx) = p, we have

x = –s log(2 – 2p).

Using the formula derived in the previously mentioned post,

σ = 2x / Φ-1(x)

where Φ is the cumulative distribution function of the standard normal.

Related posts

[1] The normal distribution is the canonical example of a thin-tailed distribution, while exponential tails are conventionally the boundary between thick and thin. “Thick tailed” and “thin tailed” are often taken to mean thicker than exponential and thinner that exponential respectively.

[2] You could use a Gaussian mechanism rather than a Laplace mechanism for similar reasons, but this makes the differential privacy theory more complicated. Rather than working with ε-differential privacy you have to work with (ε, δ)-differential privacy. The latter is messier and harder to interpret.

Probabilisitic Identifiers in CCPA

The CCPA, the California Consumer Privacy Act, was passed last year and goes into effect at the beginning of next year. And just as the GDPR impacts businesses outside Europe, the CCPA will impact businesses outside California.

The law specifically mentions probabilistic identifiers.

“Probabilistic identifier” means the identification of a consumer or a device to a degree of certainty of more probable than not based on any categories of personal information included in, or similar to, the categories enumerated in the definition of personal information.

So anything that gives you better than a 50% chance of guessing personal data fields [1]. That could be really broad. For example, the fact that you’re reading this blog post makes it “more probable than not” that you have a college degree, and education is one of the categories mentioned in the law.

Personal information

What are these enumerated categories of personal information mentioned above? They start out specific:

Identifiers such as a real name, alias, postal address, unique personal identifier, online identifier Internet Protocol address, email address, …

but then they get more vague:

purchasing or consuming histories or tendencies … interaction with an Internet Web site … professional or employment-related information.

And in addition to the vague categories are “any categories … similar to” these.

Significance

What is the significance of a probabilistic identifier? That’s hard to say. A large part of the CCPA is devoted to definitions, and some of these definitions don’t seem to be used. Maybe this is a consequence of the bill being rushed to a vote in order to avoid a ballot initiative. Maybe the definitions were included in case they’re needed in a future amended version of the law.

The CCPA seems to give probabilistic identifiers the same status as deterministic identifiers:

“Unique identifier” or “Unique personal identifier” means … or probabilistic identifiers that can be used to identify a particular consumer or device.

That seems odd. Data that can give you a “more probable than not” guess at someone’s “purchasing or consuming histories” hardly seems like a unique identifier.

Devices

It’s interesting that the CCPA says “a particular consumer or device.” That would seem to include browser fingerprinting. That could be a big deal. Identifying devices, but not directly people, is a major industry.

Related posts

[1] Nothing in this blog post is legal advice. I’m not a lawyer and I don’t give legal advice. I enjoy working with lawyers because the division of labor is clear: they do law and I do math.

Varsity versus junior varsity sports

soccer

Yesterday my wife and I watched our daughter’s junior varsity soccer game. Several statistical questions came to mind.

Larger schools tend to have better sports teams. If the talent distributions of a large school and a small school are the same, the larger school will have a better team because its players are the best from a larger population. If one school is twice as big as another, its team may consist of the top 5% while the other school’s team consists of its top 10%.

Does size benefit a school’s top (varsity) team or its second (junior varsity) team more? Would you expect more variability in varsity or junior varsity scores? Does your answer depend on whether you assume a thin-tailed (e.g. normal) or thick tailed (e.g. Cauchy) distribution on talent?

What if two schools have the same size, but one has a better athletic program, say due to better coaching. Suppose this shifts the center of the talent distribution. Does such a shift benefit varsity or junior varsity teams more?

Suppose both the varsity and junior varsity teams from two schools are playing each other, as was the case last night. If you know the outcome of the junior varsity game, how much should that influence your prediction of the outcome of the varsity game? Has anyone looked at this, either as an abstract model or by analyzing actual scores?

Related post: Probability of winning the World Series

 

Unstructured data is an oxymoron

messy workshop

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.

Non-tabular data could mean anything other than tabular data, but in practice it often means text, or it could mean data with a graph structure or some other structure.

More productive discussions

My point here isn’t to quibble over language usage but to offer a constructive suggestion: say what structure data has, not what structure it doesn’t have.

Discussions about “unstructured data” are often unproductive because two people can use the term, with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.

Eventually there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.

Free text fields

One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or at most in small number of languages. That’s a start. There should be more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?

Clients will ask how to de-identify free text fields. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.

For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breech. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]

Related posts

[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.

[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.

May I have the last four digits of your social?

call center

Imagine this conversation.

“Could you tell me your social security number?”

“Absolutely not! That’s private.”

“OK, how about just the last four digits?”

“Oh, OK. That’s fine.”

When I was in college, professors would post grades by the last four digits of student social security numbers. Now that seems incredibly naive, but no one objected at the time. Using these four digits rather than names would keep your grades private from the most lazy observer but not from anyone willing to put out a little effort.

There’s a widespread belief in the US that your social security number is a deep secret, and that telling someone your social security number gives them power over you akin to a fairy telling someone his true name. On the other hand, we also believe that telling someone just the last four digits of your SSN is harmless. Both are wrong. It’s not that hard to find someone’s full SSN, and revealing the last four digits gives someone a lot of information to use in identifying you.

In an earlier post I looked at how easily most people could be identified by the combination of birth date, sex, and zip code. We’ll use the analytical results from that post to look at how easily someone could be identified by their birthday, state, and the last four digits of their SSN [1]. Note that the previous post used birth date, i.e. including year, where here we only look at birth day, i.e. month and day but no year. Note also that there’s nothing special about social security numbers for our purposes. The last four digits of your phone number would provide just as much information.

If you know someone lives in Wyoming, and you know their birthday and the last four digits of their SSN, you can uniquely identify them 85% of the time, and in an addition 7% of cases you can narrow down the possibilities to just two people. In Texas, by contrast, the chances of a birthday and four-digit ID being unique are 0.03%. The chances of narrowing the possibilities to two people are larger but still only 0.1%.

Here are results for a few states. Note that even though Texas has between two and three times the population of Ohio, it’s over 100x harder to uniquely identify someone with the information discussed here.

|-----------+------------+--------+--------|
| State     | Population | Unique |  Pairs |
|-----------+------------+--------+--------|
| Texas     | 30,000,000 |  0.03% |  0.11% |
| Ohio      | 12,000,000 |  3.73% |  6.14% |
| Tennessee |  6,700,000 | 15.95% | 14.64% |
| Wyoming   |    600,000 | 84.84% |  6.97% |
|-----------+------------+--------+--------|

Related posts

[1] In that post we made the dubious simplifying assumption that birth dates were uniformly distributed from 0 to 78 years. This assumption is not accurate, but it was good enough to prove the point that it’s easier to identify people than you might think. Here our assumptions are better founded. Birthdays are nearly uniformly distributed, though there are some slight irregularities. The last four digits of social security numbers are uniformly distributed, though the first digits are correlated with the state.

Rényi Differential Privacy

Differential privacy, specifically ε-differential privacy, gives strong privacy guarantees, but it can be overly cautious by focusing on worst-case scenarios. The generalization (ε, δ)-differential privacy was introduced to make ε-differential privacy more flexible.

Rényi differential privacy (RDP) is a new generalization of ε-differential privacy by Ilya Mironov that is comparable to the (ε, δ) version but has several advantages. For instance, RDP is easier to interpret than (ε, δ)-DP and composes more simply.

Rényi divergence

My previous post discussed Rényi entropyRényi divergence is to Rényi entropy what Kullback-Leibler divergence is to Shannon entropy.

Given two random variables X and Y that take on n possible values each with positive probabilities pi and qi respectively, the Rényi divergence of X from Y is defined by

D_\alpha(X || Y) = \frac{1}{\alpha -1} \log\left( \sum_{i=1}^n \frac{p_i^\alpha}{q_i^{\alpha -1}}\right)

for α > 0 and not equal to 1. The definition extends to α = 0, 1, or ∞ by taking limits.

As α converges to 1, Dα converges to the Kullback-Leibler divergence.

Rényi differential privacy

A couple weeks ago I wrote an introduction to differential privacy that started by saying

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not.

The post develops precisely what it means for a query to be differentially private. The bottom line (literally!) was

An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D’,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

It turns out that this definition is equivalent to saying the Rényi divergence of order ∞ between A(D) and A(D‘) is less than ε. That is,

D_\alpha\big(A(D) || A(D')\big) \leq \varepsilon

where α = ∞. The idea of Rényi differential privacy (RDP) is to allow the possibility that α could be finite [1]. That is, an algorithm is (α, ε)-RDP if the Rényi divergence of order α between any two adjacent databases is no more than ε.

One of the nice properties of RDP is how simply algorithms compose: The composition of an (α, ε1)-RDP algorithm and an (α, ε2)-RDP algorithm is a (α, ε1 + ε2)-RDP algorithm. This leads to simple privacy budget accounting.

Comparison to (ε, δ)-differential privacy

The composition of ε-DP algorithms is simple: just substitute α = ∞ in the result above for composing RDP algorithms. However, the resulting bound may be overly pessimistic.

The composition of (ε, δ)-DP algorithms is not as simple: both ε and δ change. With RDP, the order α does not change, and the ε values simply add.

Mironov proves in [1] that every RDP algorithm is also (ε, δ)-DP algorithm. Specifically, Proposition 3 says

If f is an (α, ε)-RDP mechanism, it also satisfies

\left(\varepsilon - \frac{\log \delta}{\alpha - 1}, \delta\right)

differential privacy for any 0 < δ < 1.

This tells us that Rényi differential privacy gives us privacy guarantees that are somewhere between ε-differential privacy and (ε, δ)-differential privacy.

Related posts

[1] Ilya Mironov, Rényi Differential Privacy. arXiv 1702.07476

Gamma gamma gamma!

There are several things in math and statistics named gamma. Three examples are

  • the gamma function
  • the gamma constant
  • the gamma distribution

This post will show how these are related. We’ll also look at the incomplete gamma function which connects with all the above.

The gamma function

The gamma function is the most important function not usually found on a calculator. It’s the first “advanced” function you’re likely to learn about. You might see it in passing in a calculus class, in a homework problem on integration by parts, but usually not there’s not much emphasis on it. But it comes up a lot in application.

\Gamma(x) = \int_0^\infty t^{x-1 } e^{-t}\, dt

You can think of the gamma function as a way to extend factorial to non-integer values. For non-negative integers n, Γ(n + 1) = n!.

(Why is the argument n + 1 to Γ rather than n? There are a number of reasons, historical and practical. Short answer: some formulas turn out simpler if we define Γ that way.)

The gamma constant

The gamma constant, a.k.a. Euler’s constant or the Euler-Mascheroni constant, is defined as the asymptotic difference between harmonic numbers and logarithms. That is,

\gamma = \lim_{n\to\infty} \left( 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}- \log n\right)

The constant γ comes up fairly often in applications. But what does it have to do with the gamma function? There’s a reason the constant and the function are both named by the same Greek letter. One is that the gamma constant is part of the product formula for the gamma function.

\frac{1}{\Gamma(x)} = x e^{\gamma x} \prod_{r=1}^\infty \left(1 + \frac{x}{r} \right) e^{-x/r}

If we take the logarithm of this formula and differentiation we find out that

\frac{\Gamma'(1)}{\Gamma(1)} = -\gamma

The gamma distribution

If you take the integrand defining the gamma function and turn it into a probability distribution by normalizing it to integrate to 1, you get the gamma distribution. That is, a gamma random variable with shape k has probability density function (PDF) given by

f(x) = \frac{1}{\Gamma(k)} x^{k-1} e^{-x}

More generally you could add a scaling parameter to the gamma distribution in the usual way. You could imaging the scaling parameter present here but set to 1 to make things simpler.

The incomplete gamma function

The incomplete gamma function relates to everything above. It’s like the (complete) gamma function, except the range of integration is finite. So it’s now a function of two variables, the extra variable being the limit of integration.

\gamma(s, x)= \int_0^x t^{s-1} e^{-t} \, dt

(Note that now x appears in the limit of integration, not the exponent of t. This notation is inconsistent with the definition of the (complete) gamma function but it’s conventional.)

It uses a lower case gamma for its notation, like the gamma constant, and is a generalization of the gamma function. It’s also essentially the cumulative distribution function of the gamma distribution. That is, the CDF of a gamma random variable with shape s is γ(sx) / Γ(s).

The function γ(sx) / Γ(s) is called the regularized incomplete gamma function. Sometimes the distinction between the regularized and unregularized versions is not explicit. For example, in Python, the function gammainc does not compute the incomplete gamma function per se but the regularized incomplete gamma function. This makes sense because the latter is often more convenient to work with numerically.

Related posts