Random sampling from a file

I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling.

Given just a file name, shuf randomly permutes the lines of the file.

With the option -n you can specify how many lines to return. So it’s doing sampling without replacement. For example,

    shuf -n 10 foo.txt

would select 10 lines from foo.txt.

Actually, it would select at most 10 lines. You can’t select 10 lines without replacement from a file with less than 10 lines. If you ask for an impossible number of lines, the -n option is ignored.

You can also sample with replacement using the -r option. In that case you can select more lines than are in the file since lines may be reused. For example, you could run

    shuf -r -n 10 foo.txt

to select 10 lines drawn with replacement from foo.txt, regardless of how many lines foo.txt has. For example, when I ran the command above on a file containing

    alpha
    beta
    gamma

I got the output

    beta
    gamma
    gamma
    beta
    alpha
    alpha
    gamma
    gamma
    beta

I don’t know how shuf seeds its random generator. Maybe from the system time. But if you run it twice you will get different results. Probably.

Related

Comparing Truncation to Differential Privacy

Traditional methods of data de-identification obscure data values. For example, you might truncate a date to just the year.

Differential privacy obscures query values by injecting enough noise to keep from revealing information on an individual.

Let’s compare two approaches for de-identifying a person’s age: truncation and differential privacy.

Truncation

First consider truncating birth date to year. For example, anyone born between January 1, 1955 and December 31, 1955 would be recorded as being born in 1955. This effectively produces a 100% confidence interval that is one year wide.

Next we’ll compare this to a 95% confidence interval using ε-differential privacy.

Differential privacy

Differential privacy adds noise in proportion to the sensitivity Δ of a query. Here sensitivity means the maximum impact that one record could have on the result. For example, a query that counts records has sensitivity 1.

Suppose people live to a maximum of 120 years. Then in a database with n records [1], one person’s presence in or absence from the database would make a difference of no more than 120/n years, the worst case corresponding to the extremely unlikely event of a database of n-1 newborns and one person 120 year old.

Laplace mechanism and CIs

The Laplace mechanism implements ε-differential privacy by adding noise with a Laplace(Δ/ε) distribution, which in our example means Laplace(120/nε).

A 95% confidence interval for a Laplace distribution with scale b centered at 0 is

[b log 0.05, –b log 0.05]

which is very nearly

[-3b, 3b].

In our case b = 120/nε, and so a 95% confidence interval for the noise we add would be [-360/nε, 360/nε].

When n = 1000 and ε = 1, this means we’re adding noise that’s usually between -0.36 and 0.36, i.e. we know the average age to within about 4 months. But if n = 1, our confidence interval is the true age ± 360. Since this is wider than the a priori bounds of [0, 120], we’d truncate our answer to be between 0 and 120. So we could query for the age of an individual, but we’d learn nothing.

Comparison with truncation

The width of our confidence interval is 720/ε, and so to get a confidence interval one year wide, as we get with truncation, we would set ε = 720. Ordinarily ε is much smaller than 720 in application, say between 1 and 10, which means differential privacy reveals far less information than truncation does.

Even if you truncate age to decade rather than year, this still reveals more information than differential privacy provided ε < 72.

Related posts

[1] Ordinarily even the number of records in the database is kept private, but we’ll assume here that for some reason we know the number of rows a priori.

Random projection

Last night after dinner, the conversation turned to high-dimensional geometry. (I realize how odd that sentence sounds; I was with some unusual company.)

Someone brought up the fact that two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal. This is a surprising but well known fact. Next the conversation turned to something also surprising but not so well known.

Suppose you’re in a 20,000 dimensional space and you choose 10,000 vectors at random. Now you choose one more, and ask what angle the new vector makes with the space spanned by the 10,000 previous vectors. The new vector is very likely to have a very small component in the direction of each of the previous vectors individually, but what about their span?

There are several ways to approach this problem. You can find the exact distribution on the angle analytically, but I thought it might be more interesting to demonstrate this by actually drawing random vectors. I started to write a simulation and discovered that the Python installation on my laptop is corrupted, and so I decided to take a third approach. I’ll give an informal explanation of what’s going on. Those who wish could either fill in the details to write a proof [1] or code it up to write a simulation.

First of all, what does it mean to pick a vector randomly from a vector space? We say we want to choose vectors uniformly, but that isn’t actually possible. What we actually mean is we want to pick vectors so that their directions are uniform. That’s possible, and we only need to work with unit vectors anyway.

The way to generate uniform random samples of unit vectors in n dimensions is to first choose a standard normal random value for each coordinate, then normalize so the vector has length 1.

Suppose we do as we discussed above. We work in 20,000 dimensions and choose 10,000 random vectors. Call their span S. Then we choose another random vector v. What angle does v make with S? You can’t simply project v onto each of the basis vectors for S because even though they are nearly orthogonal, they’re not exactly orthogonal.

It turns out that it doesn’t matter what the vectors are that defined S. All that matters is that S is almost certainly a 10,000 dimensional subspace. By symmetry it doesn’t matter which subspace it is, so we’ll assume for convenience that it’s the subspace spanned by the first 10,000 unit vectors. The angle that v makes with S is the angle between v and its projection w onto S, and w is simply the vector you get by zeroing out the last 10,000 components of v.

The cosine of the angle between two unit vectors is their dot product. Our vector v is a unit vector, but its projection w is not. So the cosine of the angle is

cos θ = v·w/||w||

Now v·w is the sum of the squares of the first half of v‘s components. This is probably very close to 1/2 since v is a unit vector. This is also w·w, and so ||w|| is probably very close to the square root of 1/2. And so cos θ is very nearly √2/2, which means θ = 45°.

So in summary, if you generate 10,000 vectors in 20,000 dimensions, then find the angle of a new random vector with the span of the previous vectors, it’s very likely to be very near 45°.

Related posts

[1] If you’re in an n-dimensional space and S has dimension m < n, then cos² θ has the distribution of X/(XY) where X ~ χ²(m) and Y ~ χ²(nm), which has distribution beta(m/2, (nm)/2). If mn/2 and m is large, this is a beta distribution concentrated tightly around 1/2, and so cos θ is concentrated tightly around √2/2. More generally, if m and n are large, cos² θ is concentrated around m/n.

A truly horrible random number generator

I needed a bad random number generator for an illustration, and chose RANDU, possibly the worst random number generator that was ever widely deployed. Donald Knuth comments on RANDU in the second volume of his magnum opus.

When this chapter was first written in the late 1960’s, a truly horrible random number generator called RANDU was commonly used on most of the world’s computers.

This generator starts with an odd seed x0 and generates subsequent values via

xn+1 = 65539 xn mod 231

I needed to generate 8-bit numbers, and I took the lowest eight bits of each value from RANDU. (To be fair, the highest 8 bits would have been better, but my goal was to find a bad RNG, so I used the lowest.)

To demonstrate the generator’s (lack of) quality I made a histogram of sorts with a 16 by 16 grid, one cell for each possible 8-bit number. I generated a large number of samples and plotted the grid. Here’s what I got:

RANDU histogram

Here’s what I get with a better random number generator:

randint histogram

I was looking for something bad, but I didn’t realize RANDU was that bad. The white stripes represent the generated values and the black stripes values that are never generated. So out of 256 possible 8-bit numbers, this generator only ever outputs 64 of them. (I used 33 as my seed. I might have gotten different vertical stripes if I had chosen a different seed, but I’d still get stripes.) Also, the frequencies of the values it does take on have suspiciously little variation.

You can see the pattern of values RANDU takes on by printing out the first few generated values in hex:

    0x63
    0x29
    0x7b
    0x71
    0x53
    0xf9
    0xeb
    0xc1

The last hex digit cycles 3, 9, b, 1, 3, 9, b, 1, … producing only four possible values.

We could prove the statements empirically demonstrated above by noting that

xn = 65539n x0 mod 2k

for k = 31, but also for smaller k. We could set k = 8 and prove that there are only 64 values, or set k = 4 and prove that there are only four final hex digits.

Related posts

Regression, modular arithmetic, and PQC

Linear regression

Suppose you have a linear regression with a couple predictors and no intercept term:

β1x1 + β2x2 = y + ε

where the x‘s are inputs, the β are fixed but unknown, y is the output, and ε is random error.

Given n observations (x1, x2, y + ε), linear regression estimates the parameters β1 and β2.

I haven’t said, but I implicitly assumed all the above numbers are real. Of course they’re real. It would be strange if they weren’t!

Learning with errors

Well, we’re about to do something strange. We’re going to pick a prime number p and do our calculations modulo p except for the addition of the error ε. Our inputs (x1, x2) are going to be pairs of integers. Someone is going to compute

r = β1x1 + β2x2 mod p

where β1 and β2 are secret integers. Then they’re going to tell us

r/p + ε

where ε is a random variable on the interval [0, 1].  We give them n pairs (x1, x2) and they give back n values of r/p with noise added. Our job is to infer the βs.

This problem is called learning with errors or LWE. It’s like linear regression, but much harder when the problem size is bigger. Instead of just two inputs, we could have m of inputs with m secret coefficients where m is large. Depending on the number of variables m, the number of equations n, the modulus p, and the probability distribution on ε, the problem may be possible to solve but computationally very difficult.

Why is it so difficult? Working mod p is discontinuous. A little bit of error might completely change our estimation of the solution. If n is large enough, we could recover the coefficients anyway, using something like least squares. But how would we carry that out? If m and p are small we can just try all pm possibilities, but that’s not going to be practical if m and p are large.

In linear regression, we assume there is some (approximately) linear process out in the real world that we’re allowed to reserve with limited accuracy. Nobody’s playing a game with us, that just how data come to us. But with LWE, we are playing a game that someone has designed to be hard. Why? For cryptography. In particular, quantum-resistant cryptography.

Post Quantum Cryptography

Variations on LWE are the basis for several proposed encryption algorithms that believed to be secure even if an adversary has access to a quantum computer.

The public key encryption systems in common use today would all be breakable if quantum computing becomes practical. They depend on mathematical problems like factoring and discrete logarithms being computationally difficult, which they appear to be with traditional computing resources. But we know that these problems could be solved in polynomial time on a quantum computer with Shor’s algorithm. But LWE is a hard problem, even on a quantum computer. Or so we suspect.

The US government’s National Institute of Standards and Technology (NIST) is holding a competition to identify quantum-resistant encryption algorithms. Last month they announced 26 algorithms that made it to the second round. Many of these algorithms depend on LWE or variations.

One variation is LWR (learning with rounding) which uses rounding rather than adding random noise. There are also ring-based counterparts RLWE and RLWR which add random errors and use rounding respectively. And there are polynomial variations such as poly-LWE which uses a polynomial-based learning with errors problem. The general category for these methods is lattice methods.

Lattice methods

Of the public-key algorithms that made it to the second round of the the NIST competition, 9 out of 17 use lattice-based cryptography:

  • CRYSTALS-KYBER
  • FrodoKEM
  • LAC
  • NewHope
  • NTRU
  • NTRU Prime
  • Round5
  • SABER
  • Three Bears

Also, two of the nine digital signature algorithms are based on lattice problems:

  • CRYSTALS-DILITHIUM
  • FALCON

Based purely on the names, and not on the merits of the algorithms, I hope the winner is one of the methods with a science fiction allusion in the name.

Related posts

Entropy extractor used in μRNG

Yesterday I mentioned μRNG, a true random number generator (TRNG) that takes physical sources of randomness as input. These sources are independent but non-uniform. This post will present the entropy extractor μRNG uses to take non-uniform bits as input and produce uniform bits as output.

We will present Python code for playing with the entropy extractor. (μRNG is extremely efficient, but the Python code here is not; it’s just for illustration.) The code will show how to use the pyfinite library to do arithmetic over a finite field.

Entropy extractor

The μRNG generator starts with three bit streams—X, Y, and Z—each with at least 1/3 bit min-entropy per bit.

Min-entropy is Rényi entropy with α = ∞. For a Bernoulli random variable, that takes on two values, one with probability p and the other with probability 1-p, the min-entropy is

-log2 max(p, 1-p).

So requiring min-entropy of at least 1/3 means the two probabilities are less than 2-1/3 = 0.7937.

Take eight bits (one byte) at a time from XY, and Z, and interpret each byte as an element of the finite field with 28 elements. Then compute

X*YZ

in this field. The resulting stream of bits will be independent and uniformly distributed, or very nearly so.

Purified noise

Just a quick aside. Normally you want to remove noise from data to reveal a signal. Said another way, you want to split the data into signal and noise so you can throw out the noise. Here the goal is the opposite: we want to remove any unwanted signal in order to create pure noise!

Python implementation

We will need the bernoulli class for generating our input bit streams, and the pyfinite for doing finite field arithmetic on the bits.

    from scipy.stats import bernoulli
    from pyfinite import ffield

And we will need a couple bit manipulation functions.

    def bits_to_num(a):
        "Convert an array of bits to an integer."
    
        x = 0
        for i in range(len(a)):
            x += a[i]*2**i
        return x

    def bitCount(n):
        "Count how many bits are set to 1."
        count = 0
        while(n):
            n &= n - 1
            count += 1
        return count 

The following function generates random bytes using the entropy extractor. The input bit streams have p = 0.79, corresponding to min-entropy 0.34.

    def generate_byte():
        "Generate bytes using the entropy extractor."
    
        b = bernoulli(0.79)
    
        x = bits_to_num(b.rvs(8))
        y = bits_to_num(b.rvs(8))
        z = bits_to_num(b.rvs(8)) 

        F = ffield.FField(8)
        return F.Add(F.Multiply(x, y), z)

Note that 79% of the bits produced by the Bernoulli generator will be 1’s. But we can see that the output bytes are about half 1’s and half 0’s.

    s = 0
    N = 1000
    for _ in range(N):
        s += bitCount( generate_byte() )
    print( s/(8*N) )

This returned 0.50375 the first time I ran it and 0.49925 the second time.

For more details see the μRNG paper.

Related posts

 

Exploring the sum-product conjecture

Quanta Magazine posted an article yesterday about the sum-product problem of Paul Erdős and Endre Szemerédi. This problem starts with a finite set of real numbers A then considers the size of the sets A+A and A*A. That is, if we add every element of A to every other element of A, how many distinct sums are there? If we take products instead, how many distinct products are there?

Proven results

Erdős and Szemerédi proved that there are constants c and ε > 0 such that

max{|A+A|, |A*A|} ≥ c|A|1+ε

In other words, either A+A or A*A is substantially bigger than A. Erdős and Szemerédi only proved that some positive ε exists, but they suspected ε could be chosen close to 1, i.e. that either |A+A| or |A*A| is bounded below by a fixed multiple of |A|² or nearly so. George Shakan later showed that one can take ε to be any value less than

1/3 + 5/5277 = 0.3342899…

but the conjecture remains that the upper limit on ε is 1.

Python code

The following Python code will let you explore the sum-product conjecture empirically. It randomly selects sets of size N from the non-negative integers less than R, then computes the sum and product sets using set comprehensions.

    from numpy.random import choice

    def trial(R, N):
        # R = integer range, N = sample size
        x = choice(R, N, replace=False)
        s = {a+b for a in x for b in x}
        p = {a*b for a in x for b in x}
        return (len(s), len(p))

When I first tried this code I thought it had a bug. I called trial 10 times and got the same values for |A+A| and |A*A| every time. That was because I chose R large relative to N. In that case, it is likely that every sum and every product will be unique, aside from the redundancy from commutativity. That is, if R >> N, it is likely that |A+A| and |A*A| will both equal N(N+1)/2. Things get more interesting when N is closer to R.

Probability vs combinatorics

The Erdős-Szemerédi problem is a problem in combinatorics, looking for deterministic lower bounds. But it seems natural to consider a probabilistic extension. Instead of asking about lower bounds on |A+A| and |A*A| you could ask for the distribution on |A+A| and |A*A| when the sets A are drawn from some probability distribution.

If the set A is drawn from a continuous distribution, then |A+A| and |A*A| both equal N(N+1)/2 with probability 1. Only careful choices, ones that would happen randomly with probability zero, could prevent the sums and products from being unique, modulo commutativity, as in the case R >> N above.

If the set A is an arithmetic sequence then |A+A| is small and |A*A| is large, and the opposite holds if A is a geometric sequence. So it might be interesting to look at the correlation of |A+A| and |A*A| when A comes from a discrete distribution, such as choosing N integers uniformly from [1, R] when N/R is not too small.

Normal approximation to Laplace distribution?

I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace!

Normal and Laplace distributions

A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole holding up a circus tent.

Normal and Laplace probability density functions

A normal distribution has very thin tails, i.e. probability density drops very rapidly as you move further from the middle, like exp(-x²). The Laplace distribution has moderate tails [1], going to zero like exp(-|x|).

So normal and Laplace distributions are qualitatively very different, both in the center and in the tails. So why would you want to replace one by the other?

Statistics meets differential privacy

The normal distribution is convenient to use in mathematical statistics. Whether it is realistic in application depends on context, but it’s convenient and conventional. The Laplace distribution is convenient and conventional in differential privacy. There’s no need to ask whether it is realistic because Laplace noise is added deliberately; the distribution assumption is exactly correct by construction. (See this post for details.)

When mathematical statistics and differential privacy combine, it could be convenient to “approximate” a Laplace distribution by a normal distribution [2].

Solving for parameters

So if you wanted to replace a Laplace distribution with a normal distribution, which one would you choose? Both distributions are symmetric about their means, so it’s natural to pick the means to be the same. So without loss of generality, we’ll assume both distribution have mean 0. The question then becomes how to choose the scale parameters.

You could just set the two scale parameters to be the same, but that’s similar to the Greek letter fallacy, assuming two parameters have the same meaning just because they have the same symbol. Because the two distributions have different tail weights, their scale parameters serve different functions.

One way to replace a Laplace distribution with a normal would be to pick the scale parameter of the normal so that both two quantiles match. For example, you might want both distributions to have have 95% of their probability mass in the same interval.

I’ve written before about how to solve for scale parameters given two quantiles. We find two quantiles of the Laplace distribution, then use the method in that post to find the corresponding normal distribution scale (standard deviation).

The Laplace distribution with scale s has density

f(x) = exp(-|x|/s)/2s.

If we want to solve for the quantile x such that Prob(Xx) = p, we have

x = –s log(2 – 2p).

Using the formula derived in the previously mentioned post,

σ = 2x / Φ-1(x)

where Φ is the cumulative distribution function of the standard normal.

Related posts

[1] The normal distribution is the canonical example of a thin-tailed distribution, while exponential tails are conventionally the boundary between thick and thin. “Thick tailed” and “thin tailed” are often taken to mean thicker than exponential and thinner that exponential respectively.

[2] You could use a Gaussian mechanism rather than a Laplace mechanism for similar reasons, but this makes the differential privacy theory more complicated. Rather than working with ε-differential privacy you have to work with (ε, δ)-differential privacy. The latter is messier and harder to interpret.

Probabilisitic Identifiers in CCPA

The CCPA, the California Consumer Privacy Act, was passed last year and goes into effect at the beginning of next year. And just as the GDPR impacts businesses outside Europe, the CCPA will impact businesses outside California.

The law specifically mentions probabilistic identifiers.

“Probabilistic identifier” means the identification of a consumer or a device to a degree of certainty of more probable than not based on any categories of personal information included in, or similar to, the categories enumerated in the definition of personal information.

So anything that gives you better than a 50% chance of identifying a person or a device [1].

Presumably a mobile phone would be a device. Would a web browser be considered a “device”? If so, that would seem to outlaw browser fingerprinting. That could be a big deal. Identifying devices, but not directly people, is a major industry.

Personal information

What are these enumerated categories of personal information mentioned above? They start out specific:

Identifiers such as a real name, alias, postal address, unique personal identifier, online identifier Internet Protocol address, email address, …

but then they get more vague:

purchasing or consuming histories or tendencies … interaction with an Internet Web site … professional or employment-related information.

And in addition to the vague categories are “any categories … similar to” these.

Significance

What is the significance of a probabilistic identifier? That’s hard to say. A large part of the CCPA is devoted to definitions, and some of these definitions don’t seem to be used. Maybe this is a consequence of the bill being rushed to a vote in order to avoid a ballot initiative. Maybe the definitions were included in case they’re needed in a future amended version of the law.

The CCPA seems to give probabilistic identifiers the same status as deterministic identifiers:

“Unique identifier” or “Unique personal identifier” means … or probabilistic identifiers that can be used to identify a particular consumer or device.

That seems odd. Data that can give you a “more probable than not” guess at someone’s “purchasing or consuming histories” hardly seems like a unique identifier.

Related posts

[1] Nothing in this blog post is legal advice. I’m not a lawyer and I don’t give legal advice. I enjoy working with lawyers because the division of labor is clear: they do law and I do math.

Varsity versus junior varsity sports

soccer

Yesterday my wife and I watched our daughter’s junior varsity soccer game. Several statistical questions came to mind.

Larger schools tend to have better sports teams. If the talent distributions of a large school and a small school are the same, the larger school will have a better team because its players are the best from a larger population. If one school is twice as big as another, its team may consist of the top 5% while the other school’s team consists of its top 10%.

Does size benefit a school’s top (varsity) team or its second (junior varsity) team more? Would you expect more variability in varsity or junior varsity scores? Does your answer depend on whether you assume a thin-tailed (e.g. normal) or thick tailed (e.g. Cauchy) distribution on talent?

What if two schools have the same size, but one has a better athletic program, say due to better coaching. Suppose this shifts the center of the talent distribution. Does such a shift benefit varsity or junior varsity teams more?

Suppose both the varsity and junior varsity teams from two schools are playing each other, as was the case last night. If you know the outcome of the junior varsity game, how much should that influence your prediction of the outcome of the varsity game? Has anyone looked at this, either as an abstract model or by analyzing actual scores?

Related post: Probability of winning the World Series