Automatic data reweighting

Suppose you are designing an autonomous system that will gather data and adapt its behavior to that data.

At first you face the so-called cold-start problem. You don’t have any data when you first turn the system on, and yet the system needs to do something before it has accumulated data. So you prime the pump by having the system act at first on what you believe to be true prior to seeing any data.

Now you face a problem. You initially let the system operate on assumptions rather than data out of necessity, but you’d like to go by data rather than assumptions once you have enough data. Not just some data, but enough data. Once you have a single data point, you have some data, but you can hardly expect a system to act reasonably based on one datum.

Rather than abruptly moving from prior assumptions to empirical data, you’d like the system to gradually transition from reliance on the former to reliance on the latter, weaning the system off initial assumptions as it becomes more reliant on new data.

The delicate part is how to manage this transition. How often should you adjust the relative weight of prior assumptions and empirical data? And how should you determine what weights to use? Should you set the weight given to the prior assumptions to zero at some point, or should you let the weight asymptotically approach zero?

Fortunately, there is a general theory of how to design such systems. It’s called Bayesian statistics. The design issues raised above are all handled automatically by Bayes theorem. You start with a prior distribution summarizing what you believe to be true before collecting new data. Then as new data arrive, the magic of Bayes theorem adjusts this distribution, putting more emphasis on the empirical data and less on the prior. The effect of the prior gradually fades away as more data become available.

Related posts

TestU01 small crush test suite

crushed cars

In recent posts I’ve written about using RNG test suites on the output of the μRNG entropy extractor. This is probably the last post in the series. I’ve looked at NIST STS, PractRand, and DIEHARDER before. In this post I’ll be looking at TestU01.

TestU01 includes three batteries of tests: Small Crush, Crush, and Big Crush. The entropy extractor failed the smallest of the three, so I didn’t go on to the larger suites. Small Crush isn’t small; it used over 22 billion 32-bit samples as input, about 0.84 GB of data. Crush uses two orders of magnitude more data, and Big Crush uses another order of magnitude more data than Crush.

SmallCrush consists of 10 tests:

  • smarsa_BirthdaySpacings
  • sknuth_Collision
  • sknuth_Gap
  • sknuth_SimpPoker
  • sknuth_CouponCollector
  • sknuth_MaxOft
  • svaria_WeightDistrib
  • smarsa_MatrixRank
  • sstring_HammingIndep
  • swalk_RandomWalk1

The test names begin with s, followed by a prefix indicating the origin of the test. For example, knuth refers to Donald Knuth’s tests in volume 2 of TAOCP and marsa refers to George Marsaglia. The remainder of the name is more descriptive, such as SimpPoker for Knuth’s simple poker test.

The output of the entropy extractor failed four of the tests, failure being defined as producing a p-value less than 10-300. The other tests passed without issue, meaning they returned p-values in the range [0.001, 0.999].

Recall from earlier posts that μRNG entropy extractor takes three possibly biased bit streams and produces an unbiased bit stream, provided each of the input streams has min-entropy of at least 1/3. I produced biased streams by taking the bitwise OR of two consecutive values, producing a stream with probability 0.75 of being a 1 and probability 0.25 of being a 0. The result passed all STS and DIEHARDER tests, but failed some PractRand and Test01 SmallCrush tests. This is consistent with the generally held opinion that STS and DIEHARDER are relatively weak tests and PractRand and TestU01 are more rigorous tests.

I applied the entropy extractor to PCG without creating a biased stream, and the result passed PractRand and TestIU01 SmallCrush. Presumably it would have passed STS and DIEHARDER as well. This confirms that the extractor does no harm to a high-quality stream of pseudorandom bits. It largely removes the bias from biased streams, enough to pass the easier two test suites but not enough to pass the two more demanding test suites.

Previous posts in this series

DIEHARDER test suite

The first well-known suite of tests for random number generators was George Marsalia’s DIEHARD battery. The name was a pun on DieHard car batteries. Robert G. Brown took over the DIEHARD test suite and called it DIEHARDER, presumably a pun on the Bruce Willis movie.

I’ve written lately about an entropy extractor that creates a stream of unbiased bits from streams of biased bits. The output passes the NIST STS tests but fails the PractRand tests decisively. I was curious to see how it did on DIEHARDER, whether the results would be more like STS or like PractRand.

Turns out they’re more like STS. All tests pass, with the only minor anomaly being that one test out of 114 tests had a “weak” pass.

It’s remarkable that random number generators are even possible. They’re deterministic programs that produce output that for many practical purposes behaves as if it’s random. The entropy extractor convincingly imitates random output as far as the DIEHARDER tests are concerned, and as far as the STS tests are concerned, and even 2/3 of the PractRand tests. The remaining 1/3 of the PractRand tests, however, are not impressed.

It would be too simplistic to say that an RNG is “bad” because it fails a particular test. It would be even worse to assume and RNG is “good” because it passes a test or suite of tests. Good and bad depend on suitability for a given tasks. But all other things being equal, you can have more confidence in a generator that passes more tests.

Using PractRand to test an RNG

Yesterday I wrote about my experience using NIST STS to test an entropy extractor, a filtering procedure that produces unbiased bits from biased sources. This post will look at testing the same entropy extractor using the Practically Random (PractRand) test suite. The results were much worse this time, which speaks to the limitations of both the entropy extractor and the STS test suite.

Difficulties with PractRand

I built PractRand on Linux with no difficulty. There were multiple warnings but no errors. But when I piped a file of bits into RNG_test I got a message saying “error reading standard input”.

A couple years ago I ran PractRand using my Mac with no problems (results here) so I switched over to the machine I’d used before. The latest version of PractRand (pre95) produces compile errors, but the version I’d used before (93) compiled successfully.

Piping bits to RNG_test from the command line worked when I used the program RNG_output included with PractRand, but piping my own file of bits to the test suite failed with the same error I saw on Linux. I applied a patch posted by Daniel Lemire and was able to pipe my file to the test program. I went back to Linux to see if I could get version 93 to work there, but no luck.

Testing bits from a file is unusual; it’s faster to generate random numbers and send them directly to the test without going to disk and back, and so naturally that’s what people will do more often. But I wanted to use bits from a file because I need to do that on another project.

Test results

The entropy extractor failed 47 out of 151 tests using 128 MB of data. Five other tests reported anomalies ranging from “mildly suspicious” to “very suspicious.” The tests that failed had p-values of 10-16 and smaller. Most of the p-values were less than can be represented by a 64-bit floating point number. PractRand must use extended precision to report just how infinitesimally small some of the p-values are.

To make sure the test failures weren’t due to some anomaly with piping the file into RNG_test I repeated my procedure using output from the Melissa O’Neill’s PCG generator. All tests passed with no anomalies.

First do no harm

What if you run the entropy extractor on a high-quality source of random bits? When I reran the experiment on three streams from PCG, the result passed all PractRand tests without any anomalies. This suggests that applying the μRNG entropy extractor improves random streams, though perhaps not as much as we’d like, and at a minimum it makes streams no worse.

More on random number generation

Sufficient statistic paradox

A sufficient statistic summarizes a set of data. If the data come from a known distribution with an unknown parameter, then the sufficient statistic carries as much information as the full data [0]. For example, given a set of n coin flips, the number of heads and the number of flips are sufficient statistics. More on sufficient statistics here.

There is a theorem by Koopman, Pitman, and Darmois that essentially says that useful sufficient statistics exist only if the data come from a class of probability distributions known as normal families [1]. This leads to what Persi Diaconis [2] labeled a paradox.

Exponential families are quite a restricted family of measuures. Modern statistics deals with far richer classes of probabilities. This suggests a kind of paradox. If statistics is to be of any real use it must provide ways of boiling down great masses of data to a few humanly interpretable numbers. The Koopman-Pitman-Darmois theorem suggests this is impossible unless nature follows highly specialized laws which no one really believes.

Diaconis suggests two ways around the paradox. First, the theorem he refers to concerns sufficient statistics of a fixed size; it doesn’t apply if the summary size varies with the data size. Second, and more importantly, the theorem says nothing about a summary containing approximately as much information as the full data.


[0] Students may naively take this to mean that all you need is sufficient statistics. No, it says that if you know the distribution the data come from then all you need is the sufficient statistics. You cannot test whether a model fits given sufficient statistics that assume the model. For example, mean and variance are sufficient statistics assuming data come from a normal distribution. But knowing only the mean and variance, you can’t assess whether a normal distribution model fits the data.

[1] This does not mean the data have to come from the normal distribution. The normal family of distributions includes the normal (Gaussian) distribution, but other distributions as well.

[2] Persi Diaconis. Sufficiency as Statistical Symmetry. Proceedings of the AMS Centennial Symposium. August 8–12, 1988.

A new take on the birthday problem

Vitalii Tymchyshyn and Andrii Khlevniuk posted a new paper here entitled “On the average number of birthdays in birthday paradox setup.” This paper gives a different perspective on the birthday problem and a new proof.

In the usual formation of the birthday problem, one asks the probability that everyone in a group of size n has a different birthday, or one asks how big n needs to be for the probability to be approximately 1/2 that two people share a birthday. Tymchyshyn and Khlevniuk ask about the expected number of unique birthdays in a group of size n and show that it equals

(1 – αn) / (1 – α)

where α = 364/365.

Variations on the birthday problem often come up in practice. For example, you often need to consider how likely it is that a hash function will map set of inputs to unique values. There’s a rule of thumb that if there are N possible hash values, then there’s a 50-50 chance of a collision if you hash √N values.

Let α = (N-1)/N and p = 1/N. If we take the first three terms in the Taylor series for (1 – p)n we see that the expected number of unique hash values after hashing n inputs is approximately

nn(n – 1) p/2.

If we choose n = √N, then the expected number of unique hash values is approximately

n – 1/2,

which is consistent with having about a 0.5 probability of one collision.

Related probability posts

Convergence rate of random walk on integers mod p

Consider a random walk on the integers mod p. At each step, you either take a step forward or a step back, with equal probability. After just a few steps, you’ll have to be near where you started. After a few more steps, you could be far from your starting point, but you’re probably not. But eventually, every point is essentially equally likely.

How many steps would you have to take before your location is uniformly distributed? There’s a theorem that says you need about p² steps. We’ll give the precise statement of the theorem shortly, but first let’s do a simulation.

We’ll take random walks on the integers mod 25. You could think of this as a walk on a 24-hour clock, with one extra hour squeezed in. Suppose we want to take N steps. We would generate a +1 or -1 at random, add that to our current position, and reduce mod 25, doing that N times. That would give us one outcome of a random walk with N steps. We could do this over and over, keeping track of where each random walk ended up, to get an idea how the end of the walk is distributed.

We will do an optimization to speed up the simulation. Suppose we generated all our +1s and -1s at once. Then we just need to add up the number of +1’s and subtract the number of -1’s. The number n of +1’s is a binomial(N, 1/2) random variable, and the number of +1’s minus the number of -1’s is 2nN. So our final position will be

(2nN) mod 25.

Here’s our simulation code.

    from numpy import zeros
    from numpy.random import binomial
    import matplotlib.pyplot as plt

    p = 25
    N = p*p//2
    reps = 100000

    count = zeros(p)

    for _ in range(reps):
        n = 2*binomial(N, 0.5) - N
        count[n%p] += 1, count/reps)

After doing half of p² steps the distribution is definitely not uniform.

Distribution after p^2/2 steps

But after p² steps the distribution is close to uniform, close enough that you start to wonder how much of the lack of uniformity is due to simulation.

Distribution after p^2 steps

Now here’s the theorem I promised, taken from Group Representations in Probability and Statistics by Persi Diaconis.

For np² with p odd and greater than 7, the maximum difference between the random walk density after n steps and the uniform density is no greater than exp(- π²n / 2p²).

There is no magic point at which the distribution suddenly becomes uniform. On the other hand, the difference between the random walk density and the uniform density does drop sharply at around p² steps. Between p²/2 steps and p² steps the difference drops by almost a factor of 12.

More random walk posts

Random sample overlap

Here’s a simple probability problem with an answer you may find surprising. (The statement of the problem and solution are simple. The proof is not as simple.)

Suppose you draw 1,000 serial numbers at random from a set of 1,000,000. Then you make another random sample of 1,000. How likely is it that no numbers will be the same on both lists?

To make the problem slightly more general, take two samples of size n from a population of size n² where n is large.

The probability of no overlap in the two samples is approximately 1/e. That is, in the limit as n approaches infinity, the probability converges to 1/e.

For n = 1,000 the approximation is good to three figures.


There are Binomial(n², n) ways to choose a sample of size n from a population of size n². After we’ve drawn the first sample, there are Binomial(n² – n, n) ways to draw a new sample that doesn’t overlap with the first one. The probability of such a sample is

\begin{align*} \left.\binom{n^2 - n}{n} \middle/ \binom{n^2}{n}\right. &= \frac{(n^2 - n)!}{(n^2 - 2n)!} \frac{(n^2 - n)!}{(n^2)!} \\ & = \frac{(n^2 -n)(n^2 - n -1) \cdots (n^2 - 2n+1)}{n^2 (n^2-1) \cdots (n^2 - n+1)} \\ &= \left( \frac{n^2 - n}{n^2} \right) \left( \frac{n^2 - n - 1}{n^2 -1} \right) \cdots \left( \frac{n^2 - 2n + 1}{n^2 - n+1} \right) \\ \end{align*}

The fractions on the last line are in decreasing order, and so their product is less than the first one raised to the nth power, and greater than the last one raised to the nth power.

By taking logs one can show that

\lim_{n\to\infty} \left( \frac{n^2 - n}{n^2} \right)^n = \lim_{n\to\infty} \left( \frac{n^2 - 2n + 1}{n^2 - n+1} \right)^n = \frac{1}{e}

Since upper and lower bounds on the probability converge to 1/e, the probability converges to 1/e.


This problem is reminiscent of the birthday problem, but it’s a little different because the samples are draw without replacement.

Incidentally, I didn’t make up this problem out of thin air. It came up naturally in the course of a consulting project this week.

Data Science and Star Science

I recently got a review copy of Statistics, Data Mining, and Machine Learning in Astronomy. I’m sure the book is especially useful to astronomers, but those of us who are not astronomers could use it as a survey of data analysis techniques, especially using Python tools, where all the examples happen to come from astronomy. It covers a lot of ground and is pleasant to read.

What is a privacy budget?


The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t for this post, but it matters that there is an exact definition.

Someone designing a differentially private system sets an upper limit on the amount of difference anyone’s participation can make. That’s the privacy budget. The system will allow someone to ask one question that uses the whole privacy budget, or a series of questions whose total impact is no more than that one question.

If you think of a privacy budget in terms of money, maybe your privacy budget is $1.00. You could ask a single $1 question if you’d like, but you couldn’t ask any more questions after that. Or you could ask one $0.30 question and seven $0.10 questions.

Some metaphors are dangerous, but the idea of comparing cumulative privacy impact to a financial budget is a good one. You have a total amount you can spend, and you can chose how you spend it.

The only problem with privacy budgets is that they tend to be overly cautious because they’re based on worst-case estimates. There are several ways to mitigate this. A simple way to stretch privacy budgets is to cache query results. If you ask a question twice, you get the same answer both times, and you’re only charged once.

(Recall that differential privacy adds a little random noise to query results to protect privacy. If you could ask the same question over and over, you could average your answers, reducing the level of added noise, and so a differentially private system will rightly charge you repeatedly for repeated queries. But if the system adds noise once and remembers the result, there’s no harm in giving you back that same answer as often as you ask the question.)

A more technical way to get more from a privacy budget is to use Rényi differential privacy (RDP) rather than the original ε-differential privacy. The former simplifies privacy budget accounting due to simple composition rules, and makes privacy budgets stretch further by leaning away from worst-case analysis a bit and leaning toward average-case analysis. RDP depends on a tuning parameter that includes ε-differential privacy, so one can control how much RDP acts like ε-differential privacy by adjusting that parameter.

There are other ways to stretch privacy budgets as well. The net effect is that when querying a large database, you can often ask all the questions like, and get sufficiently accurate answers, without worrying about privacy budget.

More mathematical privacy posts