Suppose ages in some database are reported in decades: 0, 10, 20, etc. You need to add a 27 year old woman to the data set. How do you record her age? A reasonable approach would to round-to-nearest. In this case, 27 would be rounded up to 30.
Another approach would be stochastic rounding. In our example, we would round this woman’s age up to 30 with 70% probability and round it down to 20 with 30% probability. The recorded value is a random variable whose expected value is exactly 27.
Suppose we were to add a large number of 27 year olds to the database. With round-to-nearest, the average value would be 30 because all the values are 30. With stochastic rounding, about 30% of the ages would be recorded as 20 and about 70% would be recorded as 30. The average would likely be close to 27.
Next, suppose we add people to the database of varying ages. Stochastic rounding would record every person’s age using a random variable whose expected value is their age. If someone’s age is a d+x where d is a decade, i.e. a multiple of 10, and 0 < x < 10, then we would record their age as d with probability 1-x/10 and d+10 with probability x/10. There would be no bias in the reported age.
Round-to-nearest will be biased unless ages are uniformly distributed in each decade. Suppose, for example, our data is on undergraduate students. We would expect a lot more students in their early twenties than in their late twenties.
Now let’s turn things around. Instead of looking at recorded age given actual age, let’s look at actual age given recorded age. Suppose someone’s age is recorded as 30. What does that tell you about them?
With round-to-nearest, it tells you that they certainly are between 25 and 35. With stochastic rounding, they could be anywhere between 20 and 40. The probability distribution on this interval could be computed from Bayes’ theorem, depending on the prior distribution of ages on this interval. That is, if you know in general how ages are distributed over the interval (20, 40), you could use Bayes’ theorem to compute the posterior distribution on age, given that age was recorded as 30.
Stochastic rounding preserves more information than round-to-nearest on average, but less information in the case of a particular individual.
More privacy posts