Melissa O’Neill has a new post on generating random numbers from a given range. She gives the example of wanting to pick a card from a deck of 52 by first generating a 32-bit random integer, then taking the remainder when dividing by 52. There’s a slight bias because 2^{32} is not a multiple of 52.

Since 2^{32} = 82595524*52 + 48, there are 82595525 ways to generate the numbers 0 through 47, but only 82595524 ways to generate the numbers 48 through 51. As Melissa points out in her post, the bias here is small, but the bias increases linearly with the size of our “deck.” To clarify, it is the *relative* bias that increases, not the *absolute* bias.

Suppose you want to generate a number between 0 and *M*, where *M* is less than 2^{32} and not a power of 2. There will be 1 + ⌊2^{32}/*M*⌋ ways to generate a 0, but ⌊2^{32}/*M*⌋ ways to generate *M* − 1. The *difference* in the probability of generating 0 vs generating *M − *1 is 1/2^{32}, independent of *M*. However, the *ratio* of the two probabilities is 1 + 1/⌊2^{32}/*M*⌋ or approximately 1 + *M*/2^{32}.

As *M* increases, both the favored and unfavored outcomes become increasingly rare, but ratio of their respective probabilities approaches 2.

Whether this makes any practical difference depends on your context. In general, the need for random number generator quality increases with the volume of random numbers needed.

Under conventional assumptions, the sample size necessary to detect a difference between two very small probabilities *p*_{1} and *p*_{2} is approximately

8(*p*_{1} + *p*_{2})/(*p*_{1} − *p*_{2})²

and so in the card example, we would have to deal roughly 6 × 10^{18} cards to detect the bias between one of the more likely cards and one of the less likely cards.

***

Excellent article, thanks.

At start of second paragraph should 2^52 be 2^32 ?

The ratio of the two probabilities is 1 + 1/⌊2^32/M⌋ or approximately M/2^32. I am not sure about the approximation since the assumption is that M < 2^32. Isn't this why the ratio approaches 2 if M is large ?

That should have been 1 + M/2^32.

This is a really interesting observation. Do you have a citation, link or common name for the formula you give at the end. (FWIW, when I put in values for p1 and p2, I got 5.7 × 10^18, not 3 × 10^18 — my calculation was 8(2/52+1/2^32)/(1/2^32)^2, did I screw something up?)

One other thing to note is that this probability relates to observing that one particular card is biased. But suppose our range is 2^31+1, there will be 2^31+1 items that appear once and 2^31-1 items that appear twice. If we know what to look for and divide the groups into two buckets, we may quite quickly know than one bucket is much more full than the other, much sooner than the 1.7 × 10^11 outputs we’d need to be sure that a single item was biased.

In my book, one of three suggested methods of picking from a non power-of-two range is to add 64 bits (or 128 or whatever margin you want) to the bit size of the smallest binary number than can represent the range and use a random number that big, modulo the range size. Random bits are cheap, use plenty of them.

You were right about the mantissa of 5.7. I used one of the p’s instead of both.

The binary sample size rule of thumb is commonly known. You can find it, for example, in Statistical Rules of Thumb by Gerald van Bella. The formula on the page I link to has a few more details. I simplified the rule of thumb slightly for the case here were (1 – p) can be approximated as simply 1 because p is so small.