Vitalii Tymchyshyn and Andrii Khlevniuk posted a new paper here entitled “On the average number of birthdays in birthday paradox setup.” This paper gives a different perspective on the birthday problem and a new proof.

In the usual formation of the birthday problem, one asks the probability that everyone in a group of size *n* has a different birthday, or one asks how big *n* needs to be for the probability to be approximately 1/2 that two people share a birthday. Tymchyshyn and Khlevniuk ask about the expected number of unique birthdays in a group of size *n* and show that it equals

(1 – α^{n}) / (1 – α)

where α = 364/365.

Variations on the birthday problem often come up in practice. For example, you often need to consider how likely it is that a hash function will map set of inputs to unique values. There’s a rule of thumb that if there are *N* possible hash values, then there’s a 50-50 chance of a collision if you hash √*N* values.

Let α = (*N*-1)/*N* and *p* = 1/*N*. If we take the first three terms in the Taylor series for (1 – *p*)^{n} we see that the expected number of unique hash values after hashing *n* inputs is approximately

*n* – *n*(*n* – 1) *p*/2.

If we choose *n* = √*N*, then the expected number of unique hash values is approximately

*n* – 1/2,

which is consistent with having about a 0.5 probability of one collision.

So about 19 individuals, rather than the 23 we get by assuming that birthdays are uniformly distributed and every year has 365 days. Since this is one of these “as N gets large” things, it seems the year is not quite large enough.

It feels like that should be α = 364.25/365.25 to account for the few individuals with Feb 29th birthdays?

But that won’t make much difference, I guess…

>>> (364.25/365.25)-(364.0/365.0)

1.875240265181155e-06

Heh – I guess it only matters for REALLY large n :)

Very interesting. I came across this under the name “frequency of frequencies”: How many groups of size m share a birthday? I am baffled that this is new, as I am sure to have read about similar findings years ago. e.g. https://en.wikipedia.org/wiki/Good–Turing_frequency_estimation