Suppose the police department in your community reported an average of 10 burglaries per month. You could take that at face value and assume there are 10 burglaries per month. But maybe there are 20 burglaries a month but only half are reported. How could you tell?

Here’s a similar problem. Suppose you gave away an electronic book. You stated that people are free to distribute it however they want, but that you’d appreciate feedback. How could you estimate the number of people who have read the book? If you get email from 100 readers, you know at least 100 people have read it, but maybe 10,000 people have read it and only 1% sent email. How can you estimate the number of readers and the percentage who send email at the same time?

The key is the variability in the data. You could never tell by looking at the average alone. If there were an average of 10 burglaries per month, the data would be less variable than if there were an average of 20 burglaries per month with only half reported. Along the same lines, big variations in the amount of fan mail suggest that there may be many readers but few who send email.

The statistical problem is estimating the parameters (*n*, *p*) of a binomial distribution. The expected number of reported events is *np* where *n* is the total number of events and *p* is the probability that an event will be reported. Suppose one town has *n*_{1} actual burglaries with each burglary having a probability *p*_{1} of being reported. The other town has *n*_{2} burglaries with a probability *p*_{2} of being reported. If the expected number of reported burglaries are equal, then *n*_{1}*p*_{1} = *n*_{2} *p*_{2} = *r*. The variance in the burglary reports from the two towns will be *r*(1 – *p*_{1}) and *r*(1 – *p*_{2}). If *p*_{1} is less than *p*_{2} there will be more variance in the data from the first city.

Estimating binomial parameters is complicated and I won’t give the details here. If *n* were known, estimating *p* would be trivial. But when *n* and *p* are both unknown, this is a much harder problem. Still, there are ways to approach the problem.

Thanks, I’ve thought about this a little bit before but this post may be the impetus for me actually applying it.

John, this is very interesting. Can you recommend a good starting point for a novice?

Dan: Good question. Here’s a technical reference:

“Estimation of binomial parameters when both n, p are unknown” by A. DasGupta and Herman Rubin. Journal of Statistical Planning and Inference 130 (2005) 391 – 404

I may write up some notes summarizing this. If I do, I’ll post a comment here.

And a similar problem on global scale:

http://blogs.wsj.com/numbersguy/how-the-cdc-counts-h1n1-cases-886

The best (not so technical) starting point that is still in my mind is an example from Berger and Casella’s Statistical Inference book, when they discuss the method of moments and show how to estimate n and p together.

Pertinent refs: http://cameron.econ.ucdavis.edu/racd/count.html

and the book by the same authors, A.C.Cameron, P.K.Trivedi, REGRESSION ANALYSIS OF COUNT DATA (1998).

Admittedly, these pursue the binomial in the Poisson limit of large N.

I thought this sounded a really neat idea so I thought some more about it. However I’m not sure the above is a valid experimental approach.

Generally something like the true count of burglaries in a period is modelled by a Poisson distribution (with rate lambda). If we draw samples from this Poisson distribution and then under-report burglaries by a binomial model (with probability of reporting r) then by the magic of maths we come up with another Poisson distribution with rate r . lambda. Unfortunately typing the derivation into blog comments doesn’t work too well!

Therefore a series of observed counts only tell you about the product r . lambda and you can’t make any inference on either r or lambda separately.

Are there sensible cases when the true counts is fixed (ie not Poisson) and then we observe various under-reported counts from that?