Suppose the police department in your community reported an average of 10 burglaries per month. You could take that at face value and assume there are 10 burglaries per month. But maybe there are 20 burglaries a month but only half are reported. How could you tell?
Here’s a similar problem. Suppose you gave away an electronic book. You stated that people are free to distribute it however they want, but that you’d appreciate feedback. How could you estimate the number of people who have read the book? If you get email from 100 readers, you know at least 100 people have read it, but maybe 10,000 people have read it and only 1% sent email. How can you estimate the number of readers and the percentage who send email at the same time?
The key is the variability in the data. You could never tell by looking at the average alone. If there were an average of 10 burglaries per month, the data would be less variable than if there were an average of 20 burglaries per month with only half reported. Along the same lines, big variations in the amount of fan mail suggest that there may be many readers but few who send email.
The statistical problem is estimating the parameters (n, p) of a binomial distribution. The expected number of reported events is np where n is the total number of events and p is the probability that an event will be reported. Suppose one town has n1 actual burglaries with each burglary having a probability p1 of being reported. The other town has n2 burglaries with a probability p2 of being reported. If the expected number of reported burglaries are equal, then n1p1 = n2 p2 = r. The variance in the burglary reports from the two towns will be r(1 – p1) and r(1 – p2). If p1 is less than p2 there will be more variance in the data from the first city.
Estimating binomial parameters is complicated and I won’t give the details here. If n were known, estimating p would be trivial. But when n and p are both unknown, this is a much harder problem. Still, there are ways to approach the problem.