I was buying a used book through Amazon this evening. Three resellers offered the book at essentially the same price. Here were their ratings:
- 94% positive out of 85,193 reviews
- 98% positive out of 20,785 reviews
- 99% positive out of 840 reviews
Which reseller is likely to give the best service? Before you assume it’s the seller with the highest percentage of positive reviews, consider the following simpler scenario.
Suppose one reseller has 90 positive reviews out of 100. The other reseller has two reviews, both positive. You could say one has 90% approval and the other has 100% approval, so the one with 100% approval is better. But this doesn’t take into consideration that there’s much more data on one than the other. You can have some confidence that 90% of the first reseller’s customers are satisfied. You don’t really know about the other because you have only two data points.

A Bayesian view of the problem naturally incorporates the amount of data as well as its average. Let θA be the probability of a customer being satisfied with company A’s service. Let θB be the corresponding probability for company B. Suppose before we see any reviews we think all ratings are equally likely. That is, we start with a uniform prior distribution θA and θB. A uniform distribution is the same as a beta(1, 1) distribution.
After observing 90 positive reviews and 10 negative reviews, our posterior estimate on θA has a beta(91, 11) distribution. After observing 2 positive reviews, our posterior estimate on θB has a beta(3, 1) distribution. The probability that a sample from θA is bigger than a sample from θB is 0.713. That is, there’s a good chance you’d get better service from the reseller with the lower average approval rating.

Now back to our original question. Which of the three resellers is most likely to satisfy a customer?
Assume a uniform prior on θX, θY, and θZ, the probabilities of good service for each reseller. The posterior distributions on these variables have distributions beta(80082, 5113), beta(20370, 417), and beta(833, 9).
These beta distributions have such large parameters that we can approximate them by normal distributions with the same mean and variance. (A beta(a, b) random variable has mean a/(a+b) and variance ab/((a+b)2(a+b+1)).) The variable with the most variance, θZ, has standard deviation 0.003. The other variables have even smaller standard deviation. So the three distributions are highly concentrated at their mean values with practically non-overlapping support. And so a sample from θX or θY is unlikely to be higher than a sample from θZ.
In general, going by averages alone works when you have a lot of customer reviews. But when you have a small number of reviews, going by averages alone could be misleading.
Thanks to Charles McCreary for suggesting the xkcd comic.
Related links:
Inequality Calculator
Calculating random inequalities
Exact calculation of beta inequalities


{ 28 comments… read them below or add one }
Charles McCreary 09.27.11 at 21:23
Pleasure visiting with you at SciPy, but this post demands a link:
http://xkcd.com/325/
John 09.27.11 at 21:31
The xkcd strip is perfect! Thanks. I updated the post to include it.
Chris Barts 09.27.11 at 22:03
Correct me if I’m wrong, but the comic isn’t entirely apt given the subject matter: Hat guy (that’s what the character is semi-officially called) can keep a respectable positive review ratio regardless of how many people he ships to as long as he keeps sending the same ratio of bobcats to office chairs. That means someone applying your methods could well end up with a supremely angry bobcat instead of a somewhat sub-par chair that took a while to arrive, simply because Hat Guy and Semi-Loser Chair Salesman looked the same in the model.
Of course, your model works well in the real world because Seller A through Seller Z really are pretty much the same if your model says they look similar. Applying it to restaurants gets messier, however, as you have to consider price points (McDonald’s probably gets better ratings than a sub-par $20-a-plate bistro) and the fact that if you go to a sufficiently bad restaurant your health is on the line.
Finally, you have to assume people know what quality means. With chairs it’s easy, with food it’s mostly personal preference once you leave salmonella-land, but with physicians and college professors the ‘clients’ are, almost by definition, unable to judge what they’re getting with any kind of expertise.
John 09.27.11 at 22:15
Chris: The model applies better to sports. Suppose Michael has made 80 of his last 100 free throw attempts. Larry has made 2 out of 2 free throws. Who is the better player? Larry has the better average, but I’d bet on Michael, given only that information.
But if I knew, for example, that Michael and Larry were professional basketball players, I would not start with a uniform prior. I’d start with a fairly informative prior that puts a lot of probability around whatever the average free throw percentage is for professional basketball players. That could change the bet of which is the better player.
Tieming Ji 09.27.11 at 22:43
I have had the same question when I tried to buy staff from amazon. I like your idea.
Aaron 09.28.11 at 00:28
The main problem with such a bayesian argument that I see is the Beta(1,1) prior. I think I would choose different, larger paramaters, based on how often someone with no feedback is likely to have bad service.
human mathematics 09.28.11 at 00:34
Pics, or it didn’t happen:
Beta(3,1) PDF & CDF
http://www.wolframalpha.com/input/?i=beta+distribution+with+alpha%3D3+and+beta%3D1#
Beta(91,11) PDF & CDF
http://www.wolframalpha.com/input/?_=1317191370801&fp=1&i=beta+distribution+with+alpha%3d91+and+beta%3d11&incTime=true#
Mahesh CR 09.28.11 at 00:34
John – Am completely clueless about statistics but thanks for illustrating the why and the how behind judging information, that too with a relatable example. Appreciate it!
human mathematics 09.28.11 at 00:41
I feel like this could turn into an App. Not only for Amazon — for Elance, Odesk, or any site with user ratings.
The company’s Sort should be working with a functional … and you’ve just suggested a good one. (A different prior could be used for ratings that come from untrustworthy users.)
c23 09.28.11 at 01:36
1) It’s not true that you expect service quality to be randomly distributed from 0 to 1 for a new seller. You would want to look at the average quality of new sellers in the past. I have no data other than my own experience as a frequent user of ebay and amazon since 1998, but I would guess it would be more like around 0.9, which changes the math considerably – enough that the seller with the highest percentage score is the way to bet unless he has very little feedback.
2) It’s kind of weird to illustrate your point with an example in which your idea doesn’t apply. Clearly you need feedback numbers well below 800 before this issue is worth thinking about.
The post would have been better if it had given some idea of how little feedback a seller should have before you worry about it.
Josh 09.28.11 at 03:50
I still didn’t catch the answer to your question “Which of the three resellers is most likely to satisfy a customer?”. So which is it now? None of them?
John 09.28.11 at 05:48
Josh: It turns out in the first example that the reseller with the highest average is also most likely to satisfy customers.
human mathematics: There’s an app for that, or at least a very good head start. The Inequality Calculator that I liked to calculates these kinds of inequalities, and it draws pictures. It plots the two density curves on top of each other so it’s easier to compare them. Calculating beta inequalities is numerically tricky, but the other two links explain how it can be done.
I added an image based on your suggestion. I first used the image from Inequality Calculator, but the image above from Sage scaled better so I posted it instead.
Martin 09.28.11 at 09:03
I think the analysis fails to take one thing into account (which may or may not be able to be taken into account). That is, the amount of time that the entity being reviewed has been receiving reviews. In the example, how are the values skewed if the entity with 100 reviews has been around receiving reviews for five year and the one with two reviews has been around receiving reviews for one week?
Abhijit 09.28.11 at 10:03
John,
Very nice post. The comment about the choice of prior other than beta(1,1) actually makes (indirectly) the very salient point that the prior is YOUR prior beliefs, and you’re welcome to use your prior beliefs quantified rather than a nice conjugate prior. One of the beauties of the Bayesian way.
Many people get caught up in the average value rather than in what went into the average. Just as, many people and the press get swayed into reporting results from small observational or clinical trials as the truth, not understanding that the number of subjects and the variability is what’s important to assess the result’s importance.
John 09.28.11 at 11:15
c23: Regarding your first point, how do you know what I expect?
You seem to know more about online resellers, so you have more information and a different distribution on the probability of good service. That’s as it should be. But regardless of what I personally expect, a uniform distribution makes the example simple.
Tim H. 09.28.11 at 14:39
People (on Reddit and HN) are hung up on the prior. In the limit (of data points) would the posteriors always converge to the same distribution?
W. Volterman 09.28.11 at 19:41
It seems that this doesn’t take into account the act of creating fake reviews, it’s easier to create a few hundred fake reviews than 10 000. There should be some penalty in this regard, where we assume that some fixed number of positive reviews are fake.
SteveBrooklineMA 09.28.11 at 20:01
What would be the standard non-Bayesian approach? Would it lead to a different conclusion? My intuition about such things is that if the Bayesian and Classical analyses lead to different conclusions, you probably need more data. If you really need to reach a decision now, or for some other reason can’t collect more data, then pick your poison.
It seems to me that in the limit (of data points) each posterior should converge to a Dirac-delta centered at the true rate.
Dave Backus @ NYU 09.28.11 at 21:46
Nice idea. You might also take into account that these are not random draws — there’s potentially strategic behavior, maybe on both sides. Here’s an example of how that might work:
http://luiscabral.org/economics/publications/JIE%202010.pdf
George Crews 09.29.11 at 17:42
And yet another xkcd link about average ratings: http://xkcd.com/937/
Alessandro 10.01.11 at 08:10
Is it possibile to calculate the inequality with R?
Alessandro 10.01.11 at 08:18
It would be interesting to see the frequentist approach to solve the same problem.
John 10.01.11 at 09:26
Alessandro: The last two links on the page point to technical reports that will tell you how to implement the inequality calculation.
StatStudent 10.12.11 at 18:43
This is always an interesting problem, caused mostly by small sample size. It reminds me of this recent comic, not about average ratings but about sample size problem
http://romancemath.blogspot.com/2011/10/12-casanova.html
Using this logic, it is hard to attract customers when you first start selling: even if all your reviews are positive, people prefer the more established sellers. The only way to compete is to offer more, ie., cheaper shipping, lower price. By the same token, if you are an established seller, you can hike up your price and don’t have to worry about losing customers.
Hannah 10.19.11 at 17:56
Excuse me, what is the formula to calculate that “The probability that a sample from θA is bigger than a sample from θB is 0.713.”
John 10.19.11 at 18:14
Hannah: See the “related links” at the bottom of the post. There’s no simple formula for the probability. But the first link is to software that can calculate it, the third explains how to compute the probability analytically for special values of the parameters, and the second explains how to compute the probability numerically in general.
Konst 11.12.11 at 05:15
A python module for the calculation of P(X>Y) would be great …
Mitch 01.30.12 at 18:56
It takes a while to make connections…here’s another method for ‘judging’ binary voting, using ratio -and- n: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html