The previous post looked at a paradox with A/B testing: your final result may depend heavily on the order of your tests. This post looks at another problem with A/B testing: the inability to find interaction effects.
Suppose you’re debating between putting a photo of a car or a truck on your web site, and you’re debating between whether the vehicle should be red or blue. You decide to use A/B testing, so you test whether customers prefer a red truck or a blue truck. They prefer the blue truck. Then you test whether customers prefer a blue truck or a blue car. They prefer the blue truck.
Maybe customers would prefer a red car best of all, but you didn’t test that option. By testing vehicle type and color separately, you didn’t learn about the interaction of vehicle type and color. As Andrew Gelman and Jennifer Hill put it ,
Interactions can be important. In practice, inputs that have large main effects also tend to have large interactions with other inputs. (However, small main effects do not preclude the possibility of large interactions.)
Notice that sample size is not the issue. Suppose you tested the red truck against the blue truck with 1000 users and found that 88.2% preferred the blue truck. You can be quite confident that users prefer the blue truck to the red truck. Suppose you also used 1000 users to test the blue truck against the blue car and this time 73.5% preferred the blue truck. Again you can be confident in your results. But you failed to learn something that you might have learned if you’d split 100 users between four options: red truck, blue truck, red car, blue car.
This is an example of a factorial design, testing all combinations of the factors involved. Factorial designs seem impractical because the number of combinations can grow very quickly as the number of factors increases. But if it’s not practical to test all combinations of 10 factors, for example, that doesn’t mean that it’s impractical to test all combinations of two factors, as in the example above. It is often practical to use a full factorial design for a moderate number of factors, and to use a fractional factorial design with more factors.
If you only test one factor at a time, you’re betting that interaction effects don’t matter. Maybe you’re right, and you can optimize your design by optimizing each variable separately. But if you’re wrong, you won’t know.
The advantage of A/B tests is that they can often be done rapidly. Blue or red? Blue. Car or truck? Truck. Done. Now let’s test something else.
If the only options were between a rapid succession of tests of one factor at a time or one big, complicated statistical test of everything, speed might win. But there’s another possibility: a rapid succession of slightly more sophisticated tests.
Suppose you have 9 factors that you’re interested in, and you understandably don’t want to test several replications of 29 = 512 possibilities. You might start out with a (fractional) factorial design of 5 of the factors. Say that only one of these factors seems to make much difference, no matter what you pair it with. Next you do another experiment testing 5 factors at a time, the winner of the first experiment and the 4 factors you haven’t tested yet. This lets you do two small experiments rather than one big one.
Note that in this example you’re assuming that the factors that didn’t matter in the first experiment wouldn’t have important interactions with the factors in the second experiment. And your assumption might be wrong. But you’re making an educated guess, based on data from the first experiment. This is less than ideal, but it’s better than the alternative of testing every factor one at a time, assuming that no interactions matter. Assuming that some interactions don’t matter, based on data, is better than making a blanket assumption that no interactions matter, based on no data.
Testing more than one factor at a time can be efficient for screening as well as for finding interactions. It can help you narrow in on the variables you need to test more thoroughly.
 Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
One thought on “Another problem with A/B testing: interaction effects”
Fisher’s classic text on this also highlights a third benefit (besides efficiency and interactions) of generality, in that a factorial experiment allows you to explore the response over a range of values of the parameters of interest (e.g. a variety of shades of colour). That work also emphasises the importance of randomization to avoid bias: https://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers