Another problem with A/B testing: interaction effects

The previous post looked at a paradox with A/B testing: your final result may depend heavily on the order of your tests. This post looks at another problem with A/B testing: the inability to find interaction effects.

Suppose you’re debating between putting a photo of a car or a truck on your web site, and you’re debating between whether the vehicle should be red or blue. You decide to use A/B testing, so you test whether customers prefer a red truck or a blue truck. They prefer the blue truck. Then you test whether customers prefer a blue truck or a blue car. They prefer the blue truck.

Maybe customers would prefer a red car best of all, but you didn’t test that option. By testing vehicle type and color separately, you didn’t learn about the interaction of vehicle type and color. As Andrew Gelman and Jennifer Hill put it [1],

Interactions can be important. In practice, inputs that have large main effects also tend to have large interactions with other inputs. (However, small main effects do not preclude the possibility of large interactions.)

Notice that sample size is not the issue. Suppose you tested the red truck against the blue truck with 1000 users and found that 88.2% preferred the blue truck. You can be quite confident that users prefer the blue truck to the red truck. Suppose you also used 1000 users to test the blue truck against the blue car and this time 73.5% preferred the blue truck. Again you can be confident in your results. But you failed to learn something that you might have learned if you’d split 100 users between four options: red truck, blue truck, red car, blue car.

Experiment size

This is an example of a factorial design, testing all combinations of the factors involved. Factorial designs seem impractical because the number of combinations can grow very quickly as the number of factors increases. But if it’s not practical to test all combinations of 10 factors, for example, that doesn’t mean that it’s impractical to test all combinations of two factors, as in the example above. It is often practical to use a full factorial design for a moderate number of factors, and to use a fractional factorial design with more factors.

If you only test one factor at a time, you’re betting that interaction effects don’t matter. Maybe you’re right, and you can optimize your design by optimizing each variable separately. But if you’re wrong, you won’t know.

Agility

The advantage of A/B tests is that they can often be done rapidly. Blue or red? Blue. Car or truck? Truck. Done. Now let’s test something else.

If the only options were between a rapid succession of tests of one factor at a time or one big, complicated statistical test of everything, speed might win. But there’s another possibility: a rapid succession of slightly more sophisticated tests.

Suppose you have 9 factors that you’re interested in, and you understandably don’t want to test several replications of 29 = 512 possibilities. You might start out with a (fractional) factorial design of 5 of the factors. Say that only one of these factors seems to make much difference, no matter what you pair it with. Next you do another experiment testing 5 factors at a time, the winner of the first experiment and the 4 factors you haven’t tested yet. This lets you do two small experiments rather than one big one.

Note that in this example you’re assuming that the factors that didn’t matter in the first experiment wouldn’t have important interactions with the factors in the second experiment. And your assumption might be wrong. But you’re making an educated guess, based on data from the first experiment. This is less than ideal, but it’s better than the alternative of testing every factor one at a time, assuming that no interactions matter. Assuming that some interactions don’t matter, based on data, is better than making a blanket assumption that no interactions matter, based on no data.

Testing more than one factor at a time can be efficient for screening as well as for finding interactions. It can help you narrow in on the variables you need to test more thoroughly.

Related posts

[1] Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.

A/B testing and a voting paradox

A B testing

One problem with A/B testing is that your results may depend on the order of your tests.

Suppose you’re testing three options: X, Y, and Z. Let’s say you have three market segments, equal in size, each with the following preferences.

Segment 1: X > Y > Z.

Segment 2: Y > ZX.

Segment 3: Z > X > Y.

Now suppose you test X against Y in an A/B test, then test the winner against Z. Segments 1 and 3 prefer X to Y, so X wins the first round of testing. Now you compare X to Z. Segments 2 and 3 prefer Z to X, so Z wins round 2 and is the overall winner.

Now let’s run the tests again in a different order. First we test Y against Z. Segments 1 and 2 will go for Y. Then in the next round, Y against X, segments 1 and 3 prefer X, so X is the overall winner. So one way of running the tests results in Z winning, and another way results in X winning.

Can we arrange our tests so that Y wins? Yes, by testing X against Z first. Z wins the first round, and Y wins in the second round.

The root of the problem is that group preferences are not transitive. We say that preferences are transitive if when someone prefers a to b, and they prefer b to c, then they prefer a to c. We implicitly assumed that each segment has transitive preferences. For example, when we said that the first segment’s preferences are X > Y > Z, we meant that they would rank X > Y,  Y > Z, and X > Z.

Individuals (generally) have transitive preferences, but groups may not. In the example above, the market at a whole prefers X to Y, prefers Y to Z, but prefers Z to X. The segments have transitive preference but the market does not. This is known as the Condorcet voting paradox.

Voting

This is not purely hypothetical. Our example is simplified, but it reflects a phenomenon that does happen in practice. It has been observed in voting. Constituencies in a legislature may have transitive preferences while the legislature as a whole does not. This opens the possibility of manipulating the final outcome by controlling the order in which items are voted on. In the example above, someone who knows the preferences of the groups could make any of the three outcomes the winner by picking the order of A/B comparisons.

Political scientists have looked back at congressional voting records and found instances of this happening, and can roughly determine when someone first discovered the technique of rigging sequential votes. They can also roughly point to when legislators became aware of the manipulation and learned that they sometimes need to vote against their actual preferences in one vote in order to get a better outcome at the end of the sequence of votes. (I think this was around 1940, but my memory could be wrong.) Political scientists call this sophisticated voting, as opposed to naive voting in which one always votes according to honest preferences.

Market research

The voting example is relevant to market research because it shows that intransitive group preferences really happen. But unlike in voting, customers respond honestly to A/B tests. They don’t even know that they’re part of an A/B test.

In the example above, we come away from our test believing that we have a clear winner. In both rounds of testing, the winner gets twice as many responses as the loser. The large margin in each test is misleading.

Any of the three options could be the winner, depending on the order of testing, but none of the options is any better than the others. So in the example we don’t so much make a bad choice, but we have too much confidence in our choice.

But now suppose the groups are not all the same size. Suppose the three segments represent 45%, 35%, and 20% of the market respectively. We can still have any option be the final winner, depending on the order of testing. But now some rests are better than others. If we tested all three options at once in an A/B/C test, we’d learn that a plurality of the market prefers X, and we’d learn that there is no option that the market as a whole prefers.

Related posts

Illegible work

When James Scott uses the word legible, he doesn’t refer to handwriting that is clear enough to read. He uses the word more broadly to mean something that is easy to classify, something that is bureaucrat-friendly. A thing is illegible if it is hard to pigeonhole. I first heard the term from Venkatesh Rao’s essay A Big Little Idea Called Legibility.

Much of the work I do is illegible. If the work were legible, companies would have an employee who does it [1] and they wouldn’t call a consultant.

Here’s a template for a conversation I’ve had a few times:

“We’ve got kind of an unusual problem. It’s related to some things I’ve seen you write about. Have you heard of …?”

“No, what’s that?”

“Never mind. We’d like you to help us with a project. …”

Years ago, when people heard that I worked in statistics they’d ask what programming language I worked in. They expected me to say R or SAS or something like that, but I’d say C++. Not that I recommend doing statistics in C++ [2] in general, but people came to me with unusual projects that they couldn’t get done with standard tools. If an R module would have done what they wanted, they wouldn’t have knocked on my door.

Doing illegible work is a lot of fun, but it’s hard to market. Try typing “Someone who can help with a kinda off the wall math / computer science project” into Google. It’s not helpful. Search engines can only answer questions that are legible to search engines. Illegible work is more likely to come from word of mouth than from a search engine.

***

[1] Sometimes companies call a consultant because they have occasional need for some skill, something they do not need often enough to justify hiring a full-time employee to do. Or maybe they have the skills in house to do a project but don’t have anyone available. Or maybe they want an outside auditor. But in this post I’m focusing on weird projects.

[2] When I mention C++, I know some people are thinking “But isn’t C++ terribly complex?” Why yes, yes it is. But my colleagues and I already knew C++, and we stuck to a sane subset of the language. It was not unusual to rewrite R code in C++ and make it 100x faster.

“Why don’t you just use C?” These days I’m more likely to write C than C++.  Clients don’t want me to write enterprise applications, just small numerical libraries, and they usually ask for C.

What use is mental math in 2022?

Now that most people are carrying around a powerful computer in their pocket, what use is it to be able to do math in your head?

Here’s something I’ve noticed lately: being able to do quick approximations in mid-conversation is a superpower.

Zoom call

When I’m on Zoom with a client, I can’t say “Excuse me a second. Something you said gave me an idea, and I’d like to pull out my calculator app.” Instead, I can say things like “That would require collecting four times as much data. Are you OK with that?”

There’s no advantage to being able to do calculations to six decimal places on the spot like Mr. Spock, and I can’t do that anyway. But being able to come up with one significant figure or even an order-of-magnitude approximation quickly keeps the conversation flowing.

I have never had a client say something like “Could you be more precise? You said between 10 and 15, and our project is only worth doing if the answer is more than 13.2.” If they did say something like that, I’d say “I will look at this more carefully offline and get back to you with a more precise answer.”

I’m combining two closely-related but separate skills here. One is the ability to simple calculations. The other is the ability to know what to calculate, how to do so-called Fermi problems. These problems are named after Enrico Fermi, someone who was known for being able to make rough estimates with little or no data.

A famous example of a Fermi problem is “How many piano tuners are there in New York?” I don’t know whether this goes back to Fermi himself, but it’s the kind of question he would ask. Of course nobody knows exactly how many piano tuners there are in New York, but you could guess about how many piano owners there are, how often a piano needs to be tuned, and how many tuners it would take to service this demand.

The piano tuner example is more complicated than the kinds of calculations I have to do on Zoom calls, but it may be the most well-known Fermi problem.

In my work with data privacy, for example, I often have to estimate how common some combination of personal characteristics is. Of course nobody should bet their company on guesses I pull out of the air, but it does help keep a conversation going if I can say on the spot whether something sounds like a privacy risk or not. If a project sounds feasible, then I go back and make things more precise.

Related links

A Bayesian approach to pricing

Suppose you want to determine how to price a product and you initially don’t know what the market is willing to pay. This post outlines some of the things you might think about, and how Bayesian modeling might help.

This post is not the final word on the subject, or even my final word on the subject. It is essentially a reply to a friend’s question turned into a blog post rather than an email.

Prior information

You must have some idea, however vague, what the market value of your product is. If you had absolutely no idea what a product is worth, you wouldn’t be considering it as a business opportunity.

There is always prior information. As a former colleague would say, when you want to measure the distance to the moon, you know not to pick up a yard stick. Whenever you do an experiment, something motivated you to do the experiment.

This is an ideal application of Bayesian statistics because you have valuable prior information before you have data. Until you have a moderate amount of data, your prior information may be more useful than your data.

Some people will say you should only act on data, not on subjective prior knowledge, but this is impossible. When you offer your product for the first time, you have no data. All you have to go on is prior information. You could hide your prior information in the design of an experiment rather than making it explicit in a prior distribution, but it’s still there.

Model

Assume the market price for some product can be modeled by a random variable X that depends on some parameter θ. I’m not saying that the price is random in any philosophical sense, only that it is useful to model it as random. More on this line of thinking here.

By modeling market price as a random variable rather than a single number, we’re acknowledging that it has some fuzziness to it. Different customers are willing to pay different amounts for the same product. Maybe the prices they’re willing to pay are tightly distributed around some center, or maybe there’s substantial variance.

When we make a sale, or fail to make a sale, we learn something about θ. But notice that we don’t observe X per se, we observe whether a particular sample from X was above or below the offer price. You’re not conducting a survey where you ask “What is the highest price p that you’d be willing to pay” and get a candid answer. You make an offer x, and it is either accepted or rejected. You observe whether x < p or x > p.

This means the likelihood function is similar to what you’d see in modeling survival data, but a little different. When someone dies, you fully observe their survival time. But if you follow up with someone and they’re still alive, you only know a lower bound on their survival time, not the survival time itself. We say the data is censored because we haven’t yet observed everything we want to know.

Survival data is usually asymmetric, censored on one side but not the other. You could have two-sided censoring, but that’s less common.

With pricing your data is always censored in both directions. You either get a lower bound or an upper bound on what someone would have been willing to pay.

After each offer and response, you can update your estimate of θ. Each interaction gives you a better idea of the distribution on θ.

Pricing

Now suppose after numerous observations you’re moderately confident in your knowledge of θ. Now what?

One response is “Well then you charge what the market is most likely to bear.” That’s kind of a simplistic optimization. It implicitly assumes you’re OK with a 50% chance of a sale going through. Maybe your business is struggling and you don’t have many leads. Then you want a higher conversion rate. Or maybe you’re doing well, have plenty leads, and are OK with a low conversion rate. This is especially the case if the distribution on market price has a lot of variance; if the variance is low it makes more sense to think of “the” price as if it were a single number.

So far I have implicitly assumed that the only consequence of asking too much is a lost sale. But if you ask for too much, you might lose future sales, even if you get the current sale. I’ve also assumed that customers always prefer lower prices. That’s not the case. Asking too little for a product can hurt your credibility, for example.

Estimating willingness to pay is complicated, and determining what to do once you’ve made that estimate is complicated as well. This post is just a sketch of the thought process a company might go through.

Related posts

Wire gauge and user perspective

wire gauge measurement device

Wire gauge is a perennial source of confusion: larger numbers denote smaller wires. The reason is that gauge numbers were assigned from the perspective of the manufacturing process. Thinner wires require more steps in production. This is a common error in user interface design and business more generally: describing things from your perspective rather than from the customer’s perspective.

Restaurants

When you order food at a restaurant, the person taking your order may rearrange your words before repeating them back to you. The reason may be that they’re restating it in manufacturing order, the order in which the person preparing the food needs the information.

Rheostats

A rheostat is a device for controlling resistance in an electrical circuit. It would seem natural for an engineer to give a user a control to vary resistance in Ohms. But Ohm’s law says

V = IR,

i.e. voltage equals current times resistance. Users expect that when they turn a knob clockwise they get more of something—brighter lights, louder music, etc.—and that means more voltage or more current, which means less resistance. Asking users to control resistance reverses expectations.

If I remember correctly, someone designed a defibrillator once where a knob controlled resistance rather than current. If that didn’t lead to someone dying, it easily could have.

Research

When I worked for MD Anderson Cancer Center, I managed the development of software for clinical trial design and conduct. Our software started out very statistician-centric and became more user-centric. This was a win, even for statisticians.

The general pattern was to move from eliciting technical parameters to eliciting desired behavior. Tell us how you want the design to behave and we’ll solve for the parameters to make that happen. Sometimes we weren’t able to completely automate parameter selection, but we were at least able to give the user a head start in knowing where to look.

Relax

Technical people don’t always want to have their technical hat on. Sometimes they want to relax and be consumers. When statisticians wanted to crank out a clinical trial design, they wanted software that was easy to use rather than technically transparent. That’s the backstory to this post.

It’s generally a good idea to conceal technical details, but provide a “service panel” to expose details when necessary.

Related posts

Black Swan Gratification

Psychologists say that random rewards are more addictive than steady, predictable rewards. But I believe this only applies to relatively frequent feedback. If rewards are too infrequent, there’s no emotional connection between behavior and reward. The connection becomes more intellectual and less visceral as feedback becomes less frequent and less predictable.

Nassim Taleb distinguishes between delayed gratification and random gratification in his foreword to the book Safe Haven by Mark Spitznagel.

There are activities with remote payoff and no feedback that are ignored by the common crowd. … So what this idea is about isn’t delayed gratification but the ability to operate without gratification — or rather, with random gratification.

Choosing a course of action that is certain to pay off a year from now is opting for delayed gratification. Choosing something that is likely to pay off eventually, maybe two years from now, or maybe next week, is opting for random gratification.

Random rewards encourage an addictive response to frequent feedback, and discourage a rational response to infrequent feedback.

The solution is to act on principle, rather than respond like the rats in the psychological studies alluded to above.

 

Not-to-do list

There is an apocryphal [1] story that Warren Buffett once asked someone to list his top 25 goals in order. Buffett then told him that he should avoid items 6 through 25 at all costs. The idea is that worthy but low-priority goals distract from high-priority goals.

Paul Graham wrote something similar about fake work. Blatantly non-productive activity doesn’t dissipate your productive energy as unimportant work does.

I have a not-to-do list, though it’s not as rigorous as the “avoid at all costs” list that Buffett is said to have recommended. These are not hard constraints, but more like what optimization theory calls soft constraints, more like stiff springs than brick walls.

One of the things on my not-to-do list is work with students. They don’t have money, and they often want you to do their work for them, e.g. to write the statistical chapter of their dissertation. It’s easier to avoid ethical dilemmas and unpaid invoices by simply turning down such work. I haven’t made exceptions to this one.

My softest constraint is to avoid small projects, unless they’re interesting, likely to lead to larger projects, or wrap up quickly. I’ve made exceptions to this rule, some of which I regret. My definition of “small” has generally increased over time.

I like the variety of working on lots of small projects, but it becomes overwhelming to have too many open projects at the same time. Also, transaction costs and mental overhead are proportionally larger for small projects.

Most of my not-to-do items are not as firm as my prohibition against working with students but more firm than my prohibition against small projects. These are mostly things I have pursued far past the point of diminishing return. I would pick them back up if I had a reason, but I’ve decided not to invest any more time in them just-in-case.

Sometimes things move off my not-to-do list. For example, Perl was on my not-to-do list for a long time. There are many reasons not to use Perl, and I agree with all of them in context. But nothing beats Perl for small text-munging scripts for personal use.

I’m not advocating my personal not-to-do list, only the idea of having a not-to-do list. And I’d recommend seeing it like a storage facility rather than a landfill: some things may stay there a while then come out again.

I’m also not advocating evaluating everything in terms of profit. I do lots of things that don’t make money, but when I am making money, I want to make money. I might take on a small project pro bono, for example, that I wouldn’t take on for work. I heard someone say “Work for full rate or for free, but not for cheap” and I think that’s good advice.

***

[1] Some sources say this story may be apocryphal. But “apocryphal” means of doubtful origin, so it’s redundant to say something may be apocryphal. Apocryphal does not mean “false.” I’d say a story might be false, but I wouldn’t say it might be apocryphal.

More stability, less stress

It’s been eight years since I started my consulting business. Two of the things I love about having my own business are the stability and the reduced stress. This may sound like a joke, but I’m completely serious.

Having a business is ostensibly less stable and more stressful than having a salaried job, but at a deeper level it can be more stable and less stressful.

If you are an employee, you have one client. If you lose that client, you lose 100% of your income. If you have a business with a dozen clients, losing a client or two at the same time is disappointing, but it’s not devastating.

As for stress, I prefer the stress of owning a business to the stresses of employment. My net stress level dropped when I went out on my own. My sleep, for example, improved immediately.

At first I never knew where the next project was coming from. But I found this less stressful than office politics, questioning the value of my work, a lack of correlation between my efforts and my rewards, etc.

If you’re thinking of striking out on your own, I wish you well. Here is some advice I wrote a few years ago that you may find helpful.

Simultaneous projects

I said something to my wife this evening to the effect that it’s best for employees to have one or at most two projects at a time. Two is good because you can switch off when you’re tired of one project or if you’re waiting on input. But with three or more projects you spend a lot of time task switching.

She said “But …” and I immediately knew what she was thinking. I have a lot more than two projects going on. In fact, I would have to look at my project tracker to know exactly how many projects I have going on right now. How does this reconcile with my statement that two projects is optimal?

Unless you’re doing staff augmentation contracting, consulting work is substantially different from salaried work. For one thing, projects tend to be smaller and better defined.

Also consultants, at least in my experience, spend a lot of time waiting on clients, especially when the clients are lawyers. So you take on more work than you could handle if everyone wanted your attention at once. At least you work up to that if you can. You balance the risk of being overwhelmed against the risk of not having enough work to do.

Working for several clients in a single day is exhausting, but that’s usually not necessary. My ideal is to do work for one or two clients each day, even if I have a lot of clients who are somewhere between initial proposal and final invoice.