An opinionated case against A/B testing: Exploration-vs-Exploitation

Motivation

Nowadays, a lot of companies (esp. tech companies) have been embracing data-driven decision making practice. Almost every business decision such as what background color should be used for the company website or whether a new kitty photo filter should be released requires justification with some experimental perf metrics. A/B testing has become a de facto tool for trying out and comparing multiple ideas/solutions (aka, arms, options) simultaneously, often on production traffic, such as user requests. In business, A/B testing is a general term referring to the practice of conducting randomized controlled trials (aka, experiment) on real/production users.

A simple A/B test

Let's consider an oversimplified A/B test on a company homepage background color, in which we want to decice which color, red or blue should be used. Say the CEO agrees to give me 20% of the website traffic to experiment on for 1 month. I then get a fair coin with 50% chance of turning up head/tail. Whenever a user request comes, I flip the coin, if it's head then the user gets to see red; otherwise the user gets to see blue. This process applies up to 20% overall traffic, which divides up more or less equally into the red group and the blue group (as I use a fair coin). I also need to decide key performance metrics to be measured on experimental traffic. Let's pretend for a user comes to the website, I want to measure the duration they stay on the site (in minutes) as a proxy for user engagement/happiness with the background color; the longer users stay, the better. After 1 month of experiment, I compare the average happiness scores of the red and blue groups and turns out the red group has the higher score. I draw conclusion that red is better and propose red color to the CEO (final decision maker). That's it, a simple yet typical example of an A/B test.

More (boring) details

A typical A/B test usually includes a single baseline (aka Control group) and 1 or more experimental arms (aka Treatment groups).

A/B testing isn't a new or exotic tool at all, in fact a lot of A/B tests (aka, experiment) have been conducted daily by e.g., scientists, for drug testing, discovering new materials, chemical substances, new vaccinces, testing new medical treatments etc. With all due respect to the rigorous science of A/B testing, I believe it's pretty solid and reasonable tool for decision making. In this post, however I want to make a case against A/B testing, that it is NOT a silver bullet, and may not work well in many practical scenarios.

When A/B testing fails

Let's revisit the simple red-or-blue example above. You may notice some naivety in my conclusion that red is better color. With just 1 month of data, how dare are you to conclude that red is better than blue? Turns out the experiment was running in the Christmas season and people seemed to prefer red (who doesn't like red-and-white Santa Claus with big presents :)), and somehow people seeing red stayed on the website longer, and tilted my happiness metric into red's favor. Sadly, soon after the Christmas season was over, the traffic to the red homepage went down drastically and many users voiced their concern/dissastifaction with the red color!

So you see the first flaw with short/mid-term A/B test: some confounding factors (out of experimenter's control?) can corrupt the performance metrics and lead to wrong conclusion, in hindsight.

As a diehard A/B tester, I argue that if I were permitted to run the experiment for longer, say a full year, I'd surely figure out the best color. No way I could get approval for that given the big risk of losing out users. But even if I could run the experiment for that long, I may end up observing no significant difference between the red and blue in key metrics. How come? Turns out in hindsight again, people prefer red during winter and fall; and blue during spring and summer. Another possible scenario is that user base shifts over time, when the experiment starts, a majority of users favor red (aka red users), but somehow during the year, red users churn and blue users become the majority. So averaging over the whole year, blue and red perform comparably, each has its own good and bad periods. Just by observing the results in the end, no conclusion can be draw on what color is better.

The key assumption of A/B testing

The key underlying assumption of A/B testing is that there is a single winner, applicable to all user traffic over long time.

When this assumption holds, A/B testing works really well and help us figure out the clear winner quickly. But how often does this assumption hold true in practice? Many real-world interesting/critical problems don't have a single clear winner solution for all users, not to mention over long time. It's arguably more often that we're in the situation when different user groups prefer different solutions and working solutions shift over time.

In other words, multiple solutions should co-exist and serve their own user base over time!

Even when A/B testing works well (and there exists a single clear winner), it still has an critical drawback; that is during experimental phase, a significant part of users are exposed to sub-optimal solutions; since A/B testing usually divides up traffic into equal groups, and only one group gets the winner treatment.

This situation incurs regret, loss of opportunity for the experiment runner (failed to deliver best experience to users) and sub-optimial experience hurts users and may turn them away from the business.

So what are alternatives to A/B testing when it failed? Actually to be fair, the problem with A/B testing is more about how we interpret and make decision based on the experiment results; not the experiment itself (experimentation is good!). It's often that after a short period of exploring multiple options equally, a single option is selected to be the winner and then to be deployed to all traffic.

It's explore-then-exploit (aka, explore-then-commit) strategy, nothing more!

The problem is that we commit (too soon) to a single solution over long term for all users; which is usually sub-optimal. Thus, a natural alternative is to keep all options proportionally to how good individual options are, i.e., keep good balance between exploration and exploitation!

Explore vs Exploit

The exploration-vs-exploitation tradeoff is at the core of sequential decision making. In order to perform well over a long run, it's critical to juggle exploration and exploitation effectively. We need a flexible strategy which can adapt to the dynamics of the environment in which the strategy operates on. Too abstract? Let's revisit the red-or-blue problem, instead of committing to red after the experiment, what if we keep serving both red and blue; but with a catch: the corresponding traffic should be proportional to how good a color currently performs, i.e., when red is doing better (based on user happiness metric), we should allocate more traffic to red than blue; and vice versa (when blue is doing better). The key here is the traffic for each option is scaled dynamically over time according to user preference; the better a color performs, the higher traffic it receives. This sounds too good to be true; how can we do this? One popular approach to situations involving explore-vs-exploit across multiple options is the so-called multi-armed bandits, which is a family of algorithms for decision making with strong theoretical support and empirical successes. Multi-armed bandits are a key ingredient in a lot of successful real-world applications, such as Netflix art work recommendation, website optimization, ads optimization. A detailed treatment of multi-armed bandits algorithms is obviously out of scope of this post; please refer to the References section for good resource on the topic.

The key point to remember here is that multi-armed bandits allow for continuously exploring multiple options while exploiting best performers at a time. The exploration is proportional to individual options' performance.

Astute readers may wonder how to incorporate contextual information, such as user profile into the decision making here in addition to solely relying on performance metrics. Yes, contextual bandits is an algorithm in the multi-armed bandits family, taking into account contextual information when determining which option to take at a time for a particular context. For example, for a user with strong preference and watch history of anime, the algorithm should probably explore anime for this user, even though anime genre doesn't perform as well as other genres (like romantic drama) at the time.

So should we abandon A/B testing for multi-armed bandits? It likely depends on your specific problems.

Middle ground Proposal

Despite its limitation, A/B testing remains a widely adopted practice thanks to its working well at relatively low cost (vs. multi-armed bandits) in practice. To end this long post, I want to propose some middle ground approach, taking into account the explore-vs-exploit trade-off.

Proposal 1: Continuously experiment and reevaluate the status-quo to search for the better

Committing to a single winner in short-term is reasonable; it's like accepting your local optimum with low cost. However, we should avoid sticking to a single winner for too long. Instead, a good strategy is to continuously run short A/B tests to reevaluate the status-quo and promote new better options when appropriate.

Proposal 2: Incorporate simple exploration strategy in addition to exploiting the current best option

Consider adopting simple exploration strategy from multi-armed bandits literature:

References

  1. Bandits Algorithms for Website Optimization: a old yet good enough introductory book on basic bandits algorithms.
  2. Bandits Algorithms: an excellent book with rigorous theoretical treatment of bandits algorithms by Tor Lattimore and Csaba Szepesvári.