Valuing AB Testing: What’s a Winner worth?

David J. Neff
6 min readMar 13, 2018

Co-authored by Jared Huber, Data Science Lead & David J Neff, VP of Consulting

copyright Clearhead : An Accenture Studio

We know that as you read this you have an early (or intermediate) testing program that is helping to drive user experience changes and your product roadmap. As you move across the optimization journey into future stages, estimating the ongoing future ROI of a/b testing “winners” will become an important part of you building a flourishing a/b testing program. These revenue projections (although we suggest you should additionally argue for other KPIs around user experience!) can help determine the value of a/b testing and be used to justify ongoing investment to executives and others.

Done poorly or with overly-aggressive assumptions, the same revenue projections can undermine broader credibility for testing. Projections calculations that no one can agree on (or believe) can often jeopardize overall adoption.

Despite this risk, we find that there isn’t a strong agreement among the eCommerce/Testing community on how these types of calculations should be done. In our experience at Clearhead, and now Accenture, even the most sophisticated a/b testing organizations have not yet employed financial modeling and forecasting tools that represent best practices elsewhere in the enterprise.

So why not? We don’t think testing/eCommerce teams are trying to hide anything. We believe a big contributing factor is that a/b testing as a tactic has its roots in a much different department than demand and financial forecasting. If the tactic had grown up in finance or supply chain, we think this would be more commonplace. Because it didn’t, testing orgs are on the outside looking in at those methods much like our marketing friends calculating attribution models.

As it stands, the push and pull comes from other business units that run their business in a predictable, forecasted way, and testing orgs that want to show enough rigor so that others will think that they’re somewhat credible without investing too much on predicting.

So what’s the real problem here?

Testing programs traditionally project the value of testing “wins” into the future to quantify the value of testing and justify continued investment in the program. For instance, “because we implemented a feature on the site that won in our experiment, we could anticipate future incremental revenue of X.“ Simple projection methods, such as a static Order Conversion Rate (OCR) uplift applied over the course of the following year, necessarily make assumptions about the future which are difficult to predict. Here’s an example of one of those calculations:

Annual traffic x (Observed winning variant OCR — Current order conversion rate) * (Current site AOV) = “Incremental Annual Revenue” from productionalizing the winner

The problems with this approach may not be immediately apparent (we’ve outlined them in more detail below), but the biggest one is that it assumes the same uplift we saw during a short-lived experiment, during months if not weeks, will extend into the future for a full year.

On the other side of the quick-and-dirty to painstakingly correct spectrum, we have what is likely the most accurate method of measuring ongoing test “value”, which involves measuring the empirical decay of an experiment post-experiment by retaining a long-lived “universal holdout” group. No assumptions or complicated math required; we’re actually measuring the uplift ongoing. In reality, though, many organizations aren’t set up for this type of measurement and attempting to do so may become an operational drain.

The Tradeoff

More complex methods of determining ROI uplift, though likely more accurate, as we’ve seen, often require difficult operational tradeoffs, larger amounts of traffic to the pages, and additional investment in the exercise of projecting uplift. We’ve seen instances where this takes away from other investments, such as the value-creating exercise of finding user problems and testing user experiences. In a world with constrained resources, every retest we run precludes us from executing a new test and uncovering additional learnings about customer preferences. Even in the world where we assume no opportunity cost for retesting or persisting a universal control, we’re necessarily excluding some portion of traffic from seeing the “winner” to gain confidence in our answer, leading to lower ROI. (It’s outside the scope of this post, but interested readers should look into multi-armed bandits — which are intended to minimize this opportunity cost.

Testing organizations, then, as part of larger data-driven organizations, need to strike the right balance between believable / defensible and accurate. As shown below, we believe there are pros and cons to each of the approaches which help frame the conversation.

How do people approach this?

Static ROI approach, assuming a finite useful life

This approach measures a down funnel KPI like Order Conversion Rate for both control and variants during an “experiment” phase. The experiment ends upon reaching traffic thresholds; if a variant outperformed the control, then uplift is calculated or applied to the next year.

Pros

  • Easiest to understand of all the approaches. Less likely to be challenged as a result of the transparency.
  • Assumes uplift applies for a single year only, which bakes in some conservatism, especially for changes to longer-lived portions of the site.

Cons

  • Assumes a static uplift over the course of the year.
  • No novelty effect (as returning visitors get “used to” the newer experience, the initially-observed lift flattens out),
  • No normalization as other changes to the site interact with the productionalized experience.
  • In other words, this calculation assumes that conditions that existed during the course of the experiment will persist into the future for the next year.
  • Will often project the highest revenue estimate of all the options discussed here. So could be seen as less credible.

Decaying / discounted ROI approach

This approach measures the uplift using an AB test during the “experiment” phase. The experiment ends upon reaching traffic thresholds; if a variant outperformed the control, then uplift is calculated or applied to the next year. Instead of using a flat 100% to the entire year, the uplift is applied at 100% immediately after the experiment and then the delta decays over time using a decay mechanism like straight line or exponential decay. This approach uses a shorter projection period, 3 to 6 months.

Pros

  • A straight line of exponential decay results in a more conservative ROI estimate than a static ROI approach; depending on the culture of the organization, this conservatism could lend additional credibility to the calculations and the program as a whole.

Cons

  • Revenue projections will be smaller than those of the first approach. For organizations trying to justify future investment, this may be a hurdle to overcome.
  • Arbitrarily deflates future revenue compared to the first approach, maybe detrimentally so, in that it might provide a sense of false precision by pretending to be more accurate than it really is.

Measure the empirical decay of an experiment post-experiment

This approach measures during experiment phase as detailed in the previous approaches, however in this approach you continue to measure the effect of a winning variant against a control group for a long period of time to greater understand novelty effect and decay. This approach can employ a few strategies for doing so: Universal Holdback, Victory Lap, or Re-tests.

  • Universal Holdback — Productionalize winning variants for 95% of traffic; use 5% ongoing as a control group against which to measure the majority of traffic.
  • Victory Lap — Occasionally measure the effectiveness of tests deemed winners. For instance, run a split test combining all winning variants over the last 3 months against a control experience to confirm the additive uplift of those individual experiments.
  • Re-tests — Re-test individual, winning tests after 6 months to confirm that “control” still underperforms (and the rate at which it does).

Pros

  • Allows teams to more accurately measure the rate of uplift realized over time.
  • Allows teams to measure the cumulative effect of a testing program (and not just the summed effect of individual tests).
  • Allows teams to analyze interaction effects between prior winning tests (and develop hypotheses and audience insights ongoing).

Cons

  • Difficult operationally unless “productionalization” happens via 100% experiments in a SaaS tool.
  • Pushes per-hit and per-visit costs up on SaaS testing tool.
  • New tests require extra, fragile development (and additional QA) involved with stacking experiences.
  • Opportunity costs:
  • Not immediately productionalizing winning tests (especially with victory lap) means less revenue from testing.
  • A portion of traffic is tied up “confirming” past findings versus learning new things about customer behavior and preferences.

So, here are the real, real questions. Which of these are you using for your teams and programs in 2018? As you continue your journey into 2019, what are you interested in exploring? Leave us a note in the comments.

--

--

David J. Neff

Climate Change/Energy Transition VC. Let’s get to 1.5. Author ✍️, Mentor, Investor, Husband, Father, Gardner, and 2x startup founder. Board member @MRCAustin