When to A/B test, and when not to

Most teams under-experiment on pricing and over-experiment on copy. Here's the heuristic I use.

I’ve sat through enough experiment review meetings to feel certain about one thing: most teams have inverted their experimentation budget. They run a dozen tests a quarter on button colors and onboarding copy, and one or two on pricing — usually in a way that’s underpowered to detect anything but a catastrophe.

The asymmetry costs them money. Worse, it costs them learning. Here’s the heuristic I use to decide whether something deserves an A/B test or some other kind of investigation.

The heuristic

Ask three questions about the decision you’re about to test:

  1. How reversible is the result? Can I roll it back in a day, a week, or never?
  2. How heterogeneous is the effect? Will the answer be wildly different across cohorts, or roughly uniform?
  3. How well-instrumented is the dependent metric? Can I read the result within 95% confidence in a reasonable window?

A/B testing is the right tool when the result is reversible, the effect is roughly uniform, and the metric is well-instrumented. Three for three.

Two for three is a judgement call.

One for three or zero for three is a strong signal that you should be doing something other than an A/B test.

What the matrix tells you

Copy tweaks are reversible (just deploy), roughly uniform (a clearer headline is usually clearer for everyone), and well-instrumented (click-through is the simplest metric in the world). Three for three. Test all of them.

Pricing changes are not reversible without burning trust. They are extremely heterogeneous — new users vs returning, light spenders vs heavy. And the dependent metric (LTV, not week-1 revenue) takes months to read. Zero for three. Don’t A/B test pricing.

So what do you do instead? A few options:

The harder cases

Onboarding flows are usually two-for-three: reversible, uniform-ish, but the dependent metric you care about (D30 or D90 retention) takes a long time to read. A common failure mode is reading week-1 activation as a proxy and shipping changes that look good on activation but eat retention. If you must A/B test onboarding, commit to reading the long-horizon metric before shipping the winner.

Paywall design is usually one-for-three at best. Heterogeneous (high-intent vs low-intent users react in opposite directions), reversible (sort of, until churn shows up), and the metric you care about is LTV. Most paywall A/B tests are theatre.

The real point

A/B testing is a tool, not a worldview. The teams that win the most aren’t the ones with the most experiments. They’re the ones who can distinguish between questions that A/B testing answers well and questions that A/B testing answers badly but feels like it’s answering well. The second category is where most of the wasted effort lives.