When to A/B test, and when not to
Most teams under-experiment on pricing and over-experiment on copy. Here's the heuristic I use.
I’ve sat through enough experiment review meetings to feel certain about one thing: most teams have inverted their experimentation budget. They run a dozen tests a quarter on button colors and onboarding copy, and one or two on pricing — usually in a way that’s underpowered to detect anything but a catastrophe.
The asymmetry costs them money. Worse, it costs them learning. Here’s the heuristic I use to decide whether something deserves an A/B test or some other kind of investigation.
The heuristic
Ask three questions about the decision you’re about to test:
- How reversible is the result? Can I roll it back in a day, a week, or never?
- How heterogeneous is the effect? Will the answer be wildly different across cohorts, or roughly uniform?
- How well-instrumented is the dependent metric? Can I read the result within 95% confidence in a reasonable window?
A/B testing is the right tool when the result is reversible, the effect is roughly uniform, and the metric is well-instrumented. Three for three.
Two for three is a judgement call.
One for three or zero for three is a strong signal that you should be doing something other than an A/B test.
What the matrix tells you
Copy tweaks are reversible (just deploy), roughly uniform (a clearer headline is usually clearer for everyone), and well-instrumented (click-through is the simplest metric in the world). Three for three. Test all of them.
Pricing changes are not reversible without burning trust. They are extremely heterogeneous — new users vs returning, light spenders vs heavy. And the dependent metric (LTV, not week-1 revenue) takes months to read. Zero for three. Don’t A/B test pricing.
So what do you do instead? A few options:
- Geo splits. Roll the new price out in one market, hold another, compare 90-day cohorts. Slower, lower-powered, but doesn’t poison your customer base.
- Models, not experiments. Build an elasticity model from historical data and forecast the impact of the new price. Validate the model against past changes. If the model says +12% with 95% CI of [+6%, +18%], that’s better information than an underpowered A/B at +9% ±15%.
- Look at competitors. Pricing is one of the few areas where competitor moves are visible, public, and informative. The discipline of writing down what you’d expect a competitor’s recent move to do, and then watching, is a cheap proxy for an experiment you can’t run.
The harder cases
Onboarding flows are usually two-for-three: reversible, uniform-ish, but the dependent metric you care about (D30 or D90 retention) takes a long time to read. A common failure mode is reading week-1 activation as a proxy and shipping changes that look good on activation but eat retention. If you must A/B test onboarding, commit to reading the long-horizon metric before shipping the winner.
Paywall design is usually one-for-three at best. Heterogeneous (high-intent vs low-intent users react in opposite directions), reversible (sort of, until churn shows up), and the metric you care about is LTV. Most paywall A/B tests are theatre.
The real point
A/B testing is a tool, not a worldview. The teams that win the most aren’t the ones with the most experiments. They’re the ones who can distinguish between questions that A/B testing answers well and questions that A/B testing answers badly but feels like it’s answering well. The second category is where most of the wasted effort lives.