When to A/B test, and when not to

I’ve sat through enough experiment review meetings to feel certain about one thing: most teams have inverted their experimentation budget. They run a dozen tests a quarter on button colors and onboarding copy, and one or two on pricing — usually in a way that’s underpowered to detect anything but a catastrophe.

The asymmetry costs them money. Worse, it costs them learning. Here’s the heuristic I use to decide whether something deserves an A/B test or some other kind of investigation.

The heuristic

Ask three questions about the decision you’re about to test:

How reversible is the result? Can I roll it back in a day, a week, or never?
How heterogeneous is the effect? Will the answer be wildly different across cohorts, or roughly uniform?
How well-instrumented is the dependent metric? Can I read the result within 95% confidence in a reasonable window?

A/B testing is the right tool when the result is reversible, the effect is roughly uniform, and the metric is well-instrumented. Three for three.

Two for three is a judgement call.

One for three or zero for three is a strong signal that you should be doing something other than an A/B test.

What the matrix tells you

Copy tweaks are reversible (just deploy), roughly uniform (a clearer headline is usually clearer for everyone), and well-instrumented (click-through is the simplest metric in the world). Three for three. Test all of them.

Pricing changes are not reversible without burning trust. They are extremely heterogeneous — new users vs returning, light spenders vs heavy. And the dependent metric (LTV, not week-1 revenue) takes months to read. Zero for three. Don’t A/B test pricing.

So what do you do instead? A few options:

Geo splits. Roll the new price out in one market, hold another, compare 90-day cohorts. Slower, lower-powered, but doesn’t poison your customer base.
Models, not experiments. Build an elasticity model from historical data and forecast the impact of the new price. Validate the model against past changes. If the model says +12% with 95% CI of [+6%, +18%], that’s better information than an underpowered A/B at +9% ±15%.
Look at competitors. Pricing is one of the few areas where competitor moves are visible, public, and informative. The discipline of writing down what you’d expect a competitor’s recent move to do, and then watching, is a cheap proxy for an experiment you can’t run.

The harder cases

Onboarding flows are usually two-for-three: reversible, uniform-ish, but the dependent metric you care about (D30 or D90 retention) takes a long time to read. A common failure mode is reading week-1 activation as a proxy and shipping changes that look good on activation but eat retention. If you must A/B test onboarding, commit to reading the long-horizon metric before shipping the winner.

Paywall design is usually one-for-three at best. Heterogeneous (high-intent vs low-intent users react in opposite directions), reversible (sort of, until churn shows up), and the metric you care about is LTV. Most paywall A/B tests are theatre.

The real point

A/B testing is a tool, not a worldview. The teams that win the most aren’t the ones with the most experiments. They’re the ones who can distinguish between questions that A/B testing answers well and questions that A/B testing answers badly but feels like it’s answering well. The second category is where most of the wasted effort lives.