Why Most A/B Tests Are Lying to You

Thursday. A product supervisor at a Sequence B SaaS firm opens her A/B testing dashboard for the fourth time that day, a half-drunk chilly brew beside her laptop computer. The display screen reads: Variant B, +8.3% conversion raise, 96% statistical significance.

She screenshots the end result. Posts it within the #product-wins Slack channel with a celebration emoji. The top of engineering replies with a thumbs-up and begins planning the rollout dash.

Right here’s what the dashboard didn’t present her: if she had waited three extra days (the unique deliberate take a look at length), that significance would have dropped to 74%. The +8.3% raise would have shrunk to +1.2%. Beneath the noise flooring. Not actual.

If you happen to’ve ever stopped a take a look at early as a result of it “hit significance,” you’ve in all probability shipped a model of this error. You’re in giant firm. At Google and Bing, solely 10% to 20% of controlled experiments generate positive results, in response to Ronny Kohavi’s analysis printed within the Harvard Enterprise Evaluation. At Microsoft broadly, one-third of experiments show efficient, one-third are impartial, and one-third actively damage the metrics they supposed to enhance. Most concepts don’t work. The experiments that “show” they do are sometimes telling you what you wish to hear.

In case your A/B testing device enables you to peek at outcomes every day and cease each time the arrogance bar turns inexperienced, it’s not a testing device. It’s a random quantity generator with a nicer UI.

The 4 statistical sins beneath account for almost all of unreliable A/B take a look at outcomes. Every takes lower than quarter-hour to repair. By the top of this text, you’ll have a five-item pre-test guidelines and a choice framework for selecting between frequentist, Bayesian, and sequential testing which you can apply to your subsequent experiment Monday morning.

The Peeking Drawback: 26% of Your Winners Aren’t Actual

Each time you examine your A/B take a look at outcomes earlier than the deliberate finish date, you’re operating a brand new statistical take a look at. Not metaphorically. Actually.

Frequentist significance exams are designed for a single have a look at a pre-determined pattern dimension. Once you examine outcomes after 100 guests, then 200, then 500, then 1,000, you’re not operating one take a look at. You’re operating 4. Every look provides noise one other likelihood to masquerade as sign.

Evan Miller quantified this in his broadly cited evaluation “How Not to Run an A/B Test.” If you happen to examine outcomes after each batch of recent knowledge and cease the second you see p < 0.05, the precise false constructive charge isn’t 5%.

It’s 26.1%.

One in 4 “winners” is pure noise.

The mechanics are easy. A significance take a look at controls the false constructive charge at 5% for a single evaluation level. A number of checks create a number of alternatives for random fluctuations to cross the importance threshold. As Miller places it: “If you happen to peek at an ongoing experiment ten instances, then what you suppose is 1% significance is definitely simply 5%.”

Checking outcomes repeatedly and stopping at significance inflates your false constructive charge by greater than 5x. Picture by the creator.

That is the most typical sin in A/B testing, and the most costly. Groups make product choices, allocate engineering sources, and report income affect to management primarily based on outcomes that had a one-in-four likelihood of being imaginary.

The repair is straightforward however unpopular: calculate your required pattern dimension earlier than you begin, and don’t have a look at the outcomes till you hit it. If that self-discipline feels painful (and for many groups, it does), sequential testing provides a center path. Extra on that within the framework beneath.

Verify your take a look at outcomes after each batch of holiday makers, and also you’ll “discover” a winner 26% of the time. Even when there isn’t one.

The Energy Vacuum: Small Samples, Inflated Results

Peeking creates false winners. The second sin makes actual winners look greater than they’re.

Statistical energy is the likelihood that your take a look at will detect an actual impact when one exists. The usual goal is 80%, that means a 20% likelihood you’ll miss an actual impact even when it’s there. To hit 80% energy, you want a particular pattern dimension, and that quantity is determined by three issues: your baseline conversion charge, the smallest impact you wish to detect, and your significance threshold.

Most groups skip the facility calculation. They run the take a look at “till it’s important” or “for 2 weeks,” whichever comes first. This creates a phenomenon referred to as the winner’s curse.

Right here’s the way it works. In an underpowered take a look at, the random variation in your knowledge is giant relative to the true impact. The one means a real-but-small impact reaches statistical significance in a small pattern is that if random noise pushes the measured impact far above its true worth. So the very act of reaching significance in an underpowered take a look at ensures that your estimated impact is inflated.

When small samples produce important outcomes, the noticed impact is usually inflated effectively above the true worth.
Picture by the creator.

A workforce may have a good time a +8% conversion raise, ship the change, after which watch the precise quantity settle at +2% over the next quarter. The take a look at wasn’t flawed precisely (there was an actual impact), however the workforce primarily based their income projections on an inflated quantity. An artifact of inadequate pattern dimension.

An underpowered take a look at that reaches significance doesn’t discover the reality. It finds an exaggeration of the reality.

The repair: run an influence evaluation earlier than each take a look at. Set your Minimal Detectable Impact (MDE) on the smallest change that may justify the engineering and product effort to ship. Calculate the pattern dimension wanted at 80% energy. Then run the take a look at till you attain that quantity. No early exits.

The A number of Comparisons Lure

The third sin scales with ambition. Your A/B take a look at tracks conversion charge, common order worth, bounce charge, time on web page, and click-through charge on the call-to-action. 5 metrics. Customary apply.

Right here’s the issue. At a 5% significance degree per metric, the likelihood of at the least one false constructive throughout all 5 isn’t 5%. It’s 22.6%.

The mathematics: 1 − (1 − 0.05)⁵ = 0.226.

Scale that to twenty metrics (widespread in analytics-heavy groups) and the likelihood hits 64.2%. You’re extra more likely to discover noise that appears actual than to keep away from it completely.

At 20 metrics and a normal 5% threshold, you might have a virtually two-in-three likelihood of celebrating noise.
Picture by the creator.

Take a look at 20 metrics at a 5% threshold and you’ve got a 64% likelihood of celebrating noise.

That is the multiple comparisons problem, and most practitioners comprehend it exists in principle however don’t right for it in apply. They declare one main metric, then quietly have a good time when a secondary metric hits significance. Or they run the identical take a look at throughout 4 consumer segments and rely a segment-level win as an actual end result.

Two corrections exist, and main platforms already help them. Benjamini-Hochberg controls the anticipated proportion of false discoveries amongst your important outcomes (much less conservative, preserves extra energy). Holm-Bonferroni controls the likelihood of even one false constructive (extra conservative, acceptable when a single flawed name has severe penalties). Optimizely makes use of a tiered model of Benjamini-Hochberg. GrowthBook provides each.

The repair: declare one main metric earlier than the take a look at begins. All the pieces else is exploratory. If you happen to should consider a number of metrics formally, apply a correction. In case your platform doesn’t provide one, you want a distinct platform.

When “Vital” Doesn’t Imply Vital

The fourth sin is the quietest and probably the most costly. A take a look at could be statistically important and virtually nugatory on the similar time.

Statistical significance solutions precisely one query: “Is that this end result seemingly as a consequence of likelihood?” It says nothing about whether or not the distinction is large enough to matter. A take a look at with 2 million guests can detect a 0.02 share level raise on conversion with excessive confidence. That raise is actual. It’s additionally not value a single dash of engineering time to ship.

The hole between “actual” and “value appearing on” is the place sensible significance lives. Most groups by no means outline it.

Earlier than any take a look at, set a sensible significance threshold: the minimal impact dimension that justifies implementation. This could mirror the engineering value of delivery the change, the chance value of the take a look at’s runtime, and the downstream income affect. If a 0.5 share level raise interprets to $200K in annual income and the change takes one dash to construct, that’s your threshold. Something beneath it’s a “true however ineffective” discovering.

The repair: calculate your MDE earlier than the take a look at begins, not only for energy evaluation (although it’s the identical quantity), however as a choice gate. Even when a take a look at reaches significance, if the measured impact falls beneath the MDE, you don’t ship. Write this quantity down. Get stakeholder settlement earlier than launch.

The Bayesian Repair That Doesn’t Repair Something

If you happen to’ve learn this far, a thought is likely to be forming: “I’ll simply swap to Bayesian A/B testing. It handles peeking. It provides me ‘likelihood of being finest’ as a substitute of complicated p-values. Drawback solved.”

That is the most well-liked false impression in trendy experimentation.

Bayesian A/B testing does remedy one actual downside: communication. Telling a VP “there’s a 94% likelihood that Variant B is healthier” is clearer than “we reject the null speculation at α = 0.05.” Enterprise stakeholders perceive the primary assertion intuitively. The second requires a statistics lecture.

However Bayesian testing doesn’t remedy the peeking downside.

In October 2025, Alex Molas printed a detailed simulation study displaying that Bayesian A/B exams with mounted posterior thresholds endure from the identical false constructive inflation once you peek and cease on success. Utilizing a 95% “likelihood to beat management” as a stopping rule, checked after each 100 observations, produced false constructive charges of 80%. Not 5%. Not 26%. Eighty %.

David Robinson at Variance Explained reached a parallel conclusion: a hard and fast posterior threshold used as a stopping rule doesn’t management error charges in the way in which most practitioners assume. The posterior stays interpretable at any pattern dimension. However interpretability isn’t the identical as error management.

None of this implies Bayesian strategies are ineffective. For low-stakes directional choices (choosing a weblog headline, selecting an e-mail topic line) the place Kind I error management isn’t vital, the intuitive likelihood framework is genuinely higher. For top-stakes product choices the place you want dependable error ensures, “simply go Bayesian” isn’t a solution. It’s a fancy dress change on the identical downside.

Switching from frequentist to Bayesian doesn’t remedy peeking. It simply adjustments the quantity you’re misinterpreting.

The actual resolution isn’t a swap in methodology. It’s a pre-test protocol that forces statistical self-discipline no matter which framework you select.

The Pre-Take a look at Protocol

That is the part the remainder of the article was constructing towards. All the pieces above established why you want it. All the pieces beneath reveals what adjustments upon getting it.

The 5-Level Pre-Take a look at Guidelines

Run by way of these 5 objects earlier than urgent “Begin” on any A/B take a look at. Every one is go/fail. If any merchandise fails, repair it earlier than launching.

Pattern dimension calculated. Set your MDE (the smallest impact value delivery). Calculate the required pattern dimension at 80% energy and 5% significance utilizing Evan Miller’s free calculator or your platform’s built-in device. Instance: Baseline conversion 3.2%, MDE 0.5 share factors → ~25,000 per variant.
Runtime mounted and documented. Divide required pattern dimension by every day eligible visitors. Spherical up. Add buffer for weekday/weekend variation (minimal 7 full days, even when pattern dimension is reached sooner). Write down the top date. Instance: 8,300 eligible guests/day, 50,000 complete wanted → 6 days minimal, rounded to 14 days to seize weekly cycles.
One main metric declared. Write it down earlier than the take a look at begins. Secondary metrics are exploratory solely. If you happen to should consider a number of metrics formally, apply Benjamini-Hochberg or Holm-Bonferroni correction. Instance: “Major: checkout conversion charge. Secondary (exploratory): common order worth, cart abandonment charge.”
Sensible significance threshold set. Outline the minimal impact that justifies implementation. Agree on this with engineering and product stakeholders earlier than launch. If the take a look at reaches statistical significance however falls beneath this threshold, you don’t ship. Instance: “Minimal +0.5 share factors on conversion (value ~$200K yearly, justifies a 2-week dash).”
Evaluation technique chosen. Choose one: Frequentist, Bayesian, or Sequential. Doc why. Use the choice matrix beneath. Instance: “Sequential testing. Two deliberate analyses at day 7 and day 14. Alpha spending by way of O’Brien-Fleming bounds.”

Labored Instance: Checkout Circulate Take a look at

A mid-market e-commerce workforce (500K month-to-month guests) needs to check a brand new single-page checkout in opposition to their present multi-step circulate. Right here’s how they run the guidelines:

1. MDE: 0.5 share factors (from 3.2% baseline to three.7%). At 500K month-to-month guests with a $65 common order worth, a 0.5pp raise generates roughly $195K in incremental annual income. The brand new checkout prices about 2 weeks of engineering time (~$15K loaded). The ROI clears the bar.

2. Pattern dimension: At 80% energy and 5% significance, this requires ~25,000 per variant. 50,000 complete.

3. Runtime: 250K month-to-month guests attain checkout. That’s ~8,300/day. 50,000 complete ÷ 8,300/day = 6 days. Rounded to 14 days to seize weekday/weekend results.

4. Major metric: Checkout conversion charge. Common order worth and cart abandonment tracked as exploratory (no correction wanted since they gained’t drive the ship/no-ship resolution).

5. Methodology: Sequential testing. Excessive visitors, and stakeholders need weekly progress updates. Two pre-planned analyses: day 7 and day 14. Alpha spending by way of O’Brien-Fleming bounds.

Consequence: At day 7, the noticed raise is +0.3 share factors. The sequential boundary isn’t crossed. Proceed. At day 14, the raise is +0.6 share factors. Boundary crossed. Ship it.

With out the protocol: The PM checks every day, sees +1.1 share factors on day 3 with 93% “significance,” and declares a winner. She ships primarily based on a quantity that’s practically double the reality. Income projections overshoot by 83%. The precise raise settles at +0.6 factors over the subsequent quarter. Management loses belief within the experimentation program.

The perfect A/B take a look at is the one the place you wrote down “what would change our thoughts?” earlier than urgent Begin.

What Rigorous Testing Really Buys You

At Microsoft Bing, an engineer picked up a low-priority concept that had been shelved for months: a small change to how advert headlines displayed in search outcomes. The change appeared too minor to prioritize. Somebody ran an A/B take a look at.

The end result was a 12% increase in revenue per search, worth over $100 million annually within the U.S. alone. It turned the only most precious change Bing ever shipped.

This story, documented by Ronny Kohavi within the Harvard Enterprise Evaluation, carries two classes. First, instinct about what issues is flawed more often than not. At Google and Bing, 80% to 90% of experiments present no constructive impact. As Kohavi places it: “Any determine that appears attention-grabbing or completely different is often flawed.” You want rigorous testing exactly as a result of your instincts aren’t ok.

Second, rigorous testing compounds. Bing’s experimentation program recognized dozens of revenue-improving adjustments per thirty days, collectively boosting income per search by 10% to 25% annually. This accumulation was a significant component in Bing rising its U.S. search share from 8% in 2009 to 23%.

The quarter-hour you spend on a pre-test guidelines isn’t overhead. It’s the distinction between an experimentation program that compounds actual positive aspects and one which ships noise, erodes stakeholder belief, and makes A/B testing appear to be theater.

That product supervisor from 3 PM Thursday? She’s going to run one other take a look at subsequent week. So are you.

The dashboard will nonetheless present a confidence share. It would nonetheless flip inexperienced when it crosses a threshold. The UI is designed to make calling a winner really feel satisfying and definitive.

However now you realize what the dashboard doesn’t present. The 26.1%. The winner’s curse. The 64% false alarm charge. The Bayesian mirage.

Your subsequent take a look at begins quickly. The guidelines takes quarter-hour. The choice matrix takes 5. That’s 20 minutes between delivery sign and delivery noise.

Which one will or not it’s?

References

Evan Miller, “How Not To Run an A/B Test”
Alex Molas, “Bayesian A/B Testing Is Not Immune to Peeking” (October 2025)
David Robinson, “Is Bayesian A/B Testing Immune to Peeking? Not Exactly”, Variance Defined
Ron Kohavi, Stefan Thomke, “The Surprising Power of Online Experiments”, Harvard Enterprise Evaluation (September 2017)
Optimizely, “False Discovery Rate Control”, Help Documentation
GrowthBook, “Multiple Testing Corrections”, Documentation
Analytics-Toolkit, “Underpowered A/B Tests: Confusions, Myths, and Reality” (2020)
Statsig, “Effect Size: Practical vs Statistical Significance”
Statsig, “Sequential Testing: How to Peek at A/B Test Results Without Ruining Validity”

Source link

Why Most A/B Tests Are Lying to You

I Built a C++ Backend So My GPU Would Stop Eating Air

I Spent May Evaluating Different Engines for OCR

Why AI Is NOT Stealing Your Job

What AI Agents Should Never Do on Their Own

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

From Local App to Public Website in Minutes

New tiny nudibranch species discovered in Taiwan

Why the Budget’s CGT changes are a disaster for angel investors and startups

OpenAI and Anthropic Sign Letter to Prevent AI-Developed Biological Weapons

New York sports betting statements bill advances

Featured Picks

Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)

AI in the Workplace Statistics 2025–2035

AI Reveals the Catchiest Songs of All Time—Did Your Favorite Make the Cut?

Why Most A/B Tests Are Lying to You

The Peeking Drawback: 26% of Your Winners Aren’t Actual

The Energy Vacuum: Small Samples, Inflated Results

The A number of Comparisons Lure

When “Vital” Doesn’t Imply Vital

The Bayesian Repair That Doesn’t Repair Something

The Pre-Take a look at Protocol

The 5-Level Pre-Take a look at Guidelines

Labored Instance: Checkout Circulate Take a look at

What Rigorous Testing Really Buys You

References

Related Posts