On internet products, we constantly want to know “if we ship this new version, will the metric move?” Shipping to 100% of traffic is risky — if the new version is bad, DAU and engagement can drop overnight. The standard A/B test answer is to randomly sample a small fraction of traffic and split that fraction into two slices — one runs the old version, the other runs the new one — then compare the two groups’ metrics after running for a while. The procedure is intuitive and works in practice, but rests on a premise that is easy to miss: “treatment beat control on the sample” is not the same statement as “shipping the new version makes the population better.” Two distinct gaps sit in between: first, a sample-to-population gap — even with perfectly random assignment, the difference observed on a finite sample can be sampling noise that doesn’t reflect any real population-level difference; second, an association-to-causation gap — two things appearing together in the data is not the same as one causing the other (the apparent edge can come from confounders, the Hawthorne effect, SUTVA-violating network spillover, and so on). The entire mathematical scaffolding behind A/B testing exists to bridge these two gaps.
This post introduces the two core pieces of that scaffolding: the potential outcomes framework (Rubin causal model) and frequentist hypothesis testing. They handle two distinct jobs: the framework formalizes the causal effect as a difference between unobservable potential outcomes; hypothesis testing uses randomization and the central limit theorem to recover that unobservable quantity asymptotically from a finite sample. Drop either side and a “+1%” reported by an experiment cannot be honestly translated into “the new version delivered a real +1% lift.”
1. From Population Statistics to Sample-Based Inference
Statistical inference is the practice of using sample-level statistics to back out properties of a population. When we can directly enumerate the entire population, as in a national census, that is not statistical inference — it is a census. In most settings, however, a census is unaffordable, so we draw a sample using some mechanism and use sample-level statistics to estimate population-level parameters. The accuracy of inference depends on whether the sampling mechanism makes the sample “representative” of the population — and representativeness is a mathematical property, not an intuitive one.
A/B testing is one of the canonical industry applications of statistical inference. It originated in medicine, in the randomized controlled trial (RCT) — patients randomized to treatment vs. placebo, with the treatment effect on the sample used to predict the population effect after the drug ships. Internet products borrowed the same machinery for algorithm iteration. Before launching a new recommendation algorithm to all users, shipping it directly is too risky — a bad version can knock down DAU and time-spent immediately. The standard practice is to give the new algorithm to 5% of users and keep 95% on the old one, then compare per-user metrics after a week. Conceptually, this uses a 5% sample to predict “if we ship the new algorithm to everyone, how will the population’s per-user time-spent change?”
But there is a subtlety. After the experiment ends we see the treatment group is up 1.2% over control — can we conclude “the new algorithm lifted per-user time-spent by 1.2%”? Not strictly. Treatment and control are only separated at sampling time; the 5% who got the new algorithm cannot also be “assigned to control” so we can compare the same users under both versions. The quantity we actually want is “for the same user, how much higher is time-spent under the new algorithm than under the old one?” — and that is a counterfactual quantity, not directly observable at the individual level. Translating “between-group difference” into “causal effect” needs a mathematical framework.
2. The Potential Outcomes Framework
2.1 Setup and Potential Outcomes
We use the potential outcomes framework introduced by Neyman in 1923. Let there be experimental units and let denote unit ‘s assignment: for treatment, for control. The metric of interest has two versions:
- : unit ‘s metric value if assigned to treatment (got the new algorithm).
- : unit ‘s metric value if assigned to control (got the old algorithm).
These are jointly called the potential outcomes for unit — they both “potentially” exist, but only one of them is actually observable, depending on :
2.2 Individual Causal Effect and the Counterfactual
The individual causal effect for unit is naturally defined as
i.e. the difference between the same unit’s metric under the two treatments. Since we can observe at most one of them, is an individual-level quantity that can never be precisely computed. This is the fundamental difficulty of causal inference: every causal effect at the individual level involves an unobservable counterpart state — the so-called counterfactual.
2.3 SUTVA: Two Assumptions That Make Causal Inference Operational
The framework as Neyman originally proposed it is just statistical language. To turn it into a workable tool for causal inference, Rubin (1980) added two core assumptions, jointly known as the Stable Unit Treatment Value Assumption (SUTVA):
- No interference: unit ‘s potential outcomes depend only on its own , not on the assignments of other units. Industrial counterexample: in social-feed products like WeChat Channels, a treated user gets recommended a great piece of content by the new algorithm and shares it to their feed — control-group friends now see the same content. The control group’s metric has been “spilled into” by the treatment, breaking no-interference.
- Treatment variation irrelevance: every unit assigned to treatment receives one and the same version of the treatment. Industrial counterexample: a recommendation algorithm rolls out to servers in different regions at different times, so users nominally in “treatment” actually experience several different “new algorithms,” breaking the variation-irrelevance assumption.
SUTVA is silently assumed in most A/B tests and silently breaks in social, recommendation, and two-sided-market settings — experimentation under network effects is a research direction of its own (with cluster randomization and switchback as representative techniques), and this post does not go into it.
2.4 Average Treatment Effect
Individual causal effects are unestimable, but we can step back and ask about the population average. The Average Treatment Effect (ATE) is
i.e. the population’s average metric under treatment minus the population’s average metric under control. The ATE is still defined in terms of unobservable potential outcomes, but under the right experimental design it can be estimated without bias — and the next section shows why “the right experimental design” is precisely “random assignment.”
3. Randomization and the Difference-in-Means Estimator
3.1 Why Randomization Defuses the Counterfactual Problem
A key detail of the potential outcomes framework worth pinning down first: every unit simultaneously carries both and as intrinsic properties; the assignment only decides which one we observe, not which one exists. A unit assigned to the treatment group still has its as a counterfactual sitting behind the scenes — we simply can’t see it, and vice versa. With this in mind, when is statistically independent of , both the treatment group and the control group become random subsets of the same underlying population — the joint distribution of within each group equals, in expectation, the joint distribution over the full population, and therefore matches across the two groups. The claim of “matching” is not “the treatment group’s observed has the same distribution as the control group’s observed ” — those are different quantities by construction. The claim is that each group is an unbiased microcosm of the population’s . As a direct consequence, the treatment-group sample mean is unbiased for , the control-group sample mean is unbiased for , and any systematic bias that would push one group higher is averaged out by the randomness of assignment.
More formally, in Neyman’s setup we treat as fixed properties of the population (not random variables); the only randomness comes from the assignment . If the experiment guarantees uniformly across , then is statistically independent of the potential outcomes by construction. The point of randomization is to use a Bernoulli variable that is independent of the potential outcomes to “split” the constraint that each unit can only be observed under one state into “each state has a sub-sample that is unbiased for its own population mean”.
3.2 The Difference-in-Means Estimator
Building on that observation, we estimate the ATE by
This is the difference-in-means estimator. The form is straightforward: treatment-group sample mean minus control-group sample mean. Below we show it is an unbiased estimator of .
3.3 The Full Unbiasedness Proof
Rewrite the sums over the assigned subset as sums over all , gated by the / indicator, and expand :
In Neyman’s framing, and are fixed (they are population properties), and are design parameters, and the only random variable is . So the expectation passes through to act on :
The pivotal step is that has to hold uniformly across — this is the formal statement of “random assignment.”
3.4 The Failure Mode: Mid-Experiment Changes to the Assignment Ratio
Anything that lets vary by destroys unbiasedness. A failure mode that comes up constantly in industry is changing the treatment-to-control ratio mid-experiment: day 1 splits 1% control / 1% treatment (ratio 1:1), day 2 ramps up to 1% control / 5% treatment (ratio 1:5). If we just look at the cumulative treatment-vs-control sample-mean difference, early users (who entered on day 1) and late users (who only entered treatment on day 2) have different — and the simple difference-in-means absorbs the time-distribution differences from the ramp into the “treatment effect,” producing a Simpson’s-paradox-style bias.
By contrast: a proportional ramp — say both arms scale from 1%/1% to 5%/5%, keeping the ctrl:exp ratio fixed at 1:1 — keeps for every user, and the difference-in-means stays unbiased. The failure isn’t about ramping per se; it’s about ramping the ratio. A proportional ramp does introduce time-in-experiment heterogeneity (day-1 users sit in the experiment longer than day-2 newcomers), but the heterogeneity exists symmetrically in both arms — it doesn’t break internal validity (unbiasedness); it just means the estimated ATE applies to the mixed early/late population, and extrapolating to a 100% rollout requires extra reasoning.
The mathematical fix is inverse-probability weighting (see Appendix B). The more reliable engineering fix is to keep the treatment-to-control ratio fixed for the entire run; scaling the total traffic proportionally is fine, but the ratio itself must not change.
4. Hypothesis Testing: From Estimator to Decision
4.1 Null and Alternative Hypotheses, Type I and II Errors
We have an unbiased estimator . The remaining step is to turn one observed value of into a decision: do we accept or reject the claim that “the new algorithm is effective”? That is the job of hypothesis testing.
We define two mutually exclusive hypotheses:
- Null hypothesis : , the new algorithm has no effect.
- Alternative hypothesis : , the new algorithm has a positive or negative effect.
A decision rule has two failure modes:
- Type I error (false positive): is in fact true but we reject it. Probability denoted , conventionally 5%.
- Type II error (false negative): is in fact true but we fail to reject . Probability denoted ; the quantity is called the test’s statistical power.
The frequentist philosophy of hypothesis testing rests on “rare events do not happen on a single trial” — if under we observe a sample that should rarely arise, that is grounds to suspect is wrong.
4.2 The Central Limit Theorem
We have a concrete value of . To translate it into “how rare is this under ?”, we need to know the distribution of . This is where the Central Limit Theorem (CLT) enters.
Lindeberg-Lévy form: let be i.i.d. with and . Then as ,
where denotes convergence in distribution.
Sketch of the characteristic-function proof. Let , so has mean 0 and variance . The characteristic function has the Taylor expansion near :
Define the standardized sum . By independence,
And is the characteristic function of the standard normal . By Lévy’s continuity theorem — pointwise convergence of characteristic functions implies weak convergence of distributions — converges in distribution to . The full derivation (including the precise meaning of ) is in Appendix A.
Intuitively: for any distribution with finite second moment, summing and averaging dampens the influence of higher-order moments faster than the influence of mean and variance, and only those two survive in the limit — that is why “averages of enough samples are normal.”
4.3 Application to A/B Testing
Assume the treatment and control samples are i.i.d. from finite-variance populations and that the two groups are independent. In the regime of that practical experiments live in, the CLT and group independence give
Replacing the unknown population variances by their sample estimates , we form the test statistic
Under , . Given a significance level , the two-sided rejection region is .
4.4 The p-Value
The p-value is defined as the probability, assuming holds, of observing a result as extreme as the current one or more extreme. Formally,
where is the standard normal CDF. The smaller is, the rarer the observed sample is under , and the stronger the evidence for rejecting .
A common misreading worth correcting: ” means is true with 95% probability” is wrong. A frequentist p-value is “how rare the observed data is under ,” not “the posterior probability of ” — the latter is what a Bayesian would compute, and it requires a prior. The American Statistical Association’s 2016 statement on p-values catalogs this and other common misuses.
4.5 Confidence Intervals
The interval-version counterpart of the p-value: the confidence interval for is
Its frequentist semantics: “if we repeat the same experimental design many times and compute a 95% CI each time, about 95% of those intervals will cover the true ” — it does not say “this particular interval covers with 95% probability”. The true is fixed; either it lies in your one realized interval or it does not. The probability lives in the sampling procedure, not in the parameter.
4.6 Power and Minimum Sample Size
The significance level controls the false positive rate but not the false negative rate. The latter is captured by the power , which depends on the true effect size , the sample size , and the variance . Under the symmetric approximation , the two-sided -test’s power at effect is approximately
Inverting this for a target power (commonly 80%) and a target detectable effect gives the per-group sample size
This formula is the starting point for the practical “minimum sample size” calculation — any well-run A/B test should estimate and freeze before launching, not “watch and decide as it goes”. The cost of “watching as it goes” is the topic of the next section.
5. Two Real-World Pitfalls
5.1 Fixed-Sample Size and the Peeking Problem
Classical t- and z-tests rest on a premise that is easy to overlook but central: the sample size has to be fixed before the experiment starts, and the test runs once at the end on the full sample. Operationally, businesses often do something else: the dashboard refreshes daily, the PM / data scientist / engineer checks the p-value every morning — “stop and ship if it’s significant, keep running if not.” That practice — repeatedly checking and deciding whether to stop based on what you see — is called peeking.
Peeking inflates the false positive rate. Intuitively: if you check once a day for 14 days and ask “is it significant yet?”, the probability of “at least one of those checks showed significance” is much higher than 5% — this is a multiple-comparison problem. Johari et al. ran simulations in Peeking at A/B Tests (KDD 2017): under , with classical t-tests plus daily peeking, the actual false positive rate can drift from a nominal 5% to over 30% — a 14-day experiment is roughly in that range.
The business-side fallout is bizarre: “last year we ran 100 experiments, every one reported a significant +1% lift, and yet the year-end company metric went down.” Not surprising — most of those “+1%” wins came from -true experiments that happened to peek into significance early; when stacked onto the company-wide metric they neither replicate nor add up.
The industry fix for peeking is sequential testing. Wald’s classic SPRT uses a likelihood ratio with thresholds and as a stopping rule; large platforms today more commonly use the mixture-prior version, mSPRT (Always Valid Inference, Johari et al. 2015), which produces a valid confidence sequence at every moment so the business can stop at any time without breaking Type-I control. The price is that single-shot power is a bit lower than the corresponding fixed-sample t-test.
One easy-to-miss premise underpins both: the theoretical guarantees of SPRT and mSPRT rest on the sample sequence being i.i.d., which rarely holds for a real streaming A/B test. A single user’s records across different batches are positively correlated (we unpack this in §5.2), and user behavior carries weekday / weekend and morning / evening structure, so even batch-level summaries aren’t truly independent.
5.2 i.i.d. Failure and Variance Underestimation
Both the CLT and require the samples to be i.i.d. (independent and identically distributed). That premise breaks frequently in online A/B tests.
The most common failure mode is a single user contributing many records. In an impression / click experiment, the unit of randomization is the user, but the click data lives at (user, video) granularity. A single user’s many impressions and clicks across a week are obviously positively correlated — a heavy active user who clicks a lot this week tends to click a lot next week.
The algebraic consequence of positive correlation is direct. Suppose user contributes records ; assume users are independent but within-user records may be correlated. The variance of the total is
When , the true variance is larger than what the i.i.d. formula returns. If the implementation just plugs into the i.i.d. formula, the variance gets understated, the test statistic gets inflated, the p-value gets shrunk — and the false positive rate climbs.
The standard fix is to aggregate to the analysis unit (per-user aggregation): the fixed-sample t-test itself is fine — the problem is that what we feed it isn’t user-level. The most direct remedy is to lift the analysis unit from “many records” up to “one record per user” — collapse each user’s multiple records into a single per-user statistic and the “users are i.i.d. with each other” assumption is restored. The trade-off is that the metric semantics change from “per record” to “per user” — they are different mathematical objects. This works well when the experiment can run on a daily / weekly aggregated table; for cross-day metrics with composite keys, the offline aggregation cost rises and needs separate engineering.
6. Wrapping Up
What an A/B test is really doing is translating “treatment beat control” into “shipping the new version makes the population better” — not a turn of phrase, but a full chain that runs from the potential outcomes framework through randomization, the CLT, and hypothesis testing, and then has to survive the two real-world failure modes of i.i.d. violation and peeking. A few points worth keeping in mind before applying any of this:
- An A/B test is a statistical and causal system on top of an engineering one, not just an engineering one. Traffic splitting, dashboard rendering, and gradual rollout are engineering problems; how to report numbers is a mathematical problem. Get the engineering right and the math wrong, and the “+1%” you publish is noise.
- Randomization is the cheapest currency for causal identification. When randomization isn’t available — observational data, natural experiments — you need additional assumptions to identify causal effects. Difference-in-Differences, instrumental variables, propensity score matching all have failure modes of their own; there is no free identification.
- CLT and i.i.d. are foundational and also leaky. When the sample structure carries hidden positive correlation (multi-record users, social spillover, two-sided markets), variance is underestimated and the false positive rate can drift from the nominal 5% to several tens of percent. Before reading any “significant” result, ask: is my sample really i.i.d.?
- Peeking is a daily trap. Watching a dashboard and stopping the moment something looks significant turns a fixed-sample test into a multiple-comparison test — this is the hardest habit to break on the business side. The industrial fix is to bake sequential testing into the A/B platform so that “looking at any time” is mathematically legal.
On top of these foundations, the industry has been refining a separate set of advanced techniques:
- Variance reduction: CUPED (Deng et al. WSDM 2013) uses pre-experiment covariates as control variates, shrinking the variance by a constant factor and roughly halving the required sample size for the same confidence-interval width.
- Experimental design under network effects: when SUTVA’s no-interference clause breaks down — in social, recommendation, and two-sided-market settings — the industry uses cluster randomization (randomize by city or community rather than by individual) and switchback (randomize by time slice rather than by user) to keep spillover inside a unit rather than between units.
Looking at the two together, the “advanced problems” of A/B testing all revolve around one question: how do we shrink the sample-size requirement without losing causal validity?
Appendix A: A Full Proof of the CLT
Two technical details from the characteristic-function sketch in the body deserve to be filled in.
1. The precise meaning of . Strictly speaking, "" means there exists a function with
This expansion needs to have finite second moment, . The derivation uses the standard properties of characteristic functions: is twice differentiable with and .
2. Tightening the limit. Fix . We compute the limit of via its log:
The second line uses together with (since ). Exponentiating gives .
3. Weak convergence. By Lévy’s continuity theorem: if a family of characteristic functions converges pointwise to some function that is continuous at , then the corresponding distributions converge weakly to the distribution with characteristic function . Here is continuous at , so .
For weaker conditions (Lindeberg’s condition, Feller’s non-identically-distributed version) see any measure-theoretic probability text — for example, Durrett’s Probability: Theory and Examples, Chapter 3.
Appendix B: Correction for Unequal Assignment Probabilities
In §3.4 we noted that when varies with , the simple difference-in-means estimator is biased. The fix is to attach an inverse probability weight to each observation, an idea originating in the Horvitz-Thompson estimator.
Let be known (e.g. from the experimentation platform’s traffic-split log). Define
A direct computation taking expectation over confirms — and, critically, without requiring to be uniform across .
The cost is that the variance of is larger than that of the difference-in-means estimator, especially when some are close to 0 or 1 and the weights blow up. In practice, the safer approach is to keep uniform by design; IPW is an after-the-fact repair, not a license to adjust traffic mid-experiment. For systematic comparisons of IPW with more robust alternatives (doubly robust estimators, propensity score matching, etc.), see Imbens and Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015), Chapters 12–15.