On internet products, we constantly want to know “if we ship this new version, will the metric move?” Shipping to 100% of traffic is risky — if the new version is bad, DAU and engagement can drop overnight. The standard A/B test answer is to randomly sample a small fraction of traffic and split that fraction into two slices — one runs the old version, the other runs the new one — then compare the two groups’ metrics after running for a while. The procedure is intuitive and works in practice, but rests on a premise that is easy to miss: “treatment beat control on the sample” is not the same statement as “shipping the new version makes the population better.” Two distinct gaps sit in between: first, a sample-to-population gap — even with perfectly random assignment, the difference observed on a finite sample can be sampling noise that doesn’t reflect any real population-level difference; second, an association-to-causation gap — two things appearing together in the data is not the same as one causing the other (the apparent edge can come from confounders, the Hawthorne effect, SUTVA-violating network spillover, and so on). The entire mathematical scaffolding behind A/B testing exists to bridge these two gaps.

This post introduces the two core pieces of that scaffolding: the potential outcomes framework (Rubin causal model) and frequentist hypothesis testing. They handle two distinct jobs: the framework formalizes the causal effect as a difference between unobservable potential outcomes; hypothesis testing uses randomization and the central limit theorem to recover that unobservable quantity asymptotically from a finite sample. Drop either side and a “+1%” reported by an experiment cannot be honestly translated into “the new version delivered a real +1% lift.”

1. From Population Statistics to Sample-Based Inference

Statistical inference is the practice of using sample-level statistics to back out properties of a population. When we can directly enumerate the entire population, as in a national census, that is not statistical inference — it is a census. In most settings, however, a census is unaffordable, so we draw a sample using some mechanism and use sample-level statistics to estimate population-level parameters. The accuracy of inference depends on whether the sampling mechanism makes the sample “representative” of the population — and representativeness is a mathematical property, not an intuitive one.

A/B testing is one of the canonical industry applications of statistical inference. It originated in medicine, in the randomized controlled trial (RCT) — patients randomized to treatment vs. placebo, with the treatment effect on the sample used to predict the population effect after the drug ships. Internet products borrowed the same machinery for algorithm iteration. Before launching a new recommendation algorithm to all users, shipping it directly is too risky — a bad version can knock down DAU and time-spent immediately. The standard practice is to give the new algorithm to 5% of users and keep 95% on the old one, then compare per-user metrics after a week. Conceptually, this uses a 5% sample to predict “if we ship the new algorithm to everyone, how will the population’s per-user time-spent change?”

But there is a subtlety. After the experiment ends we see the treatment group is up 1.2% over control — can we conclude “the new algorithm lifted per-user time-spent by 1.2%”? Not strictly. Treatment and control are only separated at sampling time; the 5% who got the new algorithm cannot also be “assigned to control” so we can compare the same users under both versions. The quantity we actually want is “for the same user, how much higher is time-spent under the new algorithm than under the old one?” — and that is a counterfactual quantity, not directly observable at the individual level. Translating “between-group difference” into “causal effect” needs a mathematical framework.

2. The Potential Outcomes Framework

2.1 Setup and Potential Outcomes

We use the potential outcomes framework introduced by Neyman in 1923. Let there be nn experimental units and let Zi{0,1}Z_i \in \{0, 1\} denote unit ii‘s assignment: Zi=1Z_i = 1 for treatment, Zi=0Z_i = 0 for control. The metric of interest YiY_i has two versions:

  • Yi(1)Y_i(1): unit ii‘s metric value if assigned to treatment (got the new algorithm).
  • Yi(0)Y_i(0): unit ii‘s metric value if assigned to control (got the old algorithm).

These are jointly called the potential outcomes for unit ii — they both “potentially” exist, but only one of them is actually observable, depending on ZiZ_i:

Yiobs=ZiYi(1)+(1Zi)Yi(0).Y_i^{\text{obs}} = Z_i \cdot Y_i(1) + (1 - Z_i) \cdot Y_i(0).

2.2 Individual Causal Effect and the Counterfactual

The individual causal effect for unit ii is naturally defined as

τi:=Yi(1)Yi(0),\tau_i := Y_i(1) - Y_i(0),

i.e. the difference between the same unit’s metric under the two treatments. Since we can observe at most one of them, τi\tau_i is an individual-level quantity that can never be precisely computed. This is the fundamental difficulty of causal inference: every causal effect at the individual level involves an unobservable counterpart state — the so-called counterfactual.

2.3 SUTVA: Two Assumptions That Make Causal Inference Operational

The framework as Neyman originally proposed it is just statistical language. To turn it into a workable tool for causal inference, Rubin (1980) added two core assumptions, jointly known as the Stable Unit Treatment Value Assumption (SUTVA):

  • No interference: unit ii‘s potential outcomes depend only on its own ZiZ_i, not on the assignments {Zj}ji\{Z_j\}_{j \neq i} of other units. Industrial counterexample: in social-feed products like WeChat Channels, a treated user gets recommended a great piece of content by the new algorithm and shares it to their feed — control-group friends now see the same content. The control group’s metric has been “spilled into” by the treatment, breaking no-interference.
  • Treatment variation irrelevance: every unit assigned to treatment receives one and the same version of the treatment. Industrial counterexample: a recommendation algorithm rolls out to servers in different regions at different times, so users nominally in “treatment” actually experience several different “new algorithms,” breaking the variation-irrelevance assumption.

SUTVA is silently assumed in most A/B tests and silently breaks in social, recommendation, and two-sided-market settings — experimentation under network effects is a research direction of its own (with cluster randomization and switchback as representative techniques), and this post does not go into it.

2.4 Average Treatment Effect

Individual causal effects τi\tau_i are unestimable, but we can step back and ask about the population average. The Average Treatment Effect (ATE) is

τ:=1ni=1n(Yi(1)Yi(0))=Yˉ(1)Yˉ(0),\tau := \frac{1}{n} \sum_{i=1}^{n} \big(Y_i(1) - Y_i(0)\big) = \bar{Y}(1) - \bar{Y}(0),

i.e. the population’s average metric under treatment minus the population’s average metric under control. The ATE is still defined in terms of unobservable potential outcomes, but under the right experimental design it can be estimated without bias — and the next section shows why “the right experimental design” is precisely “random assignment.”

3. Randomization and the Difference-in-Means Estimator

3.1 Why Randomization Defuses the Counterfactual Problem

A key detail of the potential outcomes framework worth pinning down first: every unit simultaneously carries both Yi(0)Y_i(0) and Yi(1)Y_i(1) as intrinsic properties; the assignment ZiZ_i only decides which one we observe, not which one exists. A unit assigned to the treatment group still has its Yi(0)Y_i(0) as a counterfactual sitting behind the scenes — we simply can’t see it, and vice versa. With this in mind, when ZiZ_i is statistically independent of (Yi(0),Yi(1))(Y_i(0), Y_i(1)), both the treatment group and the control group become random subsets of the same underlying population — the joint distribution of (Yi(0),Yi(1))(Y_i(0), Y_i(1)) within each group equals, in expectation, the joint distribution over the full population, and therefore matches across the two groups. The claim of “matching” is not “the treatment group’s observed Y(1)Y(1) has the same distribution as the control group’s observed Y(0)Y(0)” — those are different quantities by construction. The claim is that each group is an unbiased microcosm of the population’s (Y(0),Y(1))(Y(0), Y(1)). As a direct consequence, the treatment-group sample mean Xˉ1\bar{X}_1 is unbiased for Yˉ(1)\bar{Y}(1), the control-group sample mean Xˉ0\bar{X}_0 is unbiased for Yˉ(0)\bar{Y}(0), and any systematic bias that would push one group higher is averaged out by the randomness of assignment.

More formally, in Neyman’s setup we treat {Yi(0),Yi(1)}i=1n\{Y_i(0), Y_i(1)\}_{i=1}^n as fixed properties of the population (not random variables); the only randomness comes from the assignment {Zi}i=1n\{Z_i\}_{i=1}^n. If the experiment guarantees P(Zi=1)=n1/n\mathbb{P}(Z_i = 1) = n_1/n uniformly across ii, then ZiZ_i is statistically independent of the potential outcomes by construction. The point of randomization is to use a Bernoulli variable that is independent of the potential outcomes to “split” the constraint that each unit can only be observed under one state into “each state has a sub-sample that is unbiased for its own population mean”.

3.2 The Difference-in-Means Estimator

Building on that observation, we estimate the ATE by

τ^:=1n1Zi=1Yiobs1n0Zi=0Yiobs.\hat{\tau} := \frac{1}{n_1} \sum_{Z_i = 1} Y_i^{\text{obs}} - \frac{1}{n_0} \sum_{Z_i = 0} Y_i^{\text{obs}}.

This is the difference-in-means estimator. The form is straightforward: treatment-group sample mean minus control-group sample mean. Below we show it is an unbiased estimator of τ\tau.

3.3 The Full Unbiasedness Proof

Rewrite the sums over the assigned subset as sums over all ii, gated by the ZiZ_i / (1Zi)(1 - Z_i) indicator, and expand Yiobs=ZiYi(1)+(1Zi)Yi(0)Y_i^{\text{obs}} = Z_i Y_i(1) + (1 - Z_i) Y_i(0):

τ^=1n1i=1nZiYi(1)1n0i=1n(1Zi)Yi(0).\hat{\tau} = \frac{1}{n_1} \sum_{i=1}^{n} Z_i \, Y_i(1) - \frac{1}{n_0} \sum_{i=1}^{n} (1 - Z_i) \, Y_i(0).

In Neyman’s framing, Yi(1)Y_i(1) and Yi(0)Y_i(0) are fixed (they are population properties), n1n_1 and n0n_0 are design parameters, and the only random variable is ZiZ_i. So the expectation passes through to act on ZiZ_i:

E[τ^]=1n1i=1nE[Zi]Yi(1)1n0i=1n(1E[Zi])Yi(0)=1n1i=1nn1nYi(1)1n0i=1nn0nYi(0)=1ni=1nYi(1)1ni=1nYi(0)=Yˉ(1)Yˉ(0)=τ.\begin{aligned} \mathbb{E}[\hat{\tau}] &= \frac{1}{n_1} \sum_{i=1}^{n} \mathbb{E}[Z_i] \cdot Y_i(1) - \frac{1}{n_0} \sum_{i=1}^{n} \big(1 - \mathbb{E}[Z_i]\big) \cdot Y_i(0) \\ &= \frac{1}{n_1} \sum_{i=1}^{n} \frac{n_1}{n} \, Y_i(1) - \frac{1}{n_0} \sum_{i=1}^{n} \frac{n_0}{n} \, Y_i(0) \\ &= \frac{1}{n} \sum_{i=1}^{n} Y_i(1) - \frac{1}{n} \sum_{i=1}^{n} Y_i(0) \\ &= \bar{Y}(1) - \bar{Y}(0) \\ &= \tau. \end{aligned}

The pivotal step is that E[Zi]=n1/n\mathbb{E}[Z_i] = n_1/n has to hold uniformly across ii — this is the formal statement of “random assignment.”

3.4 The Failure Mode: Mid-Experiment Changes to the Assignment Ratio

Anything that lets E[Zi]=n1/n\mathbb{E}[Z_i] = n_1/n vary by ii destroys unbiasedness. A failure mode that comes up constantly in industry is changing the treatment-to-control ratio mid-experiment: day 1 splits 1% control / 1% treatment (ratio 1:1), day 2 ramps up to 1% control / 5% treatment (ratio 1:5). If we just look at the cumulative treatment-vs-control sample-mean difference, early users (who entered on day 1) and late users (who only entered treatment on day 2) have different E[Zi]\mathbb{E}[Z_i] — and the simple difference-in-means absorbs the time-distribution differences from the ramp into the “treatment effect,” producing a Simpson’s-paradox-style bias.

By contrast: a proportional ramp — say both arms scale from 1%/1% to 5%/5%, keeping the ctrl:exp ratio fixed at 1:1 — keeps E[Zi]=1/2\mathbb{E}[Z_i] = 1/2 for every user, and the difference-in-means stays unbiased. The failure isn’t about ramping per se; it’s about ramping the ratio. A proportional ramp does introduce time-in-experiment heterogeneity (day-1 users sit in the experiment longer than day-2 newcomers), but the heterogeneity exists symmetrically in both arms — it doesn’t break internal validity (unbiasedness); it just means the estimated ATE applies to the mixed early/late population, and extrapolating to a 100% rollout requires extra reasoning.

The mathematical fix is inverse-probability weighting (see Appendix B). The more reliable engineering fix is to keep the treatment-to-control ratio fixed for the entire run; scaling the total traffic proportionally is fine, but the ratio itself must not change.

4. Hypothesis Testing: From Estimator to Decision

4.1 Null and Alternative Hypotheses, Type I and II Errors

We have an unbiased estimator τ^\hat{\tau}. The remaining step is to turn one observed value of τ^\hat{\tau} into a decision: do we accept or reject the claim that “the new algorithm is effective”? That is the job of hypothesis testing.

We define two mutually exclusive hypotheses:

  • Null hypothesis H0H_0: τ=0\tau = 0, the new algorithm has no effect.
  • Alternative hypothesis H1H_1: τ0\tau \neq 0, the new algorithm has a positive or negative effect.

A decision rule has two failure modes:

  • Type I error (false positive): H0H_0 is in fact true but we reject it. Probability denoted α\alpha, conventionally 5%.
  • Type II error (false negative): H1H_1 is in fact true but we fail to reject H0H_0. Probability denoted β\beta; the quantity 1β1 - \beta is called the test’s statistical power.

The frequentist philosophy of hypothesis testing rests on “rare events do not happen on a single trial” — if under H0H_0 we observe a sample that should rarely arise, that is grounds to suspect H0H_0 is wrong.

4.2 The Central Limit Theorem

We have a concrete value of τ^\hat{\tau}. To translate it into “how rare is this under H0H_0?”, we need to know the distribution of τ^\hat{\tau}. This is where the Central Limit Theorem (CLT) enters.

Lindeberg-Lévy form: let X1,X2,,XnX_1, X_2, \ldots, X_n be i.i.d. with E[Xi]=μ\mathbb{E}[X_i] = \mu and Var(Xi)=σ2<\mathrm{Var}(X_i) = \sigma^2 < \infty. Then as nn \to \infty,

n(Xˉnμ)σ  d  N(0,1),\frac{\sqrt{n}\,(\bar{X}_n - \mu)}{\sigma} \;\xrightarrow{d}\; N(0, 1),

where d\xrightarrow{d} denotes convergence in distribution.

Sketch of the characteristic-function proof. Let Yi:=XiμY_i := X_i - \mu, so YiY_i has mean 0 and variance σ2\sigma^2. The characteristic function ϕY(t):=E[eitY]\phi_Y(t) := \mathbb{E}[e^{i t Y}] has the Taylor expansion near t=0t = 0:

ϕY(t)=1σ2t22+o(t2),t0.\phi_Y(t) = 1 - \frac{\sigma^2 t^2}{2} + o(t^2), \qquad t \to 0.

Define the standardized sum Sn:=1σni=1nYiS_n := \frac{1}{\sigma \sqrt{n}} \sum_{i=1}^{n} Y_i. By independence,

ϕSn(t)=i=1nE ⁣[exp ⁣(itYiσn)]=[ϕY ⁣(tσn)]n=[1t22n+o ⁣(1n)]n  n  et2/2.\begin{aligned} \phi_{S_n}(t) &= \prod_{i=1}^{n} \mathbb{E}\!\left[\exp\!\left(\frac{i t Y_i}{\sigma \sqrt{n}}\right)\right] = \left[\phi_Y\!\left(\frac{t}{\sigma \sqrt{n}}\right)\right]^n \\ &= \left[1 - \frac{t^2}{2 n} + o\!\left(\frac{1}{n}\right)\right]^n \;\xrightarrow{n \to \infty}\; e^{-t^2 / 2}. \end{aligned}

And et2/2e^{-t^2/2} is the characteristic function of the standard normal N(0,1)N(0, 1). By Lévy’s continuity theorem — pointwise convergence of characteristic functions implies weak convergence of distributions — SnS_n converges in distribution to N(0,1)N(0, 1). The full derivation (including the precise meaning of o(t2)o(t^2)) is in Appendix A.

Intuitively: for any distribution with finite second moment, summing and averaging dampens the influence of higher-order moments faster than the influence of mean and variance, and only those two survive in the limit — that is why “averages of enough samples are normal.”

4.3 Application to A/B Testing

Assume the treatment and control samples are i.i.d. from finite-variance populations and that the two groups are independent. In the regime of n0,n130n_0, n_1 \gg 30 that practical experiments live in, the CLT and group independence give

Xˉ1Xˉ0  d  N ⁣(μ1μ0,  σ12n1+σ02n0).\bar{X}_1 - \bar{X}_0 \;\overset{d}{\approx}\; N\!\left(\mu_1 - \mu_0,\;\frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}\right).

Replacing the unknown population variances by their sample estimates σ^02,σ^12\hat{\sigma}_0^2, \hat{\sigma}_1^2, we form the test statistic

T:=Xˉ1Xˉ0σ^02/n0+σ^12/n1.T := \frac{\bar{X}_1 - \bar{X}_0}{\sqrt{\hat{\sigma}_0^2 / n_0 + \hat{\sigma}_1^2 / n_1}}.

Under H0H_0, TdN(0,1)T \overset{d}{\approx} N(0, 1). Given a significance level α\alpha, the two-sided rejection region is T>z1α/2|T| > z_{1 - \alpha/2}.

4.4 The p-Value

The p-value is defined as the probability, assuming H0H_0 holds, of observing a result as extreme as the current one or more extreme. Formally,

p=2(1Φ(T)),p = 2 \cdot \big(1 - \Phi(|T|)\big),

where Φ\Phi is the standard normal CDF. The smaller pp is, the rarer the observed sample is under H0H_0, and the stronger the evidence for rejecting H0H_0.

A common misreading worth correcting: p<0.05p < 0.05 means H1H_1 is true with 95% probability” is wrong. A frequentist p-value is “how rare the observed data is under H0H_0,” not “the posterior probability of H1H_1” — the latter is what a Bayesian would compute, and it requires a prior. The American Statistical Association’s 2016 statement on p-values catalogs this and other common misuses.

4.5 Confidence Intervals

The interval-version counterpart of the p-value: the 1α1 - \alpha confidence interval for τ\tau is

τ^±z1α/2σ^02/n0+σ^12/n1.\hat{\tau} \pm z_{1 - \alpha/2} \cdot \sqrt{\hat{\sigma}_0^2 / n_0 + \hat{\sigma}_1^2 / n_1}.

Its frequentist semantics: “if we repeat the same experimental design many times and compute a 95% CI each time, about 95% of those intervals will cover the true τ\tau” — it does not say “this particular interval covers τ\tau with 95% probability”. The true τ\tau is fixed; either it lies in your one realized interval or it does not. The probability lives in the sampling procedure, not in the parameter.

4.6 Power and Minimum Sample Size

The significance level α\alpha controls the false positive rate but not the false negative rate. The latter is captured by the power 1β1 - \beta, which depends on the true effect size Δ:=μ1μ0\Delta := \mu_1 - \mu_0, the sample size nn, and the variance σ\sigma. Under the symmetric approximation σ0=σ1=σ,n0=n1=n\sigma_0 = \sigma_1 = \sigma, n_0 = n_1 = n, the two-sided α\alpha-test’s power at effect Δ\Delta is approximately

1β1Φ ⁣(z1α/2Δ2σ2/n).1 - \beta \approx 1 - \Phi\!\left(z_{1 - \alpha/2} - \frac{|\Delta|}{\sqrt{2 \sigma^2 / n}}\right).

Inverting this for a target power (commonly 80%) and a target detectable effect Δ\Delta gives the per-group sample size

n=2(σ(z1α/2zβ)Δ)2.n = 2 \left(\frac{\sigma \cdot (z_{1 - \alpha/2} - z_\beta)}{\Delta}\right)^2.

This formula is the starting point for the practical “minimum sample size” calculation — any well-run A/B test should estimate and freeze nn before launching, not “watch and decide as it goes”. The cost of “watching as it goes” is the topic of the next section.

5. Two Real-World Pitfalls

5.1 Fixed-Sample Size and the Peeking Problem

Classical t- and z-tests rest on a premise that is easy to overlook but central: the sample size nn has to be fixed before the experiment starts, and the test runs once at the end on the full sample. Operationally, businesses often do something else: the dashboard refreshes daily, the PM / data scientist / engineer checks the p-value every morning — “stop and ship if it’s significant, keep running if not.” That practice — repeatedly checking and deciding whether to stop based on what you see — is called peeking.

Peeking inflates the false positive rate. Intuitively: if you check once a day for 14 days and ask “is it significant yet?”, the probability of “at least one of those checks showed significance” is much higher than 5% — this is a multiple-comparison problem. Johari et al. ran simulations in Peeking at A/B Tests (KDD 2017): under H0H_0, with classical t-tests plus daily peeking, the actual false positive rate can drift from a nominal 5% to over 30% — a 14-day experiment is roughly in that range.

The business-side fallout is bizarre: “last year we ran 100 experiments, every one reported a significant +1% lift, and yet the year-end company metric went down.” Not surprising — most of those “+1%” wins came from H0H_0-true experiments that happened to peek into significance early; when stacked onto the company-wide metric they neither replicate nor add up.

The industry fix for peeking is sequential testing. Wald’s classic SPRT uses a likelihood ratio with thresholds AA and BB as a stopping rule; large platforms today more commonly use the mixture-prior version, mSPRT (Always Valid Inference, Johari et al. 2015), which produces a valid confidence sequence at every moment so the business can stop at any time without breaking Type-I control. The price is that single-shot power is a bit lower than the corresponding fixed-sample t-test.

One easy-to-miss premise underpins both: the theoretical guarantees of SPRT and mSPRT rest on the sample sequence being i.i.d., which rarely holds for a real streaming A/B test. A single user’s records across different batches are positively correlated (we unpack this in §5.2), and user behavior carries weekday / weekend and morning / evening structure, so even batch-level summaries aren’t truly independent.

5.2 i.i.d. Failure and Variance Underestimation

Both the CLT and Var(Xˉ)=σ2/n\mathrm{Var}(\bar{X}) = \sigma^2 / n require the samples to be i.i.d. (independent and identically distributed). That premise breaks frequently in online A/B tests.

The most common failure mode is a single user contributing many records. In an impression / click experiment, the unit of randomization is the user, but the click data lives at (user, video) granularity. A single user’s many impressions and clicks across a week are obviously positively correlated — a heavy active user who clicks a lot this week tends to click a lot next week.

The algebraic consequence of positive correlation is direct. Suppose user ii contributes nin_i records {Xij}j=1ni\{X_{ij}\}_{j=1}^{n_i}; assume users are independent but within-user records may be correlated. The variance of the total is

Var ⁣(i,jXij)=iVar ⁣(jXij)=i ⁣[jVar(Xij)+jkCov(Xij,Xik)].\begin{aligned} \mathrm{Var}\!\left(\sum_{i, j} X_{ij}\right) &= \sum_i \mathrm{Var}\!\left(\sum_j X_{ij}\right) \\ &= \sum_i \!\left[\sum_j \mathrm{Var}(X_{ij}) + \sum_{j \neq k} \mathrm{Cov}(X_{ij}, X_{ik})\right]. \end{aligned}

When Cov(Xij,Xik)>0\mathrm{Cov}(X_{ij}, X_{ik}) > 0, the true variance is larger than what the i.i.d. formula returns. If the implementation just plugs into the i.i.d. formula, the variance gets understated, the test statistic gets inflated, the p-value gets shrunk — and the false positive rate climbs.

The standard fix is to aggregate to the analysis unit (per-user aggregation): the fixed-sample t-test itself is fine — the problem is that what we feed it isn’t user-level. The most direct remedy is to lift the analysis unit from “many records” up to “one record per user” — collapse each user’s multiple records into a single per-user statistic and the “users are i.i.d. with each other” assumption is restored. The trade-off is that the metric semantics change from “per record” to “per user” — they are different mathematical objects. This works well when the experiment can run on a daily / weekly aggregated table; for cross-day metrics with composite keys, the offline aggregation cost rises and needs separate engineering.

6. Wrapping Up

What an A/B test is really doing is translating “treatment beat control” into “shipping the new version makes the population better” — not a turn of phrase, but a full chain that runs from the potential outcomes framework through randomization, the CLT, and hypothesis testing, and then has to survive the two real-world failure modes of i.i.d. violation and peeking. A few points worth keeping in mind before applying any of this:

  • An A/B test is a statistical and causal system on top of an engineering one, not just an engineering one. Traffic splitting, dashboard rendering, and gradual rollout are engineering problems; how to report numbers is a mathematical problem. Get the engineering right and the math wrong, and the “+1%” you publish is noise.
  • Randomization is the cheapest currency for causal identification. When randomization isn’t available — observational data, natural experiments — you need additional assumptions to identify causal effects. Difference-in-Differences, instrumental variables, propensity score matching all have failure modes of their own; there is no free identification.
  • CLT and i.i.d. are foundational and also leaky. When the sample structure carries hidden positive correlation (multi-record users, social spillover, two-sided markets), variance is underestimated and the false positive rate can drift from the nominal 5% to several tens of percent. Before reading any “significant” result, ask: is my sample really i.i.d.?
  • Peeking is a daily trap. Watching a dashboard and stopping the moment something looks significant turns a fixed-sample test into a multiple-comparison test — this is the hardest habit to break on the business side. The industrial fix is to bake sequential testing into the A/B platform so that “looking at any time” is mathematically legal.

On top of these foundations, the industry has been refining a separate set of advanced techniques:

  • Variance reduction: CUPED (Deng et al. WSDM 2013) uses pre-experiment covariates as control variates, shrinking the variance by a constant factor and roughly halving the required sample size for the same confidence-interval width.
  • Experimental design under network effects: when SUTVA’s no-interference clause breaks down — in social, recommendation, and two-sided-market settings — the industry uses cluster randomization (randomize by city or community rather than by individual) and switchback (randomize by time slice rather than by user) to keep spillover inside a unit rather than between units.

Looking at the two together, the “advanced problems” of A/B testing all revolve around one question: how do we shrink the sample-size requirement without losing causal validity?

Appendix A: A Full Proof of the CLT

Two technical details from the characteristic-function sketch in the body deserve to be filled in.

1. The precise meaning of o(t2)o(t^2). Strictly speaking, "ϕY(t)=1σ2t2/2+o(t2)\phi_Y(t) = 1 - \sigma^2 t^2 / 2 + o(t^2)" means there exists a function ϵ(t)\epsilon(t) with

ϕY(t)=1σ2t22+t2ϵ(t),ϵ(t)0 as t0.\phi_Y(t) = 1 - \frac{\sigma^2 t^2}{2} + t^2 \cdot \epsilon(t), \qquad \epsilon(t) \to 0 \text{ as } t \to 0.

This expansion needs YY to have finite second moment, E[Y2]=σ2<\mathbb{E}[Y^2] = \sigma^2 < \infty. The derivation uses the standard properties of characteristic functions: ϕY\phi_Y is twice differentiable with ϕY(0)=iE[Y]=0\phi_Y'(0) = i\,\mathbb{E}[Y] = 0 and ϕY(0)=E[Y2]=σ2\phi_Y''(0) = -\mathbb{E}[Y^2] = -\sigma^2.

2. Tightening the limit. Fix tt. We compute the limit of ϕY(t/(σn))n\phi_Y(t/(\sigma \sqrt{n}))^n via its log:

ϕY ⁣(tσn)=1t22n+t2nϵ ⁣(tσn),nlnϕY ⁣(tσn)=n[t22n+o ⁣(1n)]=t22+o(1).\begin{aligned} \phi_Y\!\left(\frac{t}{\sigma \sqrt{n}}\right) &= 1 - \frac{t^2}{2 n} + \frac{t^2}{n} \cdot \epsilon\!\left(\frac{t}{\sigma \sqrt{n}}\right), \\ n \cdot \ln \phi_Y\!\left(\frac{t}{\sigma \sqrt{n}}\right) &= n \cdot \left[ -\frac{t^2}{2 n} + o\!\left(\frac{1}{n}\right) \right] = -\frac{t^2}{2} + o(1). \end{aligned}

The second line uses ln(1+x)=x+O(x2)\ln(1 + x) = x + O(x^2) together with ϵ(t/(σn))0\epsilon(t/(\sigma \sqrt{n})) \to 0 (since t/(σn)0t/(\sigma\sqrt{n}) \to 0). Exponentiating gives ϕSn(t)et2/2\phi_{S_n}(t) \to e^{-t^2 / 2}.

3. Weak convergence. By Lévy’s continuity theorem: if a family of characteristic functions ϕn(t)\phi_n(t) converges pointwise to some function ϕ(t)\phi(t) that is continuous at t=0t = 0, then the corresponding distributions converge weakly to the distribution with characteristic function ϕ\phi. Here ϕ(t)=et2/2\phi(t) = e^{-t^2/2} is continuous at t=0t = 0, so SndN(0,1)S_n \xrightarrow{d} N(0, 1).

For weaker conditions (Lindeberg’s condition, Feller’s non-identically-distributed version) see any measure-theoretic probability text — for example, Durrett’s Probability: Theory and Examples, Chapter 3.

Appendix B: Correction for Unequal Assignment Probabilities

In §3.4 we noted that when E[Zi]\mathbb{E}[Z_i] varies with ii, the simple difference-in-means estimator is biased. The fix is to attach an inverse probability weight to each observation, an idea originating in the Horvitz-Thompson estimator.

Let πi:=P(Zi=1)\pi_i := \mathbb{P}(Z_i = 1) be known (e.g. from the experimentation platform’s traffic-split log). Define

τ^IPW:=1ni=1n ⁣[ZiYiobsπi(1Zi)Yiobs1πi].\hat{\tau}_{\text{IPW}} := \frac{1}{n} \sum_{i=1}^{n} \!\left[\frac{Z_i \, Y_i^{\text{obs}}}{\pi_i} - \frac{(1 - Z_i) \, Y_i^{\text{obs}}}{1 - \pi_i}\right].

A direct computation taking expectation over ZiZ_i confirms E[τ^IPW]=τ\mathbb{E}[\hat{\tau}_{\text{IPW}}] = \tau — and, critically, without requiring πi\pi_i to be uniform across ii.

The cost is that the variance of τ^IPW\hat{\tau}_{\text{IPW}} is larger than that of the difference-in-means estimator, especially when some πi\pi_i are close to 0 or 1 and the weights blow up. In practice, the safer approach is to keep πi\pi_i uniform by design; IPW is an after-the-fact repair, not a license to adjust traffic mid-experiment. For systematic comparisons of IPW with more robust alternatives (doubly robust estimators, propensity score matching, etc.), see Imbens and Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015), Chapters 12–15.