From Potential Outcomes to Hypothesis Testing: Statistical and Causal Inference in A/B Tests

For an internet product, the question behind a strategy change is usually plain: after we ship a new recommender, ranking rule, or entry point, will DAU, time spent, click-through rate, or another population metric improve? Shipping to all traffic is risky, so A/B testing becomes the default: sample part of the traffic, randomly split it into control and treatment, keep the old version for one group and use the new version for the other, then compare metrics after the run. That comparison gives a between-group difference in the sample; it does not yet give a population-level causal effect.

Moving from a sample difference to a causal conclusion requires two checks. First, sample to population: a finite-sample difference can be sampling noise. Second, association to causation: two events appearing together in data does not mean one caused the other; confounders, the Hawthorne effect, and network spillover can all produce that illusion. The mathematical framework behind A/B testing exists to translate “treatment beat control” into “the new version caused the population metric to move”.

This post introduces the two main tools in that translation: the potential outcomes framework (Rubin causal model) and frequentist hypothesis testing. The former answers “what exactly is the causal effect?”; the latter answers “is the observed finite-sample difference rare enough under no effect?” Without potential outcomes, the experiment remains a correlation story; without hypothesis testing, it cannot separate real effects from random fluctuation.

1. From Population Statistics to Sample-Based Inference

Statistical inference studies how to infer population properties from samples. When we can enumerate the entire population, as in a national census that records every person, we are doing a census, not statistical inference. More often, the population is too large, too costly, or too risky to expose directly, so we draw a sample by some mechanism and use sample statistics to estimate population parameters. The reliability of statistical inference starts with whether the sampling mechanism makes the sample representative of the population.

A/B testing is a standard industrial form of this idea. Its lineage goes back to medical randomized controlled trials (RCTs): patients are randomized to drug or placebo, and the sample effect is used to infer the effect in the target population. Internet products use the same logic for algorithm iteration. For example, 5% of users may receive a new recommendation algorithm while 95% remain on the old one, and per-user time spent is compared after a week. The purpose is not to validate those 5% of users in isolation, but to estimate what would happen to the population metric after a full rollout.

The hard part appears immediately. If the treatment group ends up 1.2% higher than control, can we conclude that “the new algorithm lifted per-user time spent by 1.2%”? Not strictly. Users assigned to treatment cannot simultaneously use the old algorithm, so we cannot observe the same user under both versions. The quantity we want is the difference for the same unit under two treatments, and at the individual level that quantity necessarily contains an unobserved counterfactual.

2. The Potential Outcomes Framework

2.1 Setup and Potential Outcomes

We use the potential outcomes framework introduced by Neyman in 1923. Suppose the experiment has $n$ units. Let $Z_i \in \{0, 1\}$ denote unit $i$ ‘s assignment: $Z_i = 1$ for treatment and $Z_i = 0$ for control. The metric $Y_i$ has two versions:

$Y_i(1)$ : unit $i$ ‘s metric value under treatment, i.e. after using the new version.
$Y_i(0)$ : unit $i$ ‘s metric value under control, i.e. after using the old version.

These two quantities are called unit $i$ ‘s potential outcomes. They are both defined by the framework, but only one is observed in the experiment: the assignment $Z_i$ decides which potential outcome we see, not which potential outcome exists.

$Y_i^{\text{obs}} = Z_i \cdot Y_i(1) + (1 - Z_i) \cdot Y_i(0).$

2.2 Individual Causal Effect and the Counterfactual

The individual causal effect for unit $i$ is defined as the difference between the same unit under the two treatments:

$\tau_i := Y_i(1) - Y_i(0).$

The definition is natural, and it exposes the fundamental difficulty of causal inference: the same unit can receive only one treatment, so at most one of $Y_i(1)$ and $Y_i(0)$ can be observed. The individual causal effect $\tau_i$ is defined by one observed outcome and one counterfactual, so it cannot be computed exactly. A counterfactual is not a rhetorical device here; it is the mathematical source of individual-level unobservability.

2.3 SUTVA: Assumptions That Make Potential Outcomes Causal

The potential outcomes framework is only a language by itself. To use it for causal inference, we need the assumptions systematized by Rubin in 1980 under the Stable Unit Treatment Value Assumption (SUTVA). Two assumptions are commonly taken as default:

No interference: unit $i$ ‘s potential outcomes depend only on its own $Z_i$ , not on other units’ assignments $\{Z_j\}_{j \neq i}$ . An industrial counterexample is social distribution: a treated user receives a strong recommendation and shares it with control-group friends, changing the control group’s metric as well.
Treatment variation irrelevance: every treated unit receives the same treatment. An industrial counterexample is a recommender rollout by region, where some treated users actually see version A and others see version B.

SUTVA is a causal condition silently used by most A/B tests, and it breaks easily in social, recommendation, and two-sided-market settings. Experiments under network effects usually need designs such as cluster randomization or switchback to keep spillover within an interpretable unit; this post stays with the basic randomized controlled setting.

2.4 Average Treatment Effect

We cannot compute individual causal effects one by one, but in most experiments we do not need to. A/B testing usually targets the population average effect, the Average Treatment Effect (ATE):

$\tau := \frac{1}{n} \sum_{i=1}^{n} \big(Y_i(1) - Y_i(0)\big) = \bar{Y}(1) - \bar{Y}(0).$

The ATE is the population’s average metric under treatment minus its average metric under control. The ATE is still defined through potential outcomes that are not jointly observable, but a randomized experiment can estimate it without bias.

3. Randomization and the Difference-in-Means Estimator

3.1 How Randomization Handles the Counterfactual

In Neyman’s view, $\{Y_i(0), Y_i(1)\}_{i=1}^n$ are fixed potential outcomes in the population, and the only randomness comes from the assignments $\{Z_i\}_{i=1}^n$ . If assignment is independent of potential outcomes, treatment and control are random subsets of the same population. Randomization gives each potential state an observable sub-sample that is unbiased for its population mean.

One distinction matters. We are not saying that the treatment group’s observed $Y(1)$ has the same distribution as the control group’s observed $Y(0)$ ; those are outcomes under different treatments. We are saying that treatment and control represent the same population in the distribution of the potential-outcome pair $(Y(0), Y(1))$ , in expectation. That is why the treatment mean can estimate $\bar{Y}(1)$ , the control mean can estimate $\bar{Y}(0)$ , and their difference can estimate the ATE.

3.2 The Difference-in-Means Estimator

Under random assignment, we estimate the ATE by

$\hat{\tau} := \frac{1}{n_1} \sum_{Z_i = 1} Y_i^{\text{obs}} - \frac{1}{n_0} \sum_{Z_i = 0} Y_i^{\text{obs}}.$

This is the difference-in-means estimator. It is simply the observed treatment mean minus the observed control mean, but the formula gets its causal interpretation from random assignment, not from a visual impression that the groups “look similar.” The difference in group means becomes a causal estimator because randomization builds comparability into the experiment design.

3.3 The Full Unbiasedness Proof

Rewrite the subset sums as sums over all $i$ , using $Z_i$ and $1 - Z_i$ as indicators:

$\hat{\tau} = \frac{1}{n_1} \sum_{i=1}^{n} Z_i \, Y_i(1) - \frac{1}{n_0} \sum_{i=1}^{n} (1 - Z_i) \, Y_i(0).$

Here $Y_i(1)$ and $Y_i(0)$ are fixed population properties, $n_1$ and $n_0$ are design parameters, and the only random variable is $Z_i$ . Therefore the expectation operator acts only on $Z_i$ :

\begin{aligned} \mathbb{E}[\hat{\tau}] &= \frac{1}{n_1} \sum_{i=1}^{n} \mathbb{E}[Z_i] \cdot Y_i(1) - \frac{1}{n_0} \sum_{i=1}^{n} \big(1 - \mathbb{E}[Z_i]\big) \cdot Y_i(0) \\ &= \frac{1}{n_1} \sum_{i=1}^{n} \frac{n_1}{n} \, Y_i(1) - \frac{1}{n_0} \sum_{i=1}^{n} \frac{n_0}{n} \, Y_i(0) \\ &= \frac{1}{n} \sum_{i=1}^{n} Y_i(1) - \frac{1}{n} \sum_{i=1}^{n} Y_i(0) \\ &= \bar{Y}(1) - \bar{Y}(0) \\ &= \tau. \end{aligned}

The key condition is that $\mathbb{E}[Z_i] = n_1/n$ holds for every unit. This is the mathematical statement of “every unit has the same probability of entering treatment,” and it is the minimal requirement for random assignment here.

3.4 The Failure Mode: Changing the Assignment Ratio Mid-Experiment

Once $\mathbb{E}[Z_i]$ varies across units, the simple difference-in-means estimator becomes biased. A common industrial example is changing the control-to-treatment ratio during the run: day 1 uses 1% control / 1% treatment, while day 2 uses 1% control / 5% treatment. If we compare final cumulative means directly, early and late users have different probabilities of treatment assignment, and time-composition differences get mixed into the treatment effect, producing a Simpson’s-paradox-style bias. The problem is not ramping traffic; it is changing the control-to-treatment ratio.

A proportional ramp is different. If both arms scale from 1%/1% to 5%/5% and the ctrl:exp ratio remains 1:1, every unit still has the same $\mathbb{E}[Z_i]$ , and difference-in-means remains unbiased. A proportional ramp does make earlier users exposed for longer, but that heterogeneity exists symmetrically in both arms. It changes the population-time mix of the estimand, not the experiment’s internal validity.

Inverse-probability weighting can repair the bias after the fact (Appendix B), but the more reliable engineering rule is simpler: keep the ctrl:exp ratio fixed throughout the experiment; total traffic can ramp proportionally, but the assignment ratio should not change mid-run.

4. Hypothesis Testing: From Estimator to Decision

4.1 Null and Alternative Hypotheses, Type I and II Errors

An unbiased estimator $\hat{\tau}$ still does not directly make a business decision. In a finite sample, $\hat{\tau}$ will not equal $\tau$ exactly; even when the true effect is zero, random fluctuation can produce a nonzero estimate. Hypothesis testing asks whether the observed difference is rare enough under the premise that the new version has no effect.

Define two mutually exclusive hypotheses:

Null hypothesis $H_0$ : $\tau = 0$ , the new version has no effect.
Alternative hypothesis $H_1$ : $\tau \neq 0$ , the new version has a positive or negative effect.

A decision rule has two error types:

Type I error (false positive): $H_0$ is true but rejected. Its probability is $\alpha$ , commonly set to 5% in practice.
Type II error (false negative): $H_1$ is true but $H_0$ is not rejected. Its probability is $\beta$ ; $1 - \beta$ is the test’s statistical power.

The frequentist testing logic can be summarized in one sentence: if an outcome is very rare when $H_0$ is true, and we observe it, we have evidence to reject $H_0$ .

4.2 The Central Limit Theorem

To decide how rare the current result is, we need the distribution of $\hat{\tau}$ under $H_0$ . This is where the Central Limit Theorem (CLT) enters.

Lindeberg-Lévy form: let $X_1, X_2, \ldots, X_n$ be i.i.d. with $\mathbb{E}[X_i] = \mu$ and $\mathrm{Var}(X_i) = \sigma^2 < \infty$ . Then as $n \to \infty$ ,

$\frac{\sqrt{n}\,(\bar{X}_n - \mu)}{\sigma} \;\xrightarrow{d}\; N(0, 1),$

where $\xrightarrow{d}$ denotes convergence in distribution.

A characteristic-function proof gives the shape of the argument. Let $Y_i := X_i - \mu$ , so $Y_i$ has mean 0 and variance $\sigma^2$ . Its characteristic function $\phi_Y(t) := \mathbb{E}[e^{i t Y}]$ has the expansion near $t = 0$ :

$\phi_Y(t) = 1 - \frac{\sigma^2 t^2}{2} + o(t^2), \qquad t \to 0.$

Define the standardized sum $S_n := \frac{1}{\sigma \sqrt{n}} \sum_{i=1}^{n} Y_i$ . Independence gives:

\begin{aligned} \phi_{S_n}(t) &= \prod_{i=1}^{n} \mathbb{E}\!\left[\exp\!\left(\frac{i t Y_i}{\sigma \sqrt{n}}\right)\right] = \left[\phi_Y\!\left(\frac{t}{\sigma \sqrt{n}}\right)\right]^n \\ &= \left[1 - \frac{t^2}{2 n} + o\!\left(\frac{1}{n}\right)\right]^n \;\xrightarrow{n \to \infty}\; e^{-t^2 / 2}. \end{aligned}

$e^{-t^2/2}$ is the characteristic function of the standard normal $N(0, 1)$ . By Lévy’s continuity theorem, $S_n$ converges in distribution to $N(0, 1)$ . Appendix A fills in the technical details. Intuitively, as long as the second moment is finite, averaging gradually leaves only the mean and variance visible, and the distribution approaches normality.

4.3 Application to A/B Testing

Assume treatment and control samples come from finite-variance populations, are i.i.d. within each group, and are independent across groups. In the large-sample regime typical of production experiments, the CLT and group independence give:

$\bar{X}_1 - \bar{X}_0 \;\overset{d}{\approx}\; N\!\left(\mu_1 - \mu_0,\;\frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}\right).$

Replacing the unknown population variances by sample variances $\hat{\sigma}_0^2, \hat{\sigma}_1^2$ , we form the test statistic:

$T := \frac{\bar{X}_1 - \bar{X}_0}{\sqrt{\hat{\sigma}_0^2 / n_0 + \hat{\sigma}_1^2 / n_1}}.$

Under $H_0$ , $T \overset{d}{\approx} N(0, 1)$ . Given significance level $\alpha$ , the two-sided rejection region is $|T| > z_{1 - \alpha/2}$ . This step converts a group-mean difference into a standardized measure of rarity under the null.

4.4 The p-Value

The p-value is the probability, assuming $H_0$ is true, of observing the current result or a more extreme one. For a two-sided test:

$p = 2 \cdot \big(1 - \Phi(|T|)\big),$

where $\Phi$ is the standard normal CDF. Smaller p-values mean the observation is rarer under $H_0$ and provide stronger evidence against it. A p-value measures the rarity of the data relative to the null; it is not the probability that the alternative hypothesis is true.

Therefore, ” $p < 0.05$ means $H_1$ is true with 95% probability” is wrong. A frequentist p-value is always conditioned on $H_0$ ; computing the posterior probability of $H_1$ requires a prior and belongs to Bayesian inference. The American Statistical Association’s 2016 statement on p-values discusses this and related misuses. Do not read a p-value as the probability that the experiment succeeded.

4.5 Confidence Intervals

The p-value gives a test conclusion; the confidence interval gives a plausible range of effects. The $1 - \alpha$ confidence interval for $\tau$ is:

$\hat{\tau} \pm z_{1 - \alpha/2} \cdot \sqrt{\hat{\sigma}_0^2 / n_0 + \hat{\sigma}_1^2 / n_1}.$

The frequentist meaning of a confidence interval is: if we repeatedly sample under the same experimental design and construct an interval each time, about a $1 - \alpha$ fraction of those intervals will cover the true $\tau$ . It does not mean this particular interval covers $\tau$ with probability $1 - \alpha$ . The true parameter is fixed; the realized interval either covers it or does not. The probability belongs to the sampling procedure.

4.6 Power and Minimum Sample Size

The significance level $\alpha$ controls false positives, but not false negatives. Whether the test can detect a real effect depends on the true effect size $\Delta := \mu_1 - \mu_0$ , sample size $n$ , and variance $\sigma$ . Under the symmetric approximation $\sigma_0 = \sigma_1 = \sigma, n_0 = n_1 = n$ , the power of a two-sided $\alpha$ -test at effect $\Delta$ is approximately:

$1 - \beta \approx 1 - \Phi\!\left(z_{1 - \alpha/2} - \frac{|\Delta|}{\sqrt{2 \sigma^2 / n}}\right).$

Solving for the per-group sample size gives:

$n = 2 \left(\frac{\sigma \cdot (z_{1 - \alpha/2} - z_\beta)}{\Delta}\right)^2.$

Minimum sample size should be estimated before launch and fixed once the experiment begins. Watching the experiment as it runs and stopping when it becomes significant changes the testing process itself; the next section covers the cost.

5. Two Real-World Pitfalls

5.1 Fixed-Sample Size and the Peeking Problem

Classical t- and z-tests rely on a central premise: the sample size $n$ is fixed before the experiment starts, and the test is run once after the sample is complete. In practice, dashboards refresh daily and PMs, engineers, and data scientists check the p-value every morning: stop if significant, continue if not. Stopping based on interim results is called peeking.

Peeking inflates the false positive rate. Checking once per day for 14 days means asking “is it significant yet?” 14 times; even if each individual test has nominal $\alpha = 5\%$ , the probability of at least one significant result is much higher than 5%. Johari et al. simulate this in Peeking at A/B Tests (KDD 2017): under a true $H_0$ , classical t-tests with daily peeking can push the actual false positive rate from nominal 5% to above 30%.

The business symptom is familiar: a team runs 100 experiments in a year, every one reports a significant positive lift, and the year-end business metric does not grow. Many of those “significant +1%” results are random high points selected by peeking under a true $H_0$ ; they neither replicate nor add up online. Peeking turns a fixed-sample test into a multiple-comparison problem.

The standard repair is sequential testing. Wald’s SPRT defines a stopping rule using a likelihood ratio and thresholds $A, B$ ; production platforms more often use the mixture-prior version, mSPRT (Always Valid Inference, Johari et al. 2015), which produces a valid confidence sequence at every time point. Sequential testing lets the business look and stop at any time while preserving Type I error control. The cost is that, for the same sample size, power is usually lower than in a fixed-sample t-test.

One premise still matters: the guarantees of SPRT and mSPRT depend on an i.i.d. sample sequence. In real streaming A/B data, records from the same user across batches are often positively correlated, and behavior has weekday / weekend and morning / evening structure. Sequential testing fixes peeking; it does not automatically fix sample dependence.

5.2 i.i.d. Failure and Variance Underestimation

Both the CLT and $\mathrm{Var}(\bar{X}) = \sigma^2 / n$ require samples to be i.i.d. (independent and identically distributed). This assumption is often violated in online experiments, especially when the randomization unit differs from the event-recording grain. If users are randomized but metrics are computed over impressions, clicks, or play records, the samples are usually not independent.

Take an impression-click experiment. The randomization unit is the user, but the click table lives at (user, video) granularity. A user’s impressions and clicks across a week are plainly positively correlated: a highly active user contributes more records and tends to keep a stable click tendency. Suppose user $i$ contributes $n_i$ records $\{X_{ij}\}_{j=1}^{n_i}$ ; users are independent, but records within a user may be correlated. The variance of the total is:

\begin{aligned} \mathrm{Var}\!\left(\sum_{i, j} X_{ij}\right) &= \sum_i \mathrm{Var}\!\left(\sum_j X_{ij}\right) \\ &= \sum_i \!\left[\sum_j \mathrm{Var}(X_{ij}) + \sum_{j \neq k} \mathrm{Cov}(X_{ij}, X_{ik})\right]. \end{aligned}

When $\mathrm{Cov}(X_{ij}, X_{ik}) > 0$ , the true variance is larger than the variance returned by the i.i.d. formula. If the implementation plugs the event-level data directly into an i.i.d. variance estimate, the standard error is understated, the test statistic is inflated, the p-value is compressed, and the false positive rate rises.

The standard correction is to aggregate to the analysis unit. If randomization happens at the user level, the testing sample should usually return to the user level: collapse each user’s records into one per-user metric and run the t-test on user-level samples. The fixed-sample t-test is not the problem; the problem is often the grain of the data fed into it. The trade-off is that the metric changes from “per event” to “per user,” and complex cross-day or multi-key metrics can require additional offline aggregation work.

6. Wrapping Up

What an A/B test really answers is not “is treatment higher than control?”, but “will the population metric change because of this new version?” The reasoning chain starts by defining the causal quantity through potential outcomes, obtains an unbiased estimate through randomization, and then uses the CLT and hypothesis testing to turn a finite-sample result into a decision. A few points are worth keeping in mind before applying it:

An A/B test is a statistical and causal system, not only a traffic system. Traffic splitting, gradual rollout, and dashboards are engineering foundations; effect definitions, variance estimates, and error-rate control determine whether the reported number is trustworthy.
Randomization is the cheapest condition for causal identification. Without it, observational data, natural experiments, Difference-in-Differences, instrumental variables, and propensity score matching all need additional assumptions; every method has its own failure modes.
The CLT and i.i.d. assumptions are foundational, and easy to misuse. Multi-record users, social-network spillover, and two-sided-market coupling all inject dependence into variance estimation and can push a nominal 5% error rate much higher.
Peeking is the most common operational error. A fixed-sample test does not allow daily p-value checks followed by optional stopping; if the business needs continuous monitoring, the platform should use sequential testing.

On top of these foundations, industrial experimentation keeps improving in two directions. Variance reduction asks how to reduce sample size under the same error-rate control, for example CUPED (Deng et al. WSDM 2013), which uses pre-experiment covariates as control variates. Experimental design under network effects asks how to redefine the randomization unit when SUTVA fails, as in cluster randomization and switchback. Together they point to the same question: how can experiments become faster, stabler, and cheaper without giving up causal validity?

Appendix A: A Full Proof of the CLT

Three technical details in the characteristic-function proof from the body need to be filled in.

1. The precise meaning of $o(t^2)$ . Strictly speaking, $\phi_Y(t) = 1 - \sigma^2 t^2 / 2 + o(t^2)$ means there exists a function $\epsilon(t)$ such that

$\phi_Y(t) = 1 - \frac{\sigma^2 t^2}{2} + t^2 \cdot \epsilon(t), \qquad \epsilon(t) \to 0 \text{ as } t \to 0.$

This expansion requires $Y$ to have finite second moment, $\mathbb{E}[Y^2] = \sigma^2 < \infty$ . The derivation uses the standard differentiability properties of characteristic functions, with $\phi_Y'(0) = i\,\mathbb{E}[Y] = 0,\;\phi_Y''(0) = -\mathbb{E}[Y^2] = -\sigma^2$ . Finite second moment is the key condition behind this version of the CLT.

2. Tightening the limit. For fixed $t$ , take the logarithm of $\phi_Y(t/(\sigma \sqrt{n}))^n$ :

\begin{aligned} \phi_Y\!\left(\frac{t}{\sigma \sqrt{n}}\right) &= 1 - \frac{t^2}{2 n} + \frac{t^2}{n} \cdot \epsilon\!\left(\frac{t}{\sigma \sqrt{n}}\right), \\ n \cdot \ln \phi_Y\!\left(\frac{t}{\sigma \sqrt{n}}\right) &= n \cdot \left[ -\frac{t^2}{2 n} + o\!\left(\frac{1}{n}\right) \right] = -\frac{t^2}{2} + o(1). \end{aligned}

The second step uses $\ln(1 + x) = x + O(x^2)$ and $\epsilon(t/(\sigma \sqrt{n})) \to 0$ . Exponentiating gives $\phi_{S_n}(t) \to e^{-t^2 / 2}$ . This step shows that the standardized sum’s characteristic function converges to the characteristic function of the standard normal.

3. Weak convergence. By Lévy’s continuity theorem, if characteristic functions $\phi_n(t)$ converge pointwise to a function $\phi(t)$ that is continuous at $t = 0$ , then the corresponding distributions converge weakly to the distribution with characteristic function $\phi$ . Here $\phi(t) = e^{-t^2/2}$ is continuous at $t = 0$ , so $S_n \xrightarrow{d} N(0, 1)$ . Convergence of characteristic functions gives convergence in distribution.

For weaker conditions, including Lindeberg’s condition and Feller’s non-identically-distributed version, see a measure-theoretic probability text such as Durrett’s Probability: Theory and Examples, Chapter 3.

Appendix B: Correction for Unequal Assignment Probabilities

Section 3.4 noted that when $\mathbb{E}[Z_i]$ varies across units, the simple difference-in-means estimator is biased. The correction is to attach an inverse probability weight to each observation, an idea from the Horvitz-Thompson estimator. It uses known assignment probabilities to recalibrate each unit’s chance of being observed back to the population scale.

Let $\pi_i := \mathbb{P}(Z_i = 1)$ be known, for example from the experimentation platform’s traffic-split configuration. Define:

$\hat{\tau}_{\text{IPW}} := \frac{1}{n} \sum_{i=1}^{n} \!\left[\frac{Z_i \, Y_i^{\text{obs}}}{\pi_i} - \frac{(1 - Z_i) \, Y_i^{\text{obs}}}{1 - \pi_i}\right].$

Taking expectation over $Z_i$ directly verifies $\mathbb{E}[\hat{\tau}_{\text{IPW}}] = \tau$ , without requiring $\pi_i$ to be identical across units. IPW repairs the bias caused by unequal assignment probabilities.

The cost is higher variance, especially when some $\pi_i$ are close to 0 or 1 and the weights blow up. In practice, the safer approach is still to make $\pi_i$ uniform by design; IPW is an after-the-fact repair, not a reason to change assignment ratios mid-experiment. Get the randomization design right first, then consider estimator corrections. For systematic comparisons of IPW with doubly robust estimators, propensity score matching, and related methods, see Imbens and Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015), Chapters 12-15.