Seven Common Statistical Traps in Daily Data: From Small-Sample Bias to Simpson's Paradox

Open a newspaper and you can usually find a few pairs of mutually contradictory numbers: the official average wage rises every year while ordinary people keep saying their money goes less far than it used to; an official press release reports city home prices “down 0.5% year on year” while neighborhood listings keep ticking upward; a wellness practice spreads through friend groups as obviously effective even though large clinical trials show no significant benefit. This kind of “felt vs reported” mismatch is everywhere — and most of the time it isn’t anyone’s imagination, but the numbers being quietly distorted in the process of being aggregated, reported, and used.

This post walks through the seven most common statistical traps in everyday data, grouped into four themes by where the distortion enters: (1) generalizing from a single observation to a population — small-sample bias and survivorship bias; (2) summarizing a sample with mean and variance — skewed-distribution mean misleading and comparisons without significance testing; (3) treating correlation as causation — confounders and Simpson’s paradox; (4) when the metric itself becomes a target — cherry-picking and Goodhart’s law. Each trap starts from a concrete example, then unpacks its cause, mathematical core, and the professional fix typically used against it. Building enough statistical intuition to resist these traps is not about becoming an actuary — it is about keeping a working brain in an information stream curated by algorithms and shaped by data rhetoric.

This post sits at the “everyday data consumption” layer — not how to build a particular engineering system, but which traps an ordinary reader looking at a number is most likely to fall into, and how to climb back out.

1. Generalizing from One Case: Small Samples and Survivors

1.1 Anecdotal Statistics: Small-Sample Bias and the Availability Heuristic

We’ve all heard sentences like these: “My great-uncle smoked, drank, and ate fatty foods all his life and lived to 98 — clearly health advice is just noise,” or “this herbalist is amazing, my cousin got cured after just three doses,” or “follow this guy on Twitter — whatever he buys goes up.” What these statements have in common is that they take one vivid, personally known individual as the entire evidence base for a sweeping conclusion.

The brain weights personally experienced cases far more heavily than they deserve statistically. Tversky and Kahneman labeled this the availability heuristic — something “easy to retrieve from memory” gets misread as “highly probable.” And individual cases happen to score very high on availability — they’re vivid, recent, and personally witnessed all at once. A great-uncle’s longevity sticks in memory far more than a ten-thousand-person clinical trial does. The result is that individual experience gets weighted in the head far above its statistical weight in the population, and inductions drawn from it routinely miss the population’s true distribution.

The statistical counterpart is the law of large numbers. Let the population have mean $\mu$ and variance $\sigma^2$ , and let $\bar X_n$ be the sample mean from $n$ i.i.d. draws. Then

$\mathrm{Var}(\bar X_n) = \frac{\sigma^2}{n}, \qquad \mathbb P(|\bar X_n - \mu| > \epsilon) \to 0 \;\; \text{as} \;\; n \to \infty.$

At $n = 1$ the sample mean’s variance is the full $\sigma^2$ — its distance from $\mu$ is completely uncontrolled. It is only as $n$ grows that the variance shrinks at rate $1/n$ and the sample mean converges in probability to $\mu$ . Between “my great-uncle lived to 98” and “this applies to everyone” sits an entire $1/n \to 0$ limit that the anecdote skips.

The professional fix has two parts. First, use a random sample large enough to be representative — population surveys, randomized clinical trials, and the like all rest on this. Second, whenever someone reaches for a “people I know” generalization, mentally collapse the sample back to $n = 1$ — a single anecdote is at best one data point and does not constitute a population-level claim.

1.2 Survivorship Bias: The Samples Filtered Out

During World War II, the British Air Force asked statistician Abraham Wald to analyze the bullet-hole distribution on returning bombers, hoping to learn where to add armor. The raw data showed the densest hits on wings and tail sections, with engines and cockpits almost untouched. The Air Force’s first instinct was to armor the spots with the most bullet holes. Wald’s reading was the opposite: the parts to armor are the ones with almost no bullet holes — because the bullet-hole map only tells you “where the planes that came back got hit.” Planes hit in the engine or cockpit mostly never made it home and were therefore absent from the statistical sample altogether.

A modern retelling is “drop out of college, become a billionaire” — Bill Gates, Mark Zuckerberg, Steve Jobs, and so on. Counting only the survivors yields a string of famous names; but the number of dropouts who tried to start companies and failed is vastly larger and mostly invisible. Inducting “dropping out leads to success” from the famous-survivor sample is exactly the same error as Wald’s bombers.

The shared structure is: the sample we can see has already been through an unnoticed “death filter” — planes that didn’t return and founders who failed never enter observation. The bias does not come from the sampling procedure doing anything wrong; it comes from the observation process itself being silently cut off by a latent variable (survival, success).

Formally, let the population random variable be $X$ and the observation condition be $X \in \Omega_{\text{obs}}$ (for the bombers, $\Omega_{\text{obs}}$ = “came back”). What we see is not the distribution of $X$ , but the conditional distribution $X \mid X \in \Omega_{\text{obs}}$ . The simplest version is one-sided truncation: if we observe only $X > c$ , then

$\mathbb E[X \mid X > c] > \mathbb E[X],$

and using the “survivor sample mean” as the population mean is biased upward by construction. Computing the bullet-hole distribution only over returning planes does not give you a sample from the true ballistics distribution — it gives you that distribution restricted to the “survivor set”.

The professional fix is to spell out the entry criterion of the sample pool and try to estimate the “silent data.” Randomized controlled trials prevent this filtering by design — every sampled unit is followed up regardless of outcome. In observational studies, you first have to identify the filtering mechanism (survivorship, self-selection, response willingness, and so on) and then use methods like propensity score matching or Heckman correction to recover the missing portion as best you can.

2. Descriptive-Statistics Traps: Mean and Variance

2.1 Mean Misleading Under a Skewed Distribution

A well-worn complaint goes: “the Zhangs have ten million, their nine neighbors have nothing; average it out and everyone’s a millionaire.” In numbers, that’s a sample $\{0, 0, 0, 0, 0, 0, 0, 0, 0, 1000\}$ ten-thousand. The mean is 100 ten-thousand, the median is 0, and the mode is 0. Describing these ten households as “average wealth one million” is a complete misrepresentation of nine of them.

The same pattern shows up everywhere. The “fresh-graduate average salary” published every recruiting season is dragged upward by a handful of top tech or finance offers, leaving most graduates feeling they’re embarrassingly below average. “Average follower count” on a social platform, “average creator income” on a content platform — same trick.

The cause is that the underlying distribution is not symmetric and bell-shaped. It is strongly skewed, often a power law or some other heavy-tailed shape — a small number of extreme values (outliers) account for most of the total, with the rest of the sample clustered near the low end.

The math is direct: the mean $\bar X = \frac{1}{n}\sum X_i$ weights every point equally, so the farther a point is from the bulk the more it pulls the mean. Replacing a single point $x$ with $x'$ moves the mean by

$\Delta \bar X = \frac{x' - x}{n},$

linearly proportional to $|x' - x|$ . The median, by contrast, only cares about rank order; replace the Zhangs’ ten million with a billion and the median does not budge. For skewed data the mean is far more sensitive to outliers than the median is — a handful of extreme values can drag the mean far from the population’s “typical” level.

The professional fix is straightforward: when skewness is obvious, report the median plus interquartile range (IQR) or quantiles, not just the mean; for a fuller picture, look at the distribution shape (histogram, KDE) and the variance. When someone hands you an “average” or “per capita” number, first ask whether the underlying distribution is bell-shaped. If it isn’t, the median is the honest description of a typical sample.

2.2 Comparisons Without a Significance Test

A principal at a staff meeting calls out the year-end averages of two classes: class A averaged 85, class B averaged 86 — a 1-point gap. The principal is furious and demands teacher A account for the “decline.” Is the verdict reasonable?

It depends entirely on two things: each class’s variance and sample size. Whether a 1-point mean gap is noise or signal cannot be judged from the gap alone — it has to be read together with variance and sample size.

Case 1 (no significant difference): both classes are 60-student general-track classes, with scores spread out — high 100, low 20. Each group’s variance is around $\sigma^2 \approx 400$ (standard deviation 20 points). The standard error of the mean is then $\sqrt{2\sigma^2 / n} \approx 3.65$ points, so a 1-point gap sits well inside a single standard error. The corresponding $p \gg 0.05$ — the gap is almost certainly sampling noise.

Case 2 (significant difference): A and B are sections of a high-stakes standardized exam where class A scores cluster tightly in 84–86 and class B in 85–87, with a per-class standard deviation around 1 point, and each class has not 60 but a thousand students. The same 1-point gap is now dozens of standard errors wide, $p < 0.05$ , and the 1-point difference is real.

Formalized as a two-sample t-test, the test statistic is

$T = \frac{\bar X_A - \bar X_B}{\sqrt{\hat\sigma_A^2 / n_A + \hat\sigma_B^2 / n_B}}.$

Whether $T$ is large depends not on the mean gap itself but on how that gap compares to its own standard error — the typical sampling fluctuation the gap would exhibit if the same procedure were rerun many times. The denominator $\sqrt{\hat\sigma_A^2 / n_A + \hat\sigma_B^2 / n_B}$ is exactly that standard error: it grows with variance and shrinks with sample size. The p-value is “the probability, under the null hypothesis $H_0$ , of observing a result as extreme as the current one or more so.” A $p < 0.05$ leads us to reject $H_0$ (the gap is unlikely to be noise); a larger $p$ leaves no grounds to rule noise out. This is exactly the same machinery as an A/B test — for the full derivation, see From Potential Outcomes to Hypothesis Testing: Statistical and Causal Inference in A/B Tests, §4.

A related rhetorical move is “false precision.” A skincare ad reading “skin elasticity improved by 34.567% after four weeks” looks aggressively scientific. But if the sample is ten volunteers with substantial individual variation, three decimal places of precision are theater — the actual confidence interval could easily span ±20%. A decimal place without a sample size and a variance to back it up is the cheapest possible piece of statistical theater.

The fix is simple: every time you see “A is X% better than B,” ask three things — how big is $n$ , how big is $\sigma$ , and what’s the $p$ -value or confidence interval. Without those three numbers, the comparison is meaningless.

3. Correlation Is Not Causation: Confounders

Every summer, ice-cream sales spike and so does the number of swimming-pool drownings. The two curves are highly correlated. Reading correlation as causation gives you the absurd “eating ice cream causes drowning.” The real driver is that rising summer temperatures simultaneously push up both ice-cream sales and the number of people swimming — the former increases sales directly, the latter increases drownings indirectly. Temperature is the confounder here.

Drawing the structure: a treatment variable $X$ (ice-cream sales), an outcome variable $Y$ (drownings), and a confounder $Z$ (temperature) that drives both. The data show $X$ and $Y$ strongly correlated, but the correlation does not come from the causal path $X \to Y$ — it comes from $Z$ driving both and creating a “spurious correlation.” As long as $Z$ is not controlled for, the observed correlation between $X$ and $Y$ mixes the genuine causal effect with the confounding effect, and the two cannot be separated directly.

Mathematically this is cleanest in Pearl’s do-calculus. Given a confounder $Z$ ,

\begin{aligned} P(Y \mid X) &= \sum_z P(Y \mid X, Z = z)\, P(Z = z \mid X), \\ P(Y \mid \mathrm{do}(X)) &= \sum_z P(Y \mid X, Z = z)\, P(Z = z). \end{aligned}

The only difference between the two is the last factor — $P(Z = z \mid X)$ versus $P(Z = z)$ . When $X$ and $Z$ are not independent (as with ice-cream sales and temperature), these two probabilities differ, and so do $P(Y \mid X)$ and $P(Y \mid \mathrm{do}(X))$ — only the latter is the genuine causal effect. The sufficient condition for the two to agree is $X \perp Z$ , that is, the treatment is independent of every potential confounder — exactly the property that an RCT creates by randomizing $X$ .

The professional fix runs along two paths. One is to randomize at the design stage — run an RCT when feasible, as in medical trials and industrial A/B tests. The other is, when RCTs aren’t feasible (cost, ethics, practicality), to use econometric identification strategies to approximate the causal effect in observational data: difference-in-differences (cross-differencing pre/post and treated/control to strip out time trends), instrumental variables (an exogenous variable that affects $Y$ only through $X$ ), propensity score matching (matching treated and control on covariate distribution), and so on. Each has its own failure modes, but their common feature is the same: all of them are explicitly substituting assumptions for randomization — there is no free causal identification.

Further reading: how randomization defuses the counterfactual problem, how the ATE is estimated via difference-in-means, and the full derivation of randomization in A/B tests are all in From Potential Outcomes to Hypothesis Testing: Statistical and Causal Inference in A/B Tests, §2-3.

4. Simpson’s Paradox: The Aggregation Trap Under a Shifting Sample Mix

Suppose we look at second-hand-housing transactions in two districts of Shenzhen. In “the past,” Nanshan (the luxury district) recorded 6 transactions at ¥20,000 per square meter, and Bao’an (the entry-level district) recorded 1 transaction at ¥10,000 per square meter. The overall average that year was

$\frac{6 \times 2 + 1 \times 1}{7} = \frac{13}{7} \approx 1.857 \;\; \text{(in ¥10,000 / sqm)}.$

In “the present,” the transaction mix has shifted dramatically — Nanshan has only 1 transaction, now at ¥30,000 per square meter (+50%), and Bao’an has 6 transactions at ¥15,000 per square meter (+50%). Each district’s per-square-meter price has jumped 50%, but the overall average is now

$\frac{1 \times 3 + 6 \times 1.5}{7} = \frac{12}{7} \approx 1.714 \;\; \text{(in ¥10,000 / sqm)},$

an overall drop of $1 - 12/13 \approx 7.7\%$ . Every district has surged 50%, yet the overall average has fallen by roughly 8%. Listing prices and individual transactions all line up with the felt “prices are climbing”; the moment you aggregate the data city-wide with that weighting, the conclusion flips. This is Simpson’s paradox.

A canonical version is the 1973 UC Berkeley admissions case. The university-level acceptance rate for women was lower than for men, which superficially looked like “systematic bias against women.” Department by department, however, the rate for women was no lower than for men in almost every department. The real driver was that women were disproportionately applying to departments with already-low acceptance rates (such as English or Psychology), while men leaned toward departments with higher acceptance rates (such as Engineering). The “distribution of departments applied to” was the confounder that flipped the gender-vs-acceptance correlation when the data were aggregated.

The structural cause has a clean mathematical form. The aggregate mean is a weighted sum of group means,

$\mu = \sum_k w_k \mu_k,$

and depends on both the group means $\mu_k$ and the group weights $w_k$ . Even if every $\mu_k$ rises in lockstep, a redistribution of $w_k$ away from high- $\mu$ groups toward low- $\mu$ groups can pull the aggregate $\mu$ downward. In the Shenzhen example, Nanshan’s weight collapsed from 6/7 to 1/7 while Bao’an’s rose from 1/7 to 6/7 — the weight reshuffle’s effect on the aggregate dominated each district’s 50% price rise.

Two professional fixes are standard.

Fix 1 (standardized rate / direct standardization): fix a set of “standard weights” and use them to combine each period’s group means, so weight changes can no longer leak into the aggregate. In the Shenzhen example, if we assume the two districts each carry weight 0.5 in both periods,

\begin{aligned} \text{past (standardized)} &= 0.5 \times 2 + 0.5 \times 1 = 1.5, \\ \text{present (standardized)} &= 0.5 \times 3 + 0.5 \times 1.5 = 2.25, \\ \text{true growth} &= 2.25 / 1.5 - 1 = 50\%. \end{aligned}

Data and felt experience finally line up. The point of standardized weights is to freeze the “weight change” confounder so the comparison reflects only within-group movement — the same idea is what gives “age-adjusted mortality rate” its name in epidemiology.

Fix 2 (Cochran-Mantel-Haenszel): for binary outcomes (admit / reject, convert / not, click / not), the Cochran-Mantel-Haenszel test computes odds ratios within each stratum (e.g. each department) and combines them with within-stratum sample-size weights. The UC Berkeley case was untangled exactly this way — stratifying by department and computing a combined odds ratio dissolved the “systematic gender bias” reading that came from the unweighted aggregate.

The same mechanism shows up inside A/B testing as well: ramping the control-vs-treatment split mid-experiment makes $\mathbb{E}[Z_i]$ differ between early users and late users — naively differencing the two arms’ means then reproduces a Simpson-style bias. See From Potential Outcomes to Hypothesis Testing: Statistical and Causal Inference in A/B Tests, §3.4, for details.

5. When the Metric Becomes the Target: Cherry-Picking and Goodhart’s Law

The previous four groups of traps all came from the structure of the data itself — too small a sample, a sample filtered before observation, a skewed underlying distribution, or a hidden confounder. The last group is different: it comes from humans actively intervening on the data — selective disclosure and treating a metric as something to game.

5.1 Cherry-Picking

Cherry-picking literally means “picking only the sweetest cherries.” A typical scene: a fund manager pitching to clients only shows the one fund that outperformed — the ten funds the same firm ran over the same period that lagged the market are quietly omitted. Another common move is carefully cropping the time window on a long-term price chart — shift the start and end dates and the same stock can appear to have “tripled in five years” or “halved in one.”

Mathematically this is “multiple comparisons plus selective reporting.” Suppose we run $K$ independent hypothesis tests — in the fund example, $K$ funds each tested for “beat the market or not”; in the chart example, $K$ time windows each tested for “trend significance” — each at significance level $\alpha$ . The probability of “at least one comes back significant” is

$1 - (1 - \alpha)^K \approx K\alpha \;\; (\text{when } K\alpha \ll 1).$

Reporting only the one significant test is equivalent to silently inflating the significance level from $\alpha$ to $1 - (1 - \alpha)^K$ . Running 20 independent comparisons and reporting only the best one effectively raises the 5% significance level to roughly 64% — nominally significant but statistically noise.

5.2 Goodhart’s Law and the Cobra Effect

In 1862, the British colonial government in Delhi tried to reduce the local cobra population by paying a bounty for every dead cobra. Residents soon started breeding cobras at home to collect bounties; when the government caught on and ended the program, the breeders released their snakes into the wild — and Delhi ended up with more cobras than before the bounty was introduced. This is the origin of the term cobra effect.

The same story replays endlessly in corporate KPIs: measure engineers on lines of code, and engineers start writing pointlessly repetitive code; measure them on bug-fix rate, and they begin filing bugs against themselves so they can close them; measure a call center on average call duration, and agents start hanging up on customers to shorten the metric. The moment a metric stops being an “observed signal” and becomes a “target to optimize,” the people being measured drift toward optimizing the metric itself rather than the real-world value it was supposed to proxy for.

Goodhart’s law compresses this empirical regularity into one sentence: “when a measure becomes a target, it ceases to be a good measure.” In the modern reinforcement-learning framing, this is reward hacking — as optimization pressure on a proxy objective grows, the correlation between the proxy and the true objective tends to degrade, and can ultimately flip sign.

The professional fix is not “find a better single metric” — it is to accept that any single metric will be gamed and design around that:

Multi-dimensional, mutually-constraining metrics: never decide on a single number. Pair lines of code with regression bug rate and maintenance cost; pair short-term CTR in an A/B test with long-term retention and a “negative experience” metric; pair ad CTR with revenue per user and complaint rate — so that gaming one metric immediately pays a cost on another.
Adversarial metrics and deep auditing: in addition to the main metric, track an adversarial metric that gets worse as the main one gets gamed (alongside “bugs closed,” also track “regression bug rate in the same module”). Periodically run deep audits that cross-check the metric back against the real-world value it was supposed to proxy.
Hiding the objective: in some settings, don’t reveal the exact functional form of the objective to the measured party. This is essentially using “an un-reverse-engineerable reward function” as defense against reward hacking.

What cherry-picking and Goodhart’s law have in common is that neither of them requires any statistical error — the mechanism is purely “people respond to incentives,” and that alone is enough to break metric-based decision-making. This is why this class of trap can’t be defended against with mathematics alone — defenses have to be built into process and institutional design.

6. Wrapping Up

Looking at all seven traps together, they are different answers to the same underlying question — how far apart is what a number is trying to tell you from what it actually represents. In the first four traps, that gap comes from distortion in the data structure itself; in the last group, it comes from humans actively intervening on the data. A few things worth keeping in mind before applying any of this:

An anecdote is not evidence. Every “someone I know” generalization has a sample size of 1; mentally collapse such claims back to $n = 1$ before they go any further. A statistically meaningful conclusion is always backed by a sample big enough and random enough — how big depends on effect size and variance, but it is never “me and my cousin.”
For skewed data, read the median, not the mean. Wealth, income, traffic, follower counts, time-on-site, and almost any “human-generated metric capable of extreme outliers” follow heavy-tailed distributions. When the headline is an “average” or “per capita,” the first follow-up should be “what’s the median?” and “what does the distribution look like?”
A comparison without a significance test is just noise. The moment you see “A is 1% better than B,” ask how big $n$ is, how big $\sigma$ is, and what the $p$ -value is. Without those three, there is no way to tell signal from noise; extra decimal places are rhetoric, not evidence.
Correlation is not causation, and a shift in sample structure is even less so. Ice-cream sales correlating with drowning doesn’t mean one causes the other; every district rising doesn’t mean the aggregate rises. Whenever you see “X and Y moving together,” step one is to look for a $Z$ that could drive both; whenever you see an aggregate computed across some grouping, step two is to check whether the group weights moved.
The moment a metric becomes a target it stops being a good metric. Any single-metric decision system carries Goodhart risk; the answer isn’t to find a better single metric but to add adversarial metrics, deep audits, and multi-dimensional balancing.