Neural networks as “end-to-end learnable functions” are now everyday tools, but for many practitioners the mental model of “learning” stops at the phrase “there’s backpropagation.” This black-box view holds up while training behaves; the moment loss stops decreasing, gradients explode, or inference performance collapses, there is nothing concrete to grab onto. Understanding the foundations of deep learning isn’t about reinventing PyTorch — it’s about knowing where to look when a model fails to converge or generalize.

This post walks through the mathematical skeleton at the bottom of deep learning — from Rosenblatt’s 1957 perceptron all the way to the engineering foundations that let modern deep networks train stably with hundreds of layers. The core is three things: stacking linear transforms with nonlinear activations to gain “any-function” expressiveness, propagating error through a computational graph by the chain rule, and using variance control plus regularization to keep deep networks actually trainable.

The post unfolds across §§ 1-5: the perceptron’s mathematical form and the XOR problem establish the necessity of nonlinearity (§ 1); multilayer perceptrons plus the universal approximation theorem show its sufficiency (§ 2); the basic neural-network training pipeline — loss function plus gradient descent — turns “can be expressed” into “can be learned” (§ 3); vanishing / exploding gradients along with initialization and normalization keep training stable at dozens or hundreds of layers (§ 4); finally, the bias–variance tradeoff through L1 / L2 / Dropout maps the gap between “learning” and “generalizing” (§ 5).

1. The Perceptron: Limits of Linear Models

1.1 The Mathematical Form of the Perceptron

The perceptron was proposed by Rosenblatt in 1957 and is the first model in history whose parameters could be learned from data for binary classification. Its core operation is a weighted sum over input dimensions passed through a step function:

y  =  sign(wx+b),sign(z)={+1,z01,z<0y \;=\; \mathrm{sign}(w \cdot x + b), \qquad \mathrm{sign}(z) = \begin{cases} +1, & z \ge 0 \\ -1, & z < 0 \end{cases}

Here xRnx \in \mathbb{R}^n is the input vector, wRnw \in \mathbb{R}^n is the weight vector, bRb \in \mathbb{R} is the bias, and the output y{1,+1}y \in \{-1, +1\} is the class label. Geometrically the decision boundary wx+b=0w \cdot x + b = 0 is a hyperplane in Rn\mathbb{R}^n. The perceptron is, at its core, a model that “finds a hyperplane separating two classes of data”.

graph LR
  x1((x1)) -->|w1| S["Σ + b"]
  x2((x2)) -->|w2| S
  xn((xn)) -->|wn| S
  S --> step["sign·"]
  step --> y((y))

The learning algorithm itself is minimal — but to see it clearly, the symbols it uses should be listed once up front:

SymbolMeaningType
wRnw \in \mathbb{R}^nweight vectorparameter
bRb \in \mathbb{R}bias scalarparameter
xiRnx_i \in \mathbb{R}^ninput of the ii-th sampledata
yi{1,+1}y_i \in \{-1, +1\}label of the ii-th sampledata
η>0\eta > 0learning rate, the step size per updatehyperparameter
\leftarrowassignment (write the right value into the left)operator

Both ww and bb are learnable parameters, while η\eta is a human-chosen hyperparameter; without bb the decision hyperplane is locked at the origin and cannot be translated.

The full training loop is:

eta = 1                                   # learning rate
w = 0; b = 0
for _ in range(max_iter):                 # iteration cap
    n_updates = 0
    for x_i, y_i in training_samples:
        score = w @ x_i + b               # dot product
        if y_i * score <= 0:              # misclassified
            w = w + eta * y_i * x_i
            b = b + eta * y_i
            n_updates += 1
    if n_updates == 0:
        break                             # full pass with no update -> converged

Only misclassified points trigger an update; correctly classified points are skipped — this is the key difference between the perceptron and later gradient descent in “when does an update fire”. The compact misclassification test is yi(wxi+b)0y_i (w \cdot x_i + b) \le 0: the true label and the predicted score have opposite signs (or the score is exactly 0).

Why does the update point along +ηyixi+\eta y_i x_i rather than some other direction? Suppose some point (xi,yi)(x_i, y_i) is misclassified. Recompute its score after the update:

wnewxi+bnew=(w+ηyixi)xi+(b+ηyi)=(wxi+b)+ηyi(xi2+1)=old score+ηyi(xi2+1).\begin{aligned} w_{\text{new}} \cdot x_i + b_{\text{new}} &= (w + \eta y_i x_i) \cdot x_i + (b + \eta y_i) \\ &= (w \cdot x_i + b) + \eta y_i (\|x_i\|^2 + 1) \\ &= \text{old score} + \eta y_i (\|x_i\|^2 + 1). \end{aligned}

With η>0\eta > 0 and xi2+1>0\|x_i\|^2 + 1 > 0, the new score moves strictly along the yiy_i direction relative to the old score: larger for yi=+1y_i = +1, smaller for yi=1y_i = -1, in both cases toward “next pass predicts correctly”. yiy_i is both the signal of “which direction I want the score to move” and the multiplier that flips the update direction with the label; replacing it with a fixed +ηxi+\eta x_i would actually push negative samples further into being wrong.

A concrete pass: take η=1\eta = 1, initial w=(0,0),b=0w = (0, 0), b = 0, with two training points x1=(2,1),y1=+1x_1 = (2, 1), y_1 = +1 and x2=(1,2),y2=1x_2 = (-1, 2), y_2 = -1.

  • See x1x_1: score =0= 0, y10=00y_1 \cdot 0 = 0 \le 0, misclassified. Update w(2,1),  b1w \leftarrow (2, 1),\; b \leftarrow 1.
  • See x2x_2: score =2(1)+12+1=1= 2 \cdot (-1) + 1 \cdot 2 + 1 = 1, predicted +1+1 but the true label is 1-1, misclassified. Update w(3,1),  b0w \leftarrow (3, -1),\; b \leftarrow 0.
  • Recheck x1x_1: score =5>0= 5 > 0 ✓; recheck x2x_2: score =5<0= -5 < 0 ✓. Both correct, the loop terminates.

A common engineering simplification is to absorb bb into ww: append a constant 1 to every input to form x~=(x,1)Rn+1\tilde x = (x, 1) \in \mathbb{R}^{n+1}, and extend the weights to w~=(w,b)\tilde w = (w, b). Then wx+b=w~x~w \cdot x + b = \tilde w \cdot \tilde x, and the whole update collapses into w~w~+ηyix~i\tilde w \leftarrow \tilde w + \eta y_i \tilde x_i, which is more compact. bb remains a parameter mathematically — it’s just hidden inside the last component of the extended vector.

The whole learning procedure has no gradient, no backpropagation, no notion of a loss function — it’s the most primitive “fix on mistake” version: misclassified points pull the hyperplane toward themselves, correctly classified points leave it alone.

1.2 The Novikoff Convergence Theorem

A learning rule alone isn’t enough; we still need to answer two questions: when is convergence guaranteed, and how many steps does it take? In 1962, Novikoff gave a complete answer: when the data is linearly separable, the perceptron algorithm converges in at most R2/γ2\lceil R^2 / \gamma^2 \rceil steps.

Theorem (Novikoff 1962): Suppose the training set {(xi,yi)}i=1N\{(x_i, y_i)\}_{i=1}^N satisfies xiR\|x_i\| \le R, and there exists ww^* (with w=1\|w^*\| = 1 WLOG) such that yi(wxi)γ>0y_i (w^* \cdot x_i) \ge \gamma > 0 for every ii. Then the perceptron algorithm, starting from w0=0w_0 = 0 with learning rate η=1\eta = 1, converges after at most tR2/γ2t \le R^2 / \gamma^2 updates.

The proof bounds the weight vector from above and below, then squeezes with Cauchy-Schwarz. Let wtw_t denote the weights after tt updates.

Inner-product lower bound: each update is wt=wt1+yitxitw_t = w_{t-1} + y_{i_t} x_{i_t}, so

wtw=(wt1+yitxit)w=wt1w+yit(xitw)wt1w+γ.\begin{aligned} w_t \cdot w^* &= (w_{t-1} + y_{i_t} x_{i_t}) \cdot w^* \\ &= w_{t-1} \cdot w^* + y_{i_t} (x_{i_t} \cdot w^*) \\ &\ge w_{t-1} \cdot w^* + \gamma. \end{aligned}

Starting from w0w=0w_0 \cdot w^* = 0 and iterating to tt gives wtwtγw_t \cdot w^* \ge t \gamma.

Norm-squared upper bound:

wt2=wt1+yitxit2=wt12+2yit(wt1xit)+xit2.\begin{aligned} \|w_t\|^2 &= \|w_{t-1} + y_{i_t} x_{i_t}\|^2 \\ &= \|w_{t-1}\|^2 + 2 y_{i_t} (w_{t-1} \cdot x_{i_t}) + \|x_{i_t}\|^2. \end{aligned}

An update happens only when (xit,yit)(x_{i_t}, y_{i_t}) is misclassified under wt1w_{t-1}, i.e., yit(wt1xit)0y_{i_t} (w_{t-1} \cdot x_{i_t}) \le 0, so the cross term is 0\le 0; substituting xit2R2\|x_{i_t}\|^2 \le R^2 gives wt2wt12+R2\|w_t\|^2 \le \|w_{t-1}\|^2 + R^2. Iterating from w0=0\|w_0\| = 0 to tt yields wt2tR2\|w_t\|^2 \le t R^2.

Squeezing: by Cauchy-Schwarz, wtwwtw=wtw_t \cdot w^* \le \|w_t\| \cdot \|w^*\| = \|w_t\|. Combined with both bounds,

tγ    wtw    wt    tR2=Rt.t \gamma \;\le\; w_t \cdot w^* \;\le\; \|w_t\| \;\le\; \sqrt{t R^2} = R \sqrt{t}.

Squaring both sides and dividing by γ2\gamma^2 gives tR2/γ2t \le R^2 / \gamma^2. \blacksquare

Novikoff gives not an asymptotic guarantee but an absolute upper bound — provided the data is linearly separable, the iteration count is fixed by the ratio of “radius RR to margin γ\gamma. This result fueled a wave of optimism about the perceptron in the 1960s; a new limit was about to follow.

1.3 The XOR Problem: Algebra and Geometry of Non-separability

Minsky and Papert’s 1969 book Perceptrons gave a counterexample that nearly buried the perceptron: exclusive-or (XOR). XOR maps (x1,x2){0,1}2(x_1, x_2) \in \{0, 1\}^2 to 0 or 1 according to:

x1x_1x2x_2XOR
000
011
101
110

Algebraically suppose (w1,w2,b)(w_1, w_2, b) exists so the perceptron learns XOR. Plugging in the four points with their sign constraints (output 1 ⇒ 0\ge 0, output 0 ⇒ <0< 0):

b<0from (0,0)0,w2+b0from (0,1)1,w1+b0from (1,0)1,w1+w2+b<0from (1,1)0.\begin{aligned} b &< 0 \quad &&\text{from } (0, 0) \to 0, \\ w_2 + b &\ge 0 \quad &&\text{from } (0, 1) \to 1, \\ w_1 + b &\ge 0 \quad &&\text{from } (1, 0) \to 1, \\ w_1 + w_2 + b &< 0 \quad &&\text{from } (1, 1) \to 0. \end{aligned}

Adding the second and third inequalities gives w1+w2+2b0w_1 + w_2 + 2b \ge 0; adding the first and fourth gives w1+w2+2b<0w_1 + w_2 + 2b < 0. These two inequalities are directly contradictory — no single-layer linear model can implement XOR.

Geometrically the picture is even clearer: the decision boundary w1x1+w2x2+b=0w_1 x_1 + w_2 x_2 + b = 0 is a line in R2\mathbb{R}^2 that splits the plane into two half-planes. XOR demands that (0,0)(0, 0) and (1,1)(1, 1) live in the 0 half-plane while (0,1)(0, 1) and (1,0)(1, 0) live in the 1 half-plane — connecting the two points of each class with a segment gives exactly the two diagonals of the unit square, and these two diagonal segments intersect at the center (0.5,0.5)(0.5,\, 0.5). Once the two classes’ convex hulls intersect (here, two diagonals meeting at the center), no single line can place them in different half-planes, so no hyperplane can perform XOR’s “same color across the diagonal” split.

 x2

1 ●           ○        ●  class 1


0 ○           ●        ○  class 0
  └────────────── x1
    0           1

XOR isn’t an isolated counterexample — it points to a broader class of problems: whenever classes need a “non-convex decision region,” single-layer linear models necessarily fail. To push past this ceiling the model has to move from “one hyperplane” to “a composition of hyperplanes,” which is what the multilayer perceptron is built for.

2. Multilayer Perceptrons: Where Non-linearity Emerges

2.1 From One Layer to Many: Matrix Composition

The multilayer perceptron (MLP) inserts a nonlinear activation σ\sigma between successive linear transforms. Two-layer form:

y  =  σ ⁣(W2σ(W1x+b1)+b2),y \;=\; \sigma\!\left(W_2 \, \sigma(W_1 x + b_1) + b_2\right),

where W1Rh×nW_1 \in \mathbb{R}^{h \times n} maps the nn-dim input into an hh-dim hidden layer, W2Rm×hW_2 \in \mathbb{R}^{m \times h} maps the hidden layer to an mm-dim output, and σ\sigma is an element-wise nonlinearity. The general LL-layer form is

a0=x,al=σ(Wlal1+bl)    (l=1,,L),y=aL.a_0 = x, \quad a_l = \sigma(W_l a_{l-1} + b_l) \;\; (l = 1, \dots, L), \quad y = a_L.

A multilayer perceptron strings together a sequence of linear transforms with matrix multiplication, and inserts a nonlinear σ\sigma between every two layers to keep the whole stack from collapsing. The “nonlinear σ\sigma” is the load-bearing piece — the next section proves that without it the entire structure degenerates to a single layer.

2.2 The Linear-Collapse Theorem: Why Activations Are Required

Replace σ\sigma with the identity σ(z)=z\sigma(z) = z. The two-layer MLP becomes

y=W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)W~x+b~,\begin{aligned} y &= W_2 (W_1 x + b_1) + b_2 \\ &= (W_2 W_1) x + (W_2 b_1 + b_2) \\ &\equiv \tilde W x + \tilde b, \end{aligned}

a brand-new linear transform. In general, an LL-layer purely linear MLP becomes

y=WLWL1W1x+l=1L ⁣(k=l+1LWk)bl=W~x+b~,y = W_L W_{L-1} \cdots W_1 x + \sum_{l=1}^{L} \!\left(\prod_{k=l+1}^{L} W_k\right) b_l = \tilde W x + \tilde b,

still a single-layer linear transform. Any finite-depth composition of purely linear maps remains linear — no matter how many layers you stack, the expressive power never exceeds that of the shallowest possible model. This is “linear collapse.”

The nonlinear activation is the only mechanism that lets the stack escape the affine subspace. How far it escapes depends on σ\sigma and on LL, but the prerequisite is simply that σ\sigma is not the identity. Activations are not decoration — they are the sole source of nonlinear expressive power.

2.3 The Universal Approximation Theorem: Sketch of a Constructive Proof

What is the “floor” of expressive power once nonlinearity is in place? In 1989, Cybenko, and in 1991, Hornik, independently provided the answer: the universal approximation theorem (UAT).

Theorem (Cybenko 1989): Let σ\sigma be a continuous, monotone, bounded activation function (sigmoidal). For any compact set (i.e., closed and bounded — e.g., [0,1]n[0, 1]^n) KRnK \subset \mathbb{R}^n, any continuous f:KRf: K \to \mathbb{R}, and any ϵ>0\epsilon > 0, there exist a natural number hh and weights {wi,bi,αi}i=1h\{w_i, b_i, \alpha_i\}_{i=1}^h such that the network

g(x)=i=1hαiσ(wix+bi)g(x) = \sum_{i=1}^{h} \alpha_i \sigma(w_i \cdot x + b_i)

satisfies supxKg(x)f(x)<ϵ\sup_{x \in K} |g(x) - f(x)| < \epsilon.

A constructive sketch (not a full proof): take σ\sigma to be sigmoid. σ(α(zc))\sigma(\alpha(z - c)) is a smooth S-curve that transitions from 0 to 1 around z=cz = c — larger α\alpha makes the transition region narrower and the curve steeper, so the shape approaches a step function at z=cz = c.

Subtracting two such sigmoids whose transition points are close (separated by δ\delta) — σ(α(zc))σ(α(zcδ))\sigma(\alpha(z - c)) - \sigma(\alpha(z - c - \delta)) — is most intuitive when split into three regions along the zz-axis:

Regionσ(α(zc))\sigma(\alpha(z - c))σ(α(zcδ))\sigma(\alpha(z - c - \delta))Difference
z<cz < c0\approx 00\approx 00\approx 0
z[c,c+δ]z \in [c,\, c + \delta]1\approx 1 (already past cc)0\approx 0 (not yet at c+δc + \delta)1\approx 1
z>c+δz > c + \delta1\approx 11\approx 10\approx 0

The difference is 1\approx 1 only on the narrow band in the middle and 0\approx 0 everywhere else — a rectangular pulse supported on [c,c+δ][c, c + \delta], obtained as the difference of two step functions offset by δ\delta. At finite α\alpha the rectangle has smooth transitions (no sharp corners), so the sigmoid difference is a smooth approximation to a rectangular pulse.

And any continuous function on a compact set can be uniformly approximated to arbitrary precision by a step function (a piecewise-constant function built from rectangles of varying height and sufficiently small width) — this is a direct consequence of the Heine–Cantor theorem (a continuous ff on a compact set is uniformly continuous, so partitioning the set finely enough and using f(xi)f(x_i) as the constant on each piece keeps the uniform error under any preset bound). Replacing each rectangle in the step function with a sigmoid-difference bump scaled to that rectangle’s height and summing them gives a single-hidden-layer sigmoid network that approximates ff piece by piece — exactly the form g(x)=iαiσ(wix+bi)g(x) = \sum_i \alpha_i \sigma(w_i \cdot x + b_i) stated in the UAT. Hornik 1991 generalized the conclusion to any non-constant, bounded, continuous σ\sigma, dropping the sigmoidal requirement.

UAT delivers an existence guarantee for expressive power, not an efficiency claim, and not an optimization or generalization guarantee. It says “neural networks can in principle approximate any continuous function” — but “can approximate” ≠ “can be learned” (an optimization problem) ≠ “approximates on the test set” (a generalization problem). These two gaps are exactly what §§ 3-5 address.

2.4 The Activation Function Zoo: Saturating to Non-saturating

Before walking through specific functions, fix a few terms that will recur:

  • Saturation region: the part of the input where σ(z)0\sigma'(z) \to 0. Inside it, input changes barely move the output, and backprop compresses the gradient through that layer toward zero.
  • Saturating activation: an activation with saturation regions on both sides. Sigmoid (z±z \to \pm\infty approaching 0 / 1) and tanh (approaching 1\mp 1) saturate on both sides; in deep networks, the chain product σ\prod \sigma' of such functions decays exponentially.
  • Half-saturating activation: saturates on only one side — e.g., ReLU’s negative half (σ=0\sigma = 0 for z<0z < 0); the positive half has σ(z)=1\sigma'(z) = 1 and does not saturate.
  • Dying region: the part where σ(z)\sigma'(z) is exactly zero, not merely near zero. ReLU’s negative half is exactly this: σ(z)=0\sigma(z) = 0 and σ(z)=0\sigma'(z) = 0, so any neuron pushed into this region receives no gradient signal and never updates again — this is “dying ReLU”. The difference between dying and saturating is “exactly zero” vs “approaching zero”; the former is irrecoverable, the latter can still escape via small perturbations.

UAT’s requirements on σ\sigma are quite loose, but the engineering choice of activation directly determines training stability. The four most common activations:

-4-3-2-101234Input z -101234Activation σ(z) Sigmoid Tanh ReLU GeLU
Four activation functions

What matters more for training are their derivatives — backpropagation’s chain product multiplies σ\sigma' into every layer:

-4-3-2-101234Input z 00.20.40.60.811.2Derivative σ'(z) Sigmoid' Tanh' ReLU' GeLU'
Derivatives of the four activations

The key differences across these four activations all live in their derivatives:

ActivationExpressionσ(0)\sigma'(0)Saturation-region σ\sigma'Training behavior
Sigmoid1/(1+ez)1 / (1 + e^{-z})0.250\to 0Deep networks vanish easily
Tanhtanh(z)\tanh(z)10\to 0Slightly better than sigmoid, still saturates
ReLUmax(0,z)\max(0, z)1 (by convention)1 (positive) / 0 (negative)Non-saturating but has dying region
GeLUzΦ(z)z \cdot \Phi(z)0.51\to 1 (positive) / 0\to 0 (negative)Transformer default, smooth

Saturating activations (sigmoid, tanh) have derivatives that tend to 0 at large z|z|, so the chain product σ\prod \sigma' in a deep network decays exponentially — this is why sigmoid is essentially abandoned in deep architectures. ReLU fixes the saturation problem by pinning the positive-region derivative at 1, but introduces the “dying ReLU” problem: the negative-region derivative is permanently 0, and once a neuron is pushed into the negative region it never updates again. GeLU smooths the kink at the ReLU origin and is the default in modern Transformer-style architectures.

ReLU’s dying region motivated a family of improvements. The common ones in practice:

  • Leaky ReLU: σ(z)=max(αz,z)\sigma(z) = \max(\alpha z, z) with α\alpha a small positive constant (typically 0.01). Replacing ReLU’s zero negative-slope with α\alpha eliminates the dying region — every input now has a nonzero gradient flowing back.
  • PReLU (Parametric ReLU): same shape as Leaky ReLU, but α\alpha becomes a learnable parameter, one per layer.
  • ELU (Exponential Linear Unit): σ(z)=z\sigma(z) = z if z>0z > 0, α(ez1)\alpha(e^z - 1) if z0z \le 0. As zz \to -\infty, σ(z)α\sigma(z) \to -\alpha (the negative side saturates at y=αy = -\alpha), so the function is differentiable everywhere and bounded below by α-\alpha — usually more stable than Leaky ReLU.
  • Swish / SiLU: σ(z)=zσsigm(z)\sigma(z) = z \cdot \sigma_\text{sigm}(z) where σsigm\sigma_\text{sigm} is sigmoid. Shape is essentially the same as GeLU — GeLU uses the Gaussian CDF Φ\Phi for the smoothing weight, Swish uses sigmoid; the two are nearly interchangeable in large-model practice.

Plotting these four variants (Leaky ReLU with α=0.1\alpha = 0.1, ELU with α=1\alpha = 1, plus plain ReLU as a reference) overlaid makes the shape differences concrete:

-4-3-2-101234Input z -101234Activation σ(z) ReLU (reference) Leaky ReLU (α=0.1) ELU (α=1) Swish / SiLU
ReLU-family variants

And their derivatives:

-4-3-2-101234Input z -0.200.20.40.60.811.2Derivative σ'(z) ReLU' (reference) Leaky ReLU' (α=0.1) ELU' (α=1) Swish' / SiLU'
Derivatives of ReLU-family variants

A few things to notice: Leaky ReLU lifts ReLU’s negative-side derivative from 0 to α\alpha (a flat line — that’s the entire cost of removing the dying region); ELU’s derivative is continuous at z=0z = 0 where ReLU has a kink; Swish is smooth overall and even dips slightly below 0 around z1.3z \approx -1.3 — exactly the shape that makes it nearly interchangeable with GeLU.

Practical guidance by setting:

  • Transformers / large language models: GeLU by default (BERT, the GPT family, LLaMA all use it); Swish / SiLU as a drop-in alternative. Smoothness helps large-model optimizers stay stable.
  • CNN / vanilla MLP hidden layers: ReLU remains the default — simple and efficient; if dying ReLU is a concern, switch to Leaky ReLU or ELU.
  • Output layer: depends on the task — linear for regression, sigmoid for binary classification, softmax for multi-class. “Saturation” here is the design goal (squashing logits into probabilities), not a flaw.
  • LSTM / GRU gates: gates need [0,1][0, 1] outputs, so sigmoid; cell state uses tanh. This is intrinsic to RNN design — don’t swap.
  • Avoid sigmoid / tanh in deep hidden layers: chain-product decay is too severe, unless BatchNorm / LayerNorm rescues things (§ 4.3 expands on this).

Initialization should match: He init for the ReLU family, Xavier for sigmoid / tanh (§ 4.2 derives both).

§ 4 will quantify what “chain product decay” actually means.

3. Training Neural Networks: Loss, Backpropagation, Gradient Descent

3.1 The Training Loop at a Glance

§§ 1-2 covered what a neural network can express — single-layer models only carve hyperplanes, multilayer networks with nonlinearities approximate any continuous function. The remaining core question: given a network architecture, how do we learn its parameters θ\theta from data? This section gives the high-level view — the three core actions in every training loop (defining a loss / computing gradients via backprop / updating with gradient descent); § 3.2 then expands “where the loss comes from”.

1. The training objective: think of the network as a parameterized function fθf_\theta. Given a training set {(xi,yi)}i=1N\{(x_i, y_i)\}_{i=1}^N, training seeks θ\theta^* that minimizes “the gap between the model’s predictions and the true labels” — quantified by a loss function L(θ)L(\theta) (common forms catalogued in § 3.2, probabilistic interpretation in § 3.3).

2. Gradient descent: neural networks generally have no closed-form optimum, so training is iterative — each step moves a small distance along the negative gradient of the loss w.r.t. the parameters:

θθηθL(θ).\theta \leftarrow \theta - \eta\, \nabla_\theta L(\theta).

Intuition: the gradient θL\nabla_\theta L points in the direction of steepest ascent, so the opposite direction is steepest descent. The learning rate η\eta (a small positive number, e.g. 10310^{-3}) controls the step size — too large overshoots the minimum, too small makes convergence painfully slow. This is gradient descent.

3. Backpropagation is the algorithm that makes computing θL\nabla_\theta L tractable: modern networks have millions to billions of parameters; computing each L/θj\partial L / \partial \theta_j by hand is hopeless. Backpropagation solves this by treating the network as a computational graph (built during the forward pass) and applying the chain rule in reverse topological order to compute all L/θj\partial L / \partial \theta_j in one sweep, at the same compute cost as the forward pass. Note the distinction: “backpropagation” is the algorithm that computes the gradient, “gradient descent” is the optimization algorithm that uses the gradient; both are required for a single parameter update.

4. The full training loop: stringing these three pieces together:

graph LR
  input["input x_batch"] --> forward["forward: y_pred = f_θx"]
  forward --> loss["compute loss L"]
  loss --> backward["backward: ∇_θ L"]
  backward --> update["update: θ ← θ - η∇L"]
  update --> input

The corresponding pseudocode (a minimal mini-batch SGD — stochastic gradient descent — training loop, which computes the gradient and updates on each small batch rather than scanning the full training set; this is the default pattern in every modern framework):

for epoch in range(num_epochs):
    for x_batch, y_batch in dataloader:
        y_pred = model.forward(x_batch)       # forward
        loss = loss_fn(y_pred, y_batch)       # compute loss
        grads = loss.backward()               # backprop -> ∇L
        for param in model.parameters():
            param -= eta * grads[param]       # gradient descent

The training loop is the repeated cycle of “forward to compute predictions → compute loss → backward to compute gradient → take a small step”.

3.2 A Quick Catalog of Common Loss Functions

Before the probabilistic derivation, here is a quick landscape of the loss functions most commonly used in deep learning:

LossFormTypical use
MSE / L21Ni(yiy^i)2\frac{1}{N}\sum_i (y_i - \hat y_i)^2regression
MAE / L11Niyiy^i\frac{1}{N}\sum_i \lvert y_i - \hat y_i \rvertregression, more robust to outliers
Binary cross-entropyylogy^(1y)log(1y^)-y \log \hat y - (1-y) \log(1 - \hat y)binary classification
Categorical cross-entropykyklogy^k-\sum_k y_k \log \hat y_kmulti-class classification (paired with softmax)
Hinge lossmax(0,1yy^)\max(0,\, 1 - y \hat y)SVM, max-margin classification
KL divergencexp(x)logp(x)q(x)\sum_x p(x) \log \frac{p(x)}{q(x)}distribution matching (VAE, knowledge distillation)

3.3 A Probabilistic Interpretation of Loss: MLE Derives MSE and Cross-Entropy

Many textbooks present MSE and cross-entropy as two independent losses, but they actually share one source: maximum likelihood estimation (MLE) under different noise / output distribution assumptions.

Regression (Gaussian noise): assume yi=fθ(xi)+ϵiy_i = f_\theta(x_i) + \epsilon_i with ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2) i.i.d. The conditional density is

p(yixi,θ)=12πσexp ⁣((yifθ(xi))22σ2).p(y_i \mid x_i, \theta) = \frac{1}{\sqrt{2\pi}\,\sigma} \exp\!\left(-\frac{(y_i - f_\theta(x_i))^2}{2\sigma^2}\right).

The log-likelihood is

logp({yi}{xi},θ)=i=1Nlogp(yixi,θ)=12σ2i=1N(yifθ(xi))2Nlog(2πσ).\begin{aligned} \log p(\{y_i\} \mid \{x_i\}, \theta) &= \sum_{i=1}^{N} \log p(y_i \mid x_i, \theta) \\ &= -\frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - f_\theta(x_i))^2 - N \log(\sqrt{2\pi}\,\sigma). \end{aligned}

The second term is independent of θ\theta, so maximizing the log-likelihood is equivalent to minimizing (yifθ(xi))2\sum (y_i - f_\theta(x_i))^2 — exactly MSE. MSE isn’t an arbitrary loss — it’s the direct product of MLE under the “output plus Gaussian noise” assumption.

Classification (Bernoulli output): for binary classification let fθ(x)(0,1)f_\theta(x) \in (0, 1) be the model’s predicted “probability of class 1” and yi{0,1}y_i \in \{0, 1\} the label. A Bernoulli distribution models a random variable taking only {0,1}\{0, 1\} and is fully determined by a single parameter π[0,1]\pi \in [0, 1]: P(Y=1)=πP(Y = 1) = \pi, P(Y=0)=1πP(Y = 0) = 1 - \pi. Its compact form uses “exponents as if-else selectors” to merge both cases into one line —

P(Y=y)=πy(1π)1y,y{0,1}.P(Y = y) = \pi^y (1 - \pi)^{1 - y}, \quad y \in \{0, 1\}.

Verifying: at y=1y = 1, (1π)0=1(1 - \pi)^0 = 1 drops out, leaving π\pi; at y=0y = 0, π0=1\pi^0 = 1 drops out, leaving 1π1 - \pi. Anything-to-the-zero is 1, so “the factor that shouldn’t be selected” disappears automatically.

The “conditional” in “conditional distribution” means conditioning on the input xix_i — the model lets the Bernoulli parameter π\pi depend on xix_i via π(xi)=fθ(xi)\pi(x_i) = f_\theta(x_i); the same θ\theta produces different π\pi on different xix_i. Substituting π(xi)\pi(x_i) into the Bernoulli compact form gives

p(yixi,θ)=fθ(xi)yi(1fθ(xi))1yi.p(y_i \mid x_i, \theta) = f_\theta(x_i)^{y_i} (1 - f_\theta(x_i))^{1 - y_i}.

Concrete example (cat-vs-dog): a classifier outputs fθ(x)=0.8f_\theta(x) = 0.8 for an image (predicted probability of “cat”). If the true label is y=1y = 1 (cat), p(y=1x,θ)=0.810.20=0.8p(y = 1 \mid x, \theta) = 0.8^1 \cdot 0.2^0 = 0.8 — the model assigns a high probability to the true label. If the true label is y=0y = 0 (dog), p(y=0x,θ)=0.800.21=0.2p(y = 0 \mid x, \theta) = 0.8^0 \cdot 0.2^1 = 0.2 — the model is wrong and gives the true label a low probability. MLE tunes θ\theta to maximize the probability the model assigns to the true label across all training samples.

The log-likelihood is

logp({yi}{xi},θ)=i=1N[yilogfθ(xi)+(1yi)log(1fθ(xi))].\log p(\{y_i\} \mid \{x_i\}, \theta) = \sum_{i=1}^{N} \big[\, y_i \log f_\theta(x_i) + (1 - y_i) \log(1 - f_\theta(x_i)) \,\big].

The negative of this is the binary cross-entropy loss. For multi-class output with softmax fθ(x)=(p1,,pK)f_\theta(x) = (p_1, \dots, p_K) and one-hot yy, the same derivation yields categorical cross-entropy kyklogpk-\sum_k y_k \log p_k.

So why is it called cross-entropy? The name comes from information theory. Following the classical convention with log2\log_2, the entropy of a discrete distribution pp,

H(p)=xp(x)log2p(x),H(p) = -\sum_x p(x) \log_2 p(x),

measures its average information content — i.e., how many bits per event you need on average when using a code optimized for pp (the continuous analogue is the integral form p(x)log2p(x)dx-\int p(x) \log_2 p(x)\, dx, called differential entropy, not discussed here). Cross-entropy mixes two discrete distributions:

H(p,q)=xp(x)log2q(x).H(p, q) = -\sum_x p(x) \log_2 q(x).

It answers: “events occur under the true distribution pp, but you use a code designed for qq — how many bits do you need on average?” The “cross” comes from the two distributions playing different roles in the formula: pp sits outside the log as a weight, qq sits inside the log as the probability argument. Swapping the roles — letting qq be the weight and putting pp inside the log — gives H(q,p)=xq(x)log2p(x)H(q, p) = -\sum_x q(x) \log_2 p(x), generally a different number. Cross-entropy is asymmetric: H(p,q)H(q,p)H(p, q) \neq H(q, p).

In classification: the one-hot true label yy plays the role of pp, and the model’s softmax output y^\hat y plays qq, so kyklogy^k=H(p,q)-\sum_k y_k \log \hat y_k = H(p, q) — the classification cross-entropy loss is literally “the cross-entropy of the true label distribution against the model’s predicted distribution”. Minimizing it pushes qq toward pp; equivalently, one can minimize the KL divergence DKL(pq)=H(p,q)H(p)D_{KL}(p \| q) = H(p, q) - H(p), which differs from cross-entropy only by the data-only constant H(p)H(p).

⚙️ The log base only affects the unit, not the minimization result: log2\log_2 → bit, natural log ln\ln → nat, log10\log_{10} → ban — all related by constant factors. ML implementations (cross-entropy in PyTorch / TensorFlow) actually use natural log ln\ln — derivatives are simpler and the form maps directly onto the negative log-likelihood from MLE, so the unit is nats rather than bits. The log\log inside the MLE-derived "kyklogy^k-\sum_k y_k \log \hat y_k" above is therefore ln\ln in practice; the optimal parameters θ\theta are identical to what you’d get using log2\log_2.

Cross-entropy isn’t arbitrary either — it’s the direct product of MLE under “outputs distributed Bernoulli / categorical”. Both losses share the same underlying logic: the negative log-likelihood.

One last word on the KL divergence (Kullback & Leibler 1951, also called relative entropy). It was originally proposed in mathematical statistics to measure “how easy two hypothesis distributions are to tell apart”: it is non-negative (Gibbs’ inequality, DKL(pq)=0D_{KL}(p \| q) = 0 iff p=qp = q almost everywhere), asymmetric like cross-entropy, and commonly read as “the information lost when approximating pp with qq”. Beyond its implicit appearance in classification training, common landings in deep learning include:

  • Variational autoencoder (VAE): the loss contains DKL(q(zx)p(z))D_{KL}(q(z \mid x) \| p(z)), which pulls the encoder’s latent distribution toward a prior so the latent space becomes structured and sample-able.
  • Reinforcement learning (PPO / TRPO): each policy update constrains DKL(πnewπold)D_{KL}(\pi_\text{new} \| \pi_\text{old}) to stay small, preventing drastic policy shifts in a single step.
  • Knowledge distillation: minimizing DKL(pstudentpteacher)D_{KL}(p_\text{student} \| p_\text{teacher}) pushes the student model’s output distribution toward the teacher’s on every input.

When pp is fixed during training (as in classification), minimizing cross-entropy is equivalent to minimizing KL; but when pp is itself part of the model (e.g., the prior term in a VAE), the full KL must be used.

3.4 Backpropagation: Computational Graph, Chain Rule, Jacobian

§ 3.1 framed backpropagation as “the algorithm that computes the gradient”. Backpropagation is not a new differentiation rule — everything it does is “apply the chain rule in reverse topological order over a computational graph.” This subsection introduces the computational graph, the Jacobian matrix, and the chain-product form, along with how each takes shape on tensors, matrices, and a deep network’s layered structure.

The computational graph and local derivatives. The bottom-most abstraction of modern deep-learning frameworks (PyTorch / TensorFlow / JAX) is the computational graph — a directed acyclic graph where nodes are all the variables (inputs, parameters WW and bb, intermediates, output) and edges represent computational dependence and carry the local partial derivative from parent to child. Note this differs from the “network architecture diagrams” in § 1.1 and § 2.1: those diagrams put weights on the edges (the weighted connections between neurons), whereas a computational graph treats each weight as a node, and the edge between that node and a downstream intermediate carries the local partial z/W\partial z / \partial W — not the weight itself.

Take the simplest regression loss L=(σ(Wx+b)y)2L = (\sigma(W x + b) - y)^2. Splitting it into intermediate variables z=Wx+bz = W x + b, a=σ(z)a = \sigma(z), L=(ay)2L = (a - y)^2, the corresponding computational graph is:

graph LR
  x((x)) --> z[z]
  W((W)) --> z
  b((b)) --> z
  z --> a[a]
  a --> diff["a - y"]
  y((y)) --> diff
  diff --> L[L]

Each edge corresponds to a local derivative: z/W=xT\partial z / \partial W = x^T, a/z=σ(z)\partial a / \partial z = \sigma'(z), L/a=2(ay)\partial L / \partial a = 2(a - y), etc. To compute L/W\partial L / \partial W, multiply the local derivatives along the path LazWL \to a \to z \to W using the chain rule:

LW=LaazzW=2(ay)σ(z)xT.\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W} = 2(a - y) \cdot \sigma'(z) \cdot x^T.

Backpropagation isn’t a new differentiation rule — it’s the chain rule executed in reverse topological order over a DAG using local derivatives. Any function that can be expressed as a DAG admits this procedure; the framework just has to remember the local-derivative operator on each edge.

The Jacobian matrix. When a layer maps a vector to a vector rather than a scalar to a scalar, the “local derivative” becomes a Jacobian matrix: if f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m, then f/xRm×n\partial f / \partial x \in \mathbb{R}^{m \times n} with (f/x)ij=fi/xj(\partial f / \partial x)_{ij} = \partial f_i / \partial x_j. The three common neural-network layers have these Jacobians:

  • Linear layer y=Wx+by = W x + b (yRmy \in \mathbb{R}^m, xRnx \in \mathbb{R}^n, WRm×nW \in \mathbb{R}^{m \times n}): y/x=W\partial y / \partial x = W; element-wise w.r.t. WW is yi/Wij=xj\partial y_i / \partial W_{ij} = x_j.
  • Activation layer y=σ(x)y = \sigma(x) (element-wise): y/x=diag(σ(x))\partial y / \partial x = \mathrm{diag}(\sigma'(x)), a diagonal matrix.
  • Loss layer L=(y,ytrue)L = \ell(y, y_{\text{true}}) (scalar output): L/yR1×m\partial L / \partial y \in \mathbb{R}^{1 \times m}, a row vector.

The Jacobian of an element-wise activation is diagonal — this observation drives § 4.1’s “gradient chain product” analysis: a diagonal matrix’s singular values are its diagonal entries σ(zi)\sigma'(z_i) themselves, so the magnitude of the whole chain product is determined by σ\sigma' multiplied across all layers.

The chain-product form. Writing the whole backward pass as one expression: the gradient of the loss at depth LL with respect to input x0x_0 is the chain product of all layer Jacobians,

Lx0=LxLl=L1Jl,Jl=xlxl1.\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \cdot \prod_{l=L}^{1} J_l, \quad J_l = \frac{\partial x_l}{\partial x_{l-1}}.

This is the starting point for § 4’s analysis of vanishing / exploding gradients.

4. Training Stability: How Deep Networks Survive

4.1 Vanishing and Exploding Gradients: the Spectral-Radius View

Return to § 2.4’s derivative plot, and pick up § 3.4’s closing chain-product formula:

Lx0=LxLl=L1Jl,Jl=xlxl1.\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \cdot \prod_{l=L}^{1} J_l, \quad J_l = \frac{\partial x_l}{\partial x_{l-1}}.

The magnitude of the chain product is governed by how much each layer’s JlJ_l scales a vector’s length. For a matrix AA, its largest singular value σmax(A)\sigma_{\max}(A) is the maximum factor by which AA can stretch a unit vector — precisely, for any vector vv,

Av2    σmax(A)v2,\|A v\|_2 \;\le\; \sigma_{\max}(A) \cdot \|v\|_2,

with equality when vv points along AA‘s “longest” direction. So if every layer has σmax(Jl)<1\sigma_{\max}(J_l) < 1, the chain product shrinks the gradient magnitude exponentially; if all >1> 1, it explodes.

For square matrices one also uses the spectral radius ρ(J)=maxiλi(J)\rho(J) = \max_i \lvert \lambda_i(J) \rvert (the largest eigenvalue magnitude) to describe asymptotic scaling under iteration; for non-symmetric JJ, ρ(J)σmax(J)\rho(J) \le \sigma_{\max}(J), and the two are used interchangeably in engineering as “the effective scaling factor of the map”. Below, ρˉ\bar\rho denotes the average per-layer scaling factor (either singular value or spectral radius works).

If the average per-layer scaling factor is ρˉ\bar\rho, the chain product’s magnitude scales as ρˉL\bar\rho^L. Two extremes:

  • ρˉ<1\bar\rho < 1: chain product decays exponentially — vanishing gradient, layers near the input receive almost no signal.
  • ρˉ>1\bar\rho > 1: chain product grows exponentially — exploding gradient, numerical overflow and optimization failure.

The failure modes of vanishing and exploding gradients differ. With vanishing gradients, training looks like it’s still running and the loss may still decrease, but the layers near the input barely update — the deep network effectively degenerates into a shallow one that only trains its last few layers, capping generalization.

Exploding is more violent. Once any gradient component exceeds the float32 maximum (about 3.4×10383.4 \times 10^{38}), it becomes inf; subsequent operations like inf - inf / 0 * inf / inf / inf produce NaN, and NaN is absorbing under IEEE 754 — any arithmetic involving it returns NaN. Once NaN enters a parameter, the next forward pass turns the entire activation map into NaN, the loss becomes NaN, every gradient is NaN, every parameter is updated to NaN — training dies completely with no recovery.

Even without hitting inf, an oversized update ηL\eta \nabla L can push the parameters into a completely unfamiliar region of the loss landscape, and the trajectory diverges. The engineering remedy is gradient clipping: whenever L>τ\|\nabla L\| > \tau, scale it down to τ\tautrading a bit of speed for not letting any single step blow up. It is nearly standard in RNN / Transformer training.

Take Sigmoid as a concrete number: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)) is a bell curve with maximum 0.250.25 at z=0z = 0, and σ\sigma' drops quickly as z|z| grows. So even in the “most favorable spot” (z0z \approx 0), a layer’s ρˉ\bar\rho tops out at 0.250.25; with un-tuned initialization the per-layer value is smaller in practice, and the 10-layer chain product is bounded above by 0.25101060.25^{10} \approx 10^{-6} — gradient effectively zero.

51152Depth L 012345Chain product ρ^L sigmoid' chain (0.25)^L tanh' near saturation (0.9)^L ReLU' positive region (1.0)^L Exploding ρ=1.2
Gradient chain product vs depth

The ρ=1.2\rho = 1.2 curve runs off the top of the plot around L9L \approx 9 — that’s the literal meaning of “explosion.” The gradient problem isn’t a curse of “depth” — it’s a curse of “Jacobian chain-product spectral radius far from 1”. This reframes the issue from “architecture choice” to “keep each layer’s JlJ_l spectral radius near 1,” an engineering task. §§ 4.2 and 4.3 are its two solutions.

4.2 Weight Initialization: the Variance-Preservation Derivation of Xavier and He

Xavier initialization (Glorot & Bengio 2010) is built on the idea to keep the variance of each dimension equal across layers in both forward and backward passes — making the chain-product spectral radius hover near 1 automatically.

Why does “variance preservation” pin the spectral radius near 1? Two steps: variance preservation means each layer on average neither stretches nor shrinks the squared length of a vector; the Marchenko–Pastur distribution from random matrix theory says that a random matrix with i.i.d. entries of variance σw2\sigma_w^2 and shape m×nm \times n has singular values concentrated around σwn\sigma_w \sqrt{n}. Plug in σw2=1/n\sigma_w^2 = 1/n (the Xavier forward equation derived below) and the singular values 1/nn=1\approx \sqrt{1/n} \cdot \sqrt{n} = 1, so ρ1\rho \approx 1. Variance equalization is essentially an engineering proxy for “make every layer Jacobian’s singular values concentrate near 1” — controlling variance during sampling is O(1)O(1); directly controlling singular values via SVD is O(n3)O(n^3).

Assumptions: input xx has zero-mean, i.i.d. components with variance σx2\sigma_x^2; weight matrix WRm×nW \in \mathbb{R}^{m \times n} has zero-mean i.i.d. entries with variance σw2\sigma_w^2; the activation is approximately linear (small-input regime, e.g., tanh near zero). Notation: WW maps an nn-dimensional input to an mm-dimensional output, so each output neuron receives nn inputs — its fan-in =n= n (number of columns of WW); each input is dispatched to mm outputs — its fan-out =m= m (number of rows of WW). Both terms come from circuit terminology — number of incoming / outgoing ports.

Forward variance: for a linear layer zj=k=1nWjkxkz_j = \sum_{k=1}^{n} W_{jk} x_k,

Var(zj)=k=1nVar(Wjkxk)=k=1nVar(Wjk)Var(xk)=nσw2σx2.\begin{aligned} \mathrm{Var}(z_j) &= \sum_{k=1}^{n} \mathrm{Var}(W_{jk} x_k) \\ &= \sum_{k=1}^{n} \mathrm{Var}(W_{jk}) \cdot \mathrm{Var}(x_k) \\ &= n \cdot \sigma_w^2 \cdot \sigma_x^2. \end{aligned}

To preserve Var(z)=Var(x)\mathrm{Var}(z) = \mathrm{Var}(x), we need σw2=1/n\sigma_w^2 = 1 / n (nn is the fan-in).

Backward variance: the propagating gradient gj=i=1mWijgi(out)g_j = \sum_{i=1}^{m} W_{ij} g_i^{\text{(out)}} (mm is fan-out) satisfies

Var(gj)=mσw2Var(g(out)).\mathrm{Var}(g_j) = m \cdot \sigma_w^2 \cdot \mathrm{Var}(g^{\text{(out)}}).

To preserve backward variance we need σw2=1/m\sigma_w^2 = 1 / m. Forward and backward cannot be simultaneously satisfied; Xavier takes the harmonic compromise:

σw2=2n+m.\sigma_w^2 = \frac{2}{n + m}.

He initialization (He et al. 2015) corrects this for ReLU: since ReLU zeros out the negative half, only half the activations carry through, so forward variance becomes Var(z)=12nσw2σx2\mathrm{Var}(z) = \tfrac{1}{2} n \sigma_w^2 \sigma_x^2. To preserve variance one needs

σw2=2n.\sigma_w^2 = \frac{2}{n}.

The difference between Xavier and He isn’t a choice of σx2\sigma_x^2 — it’s the physical assumption that “ReLU kills half the activations,” which moves a factor of 2 into the numerator. Engineering practice now defaults to Xavier for sigmoid / tanh networks and He for ReLU-family networks.

4.3 Normalization: the Mathematical Essence of BatchNorm and LayerNorm

Weight initialization only enforces ρˉ1\bar\rho \approx 1 at t=0t = 0; during training, weights drift, activation distributions shift, and the spectral radius wanders away from 1. Normalization layers exist to forcibly pull each layer’s activations back to zero-mean unit-variance distribution, keeping σ\sigma' in its “sensitive region.”

Batch Normalization (BN) normalizes along the mini-batch dimension. For a batch {x(1),,x(B)}\{x^{(1)}, \dots, x^{(B)}\}, channel / neuron jj:

μj=1Bi=1Bxj(i),σj2=1Bi=1B(xj(i)μj)2,x^j(i)=xj(i)μjσj2+ϵ,yj(i)=γjx^j(i)+βj.\begin{aligned} \mu_j &= \frac{1}{B} \sum_{i=1}^{B} x_j^{(i)}, \\ \sigma_j^2 &= \frac{1}{B} \sum_{i=1}^{B} (x_j^{(i)} - \mu_j)^2, \\ \hat x_j^{(i)} &= \frac{x_j^{(i)} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}, \\ y_j^{(i)} &= \gamma_j \hat x_j^{(i)} + \beta_j. \end{aligned}

The trailing affine parameters γj,βj\gamma_j, \beta_j are learnable, so normalization doesn’t lock the network’s expressive capacity — if a layer genuinely needs nonzero-mean or non-unit-variance output, the model can learn it back through γ,β\gamma, \beta.

⚙️ In modern architectures the linear layer’s bb is often disabled: ResNet’s convolutional layers default to bias=False; LLaMA / PaLM’s FFN and Q / K / V projections are also bias-free. The reason is that these linear layers are followed by BN / LN, and the normalization layer’s β\beta already provides the same shift freedom — keeping bb alongside is a redundant parameter that adds count without adding expressive power. Disabling bb in the “Linear + BN + ReLU” combination is the default in PyTorch implementations.

⚙️ What the implementation actually does: BatchNorm uses batch statistics μj,σj2\mu_j, \sigma_j^2 during training; at inference, using batch statistics would make the same input produce different outputs in different batches. PyTorch / TensorFlow implementations maintain a running average of μ,σ2\mu, \sigma^2 during training, and switch to it for inference. This split makes BatchNorm’s train-inference behavior fundamentally asymmetric, which is why it underperforms in small-batch, variable-length-input, RNN / Transformer scenarios. In practice RNN / Transformer architectures use LayerNorm (introduced below); modern LLMs (LLaMA, Mistral, Gemma, etc.) further simplify to RMSNorm — dropping the mean subtraction and normalizing only by the root mean square.

Layer Normalization (LN) instead normalizes along all dimensions of a single sample, independent of batch size:

μ=1nj=1nxj,σ2=1nj=1n(xjμ)2,x^j=xjμσ2+ϵ.\mu = \frac{1}{n}\sum_{j=1}^{n} x_j, \quad \sigma^2 = \frac{1}{n}\sum_{j=1}^{n}(x_j - \mu)^2, \quad \hat x_j = \frac{x_j - \mu}{\sqrt{\sigma^2 + \epsilon}}.

LayerNorm decouples from the batch, making it especially friendly to Transformer-style architectures with variable-length sequences within a batch — this is the engineering reason it has almost completely replaced BatchNorm in NLP and modern LLMs.

Back to the § 4.1 view: normalization pins each layer’s activations to zero-mean unit-variance, which forces σ\sigma' to stay roughly constant. The real goal of normalization isn’t the vague “distribution stability” — it’s keeping the activation distribution in the region where σ\sigma' \approx constant, so the chain product of Jacobian spectral radii doesn’t drift.

5. Generalization and Regularization

5.1 The Bias–Variance Tradeoff

Everything so far has been about whether training converges. But low training loss isn’t the goal — the goal is low generalization error on new samples. Given an input x0x_0, true label y0y_0, and training set DD, define the prediction y^(x0;D)\hat y(x_0; D); the expected MSE at x0x_0 decomposes as

ED ⁣[(y^(x0;D)y0)2]=ED ⁣[(y^y^ˉ)2]+(y^ˉf(x0))2+σϵ2=VarD[y^(x0;D)]variance+(ED[y^(x0;D)]f(x0))2bias2+σϵ2noise,\begin{aligned} \mathbb{E}_D\!\left[(\hat y(x_0; D) - y_0)^2\right] &= \mathbb{E}_D\!\left[(\hat y - \bar{\hat y})^2\right] + (\bar{\hat y} - f(x_0))^2 + \sigma_\epsilon^2 \\ &= \underbrace{\mathrm{Var}_D[\hat y(x_0; D)]}_{\text{variance}} + \underbrace{(\mathbb{E}_D[\hat y(x_0; D)] - f(x_0))^2}_{\text{bias}^2} + \underbrace{\sigma_\epsilon^2}_{\text{noise}}, \end{aligned}

where y^ˉ=ED[y^(x0;D)]\bar{\hat y} = \mathbb{E}_D[\hat y(x_0; D)], f(x0)f(x_0) is the true conditional expectation, and σϵ2\sigma_\epsilon^2 is irreducible label noise.

The three terms trade off: larger model capacity can fit more complex ff → smaller bias; larger capacity is also more sensitive to training-sample details → larger variance. The essence of the bias–variance tradeoff is that model capacity is both the enemy of bias and the ally of variance, and the optimum total error sits at the intersection of the two.

In practice, neural networks’ double descent phenomenon shows the classical bias-variance picture is incomplete in the over-parameterized regime (Belkin et al. 2019 and Nakkiran et al. 2020 gave systematic evidence). Sweeping a neural network’s width / parameter count from small to large and plotting test error, the curve is not a classical U shape but descent → ascent → descent:

  1. At low capacity, the test error follows the classical U shape down to a classical optimum;
  2. As capacity continues to grow and variance dominates, test error rises;
  3. Around the interpolation threshold (the critical capacity at which the model can perfectly memorize the training set, training error 0\approx 0), the test error spikes to a peak;
  4. With still more capacity, entering the over-parameterized regime (parameter count \gg training-set size), the test error descends again, often dropping below the classical optimum.

This “descent → ascent → descent” curve directly contradicts the classical bias-variance prediction that “more capacity → more variance → eventual overfitting” — in the over-parameterized regime, growing the model further generalizes better, which is the empirical basis for modern deep learning (GPT, LLaMA, and other large models) extracting good performance by scaling parameters.

The explanations are not fully settled; three common candidates:

  • Implicit regularization from SGD in the over-parameterized regime: many parameter settings perfectly fit the training set, and SGD prefers the one with smallest norm, which acts as automatic regularization.
  • Neural Tangent Kernel (NTK) view: when networks are wide enough, training dynamics approach a deterministic kernel method whose variance behavior differs from finite-width classical models.
  • Lottery ticket hypothesis: a large network contains a sparse “winning ticket” subnetwork that training selects automatically, with the other parameters playing no role.

When model size is moderate (parameter count of the same order as or smaller than the training set), the classical bias-variance picture still holds; double descent only appears in the over-parameterized regime. As a foundational view and design lens for regularization, the classical picture remains the starting point for the next two subsections on L1 / L2 / Dropout.

5.2 The Bayesian Essence of L2 / L1 Regularization

The simplest defense against overfitting is to add a “keep the weights small” term to the loss. Below, NLL(θ)\mathrm{NLL}(\theta) denotes the negative log-likelihood logp(Dθ)-\log p(D \mid \theta) — the quantity § 3.3 minimized when deriving MSE and cross-entropy from MLE, and virtually every loss in deep learning is some form of NLL. Two common regularizers:

LL2(θ)=NLL(θ)+λθ22,LL1(θ)=NLL(θ)+λθ1.L_{\text{L2}}(\theta) = \mathrm{NLL}(\theta) + \lambda \|\theta\|_2^2, \qquad L_{\text{L1}}(\theta) = \mathrm{NLL}(\theta) + \lambda \|\theta\|_1.

Both regularizers have a clean Bayesian interpretation: they correspond to MAP (maximum a posteriori) estimation under different parameter priors.

L2 ↔ Gaussian prior: assume parameters follow an isotropic Gaussian prior θN(0,τ2I)\theta \sim \mathcal{N}(0, \tau^2 I). Given data DD, the log-posterior is

logp(θD)logp(Dθ)+logp(θ)=NLL(θ)12τ2θ22+const.\begin{aligned} \log p(\theta \mid D) &\propto \log p(D \mid \theta) + \log p(\theta) \\ &= -\mathrm{NLL}(\theta) - \frac{1}{2\tau^2}\|\theta\|_2^2 + \text{const}. \end{aligned}

Maximizing the posterior is equivalent to minimizing NLL(θ)+12τ2θ22\mathrm{NLL}(\theta) + \frac{1}{2\tau^2}\|\theta\|_2^2 — setting λ=1/(2τ2)\lambda = 1 / (2 \tau^2) recovers L2 (ridge regression). L2 regularization = MAP under a Gaussian prior — “don’t let weights get too big” is equivalent to “the prior on weights concentrates around 0.”

One engineering caveat: θ22\|\theta\|_2^2 in practice usually applies only to weights, not to biases. PyTorch’s weight_decay does decay biases by default, but textbooks and practical guides generally recommend excluding biases explicitly. This is consistent with the intuition that “bias should retain its freedom to translate”; in the bias-free modern architectures mentioned in § 4.3 the requirement is automatically satisfied.

L1 ↔ Laplace prior: swap in a Laplace prior θjLaplace(0,b)\theta_j \sim \mathrm{Laplace}(0, b) with density exp(θj/b)\propto \exp(-|\theta_j| / b). The log-posterior becomes

logp(θD)NLL(θ)1bjθj+const,\log p(\theta \mid D) \propto -\mathrm{NLL}(\theta) - \frac{1}{b}\sum_j |\theta_j| + \text{const},

so maximizing it corresponds to minimizing NLL+λθ1\mathrm{NLL} + \lambda \|\theta\|_1 (λ=1/b\lambda = 1 / b). L1 (Lasso) regularization = MAP under a Laplace prior — choosing Laplace over Gaussian is what introduces sparsity.

Why does L1 produce sparsity (many θj\theta_j exactly zero) and L2 doesn’t? The answer is in each prior’s shape at 0: the Gaussian density is smooth at 0 (zero first derivative), while the Laplace density has a “peak” at 0 (a kink in the first derivative). Treating the negative log of the latter as a penalty θj|\theta_j|, the objective is non-smooth at θj=0\theta_j = 0 (“corner”), and the optimum tends to push components exactly to 0; the negative log of a Gaussian is a smooth θj2\theta_j^2, so the optimum only shrinks components but never zeroes them. L1’s sparsity isn’t an accident — it follows directly from the non-smoothness of the Laplace prior at 0.

5.3 The Ensemble View of Dropout

Dropout (Srivastava et al. 2014) is a minimal-looking operation: during training, each neuron is independently zeroed (masked) with probability pp; at inference no mask is applied, but activations are scaled by (1p)(1 - p). The puzzle is why this “noise injection” gives such a large generalization boost.

The answer is that it’s mathematically equivalent to a gigantic ensemble. Consider a network with nn hidden neurons. Each mini-batch’s random mask induces one of 2n2^n possible sub-networks (each neuron 0 or 1); each sub-network gets one training step (with shared weights). Dropout isn’t just random noise — it trains a different sub-network on every batch, forming an implicit ensemble of up to 2n2^n sub-networks.

At inference, dropout is turned off and each activation is multiplied by (1p)(1 - p) — this operation is exactly the normalized geometric mean of the 2n2^n subnetwork softmax probabilities for a softmax output layer. Derivation: each mask’s logit zk(m)=Wk(mh)z_k(m) = W_k \cdot (m \odot h) is linear in mm, so by linearity of expectation Em[zk(m)]=(1p)Wkh\mathbb{E}_m[z_k(m)] = (1 - p) \cdot W_k \cdot hexactly the logit produced by (1p)(1-p) scaling at inference. Plugging this into softmax and using the identity eEm[z]=meP(m)z(m)e^{\mathbb{E}_m[z]} = \prod_m e^{P(m) z(m)}, the exponential of an expectation expands into a product over masks; after normalization it equals the (normalized) geometric mean my^k(m)P(m)\prod_m \hat y_k(m)^{P(m)} of the subnetwork softmax outputs y^k(m)\hat y_k(m).

The “arithmetic mean of logits → geometric mean of probabilities” comes from softmax’s exponential — a direct manifestation of the AM-GM identity ezˉ=(ezi)1/Ne^{\bar z} = (\prod e^{z_i})^{1/N}: a linear average in logit space becomes a multiplicative average in probability space after exp. For non-softmax outputs (sigmoid regression heads, ReLU inside conv layers, etc.) the equivalence is no longer strict, but the “ensemble approximation” of dropout still works well empirically in deep nonlinear networks.

In practice dropout is highly effective on conv-nets and MLPs, while Transformers often substitute it with stochastic depth or attention dropout. Dropout’s theoretical value is in connecting “regularization” and “ensemble learning” as one phenomenon — the same lens also applies to BatchNorm (batch-level noise) and data augmentation (input perturbation).

6. Wrapping Up

Strung together, this post is really answering three root questions: why neural networks can learn (§§ 1-2), how they learn (§ 3), and why depth makes learning hard (§§ 4-5). A few things not stressed in the body but worth keeping in mind before applying any of this:

  • The perceptron’s limit isn’t “linearity” — it’s “single layer”. XOR gave that answer back in 1969; the subsequent “AI winter” wasn’t because the perceptron itself was inadequate but because at that time backpropagation, compute, and the engineering scaffolding around nonlinear activations were all missing. Stacking layers with nonlinearity is what actually lifts the expressiveness ceiling.
  • Backpropagation isn’t a new algorithm — it’s “the chain rule executed in reverse topological order over a computational graph”. VJP is the load-bearing piece; without it, none of the modern AD frameworks (PyTorch / JAX / Autograd) would exist, and the specific memory-vs-compute tradeoff of reverse-mode AD comes from VJP, not from the chain rule itself.
  • The gradient problem isn’t a curse of “depth” — it’s a curse of “spectral radius”. Xavier / He initialization and BatchNorm / LayerNorm all chase the same target: keep the chain product’s ρˉ1\bar\rho \approx 1. Residual connections (ResNet), normalization, and attention design can all be viewed through the same “keep spectral radius stable” lens.
  • All regularization is fundamentally a Bayesian prior over parameters. L2 ↔ Gaussian prior, L1 ↔ Laplace prior, Dropout ↔ implicit ensemble — they’re the same idea in different engineering forms. New regularizers (label smoothing, mixup, weight averaging) fit the same template once you identify the corresponding prior or ensemble structure.
  • The Universal Approximation Theorem is an expressiveness theorem, not a generalization theorem. “Can approximate” ≠ “can learn” (an optimization problem) ≠ “approximates on the test set” (a generalization problem). The real progress in the past decade has been on the latter two — optimization algorithms, normalization, regularization, and architectural priors (CNN / attention / Transformer) together form the engineering substrate that makes UAT actually land on real data.