Deep Learning Foundations: From Perceptrons to Training Deep Networks

Neural networks as “end-to-end learnable functions” are now everyday tools, but for many practitioners the mental model of “learning” stops at the phrase “there’s backpropagation.” This black-box view holds up while training behaves; the moment loss stops decreasing, gradients explode, or inference performance collapses, there is nothing concrete to grab onto. Understanding the foundations of deep learning isn’t about reinventing PyTorch — it’s about knowing where to look when a model fails to converge or generalize.

This post walks through the mathematical skeleton at the bottom of deep learning — from Rosenblatt’s 1957 perceptron all the way to the engineering foundations that let modern deep networks train stably with hundreds of layers. The core is three things: stacking linear transforms with nonlinear activations to gain “any-function” expressiveness, propagating error through a computational graph by the chain rule, and using variance control plus regularization to keep deep networks actually trainable.

The post unfolds across §§ 1-5: the perceptron’s mathematical form and the XOR problem establish the necessity of nonlinearity (§ 1); multilayer perceptrons plus the universal approximation theorem show its sufficiency (§ 2); the basic neural-network training pipeline — loss function plus gradient descent — turns “can be expressed” into “can be learned” (§ 3); vanishing / exploding gradients along with initialization and normalization keep training stable at dozens or hundreds of layers (§ 4); finally, the bias–variance tradeoff through L1 / L2 / Dropout maps the gap between “learning” and “generalizing” (§ 5).

1. The Perceptron: Limits of Linear Models

1.1 The Mathematical Form of the Perceptron

The perceptron was proposed by Rosenblatt in 1957 and is the first model in history whose parameters could be learned from data for binary classification. Its core operation is a weighted sum over input dimensions passed through a step function:

y \;=\; \mathrm{sign}(w \cdot x + b), \qquad \mathrm{sign}(z) = \begin{cases} +1, & z \ge 0 \\ -1, & z < 0 \end{cases}

Here $x \in \mathbb{R}^n$ is the input vector, $w \in \mathbb{R}^n$ is the weight vector, $b \in \mathbb{R}$ is the bias, and the output $y \in \{-1, +1\}$ is the class label. Geometrically the decision boundary $w \cdot x + b = 0$ is a hyperplane in $\mathbb{R}^n$ . The perceptron is, at its core, a model that “finds a hyperplane separating two classes of data”.

graph LR
  x1((x1)) -->|w1| S["Σ + b"]
  x2((x2)) -->|w2| S
  xn((xn)) -->|wn| S
  S --> step["sign·"]
  step --> y((y))

The learning algorithm itself is minimal — but to see it clearly, the symbols it uses should be listed once up front:

Symbol	Meaning	Type
$w \in \mathbb{R}^n$	weight vector	parameter
$b \in \mathbb{R}$	bias scalar	parameter
$x_i \in \mathbb{R}^n$	input of the $i$ -th sample	data
$y_i \in \{-1, +1\}$	label of the $i$ -th sample	data
$\eta > 0$	learning rate, the step size per update	hyperparameter
$\leftarrow$	assignment (write the right value into the left)	operator

Both $w$ and $b$ are learnable parameters, while $\eta$ is a human-chosen hyperparameter; without $b$ the decision hyperplane is locked at the origin and cannot be translated.

The full training loop is:

eta = 1                                   # learning rate
w = 0; b = 0
for _ in range(max_iter):                 # iteration cap
    n_updates = 0
    for x_i, y_i in training_samples:
        score = w @ x_i + b               # dot product
        if y_i * score <= 0:              # misclassified
            w = w + eta * y_i * x_i
            b = b + eta * y_i
            n_updates += 1
    if n_updates == 0:
        break                             # full pass with no update -> converged

Only misclassified points trigger an update; correctly classified points are skipped — this is the key difference between the perceptron and later gradient descent in “when does an update fire”. The compact misclassification test is $y_i (w \cdot x_i + b) \le 0$ : the true label and the predicted score have opposite signs (or the score is exactly 0).

Why does the update point along $+\eta y_i x_i$ rather than some other direction? Suppose some point $(x_i, y_i)$ is misclassified. Recompute its score after the update:

\begin{aligned} w_{\text{new}} \cdot x_i + b_{\text{new}} &= (w + \eta y_i x_i) \cdot x_i + (b + \eta y_i) \\ &= (w \cdot x_i + b) + \eta y_i (\|x_i\|^2 + 1) \\ &= \text{old score} + \eta y_i (\|x_i\|^2 + 1). \end{aligned}

With $\eta > 0$ and $\|x_i\|^2 + 1 > 0$ , the new score moves strictly along the $y_i$ direction relative to the old score: larger for $y_i = +1$ , smaller for $y_i = -1$ , in both cases toward “next pass predicts correctly”. $y_i$ is both the signal of “which direction I want the score to move” and the multiplier that flips the update direction with the label; replacing it with a fixed $+\eta x_i$ would actually push negative samples further into being wrong.

A concrete pass: take $\eta = 1$ , initial $w = (0, 0), b = 0$ , with two training points $x_1 = (2, 1), y_1 = +1$ and $x_2 = (-1, 2), y_2 = -1$ .

See $x_1$ : score $= 0$ , $y_1 \cdot 0 = 0 \le 0$ , misclassified. Update $w \leftarrow (2, 1),\; b \leftarrow 1$ .
See $x_2$ : score $= 2 \cdot (-1) + 1 \cdot 2 + 1 = 1$ , predicted $+1$ but the true label is $-1$ , misclassified. Update $w \leftarrow (3, -1),\; b \leftarrow 0$ .
Recheck $x_1$ : score $= 5 > 0$ ✓; recheck $x_2$ : score $= -5 < 0$ ✓. Both correct, the loop terminates.

A common engineering simplification is to absorb $b$ into $w$ : append a constant 1 to every input to form $\tilde x = (x, 1) \in \mathbb{R}^{n+1}$ , and extend the weights to $\tilde w = (w, b)$ . Then $w \cdot x + b = \tilde w \cdot \tilde x$ , and the whole update collapses into $\tilde w \leftarrow \tilde w + \eta y_i \tilde x_i$ , which is more compact. $b$ remains a parameter mathematically — it’s just hidden inside the last component of the extended vector.

The whole learning procedure has no gradient, no backpropagation, no notion of a loss function — it’s the most primitive “fix on mistake” version: misclassified points pull the hyperplane toward themselves, correctly classified points leave it alone.

1.2 The Novikoff Convergence Theorem

A learning rule alone isn’t enough; we still need to answer two questions: when is convergence guaranteed, and how many steps does it take? In 1962, Novikoff gave a complete answer: when the data is linearly separable, the perceptron algorithm converges in at most $\lceil R^2 / \gamma^2 \rceil$ steps.

Theorem (Novikoff 1962): Suppose the training set $\{(x_i, y_i)\}_{i=1}^N$ satisfies $\|x_i\| \le R$ , and there exists $w^*$ (with $\|w^*\| = 1$ WLOG) such that $y_i (w^* \cdot x_i) \ge \gamma > 0$ for every $i$ . Then the perceptron algorithm, starting from $w_0 = 0$ with learning rate $\eta = 1$ , converges after at most $t \le R^2 / \gamma^2$ updates.

The proof bounds the weight vector from above and below, then squeezes with Cauchy-Schwarz. Let $w_t$ denote the weights after $t$ updates.

Inner-product lower bound: each update is $w_t = w_{t-1} + y_{i_t} x_{i_t}$ , so

\begin{aligned} w_t \cdot w^* &= (w_{t-1} + y_{i_t} x_{i_t}) \cdot w^* \\ &= w_{t-1} \cdot w^* + y_{i_t} (x_{i_t} \cdot w^*) \\ &\ge w_{t-1} \cdot w^* + \gamma. \end{aligned}

Starting from $w_0 \cdot w^* = 0$ and iterating to $t$ gives $w_t \cdot w^* \ge t \gamma$ .

Norm-squared upper bound:

\begin{aligned} \|w_t\|^2 &= \|w_{t-1} + y_{i_t} x_{i_t}\|^2 \\ &= \|w_{t-1}\|^2 + 2 y_{i_t} (w_{t-1} \cdot x_{i_t}) + \|x_{i_t}\|^2. \end{aligned}

An update happens only when $(x_{i_t}, y_{i_t})$ is misclassified under $w_{t-1}$ , i.e., $y_{i_t} (w_{t-1} \cdot x_{i_t}) \le 0$ , so the cross term is $\le 0$ ; substituting $\|x_{i_t}\|^2 \le R^2$ gives $\|w_t\|^2 \le \|w_{t-1}\|^2 + R^2$ . Iterating from $\|w_0\| = 0$ to $t$ yields $\|w_t\|^2 \le t R^2$ .

Squeezing: by Cauchy-Schwarz, $w_t \cdot w^* \le \|w_t\| \cdot \|w^*\| = \|w_t\|$ . Combined with both bounds,

t \gamma \;\le\; w_t \cdot w^* \;\le\; \|w_t\| \;\le\; \sqrt{t R^2} = R \sqrt{t}.

Squaring both sides and dividing by $\gamma^2$ gives $t \le R^2 / \gamma^2$ . $\blacksquare$

Novikoff gives not an asymptotic guarantee but an absolute upper bound — provided the data is linearly separable, the iteration count is fixed by the ratio of “radius $R$ to margin $\gamma$ ”. This result fueled a wave of optimism about the perceptron in the 1960s; a new limit was about to follow.

1.3 The XOR Problem: Algebra and Geometry of Non-separability

Minsky and Papert’s 1969 book Perceptrons gave a counterexample that nearly buried the perceptron: exclusive-or (XOR). XOR maps $(x_1, x_2) \in \{0, 1\}^2$ to 0 or 1 according to:

$x_1$	$x_2$	XOR
0	0	0
0	1	1
1	0	1
1	1	0

Algebraically suppose $(w_1, w_2, b)$ exists so the perceptron learns XOR. Plugging in the four points with their sign constraints (output 1 ⇒ $\ge 0$ , output 0 ⇒ $< 0$ ):

\begin{aligned} b &< 0 \quad &&\text{from } (0, 0) \to 0, \\ w_2 + b &\ge 0 \quad &&\text{from } (0, 1) \to 1, \\ w_1 + b &\ge 0 \quad &&\text{from } (1, 0) \to 1, \\ w_1 + w_2 + b &< 0 \quad &&\text{from } (1, 1) \to 0. \end{aligned}

Adding the second and third inequalities gives $w_1 + w_2 + 2b \ge 0$ ; adding the first and fourth gives $w_1 + w_2 + 2b < 0$ . These two inequalities are directly contradictory — no single-layer linear model can implement XOR.

Geometrically the picture is even clearer: the decision boundary $w_1 x_1 + w_2 x_2 + b = 0$ is a line in $\mathbb{R}^2$ that splits the plane into two half-planes. XOR demands that $(0, 0)$ and $(1, 1)$ live in the 0 half-plane while $(0, 1)$ and $(1, 0)$ live in the 1 half-plane — connecting the two points of each class with a segment gives exactly the two diagonals of the unit square, and these two diagonal segments intersect at the center $(0.5,\, 0.5)$ . Once the two classes’ convex hulls intersect (here, two diagonals meeting at the center), no single line can place them in different half-planes, so no hyperplane can perform XOR’s “same color across the diagonal” split.

 x2
  │
1 ●           ○        ●  class 1
  │
  │
0 ○           ●        ○  class 0
  └────────────── x1
    0           1

XOR isn’t an isolated counterexample — it points to a broader class of problems: whenever classes need a “non-convex decision region,” single-layer linear models necessarily fail. To push past this ceiling the model has to move from “one hyperplane” to “a composition of hyperplanes,” which is what the multilayer perceptron is built for.

2. Multilayer Perceptrons: Where Non-linearity Emerges

2.1 From One Layer to Many: Matrix Composition

The multilayer perceptron (MLP) inserts a nonlinear activation $\sigma$ between successive linear transforms. Two-layer form:

y \;=\; \sigma\!\left(W_2 \, \sigma(W_1 x + b_1) + b_2\right),

where $W_1 \in \mathbb{R}^{h \times n}$ maps the $n$ -dim input into an $h$ -dim hidden layer, $W_2 \in \mathbb{R}^{m \times h}$ maps the hidden layer to an $m$ -dim output, and $\sigma$ is an element-wise nonlinearity. The general $L$ -layer form is

a_0 = x, \quad a_l = \sigma(W_l a_{l-1} + b_l) \;\; (l = 1, \dots, L), \quad y = a_L.

A multilayer perceptron strings together a sequence of linear transforms with matrix multiplication, and inserts a nonlinear $\sigma$ between every two layers to keep the whole stack from collapsing. The “nonlinear $\sigma$ ” is the load-bearing piece — the next section proves that without it the entire structure degenerates to a single layer.

2.2 The Linear-Collapse Theorem: Why Activations Are Required

Replace $\sigma$ with the identity $\sigma(z) = z$ . The two-layer MLP becomes

\begin{aligned} y &= W_2 (W_1 x + b_1) + b_2 \\ &= (W_2 W_1) x + (W_2 b_1 + b_2) \\ &\equiv \tilde W x + \tilde b, \end{aligned}

a brand-new linear transform. In general, an $L$ -layer purely linear MLP becomes

y = W_L W_{L-1} \cdots W_1 x + \sum_{l=1}^{L} \!\left(\prod_{k=l+1}^{L} W_k\right) b_l = \tilde W x + \tilde b,

still a single-layer linear transform. Any finite-depth composition of purely linear maps remains linear — no matter how many layers you stack, the expressive power never exceeds that of the shallowest possible model. This is “linear collapse.”

The nonlinear activation is the only mechanism that lets the stack escape the affine subspace. How far it escapes depends on $\sigma$ and on $L$ , but the prerequisite is simply that $\sigma$ is not the identity. Activations are not decoration — they are the sole source of nonlinear expressive power.

2.3 The Universal Approximation Theorem: Sketch of a Constructive Proof

What is the “floor” of expressive power once nonlinearity is in place? In 1989, Cybenko, and in 1991, Hornik, independently provided the answer: the universal approximation theorem (UAT).

Theorem (Cybenko 1989): Let $\sigma$ be a continuous, monotone, bounded activation function (sigmoidal). For any compact set (i.e., closed and bounded — e.g., $[0, 1]^n$ ) $K \subset \mathbb{R}^n$ , any continuous $f: K \to \mathbb{R}$ , and any $\epsilon > 0$ , there exist a natural number $h$ and weights $\{w_i, b_i, \alpha_i\}_{i=1}^h$ such that the network

g(x) = \sum_{i=1}^{h} \alpha_i \sigma(w_i \cdot x + b_i)

satisfies $\sup_{x \in K} |g(x) - f(x)| < \epsilon$ .

A constructive sketch (not a full proof): take $\sigma$ to be sigmoid. $\sigma(\alpha(z - c))$ is a smooth S-curve that transitions from 0 to 1 around $z = c$ — larger $\alpha$ makes the transition region narrower and the curve steeper, so the shape approaches a step function at $z = c$ .

Subtracting two such sigmoids whose transition points are close (separated by $\delta$ ) — $\sigma(\alpha(z - c)) - \sigma(\alpha(z - c - \delta))$ — is most intuitive when split into three regions along the $z$ -axis:

Region	$\sigma(\alpha(z - c))$	$\sigma(\alpha(z - c - \delta))$	Difference
$z < c$	$\approx 0$	$\approx 0$	$\approx 0$
$z \in [c,\, c + \delta]$	$\approx 1$ (already past $c$ )	$\approx 0$ (not yet at $c + \delta$ )	$\approx 1$
$z > c + \delta$	$\approx 1$	$\approx 1$	$\approx 0$

The difference is $\approx 1$ only on the narrow band in the middle and $\approx 0$ everywhere else — a rectangular pulse supported on $[c, c + \delta]$ , obtained as the difference of two step functions offset by $\delta$ . At finite $\alpha$ the rectangle has smooth transitions (no sharp corners), so the sigmoid difference is a smooth approximation to a rectangular pulse.

And any continuous function on a compact set can be uniformly approximated to arbitrary precision by a step function (a piecewise-constant function built from rectangles of varying height and sufficiently small width) — this is a direct consequence of the Heine–Cantor theorem (a continuous $f$ on a compact set is uniformly continuous, so partitioning the set finely enough and using $f(x_i)$ as the constant on each piece keeps the uniform error under any preset bound). Replacing each rectangle in the step function with a sigmoid-difference bump scaled to that rectangle’s height and summing them gives a single-hidden-layer sigmoid network that approximates $f$ piece by piece — exactly the form $g(x) = \sum_i \alpha_i \sigma(w_i \cdot x + b_i)$ stated in the UAT. Hornik 1991 generalized the conclusion to any non-constant, bounded, continuous $\sigma$ , dropping the sigmoidal requirement.

UAT delivers an existence guarantee for expressive power, not an efficiency claim, and not an optimization or generalization guarantee. It says “neural networks can in principle approximate any continuous function” — but “can approximate” ≠ “can be learned” (an optimization problem) ≠ “approximates on the test set” (a generalization problem). These two gaps are exactly what §§ 3-5 address.

2.4 The Activation Function Zoo: Saturating to Non-saturating

Before walking through specific functions, fix a few terms that will recur:

Saturation region: the part of the input where $\sigma'(z) \to 0$ . Inside it, input changes barely move the output, and backprop compresses the gradient through that layer toward zero.
Saturating activation: an activation with saturation regions on both sides. Sigmoid ( $z \to \pm\infty$ approaching 0 / 1) and tanh (approaching $\mp 1$ ) saturate on both sides; in deep networks, the chain product $\prod \sigma'$ of such functions decays exponentially.
Half-saturating activation: saturates on only one side — e.g., ReLU’s negative half ( $\sigma = 0$ for $z < 0$ ); the positive half has $\sigma'(z) = 1$ and does not saturate.
Dying region: the part where $\sigma'(z)$ is exactly zero, not merely near zero. ReLU’s negative half is exactly this: $\sigma(z) = 0$ and $\sigma'(z) = 0$ , so any neuron pushed into this region receives no gradient signal and never updates again — this is “dying ReLU”. The difference between dying and saturating is “exactly zero” vs “approaching zero”; the former is irrecoverable, the latter can still escape via small perturbations.

UAT’s requirements on $\sigma$ are quite loose, but the engineering choice of activation directly determines training stability. The four most common activations:

Four activation functions

What matters more for training are their derivatives — backpropagation’s chain product multiplies $\sigma'$ into every layer:

Derivatives of the four activations

The key differences across these four activations all live in their derivatives:

Activation	Expression	$\sigma'(0)$	Saturation-region $\sigma'$	Training behavior
Sigmoid	$1 / (1 + e^{-z})$	0.25	$\to 0$	Deep networks vanish easily
Tanh	$\tanh(z)$	1	$\to 0$	Slightly better than sigmoid, still saturates
ReLU	$\max(0, z)$	1 (by convention)	1 (positive) / 0 (negative)	Non-saturating but has dying region
GeLU	$z \cdot \Phi(z)$	0.5	$\to 1$ (positive) / $\to 0$ (negative)	Transformer default, smooth

Saturating activations (sigmoid, tanh) have derivatives that tend to 0 at large $|z|$ , so the chain product $\prod \sigma'$ in a deep network decays exponentially — this is why sigmoid is essentially abandoned in deep architectures. ReLU fixes the saturation problem by pinning the positive-region derivative at 1, but introduces the “dying ReLU” problem: the negative-region derivative is permanently 0, and once a neuron is pushed into the negative region it never updates again. GeLU smooths the kink at the ReLU origin and is the default in modern Transformer-style architectures.

ReLU’s dying region motivated a family of improvements. The common ones in practice:

Leaky ReLU: $\sigma(z) = \max(\alpha z, z)$ with $\alpha$ a small positive constant (typically 0.01). Replacing ReLU’s zero negative-slope with $\alpha$ eliminates the dying region — every input now has a nonzero gradient flowing back.
PReLU (Parametric ReLU): same shape as Leaky ReLU, but $\alpha$ becomes a learnable parameter, one per layer.
ELU (Exponential Linear Unit): $\sigma(z) = z$ if $z > 0$ , $\alpha(e^z - 1)$ if $z \le 0$ . As $z \to -\infty$ , $\sigma(z) \to -\alpha$ (the negative side saturates at $y = -\alpha$ ), so the function is differentiable everywhere and bounded below by $-\alpha$ — usually more stable than Leaky ReLU.
Swish / SiLU: $\sigma(z) = z \cdot \sigma_\text{sigm}(z)$ where $\sigma_\text{sigm}$ is sigmoid. Shape is essentially the same as GeLU — GeLU uses the Gaussian CDF $\Phi$ for the smoothing weight, Swish uses sigmoid; the two are nearly interchangeable in large-model practice.

Plotting these four variants (Leaky ReLU with $\alpha = 0.1$ , ELU with $\alpha = 1$ , plus plain ReLU as a reference) overlaid makes the shape differences concrete:

ReLU-family variants

And their derivatives:

Derivatives of ReLU-family variants

A few things to notice: Leaky ReLU lifts ReLU’s negative-side derivative from 0 to $\alpha$ (a flat line — that’s the entire cost of removing the dying region); ELU’s derivative is continuous at $z = 0$ where ReLU has a kink; Swish is smooth overall and even dips slightly below 0 around $z \approx -1.3$ — exactly the shape that makes it nearly interchangeable with GeLU.

Practical guidance by setting:

Transformers / large language models: GeLU by default (BERT, the GPT family, LLaMA all use it); Swish / SiLU as a drop-in alternative. Smoothness helps large-model optimizers stay stable.
CNN / vanilla MLP hidden layers: ReLU remains the default — simple and efficient; if dying ReLU is a concern, switch to Leaky ReLU or ELU.
Output layer: depends on the task — linear for regression, sigmoid for binary classification, softmax for multi-class. “Saturation” here is the design goal (squashing logits into probabilities), not a flaw.
LSTM / GRU gates: gates need $[0, 1]$ outputs, so sigmoid; cell state uses tanh. This is intrinsic to RNN design — don’t swap.
Avoid sigmoid / tanh in deep hidden layers: chain-product decay is too severe, unless BatchNorm / LayerNorm rescues things (§ 4.3 expands on this).

Initialization should match: He init for the ReLU family, Xavier for sigmoid / tanh (§ 4.2 derives both).

§ 4 will quantify what “chain product decay” actually means.

3. Training Neural Networks: Loss, Backpropagation, Gradient Descent

3.1 The Training Loop at a Glance

§§ 1-2 covered what a neural network can express — single-layer models only carve hyperplanes, multilayer networks with nonlinearities approximate any continuous function. The remaining core question: given a network architecture, how do we learn its parameters $\theta$ from data? This section gives the high-level view — the three core actions in every training loop (defining a loss / computing gradients via backprop / updating with gradient descent); § 3.2 then expands “where the loss comes from”.

1. The training objective: think of the network as a parameterized function $f_\theta$ . Given a training set $\{(x_i, y_i)\}_{i=1}^N$ , training seeks $\theta^*$ that minimizes “the gap between the model’s predictions and the true labels” — quantified by a loss function $L(\theta)$ (common forms catalogued in § 3.2, probabilistic interpretation in § 3.3).

2. Gradient descent: neural networks generally have no closed-form optimum, so training is iterative — each step moves a small distance along the negative gradient of the loss w.r.t. the parameters:

\theta \leftarrow \theta - \eta\, \nabla_\theta L(\theta).

Intuition: the gradient $\nabla_\theta L$ points in the direction of steepest ascent, so the opposite direction is steepest descent. The learning rate $\eta$ (a small positive number, e.g. $10^{-3}$ ) controls the step size — too large overshoots the minimum, too small makes convergence painfully slow. This is gradient descent.

3. Backpropagation is the algorithm that makes computing $\nabla_\theta L$ tractable: modern networks have millions to billions of parameters; computing each $\partial L / \partial \theta_j$ by hand is hopeless. Backpropagation solves this by treating the network as a computational graph (built during the forward pass) and applying the chain rule in reverse topological order to compute all $\partial L / \partial \theta_j$ in one sweep, at the same compute cost as the forward pass. Note the distinction: “backpropagation” is the algorithm that computes the gradient, “gradient descent” is the optimization algorithm that uses the gradient; both are required for a single parameter update.

4. The full training loop: stringing these three pieces together:

graph LR
  input["input x_batch"] --> forward["forward: y_pred = f_θx"]
  forward --> loss["compute loss L"]
  loss --> backward["backward: ∇_θ L"]
  backward --> update["update: θ ← θ - η∇L"]
  update --> input

The corresponding pseudocode (a minimal mini-batch SGD — stochastic gradient descent — training loop, which computes the gradient and updates on each small batch rather than scanning the full training set; this is the default pattern in every modern framework):

for epoch in range(num_epochs):
    for x_batch, y_batch in dataloader:
        y_pred = model.forward(x_batch)       # forward
        loss = loss_fn(y_pred, y_batch)       # compute loss
        grads = loss.backward()               # backprop -> ∇L
        for param in model.parameters():
            param -= eta * grads[param]       # gradient descent

The training loop is the repeated cycle of “forward to compute predictions → compute loss → backward to compute gradient → take a small step”.

3.2 A Quick Catalog of Common Loss Functions

Before the probabilistic derivation, here is a quick landscape of the loss functions most commonly used in deep learning:

Loss	Form	Typical use
MSE / L2	$\frac{1}{N}\sum_i (y_i - \hat y_i)^2$	regression
MAE / L1	$\frac{1}{N}\sum_i \lvert y_i - \hat y_i \rvert$	regression, more robust to outliers
Binary cross-entropy	$-y \log \hat y - (1-y) \log(1 - \hat y)$	binary classification
Categorical cross-entropy	$-\sum_k y_k \log \hat y_k$	multi-class classification (paired with softmax)
Hinge loss	$\max(0,\, 1 - y \hat y)$	SVM, max-margin classification
KL divergence	$\sum_x p(x) \log \frac{p(x)}{q(x)}$	distribution matching (VAE, knowledge distillation)

3.3 A Probabilistic Interpretation of Loss: MLE Derives MSE and Cross-Entropy

Many textbooks present MSE and cross-entropy as two independent losses, but they actually share one source: maximum likelihood estimation (MLE) under different noise / output distribution assumptions.

Regression (Gaussian noise): assume $y_i = f_\theta(x_i) + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ i.i.d. The conditional density is

p(y_i \mid x_i, \theta) = \frac{1}{\sqrt{2\pi}\,\sigma} \exp\!\left(-\frac{(y_i - f_\theta(x_i))^2}{2\sigma^2}\right).

The log-likelihood is

\begin{aligned} \log p(\{y_i\} \mid \{x_i\}, \theta) &= \sum_{i=1}^{N} \log p(y_i \mid x_i, \theta) \\ &= -\frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - f_\theta(x_i))^2 - N \log(\sqrt{2\pi}\,\sigma). \end{aligned}

The second term is independent of $\theta$ , so maximizing the log-likelihood is equivalent to minimizing $\sum (y_i - f_\theta(x_i))^2$ — exactly MSE. MSE isn’t an arbitrary loss — it’s the direct product of MLE under the “output plus Gaussian noise” assumption.

Classification (Bernoulli output): for binary classification let $f_\theta(x) \in (0, 1)$ be the model’s predicted “probability of class 1” and $y_i \in \{0, 1\}$ the label. A Bernoulli distribution models a random variable taking only $\{0, 1\}$ and is fully determined by a single parameter $\pi \in [0, 1]$ : $P(Y = 1) = \pi$ , $P(Y = 0) = 1 - \pi$ . Its compact form uses “exponents as if-else selectors” to merge both cases into one line —

P(Y = y) = \pi^y (1 - \pi)^{1 - y}, \quad y \in \{0, 1\}.

Verifying: at $y = 1$ , $(1 - \pi)^0 = 1$ drops out, leaving $\pi$ ; at $y = 0$ , $\pi^0 = 1$ drops out, leaving $1 - \pi$ . Anything-to-the-zero is 1, so “the factor that shouldn’t be selected” disappears automatically.

The “conditional” in “conditional distribution” means conditioning on the input $x_i$ — the model lets the Bernoulli parameter $\pi$ depend on $x_i$ via $\pi(x_i) = f_\theta(x_i)$ ; the same $\theta$ produces different $\pi$ on different $x_i$ . Substituting $\pi(x_i)$ into the Bernoulli compact form gives

p(y_i \mid x_i, \theta) = f_\theta(x_i)^{y_i} (1 - f_\theta(x_i))^{1 - y_i}.

Concrete example (cat-vs-dog): a classifier outputs $f_\theta(x) = 0.8$ for an image (predicted probability of “cat”). If the true label is $y = 1$ (cat), $p(y = 1 \mid x, \theta) = 0.8^1 \cdot 0.2^0 = 0.8$ — the model assigns a high probability to the true label. If the true label is $y = 0$ (dog), $p(y = 0 \mid x, \theta) = 0.8^0 \cdot 0.2^1 = 0.2$ — the model is wrong and gives the true label a low probability. MLE tunes $\theta$ to maximize the probability the model assigns to the true label across all training samples.

The log-likelihood is

\log p(\{y_i\} \mid \{x_i\}, \theta) = \sum_{i=1}^{N} \big[\, y_i \log f_\theta(x_i) + (1 - y_i) \log(1 - f_\theta(x_i)) \,\big].

The negative of this is the binary cross-entropy loss. For multi-class output with softmax $f_\theta(x) = (p_1, \dots, p_K)$ and one-hot $y$ , the same derivation yields categorical cross-entropy $-\sum_k y_k \log p_k$ .

So why is it called cross-entropy? The name comes from information theory. Following the classical convention with $\log_2$ , the entropy of a discrete distribution $p$ ,

$H(p) = -\sum_x p(x) \log_2 p(x),$

measures its average information content — i.e., how many bits per event you need on average when using a code optimized for $p$ (the continuous analogue is the integral form $-\int p(x) \log_2 p(x)\, dx$ , called differential entropy, not discussed here). Cross-entropy mixes two discrete distributions:

$H(p, q) = -\sum_x p(x) \log_2 q(x).$

It answers: “events occur under the true distribution $p$ , but you use a code designed for $q$ — how many bits do you need on average?” The “cross” comes from the two distributions playing different roles in the formula: $p$ sits outside the log as a weight, $q$ sits inside the log as the probability argument. Swapping the roles — letting $q$ be the weight and putting $p$ inside the log — gives $H(q, p) = -\sum_x q(x) \log_2 p(x)$ , generally a different number. Cross-entropy is asymmetric: $H(p, q) \neq H(q, p)$ .

In classification: the one-hot true label $y$ plays the role of $p$ , and the model’s softmax output $\hat y$ plays $q$ , so $-\sum_k y_k \log \hat y_k = H(p, q)$ — the classification cross-entropy loss is literally “the cross-entropy of the true label distribution against the model’s predicted distribution”. Minimizing it pushes $q$ toward $p$ ; equivalently, one can minimize the KL divergence $D_{KL}(p \| q) = H(p, q) - H(p)$ , which differs from cross-entropy only by the data-only constant $H(p)$ .

⚙️ The log base only affects the unit, not the minimization result: $\log_2$ → bit, natural log $\ln$ → nat, $\log_{10}$ → ban — all related by constant factors. ML implementations (cross-entropy in PyTorch / TensorFlow) actually use natural log $\ln$ — derivatives are simpler and the form maps directly onto the negative log-likelihood from MLE, so the unit is nats rather than bits. The $\log$ inside the MLE-derived " $-\sum_k y_k \log \hat y_k$ " above is therefore $\ln$ in practice; the optimal parameters $\theta$ are identical to what you’d get using $\log_2$ .

Cross-entropy isn’t arbitrary either — it’s the direct product of MLE under “outputs distributed Bernoulli / categorical”. Both losses share the same underlying logic: the negative log-likelihood.

One last word on the KL divergence (Kullback & Leibler 1951, also called relative entropy). It was originally proposed in mathematical statistics to measure “how easy two hypothesis distributions are to tell apart”: it is non-negative (Gibbs’ inequality, $D_{KL}(p \| q) = 0$ iff $p = q$ almost everywhere), asymmetric like cross-entropy, and commonly read as “the information lost when approximating $p$ with $q$ ”. Beyond its implicit appearance in classification training, common landings in deep learning include:

Variational autoencoder (VAE): the loss contains $D_{KL}(q(z \mid x) \| p(z))$ , which pulls the encoder’s latent distribution toward a prior so the latent space becomes structured and sample-able.
Reinforcement learning (PPO / TRPO): each policy update constrains $D_{KL}(\pi_\text{new} \| \pi_\text{old})$ to stay small, preventing drastic policy shifts in a single step.
Knowledge distillation: minimizing $D_{KL}(p_\text{student} \| p_\text{teacher})$ pushes the student model’s output distribution toward the teacher’s on every input.

When $p$ is fixed during training (as in classification), minimizing cross-entropy is equivalent to minimizing KL; but when $p$ is itself part of the model (e.g., the prior term in a VAE), the full KL must be used.

3.4 Backpropagation: Computational Graph, Chain Rule, Jacobian

§ 3.1 framed backpropagation as “the algorithm that computes the gradient”. Backpropagation is not a new differentiation rule — everything it does is “apply the chain rule in reverse topological order over a computational graph.” This subsection introduces the computational graph, the Jacobian matrix, and the chain-product form, along with how each takes shape on tensors, matrices, and a deep network’s layered structure.

The computational graph and local derivatives. The bottom-most abstraction of modern deep-learning frameworks (PyTorch / TensorFlow / JAX) is the computational graph — a directed acyclic graph where nodes are all the variables (inputs, parameters $W$ and $b$ , intermediates, output) and edges represent computational dependence and carry the local partial derivative from parent to child. Note this differs from the “network architecture diagrams” in § 1.1 and § 2.1: those diagrams put weights on the edges (the weighted connections between neurons), whereas a computational graph treats each weight as a node, and the edge between that node and a downstream intermediate carries the local partial $\partial z / \partial W$ — not the weight itself.

Take the simplest regression loss $L = (\sigma(W x + b) - y)^2$ . Splitting it into intermediate variables $z = W x + b$ , $a = \sigma(z)$ , $L = (a - y)^2$ , the corresponding computational graph is:

graph LR
  x((x)) --> z[z]
  W((W)) --> z
  b((b)) --> z
  z --> a[a]
  a --> diff["a - y"]
  y((y)) --> diff
  diff --> L[L]

Each edge corresponds to a local derivative: $\partial z / \partial W = x^T$ , $\partial a / \partial z = \sigma'(z)$ , $\partial L / \partial a = 2(a - y)$ , etc. To compute $\partial L / \partial W$ , multiply the local derivatives along the path $L \to a \to z \to W$ using the chain rule:

\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W} = 2(a - y) \cdot \sigma'(z) \cdot x^T.

Backpropagation isn’t a new differentiation rule — it’s the chain rule executed in reverse topological order over a DAG using local derivatives. Any function that can be expressed as a DAG admits this procedure; the framework just has to remember the local-derivative operator on each edge.

The Jacobian matrix. When a layer maps a vector to a vector rather than a scalar to a scalar, the “local derivative” becomes a Jacobian matrix: if $f: \mathbb{R}^n \to \mathbb{R}^m$ , then $\partial f / \partial x \in \mathbb{R}^{m \times n}$ with $(\partial f / \partial x)_{ij} = \partial f_i / \partial x_j$ . The three common neural-network layers have these Jacobians:

Linear layer $y = W x + b$ ( $y \in \mathbb{R}^m$ , $x \in \mathbb{R}^n$ , $W \in \mathbb{R}^{m \times n}$ ): $\partial y / \partial x = W$ ; element-wise w.r.t. $W$ is $\partial y_i / \partial W_{ij} = x_j$ .
Activation layer $y = \sigma(x)$ (element-wise): $\partial y / \partial x = \mathrm{diag}(\sigma'(x))$ , a diagonal matrix.
Loss layer $L = \ell(y, y_{\text{true}})$ (scalar output): $\partial L / \partial y \in \mathbb{R}^{1 \times m}$ , a row vector.

The Jacobian of an element-wise activation is diagonal — this observation drives § 4.1’s “gradient chain product” analysis: a diagonal matrix’s singular values are its diagonal entries $\sigma'(z_i)$ themselves, so the magnitude of the whole chain product is determined by $\sigma'$ multiplied across all layers.

The chain-product form. Writing the whole backward pass as one expression: the gradient of the loss at depth $L$ with respect to input $x_0$ is the chain product of all layer Jacobians,

\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \cdot \prod_{l=L}^{1} J_l, \quad J_l = \frac{\partial x_l}{\partial x_{l-1}}.

This is the starting point for § 4’s analysis of vanishing / exploding gradients.

4. Training Stability: How Deep Networks Survive

4.1 Vanishing and Exploding Gradients: the Spectral-Radius View

Return to § 2.4’s derivative plot, and pick up § 3.4’s closing chain-product formula:

\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \cdot \prod_{l=L}^{1} J_l, \quad J_l = \frac{\partial x_l}{\partial x_{l-1}}.

The magnitude of the chain product is governed by how much each layer’s $J_l$ scales a vector’s length. For a matrix $A$ , its largest singular value $\sigma_{\max}(A)$ is the maximum factor by which $A$ can stretch a unit vector — precisely, for any vector $v$ ,

\|A v\|_2 \;\le\; \sigma_{\max}(A) \cdot \|v\|_2,

with equality when $v$ points along $A$ ‘s “longest” direction. So if every layer has $\sigma_{\max}(J_l) < 1$ , the chain product shrinks the gradient magnitude exponentially; if all $> 1$ , it explodes.

For square matrices one also uses the spectral radius $\rho(J) = \max_i \lvert \lambda_i(J) \rvert$ (the largest eigenvalue magnitude) to describe asymptotic scaling under iteration; for non-symmetric $J$ , $\rho(J) \le \sigma_{\max}(J)$ , and the two are used interchangeably in engineering as “the effective scaling factor of the map”. Below, $\bar\rho$ denotes the average per-layer scaling factor (either singular value or spectral radius works).

If the average per-layer scaling factor is $\bar\rho$ , the chain product’s magnitude scales as $\bar\rho^L$ . Two extremes:

$\bar\rho < 1$ : chain product decays exponentially — vanishing gradient, layers near the input receive almost no signal.
$\bar\rho > 1$ : chain product grows exponentially — exploding gradient, numerical overflow and optimization failure.

The failure modes of vanishing and exploding gradients differ. With vanishing gradients, training looks like it’s still running and the loss may still decrease, but the layers near the input barely update — the deep network effectively degenerates into a shallow one that only trains its last few layers, capping generalization.

Exploding is more violent. Once any gradient component exceeds the float32 maximum (about $3.4 \times 10^{38}$ ), it becomes inf; subsequent operations like inf - inf / 0 * inf / inf / inf produce NaN, and NaN is absorbing under IEEE 754 — any arithmetic involving it returns NaN. Once NaN enters a parameter, the next forward pass turns the entire activation map into NaN, the loss becomes NaN, every gradient is NaN, every parameter is updated to NaN — training dies completely with no recovery.

Even without hitting inf, an oversized update $\eta \nabla L$ can push the parameters into a completely unfamiliar region of the loss landscape, and the trajectory diverges. The engineering remedy is gradient clipping: whenever $\|\nabla L\| > \tau$ , scale it down to $\tau$ — trading a bit of speed for not letting any single step blow up. It is nearly standard in RNN / Transformer training.

Take Sigmoid as a concrete number: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ is a bell curve with maximum $0.25$ at $z = 0$ , and $\sigma'$ drops quickly as $|z|$ grows. So even in the “most favorable spot” ( $z \approx 0$ ), a layer’s $\bar\rho$ tops out at $0.25$ ; with un-tuned initialization the per-layer value is smaller in practice, and the 10-layer chain product is bounded above by $0.25^{10} \approx 10^{-6}$ — gradient effectively zero.

Gradient chain product vs depth

The $\rho = 1.2$ curve runs off the top of the plot around $L \approx 9$ — that’s the literal meaning of “explosion.” The gradient problem isn’t a curse of “depth” — it’s a curse of “Jacobian chain-product spectral radius far from 1”. This reframes the issue from “architecture choice” to “keep each layer’s $J_l$ spectral radius near 1,” an engineering task. §§ 4.2 and 4.3 are its two solutions.

4.2 Weight Initialization: the Variance-Preservation Derivation of Xavier and He

Xavier initialization (Glorot & Bengio 2010) is built on the idea to keep the variance of each dimension equal across layers in both forward and backward passes — making the chain-product spectral radius hover near 1 automatically.

Why does “variance preservation” pin the spectral radius near 1? Two steps: variance preservation means each layer on average neither stretches nor shrinks the squared length of a vector; the Marchenko–Pastur distribution from random matrix theory says that a random matrix with i.i.d. entries of variance $\sigma_w^2$ and shape $m \times n$ has singular values concentrated around $\sigma_w \sqrt{n}$ . Plug in $\sigma_w^2 = 1/n$ (the Xavier forward equation derived below) and the singular values $\approx \sqrt{1/n} \cdot \sqrt{n} = 1$ , so $\rho \approx 1$ . Variance equalization is essentially an engineering proxy for “make every layer Jacobian’s singular values concentrate near 1” — controlling variance during sampling is $O(1)$ ; directly controlling singular values via SVD is $O(n^3)$ .

Assumptions: input $x$ has zero-mean, i.i.d. components with variance $\sigma_x^2$ ; weight matrix $W \in \mathbb{R}^{m \times n}$ has zero-mean i.i.d. entries with variance $\sigma_w^2$ ; the activation is approximately linear (small-input regime, e.g., tanh near zero). Notation: $W$ maps an $n$ -dimensional input to an $m$ -dimensional output, so each output neuron receives $n$ inputs — its fan-in $= n$ (number of columns of $W$ ); each input is dispatched to $m$ outputs — its fan-out $= m$ (number of rows of $W$ ). Both terms come from circuit terminology — number of incoming / outgoing ports.

Forward variance: for a linear layer $z_j = \sum_{k=1}^{n} W_{jk} x_k$ ,

\begin{aligned} \mathrm{Var}(z_j) &= \sum_{k=1}^{n} \mathrm{Var}(W_{jk} x_k) \\ &= \sum_{k=1}^{n} \mathrm{Var}(W_{jk}) \cdot \mathrm{Var}(x_k) \\ &= n \cdot \sigma_w^2 \cdot \sigma_x^2. \end{aligned}

To preserve $\mathrm{Var}(z) = \mathrm{Var}(x)$ , we need $\sigma_w^2 = 1 / n$ ( $n$ is the fan-in).

Backward variance: the propagating gradient $g_j = \sum_{i=1}^{m} W_{ij} g_i^{\text{(out)}}$ ( $m$ is fan-out) satisfies

\mathrm{Var}(g_j) = m \cdot \sigma_w^2 \cdot \mathrm{Var}(g^{\text{(out)}}).

To preserve backward variance we need $\sigma_w^2 = 1 / m$ . Forward and backward cannot be simultaneously satisfied; Xavier takes the harmonic compromise:

\sigma_w^2 = \frac{2}{n + m}.

He initialization (He et al. 2015) corrects this for ReLU: since ReLU zeros out the negative half, only half the activations carry through, so forward variance becomes $\mathrm{Var}(z) = \tfrac{1}{2} n \sigma_w^2 \sigma_x^2$ . To preserve variance one needs

\sigma_w^2 = \frac{2}{n}.

The difference between Xavier and He isn’t a choice of $\sigma_x^2$ — it’s the physical assumption that “ReLU kills half the activations,” which moves a factor of 2 into the numerator. Engineering practice now defaults to Xavier for sigmoid / tanh networks and He for ReLU-family networks.

4.3 Normalization: the Mathematical Essence of BatchNorm and LayerNorm

Weight initialization only enforces $\bar\rho \approx 1$ at $t = 0$ ; during training, weights drift, activation distributions shift, and the spectral radius wanders away from 1. Normalization layers exist to forcibly pull each layer’s activations back to zero-mean unit-variance distribution, keeping $\sigma'$ in its “sensitive region.”

Batch Normalization (BN) normalizes along the mini-batch dimension. For a batch $\{x^{(1)}, \dots, x^{(B)}\}$ , channel / neuron $j$ :

\begin{aligned} \mu_j &= \frac{1}{B} \sum_{i=1}^{B} x_j^{(i)}, \\ \sigma_j^2 &= \frac{1}{B} \sum_{i=1}^{B} (x_j^{(i)} - \mu_j)^2, \\ \hat x_j^{(i)} &= \frac{x_j^{(i)} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}, \\ y_j^{(i)} &= \gamma_j \hat x_j^{(i)} + \beta_j. \end{aligned}

The trailing affine parameters $\gamma_j, \beta_j$ are learnable, so normalization doesn’t lock the network’s expressive capacity — if a layer genuinely needs nonzero-mean or non-unit-variance output, the model can learn it back through $\gamma, \beta$ .

⚙️ In modern architectures the linear layer’s $b$ is often disabled: ResNet’s convolutional layers default to bias=False; LLaMA / PaLM’s FFN and Q / K / V projections are also bias-free. The reason is that these linear layers are followed by BN / LN, and the normalization layer’s $\beta$ already provides the same shift freedom — keeping $b$ alongside is a redundant parameter that adds count without adding expressive power. Disabling $b$ in the “Linear + BN + ReLU” combination is the default in PyTorch implementations.

⚙️ What the implementation actually does: BatchNorm uses batch statistics $\mu_j, \sigma_j^2$ during training; at inference, using batch statistics would make the same input produce different outputs in different batches. PyTorch / TensorFlow implementations maintain a running average of $\mu, \sigma^2$ during training, and switch to it for inference. This split makes BatchNorm’s train-inference behavior fundamentally asymmetric, which is why it underperforms in small-batch, variable-length-input, RNN / Transformer scenarios. In practice RNN / Transformer architectures use LayerNorm (introduced below); modern LLMs (LLaMA, Mistral, Gemma, etc.) further simplify to RMSNorm — dropping the mean subtraction and normalizing only by the root mean square.

Layer Normalization (LN) instead normalizes along all dimensions of a single sample, independent of batch size:

\mu = \frac{1}{n}\sum_{j=1}^{n} x_j, \quad \sigma^2 = \frac{1}{n}\sum_{j=1}^{n}(x_j - \mu)^2, \quad \hat x_j = \frac{x_j - \mu}{\sqrt{\sigma^2 + \epsilon}}.

LayerNorm decouples from the batch, making it especially friendly to Transformer-style architectures with variable-length sequences within a batch — this is the engineering reason it has almost completely replaced BatchNorm in NLP and modern LLMs.

Back to the § 4.1 view: normalization pins each layer’s activations to zero-mean unit-variance, which forces $\sigma'$ to stay roughly constant. The real goal of normalization isn’t the vague “distribution stability” — it’s keeping the activation distribution in the region where $\sigma' \approx$ constant, so the chain product of Jacobian spectral radii doesn’t drift.

5. Generalization and Regularization

5.1 The Bias–Variance Tradeoff

Everything so far has been about whether training converges. But low training loss isn’t the goal — the goal is low generalization error on new samples. Given an input $x_0$ , true label $y_0$ , and training set $D$ , define the prediction $\hat y(x_0; D)$ ; the expected MSE at $x_0$ decomposes as

\begin{aligned} \mathbb{E}_D\!\left[(\hat y(x_0; D) - y_0)^2\right] &= \mathbb{E}_D\!\left[(\hat y - \bar{\hat y})^2\right] + (\bar{\hat y} - f(x_0))^2 + \sigma_\epsilon^2 \\ &= \underbrace{\mathrm{Var}_D[\hat y(x_0; D)]}_{\text{variance}} + \underbrace{(\mathbb{E}_D[\hat y(x_0; D)] - f(x_0))^2}_{\text{bias}^2} + \underbrace{\sigma_\epsilon^2}_{\text{noise}}, \end{aligned}

where $\bar{\hat y} = \mathbb{E}_D[\hat y(x_0; D)]$ , $f(x_0)$ is the true conditional expectation, and $\sigma_\epsilon^2$ is irreducible label noise.

The three terms trade off: larger model capacity can fit more complex $f$ → smaller bias; larger capacity is also more sensitive to training-sample details → larger variance. The essence of the bias–variance tradeoff is that model capacity is both the enemy of bias and the ally of variance, and the optimum total error sits at the intersection of the two.

In practice, neural networks’ double descent phenomenon shows the classical bias-variance picture is incomplete in the over-parameterized regime (Belkin et al. 2019 and Nakkiran et al. 2020 gave systematic evidence). Sweeping a neural network’s width / parameter count from small to large and plotting test error, the curve is not a classical U shape but descent → ascent → descent:

At low capacity, the test error follows the classical U shape down to a classical optimum;
As capacity continues to grow and variance dominates, test error rises;
Around the interpolation threshold (the critical capacity at which the model can perfectly memorize the training set, training error $\approx 0$ ), the test error spikes to a peak;
With still more capacity, entering the over-parameterized regime (parameter count $\gg$ training-set size), the test error descends again, often dropping below the classical optimum.

This “descent → ascent → descent” curve directly contradicts the classical bias-variance prediction that “more capacity → more variance → eventual overfitting” — in the over-parameterized regime, growing the model further generalizes better, which is the empirical basis for modern deep learning (GPT, LLaMA, and other large models) extracting good performance by scaling parameters.

The explanations are not fully settled; three common candidates:

Implicit regularization from SGD in the over-parameterized regime: many parameter settings perfectly fit the training set, and SGD prefers the one with smallest norm, which acts as automatic regularization.
Neural Tangent Kernel (NTK) view: when networks are wide enough, training dynamics approach a deterministic kernel method whose variance behavior differs from finite-width classical models.
Lottery ticket hypothesis: a large network contains a sparse “winning ticket” subnetwork that training selects automatically, with the other parameters playing no role.

When model size is moderate (parameter count of the same order as or smaller than the training set), the classical bias-variance picture still holds; double descent only appears in the over-parameterized regime. As a foundational view and design lens for regularization, the classical picture remains the starting point for the next two subsections on L1 / L2 / Dropout.

5.2 The Bayesian Essence of L2 / L1 Regularization

The simplest defense against overfitting is to add a “keep the weights small” term to the loss. Below, $\mathrm{NLL}(\theta)$ denotes the negative log-likelihood $-\log p(D \mid \theta)$ — the quantity § 3.3 minimized when deriving MSE and cross-entropy from MLE, and virtually every loss in deep learning is some form of NLL. Two common regularizers:

L_{\text{L2}}(\theta) = \mathrm{NLL}(\theta) + \lambda \|\theta\|_2^2, \qquad L_{\text{L1}}(\theta) = \mathrm{NLL}(\theta) + \lambda \|\theta\|_1.

Both regularizers have a clean Bayesian interpretation: they correspond to MAP (maximum a posteriori) estimation under different parameter priors.

L2 ↔ Gaussian prior: assume parameters follow an isotropic Gaussian prior $\theta \sim \mathcal{N}(0, \tau^2 I)$ . Given data $D$ , the log-posterior is

\begin{aligned} \log p(\theta \mid D) &\propto \log p(D \mid \theta) + \log p(\theta) \\ &= -\mathrm{NLL}(\theta) - \frac{1}{2\tau^2}\|\theta\|_2^2 + \text{const}. \end{aligned}

Maximizing the posterior is equivalent to minimizing $\mathrm{NLL}(\theta) + \frac{1}{2\tau^2}\|\theta\|_2^2$ — setting $\lambda = 1 / (2 \tau^2)$ recovers L2 (ridge regression). L2 regularization = MAP under a Gaussian prior — “don’t let weights get too big” is equivalent to “the prior on weights concentrates around 0.”

One engineering caveat: $\|\theta\|_2^2$ in practice usually applies only to weights, not to biases. PyTorch’s weight_decay does decay biases by default, but textbooks and practical guides generally recommend excluding biases explicitly. This is consistent with the intuition that “bias should retain its freedom to translate”; in the bias-free modern architectures mentioned in § 4.3 the requirement is automatically satisfied.

L1 ↔ Laplace prior: swap in a Laplace prior $\theta_j \sim \mathrm{Laplace}(0, b)$ with density $\propto \exp(-|\theta_j| / b)$ . The log-posterior becomes

\log p(\theta \mid D) \propto -\mathrm{NLL}(\theta) - \frac{1}{b}\sum_j |\theta_j| + \text{const},

so maximizing it corresponds to minimizing $\mathrm{NLL} + \lambda \|\theta\|_1$ ( $\lambda = 1 / b$ ). L1 (Lasso) regularization = MAP under a Laplace prior — choosing Laplace over Gaussian is what introduces sparsity.

Why does L1 produce sparsity (many $\theta_j$ exactly zero) and L2 doesn’t? The answer is in each prior’s shape at 0: the Gaussian density is smooth at 0 (zero first derivative), while the Laplace density has a “peak” at 0 (a kink in the first derivative). Treating the negative log of the latter as a penalty $|\theta_j|$ , the objective is non-smooth at $\theta_j = 0$ (“corner”), and the optimum tends to push components exactly to 0; the negative log of a Gaussian is a smooth $\theta_j^2$ , so the optimum only shrinks components but never zeroes them. L1’s sparsity isn’t an accident — it follows directly from the non-smoothness of the Laplace prior at 0.

5.3 The Ensemble View of Dropout

Dropout (Srivastava et al. 2014) is a minimal-looking operation: during training, each neuron is independently zeroed (masked) with probability $p$ ; at inference no mask is applied, but activations are scaled by $(1 - p)$ . The puzzle is why this “noise injection” gives such a large generalization boost.

The answer is that it’s mathematically equivalent to a gigantic ensemble. Consider a network with $n$ hidden neurons. Each mini-batch’s random mask induces one of $2^n$ possible sub-networks (each neuron 0 or 1); each sub-network gets one training step (with shared weights). Dropout isn’t just random noise — it trains a different sub-network on every batch, forming an implicit ensemble of up to $2^n$ sub-networks.

At inference, dropout is turned off and each activation is multiplied by $(1 - p)$ — this operation is exactly the normalized geometric mean of the $2^n$ subnetwork softmax probabilities for a softmax output layer. Derivation: each mask’s logit $z_k(m) = W_k \cdot (m \odot h)$ is linear in $m$ , so by linearity of expectation $\mathbb{E}_m[z_k(m)] = (1 - p) \cdot W_k \cdot h$ — exactly the logit produced by $(1-p)$ scaling at inference. Plugging this into softmax and using the identity $e^{\mathbb{E}_m[z]} = \prod_m e^{P(m) z(m)}$ , the exponential of an expectation expands into a product over masks; after normalization it equals the (normalized) geometric mean $\prod_m \hat y_k(m)^{P(m)}$ of the subnetwork softmax outputs $\hat y_k(m)$ .

The “arithmetic mean of logits → geometric mean of probabilities” comes from softmax’s exponential — a direct manifestation of the AM-GM identity $e^{\bar z} = (\prod e^{z_i})^{1/N}$ : a linear average in logit space becomes a multiplicative average in probability space after exp. For non-softmax outputs (sigmoid regression heads, ReLU inside conv layers, etc.) the equivalence is no longer strict, but the “ensemble approximation” of dropout still works well empirically in deep nonlinear networks.

In practice dropout is highly effective on conv-nets and MLPs, while Transformers often substitute it with stochastic depth or attention dropout. Dropout’s theoretical value is in connecting “regularization” and “ensemble learning” as one phenomenon — the same lens also applies to BatchNorm (batch-level noise) and data augmentation (input perturbation).

6. Wrapping Up

Strung together, this post is really answering three root questions: why neural networks can learn (§§ 1-2), how they learn (§ 3), and why depth makes learning hard (§§ 4-5). A few things not stressed in the body but worth keeping in mind before applying any of this:

The perceptron’s limit isn’t “linearity” — it’s “single layer”. XOR gave that answer back in 1969; the subsequent “AI winter” wasn’t because the perceptron itself was inadequate but because at that time backpropagation, compute, and the engineering scaffolding around nonlinear activations were all missing. Stacking layers with nonlinearity is what actually lifts the expressiveness ceiling.
Backpropagation isn’t a new algorithm — it’s “the chain rule executed in reverse topological order over a computational graph”. VJP is the load-bearing piece; without it, none of the modern AD frameworks (PyTorch / JAX / Autograd) would exist, and the specific memory-vs-compute tradeoff of reverse-mode AD comes from VJP, not from the chain rule itself.
The gradient problem isn’t a curse of “depth” — it’s a curse of “spectral radius”. Xavier / He initialization and BatchNorm / LayerNorm all chase the same target: keep the chain product’s $\bar\rho \approx 1$ . Residual connections (ResNet), normalization, and attention design can all be viewed through the same “keep spectral radius stable” lens.
All regularization is fundamentally a Bayesian prior over parameters. L2 ↔ Gaussian prior, L1 ↔ Laplace prior, Dropout ↔ implicit ensemble — they’re the same idea in different engineering forms. New regularizers (label smoothing, mixup, weight averaging) fit the same template once you identify the corresponding prior or ensemble structure.
The Universal Approximation Theorem is an expressiveness theorem, not a generalization theorem. “Can approximate” ≠ “can learn” (an optimization problem) ≠ “approximates on the test set” (a generalization problem). The real progress in the past decade has been on the latter two — optimization algorithms, normalization, regularization, and architectural priors (CNN / attention / Transformer) together form the engineering substrate that makes UAT actually land on real data.