Before Attention: From Sequence Bottlenecks to Dot Products, Softmax, and the GPU Memory Wall

The core Attention formula fits on one line, but that line compresses sequence modeling, matrix computation, probability normalization, and the GPU execution model at the same time. If we start directly from $QK^T$ , $\sqrt{d_k}$ , and Softmax, the formula looks like a set of symbols to memorize rather than a design forced by constraints. The prerequisites for Attention are not a checklist of concepts; they form a chain from “how a sequence is represented” to “how relations are computed in parallel”.

This post introduces the background needed before the main Attention derivation: why traditional sequence models are limited, how text becomes tensors, why dot products can represent relevance, how Softmax turns scores into differentiable choices, why deep networks need Embedding, Residual Connection, and LayerNorm, and how GPU memory hierarchy shapes the implementation of Attention. This post does not derive full Attention; it establishes the mathematical and engineering context for the next post’s Scaled Dot-Product Attention formula.

This post follows Deep Learning Foundations: From Perceptrons to Training Deep Networks and Backpropagation and Automatic Differentiation: From Jacobians to Computational Graphs to VJP. The first explains why neural networks can express complex functions; the second explains how gradients are computed efficiently. This post shifts the view to sequence data: inputs have variable length, elements have order, and any two positions may depend on each other. Once the input changes from a fixed vector into a sequence, the core modeling problem expands from fitting a function to preserving, transmitting, and reading information.

1. Fixed-Dimension Friction in Sequence Modeling

An MLP is most naturally fed a fixed-length vector: numerical features for one sample, a flattened image patch, or a pre-aggregated metric. Text, code, logs, and user-behavior sequences do not have that shape. They vary in length, order matters, and a later element may depend on a much earlier position. When a fixed-dimensional network meets sequence data, the first question is not whether the model is deep enough, but how the input enters a shared tensor interface.

The most direct workarounds are truncation, padding, or aggregation. Truncation cuts off long inputs and loses tail information; padding extends short sequences to a common length and introduces invalid computation; average pooling collapses every token into one vector and weakens order plus local structure. All three turn a variable-length sequence into a fixed-length vector, but they also weaken the positional and dependency structure that makes the sequence valuable.

Recurrent neural networks (RNNs) provide a more natural abstraction: the model reads the input one time step at a time and maintains a hidden state at each step. The simplest form is:

h_t = f(h_{t-1}, x_t)

Here $x_t$ is the input at step $t$ , $h_{t-1}$ is the historical summary left by the previous step, and $h_t$ is the new summary after reading the current token. A vanilla RNN usually expands $f$ as:

h_t = \phi(W_x x_t + W_h h_{t-1} + b_h)

where $W_x$ acts on the current input, $W_h$ acts on the previous hidden state, $b_h$ is the bias, and $\phi$ can be a nonlinearity such as $\tanh$ or ReLU. The key design in an RNN is to compress arbitrary-length history into a continuously updated state vector, while reusing the same parameters $W_x, W_h, b_h$ at every time step.

If the input dimension is $D_x$ and the hidden-state dimension is $D_h$ , the usual column-vector convention gives:

Symbol	Shape	Meaning
$x_t$	$D_x \times 1$	Input vector at step $t$
$h_{t-1}$	$D_h \times 1$	Previous hidden state
$W_x$	$D_h \times D_x$	Maps input into hidden-state space
$W_h$	$D_h \times D_h$	Maps previous hidden state into the next hidden-state space
$b_h$	$D_h \times 1$	Hidden-state bias
$h_t$	$D_h \times 1$	Current hidden state

Therefore $W_x x_t$ and $W_h h_{t-1}$ are both $D_h \times 1$ , so they can be added before the nonlinearity. $D_x$ is determined by the input representation, while $D_h$ is the model’s chosen hidden-state capacity; every time step writes into the same $D_h$ -dimensional state space.

If the task needs an output at each time step, the model can read it from the hidden state:

y_t = g(W_y h_t + b_y)

The output parameters $W_y, b_y$ are also usually shared across time steps; for whole-sequence classification, the model may instead read only the final $h_T$ . The repeated $f$ boxes in an unrolled RNN are not separate layers with independent weights; they are repeated calls to the same RNN cell at different time steps.

Unrolling the same recurrence over time makes the dependency chain visible. The output $y_t$ can be read at each step or only at the end, depending on the task; the bottleneck discussed here is the hidden-state path. Every time step reuses the same transition function $f$ , while the hidden state carries information forward along the sequence.

A compact RNN cell and its unfolded form: the same transition function is reused at each time step.

That design solves the variable-length input interface, but it introduces a representation bottleneck. A sequence of length $N$ is gradually folded into $h_1, h_2, \ldots, h_N$ ; the earlier a piece of information appears, the more state updates it must survive before it can influence the final output. If every step writes new information into the same finite-capacity state vector, early information can be overwritten or diluted. The hidden state in an RNN plays both memory and current-representation roles, and that dual role makes long-range dependence inherently difficult.

Gradient propagation faces the same long chain. When the loss flows backward to position $t$ , it must pass through the recurrence path $h_N \to h_{N-1} \to \cdots \to h_t$ ; each step multiplies by a local Jacobian, and repeated multiplication can produce vanishing or exploding gradients. Long-range dependence appears as compressed information in the forward pass and as an overlong multiplication chain in the backward pass.

There is also a direct engineering limit: $h_t$ depends on $h_{t-1}$ , so time steps have strict data dependence. Even if a batch contains many samples, the positions inside one sequence cannot easily be expanded into one round of matrix multiplication. The recurrence structure makes the sequence dimension look like a serial queue, leaving much of the GPU’s parallel capacity unused.

Attention targets this group of problems. Each position no longer depends only on the previous hidden state; it can read information directly from other positions. This read is not a hard-match rule such as query == key; the model first scores how relevant every position is, then uses Softmax to turn those scores into weights, and finally takes a weighted sum of information from other positions. The shift from RNNs to Attention is a shift from passing history step by step to constructing a relation matrix across positions in one shot.

2. The Representation Path from Tokens to Tensors

Neural networks do not operate on strings directly; they operate on numerical tensors. Text is usually tokenized into tokens; each token is then mapped through the vocabulary to an integer token id, and that id is used to fetch the corresponding continuous vector from an Embedding table. Embedding connects discrete symbols to continuous space, so downstream layers can process language units with matrix multiplication and gradient descent.

An Embedding table can be viewed as a trainable lookup matrix. If the vocabulary size is $V$ and the vector dimension is $D$ , the Embedding matrix is $E \in \mathbb{R}^{V \times D}$ ; token id $i$ corresponds to row $E_i$ . During training, tokens that appear in similar contexts tend to receive related representations through parameter updates. Embedding is not a dictionary definition; it is a set of continuous coordinates learned together with the task objective.

From an implementation view, the lookup is a row gather; from a linear-algebra view, it is equivalent to multiplying a one-hot vector by $E$ . If $o_i \in \mathbb{R}^{V}$ has a single 1 at token id $i$ , then $o_i^T E = E_i$ . Frameworks avoid materializing the one-hot vector and directly gather the row by token id, but $E$ is still a model parameter: training updates the gathered rows through backpropagation. An Embedding table is an ordinary trainable weight whose forward path uses indexing instead of a dense matrix multiply.

Before entering Attention, the input is usually represented as a three-dimensional tensor:

X \in \mathbb{R}^{B \times N \times D}

where $B$ is batch size, $N$ is sequence length, and $D$ is hidden size or embedding dimension. For clarity in derivations, we often ignore the batch dimension first and write one sequence as:

X \in \mathbb{R}^{N \times D}

This shape convention matters: $N$ says how many positions we have, while $D$ says how many features describe each position.

Matrix multiplication here has two useful readings. For a single token vector $x_i \in \mathbb{R}^{D}$ , $x_i W$ is a linear map that moves it from the original $D$ -dimensional feature space into a new $D'$ -dimensional feature space. For the whole sequence $X \in \mathbb{R}^{N \times D}$ , $XW$ applies the same linear map to all $N$ tokens at once. In neural networks, matrix multiplication is both batched computation and feature-space transformation.

A linear layer can be written as:

Y = XW,\qquad X \in \mathbb{R}^{N \times D},\quad W \in \mathbb{R}^{D \times D'}

The output $Y \in \mathbb{R}^{N \times D'}$ keeps the same $N$ positions, but each position moves from a $D$ -dimensional representation to a $D'$ -dimensional one. A linear layer does not change sequence length; it changes the feature space in which each token is represented.

This observation leads directly into the main Attention derivation. Query, Key, and Value are all produced by applying different linear projections to the same input $X$ ; their specific roles are expanded in the next post. The prerequisite point is that a linear projection keeps the number of tokens fixed while changing each token’s representation space.

3. Dot-Product Similarity and Differentiable Addressing

The dot product of two vectors is:

x_i \cdot x_j = \sum_{k=1}^{D} x_{ik}x_{jk}

Its geometric identity is:

x_i \cdot x_j = \lVert x_i\rVert\,\lVert x_j\rVert\cos\theta

where $\theta$ is the angle between the two vectors. If both vectors are normalized, the dot product is exactly cosine similarity; with fixed norms, it is largest when the directions align, zero when they are orthogonal, and negative when they point in opposite directions. In a continuous representation space, the dot product gives a simple, differentiable, hardware-friendly relevance score based on directional alignment.

This differs sharply from a traditional Key-Value lookup. A hash-table lookup is a hard match: equal keys hit, unequal keys miss. Neural networks need a soft match: one position and another are not absolutely related or unrelated; they receive a continuous score. Attention does not address memory by query == key; it scores every candidate position with a dot product at the same time.

For a single sequence $X \in \mathbb{R}^{N \times D}$ , directly computing $XX^T$ produces an $N \times N$ matrix:

S = XX^T,\qquad S_{ij} = x_i \cdot x_j

Row $i$ in this matrix contains the similarity between token $i$ and every token. The $N \times N$ relation matrix turns sequence modeling from “passing state through time” into “explicitly computing the relation between any pair of positions”.

Why not use Euclidean distance or another scoring function? In terms of expressiveness, many similarity functions can work; in terms of implementation, dot product can be batched into GEMM (General Matrix Multiply, the matrix operation that GPUs, BLAS, cuBLAS, and deep-learning frameworks optimize heavily. When an operation can be written as large-scale matrix multiplication, it can usually use GPU parallelism efficiently). Attention uses dot product not only because it is mathematically reasonable, but also because it maps global relation computation onto highly optimized linear-algebra kernels.

Directly using $XX^T$ already shows the main prerequisite: any two positions in a sequence can receive a relevance score through one matrix multiplication. It is still not full Attention. The full operator first projects the input into different representation spaces for scoring and output construction; those roles appear in the next post as Q, K, and V. The point to keep here is simpler: dot products can express all pairwise position relevance as one matrix multiplication.

The next post expands the projection matrices and Multi-Head Attention details. This post stops at the prerequisite level: once token representations are organized as matrices, dot-product scoring can cover every pair of positions at once.

4. Softmax and Differentiable Choice

A dot product only gives a set of real-valued scores; it does not yet say how much information should be read from each position. Softmax maps a set of real numbers into non-negative weights that sum to 1:

softmax(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

High-scoring positions receive larger weights, while low-scoring positions still keep nonzero weight. Softmax turns hard selection into a probability distribution, allowing multiple tokens to contribute proportionally to the current representation.

As shown below, Softmax behaves differently under two input scales: when score gaps are moderate, several positions keep visible weight; when one score is much larger, the output becomes nearly one-hot. Softmax always normalizes scores into a weight distribution, but the score scale determines whether that distribution is spread out or sharp.

Softmax converts raw scores into weights that sum to 1; larger score gaps make the distribution sharper.

For this prerequisite post, it is enough to view Softmax as a function that turns scores into weights: the input is one row of scores, the output is one row of weights, and a larger weight means the corresponding item contributes more in a later weighted sum. This section only needs two Softmax facts: it normalizes scores into weights, and score scale changes the shape of the distribution.

Practical implementations do not compute $e^{z_i}$ directly, because exponentials can overflow easily. The standard form subtracts the row maximum first:

softmax(z_i) = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}

This transformation does not change the Softmax result, because numerator and denominator are both multiplied by the same factor $e^{-\max(z)}$ . This form is usually called stable softmax: it keeps exponentials inside the range that floating-point arithmetic can represent safely instead of changing the mathematical definition. “Stable” describes the numerical computation, not a different meaning of the output.

Softmax also has a training-related issue: if the gaps within one row of scores are too large, the largest score takes almost all the weight, the output approaches one-hot, and gradients shrink. The reason can be seen by differentiating Softmax:

\begin{aligned} s_i &= \frac{e^{z_i}}{Z}, \qquad Z = \sum_k e^{z_k} \\ \frac{\partial s_i}{\partial z_j} &= \frac{\partial}{\partial z_j}\left(\frac{e^{z_i}}{Z}\right) \\ &= \frac{\delta_{ij} e^{z_i} Z - e^{z_i} e^{z_j}}{Z^2} \\ &= \frac{e^{z_i}}{Z}\left(\delta_{ij} - \frac{e^{z_j}}{Z}\right) \\ &= s_i(\delta_{ij} - s_j) \end{aligned}

where $s = softmax(z)$ and $\delta_{ij}$ is the Kronecker delta: it equals 1 when $i=j$ and 0 otherwise. The key is in the derivative terms themselves: the diagonal terms are $s_i(1 - s_i)$ , and the off-diagonal terms are $-s_i s_j$ . If the output is already close to one-hot, for example $s_i=0.99$ , then the diagonal term is only $0.99 \times 0.01 = 0.0099$ ; when the other entries are near 0, the off-diagonal terms $s_i s_j$ are also near 0. If scores are too similar, the weights output by Softmax have no clear preference; if score gaps are too large, the distribution enters saturation and gradients weaken.

This is the prerequisite motivation for the $\sqrt{d_k}$ term in Scaled Dot-Product Attention. Suppose two vectors $q,k \in \mathbb{R}^{d}$ have independent coordinates with mean 0 and variance 1. Their dot product is:

q \cdot k = \sum_{r=1}^{d} q_r k_r

Under this simplified independent-identically-distributed assumption, each term $q_r k_r$ has mean 0 and variance 1, so the sum of $d$ terms has variance about $d$ . The higher the dimension, the larger the variance of unscaled dot-product scores, and the more easily Softmax is pushed into saturation.

The scaling term in the next formula is therefore not decoration:

\frac{QK^T}{\sqrt{d_k}}

Under the simplified assumption above, dividing by $\sqrt{d_k}$ brings the score variance from roughly $d_k$ back to roughly 1. Real model activations for $Q$ and $K$ will not strictly be independent, mean-zero, unit-variance variables, but the derivation identifies the quantity that the scaling term is meant to control. The core purpose of $\sqrt{d_k}$ scaling is to control score magnitude, preventing score gaps from growing too large with dimension, so Softmax works in a more stable numerical range.

5. Three Stabilizing Components in Deep Networks

Attention itself describes only how to read information from other positions, but a Transformer block is not a standalone Attention operator. A trainable deep network also needs input representations, the direct propagation path provided by Residual Connection, and normalization mechanisms. Embedding, Residual Connection, and LayerNorm are the basic components that let Attention stack into a deep model.

5.1 Embedding: Initial Token Representation

Embedding has already appeared above: it maps token ids to continuous vectors. The point worth repeating is that Embedding is model parameterization and is updated by the training objective; it is usually tied to the tokenizer and vocabulary, because a token id is effectively a row index in that vocabulary. Each large model typically has its own Embedding matrix, learned during pretraining from its initialized state; even if two models have similar architecture, different tokenizers, training data, or training processes usually produce different Embeddings. The same token does not necessarily have the same coordinates across models, and no single coordinate has to correspond to a human-nameable semantic dimension. Embedding provides a learnable representation space, not a hand-written table of linguistic knowledge.

Text is first split into tokens by a tokenizer, then mapped to token ids; each token id is a row index used to retrieve a continuous vector from the Embedding table.

5.2 Residual Connection: Learning a Correction

Residual Connection has a short form:

y = x + F(x)

Residual Connection preserves a direct add-back path from input to output; the sublayer only needs to learn the correction term F(x).

Mathematically, $F(x)$ is the transformation learned by the sublayer, while $x$ is the original input itself. Residual Connection does not output only $F(x)$ ; it outputs $x + F(x)$ . If a layer has not learned a useful correction yet, it can make $F(x)$ close to 0, so the layer behaves roughly like $y \approx x$ instead of damaging an already useful representation. Residual Connection changes the learning target of each layer from “recreate the whole representation” to “learn a correction relative to the input”.

Intuitively, it is like editing an existing draft instead of tearing it up and rewriting it at every layer. Without a residual connection, information must be reprocessed layer after layer and can become distorted or lost as the network gets deeper; with the direct add-back term $x$ , the original representation still has a chance to pass into the next step, while the sublayer only adds, removes, or adjusts part of it. This is why Residual Connection makes deep models easier to train: the model does not have to re-express the same information from scratch at every layer.

5.3 LayerNorm: Stabilizing Feature Scale per Token

Start with BatchNorm. During training, BatchNorm takes one feature dimension in the output of one layer, computes its mean and variance across multiple samples in a mini-batch, and uses those batch statistics for normalization. In a simplified notation, let $a_{b,d}$ be the value of sample $b$ on dimension $d$ , meaning one element in that layer’s output tensor; many textbooks call this value an activation. BatchNorm computes statistics along the batch dimension:

\begin{aligned} \mu_d &= \frac{1}{B}\sum_{b=1}^{B} a_{b,d} \\ \sigma_d^2 &= \frac{1}{B}\sum_{b=1}^{B}(a_{b,d} - \mu_d)^2 \\ BN(a_{b,d}) &= \gamma_d \frac{a_{b,d} - \mu_d}{\sqrt{\sigma_d^2 + \epsilon}} + \beta_d \end{aligned}

In a large Transformer model, one sample usually means one token sequence; after batching, the activation can be roughly viewed as shape $B \times N \times D$ , where $B$ is batch size, $N$ is sequence length, and $D$ is hidden dimension. Different samples can originally have different token counts, and engineering systems can use padding, attention masks, packing, and related methods to organize them into computable batches; Section 6 expands this batching story. The formula above omits the sequence-position dimension and keeps only the key point: BatchNorm statistics come from multiple samples in the batch. BatchNorm computes statistics across samples, so it works best when the batch has stable shape and composition.

Variable-length sequences make that assumption awkward. Padding solves the shape-alignment problem, not the statistics-stability problem. If normalization depends on the current batch mean $\mu$ and variance $\sigma^2$ , those statistics are affected by the current batch’s length composition, padding ratio, and batch size. In a Transformer, the BatchNorm issue is not whether sequences can be batched, but whether batch statistics are a stable basis for normalization.

Layer Normalization comes from the Layer Normalization paper. Here, “layer” does not mean normalizing across many layers; it means normalizing the output vector of one sample within the same layer. In a Transformer, that means normalizing the hidden vector of one token inside one layer. The problem it addresses is that, in a deep network, each token’s hidden vector passes through Attention, MLPs, residual additions, and other operations again and again; its numerical scale can drift from layer to layer. Some dimensions may become large, others small, and the next sublayer then has a harder input to process. LayerNorm pulls the hidden vector of one token back to a more stable scale before the next computation uses it.

Here, a hidden vector means the current internal representation of a token. The embedding vector is the initial representation: a token id looks up a row $e_i$ in the Embedding table, then position-related information is added before it enters the first Transformer block; after one layer of Attention, MLP, Residual Connection, and related operations, the vector at the same position is updated into a new representation. The embedding vector is the layer-0 input, while a hidden vector is the current token representation at some position inside some layer of the model.

Concretely, given a vector $x \in \mathbb{R}^{D}$ at one position, LayerNorm computes the mean and variance across that vector’s own $D$ hidden dimensions, then applies normalization, scale, and shift:

LN(x) = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

The terms mean:

Symbol	Meaning
$x \in \mathbb{R}^{D}$	the hidden vector of one token at the current layer and sublayer input
$D$	the hidden dimension size
$\mu$	the mean across the $D$ dimensions of this vector
$\sigma^2$	the variance across the $D$ dimensions of this vector
$\epsilon$	a small constant that prevents division by zero
$\gamma$	a trainable scale parameter, allowing the model to readjust each dimension’s scale
$\beta$	a trainable shift parameter, allowing the model to readjust each dimension’s offset

LayerNorm computes mean and variance independently for each token, so it does not need stable statistics estimated from the batch; its definition is the same whether the current step processes 1 token or a full batch. In Transformer blocks, LayerNorm usually appears together with Residual Connection around the Attention and MLP sublayers, with common variants such as Pre-LN and Post-LN. LayerNorm is commonly placed near every Transformer sublayer to stabilize hidden-vector scale, not to change token length or positional structure.

These three components repeatedly combine with Attention inside a Transformer. A simplified block can be viewed as: input enters an Attention sublayer, then residual addition and normalization; next it enters an MLP sublayer, followed again by residual addition and normalization. Attention reads across positions, the MLP applies per-position nonlinear transforms, and Residual plus LayerNorm make those sublayers stable to stack.

This post does not expand Pre-LN, Post-LN, activation choices, or training-stability variants. The boundary we need here is simpler: Attention is the core operator, but it is not the full network; large models place Attention inside a trainable, stackable, parallelizable structure. Separating Attention from the Transformer block helps us discuss the mathematical formula and the engineering implementation independently.

6. Parallel Computation and the GPU Memory Wall

6.1 Batching in Transformers: Training, Prefill, and Decode

During Transformer training, batching first appears on the token-id tensor. If a batch contains $B$ sequences and the shared length after padding or packing is $N$ , a few key tensor shapes can be roughly written as follows. During training, batching means organizing multiple sequences into the same tensor shape.

input_ids: B × N: each position is one token id.
embedding: B × N × D: after the Embedding-table lookup, each token id becomes a $D$ -dimensional vector.
hidden: B × N × D: after entering Transformer blocks, the current representation inside each layer is usually called hidden states.

If samples originally have different lengths, the simple approach pads shorter sequences to the same $N$ and uses an attention mask to mark real tokens versus padding; in pretraining, systems also commonly split long text streams into fixed-length blocks or pack multiple text segments into one block to reduce padding waste. The purpose of a training batch is to organize many sequences into regular tensors so the GPU can process many tokens at once.

Packing does not mean discarding sample boundaries. Pretraining commonly uses separator tokens such as <eos> to mark text endings before next-token prediction; SFT or instruction tuning also uses attention masks, loss masks, and related metadata to control which positions can attend to each other and which positions contribute to the loss. Packing is an optimization in how training data is organized, not an unconditional merge of multiple samples into one text.

In Attention, Query can be roughly read as “what information I am looking for now,” Key as “what condition can match me,” and Value as “what content I contribute if matched.” The first stage of inference is usually called prefill: the model processes the user’s existing prompt and computes Key and Value vectors for those history tokens in every Attention layer; these Key/Value tensors are stored in KV Cache and reused during later decode steps. Different requests can have different prompt lengths, so a simple implementation can still pad to a shared length and use masks; high-performance inference frameworks instead use variable-length attention, paged attention, or similar mechanisms, recording each request’s true length in metadata to reduce wasted padding. In the prefill stage, the batch contains multiple prompts, and the difficulty is that those prompts need not have the same length.

The second stage is decode: the model generates autoregressively, usually adding 1 token per active request at each step. Suppose the current decode batch contains 3 requests. After prefill and several earlier decode steps, they have accumulated KV Caches of different lengths:

request A: KV Cache for 120 history tokens
request B: KV Cache for 900 history tokens
request C: KV Cache for 35 history tokens

At the next generation step, each request adds only 1 new token, so the current input is closer to $B_{\text{active}} \times 1 \times D$ ; but during Attention, A’s new token reads A’s 120 history tokens, B’s new token reads B’s 900 history tokens, and C’s new token reads C’s 35 history tokens. In decode, the current batched input is short, but each request points to a different amount of cached history.

current input: 3 × 1 × D
A's new token -> reads A's own 120 history tokens
B's new token -> reads B's own 900 history tokens
C's new token -> reads C's own 35 history tokens

Implementations use metadata such as position ids, sequence lengths, slot mapping, and paged KV Cache to connect “the token at this step” with “that request’s own history cache.” Transformer inference is still batched, but it is a dynamic batch of variable-length requests, not one huge rectangle created by padding every request to the same length.

6.2 Matrix Parallelism and the GPU Memory Wall

One important advantage of Attention over RNNs is that it organizes dependencies inside a sequence as matrix operations. Step $t$ in an RNN depends on step $t-1$ , so the sequence dimension is hard to parallelize; Attention can construct the $N \times N$ relation matrix at once and then aggregate information through matrix multiplication. Attention is more GPU-friendly not just because it works well as a model, but because it rewrites sequence relations as large-scale parallel linear algebra.

GPUs excel at high-throughput matrix computation. Large numbers of multiply-add operations can be distributed across many SMs (Streaming Multiprocessors, the basic parallel compute units on a GPU) and Tensor Cores (hardware units specialized for matrix multiplication), and if data arrives on time, the hardware can maintain high utilization. The issue is that compute units and memory do not run at the same speed: HBM (High Bandwidth Memory, the GPU’s high-bandwidth device memory) has large capacity and high bandwidth, but access is still much slower than on-chip registers, shared memory, and cache. GPU performance is often determined not only by FLOPs (Floating Point Operations), but by where tensors live and how often data moves across memory hierarchy levels.

At a high level, the GPU memory hierarchy can be read by distance from the compute units: registers are the fastest and smallest, usually holding temporary values inside a thread; shared memory and cache are on-chip, smaller but fast; HBM is the main GPU device memory, storing model weights, activations, intermediate matrices, and KV Cache; CPU host memory is farther away and should not be touched frequently by core kernels. Using the H100 SXM from NVIDIA’s official page as an example, it has 80GB of HBM3 and 3.35TB/s of GPU memory bandwidth; this HBM is large off-chip device memory, while registers, shared memory, and cache remain smaller working areas closer to compute. The closer a storage level is to compute, the faster and smaller it is; the farther it is, the larger and more expensive it is to move data through.

registers / shared memory / cache   on-chip, small, close to SMs and Tensor Cores
HBM3 (H100 SXM example)              80 GB, 3.35 TB/s, off-chip device memory for weights, activations, and KV Cache
CPU host memory                      farther away, larger, not something core kernels should touch frequently

This gives the distinction between Compute Bound and I/O Bound. Compute Bound means the bottleneck is mostly multiply-add work, so more compute capacity helps directly; I/O Bound means the bottleneck is reading and writing data, so compute units may sit waiting for data. When a model writes large intermediate matrices back to HBM and later reads them again, the bottleneck can move from arithmetic to memory bandwidth.

The roofline model states the same idea through arithmetic intensity: how many floating-point operations an operator performs per byte moved. If an operator moves too many bytes for each FLOP relative to the hardware’s compute-to-bandwidth ratio, it becomes memory-bound. Attention’s $N \times N$ score and probability matrices have enough data movement that naive kernels can land on the memory-bound side even though the formula is written as matrix multiplication.

Standard Attention produces an $N \times N$ score matrix and then an $N \times N$ probability matrix after Softmax. When sequence length doubles, these intermediate results grow as $N^2$ ; in long-context settings, memory traffic can quickly become the main cost. The mathematical formula of Attention is parallel-friendly, but a naive implementation creates massive intermediate-matrix I/O.

This is the starting point for work such as FlashAttention. It does not change the mathematical result of Attention; it uses tiling and online softmax to reduce reads and writes to HBM, keeping more computation close to faster on-chip storage. During inference, KV Cache makes a complementary trade-off: it stores previous keys and values in HBM so the model avoids recomputing them for every new token. The key idea in FlashAttention and KV Cache is not to redefine Attention, but to spend the memory hierarchy deliberately.

Learning Attention therefore cannot stop at the formula. The formula explains the composition of $QK^T$ , Softmax, and $V$ ; the system implementation still has to answer how those matrices are tiled, moved, cached, and whether long-sequence intermediate results can avoid landing in HBM. In large models, Attention is both a mathematical operator and a system component constrained by memory hierarchy.

Wrapping Up

This post answered a prerequisite question: why does the Attention formula have this shape? The answer is not one trick, but the combination of several constraints. Sequence input needs position-to-position relations, continuous vectors need differentiable similarity, Softmax needs controlled score scale, deep networks need stabilizing components, and GPU implementations need to reduce unnecessary serial dependence and memory traffic. The Attention formula is short because many motivations have already been compressed into the role of each symbol.

A few boundaries are worth keeping in mind before entering the main derivation:

Attention is not the full Transformer. Attention only handles cross-position information reading; a Transformer block also includes the MLP, Residual Connection, LayerNorm, position encoding, and masks.
Dot-product scoring is not semantic understanding by itself. It is only a similarity computation in continuous space; semantic capability comes from data, objectives, parameter scale, and multilayer composition shaping the representation space together.
Softmax probability weights are not a guarantee of interpretability. A high weight means a high aggregation share for the current layer and head, not a human-level causal explanation.
Matrixization is not free. Attention removes the time-step serial dependence of RNNs, but the $N \times N$ intermediate matrix shifts long-sequence cost toward memory capacity and bandwidth.
Hardware optimization does not change the mathematical semantics. Optimizations such as FlashAttention mainly reduce HBM round trips; they execute the same Attention formula in a GPU-friendlier way.

The next post can now enter Scaled Dot-Product Attention from Attention Is All You Need: starting with the role split among $Q$ , $K$ , and $V$ , deriving $softmax(QK^T / \sqrt{d_k})V$ term by term, and then connecting masks, Multi-Head Attention, RoPE, KV Cache, and inference-side memory optimization onto the same line.