Before Attention: From Sequence Bottlenecks to Dot Products, Softmax, and the GPU Memory Wall
The Attention formula is short, but every term in it answers a concrete problem: serial dependence in sequence models, differentiable vector similarity, Softmax normalization, and the compute-vs-data-movement tradeoff on GPUs. This post starts from the recurrence bottleneck in RNNs, then introduces Embedding, matrix projection, dot-product scoring, Softmax, LayerNorm, Residual Connection, and the memory wall as preparation for deriving Scaled Dot-Product Attention in the next post.