Blog - pengleizhao

Attention Is All You Need: From the Attention Formula to Transformer Architecture and Inference Optimization

June 4, 2026

#Artificial Intelligence#Deep Learning#Attention#Transformer#GPU

The prerequisite post established sequence bottlenecks, dot-product scoring, Softmax scaling, and the GPU memory wall. This post derives the Scaled Dot-Product Attention operator from Attention Is All You Need: starting with Q, K, and V, then unpacking $softmax(QK^T / \sqrt{d_k})V$ , masks, Multi-Head Attention, positional encodings, KV Cache, GQA, and FlashAttention.

Before Attention: From Sequence Bottlenecks to Dot Products, Softmax, and the GPU Memory Wall

June 1, 2026

#Artificial Intelligence#Deep Learning#Attention#Transformer#GPU

The Attention formula is short, but every term in it answers a concrete problem: serial dependence in sequence models, differentiable vector similarity, Softmax normalization, and the compute-vs-data-movement tradeoff on GPUs. This post starts from the recurrence bottleneck in RNNs, then introduces Embedding, matrix projection, dot-product scoring, Softmax, LayerNorm, Residual Connection, and the memory wall as preparation for deriving Scaled Dot-Product Attention in the next post.

Mindfulness Meditation: On Purpose, Non-Judgmentally, in the Present

May 24, 2026

#Mindfulness#Meditation#Reading

Mindfulness gets talked about a lot, and just as often misunderstood — as emptying your head, a relaxation trick, or something mystical. So what is it really? How does it relate to meditation? And the common practices — breath awareness, the body scan, walking, everyday mindfulness — how do you actually do them, and what should you watch for? This is what I pulled together after reading seven of Jon Kabat-Zinn's books, written partly for the version of me who will slowly let the practice slide.

Backpropagation and Automatic Differentiation: From Jacobians to Computational Graphs to VJP

May 17, 2026

#Artificial Intelligence#Deep Learning#Backpropagation#Automatic Differentiation

Training a neural network requires the gradient of the loss with respect to every parameter. Backpropagation plus automatic differentiation (AD) is the method for doing this efficiently. This post starts from multidimensional calculus basics (Jacobians, the numerator-layout convention), derives the matrix chain rule and the gradient alignment for a linear layer $Y = WX$ (using both index notation and matrix differentials with the trace trick as mutually validating methods), then transitions to computational graphs, vector-Jacobian products (VJP), and reverse-mode AD, and closes with engineering tradeoffs — gradient checkpointing, and dynamic vs static graphs. It is the deep-dive companion to § 3.4 Backpropagation: Computational Graph, Chain Rule, Jacobian of Deep Learning Foundations: From Perceptrons to Backpropagation to Training Deep Networks.

Deep Learning Foundations: From Perceptrons to Training Deep Networks

May 12, 2026

#Artificial Intelligence#Deep Learning#Perceptron#Backpropagation#Neural Networks

Deep learning can fit arbitrarily complex functions, but the machinery underneath uses only a handful of pieces: a linear transform, a nonlinear activation, and a chain rule that propagates error backward. Turning that machinery from "mathematically workable" into "trainable on a 100-layer network" requires walking through linear collapse, exploding gradients, and overfitting. This post follows the path perceptron → multilayer perceptron → training neural networks → training stability → generalization and regularization — every step starts from a mathematical motivation, gives a rigorous proof or derivation, then returns to engineering reality.

ClickHouse NumericIndexedVector Best Practices: Bucketing & Position Encoding, Applicable Scenarios, Usage Guide

December 28, 2025

#ClickHouse#BSI#Bitmap#Software Engineering

This post covers best practices for using NumericIndexedVector in ClickHouse, organized into four parts: bucketing & Position Encoding → applicable scenarios → usage guide → measured payoff. Measured in WeChat scenarios across 29 days and 105 core metrics, the typical scenario delivers: storage reduced by 60% (4.1 TB → 1.6 TB), single global sum sped up 100× (59.2 s → 0.6 s), ad-hoc query mean latency reduced by 3.7× (22.3 s → 6.0 s). Bucketing strategy directly affects the underlying RoaringBitmap's storage cost — the gap between best and worst container tiers can reach an order of magnitude.

ClickHouse NumericIndexedVector: Design and Implementation of a Sparse Numeric Vector on Bitmap + BSI

August 10, 2025

#ClickHouse#BSI#Bitmap#Software Engineering

Bitmaps answer 'does key k exist?' efficiently, and AND / OR / NOT compose set operations cheaply. But what if every key carried a real-valued payload — how would we store that on top of a bitmap, and use those AND / OR / NOT primitives to implement pointwise (per-key) add / sub / mul / div and comparison? This post introduces NumericIndexedVector, newly merged into ClickHouse, which builds exactly this on top of RoaringBitmap and BSI (Bit-Sliced Index).

Seven Common Statistical Traps in Daily Data: From Small-Sample Bias to Simpson's Paradox

November 23, 2024

#Statistical Inference#Statistical Traps#Simpson's Paradox#A/B Testing

A familiar tension keeps showing up: people feel poorer year after year while the official average wage keeps climbing; neighbors swear local home prices are rising while the national statistics bureau reports a small year-on-year dip. That gap is usually not anyone's imagination — the numbers get quietly distorted somewhere between being collected, summarized, and reported. This post walks through the seven most common statistical traps in everyday data, grouped by where the distortion comes from: small-sample induction and [survivorship bias](https://en.wikipedia.org/wiki/Survivorship_bias), skewed-distribution means and missing significance tests, confounders and [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox), and finally cherry-picking and [Goodhart's law](https://en.wikipedia.org/wiki/Goodhart%27s_law) — each illustrated with a concrete example, then unpacked into its cause, its mathematical core, and its professional remedy.

Black-box Optimization: Bayesian Optimization and Its Multi-task Extension

August 5, 2021

#Black-box Optimization#Bayesian Optimization#Gaussian Process Regression#Multi-task Learning#Machine Learning

The previous post derived Gaussian Process Regression from linear regression. This post uses GPR as a surrogate model to assemble the full Bayesian optimization loop, and surveys six acquisition functions (EI / EIC / NEIC / UCB / ES / PI). We then extend to the multi-task setting, covering ICM, SLFM, LMC, and PC formulations of multi-output GPs, plus the acquisition functions used in multi-task Bayesian optimization.

Black-box Optimization: From Linear Regression to Gaussian Process Regression

March 2, 2021

#Black-box Optimization#Gaussian Process Regression#Bayesian Methods#Machine Learning

Linear regression is the simplest, most classical model in machine learning. Replace its point-estimated parameters with a distribution (Bayesian linear regression), use the kernel trick to lift the input space into a high-dimensional feature space, and let that feature space go infinite — and we arrive at Gaussian Process Regression. This post derives GPR step by step from linear regression, and shows how its hyperparameters (the kernel's parameters) are tuned by maximising the log marginal likelihood.

From Potential Outcomes to Hypothesis Testing: Statistical and Causal Inference in A/B Tests

December 26, 2020

#Statistical Inference#Causal Inference#A/B Testing#Hypothesis Testing

The core question in an online A/B test is simple: did the new version improve the population metric? A between-group difference in the sample is not automatically a causal effect. This post introduces how the potential outcomes framework (Rubin causal model) connects with frequentist hypothesis testing: the former defines the causal quantity we want; the latter uses randomization, the central limit theorem, and a test statistic to turn it into an operational experiment decision.