We derive Gaussian Process Regression starting from linear regression, following Rasmussen & Williams (2006).

1. The Linear Regression Problem

Problem statement: Given a dataset D={(x1,y1),(x2,y2),...,(xn,yn)}D = \left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), ..., (\mathbf{x}_n, y_n)\right\}, where xi=[xi1;xi2;...;xip]\mathbf{x}_i = [x_{i1}; x_{i2}; ...; x_{ip}] is a pp-dimensional input (each dimension representing a feature) and yiRy_i \in \mathbf{R} is the observed output. Linear regression aims to learn a linear model such that, given input xi\mathbf{x}_i, the model output is as close as possible to the observed output yiy_i.

This linear model is typically defined as f(xi)=wTxif(\mathbf{x}_i) = \mathbf{w}^T \mathbf{x}_i, where w\mathbf{w} is a p×1p \times 1 vector. We assume that the true observation differs from the model output by an error term ϵ\mathbf{\epsilon}, i.e. yi=f(xi)+ϵiy_i = f(\mathbf{x}_i) + \epsilon_i. Each error term is further assumed to be i.i.d. and to follow a Gaussian distribution with zero mean and variance σn2\sigma_n^2: ϵN(0,σn2)\epsilon \sim \mathcal{N}(0, \sigma_n^2).

By appending a constant feature of 11 to xi\mathbf{x}_i, we can equivalently realise f(xi)=wTxi+bf(\mathbf{x}_i) = \mathbf{w}^T \mathbf{x}_i + b. We assume here that the pp-dimensional xi\mathbf{x}_i already contains such a constant entry.

For a dataset DD with nn samples, define X=[x1,x2,...,xn]\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n] as the p×np \times n input matrix; y=[y1;y2;...,yn;]\mathbf{y} = [y_1; y_2; ..., y_n;] as the n×1n \times 1 vector of true outputs; and ϵ=[ϵ1;ϵ2;...,ϵn;]\mathbf{\epsilon} = [\epsilon_1; \epsilon_2; ..., \epsilon_n;] as the n×1n \times 1 vector of errors. The model can then be written in matrix form:

f(X)=XTw    y=f(X)+ϵf(\mathbf{X}) = \mathbf{X}^T \mathbf{w} \ \ \ \ \mathbf{y} = f(\mathbf{X}) + \mathbf{\epsilon}

2. Frequentist Linear Regression

The frequentist approach computes the model parameter w\mathbf{w} purely from data. The most common method is least squares: find a w\mathbf{w}^* that minimises the mean squared error:

w=argminw(yXTw)T(yXTw)\mathbf{w}^* = \arg\min_{\mathbf{w}} (\mathbf{y} - \mathbf{X}^T \mathbf{w})^T (\mathbf{y} - \mathbf{X}^T \mathbf{w})

Let Ew=(yXTw)T(yXTw)E_\mathbf{w} = (\mathbf{y} - \mathbf{X}^T \mathbf{w})^T (\mathbf{y} - \mathbf{X}^T \mathbf{w}). Differentiating with respect to w\mathbf{w} gives Eww=2X(XTwy)\frac{\partial{E_\mathbf{w}}}{\partial{\mathbf{w}}} = 2 \mathbf{X} (\mathbf{X}^T\mathbf{w} - \mathbf{y}).

Setting the derivative to zero yields the optimal w\mathbf{w}.

Since (yXTw)T(yXTw)=yXTw2(\mathbf{y} - \mathbf{X}^T \mathbf{w})^T (\mathbf{y} - \mathbf{X}^T \mathbf{w}) = \| \mathbf{y} - \mathbf{X}^T \mathbf{w} \|^2, it is also evident without differentiation that the optimum is reached when y=XTw\mathbf{y} = \mathbf{X}^T \mathbf{w}. At this point,

When XTX\mathbf{X}^T\mathbf{X} is invertible, w\mathbf{w}^* has a unique solution:

XXTw=Xyw=(XXT)1Xy\begin{aligned} \mathbf{X} \mathbf{X}^T \mathbf{w}^* &= \mathbf{X} \mathbf{y} \\ \mathbf{w}^* &= (\mathbf{X} \mathbf{X}^T)^{-1} \mathbf{X} \mathbf{y} \end{aligned}

The learned linear regression model is then f(xi)=xiT(XXT)1Xyf(\mathbf{x_i}) = \mathbf{x_i}^T (\mathbf{X} \mathbf{X}^T)^{-1} \mathbf{X} \mathbf{y}.

In the problem statement we assumed that the true output and the model prediction differ by a Gaussian-distributed error term. In that case the optimal parameter from least squares is also the maximum likelihood estimate:

When the observations are independent, the likelihood function is:

p(yX,w)=i=1np(yixi,w)=i=1n12πσnexp((yixiTw)22σn2)=1(2πσn2)n2exp(12σn2yXTw2)=1(2πσn2)n2exp[12σn2(yXTw)T(yXTw)]\begin{aligned} p(\mathbf{y} | \mathbf{X}, \mathbf{w}) & = \prod_{i=1}^{n} p (y_i | \mathbf{x}_i, \mathbf{w}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi} \sigma_n} \exp \left( - \frac{(y_i - \mathbf{x_i}^T \mathbf{w})^2}{2 \sigma_n^2} \right) \\ &= \frac{1}{(2\pi\sigma_n^2)^{\frac{n}{2}}} \exp \left( - \frac{1}{2 \sigma_n^2} |\mathbf{y} - \mathbf{X}^T \mathbf{w}|^2 \right) \\ &= \frac{1}{(2\pi\sigma_n^2)^{\frac{n}{2}}} \exp \left[ - \frac{1}{2 \sigma_n^2} (\mathbf{y} - \mathbf{X}^T \mathbf{w})^T (\mathbf{y} - \mathbf{X}^T \mathbf{w}) \right] \end{aligned}

Differentiating with respect to w\mathbf{w}: p(yX,w)w=p(yX,w)×Eww\frac{\partial p(\mathbf{y} | \mathbf{X}, \mathbf{w})}{\partial \mathbf{w}} = p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \times \frac{\partial{E_\mathbf{w}}}{\partial{\mathbf{w}}}. Since p(yX,w)>0p(\mathbf{y} | \mathbf{X}, \mathbf{w}) > 0, the solution to p(yX,w)w=0\frac{\partial p(\mathbf{y} | \mathbf{X}, \mathbf{w})}{\partial \mathbf{w}} = 0 is equivalent to the solution to Eww=0\frac{\partial{E_\mathbf{w}}}{\partial{\mathbf{w}}} = 0 in least squares. So under the assumption of Gaussian noise, the least-squares estimate is equivalent to the maximum likelihood estimate.

Summary:

  • The frequentist linear regression method only yields a point estimate of the model parameters; predictions use that point estimate.
  • The model is determined entirely by the data — all the model’s information is encoded in the available training set.

3. Bayesian Linear Regression

In the Bayesian approach to linear regression, we first assign a prior distribution to the parameters, representing our knowledge of them before observing data. We then use the observed data to update that knowledge via Bayes’ theorem, obtaining the posterior distribution. Finally, we use the posterior distribution to make predictions. Compared with the frequentist approach, this differs in two ways:

  • The parameter information is not determined entirely by the data; the prior also has influence.
  • Predictions use the parameter distribution.

The procedure is as follows. To make the posterior tractable, we choose the prior to be the conjugate prior of the likelihood. The likelihood is Gaussian, and the Gaussian distribution is its own conjugate (self-conjugate). So we set the prior on w\mathbf{w} to be a multivariate Gaussian with zero mean and covariance Σp\Sigma_p: wN(0,Σp)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_p).

Setting the prior mean of wN(0,Σp)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_p) to 0\mathbf{0} is purely for simplicity — it does not have to be 0\mathbf{0}. In practice, this choice does not affect the result.

After observing the data D=(X,y)D = (\mathbf{X}, \mathbf{y}), since the data points are independent and Gaussian, the likelihood function is:

p(yX,w)=i=1np(yixi,w)=i=1n12πσnexp((yixiTw)22σn2)=1(2πσn2)n2exp(12σn2yXTw2)=1(2πσn2)n2exp[12σn2(yXTw)T(yXTw)]=1(2πσn2)n2exp[12(yXTw)T(σn2I)1(yXTw)]N(XTw,σn2I)\begin{aligned} p(\mathbf{y} | \mathbf{X}, \mathbf{w}) & = \prod_{i=1}^{n} p (y_i | \mathbf{x}_i, \mathbf{w}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi} \sigma_n} \exp \left( - \frac{(y_i - x_i^T \mathbf{w})^2}{2 \sigma_n^2} \right) \\ &= \frac{1}{(2\pi\sigma_n^2)^{\frac{n}{2}}} \exp \left( - \frac{1}{2 \sigma_n^2} |\mathbf{y} - \mathbf{X}^T \mathbf{w}|^2 \right) \\ &= \frac{1}{(2\pi\sigma_n^2)^{\frac{n}{2}}} \exp \left[ - \frac{1}{2 \sigma_n^2} (\mathbf{y} - \mathbf{X}^T \mathbf{w})^T (\mathbf{y} - \mathbf{X}^T \mathbf{w}) \right] \\ &= \frac{1}{(2\pi\sigma_n^2)^{\frac{n}{2}}} \exp \left[ - \frac{1}{2} (\mathbf{y} - \mathbf{X}^T \mathbf{w})^T (\sigma_n^2 \mathit{I})^{-1} (\mathbf{y} - \mathbf{X}^T \mathbf{w}) \right] \\ & \sim \mathcal{N}(\mathbf{X}^T \mathbf{w}, \sigma_n^2 \mathbf{I}) \end{aligned}

By Bayes’ theorem:

posterior=likelihood×priormarginal likelihood,p(wy,X)=p(yX,w)p(w)p(yX)\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{marginal likelihood}}, \quad p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \frac{p(\mathbf{y} | \mathbf{X}, \mathbf{w}) p(\mathbf{w})}{p(\mathbf{y} | \mathbf{X})}

where p(yX)p(\mathbf{y} | \mathbf{X}) is the marginal likelihood:

p(yX)=p(yX,w)p(w)dwp(\mathbf{y} | X) = \int p(\mathbf{y} | X, \mathbf{w}) p(\mathbf{w}) d \mathbf{w}

We can derive the posterior distribution of w\mathbf{w}, which contains all information from the prior and the likelihood:

p(wy,X)= p(yX,w)p(w)p(yX) exp(12σn2(yXTw)T(yXTw))exp(12wTΣp1w) exp(12(wwˉ)T(1σn2XXT+Σp1)(wwˉ))where: wˉ=σn2(σn2XXT+Σp1)Xy N(wˉ=1σn2A1Xy, A1)where: A=σn2XXT+Σp1\begin{aligned} p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \ & \frac{ p(\mathbf{y} | \mathbf{X}, \mathbf{w}) p(\mathbf{w})}{p(\mathbf{y}|\mathbf{X})} \\ \propto \ & \exp \left( - \frac{1}{2 \sigma_n^2} (\mathbf{y} - \mathbf{X}^T \mathbf{w})^T (\mathbf{y} - \mathbf{X}^T \mathbf{w}) \right) \exp \left( - \frac{1}{2} \mathbf{w}^T \Sigma_p^{-1} \mathbf{w} \right) \\ \propto \ & \exp \left( - \frac{1}{2} (\mathbf{w} - \mathbf{\bar{w}})^T \left(\frac{1}{\sigma_n^2}\mathbf{X}\mathbf{X}^T + \Sigma_{p}^{-1}\right)(\mathbf{w} - \mathbf{\bar{w}}) \right) \\ & \text{where:} \ \mathbf{\bar{w}} = \sigma_{n}^{-2}(\sigma_n^{-2}\mathbf{X}\mathbf{X}^T + \Sigma_p^{-1})\mathbf{X}\mathbf{y} \\ \sim \ & \mathcal{N}\left(\mathbf{\bar{w}} = \frac{1}{\sigma_n^2}\mathbf{A}^{-1}\mathbf{X}\mathbf{y}, \ \mathbf{A}^{-1}\right) \\ & \text{where:} \ \mathbf{A} = \sigma_{n}^{-2} \mathbf{X} \mathbf{X}^T + \Sigma_p^{-1} \end{aligned}

For a test point x\mathbf{x}_*, we use the posterior predictive distribution to predict its output ff_{*}:

p(fx,X,y)= p(fx,w)p(wX,y)dw= N(1σn2xTA1Xy, xTA1x)\begin{aligned} p(f_{*}|\mathbf{x}_{*}, \mathbf{X}, \mathbf{y}) = \ & \int p(f_{*} | \mathbf{x}_{*}, \mathbf{w}) p(\mathbf{w} | \mathbf{X}, \mathbf{y}) d\mathbf{w} \\ = \ & \mathcal{N} \left(\frac{1}{\sigma_n^2} \mathbf{x}_{*}^T \mathbf{A}^{-1} \mathbf{X} \mathbf{y}, \ \mathbf{x}_{*}^T \mathbf{A}^{-1} \mathbf{x}_{*}\right) \end{aligned}

The posterior predictive distribution is again a Gaussian. As the formula shows, the posterior predictive is a weighted average over the predictions made by every parameter (an integral, in the continuous case). It no longer picks a single parameter via some strategy and predicts with it; instead it accounts for the parameter uncertainty (its distribution). And the prediction itself is a distribution, not a point. This is the main difference from the frequentist (point-estimate) approach.

3.1 Kernel Trick: Handling Non-linear Data

Standard Bayesian linear regression can only express linear relationships and cannot capture complex non-linear patterns. We can address this by mapping the input space, via a non-linear function, into a higher-dimensional feature space, then applying the linear model in that feature space. As long as the mapping is independent of the parameter w\mathbf{w}, the overall model remains linear (in w\mathbf{w}). How should we choose the mapping function? In practice we usually do not need to perform the mapping explicitly — we use the kernel trick instead. See the kernel-function chapter for the kernel trick.

Specifically, we use a non-linear map ϕ(x)\phi(\mathbf{x}) to lift the pp-dimensional original input x\mathbf{x} into a qq-dimensional space, and perform linear regression in the new feature space. Now w\mathbf{w} is a q×1q \times 1 vector with ϵN(0,σn2)\epsilon \sim \mathcal{N}(0, \sigma_n^2), and the regression model is:

f(X)=ϕ(X)Tw,y=f(X)+ϵf(\mathbf{X}) = \phi(\mathbf{X})^T \mathbf{w}, \quad \mathbf{y} = f(\mathbf{X}) + \mathbf{\epsilon}

The posterior predictive becomes:

p(fx,X,y)=N(1σn2ϕ(x)TA1Φy,ϕ(x)TA1ϕ(x))p(f_{*}|\mathbf{x}_{*}, \mathbf{X}, \mathbf{y}) = \mathcal{N} \left(\frac{1}{\sigma_n^2} \phi(\mathbf{x}_{*})^T \mathbf{A}^{-1} \Phi \mathbf{y}, \quad \phi(\mathbf{x}_{*})^T \mathbf{A}^{-1} \phi(\mathbf{x}_{*})\right)

where Φ=Φ(X)\Phi = \Phi(\mathbf{X}) and A=σn2ΦΦT+Σq1\mathbf{A} = \sigma_{n}^{-2} \Phi \Phi^T + \Sigma_q^{-1}.

This formula is hard to use in practice because computing A1\mathbf{A}^{-1} (q×qq \times q) is expensive when qq is large, and infeasible when the feature space is infinite-dimensional. The kernel trick comes in here: it not only avoids computing A1\mathbf{A}^{-1}, but also removes the need to specify the mapping function explicitly. The kernel trick computes feature-space inner products directly from the input space, so we need to rewrite the formula above in inner-product form.

For the mean:

AΣqΦ=σn2Φ(ΦTΣqΦ+σn2I)\mathbf{A} \Sigma_q \Phi = \sigma_n^{-2} \Phi\left(\Phi^T \Sigma_q \Phi + \sigma_n^2 I\right)

Multiplying by A1\mathbf{A}^{-1} on the left and by (ΦTΣqΦ+σn2I)1\left(\Phi^T \Sigma_q \Phi+\sigma_n^2 I\right)^{-1} on the right gives an equivalent expression for the mean before and after the transformation:

ΣqΦ(ΦTΣqΦ+σn2I)1=σn2A1Φ\Sigma_q \Phi \left(\Phi^T \Sigma_q \Phi+\sigma_n^2 I\right)^{-1} = \sigma_n^{-2} \mathbf{A}^{-1} \Phi

For the variance, by the Woodbury matrix identity:

(Z+UWVT)1=Z1Z1U(W1+VTZ1U)1VTZ1(\mathbf{Z}+\mathbf{U}\mathbf{W}\mathbf{V}^T)^{-1}=\mathbf{Z}^{-1}-\mathbf{Z}^{-1}\mathbf{U}(\mathbf{W}^{-1}+\mathbf{V}^T\mathbf{Z}^{-1}\mathbf{U})^{-1}\mathbf{V}^T\mathbf{Z}^{-1}

Let:

Z1=Σq,    W1=σn2I,    U=V=Φ\mathbf{Z}^{-1} = \Sigma_q, \ \ \ \ \mathbf{W}^{-1} = \sigma_n^2 \mathbf{I}, \ \ \ \ U = V = \Phi

Then:

(Σq1+σn2ΦΦT)1=ΣqΣqΦ(σn2I+ΦTΣqΦ)1ΦTΣq(\Sigma_q^{-1} + \sigma_{n}^{-2} \Phi \Phi^T)^{-1} = \Sigma_q - \Sigma_q \Phi (\sigma_n^2 \mathbf{I} + \Phi^T \Sigma_q \Phi)^{-1} \Phi^T \Sigma_q

Substituting gives the posterior under the kernel trick:

fx,X,yN(ϕTΣqΦ(K+σn2I)1y,ϕTΣqϕϕTΣqΦ(ΦTΣqΦ+σn2I)1ΦTΣqϕ)\begin{aligned} f_{*}|\mathbf{x}_{*}, \mathbf{X}, \mathbf{y} \sim \mathcal{N} & \left(\phi_{*}^T\Sigma_q \Phi (\mathbf{K} + \sigma^{2}_{n}\mathbf{I})^{-1} \mathbf{y}, \right.\\ & \left. \phi_{*}^{T} \Sigma_q \phi_{*} - \phi_{*}^T \Sigma_q\Phi (\Phi^T \Sigma_q \Phi + \sigma^{2}_{n}\mathbf{I})^{-1} \Phi^T \Sigma_q \phi_{*}\right) \end{aligned}

where ϕ=ϕ(x)\phi_* = \phi({\mathbf{x}_*}) and K=ΦTΣqΦ\mathbf{K} = \Phi^T \Sigma_q \Phi. This K\mathbf{K} is exactly the covariance matrix:

Cov(f(X),f(X))=E[f(X)f(X)]=Φ(X)E[ww]Φ(X)=ΦTΣqΦ\operatorname{Cov}(f(\mathbf{X}), f(\mathbf{X})) = \mathbb{E}\left[f(\mathbf{X}) f\left(\mathbf{X}\right)\right] = \boldsymbol{\Phi}(\mathbf{X})^{\top} \mathbb{E}\left[\mathbf{w} \mathbf{w}^{\top}\right] \boldsymbol{\Phi}\left(\mathbf{X}\right) = \boldsymbol{\Phi^T} \Sigma_{q} \boldsymbol{\Phi}

For a test set with multiple points, applying the formula above and the definition of covariance, the prediction can be written in vector form:

fX,X,yN(K(X,X)[K(X,X)+σn2I]1y,K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X))\begin{aligned} \mathbf{f}_{*} | \mathbf{X}_{*}, \mathbf{X}, \mathbf{y} \sim \mathcal{N} & \left(K\left(\mathbf{X}_{*}, \mathbf{X}\right) \left[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I} \right]^{-1} \mathbf{y}, \right.\\ & \left. K\left(\mathbf{X}_{*}, \mathbf{X}_{*}\right) - K\left(\mathbf{X}_{*}, \mathbf{X}\right) \left[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I} \right]^{-1} K\left(\mathbf{X}, \mathbf{X}_{*}\right)\right) \end{aligned}

Some remarks:

  • For each test point x\mathbf{x}_*, the posterior mean is a linear combination of y\mathbf{y}; equivalently, a linear combination of k(xi,x)k(\mathbf{x}_i, \mathbf{x}_\star): fˉ(x)=i=1nαik(xi,x)\bar{f}(\mathbf{x}_*) = \sum_{i=1}^n \alpha_i k(\mathbf{x}_i, \mathbf{x}_*), where α=[K(X,X)+σn2I]1y\alpha = [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I}]^{-1} \mathbf{y}.
  • For each test point x\mathbf{x}_*, the posterior variance is independent of the training output y\mathbf{y} and depends only on the inputs. The variance can be viewed as the prior variance (K(X,X)K(\mathbf{X}_*, \mathbf{X}_*)) minus the information acquired from the training set K(X,X)[K(X,X)+σn2I]1K(X,X)K(\mathbf{X}_*, \mathbf{X}) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I}]^{-1} K(\mathbf{X}, \mathbf{X}_*) — observations gradually shrink the variance, which makes sense.
  • Although a Gaussian process corresponds to an infinite-dimensional feature space, the posterior at each point depends only on the training set, which is what makes it practical.
  • The covariance matrix is computed via a kernel function. There are many kernels; if we have deep insight into the data we can pick a suitable one. For example, if we know the output is periodic in parameter space, we can use a periodic kernel. In most practical cases such information is unknown and there is no especially effective method for picking a kernel. For Gaussian process regression, the most commonly used kernels are the RBF kernel and the Matérn kernel.

4. Gaussian Process Regression

4.1 Theory

The result derived in the previous section (kernel trick) is in fact the result of Gaussian process regression. This section derives it directly from the Gaussian process.

A stochastic process is a collection of random variables; a Gaussian process is one kind of stochastic process. A stochastic process is called a Gaussian process if every finite subset of its random variables follows a multivariate Gaussian distribution. A Gaussian process is uniquely determined by its mean function (mean function: m(x)m(\mathbf{x})) and its covariance function (covariance function: k(x,x)k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)). We denote the Gaussian process f(x)f(\mathbf{x}) with mean function m(x)m(\mathbf{x}) and covariance function k(x,x)k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) as (here ff no longer denotes a function):

f(x)GP(m(x),k(x,x))f(\mathbf{x}) \sim \mathcal{G} \mathcal{P}\left(m(\mathbf{x}), k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right)

The mean and covariance functions are defined as:

m(x)=E[f(x)]k(x,x)=E[(f(x)m(x))(f(x)m(x))]\begin{aligned} m(\mathbf{x}) &= \mathbb{E} \left[ f(\mathbf{x}) \right] \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &= \mathbb{E}\left[(f(\mathbf{x})-m(\mathbf{x}))\left(f\left(\mathbf{x}^{\prime}\right)-m\left(\mathbf{x}^{\prime}\right)\right)\right] \end{aligned}

For a training set D(X,y)D(\mathbf{X}, \mathbf{y}), modelling its output with a Gaussian process, y\mathbf{y} follows a multivariate Gaussian:

yN(0,K(X,X)+σn2I)\mathbf{y} \sim \mathcal{N} \left( \mathbf{0}, \mathbf{K}(\mathbf{X}, \mathbf{X}) + \sigma_{n}^2 \mathbf{I} \right)

For nn_* test points X\mathbf{X}_{*}, we want to predict the outputs f\mathbf{f}_*. By the Gaussian process property that any finite subset is multivariate Gaussian, y\mathbf{y} and f\mathbf{f}_* jointly are still multivariate Gaussian:

[yf]N(0,[K(X,X)+σn2IK(X,X)K(X,X)K(X,X)])\left[\begin{array}{c}{\mathbf{y}} \\ {\mathbf{f}_{*}}\end{array}\right] \sim \mathcal{N}\left(\mathbf{0},\left[\begin{array}{cc}{K(\mathbf{X}, \mathbf{X}) + \sigma_{n}^2 \mathbf{I}} & {K\left(\mathbf{X}, \mathbf{X}_{*}\right)} \\ {K\left(\mathbf{X}_{*}, \mathbf{X}\right)} & {K\left(\mathbf{X}_{*}, \mathbf{X}_{*}\right)}\end{array}\right]\right)

Since the joint is Gaussian, the conditional is straightforward to compute. See conditional distributions of the multivariate normal:

fX,X,yN(K(X,X)[K(X,X)+σn2I]1y,K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X))\begin{aligned} \mathbf{f}_{*} | \mathbf{X}_{*}, \mathbf{X}, \mathbf{y} \sim \mathcal{N} & \left(K\left(\mathbf{X}_{*}, \mathbf{X}\right) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2\mathbf{I}]^{-1} \mathbf{y}, \right.\\ & \left. K\left(\mathbf{X}_{*}, \mathbf{X}_{*}\right) - K\left(\mathbf{X}_{*}, \mathbf{X}\right) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I}]^{-1} K\left(\mathbf{X}, \mathbf{X}_{*}\right)\right) \end{aligned}

This result agrees with the kernel-trick-based Bayesian linear regression derivation under Gaussian noise — the two paths converge.

The log marginal likelihood (or evidence) is used in hyperparameter tuning (see §5). The marginal likelihood is the integral of the product of likelihood and prior:

p(yX)=p(yf,X)p(fX)dfp(\mathbf{y} | \mathbf{X}) = \int p(\mathbf{y} | \mathbf{f}, \mathbf{X}) p(\mathbf{f} | \mathbf{X}) d \mathbf{f}

The prior in GPR is the Gaussian process fXN(0,K)f|\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \mathbf{K}), so:

p(fX)=exp(12(fμ)TΣ1(fμ))(2π)kΣ=exp(12(f)TΣ1(f))(2π)kΣ\begin{aligned} p(\mathbf{f} | \mathbf{X}) & = \frac{\exp \left(-\frac{1}{2}(\mathbf{f}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{f}-\boldsymbol{\mu})\right)}{\sqrt{(2 \pi)^k|\boldsymbol{\Sigma}|}} \\ & = \frac{\exp \left(-\frac{1}{2}(\mathbf{f})^T \boldsymbol{\Sigma}^{-1}(\mathbf{f})\right)}{\sqrt{(2 \pi)^k|\boldsymbol{\Sigma}|}} \end{aligned}

Taking the logarithm:

logp(fX)=12fTK1f12logKn2log2π\log p(\mathbf{f} \mid \mathbf{X}) = -\frac{1}{2} \mathbf{f}^T \mathbf{K}^{-1} \mathbf{f} - \frac{1}{2} \log |\mathbf{K}| - \frac{n}{2} \log 2 \pi

The likelihood is yfN(f,σn2I)yN(0,K+σn2I)\mathbf{y} | \mathbf{f} \sim \mathcal{N}(\mathbf{f}, \sigma_n^2 \mathbf{I}) \rightarrow \mathbf{y} \sim \mathcal{N}(\mathbf{0}, \mathbf{K} + \sigma_n^2 \mathbf{I}), so by analogy with logp(fX)\log p(\mathbf{f} | \mathbf{X}):

logp(yX)=12y(K+σn2I)1y12logK+σn2In2log2π\log p(\mathbf{y} \mid \mathbf{X}) = -\frac{1}{2} \mathbf{y}^{\top}\left(\mathbf{K}+\sigma_n^2 \mathbf{I}\right)^{-1} \mathbf{y} - \frac{1}{2} \log \left|\mathbf{K}+\sigma_n^2 \mathbf{I}\right| - \frac{n}{2} \log 2 \pi

4.2 Implementation

In practice, direct matrix inversion is expensive (O(n3)O(n^3)) and prone to numerical error. We instead use Cholesky decomposition and two triangular solves to bypass the matrix inversion. The complexity remains O(n3)O(n^3) but with a better constant — about 1/61/6 of the direct inversion — and improved numerical stability:

L=cholesky(K(X,X)+σn2I)let  (LLT)1y=α  α=LT\(L\y)\begin{aligned} & \mathbf{L} = \text{cholesky}(K(\mathbf{X}, \mathbf{X}) + \sigma_n^2\mathbf{I}) \\ & \text{let} \ \ (\mathbf{L} \mathbf{L}^T)^{-1} \mathbf{y} = \boldsymbol{\alpha} \ \rightarrow \ \boldsymbol{\alpha} = \mathbf{L}^T \backslash (\mathbf{L} \backslash \mathbf{y}) \end{aligned}
  • L\y\mathbf{L} \backslash \mathbf{y} denotes solving the linear system Lx=y\mathbf{L} \mathbf{x} = \mathbf{y}.

This gives the posterior mean:

fˉ=k(x,X)α\bar{f}_* = \mathbf{k}(\mathbf{x}_*, \mathbf{X}) \boldsymbol{\alpha}

For the variance:

k(x,X)[K(X,X)+σn2I]1k(X,x)=k(x,X)(LLT)1k(X,x)=k(X,x)T(LT)1L1k(X,x)=(L1k(X,x))TL1k(X,x)\begin{aligned} k\left(\mathbf{x}_{*}, \mathbf{X}\right) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 \mathbf{I}]^{-1} k\left(\mathbf{X}, \mathbf{x}_{*}\right) & = k\left(\mathbf{x}_{*}, \mathbf{X}\right) (\mathbf{L} \mathbf{L}^T)^{-1} k\left(\mathbf{X}, \mathbf{x}_{*}\right) \\ & = k\left(\mathbf{X}, \mathbf{x}_{*}\right)^T (\mathbf{L}^T)^{-1} \mathbf{L}^{-1} k\left(\mathbf{X}, \mathbf{x}_{*}\right) \\ & = \left(\mathbf{L}^{-1} k\left(\mathbf{X}, \mathbf{x}_{*}\right)\right)^T \mathbf{L}^{-1} k\left(\mathbf{X}, \mathbf{x}_{*}\right) \end{aligned}

Let v=L1k(X,x)\mathbf{v} = \mathbf{L}^{-1} k\left(\mathbf{X}, \mathbf{x}_{*}\right), also obtained via a triangular solve:

v=L\K(X,x)\mathbf{v} = \mathbf{L} \backslash K\left(\mathbf{X}, \mathbf{x}_{*}\right)

This gives the posterior variance:

V[f]=k(x,x)vv\mathbb{V}\left[f_*\right] = k\left(\mathbf{x}_*, \mathbf{x}_*\right) - \mathbf{v}^{\top} \mathbf{v}

and the log marginal likelihood:

logp(yX)=12yαilogLiin2log2π\log p(\mathbf{y} \mid \mathbf{X}) = -\frac{1}{2} \mathbf{y}^{\top} \boldsymbol{\alpha} - \sum_i \log \mathbf{L}_{i i} - \frac{n}{2} \log 2 \pi

where:

  • 12logK+σn2I=ilogLii\frac{1}{2} \log \left|\mathbf{K}+\sigma_n^2 \mathbf{I}\right| = \sum_i \log \mathbf{L}_{i i} uses the lower-triangular property of the Cholesky factor (12logLLT=ilogLii\frac{1}{2} \log |\mathbf{L} \mathbf{L}^T| = \sum_i \log \mathbf{L}_{ii}).

5. Model Selection and Hyperparameter Optimization

The theory of GPR is complete, but turning it into a powerful tool still requires more work. Many uncertainties remain in the procedure: how should we choose the kernel function? how should we determine the kernel’s hyperparameters? These problems must be solved before GPR becomes a strong, practical tool. We collectively call them the model selection problem (different hyperparameters being analogous to different models), or the model training process.

5.1 Bayesian Model Selection Framework

We can do model selection from a Bayesian perspective. Group model selection into three levels:

  • Parameters w\mathbf{w}: e.g. w\mathbf{w} in linear regression, or weights in a neural network.
  • Hyperparameters θ\boldsymbol{\theta}: e.g. the length-scale parameter of a kernel function, or weight decay in a neural network.
  • Structure Hi\mathcal{H}_i: different model structures, e.g. linear models, neural networks.

For parameter w\mathbf{w}, by Bayes’ theorem the posterior is:

p(wy,X,θ,Hi)=p(yX,w,Hi)p(wθ,Hi)p(yX,θ,Hi)p\left(\mathbf{w} \mid \mathbf{y}, \mathbf{X}, \boldsymbol{\theta}, \mathcal{H}_i\right) = \frac{p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{w}, \mathcal{H}_i\right) p\left(\mathbf{w} \mid \boldsymbol{\theta}, \mathcal{H}_i\right)}{p\left(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}, \mathcal{H}_i\right)}

where the denominator is the marginal likelihood:

p(yX,θ,Hi)=p(yX,w,Hi)p(wθ,Hi)dwp\left(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}, \mathcal{H}_i\right) = \int p\left(\mathbf{y} \mid \mathbf{X}, \mathbf{w}, \mathcal{H}_i\right) p\left(\mathbf{w} \mid \boldsymbol{\theta}, \mathcal{H}_i\right) d \mathbf{w}

For hyperparameter θ\boldsymbol{\theta}, the posterior is:

p(θy,X,Hi)=p(yX,θ,Hi)p(θHi)p(yX,Hi)p\left(\boldsymbol{\theta} \mid \mathbf{y}, \mathbf{X}, \mathcal{H}_i\right) = \frac{p\left(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}, \mathcal{H}_i\right) p\left(\boldsymbol{\theta} \mid \mathcal{H}_i\right)}{p\left(\mathbf{y} \mid \mathbf{X}, \mathcal{H}_i\right)}

with marginal likelihood:

p(yX,Hi)=p(yX,θ,Hi)p(θHi)dθp\left(\mathbf{y} \mid \mathbf{X}, \mathcal{H}_i\right) = \int p\left(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}, \mathcal{H}_i\right) p\left(\boldsymbol{\theta} \mid \mathcal{H}_i\right) d \boldsymbol{\theta}

For model structure w\mathbf{w}, the posterior is:

p(Hiy,X)=p(yX,Hi)p(Hi)p(yX)p\left(\mathcal{H}_i \mid \mathbf{y}, \mathbf{X}\right) = \frac{p\left(\mathbf{y} \mid \mathbf{X}, \mathcal{H}_i\right) p\left(\mathcal{H}_i\right)}{p(\mathbf{y} \mid \mathbf{X})}

Since the candidate model structures are usually a discrete set, the marginal likelihood is:

p(yX)=ip(yX,Hi)p(Hi)p(\mathbf{y} \mid \mathbf{X}) = \sum_i p\left(\mathbf{y} \mid \mathbf{X}, \mathcal{H}_i\right) p\left(\mathcal{H}_i\right)

This section provides a compelling and unified Bayesian framework for model selection. Unfortunately, for most machine learning models the marginal likelihood (an integral over parameter space) has no analytic solution and no good approximation. GPR is an exception — it has an analytic solution. The next section explains how to apply this framework to do model selection in GPR.

5.2 Model Selection in Gaussian Process Regression

Model selection in GPR mainly refers to tuning the hyperparameters of the kernel function, e.g. the length-scale parameter. Following Bayesian theory, we tune hyperparameters by maximising the log marginal likelihood (the derivation appears at the end of §4.1):

logp(yX,θ)=12y(Kf+σn2I)1y12logKf+σn2In2log2π=12yKy1y12logKyn2log2π\begin{aligned} \log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol \theta) &= -\frac{1}{2} \mathbf{y}^{\top}\left(\mathbf{K}_{\mathbf{f}}+\sigma_n^2 \mathbf{I}\right)^{-1} \mathbf{y} - \frac{1}{2} \log \left|\mathbf{K}_{\mathbf{f}} +\sigma_n^2 \mathbf{I}\right| - \frac{n}{2} \log 2 \pi \\ &= -\frac{1}{2} \mathbf{y}^{\top}\mathbf{K}_{\mathbf{y}}^{-1} \mathbf{y} - \frac{1}{2} \log \left|\mathbf{K}_{\mathbf{y}}\right| - \frac{n}{2} \log 2 \pi \end{aligned}

with each part interpreted as:

  • 12yKy1y\frac{1}{2} \mathbf{y}^{\top}\mathbf{K}_{\mathbf{y}}^{-1} \mathbf{y}: data fit term. The only term involving the observed targets.
  • 12logKy\frac{1}{2} \log \left|\mathbf{K}_{\mathbf{y}}\right|: complexity penalty term. Depends only on the covariance function and the input data X\mathbf{X}.
  • n2log2π\frac{n}{2} \log 2 \pi: normalisation constant.

Differentiating the log marginal likelihood:

θjlogp(yX,θ)=12yK1KθjK1y12tr(K1Kθj)=12tr((ααK1)Kθj) where α=K1y\begin{aligned} \frac{\partial}{\partial \theta_j} \log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) & = \frac{1}{2} \mathbf{y}^{\top} \mathbf{K}^{-1} \frac{\partial \mathbf{K}}{\partial \theta_j} \mathbf{K}^{-1} \mathbf{y} - \frac{1}{2} \operatorname{tr}\left(\mathbf{K}^{-1} \frac{\partial \mathbf{K}}{\partial \theta_j}\right) \\ & = \frac{1}{2} \operatorname{tr}\left(\left(\boldsymbol{\alpha} \boldsymbol{\alpha}^{\top}-\mathbf{K}^{-1}\right) \frac{\partial \mathbf{K}}{\partial \theta_j}\right) \text{ where } \boldsymbol{\alpha} = \mathbf{K}^{-1} \mathbf{y} \end{aligned}

The cost above is dominated by inverting KK. For an n×nn \times n positive-definite symmetric matrix the standard inversion is O(n3)O(n^3). For each hyperparameter, the partial derivative then costs O(n2)O(n^2). Overall, this is reasonably efficient.