Black-box Optimization: From Linear Regression to Gaussian Process Regression
We derive Gaussian Process Regression starting from linear regression, following Rasmussen & Williams (2006).
1. The Linear Regression Problem
Problem statement: Given a dataset D={(x1,y1),(x2,y2),...,(xn,yn)}, where xi=[xi1;xi2;...;xip] is a p-dimensional input (each dimension representing a feature) and yi∈R is the observed output. Linear regression aims to learn a linear model such that, given input xi, the model output is as close as possible to the observed output yi.
This linear model is typically defined as f(xi)=wTxi, where w is a p×1 vector. We assume that the true observation differs from the model output by an error term ϵ, i.e. yi=f(xi)+ϵi. Each error term is further assumed to be i.i.d. and to follow a Gaussian distribution with zero mean and variance σn2: ϵ∼N(0,σn2).
By appending a constant feature of 1 to xi, we can equivalently realise f(xi)=wTxi+b. We assume here that the p-dimensional xi already contains such a constant entry.
For a dataset D with n samples, define X=[x1,x2,...,xn] as the p×n input matrix; y=[y1;y2;...,yn;] as the n×1 vector of true outputs; and ϵ=[ϵ1;ϵ2;...,ϵn;] as the n×1 vector of errors. The model can then be written in matrix form:
f(X)=XTwy=f(X)+ϵ
2. Frequentist Linear Regression
The frequentist approach computes the model parameter w purely from data. The most common method is least squares: find a w∗ that minimises the mean squared error:
w∗=argwmin(y−XTw)T(y−XTw)
Let Ew=(y−XTw)T(y−XTw). Differentiating with respect to w gives ∂w∂Ew=2X(XTw−y).
Setting the derivative to zero yields the optimal w.
Since (y−XTw)T(y−XTw)=∥y−XTw∥2, it is also evident without differentiation that the optimum is reached when y=XTw. At this point,
When XTX is invertible, w∗ has a unique solution:
XXTw∗w∗=Xy=(XXT)−1Xy
The learned linear regression model is then f(xi)=xiT(XXT)−1Xy.
In the problem statement we assumed that the true output and the model prediction differ by a Gaussian-distributed error term. In that case the optimal parameter from least squares is also the maximum likelihood estimate:
When the observations are independent, the likelihood function is:
Differentiating with respect to w: ∂w∂p(y∣X,w)=p(y∣X,w)×∂w∂Ew. Since p(y∣X,w)>0, the solution to ∂w∂p(y∣X,w)=0 is equivalent to the solution to ∂w∂Ew=0 in least squares. So under the assumption of Gaussian noise, the least-squares estimate is equivalent to the maximum likelihood estimate.
Summary:
The frequentist linear regression method only yields a point estimate of the model parameters; predictions use that point estimate.
The model is determined entirely by the data — all the model’s information is encoded in the available training set.
3. Bayesian Linear Regression
In the Bayesian approach to linear regression, we first assign a prior distribution to the parameters, representing our knowledge of them before observing data. We then use the observed data to update that knowledge via Bayes’ theorem, obtaining the posterior distribution. Finally, we use the posterior distribution to make predictions. Compared with the frequentist approach, this differs in two ways:
The parameter information is not determined entirely by the data; the prior also has influence.
Predictions use the parameter distribution.
The procedure is as follows. To make the posterior tractable, we choose the prior to be the conjugate prior of the likelihood. The likelihood is Gaussian, and the Gaussian distribution is its own conjugate (self-conjugate). So we set the prior on w to be a multivariate Gaussian with zero mean and covariance Σp: w∼N(0,Σp).
Setting the prior mean of w∼N(0,Σp) to 0 is purely for simplicity — it does not have to be 0. In practice, this choice does not affect the result.
After observing the data D=(X,y), since the data points are independent and Gaussian, the likelihood function is:
The posterior predictive distribution is again a Gaussian. As the formula shows, the posterior predictive is a weighted average over the predictions made by every parameter (an integral, in the continuous case). It no longer picks a single parameter via some strategy and predicts with it; instead it accounts for the parameter uncertainty (its distribution). And the prediction itself is a distribution, not a point. This is the main difference from the frequentist (point-estimate) approach.
3.1 Kernel Trick: Handling Non-linear Data
Standard Bayesian linear regression can only express linear relationships and cannot capture complex non-linear patterns. We can address this by mapping the input space, via a non-linear function, into a higher-dimensional feature space, then applying the linear model in that feature space. As long as the mapping is independent of the parameter w, the overall model remains linear (in w). How should we choose the mapping function? In practice we usually do not need to perform the mapping explicitly — we use the kernel trick instead. See the kernel-function chapter for the kernel trick.
Specifically, we use a non-linear map ϕ(x) to lift the p-dimensional original input x into a q-dimensional space, and perform linear regression in the new feature space. Now w is a q×1 vector with ϵ∼N(0,σn2), and the regression model is:
This formula is hard to use in practice because computing A−1 (q×q) is expensive when q is large, and infeasible when the feature space is infinite-dimensional. The kernel trick comes in here: it not only avoids computing A−1, but also removes the need to specify the mapping function explicitly. The kernel trick computes feature-space inner products directly from the input space, so we need to rewrite the formula above in inner-product form.
For the mean:
AΣqΦ=σn−2Φ(ΦTΣqΦ+σn2I)
Multiplying by A−1 on the left and by (ΦTΣqΦ+σn2I)−1 on the right gives an equivalent expression for the mean before and after the transformation:
For each test pointx∗, the posterior mean is a linear combination of y; equivalently, a linear combination of k(xi,x⋆): fˉ(x∗)=∑i=1nαik(xi,x∗), where α=[K(X,X)+σn2I]−1y.
For each test pointx∗, the posterior variance is independent of the training output y and depends only on the inputs. The variance can be viewed as the prior variance (K(X∗,X∗)) minus the information acquired from the training set K(X∗,X)[K(X,X)+σn2I]−1K(X,X∗) — observations gradually shrink the variance, which makes sense.
Although a Gaussian process corresponds to an infinite-dimensional feature space, the posterior at each point depends only on the training set, which is what makes it practical.
The covariance matrix is computed via a kernel function. There are many kernels; if we have deep insight into the data we can pick a suitable one. For example, if we know the output is periodic in parameter space, we can use a periodic kernel. In most practical cases such information is unknown and there is no especially effective method for picking a kernel. For Gaussian process regression, the most commonly used kernels are the RBF kernel and the Matérn kernel.
4. Gaussian Process Regression
4.1 Theory
The result derived in the previous section (kernel trick) is in fact the result of Gaussian process regression. This section derives it directly from the Gaussian process.
A stochastic process is a collection of random variables; a Gaussian process is one kind of stochastic process. A stochastic process is called a Gaussian process if every finite subset of its random variables follows a multivariate Gaussian distribution. A Gaussian process is uniquely determined by its mean function (mean function: m(x)) and its covariance function (covariance function: k(x,x′)). We denote the Gaussian process f(x) with mean function m(x) and covariance function k(x,x′) as (here f no longer denotes a function):
f(x)∼GP(m(x),k(x,x′))
The mean and covariance functions are defined as:
m(x)k(x,x′)=E[f(x)]=E[(f(x)−m(x))(f(x′)−m(x′))]
For a training set D(X,y), modelling its output with a Gaussian process, y follows a multivariate Gaussian:
y∼N(0,K(X,X)+σn2I)
For n∗ test points X∗, we want to predict the outputs f∗. By the Gaussian process property that any finite subset is multivariate Gaussian, y and f∗ jointly are still multivariate Gaussian:
This result agrees with the kernel-trick-based Bayesian linear regression derivation under Gaussian noise — the two paths converge.
The log marginal likelihood (or evidence) is used in hyperparameter tuning (see §5). The marginal likelihood is the integral of the product of likelihood and prior:
p(y∣X)=∫p(y∣f,X)p(f∣X)df
The prior in GPR is the Gaussian process f∣X∼N(0,K), so:
In practice, direct matrix inversion is expensive (O(n3)) and prone to numerical error. We instead use Cholesky decomposition and two triangular solves to bypass the matrix inversion. The complexity remains O(n3) but with a better constant — about 1/6 of the direct inversion — and improved numerical stability:
Let v=L−1k(X,x∗), also obtained via a triangular solve:
v=L\K(X,x∗)
This gives the posterior variance:
V[f∗]=k(x∗,x∗)−v⊤v
and the log marginal likelihood:
logp(y∣X)=−21y⊤α−i∑logLii−2nlog2π
where:
21logK+σn2I=∑ilogLii uses the lower-triangular property of the Cholesky factor (21log∣LLT∣=∑ilogLii).
5. Model Selection and Hyperparameter Optimization
The theory of GPR is complete, but turning it into a powerful tool still requires more work. Many uncertainties remain in the procedure: how should we choose the kernel function? how should we determine the kernel’s hyperparameters? These problems must be solved before GPR becomes a strong, practical tool. We collectively call them the model selection problem (different hyperparameters being analogous to different models), or the model training process.
5.1 Bayesian Model Selection Framework
We can do model selection from a Bayesian perspective. Group model selection into three levels:
Parameters w: e.g. w in linear regression, or weights in a neural network.
Hyperparameters θ: e.g. the length-scale parameter of a kernel function, or weight decay in a neural network.
Structure Hi: different model structures, e.g. linear models, neural networks.
For parameter w, by Bayes’ theorem the posterior is:
Since the candidate model structures are usually a discrete set, the marginal likelihood is:
p(y∣X)=i∑p(y∣X,Hi)p(Hi)
This section provides a compelling and unified Bayesian framework for model selection. Unfortunately, for most machine learning models the marginal likelihood (an integral over parameter space) has no analytic solution and no good approximation. GPR is an exception — it has an analytic solution. The next section explains how to apply this framework to do model selection in GPR.
5.2 Model Selection in Gaussian Process Regression
Model selection in GPR mainly refers to tuning the hyperparameters of the kernel function, e.g. the length-scale parameter. Following Bayesian theory, we tune hyperparameters by maximising the log marginal likelihood (the derivation appears at the end of §4.1):
21y⊤Ky−1y: data fit term. The only term involving the observed targets.
21log∣Ky∣: complexity penalty term. Depends only on the covariance function and the input data X.
2nlog2π: normalisation constant.
Differentiating the log marginal likelihood:
∂θj∂logp(y∣X,θ)=21y⊤K−1∂θj∂KK−1y−21tr(K−1∂θj∂K)=21tr((αα⊤−K−1)∂θj∂K) where α=K−1y
The cost above is dominated by inverting K. For an n×n positive-definite symmetric matrix the standard inversion is O(n3). For each hyperparameter, the partial derivative then costs O(n2). Overall, this is reasonably efficient.