Black-box Optimization: Bayesian Optimization and Its Multi-task Extension
The previous post derived Gaussian Process Regression from linear regression. This post uses GPR as a surrogate model to assemble the full Bayesian optimization loop, and surveys six acquisition functions (EI / EIC / NEIC / UCB / ES / PI). We then extend to the multi-task setting, covering ICM, SLFM, LMC, and PC formulations of multi-output GPs, plus the acquisition functions used in multi-task Bayesian optimization.
1. The Black-box Optimization Problem
Suppose there exists an unknown objective function f:X→R with the following properties:
Objective function characteristics
Black-box: the analytic form, gradients, and higher-order derivatives of f are unavailable; we can only obtain observations by querying with input x∈X.
Noisy: the observation is y=f(x)+ϵ, where ϵ∼N(0,τ2) is additive noise (the noise is linearly superimposed on the true signal).
Costly to evaluate:
High computational cost: each evaluation of f consumes substantial compute or time.
Limited evaluation budget: in practice the number of evaluations is strictly bounded.
May involve multi-objective optimization — minimising several objectives simultaneously.
Domain and constraints
The input space is X⊆Rp, where p is the number of parameters; mixed parameter types are possible (continuous, integer, categorical).
The constraint functions cj share the same characteristics as the objective f — black-box, noisy, costly to evaluate — and can be evaluated alongside the objective.
Optimization objective: within a finite evaluation budget, find an approximate optimum x∗∈X such that:
x∗=argx∈Xminf(x) subject to wTx⩽b and cj(x)≤0,j=1,…,J.
2. Bayesian Optimization Loop
Bayesian optimization tackles the problem of an unknown and expensive-to-evaluate f(x) by introducing a surrogate model. The surrogate f^ must satisfy two properties: (1) evaluating the surrogate is cheap; (2) f^ approximates the true function f well.
To reach the optimum within as few evaluations as possible, we additionally need a function that evaluates the value of querying an unevaluated point, so that we pick the highest-value point to evaluate next. This function is called the acquisition function, and it uses the surrogate f^ to compute the value of each candidate point.
The basic Bayesian optimization loop:
Step 1: Randomly select some initial points from the parameter space and evaluate them, producing the dataset D=(X,f(x)).
Step 2: Fit the surrogate to the parameter space using D, obtaining a posterior distribution at every point.
Step 3: Use the acquisition function to score every point’s value based on its posterior; pick the point with the highest value (subject to constraints, if any) to evaluate next.
Step 4: Add the newly evaluated point to D. To continue, return to Step 2; to stop, go to Step 5.
Step 5: Pick the best already-evaluated point. Specifically:
If observations are noiseless (no variance), simply take the x with the best f(x) from the evaluated dataset.
If observations are noisy, compute the predictive posterior at the evaluated points and pick the optimal x based on some statistic (posterior mean, median, etc.). Under constraints, the chosen point must also satisfy the constraints.
3. Common Surrogate Models
3.1 Requirements on the Surrogate Model
The surrogate must produce a posterior distribution p(f(x)∣D) over candidate points x, with both a mean (prediction) and a variance (uncertainty). The predicted variance must accurately reflect the true uncertainty (especially in low-data regions).
It must not require a pre-specified functional form (e.g. linear, polynomial); it should adaptively fit complex non-linear relationships.
Bayesian optimization frequently updates the surrogate, so its computational complexity must be manageable.
Real observations often contain noise y=f(x)+ϵ; the surrogate must be able to separate signal from noise.
3.2 Gaussian Processes
The Gaussian process is the surrogate model of choice — Gaussian process regression is an efficient way to approximate a black-box function. First, GPs are non-parametric: their model complexity adapts to the data, avoiding the structural limitations of parametric models (polynomial regression, neural networks):
Kernel flexibility: by choosing the kernel function (e.g. RBF kernel, Matern kernel), GPs implicitly map into infinite-dimensional feature spaces, capturing non-linear relationships of arbitrary smoothness. For example:
The RBF kernel models infinitely differentiable functions.
The Matern kernel allows control over smoothness (e.g. permits some “jaggedness”).
Data-driven fitting: GPs do not pre-specify a functional form (e.g. linear, quadratic); they learn the function shape from data via the kernel and hyperparameter optimization (e.g. maximising the marginal likelihood). This flexibility lets them fit highly non-convex, multi-modal black-box functions.
GPR also outputs a posterior distribution over the search space (mean and variance) rather than a single prediction, which pairs perfectly with acquisition functions. Emphasising the mean is conservative (Exploitation — preserve current gains); emphasising the variance is aggressive (Exploration).
The mechanics are covered in earlier sections. For kernel functions, see the kernel function chapter; for Gaussian processes and GPR, see the GPR derivation; for hyperparameter tuning of GPR, see hyperparameter optimization.
4. Common Acquisition Functions
4.1 Requirements on the Acquisition Function
The acquisition function must balance Exploitation and Exploration.
Exploitation: sample more in regions where the current model predicts well (i.e. high-mean regions), to refine the local optimum.
Use case 1: the objective has a clear optimum region (single-modal or with a dominant peak).
Use case 2: evaluation cost is extremely high; we need fast convergence to an acceptable solution.
Risk: over-exploitation can trap us in a local optimum, ignoring globally promising regions.
Exploration: sample in regions of high uncertainty (i.e. high-variance regions), to discover new potentially optimal points.
Use case 1: the objective may have multiple local optima (multi-modal).
Use case 2: the data is sparse or initial sampling is insufficient; a global search is needed.
Risk: over-exploration wastes resources and slows convergence.
4.2 Expected Improvement (EI)
Suppose the goal is to maximise f(x), in a noiseless setting, and the so-far-iterated data is D=(X,f(X)). GPR gives the posterior at the candidate point x∗ as f(x∗)∼N(μx∗,σx∗). Let f∗=i∈{1,...,n}maxf(xi) be the best-so-far observation. The EI of candidate x∗ is defined as:
The EI of an already-observed point is 0, so we never re-sample the same point.
Requires a definite “best observation so far”, so it only applies to the noiseless setting.
4.3 EI with Constraints (EIC)
If the optimization has J non-linear constraints, the problem becomes:
xmaxf(x) subject to cj(x)≤0,j=1,…,J,
The constraints share the same characteristics as the objective (black-box, costly, etc.); during optimization we use GPR for each constraint to obtain its posterior mean and variance. The best-so-far observation (a feasible point that satisfies the constraints) is:
fc∗=i∈{1,…,n}cj(xi)≤0,j=1,…,Jmaxf(xi)
Assuming {f,c1,...,cJ} are mutually independent:
αEIC(x∗∣fc∗)=αEI(x∗∣fc∗)j=1∏JP(cj(x∗)≤0)
In the noiseless setting, a constraint is either satisfied or not, so P(cj(x∗)≤0) is either 0 or 1.
4.4 Noisy Expected Improvement With Constraints (NEIC)
In real-world settings there are usually both constraints and noise. The objective’s observed data is Df=(X,y) with yi=f(xi)+ϵi. To simplify GPR, we assume ϵi∈N(0,τi2). The difficulty is that we no longer know which point in Df is the best observation. One approach is to substitute the posterior mean of each observed point — but this converges too slowly (Vazquez & Bect, 2008), and the suggested points cluster too tightly to explore the space effectively (Letham et al. 2019, arXiv:1706.07094).
Instead, we can take a weighted average over all possible outcomes based on the GPR posterior. The posterior at the observed points is:
Note that the posterior at x∗ itself is deterministic; only when computing EI do we need to consider the posterior at the observed points.
No analytic form, but value and gradient can both be computed efficiently, so it is practical.
For batch sampling (producing multiple samples in one step): produce one sample, add its posterior to fn and cn, then produce the next sample, and so on.
Like EI, the NEI of an already-observed point is 0, so we never re-sample.
4.5 Upper Confidence Bound (UCB)
For maximisation, UCB is defined as the upper end of the posterior interval:
αUCB(x∗)=μx∗+βσx∗
For minimisation, UCB is the lower end of the posterior interval:
αUCB(x∗)=μx∗−βσx∗
where:
μx∗: posterior mean at the candidate x∗.
σx∗: posterior standard deviation at the candidate.
β≥0: balance parameter. A larger β increases exploration; a smaller β emphasises exploitation.
UCB is computationally efficient — only the posterior mean and variance are needed.
β must be chosen carefully. A fixed β can imbalance the double-E; an adaptive strategy is generally preferred.
Theoretical Guarantee for UCB
Under certain conditions, the UCB algorithm guarantees that the cumulative regret R(T) grows sub-linearly, in which case the algorithm is guaranteed to converge to the optimum:
R(T)=t=1∑Tf(x∗)−f(xt)≤O~(T)
where:
x∗ is the global optimum.
xt is the point chosen at step t.
O~ denotes an asymptotic upper bound that ignores logarithmic factors.
4.6 Entropy Search (ES)
ES aims to minimise the uncertainty about the objective’s argmax xmax (measured by information entropy). Given the observation set D, define:
Current entropy:
H(xmax∣D)=−∫p(xmax∣D)logp(xmax∣D)dxmax
Expected entropy (the conditional entropy after sampling at x∗):
Ef(x∗)[H(xmax∣D∪{(x∗,f(x∗))})]
The ES acquisition function is the expected reduction in entropy:
ES is computationally expensive because p(xmax∣D) has no analytic form and must be approximated (Monte Carlo sampling, etc.), so it is suitable for low-dimensional problems (within roughly 10 dimensions).
4.7 Probability of Improvement (PI)
PI measures the probability that a candidate x∗ improves over the best-so-far observation f∗. Based on the posterior f(x∗)∼N(μx∗,σx∗):
αPI(x∗∣f∗)=P(f(x∗)≥f∗)=Φ(σx∗μx∗−f∗)
where Φ is the standard normal cumulative distribution function (CDF). An exploration parameter ξ≥0 can be added to avoid local optima:
αPI(x∗∣f∗,ξ)=Φ(σx∗μx∗−(f∗+ξ))
Has an analytic form, depending only on the posterior mean and variance, so it is computationally efficient.
Without ξ, PI tends to over-exploit (prefers high-mean, low-variance regions).
PI ignores the magnitude of the improvement, whereas EI balances probability and magnitude — so EI is more commonly used.
5. Multi-task Gaussian Process Regression
Machine learning has a paradigm called multi-task learning, whose core idea is to share information across different tasks — relative to learning each task in isolation, this can boost performance. In an optimization context, we can use multi-task optimization to jointly optimise multiple correlated tasks for a higher payoff; or, for an expensive online optimization, we can construct biased-but-related offline optimization tasks and use multi-task learning to enhance the online optimization. In the Gaussian process regression setting this is also called Multi-Output Learning or Vector-Valued Learning.
The earlier single-task GPR section covers single-task Gaussian process regression. For the multi-task setting, we define a vector-valued function f=[f1,…,fO]T that follows a Gaussian process:
f∼GP(m,K)
where:
m is the mean function vector — an O-dimensional vector of functions: [m1(x),...,mO(x)]T.
K is the covariance function matrix — an O×O positive matrix-valued function. Each entry is a covariance function as in single-task GPR. The element at row o column o′ of K(x,x′), written (K(x,x′))o,o′, represents the correlation between tasks fo(x) and fo′(x′).
Kx∗∈RO×NO has entries (K(x∗,xj))o,o′ for j=1,…,N and o,o′=1,…,O.
θ: the set of all kernel hyperparameters.
In practice, proposing a model usually starts from an assumption — for the multi-task setting, we first assume something about the data distribution and the inter-task relationships, then derive the corresponding model. The next sections cover models derived from different assumptions, including:
Separable Kernels and Sum of Separable Kernels (SoS kernels): A special class of multi-output kernel functions that can be formulated as a sum of products between a kernel function for the input space alone, and a kernel function that encodes the interactions among the outputs. Includes ICM (Intrinsic Coregionalization Model) and SLFM (Semiparametric Latent Factor Model) (Linear Model of Coregionalization). LMC is a generalisation of ICM and SLFM.
Beyond Separable Kernels: more sophisticated models, such as PC (Process Convolutions). Process convolutions further generalise LMC.
Glossary
cokriging: probabilistic models and Gaussian processes for multivariate learning were originally pioneered and developed in geostatistics, where prediction over vector-valued output data is called cokriging (multivariate co-kriging).
autokrigeability: a term describing the special condition under which multivariate cokriging can be reduced to univariate kriging. When the data satisfies certain conditions, the covariance matrix of the multivariate cokriging model becomes diagonal (block-diagonal, indicating independence among variables), and the problem automatically decouples into multiple independent univariate kriging problems.
Isotopic / Heterotopic Data: Isotopic means the input datasets across different tasks are exactly the same; heterotopic means the input datasets across tasks differ.
5.2 Intrinsic Coregionalization Model (ICM)
ICM assumes that multiple outputs are obtained as different linear transformations of multiple samples drawn from the same Gaussian process regression.
Two-output Setting
We derive ICM from the two-output setting. Consider two outputs f1(x) and f2(x) that share the same input space x∈Rp. The corresponding vector-valued function is:
Sample two functions μ1(x) and μ2(x) from the Gaussian process μ(x)∼GP(0,k(x,x′)). The two are mutually independent and share the same covariance function k(x,x′).
Assume f1(x) and f2(x) are obtained from the following linear transformations of μ1(x) and μ2(x):
cov(f(x),f(x′))=a1(a1)⊤cov(u1(x),u1(x′))+a2(a2)⊤cov(u2(x),u2(x′))=a1(a1)⊤k(x,x′)+a2(a2)⊤k(x,x′)=[a1(a1)⊤+a2(a2)⊤]k(x,x′)define B=a1(a1)⊤+a2(a2)⊤ which has rank two.=Bk(x,x′)=[b11b21b12b22]k(x,x′)
Generalisation to R samples and O outputs
In ICM we assume the O outputs {fo(x)}o=1O are all obtained from R independent samples of the same Gaussian process. These samples are mutually independent and share a single covariance function:
cov[f(x),f(x′)]=AA⊤k(x,x′)define B=AA⊤,B∈RO×O, B has rank R=Bk(x,x′)
ICM properties:
If the outputs are noiseless and the data is isotopic, then ICM yields the same result as running single-task GPR on each output independently. This automatic reduction from cokriging to kriging is called autokrigeability.
5.3 Semiparametric Latent Factor Model (SLFM)
Like ICM, SLFM assumes that multiple outputs are different linear transformations of multiple sampled functions — but here the samples come from different, mutually independent GPRs (one function sampled from each GPR). With Q GPRs μq(x)∼GP(0,kq(x,x′)),q∈{1,2,...,Q}, we get Q sampled functions {μq(x)}. Q=1 corresponds to the R=1 case of ICM.
As in the generalised ICM, for O outputs:
fo(x)=q=1∑Qao,qμq(x)
Define:
aq=[a1,qa2,q...aO,q]T
The vector-valued function is:
f(x)=[f1(x)⋯fO(x)]⊤=q=1∑Qaqμq(x)
The covariance function is:
cov[f(x),f(x′)]=q=1∑Qaqaq⊤kq(x,x′)define Bq=aqaq⊤, The rank of each Bq∈RO×O is one.=q=1∑QBqkq(x,x′)
5.4 Linear Model of Coregionalization (LMC)
LMC is a further generalisation of ICM and SLFM. The base assumption is the same — multiple outputs are different linear combinations of sampled functions — but ICM assumes the samples come from a single Gaussian process; SLFM assumes they come from completely different Gaussian processes; LMC assumes some samples come from the same Gaussian process and others from different ones.
For O outputs:
fo(x)=q=1∑Qr=1∑Rqao,qruqr(x)
There are Q groups of samples, each containing Rq sampled functions. The Rq functions in a group are mutually independent and share the same Gaussian process with covariance kq(x,x′); the Q groups are also mutually independent. This is equivalent to summing Q ICMs.
where Bq is usually called the coregionalization matrices, with rank Rq.
5.5 Process Convolutions (PC)
In LMC, the value of the vector-valued function f(x) at point x depends only on the values of the sampled functions at point x, not on their values elsewhere. This simplification gives LMC the form of a separable kernel. PC convolves a smoothing kernel with the sampled functions, making each output depend on the values of the sampled functions at all points — a further generalisation of LMC. Because the sampled functions come from GPR, the convolution result is still a Gaussian process, so the posterior remains analytic.
About the smoothing kernel:
Here the smoothing kernel is a class of functions, but it does not refer to a kernel function. The convolution process uses the smoothing kernel to take a weighted average over a neighbourhood, producing smoother results — this is sometimes also called a moving average function.
As in LMC, assume there are Q groups of samples, each with Rq sampled functions. The Rq sampled functions μqr(x),i∈{1,...,Rq} are mutually independent and share the same Gaussian process with covariance kq(x,x′); the Q groups are also mutually independent. For O outputs {fo(x)}o=1O:
{wo(x)}o=1O are mutually independent Gaussian processes with mean 0 and covariance kwo(x,x′).
For the integral above to converge, Go,qr(x) must be a continuous function with compact support (locally non-zero, so the integral does not blow up) or be square-integrable.
When the smoothing kernel is the Dirac δ function Go,qi(x−z)=ao,qiδ(x−z), PC degenerates to LMC. So:
Degenerates to ICM: Go,qi(x−z)=ao,qiδ(x−z),Q=1,Rq>1
Degenerates to SLFM: Go,qi(x−z)=ao,qiδ(x−z),Q>1,Rq=1
Degenerates to LMC: Go,qi(x−z)=ao,qiδ(x−z),Q>1,Rq>1
Degenerates to Dependent GPs: k1(z,z′)=σ2δ(z,z′),Q=1,Rq=1
6. Multi-task Bayesian Optimization
Once we have fitted multiple tasks with multi-task GPR, multi-task Bayesian optimization mainly addresses how to choose the next probe point — i.e. it defines an acquisition function for the multi-task setting. The paper Swersky, Snoek & Adams 2013 — Multi-Task Bayesian Optimization extends Entropy Search to the multi-task setting. The paper Letham et al. (2019) extends Expected Improvement to the multi-task setting; its implementation is open-sourced in facebook/ax.