VGP Notes

James Hensman 2016

Here are some implementation notes on the variational Gaussian approximation, gpflow.models.VGP. The reference for this work is Opper and Archambeau 2009, *The variational Gaussian approximation revisited*; these notes serve to map the conclusions of that paper to the implementation in GPflow. I’ll give derivations for the expressions that are implemented in the VGP object.

Two things are not covered by this notebook: prior mean functions and the extension to multiple independent outputs. These extensions are straightforward in theory but we have taken some care in the code to ensure that they are handled efficiently.

Optimal distribution

The key result in the work of Opper and Archambeau is that for a Gaussian process with a non-Gaussian likelihood, the optimial Gaussian approximation (in the KL sense) is given by

\[\hat q(\mathbf f) = \mathcal N\left(\mathbf m, [\mathbf K^{-1} + \textrm{diag}(\boldsymbol \lambda)]^{-1}\right)\,.\]

We follow their advice in reparameterising the mean as

\[\mathbf m = \mathbf K \boldsymbol \alpha\]

and additionally, to avoid having to constrain the parameter \(\lambda\) to be positive, take the square. The approximation then becomes

\[\hat q(\mathbf f) = \mathcal N\left(\mathbf K \boldsymbol \alpha, [\mathbf K^{-1} + \textrm{diag}(\boldsymbol \lambda)^2]^{-1}\right)\,.\]

The ELBO is

\[\textrm{ELBO} = \sum_n\mathbb E_{q(f_n)}\left[ \log p(y_n\,|\,f_n)\right] - \textrm{KL}\left[q(\mathbf f)||p(\mathbf f)\right]\]

We’ll split the rest of this document into considering two terms: the marginals of \(q(f)\) and the KL term. Given these, it is straight-forward to compute the ELBO: GPflow uses quadrature to compute 1D expectations where no closed form is available.

Marginals of \(q(\mathbf f)\)

Given the above form for \(q(\mathbf f)\), what is a quick and stable way to compute the marginals of this Gaussian? The means are trivial, but the variance could do with some work in order to avoid having to perform two matrix inverses.

Let \(\boldsymbol \Lambda = \textrm{diag}(\boldsymbol \lambda)\) and \(\boldsymbol \Sigma\) be the covariance in question; \(\boldsymbol \Sigma = [\mathbf K^{-1} + \boldsymbol \Lambda^2]^{-1}\) By the matrix inversion lemma we have

\[\boldsymbol \Sigma = [\mathbf K^{-1} + \boldsymbol \Lambda^2]^{-1}\]
\[= \boldsymbol \Lambda^{-2} - \boldsymbol \Lambda^{-2}[\mathbf K + \boldsymbol \Lambda^{-2}]^{-1}\boldsymbol \Lambda^{-2}\]
\[= \boldsymbol \Lambda^{-2} - \boldsymbol \Lambda^{-1}\mathbf A^{-1}\boldsymbol \Lambda^{-1}\]

with \(\mathbf A = \boldsymbol \Lambda\mathbf K \boldsymbol \Lambda + \mathbf I\,.\)

Working with this form means that only one matrix decomposition is needed, and taking the cholesky factor of \(\mathbf A\) should be numerically stable since the eigenvalues are bounded by 1.

KL divergence

The KL divergence term will benefit from a similar re-organisation to the above. The KL is

\[\textrm{KL} = -0.5 \log |\boldsymbol \Sigma| + 0.5 \log |\mathbf K| +0.5\mathbf m^\top\mathbf K^{-1}\mathbf m + 0.5\textrm{tr}(\mathbf K^{-1} \boldsymbol \Sigma) - 0.5 N\]

Where N is the number of data. Recalling our parameterization \(\boldsymbol \alpha\) and combining like terms,

\[\textrm{KL} = 0.5 (-\log |\mathbf K^{-1}\boldsymbol \Sigma | +\boldsymbol \alpha^\top\mathbf K\boldsymbol \alpha + \textrm{tr}(\mathbf K^{-1} \boldsymbol \Sigma) - N)\,.\]

With a little manipulation it’s possible to show that \(\textrm{tr}(\mathbf K^{-1} \boldsymbol \Sigma) = \textrm{tr}(\mathbf A^{-1})\) and \(|\mathbf K^{-1} \boldsymbol \Sigma| = |\mathbf A^{-1}|\), giving the final expression

\[\textrm{KL} = 0.5 (\log |\mathbf A| +\boldsymbol \alpha^\top\mathbf K\boldsymbol \alpha + \textrm{tr}(\mathbf A^{-1}) - N)\,.\]

This expression is not completely ideal because we have to compute the diagonal elements of \(\mathbf A^{-1}\). We do this with an extra backsubstitution (into the identity matrix), although it may be possible to do this faster in theory (not in tensorflow, to my knowledge).

Prediction

To make predictions with the Gaussian approximation, we need to integrate:

\[q(f^\star \,|\,\mathbf y) = \int p(f^\star \,|\, \mathbf f)q(\mathbf f)\,\textrm d \mathbf f\]

The integral is a Gaussian one, and follows straightforwardly. Substituting the equations for these quantities:

\[q(f^\star \,|\,\mathbf y) = \int \mathcal N(f^\star \,|\, \mathbf K_{\star \mathbf f}\mathbf K^{-1}\mathbf f,\, \mathbf K_{\star \star} - \mathbf K_{\star \mathbf f}\mathbf K^{-1}\mathbf K_{\mathbf f \star})\mathcal N (\mathbf f\,|\, \mathbf K \boldsymbol\alpha, \boldsymbol \Sigma)\,\textrm d \mathbf f\]
\[q(f^\star \,|\,\mathbf y) = \mathcal N\left(f^\star \,|\, \mathbf K_{\star \mathbf f}\boldsymbol \alpha,\, \mathbf K_{\star \star} - \mathbf K_{\star \mathbf f}(\mathbf K^{-1} - \mathbf K^{-1}\boldsymbol \Sigma\mathbf K^{-1})\mathbf K_{\mathbf f \star}\right)\]

Where the notation \(\mathbf K_{\star \mathbf f}\) means the covariance between the prediction points and the data points, and the matrix \(\mathbf K\) is shorthand for \(\mathbf K_{\mathbf{ff}}\).

The matrix \(\mathbf K^{-1} - \mathbf K^{-1}\boldsymbol \Sigma\mathbf K^{-1}\) can be expanded:

\[ \begin{align}\begin{aligned} \mathbf K^{-1} - \mathbf K^{-1}\boldsymbol \Sigma\mathbf K^{-1} = \mathbf K^{-1} - \mathbf K^{-1}[\mathbf K^{-1} + \boldsymbol\Lambda^2]^{-1}\mathbf K^{-1}\,,\\and simplified by recognising the form of the matrix inverse lemma:\end{aligned}\end{align} \]
\[\mathbf K^{-1} - \mathbf K^{-1}\boldsymbol \Sigma\mathbf K^{-1} = [\mathbf K + \boldsymbol\Lambda^2]^{-1}\,.\]

This leads to the final expression for the prediction

\[q(f^\star \,|\,\mathbf y) = \mathcal N\left(f^\star \,|\, \mathbf K_{\star \mathbf f}\boldsymbol \alpha,\, \mathbf K_{\star \star} - \mathbf K_{\star \mathbf f}[\mathbf K + \boldsymbol \Lambda^2]^{-1}\mathbf K_{\mathbf f \star}\right)\]

The VGP class has a little extra functionality to enable us to compute the marginal variance of the prediction when the full covariance matrix is not required.