# VGP Notes¶

*James Hensman 2016*

Here are some implementation notes on the variational Gaussian
approximation, `gpflow.models.VGP`

. The reference for this work is
Opper and Archambeau 2009, *The variational Gaussian approximation
revisited*;
these notes serve to map the conclusions of that paper to the
implementation in GPflow. I’ll give derivations for the expressions that
are implemented in the VGP object.

Two things are not covered by this notebook: prior mean functions and the extension to multiple independent outputs. These extensions are straightforward in theory but we have taken some care in the code to ensure that they are handled efficiently.

## Optimal distribution¶

The key result in the work of Opper and Archambeau is that for a Gaussian process with a non-Gaussian likelihood, the optimial Gaussian approximation (in the KL sense) is given by

We follow their advice in reparameterising the mean as

and additionally, to avoid having to constrain the parameter \(\lambda\) to be positive, take the square. The approximation then becomes

The ELBO is

We’ll split the rest of this document into considering two terms: the marginals of \(q(f)\) and the KL term. Given these, it is straight-forward to compute the ELBO: GPflow uses quadrature to compute 1D expectations where no closed form is available.

## Marginals of \(q(\mathbf f)\)¶

Given the above form for \(q(\mathbf f)\), what is a quick and stable way to compute the marginals of this Gaussian? The means are trivial, but the variance could do with some work in order to avoid having to perform two matrix inverses.

Let \(\boldsymbol \Lambda = \textrm{diag}(\boldsymbol \lambda)\) and \(\boldsymbol \Sigma\) be the covariance in question; \(\boldsymbol \Sigma = [\mathbf K^{-1} + \boldsymbol \Lambda^2]^{-1}\) By the matrix inversion lemma we have

with \(\mathbf A = \boldsymbol \Lambda\mathbf K \boldsymbol \Lambda + \mathbf I\,.\)

Working with this form means that only one matrix decomposition is needed, and taking the cholesky factor of \(\mathbf A\) should be numerically stable since the eigenvalues are bounded by 1.

## KL divergence¶

The KL divergence term will benefit from a similar re-organisation to the above. The KL is

Where N is the number of data. Recalling our parameterization \(\boldsymbol \alpha\) and combining like terms,

With a little manipulation it’s possible to show that \(\textrm{tr}(\mathbf K^{-1} \boldsymbol \Sigma) = \textrm{tr}(\mathbf A^{-1})\) and \(|\mathbf K^{-1} \boldsymbol \Sigma| = |\mathbf A^{-1}|\), giving the final expression

This expression is not completely ideal because we have to compute the diagonal elements of \(\mathbf A^{-1}\). We do this with an extra backsubstitution (into the identity matrix), although it may be possible to do this faster in theory (not in tensorflow, to my knowledge).

## Prediction¶

To make predictions with the Gaussian approximation, we need to integrate:

The integral is a Gaussian one, and follows straightforwardly. Substituting the equations for these quantities:

Where the notation \(\mathbf K_{\star \mathbf f}\) means the covariance between the prediction points and the data points, and the matrix \(\mathbf K\) is shorthand for \(\mathbf K_{\mathbf{ff}}\).

The matrix \(\mathbf K^{-1} - \mathbf K^{-1}\boldsymbol \Sigma\mathbf K^{-1}\) can be expanded:

This leads to the final expression for the prediction

The VGP class has a little extra functionality to enable us to compute the marginal variance of the prediction when the full covariance matrix is not required.