SGPR notes

James Hensman, March 2016 Small corrections Alex Matthews, December 2016

The GPflow method SGPR performs Sparse Gaussian Process Regression. It is based mainly on Titsias (2009), though other works (Hensman et al. (2013), Matthews et. al. 2016) are useful for clarifying the prediction density.

This notebook contains a derivation of the form of the equations used in GPflow for the marginal likelihood bound and for the prediction equations.

Marginal Likelihood bound

The bound on the marginal likelihod is (Titsias 2009):

\[ \begin{align}\begin{aligned}\log p(\mathbf y) \geq \log \mathcal N(\mathbf y\,|\,\mathbf 0,\, \mathbf Q_{ff} + \sigma^2 \mathbf I) - \tfrac{1}{2} \sigma^{-2}\textrm{tr}(\mathbf K_{ff} - \mathbf Q_{ff}) \triangleq \mathcal L\\with\end{aligned}\end{align} \]
\[\mathbf Q_{ff} = \mathbf K_{fu}\mathbf K_{uu}^{-1}\mathbf K_{uf}\]

.

The kernel matrices \(\mathbf K_{ff}\), \(\mathbf K_{uu}\), \(\mathbf K_{fu}\) represent the kernel evaulated at the data points \(\mathbf X\), the inducing input points \(\mathbf Z\) and between the data and inducing points, respectively. We refer to the value of the GP at the data points \(\mathbf X\) as \(\mathbf f\), at the inducing points \(\mathbf Z\) as \(\mathbf u\) and at the remainder of the function as \(f^\star\).

To obtain an efficient and stable evaluation os the bound \(\mathcal L\), we first apply the Woodbury identity to the effective covariance matrix:

\[[\mathbf Q_{ff} + \sigma^2 \mathbf I ]^{-1} = \sigma^{-2} \mathbf I - \sigma^{-4} \mathbf K_{fu}[\mathbf K_{uu} + \mathbf K_{uf}\mathbf K_{fu}\sigma^{-2}]^{-1}\mathbf K_{uf}\]

Now, to obtain a better conditioned matrix for inversion, we rotate by \(\mathbf L\), where \(\mathbf L\mathbf L^\top = \mathbf K_{uu}\):

\[[\mathbf Q_{ff} + \sigma^2 \mathbf I ]^{-1} = \sigma^{-2} \mathbf I - \sigma^{-4} \mathbf K_{fu}\color{red}{\mathbf L^{-\top} \mathbf L^\top}[\mathbf K_{uu} + \mathbf K_{uf}K_{fu}\sigma^{-2}]^{-1}\color{red}{\mathbf L \mathbf L^{-1}}\mathbf K_{uf}\]

This matrix is better conditioned because for many kernels it has eigenvalues which are bounded above and below. For more detail see GPML section 3.4.3 .

\[\phantom{[\mathbf Q_{ff} + \sigma^2 \mathbf I ]^{-1}} = \sigma^{-2} \mathbf I - \sigma^{-4} \mathbf K_{fu}\color{red}{\mathbf L^{-\top}} [\color{red}{\mathbf L^{-1}}\mathbf (K_{uu} + \mathbf K_{uf}K_{fu}\sigma^{-2})\color{red}{\mathbf L^{-\top}}]^{-1}\color{red}{ \mathbf L^{-1}}\mathbf K_{uf}\]
\[\phantom{[\mathbf Q_{ff} + \sigma^2 \mathbf I ]^{-1} }= \sigma^{-2} \mathbf I - \sigma^{-4} \mathbf K_{fu}\color{red}{\mathbf L^{-\top}} [\mathbf I + \color{red}{\mathbf L^{-1}}\mathbf (\mathbf K_{uf}K_{fu})\color{red}{\mathbf L^{-\top}}\sigma^{-2}]^{-1}\color{red}{ \mathbf L^{-1}}\mathbf K_{uf}\]

For notational convenience, we’ll define \(\mathbf L^{-1}\mathbf K_{uf}\sigma^{-1} \triangleq \mathbf A\), and \([\mathbf I + \mathbf A\mathbf A^\top]\triangleq \mathbf B\).

\[\phantom{[\mathbf Q_{ff} + \sigma^2 \mathbf I ]^{-1} }= \sigma^{-2} \mathbf I - \sigma^{-2} \mathbf A^{\top} [\mathbf I + \mathbf A\mathbf A^\top]^{-1}\mathbf A\]
\begin{equation} \phantom{[\mathbf Q_{ff} + \sigma^2 \mathbf I ]^{-1}}= \sigma^{-2} \mathbf I - \sigma^{-2} \mathbf A^{\top} \mathbf B^{-1}\mathbf A \end{equation}

We also apply the matrix determinant lemma to the same:

\[|{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |{\mathbf K_{uu}} + \mathbf K_{uf}\mathbf K_{fu}\sigma^{-2}| \, |\mathbf K_{uu}^{-1}| \, |\sigma^{2}\mathbf I|\]

Substituting \(\mathbf K_{uu} = {\mathbf {L L}^\top}\):

\[|{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |{\mathbf {L L}^\top} + \mathbf K_{uf}\mathbf K_{fu}\sigma^{-2}| \, |\mathbf L^{-\top}|\,| \mathbf L^{-1}| \, |\sigma^{2}\mathbf I|\]
\[|{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |\mathbf I + \mathbf L^{-1}\mathbf K_{uf}\mathbf K_{fu} \mathbf L^{-\top}\sigma^{-2}| \, |\sigma^{2}\mathbf I|\]
\[|{\mathbf Q_{ff}} + \sigma^2 {\mathbf I}| = |\mathbf B| \, |\sigma^{2}\mathbf I|\]

With these two definitions, we’re ready to expand the bound:

\[\mathcal L = \log \mathcal N(\mathbf y\,|\,\mathbf 0,\, \mathbf Q_{ff} + \sigma^2 \mathbf I) - \tfrac{1}{2} \sigma^{-2}\textrm{tr}(\mathbf K_{ff} - \mathbf Q_{ff})\]
\[= -\tfrac{N}{2}\log{2\pi} -\tfrac{1}{2}\log|\mathbf Q_{ff}+\sigma^2\mathbf I| - \tfrac{1}{2}\mathbf y^\top [ \mathbf Q_{ff} + \sigma^2 \mathbf I]^{-1}\mathbf y - \tfrac{1}{2} \sigma^{-2}\textrm{tr}(\mathbf K_{ff} - \mathbf Q_{ff})\]
\[= -\tfrac{N}{2}\log{2\pi} -\tfrac{1}{2}\log(|\mathbf B||\sigma^{2}\mathbf I|) - \tfrac{1}{2}\sigma^{-2}\mathbf y^\top (\mathbf I - \sigma^{-2} \mathbf A^{\top} \mathbf B^{-1}\mathbf A)\mathbf y - \tfrac{1}{2} \sigma^{-2}\textrm{tr}(\mathbf K_{ff} - \mathbf Q_{ff})\]
\[ = -\tfrac{N}{2}\log{2\pi} -\tfrac{1}{2}\log|\mathbf B| -\tfrac{N}{2}\log\sigma^{2} -\tfrac{1}{2}\sigma^{-2}\mathbf y^\top\mathbf y +\tfrac{1}{2}\sigma^{-2}\mathbf y^\top\mathbf A^{\top} \mathbf B^{-1}\mathbf A\mathbf y -\tfrac{1}{2}\sigma^{-2}\textrm{tr}(\mathbf K_{ff}) + \tfrac{1}{2}\textrm{tr}(\mathbf {AA}^\top)\]

where we have used \(\sigma^{-2}\textrm{tr}(\mathbf Q) = \textrm{tr}(\mathbf {AA}^\top)\).

Finally, we define \(\mathbf c \triangleq \mathbf L_{\mathbf B}^{-1}\mathbf A\mathbf y \sigma^{-1}\), with \(\mathbf {L_BL_B}^\top = \mathbf B\), so that:

\[\sigma^{-2}\mathbf y^\top\mathbf A^{\top} \mathbf B^{-1}\mathbf A\mathbf y = \mathbf c^\top \mathbf c\]

The SGPR code implements this equation with small changes for multiple concurrent outputs (columns of the data matrix Y) and also a prior mean function.

Prediction

At precition time, we need to compute the mean and variance of the variational approximation at some new points \(\mathbf X^\star\). Following Hensman et al. (2013), we know that all the information in the posterior approximation is contained in the Gaussian distribution \(q(\mathbf u)\), which represents the distribution on function values at the inducing points \(\mathbf Z\). Recall:

\[q(\mathbf u) = \mathcal N(\mathbf u\,|\, \mathbf m, \mathbf \Lambda)\]

with

\[\mathbf \Lambda = \mathbf K_{uu}^{-1} + \mathbf K_{uu}^{-1}\mathbf K_{uf}\mathbf K_{fu}\mathbf K_{uu}^{-1} \sigma^{-2}\]
\[\mathbf m = \mathbf \Lambda^{-1} \mathbf K_{uu}^{-1}\mathbf K_{uf}\mathbf y\sigma^{-2}\]

To make a prediction, we need to integrate:

\[ \begin{align}\begin{aligned} p(\mathbf f^\star) = \int p(\mathbf f^\star \,|\, \mathbf u) q (\mathbf u) \textrm d \mathbf u\\ with\end{aligned}\end{align} \]
\[p(\mathbf f^\star \,|\, \mathbf u) = \mathcal N(\mathbf f^\star\,|\, \mathbf K_{\star u}\mathbf K_{uu}^{-1}\mathbf u, \,\mathbf K_{\star\star} - \mathbf K_{\star u}\mathbf K_{uu}^{-1}\mathbf K_{u\star})\]

The integral results in

\[p(\mathbf f^\star) = \mathcal N(\mathbf f^\star\,|\, \mathbf K_{\star u}\mathbf K_{uu}^{-1}\mathbf m, \,\mathbf K_{\star\star} - \mathbf K_{\star u}\mathbf K_{uu}^{-1}\mathbf K_{u\star} + \mathbf K_{\star u}\mathbf K_{uu}^{-1}\mathbf \Lambda^{-1}\mathbf K_{uu}^{-1}\mathbf K_{u\star})\]

Note from our above definitions that we have

\[\mathbf K_{uu}^{-1}\mathbf \Lambda^{-1}\mathbf K_{uu}^{-1} = \mathbf L^{-\top}\mathbf B^{-1}\mathbf L^{-1}\]

and further

\[\mathbf K_{uu}^{-1}\mathbf m = \mathbf L^{-\top}\mathbf L_{\mathbf B}^{-\top}\mathbf c\]

.

substituting:

\[p(\mathbf f^\star) = \mathcal N(\mathbf f^\star\,|\, \mathbf K_{\star u}\mathbf L^{-\top}\mathbf L_{\mathbf B}^{-\top}\mathbf c, \,\mathbf K_{\star\star} - \mathbf K_{\star u}\mathbf L^{-1}(\mathbf I - \mathbf B^{-1})\mathbf L^{-1}\mathbf K_{u\star})\]

.

The code in sgpr.py implements this equation, with an additional switch depending on whether the full covariance matrix is required.