The Gaussian Processes Web Site

This web site aims to provide an overview of resources concerned with probabilistic modeling, inference and learning based on Gaussian processes. Although Gaussian processes have a long history in the field of statistics, they seem to have been employed extensively only in niche areas. With the advent of kernel machines in the machine learning community, models based on Gaussian processes have become commonplace for problems of regression (kriging) and classification as well as a host of more specialized applications.

[ Books | Events | Other Web Sites | Software | Research Papers ]

Books

Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Chris Williams, the MIT Press, 2006, online version.

Statistical Interpolation of Spatial Data: Some Theory for Kriging, Michael L. Stein, Springer, 1999.

Statistics for Spatial Data (revised edition), Noel A. C. Cressie, Wiley, 1993

Spline Models for Observational Data, Grace Wahba, SIAM, 1990

Future and Past Events

The Bayesian Research Kitchen at The Wordsworth Hotel, Grasmere, Ambleside, Lake District, United Kingdom 05 - 07 September 2008.

A tutorial entitled Advances in Gaussian Processes on Dec. 4th at NIPS 2006 in VanCouver, slides, lecture.

The Gaussian Processes in Practice workshop at Bletchley Park, U.K., June 12-13 2006.

The Open Problems in Gaussian Processes for Machine Learning workshop at nips*05 in Whistler, December 10th, 2005.

The Gaussian Process Round Table meeting in Sheffield, June 9-10, 2005.

Other Web Sites of Related Interest

The kernel-machines web site.

Wikipedia entry on Gaussian processes.

The ai-geostats web site for spatial statistics and geostatistics.

The Bibliography of Gaussian Process Models in Dynamic Systems Modelling web site maintained by Juš Kocijan.

Software

Andreas Geiger has written a simple Gaussian process regression Java applet, illustrating the behaviour of covariance functions and hyperparameters.

package	title	author	implementation	description
bcm	The Bayesian Committee Machine	Anton Schwaighofer	matlab and NETLAB	An extension of the Netlab implementation for GP regression. It allows large scale regression based on the BCM approximation, see also the accompanying paper
fbm	Software for Flexible Bayesian Modeling	Radford M. Neal	C for linux/unix	An extensive and well documented package implementing Markov chain Monte Carlo methods for Bayesian inference in neural networks, Gaussian processes (regression, binary and multi-class classification), mixture models and Dirichlet Diffusion trees.
gp-lvm and fgp-lvm	A (fast) implementation of Gaussian Process Latent Variable Models	Neil D. Lawrence	matlab and C
gpml	Code from the Rasmussen and Williams: Gaussian Processes for Machine Learning book.	Carl Edward Rasmussen and Hannes Nickisch	matlab and octave	The GPML toolbox implements approximate inference algorithms for Gaussian processes such as Expectation Propagation, the Laplace Approximation and Variational Bayes for a wide class of likelihood functions for both regression and classification. It comes with a big algebra of covariance and mean functions allowing for flexible modeling. The code is fully compatible to Octave 3.2.x. JMLR paper describing the toolbox.
c++-ivm	Sparse approximations based on the Informative Vector Machine	Neil D. Lawrence	C++	IVM Software in C++ , also includes the null category noise model for semi-supervised learning.
BFD	Bayesian Fisher's Discriminant software	Tonatiuh Peña Centeno	matlab	Implements a Gaussian process interpretation of Kernel Fisher's discriminant.
gpor	Gaussian Processes for Ordinal Regression	Wei Chu	C for linux/unix	Software implementation of Gaussian Processes for Ordinal Regression. Provides Laplace Approximation, Expectation Propagation and Variational Lower Bound.
MCMCstuff	MCMC Methods for MLP and GP and Stuff	Aki Vehtari	matlab and C	A collection of matlab functions for Bayesian inference with Markov chain Monte Carlo (MCMC) methods. The purpose of this toolbox was to port some of the features in fbm to matlab for easier development for matlab users.
ogp	Sparse Online Gaussian Processes	Lehel Csató	matlab and NETLAB	Approximate online learning in sparse Gaussian process models for regression (including several non-Gaussian likelihood functions) and classification.
sogp	Sparse Online Gaussian Process C++ Library	Dan Grollman	C++	Sparse online Gaussian process C++ library based on the PhD thesis of Lehel Csató
spgp .tgz or .zip	Sparse Pseudo-input Gaussian Processes	Ed Snelson	matlab	Implements sparse GP regression as described in Sparse Gaussian Processes using Pseudo-inputs and Flexible and efficient Gaussian process models for machine learning. The SPGP uses gradient-based marginal likelihood optimization to find suitable basis points and kernel hyperparameters in a single joint optimization.
tgp	Treed Gaussian Processes	Robert B. Gramacy	C/C++ for R	Bayesian Nonparametric and nonstationary regression by treed Gaussian processes with jumps to the limiting linear model (LLM). Special cases also implememted include Bayesian linear models, linear CART, stationary separable and isotropic Gaussian process regression. Includes 1-d and 2-d plotting functions (with higher dimension projection and slice capabilities), and tree drawing, designed for visualization of tgp class output. See also Gramacy 2007
Tpros	Gaussian Process Regression	David MacKay and Mark Gibbs	C	Tpros is the Gaussian Process program written by Mark Gibbs and David MacKay.
GP Demo	Octave demonstration of Gaussian process interpolation	David MacKay	octave	This DEMO works fine with octave-2.0 and did not work with 2.1.33.
GPClass	Matlab code for Gaussian Process Classification	David Barber and C. K. I. Williams	matlab	Implements Laplace's approximation as described in Bayesian Classification with Gaussian Processes for binary and multiclass classification.
VBGP	Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors	Mark Girolami and Simon Rogers	matlab	Implements a variational approximation for Gaussian Process based multiclass classification as described in the paper Variational Bayesian Multinomial Probit Regression.
pyGPs	Gaussian Processes for Regression and Classification	Marion Neumann	Python	pyGPs is a library containing an object-oriented python implementation for Gaussian Process (GP) regression and classification. github
gaussian-process	Gaussian process regression	Anand Patil	Python	under development
gptk	Gaussian Process Tool-Kit	Alfredo Kalaitzis	R	The gptk package implements a general-purpose toolkit for Gaussian process regression with an RBF covariance function. Based on a MATLAB implementation written by Neil D. Lawrence.

Other software that way be useful for implementing Gaussian process models:

The NETLAB package by Ian Nabney includes code for Gaussian process regression and many other useful thing, e.g. optimisers.
See Tom Minka's page on accelerating matlab and his lightspeed toolbox.
Matthias Seeger shares his code for Kernel Multiple Logistic Regression, Incomplete Cholesky Factorization and Low-rank Updates of Cholesky Factorizations.
See the software section of www.kernel-machines.org.

Annotated Bibliography

Below is a collection of papers relevant to learning in Gaussian process models. The papers are ordered according to topic, with occational papers occuring under multiple headings.

Tutorials

Several papers provide tutorial material suitable for a first introduction to learning in Gaussian process models. These range from very short [Williams 2002] over intermediate [MacKay 1998], [Williams 1999] to the more elaborate [Rasmussen and Williams 2006]. All of these require only a minimum of prerequisites in the form of elementary probability theory and linear algebra.

D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge, UK, 2003. chapter 45.

Comment: A short introduction to GPs, emphasizing the relationships to paramteric models (RBF networks, neural networks, splines).

D. J. C. MacKay. Gaussian processes - a replacement for supervised neural networks?. Tutorial lecture notes for NIPS 1997, 1997.

D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, volume 168 of NATO ASI Series, pages 133-165. Springer, Berlin, 1998.

W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannary. Numerical Recipes. Cambridge University Press, third edition, 2007. Section 15.9.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006.

Comment: The initial chapters contain significant amounts of tutorial material. The whole book, including all chapters are freely available online.

M. Seeger. Gaussian processes for machine learning. International Journal of Neural Systems, 14(2):69-106, 2004.

Abstract: Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated.

C. K. I. Williams. Gaussian processes. In M. A. Arbib, editor, Handbook of Brain Theory and Neural Networks, pages 466-470. The MIT Press, second edition, 2002.

C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning in Graphical Models, pages 599-621. The MIT Press, Cambridge, MA, 1999. Previously (1998) published by Kluwer Academic Press.

Abstract: The main aim of this paper is to provide a tutorial on regression with Gaussian processes. We start from Bayesian linear regression, and show how by a change of viewpoint one can se this method as a Gaussian process predictor based on priors over functions, rather than on priors over parameters. This leads to a more general discussion of Gaussian processes in section 4. Section 5 deals with further issues, including hierarchical modelling and the setting of the parameters that control the Gaussian process, the covariance functions for neural network models and the use of Gaussian processes in classification problems.

Regression

The simplest uses of Gaussian process models are for (the conjugate case of) regression with Gaussian noise. See the approximation section for papers which deal specifically with sparse or fast approximation techniques. O'Hagan 1978 represents an early reference from the statistics comunity for the use of a Gaussian process as a prior over functions, an idea which was only introduced to the machine learning community by Williams and Rasmussen 1996.

P. Boyle and M. Frean. Dependent Gaussian processes. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 217-224. The MIT Press, 2005.

Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs.

P. W. Goldberg, C. K. I. Williams, and C. M. Bishop. Regression with input-dependent noise: A Gaussian process treatment. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10. The MIT Press, Cambridge, MA, 1998.

Abstract: Gaussian processes provide natural non-parametric prior distributions over regression functions. In this paper we consider regression problems where there is noise on the output, and the variance of the noise depends on the inputs. If we assume that the noise is a smooth functon of the inputs, then it is natural to model the noise variance using a second Gaussian process, in addition to the Gaussian process governing the noise-free output value. We show that prior uncertainty about the parameters controlling both processes can be handled and that the posterior distribution of the noise rate can be sampled from using Markov chain Monte Carlo methods. Our results on a sythetic data set give a posterior noise variance that well-approximates the true variance.

R. B. Gramacy. tgp: An R package for Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian process models. Journal of Statistical Software, 19, 2007.

Abstract: The tgp package for R is a tool for fully Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian processes with jumps to the limiting linear model. Special cases also implemented include Bayesian linear models, linear CART, stationary separable and isotropic Gaussian processes. In addition to inference and posterior prediction, the package supports the (sequential) design of experiments under these models paired with several objective criteria. 1-d and 2-d plotting, with higher dimension projection and slice capabilities, and tree drawing functions (requiring maptree and combinat packages), are also provided for visualization of tgp objects.

R. B. Gramacy, Herbert K. H. Lee, and William MacReady. Parameter space exploration with Gaussian process trees. In 21st International Conference on Machine Learning, pages 353-360. Omnipress and ACM Digital Library, 2004.

Abstract: Computer experiments often require dense sweeps over input parameters to obtain a qualitative understanding of their response. Such sweeps can be prohibitively expensive, and are unnecessary in regions where the response is easily predicted; well-chosen designs could allow a mapping of the response with far fewer simulation runs. Thus, there is a need for computationally inexpensive surrogate models and an accompanying method for selecting small designs. We explore a general methodology for addressing this need that uses non-stationary Gaussian processes. Binary trees partition the input space to facilitate non-stationarity and a Bayesian interpretation provides an explicit measure of predictive uncertainty that can be used to guide sampling. Our methods are illustrated on several examples, including a motivating example involving computational fluid dynamics simulation of a NASA reentry vehicle.

E. Meeds and S. Osindero. An alternative infinite mixture of Gaussian process experts. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 883-890. The MIT Press, Cambridge, MA, 2006.

Abstract: We present an infinite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective 'gating network' for the different experts.

A. O'Hagan. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society, Series B, 40(1):1-42, 1978.

A. O'Hagan. Bayesian Inference, volume 2B of Kendall's Advanced Theory of Statistics. Arnold, London, 1994. chapter 10.48.

T. Plate. Accuracy versus interpretability in flexible modeling: implementing a tradeoff using Gaussian process models. Behaviourmetrika, 26(1):29-50, 1999.

Abstract: One of the widely acknowledged drawbacks of flexible statistical models is that the fitted models are often extremely difficult to interpret. However, if flexible models are constrained to be additive the fitted models are much easier to interpret, as each input can be considered independently. The problem with additive models is that they cannot provide an accurate model if the phenomenon being modeled is not additive. This paper shows that a tradeoff between accuracy and additivity can be implemented easily in Gaussian process models, which are a type of flexible model closely related to feedforward neural networks. One can fit a series of Gaussian process models that begins with the completely flexible and are constrained to be progressively more additive, and thus progressively more interpretable. Observations of how the degree of non-additivity and the test error change as the models become more additive give insight into the importance of interactions in a particular model. Fitted models in the series can also be interpreted graphically with a technique for visualizing the effects of inputs in non-additive models that was adapted from plots for generalized additive models. This visualization technique shows the overall effects of different inputs and also shows which inputs are involved in interactions and how strong those interactions are.

C. E. Rasmussen. Evaluation of Gaussian Processes and other Methods for Non-linear Regression. PhD thesis, Department of Computer Science, University of Toronto, 1996.

Abstract: This thesis develops two Bayesian learning methods relying on Gaussian processes and a rigorous statistical approach for evaluating such methods. In these experimental designs the sources of uncertainty in the estimated generalisation performances due to both variation in training and test sets are accounted for. The framework allows for estimation of generalisation performance as well as statistical tests of significance for pairwise comparisons. Two experimental designs are recommended and supported by the DELVE software environment. Two new non-parametric Bayesian learning methods relying on Gaussian process priors over functions are developed. These priors are controlled by hyperparameters which set the characteristic length scale for each input dimension. In the simplest method, these parameters are fit from the data using optimization. In the second, fully Bayesian method, a Markov chain Monte Carlo technique is used to integrate over the hyperparameters. One advantage of these Gaussian process methods is that the priors and hyperparameters of the trained models are easy to interpret. The Gaussian process methods are benchmarked against several other methods, on regression tasks using both real data and data generated from realistic simulations. The experiments show that small datasets are unsuitable for benchmarking purposes because the uncertainties in performance measurements are large. A second set of experiments provide strong evidence that the bagging procedure is advantageous for the Multivariate Adaptive Regression Splines (MARS) method. The simulated datasets have controlled characteristics which make them useful for understanding the relationship between properties of the dataset and the performance of different methods. The dependency of the performance on available computation time is also investigated. It is shown that a Bayesian approach to learning in multi-layer perceptron neural networks achieves better performance than the commonly used early stopping procedure, even for reasonably short amounts of computation time. The Gaussian process methods are shown to consistently outperform the more conventional methods.

C. E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. The MIT Press, 2002.

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using a input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets - thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

E. Snelson, C. E. Rasmussen, and Z. Ghahramani. Warped Gaussian processes. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004. The MIT Press.

Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation.

P. Sollich and C. K. I. Williams. Using the equivalent kernel to understand Gaussian process regression. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17. The MIT Press, 2005.

Abstract: The equivalent kernel [Silverman, 1984] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes.

F. Vivarelli and C. K. I Williams. Discovering hidden features with Gaussian processes regression. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11. The MIT Press, 1999.

Abstract: In Gaussian process regression the covariance between the outputs at input locations x and x' is usually assumed to depend on the distance (x-x')W(x-x'), where W is a positive definite matrix. W is often taken to be diagonal, but if we allow W to be a general positive definite matrix which can be tuned on the basis of training data, then an eigen-analysis of W shows that we are effectively creating hidden features, where the dimensionality of the hidden-feature space is determined by the data. We demonstrate the superiority of predictions using the general matrix over those based on a diagonal matrix on two test problems.

C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514-520. The MIT Press, Cambridge, MA, 1996.

Abstract: The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior over functions. We investigate the use of a Gaussian process prior over functions, which permits the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.

H. Zhu, C. K. I. Williams, R. J. Rohwer, and M. Morciniec. Gaussian regression and optimal finite dimensional linear models. In C. M. Bishop, editor, Neural Networks and Machine Learning. Springer-Verlag, Berlin, 1998. See also technical report NCRG/97/011, Aston University.

Abstract: The problem of regression under Gaussian assumptions is treated generally. The relationship between Bayesian prediction, regularization and smoothing is elucidated. The ideal regression is the posterior mean and its computation scales as O(n³), where n is the same size. We show that the optimal m-dimensional linear model under a given prior is spanned by the first m eigenfunctions of a covaraince operator, which is a trace-class operator. This is an infinite dimensional analogue of principal component analysis. The importance of Hilbert space methods to practical statistics is also discussed.

Classification

Exact inference in Gaussian process models for classification is not tractable. Several approximation schemes have been suggested, including Laplace's method, variational approximations, mean field methods, Markov chain Monte Carlo and Expectation Propagation. See also the approximation section. Multi-class classification may be treated explicitly, or decomposed into multiple, binary (one against the rest) problems. For introductions, see for example Williams and Barber 1998 or Kuss and Rasmussen 2005. Bounds from the PAC-Bayesian perspective are applied in Seeger 2002.

Y. Altun, T. Hofmann, and A. J. Smola. Gaussian process classification for segmenting and annotating sequences.. In C. E. Brodley, editor, Proceedings of the Twenty-first International Conference on Machine Learning (ICML 2004). ACM, 2004.

D. Barber and C. K. I. Williams. Gaussian processes for Bayesian classification via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. The MIT Press.

Abstract: The full Bayesian method for applying neural networks to a prediction problem is to set up the prior/hyperprior structure for the net and then perform the necessary integrals. However, these integrals are not tractable analytically, and Markov Chain Monte Carlo (MCMC) methods are slow, especially if the parameter space is high-dimensional. Using Gaussian processes we can approximate the weight space integral analytically, so that only a small number of hyperparameters need be integrated over by MCMC methods. We have applied this idea to classification problems, obtaining excellent results on the real-world problems investigated so far.

N. Choudhuri, S. Ghosal, and A. Roy. Nonparametric binary regression using a Gaussian process prior. (unpublished), 2005.

L. Csató, E. Fokoué, M. Opper, B. Schottky, and O. Winther. Efficient approaches to Gaussian process classification. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, Cambridge, MA, 2000. The MIT Press.

Abstract: We present three simple approximations for the calculation of the posterior mean in Gaussian Process classification. The first two methods are related to mean field ideas known in Statistical Physics. The third approach is based on Bayesan online approach which was motivated by recent results in the Statistical Mechanics of Neural Networks. We present simulation results showing: 1. that the mean field Bayesian evidence may be used for hyperparameter tuning and 2. that the online approach may achieve a low training error fast.

L. Csató and M. Opper. Sparse representation for Gaussian process models. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, Cambridge, MA, 2001. The MIT Press.

Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world datasets indicate that efficiency of the approach.

L. Csató and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(2):641-669, 2002.

Abstract: We develop an approach for sparse representations of gaussian process (GP) models (which are Bayesian types of kernel machines) in order to overcome their limitations for large data sets. The method is based on a combination of a Bayesian on-line algorithm, together with a sequential construction of a relevant subsample of the data that fully specifies the prediction of the GP model. By using an appealing parameterization and projection techniques in a reproducing kernel Hilbert space, recursions for the effective parameters and a sparse gaussian approximation of the posterior process are obtained. This allows for both a propagation of predictions and Bayesian error measures. The significance and robustness of our approach are demonstrated on a variety of experiments.

M. N. Gibbs and D. J. C. MacKay. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458-1464, 2000.

Abstract: Gaussian processes are a promising non-linear interpolation tool [williams-95,Williams and Rasmussen 1996], but it is not straightforward to solve classification problems with them. In this paper the variational methods of [jaakkola-jordan-96] are applied to Gaussian processes to produce an efficient Bayesian binary classifier.

M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8):1790-1817, 2006.

Abstract: It is well known in the statistics literature that augmenting binary and polychotomous response models with Gaussian latent variables enables exact Bayesian analysis via Gibbs sampling from the parameter posterior. By adopting such a data augmentation strategy, dispensing with priors over regression coefficients in favour of Gaussian Process (GP) priors over functions, and employing variational approximations to the full posterior we obtain efficient computational methods for Gaussian Process classification in the multi-class setting. The model augmentation with additional latent variables ensures full a posteriori class coupling whilst retaining the simple a priori independent GP covariance structure from which sparse approximations, such as multi-class Informative Vector Machines (IVM), emerge in a very natural and straightforward manner. This is the first time that a fully Variational Bayesian treatment for multi-class GP classification has been developed without having to resort to additional explicit approximations to the non-Gaussian likelihood term. Empirical comparisons with exact analysis via MCMC and Laplace approximations illustrate the utility of the variational approximation as a computationally economic alternative to full MCMC and it is shown to be more accurate than the Laplace approximation.

A. Kapoor, K. Grauman, R. Urtasun, and T. Darell. Active learning with Gaussian processes for object categorization. In Proceedings of the International Conference in Cmputer Vision, 2007.

Abstract: Discriminative methods for visual object category recognition are typically non-probabilistic, predicting class labels but not directly providing an estimate of uncertainty. Gaussian Processes (GPs) are powerful regression techniques with explicit uncertainty models; we show here how Gaussian Processes with covariance functions defined based on a Pyramid Match Kernel (PMK) can be used for probabilistic object category recognition. The uncertainty model provided by GPs offers confidence estimates at test points, and naturally allows for an active learning paradigm in which points are optimally selected for interactive labeling. We derive a novel active category learning method based on our probabilistic regression model, and show that a significant boost in classification performance is possible, especially when the amount of training data for a category is ultimately very small.

H.-C. Kim and Z. Ghahramani. The EM-EP algorithm for Gaussian process classification. In Proceedings of the Workshop on Probabilistic Graphical Models for Classification (at ECML), 2003.

Abstract: Gaussian process classifiers (GPCs) are fully statistical kernel classification models derived from Gaussian processes for regression. In GPCs, the probability of belonging to a certain class at an input location is monotonically related to the value of some latent function at that location. Starting from a prior over this latent function, the data are used to infer both the posterior over the latent function and the values of hyperparameters determining various aspects of the function. GPCs can also be viewed as graphical models with latent variables. Based on the work of [Minka 2001, Opper and Winther 2000], we present an approximate EM algorithm, the EM-EP algorithm for learning both the latent function and the hyperparameters of a GPC. The algorithm alternates the following steps until convergence. In the E-step, given the hyperparameters, a density for the latent variables defining the latent function is computed via the Expectation-Propagation (EP) algorithm [Minka 2001, Opper and Winther 2000]. In the M-step, given the density for the latent values, the hyperparameters are selected to maximize a variational lower bound on the marginal likelihood (i.e. the model evidence). This algorithm is found to converge in practice and provides an efficient Bayesian framework for learning hyperparameters of the kernel. We examine the role of various different hyperparameters which model labeling errors, the lengthscales (i.e. relevances) of different features, and sharpness of the decision boundary. The added flexibility these provide results in signicantly improved performance. Experimental results on synthetic and real data sets show that the EM-EP algorithm works well, with GPCs giving equal or better performance than support vector machines (SVMs) on all data sets tested.

M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679-1704, 2005.

Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace's method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace's method.

M. Kuss and C. E. Rasmussen. Assessing approximations for Gaussian process classification. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 699-706, Cambridge, MA, 2006. The MIT Press.

Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace's method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate.

N. D. Lawrence and M. I. Jordan. Semi-supervised learning via Gaussian processes. In L. K. Saul, Y. Weiss, and Bottou L, editors, Advances in Neural Information Processing Systems 17, pages 753-760, Cambridge, MA, 2005. The MIT Press.

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a "null category noise model" (NCNM) inspired by ordered cate- gorical noise models. The noise model re ects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits.

R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475-501. Oxford University Press, 1998.

Abstract: Gaussian processes are a natural way of specifying prior distributions over functions of one or more input variables. When such a function defines the mean response in a regression model with Gaussian errors, inference can be done using matrix computations, which are feasible for datasets of up to about a thousand cases. The covariance function of the Gaussian process can be given a hierarchical prior, which allows the model to discover high-level properties of the data, such as which inputs are relevant to predicting the response. Inference for these covariance hyperparameters can be done using Markov chain sampling. Classification models can be defined using Gaussian processes for underlying latent values, which can also be sampled within the Markov chain. Gaussian processes are in my view the simplest and most obvious way of defining flexible Bayesian regression and classification models, but despite some past usage, they appear to have been rather neglected as a general-purpose technique. This may be partly due to a confusion between the properties of the function being modeled and the properties of the best predictor for this unknown function.

H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035-2078, 2008.

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches.

M. Opper and O. Winther. Gaussian processes for classification: Mean-field algorithms. Neural Computation, 12(11):2655-2684, 2000.

Abstract: We derive a mean-field algorithm for binary classification with gaussian processes that is based on the TAP approach originally proposed in statistical physics of disordered systems. The theory also yields an approximate leave-one-out estimator for the generalization error, which is computed with no extra computational cost. We show that from the TAP approach, it is possible to derive both a simpler "naive" mean-field theory and support vector machines (SVMs) as limiting cases. For both mean-field algorithms and support vector machines, simulation results for three small benchmark data sets are presented. They show that one may get state-of-the-art performance by using the leave-one-out estimator for model selection and the built-in leave-one-out estimators are extremely precise when compared to the exact leave-one-out estimate. The second result is taken as strong support for the internal consistency of the mean-field approach.

M. Opper and O. Winther. Mean field methods for classification with Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 309-315, Cambridge, MA, 1999. The MIT Press.

Abstract: We discuss the application of TAP mean field methods known from the Statistical Mechanics of diordered systems to Bayesian classification models with Gaussian processes. In contrast to previous approaches, no knowledge about the distribution of inputs is needed. Simulation results for the Sonar data set are given.

T. Peña Centeno and N. D. Lawrence. Optimising kernel parameters and regularisation coefficients for non-linear discriminant analysis. Journal of Machine Learning Research, 7:455-491, 2006.

Abstract: In this paper we consider a novel Bayesian interpretation of Fisher's discriminant analysis. We relate Rayleigh's coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the 'regression to the labels' assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher's discriminant, and with the incorporation of a prior we can apply Bayes' rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher's discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.

M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233-269, 2002.

Abstract: Approximate Bayesian Gaussian process (GP) classification techniques are powerful non-parametric learning methods, similar in appearance and performance to support vector machines. Based on simple probabilistic models, they render interpretable results and can be embedded in Bayesian frameworks for model selection, feature selection, etc. In this paper, by applying the PAC-Bayesian theorem of McAllester (1999a), we prove distribution-free generalisation error bounds for a wide range of approximate Bayesian GP classification techniques. We also provide a new and much simplified proof for this powerful theorem, making use of the concept of convex duality which is a backbone of many machine learning techniques. We instantiate and test our bounds for two particular GPC techniques, including a recent sparse method which circumvents the unfavourable scaling of standard GP algorithms. As is shown in experiments on a real-world task, the bounds can be very tight for moderate training sample sizes. To the best of our knowledge, these results provide the tightest known distribution-free error bounds for approximate Bayesian GPC methods, giving a strong learning-theoretical justification for the use of these techniques.

M. Seeger and M. I. Jordan. Sparse Gaussian process classification with multiple classes. Technical Report TR 661, Department of Statistics, University of California at Berkeley, 2004.

Abstract: Sparse approximations to Bayesian inference for nonparametric Gaussian Process models scale linearly in the number of training points, allowing for the application of these powerful kernel-based models to large datasets. We show how to generalize the binary classification Informative Vector Machine (IVM) to multiple classes. In contrast to earlier efficient approaches to kernel-based non-binary classification, our method is a principled approximation to Bayesian inference which yields valid uncertainty estimates and allows for hyperparameter adaption via marginal likelihood maximization. While most earlier proposals suggest fitting independent binary discriminants to heuristically chosen partitions of the data and combining these in a heuristic manner, our method operates jointly on the data for all classes. Crucially, we still achieve a linear scaling in both the number of classes and the number of training points.

R. Urtasun and T. Darrell. Discriminative Gaussian process latent variable model for classification. In 24th International Conference on Machine Learning, 2007.

Abstract: Supervised learning is dicult with high dimensional input spaces and very small training sets, but accurate classcation may be possible if the data lie on a low-dimensional manifold. Gaussian Process Latent Variable Models can discover low dimensional manifolds given only a small number of examples, but learn a latent space without regard for class labels. Existing methods for discriminative manifold learning (e.g., LDA, GDA) do constrain the class distribution in the latent space, but are generally deterministic and may not generalize well with limited training data. We introduce a method for Gaussian Process Classcation using latent variable models trained with discriminative priors over the latent space, which can learn a discriminative latent space from a small training set.

C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342-1351, 1998. Accompanying code.

Abstract: We consider the problem of assigning an input vector x to one of m classes by predicting P(c|x) for c=1,...,m. For a two-class problem, the probability of class 1 given x is esimated by σ(y(x)), where σ(y)=1/(1+e^-y). A Gaussian process prior is placed on y(x), and is combined with the training data to obtain predictions for the x points. We provide a Bayesian treatment, integrating over uncertainty in y and in the parameters that control the Gaussian process prior; the necessary integration over y is carried out using Laplace's approximation. The method is generalized to multi-class problems (m>2) using the softmax function. We demonstrate the effectiveness of the method on a number of datasets.

Covariance Functions and Properties of Gaussian Processes

The properties of Gaussian processes are controlled by the (mean function and) covariance function. Some references here describe difference covariance functions, while others give mathematical characterizations, see eg. Abrahamsen 1997 for a review. Some references describe non-standard covariance functions leading to non-stationarity etc.

P. Abrahamsen. A review of Gaussian random fields and correlation functions. Technical Report 917, Norwegian Computing Center, Oslo, 1997.

R. J. Adler. The Geometry of Random Fields. John Wiley & Sons, Chichester, 1981.

J. L. Doob. The elementary Gaussian processes. Annals of Mathematical Statistics, 15(3):229-282, 1944.

M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis, Department of Physics, University of Cambridge, 1997.

Abstract: Bayesian inference offers us a powerful tool with which to tackle the problem of data modelling. However the performance of Bayesian methods is crucially dependent on being able to find good models for our data. The principal focus of this thesis is the development of models based on Gaussian process priors. Such models, which can be thought of as the infinite extension of several existing finite models have the flexibility to model complex phenomena while being mathematically simple. In thesis, I present a review of the theory of Gaussian processes and their covariance functions and demonstrate how they fit into the Bayesian framework. The efficient implementation of a Gaussian process is discussed with particular reference to approximate methods for matrix inversion based on the work of Skilling (1993). Several regression problems are examined. Non-stationary covariance functions are developed for the regression of neuron spike data and the use of Gaussian processes to model the potential energy surfaces of weakly bound molecules is discussed. Classification methods based on Gaussian processes are implemented using variational methods. Existing bounds (Jaakkola and Jordan 1996) for the sigmoid function are used to tackle binary problems and multi-dimensional bounds on the softmax function are presented for the multiple class case. The performance of the variational classifier is compared with that of other methods using the CRABS and PIMA datasets (Ripley 1996) and the problem of predicting the cracking of welds based on their chemical composition is also investigated. The theoretical calculation of the density of states of crystal structures is discussed in detail. Three possible approaches to the problem are described based on free energy minimization, Gaussian processes and the theory of random matrices. Results from these approaches are compared with the state-of-the-art techniques (Pickard 1997).

G. Matheron. The intrinsic random functions and their applications. Advances in Applied Probability, 5:439-468, 1973.

W. Meiring, P. Monestiez, P. D. Sampson, and Guttorp. P. Developments in the modelling of nonstationary spatial covariance structure for space-time monitoring data. In E. Y. Baafi and N. Schofield, editors, Geostatistics Wollongong '96, volume 1, pages 162-173. Kluwer, 1997.

R. M. Neal. Bayesian Learning for Neural Networks. Springer, New York, 1996. chapter 2.

Abstract: Artificial ``neural networks'' are now widely used as flexible models for regression and classification applications, but questions remain regarding what these models mean, and how they can safely be used when training data is limited. Bayesian Learning for Neural Networks shows that Bayesian methods allow complex neural network models to be used without fear of the ``overfitting'' that can occur with traditional neural network learning methods. Insight into the nature of these complex Bayesian models is provided by a theoretical investigation of the priors over functions that underlie them. Use of these models in practice is made possible using Markov chain Monte Carlo techniques. Both the theoretical and computational aspects of this work are of wider statistical interest, as they contribute to a better understanding of how Bayesian methods can be applied to complex problems. Presupposing only basic knowledge of probability and statistics, this book should be of interest to many researchers in Statistics, Engineering, and Artificial Intelligence. Software for Unix systems that implements the methods described is freely available over the Internet.

Comment: Gaussian processes are not the main topic of this book but, chapter 2 entitled "Priors for Infinite Networks" contains a characterization of Gaussian and non-Gaussian limits of priors over functions generated from neural networks. See also Williams 1998.

C. J. Paciorek and M. J. Schervish. Nonstationary covariance functions for Gaussian process regression. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. The MIT Press, 2004.

P. D. Sampson and P. Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108-119, 1992.

A. M. Schmidt and A. O'Hagan. Bayesian inference for non-stationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society B, 65(3):745-758, 2003.

I. J. Schoenberg. Metric spaces and positive definite functions. Transactions of the American Mathematical Society, 44(3):522-536, 1938.

C. K. I. Williams. Computation with infinite neural networks. Neural Computation, 10:1203-1216, 1998.

Abstract: For neural networks with a wide class of weight priors, it can be shown that in the limit of an infinite number of hidden units, the prior over functions tends to a Gaussian process. In this article, analytic forms are derived for the covariance function of the Gaussian processes corresponding to networks with sigmoidal and gaussian hidden units. This allows predictions to be made efficiently using networks with an infinite number of hidden units and shows, somewhat paradoxically, that it may be easier to carry out Bayesian prediction with infinite networks rather than finite ones.

A. M. Yaglom. Stationary Random Functions. Prentice-Hall, Englewood Cliffs, NJ, 1962.

Model Selection

D. J. C. MacKay. Comparison of approximate methods for handling hyperparameters. Neural Compuration, 11(5):1035-1068, 1999.

K. V. Mardia and R. J. Marshall. Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika, 71(1):135-146, 1984.

Y. Qi, T. P. Minka, R. W. Picard, and Z. Ghahramani. Predictive automatic relevance determination by expectation propagation. In C. E. Brodley, editor, Proceedings of Twenty-first International Conference on Machine Learning, 2004.

A. Schwaighofer, V. Tresp, and K. Yu. Learning Gaussian process kernels via hierarchical bayes. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, Cambridge, MA, 2005. The MIT Press.

M. Seeger. Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, pages 603-609, Cambridge, MA, 2000. The MIT Press.

M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233-269, 2002.

P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning, 46(1-3):21-52, 2002.

Abstract: I describe a framework for interpreting Support Vector Machines (SVMs) as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors. This probabilistic interpretation can provide intuitive guidelines for choosing a lsquogoodrsquo SVM kernel. Beyond this, it allows Bayesian methods to be used for tackling two of the outstanding challenges in SVM classification: how to tune hyperparameters\u2014the misclassification penalty C, and any parameters specifying the ernel\u2014and how to obtain predictive class probabilities rather than the conventional deterministic class label predictions. Hyperparameters can be set by maximizing the evidence; I explain how the latter can be defined and properly normalized. Both analytical approximations and numerical methods (Monte Carlo chaining) for estimating the evidence are discussed. I also compare different methods of estimating class probabilities, ranging from simple evaluation at the MAP or at the posterior average to full averaging over the posterior. A simple toy application illustrates the various concepts and techniques.

S. Sundararajan and S. S. Keerthi. Predictive approaches for choosing hyperparameters in Gaussian processes,. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, Cambridge, MA, 2000. The MIT Press.

S. Sundararajan and S. Sathiya Keerthi. Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Computation, 13:1103-1118, 2001.

G. Wahba. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society, Series B, 40:364-372, 1978.

Approximations

There are two main reasons for doing approximations in Gaussian process models. Either because of analytical intractability such as arrises in classification and regression with non-Gaussian noise. Or in order to gain a computational advantage when using large datasets, by the use of sparse approximations. Some methods address both issues simultaneously. The approximation methods and approximate inference algorithms are quite diverse, see Quiñonero-Candela and Ramussen 2005 for a unifying framework for sparse approximations in the Gaussian regression model.

L. Csató. Gaussian Processes - Iterative Sparse Approximations. PhD thesis, Neural Computing Research Group, Aston University, 2002.

Abstract: In recent years there has been an increased interest in applying non-parametric methods to real-world problems. Significant research has been devoted to Gaussian processes (GPs) due to their increased flexibility when compared with parametric models. These methods use Bayesian learning, which generally leads to analytically intractable posteriors. This thesis proposes a two-step solution to construct a probabilistic approximation to the posterior. In the first step we adapt the Bayesian online learning to GPs: the final approximation to the posterior is the result of propagating the first and second moments of intermediate posteriors obtained by combining a new example with the previous approximation. The propagation of functional forms is solved by showing the existence of a parametrisation to posterior moments that uses combinations of the kernel function at the training points, transforming the Bayesian online learning of functions into a parametric formulation. The drawback is the prohibitive quadratic scaling of the number of parameters with the size of the data, making the method inapplicable to large datasets. The second step solves the problem of the exploding parameter size and makes GPs applicable to arbitrarily large datasets. The approximation is based on a measure of distance between two GPs, the KL-divergence between GPs. This second approximation is with a constrained GP in which only a small subset of the whole training dataset is used to represent the GP. This subset is called the Basis Vector, or BV set and the resulting GP is a sparse approximation to the true posterior. As this sparsity is based on the KL-minimisation, it is probabilistic and independent of the way the posterior approximation from the first step is obtained. We combine the sparse approximation with an extension to the Bayesian online algorithm that allows multiple iterations for each input and thus approximating a batch solution. The resulting sparse learning algorithm is a generic one: for different problems we only change the likelihood. The algorithm is applied to a variety of problems and we examine its performance both on more classical regression and classification tasks and to the data-assimilation and a simple density estimation problems.

L. Csató and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(2):641-669, 2002.

G. Ferrari Trecate, C. K. I. Williams, and M. Opper. Finite-dimensional approximation of Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 218-224. The MIT Press, 1999.

M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis, Department of Physics, University of Cambridge, 1997.

M. N. Gibbs and D. J. C. MacKay. Efficient implementation of Gaussian processes. Technical report, Department of Physics, Cavendish Laboratory, Cambridge University, 1997.

Abstract: Neural networks and Bayesian inference provide a useful framework within which to solve regression problems. However their parameterisation means that the Bayesian analysis of neural networks can be difficult. In this paper, we investigate a method for regression using Gaussian Process Priors which allows exact Bayesian analysis using matrix manipulation for fixed values of hyperparameters. We discuss in detail the workings of the method and we detail a range of mathematical and numerical techniques that are useful in applying Gaussian Processes to general problems including efficient approximate matrix inversion methods developed by Skilling.

S. S. Keerthi and W. Chu. A matching pursuit approach to sparse Gaussian process regression. In Advances in Neural Information Processing Systems 18, 2006.

Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions.

N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: The informative vector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 609-616, Cambridge, MA, 2003. The MIT Press.

T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001.

T. P. Minka. Expectation propagation for approximate Bayesian inference. In J. S. Breese and D. Koller, editors, Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pages 362-369. Morgan Kaufmann, 2001.

R. M. Neal. Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report 9702, Department of Statistics, University of Toronto, 1997.

Abstract: Gaussian processes are a natural way of defining prior distributions over functions of one or more input variables. In a simple nonparametric regression problem, where such a function gives the mean of a Gaussian distribution for an observed response, a Gaussian process model can easily be implemented using matrix computations that are feasible for datasets of up to about a thousand cases. Hyperparameters that define the covariance function of the Gaussian process can be sampled using Markov chain methods. Regression models where the noise has a t distribution and logistic or probit models for classification applications can be implemented by sampling as well for latent values underlying the observations. Software is now available that implements these methods using covariance functions with hierarchical parameterizations. Models defined in this way can discover high-level properties of the data, such as which inputs are relevant to predicting the response.

M. Opper and O. Winther. Gaussian processes for classification: Mean-field algorithms. Neural Computation, 12(11):2655-2684, 2000.

J. Quiñonero-Candela. Learning with Uncertainty-Gaussian Processes and Relevance Vector Machines. PhD thesis, Informatics and Mathematical Modelling, Technical Univeristy of Denmark, 2004.

J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1935-1959, 12 2005.

Abstract: We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints.

Comment: Errata: there is a mistake in computational complexity for taking the derivative of the log marginal likelihood wrt all elements of X_u, stated as O(dnm²), in line -12 on page 1953. A more careful derivation reduces this to O(dnm+nm²).

Joaquin Quiñonero-Candela and Ole Winther. Incremental Gaussian processes. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, Cambridge, MA, 2003. The MIT Press.

A. Schwaighofer and V. Tresp. Transductive and inductive methods for approximate Gaussian process regression. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15. The MIT Press, 2003.

Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.

M. Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations. PhD thesis, Institute of Adaptive and Neural Computation, University of Edinburgh, 2003.

M. Seeger, C. K. I. Williams, and N. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In C.M. Bishop and B. J. Frey, editors, Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2003.

Y. Shen, A. Ng, and M. Seeger. Fast Gaussian process regression using KD-trees. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1227-1234. The MIT Press, Cambridge, MA, 2006.

A. J. Smola and P. L. Bartlett. Sparse greedy Gaussian process regression. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619-625. The MIT Press, 2001.

E. Snelson. Flexible and efficient Gaussian process models for machine learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2007.

Abstract: Gaussian process (GP) models are widely used to perform Bayesian nonlinear regression and classification—tasks that are central to many machine learning problems. A GP is nonparametric, meaning that the complexity of the model grows as more data points are received. Another attractive feature is the behaviour of the error bars. They naturally grow in regions away from training data where we have high uncertainty about the interpolating function.
In their standard form GPs have several limitations, which can be divided into two broad categories: computational difficulties for large data sets, and restrictive modelling assumptions for complex data sets. This thesis addresses various aspects of both of these problems.
The training cost for a GP has O(N³) complexity, where N is the number of training data points. This is due to an inversion of the N × N covariance matrix. In this thesis we develop several new techniques to reduce this complexity to O(NM²), where M is a user chosen number much smaller than N. The sparse approximation we use is based on a set of M ‘pseudo-inputs’ which are optimised together with hyperparameters at training time. We develop a further approximation based on clustering inputs that can be seen as a mixture of local and global approximations.
Standard GPs assume a uniform noise variance. We use our sparse approximation described above as a way of relaxing this assumption. By making a modification of the sparse covariance function, we can model input dependent noise. To handle high dimensional data sets we use supervised linear dimensionality reduction. As another extension of the standard GP, we relax the Gaussianity assumption of the process by learning a nonlinear transformation of the output space. All these techniques further increase the applicability of GPs to real complex data sets.
We present empirical comparisons of our algorithms with various competing techniques, and suggest problem dependent strategies to follow in practice.

Comment: See also the matlab implementation spgp.

E. Snelson and Z Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1259-1266. The MIT Press, Cambridge, MA, 2006.

Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M<<N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M²N) training cost and O(M²) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M, i.e. very sparse solutions, and it significantly outperforms other approaches in this regime.

V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719-2741, 2000.

Abstract: The Bayesian committee machine (BCM) is a novel approach to combining estimators that were trained on different data sets. Although the BCM can be applied to the combination of any kind of estimators, the main foci are gaussian process regression and related systems such as regularization networks and smoothing splines for which the degrees of freedom increase with the number of training data. Somewhat surprisingly, we find that the performance of the BCM improves if several test points are queried at the same time and is optimal if the number of test points is at least as large as the degrees of freedom of the estimator. The BCM also provides a new solution for on-line learning with potential applications to data mining. We apply the BCM to systems with fixed basis functions and discuss its relationship to gaussian process regression. Finally, we show how the ideas behind the BCM can be applied in a non-Bayesian setting to extend the input-dependent combination of estimators.

M. J. Wainwright, E. B. Sudderth, and A. S. Willsky. Tree-based modeling and estimation of Gaussian processes on graphs with cycles. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, Cambridge, MA, 2001. The MIT Press.

C. K. I. Williams, C. E. Rasmussen, A. Schwaighofer, and V. Tresp. Observations on the Nyström method for Gaussian process prediction. Technical report, University of Edinburgh, 2002.

C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682-688. The MIT Press, 2001.

O. Winther. Bayesian Mean Field Algorithms for Neural Networks and Gaussian Processes. PhD thesis, University of Copenhagen, 1988.

S. A. Wood, W. Jiang, and M. Tanner. Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika, 89(3):513-528, 2002.

References from the Statistics Community

Gaussian processes have a long history in the statistics community. They have been particularly well developed in geostatistics under the name of kriging. The papers have been grouped because they are written using a common terminology, and have slightly different focus from typical machine learning papers,

D. Barry. Nonparametric Bayesian regression. The Annals of Statistics, 14(3):934-953, 1986.

B. J. N. Blight and L. Ott. A Bayesian approach to model inadequacy for polynomial regression. Biometrika, 62(1):79-88, 1975.

N. A. C. Cressie. Statistics for Spatial Data. John Wiley & Sons, New York, 1993.

P. J. Diggle, J. A. Tawn, and R. A. Moyeed. Model-based geostatistics (with discussion). Applied Statistics, 47:299-350, 1998.

M. S. Handcock and M. L. Stein. A Bayesian analysis of kriging. Technometrics, 35(4):403-410, 1993.

J. T. Kent and K. V. Mardia. The link between Kriging and thin-plate splines. In F. P. Kelly, editor, Probability, Statsitics and Optimization, pages 325-339. Wiley, 1994.

D. G. Krige. A statistical approach to some basic mine valuation problems on the Witwatersrand. Journal of the Chemical, Metallurgical and Mining Society of South Africa, 52(6):119-139, 1951.

G. M. Laslett. Kriging and splines: An empirical comparison of their predictive performance in some applications. Journal of the American Statistical Association, 89(426):391-409, 1994.

A. O'Hagan. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society, Series B, 40(1):1-42, 1978.

H. Rue and L. Held. Gaussian Markov Random Fields: Theory and Applications, volume 104 of Monographs on Statistics and Applied Probability. Chapman & Hall, London, 2005.

H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society B, 71, 2009. PREPRINT.

Abstract: Structured additive regression models are perhaps the most commonly used class of models in statistical applications. It includes, among others, (generalized) linear models, (generalized) additive models, smoothing spline models, state space models, semiparametric regression, spatial and spatiotemporal models, log-Gaussian Cox processes and geostatistical and geoadditive models. We consider approximate Bayesian inference in a popular subset of structured additive regression models, latent Gaussian models, where the latent field is Gaussian, controlled by a few hyperparameters and with non-Gaussian response variables. The posterior marginals are not available in closed form owing to the non-Gaussian response variables. For such models, Markov chain Monte Carlo methods can be implemented, but they are not without problems, in terms of both convergence and computational time. In some practical applications, the extent of these problems is such that Markov chain Monte Carlo sampling is simply not an appropriate tool for routine analysis.We show that, by using an integrated nested Laplace approximation and its simplified version, we can directly compute very accurate approximations to the posterior marginals. The main benefit of these approximations is computational: where Markov chain Monte Carlo algorithms need hours or days to run, our approximations provide more precise estimates in seconds or minutes. Another advantage with our approach is its generality, which makes it possible to perform Bayesian analysis in an automatic, streamlined way, and to compute model comparison criteria and various predictive measures so that models can be compared and the model under study can be challenged.

B. W. Silverman. Spline smoothing: The equivalent variable kernel method. Annals of Statistics, 12(3):898-916, 1984.

B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting (with discussion). Journal of the Royal Statistical Society, Series B, 47(1):1-52, 1985.

M. L. Stein. A kernel approximation to the kriging predictor of a spatial process. Ann. Inst. Statist. Math, 43(1):61-75, 1991.

M. L. Stein. Interpolation of Spatial Data. Springer, New York, 1999.

G. Wahba. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society, Series B, 40:364-372, 1978.

S. J. Yakowitz and F. Szidarovszky. A comparison of kriging with nonparametric regression methods. Journal of Multivariate Analysis, 16:21-53, 1985.

A. S. Young. A Bayesian approach to prediction using polynomials. Biometrika, 64(2):309-317, 1977.

Consistency, Learning Curves and Bounds

The papers in this section give theoretical results on learning curves, which describe the expected generalization performance as a function of the number of training cases. Consistency addresses the question whether the solution approaches the true data generating process in the limit of infinitely many training examples.

T. Choi and M. J. Schervish. Posterior consistency in nonparametric regression problems under Gaussian process priors. Technical Report 809, Department of Statistics, CMU, 2004.

S. Kakade, M. Seeger, and P. Foster. Worst-case bounds for Gaussian process models. In Advances in Neural Information Processing Systems 18. The MIT Press, 2006.

Abstract: We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all functions) and provide bounds on the regret (under the log loss) for commonly used non-parametric Bayesian algorithms-including Gaussian regression and logistic regression-which show how these algorithms can perform favorably under rather general conditions. These bounds explicitly handle the infinite dimensionality of these non-parametric classes in a natural way. We also make formal connections to the minimax and minimum description length (MDL) framework. Here, we show precisely how Bayesian Gaussian regression is a minimax strategy.

D. Malzahn and M. Opper. A variational approach to learning curves. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. The MIT Press, 2002.

M. Opper and F. Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302-308. The MIT Press, 1999.

P. Sollich. Approximate learning curves for Gaussian processes. In ICANN99 - Ninth International Conference on Artificial Neural Networks, pages 437-442, London, 1999. IEE.

P. Sollich. Learning curves for Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Neural Information Processing Systems, Vol. 11. The MIT Press, 1999.

P. Sollich and A. Halees. Learning curves for Gaussian process regression: Approximations and bounds. Neural Computation, 14:1393-1428, 2002.

Abstract: We consider the problem of calculating learning curves (i.e., average generalization performance) of gaussian processes used for regression. On the basis of a simple expression for the generalization error, in terms of the eigenvalue decomposition of the covariance function, we derive a number of approximation schemes. We identify where these become exact and compare with existing bounds on learning curves; the new approximations, which can be used for any input space dimension, generally get substantially closer to the truth. We also study possible improvements to our approximations. Finally, we use a simple exactly solvable learning scenario to show that there are limits of principle on the quality of approximations and bounds expressible solely in terms of the eigenvalue spectrum of the covariance function.

I Steinwart. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans. on Information Theory, 51(1):128-142, 2005.

A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussian process priors. Annals of Statistics, 36(3):1435-1463, 2008.

Abstract: We derive rates of contraction of posterior distributions on nonparametric or semiparametric models based on Gaussian processes. The rate of contraction is shown to depend on the position of the true parameter relative to the reproducing kernel Hilbert space of the Gaussian process and the small ball probabilities of the Gaussian process. We determine these quantities for a range of examples of Gaussian priors and in several statistical settings. For instance, we consider the rate of contraction of the posterior distribution based on sampling from a smooth density model when the prior models the log density as a (fractionally integrated) Brownian motion. We also consider regression with Gaussian errors and smooth classification under a logistic or probit link function combined with various priors.

C. K. I. Williams and F. Vivarelli. Upper and lower bounds on the learning curve for Gaussian proccesses. Machine Learning, 40:77-102, 2000.

Abstract: In this paper we introduce and illustrate non-trivial upper and lower bounds on the learning curves for one-dimensional Guassian Processes. The analysis is carried out emphasising the effects induced on the bounds by the smoothness of the random process described by the Modified Bessel and the Squared Exponential covariance functions. We present an explanation of the early, linearly-decreasing behavior of the learning curves and the bounds as well as a study of the asymptotic behavior of the curves. The effects of the noise level and the lengthscale on the tightness of the bounds are also discussed.

Reproducing Kernel Hilbert Spaces

N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337-404, 1950.

M. G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2:299-312, 2001.

Abstract: In this paper, we present classes of kernels for machine learning from a statistics perspective. Indeed, kernels are positive definite functions and thus also covariances. After discussing key properties of kernels, as well as a new formula to construct kernels, we present several important classes of kernels: anisotropic stationary kernels, isotropic stationary kernels, compactly supported kernels, locally stationary kernels, nonstationary kernels, and separable nonstationary kernels. Compactly supported kernels and separable nonstationary kernels are of prime interest because they provide a computational reduction for kernel-based methods. We describe the spectral representation of the various classes of kernels and conclude with a discussion on the characterization of nonlinear maps that reduce nonstationary kernels to either stationarity or local stationarity.

G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics, 41(2):495-502, 1970.

G. Wahba. Spline Models for Observational Data, volume 59. Society for Industrial and Applied Mathematics, Philadelphia, 1990.

Reinforcement Learning

M. P. Deisenroth, J. Peters, and C. E. Rasmussen. Approximate Dynamic Programming with Gaussian Processes. In Proceedings of the 2008 American Control Conference (ACC 2008), pages 4480-4485, Seattle, WA, USA, June 2008.

Abstract: In general, it is difficult to determine an optimal closed-loop policy in nonlinear control problems with continuous-valued state and control domains. Hence, approximations are often inevitable. The standard method of discretizing states and controls suffers from the curse of dimensionality and strongly depends on the chosen temporal sampling rate. In this paper, we introduce Gaussian process dynamic programming (GPDP) and determine an approximate globally optimal closed-loop policy. In GPDP, value functions in the Bellman recursion of the dynamic programming algorithm are modeled using Gaussian processes. GPDP returns an optimal state-feedback for a finite set of states. Based on these outcomes, we learn a possibly discontinuous closed-loop policy on the entire state space by switching between two independently trained Gaussian processes. A binary classifier selects one Gaussian process to predict the optimal control signal. We show that GPDP is able to yield an almost optimal solution to an LQ problem using few sample points. Moreover, we successfully apply GPDP to the underpowered pendulum swing up, a complex nonlinear control problem.

M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Model-Based Reinforcement Learning with Continuous States and Actions. In Proceedings of the 16th European Symposium on Artificial Neural Networks (ESANN 2008), pages 19-24, Bruges, Belgium, April 2008.

Abstract: Finding an optimal policy in a reinforcement learning (RL) framework with continuous state and action spaces is challenging. Approximate solutions are often inevitable. GPDP is an approximate dynamic programming algorithm based on Gaussian process (GP) models for the value functions. In this paper, we extend GPDP to the case of unknown transition dynamics. After building a GP model for the transition dynamics, we apply GPDP to this model and determine a continuous-valued policy in the entire state space. We apply the resulting controller to the underpowered pendulum swing up. Moreover, we compare our results on this RL task to a nearly optimal discrete DP solution in a fully known environment.

M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing, 72(7-9):1508-1524, March 2009.

Abstract: Reinforcement learning (RL) and optimal control of systems with continuous states and actions require approximation techniques in most interesting cases. In this article, we introduce Gaussian process dynamic programming (GPDP), an approximate value-function based RL algorithm. We consider both a classic optimal control problem, where problem-specific prior knowledge is available, and a classic RL problem, where only very general priors can be used. For the classic optimal control problem, GPDP models the unknown value functions with Gaussian processes and generalizes dynamic programming to continuous-valued states and actions. For the RL problem, GPDP starts from a given initial state and explores he state space using Bayesian active learning. To design a fast learner, available data has to be used efficiently. Hence, we propose to learn probabilistic models of the a priori unknown transition dynamics and the value functions on the fly. In both cases, we successfully apply the resulting continuous-valued controllers to the under-actuated pendulum swing up and analyze the performances of the suggested algorithms. It turns out that GPDP uses data very efficiently and can be applied to problems, where classic dynamic programming would be cumbersome.

Y. Engel. Algorithms and Representations for Reinforcement Learning. PhD thesis, Hebrew University, Jerusalem, Israel, April 2005.

Abstract: Machine Learning is a field of research aimed at constructing intelligent machines that gain and improve their skills by learning and adaptation. As such, Machine Learning research addresses several classes of learning problems, including for instance, supervised and unsupervised learning. Arguably, the most ubiquitous and realistic class of learning problems, faced by both living creatures and artificial agents, is known as Reinforcement Learning. Reinforcement Learning problems are characterized by a long-term interaction between the learning agent and a dynamic, unfamiliar, uncertain, possibly even hostile environment. Mathematically, this interaction is modeled as a Markov Decision Process (MDP). Probably the most significant contribution of this thesis is in the introduction of a new class of Reinforcement Learning algorithms, which leverage the power of a statistical set of tools known as Gaussian Processes. This new approach to Reinforcement Learning offers viable solutions to some of the major limitations of current Reinforcement Learning methods, such as the lack of confidence intervals for performance predictions, and the difficulty of appropriately reconciling exploration with exploitation. Analysis of these algorithms and their relationship with existing methods also provides us with new insights into the assumptions underlying some of the most popular Reinforcement Learning algorithms to date.

Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In T. Fawcett and N. Mishra, editors, 20th International Conference on Machine Learning. AAAI Press, 2003.

Abstract: We present a novel Bayesian approach to the problem of value function estimation in con- tinuous state spaces. We derne a probabilistic generative model for the value function by imposing a Gaussian prior over value functions and assuming a Gaussian noise model. Due to the Gaussian nature of the random processes involved, the posterior distribution of the value function is also Gaussian and is therefore described entirely by its mean and covariance. We derive exact expressions for the posterior process moments, and utilizing an ecient sequential sparsfication method, we describe an on-line algorithm for learning them. We demonstrate the operation of the algorithm on a 2-dimensional continuous spatial navigation domain.

Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with Gaussian processes. In 22nd International Conference on Machine Learning, pages 201-208, Bonn, Germany, August 2005.

Abstract: Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper Engel et al., 2003. The first is the issue of stochasticity in the state transitions, and the second is concerned with action selection and policy improvement. We present a new generative model for the value function, deduced from its relation with the discounted return. We derive a corresponding on-line algorithm for learning the posterior moments of the value Gaussian process. We also present a SARSA based extension of GPTD, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a world-model.

Y. Engel, P. Szabo, and D. Volkinshtein. Learning to control an octopus arm with Gaussian process temporal difference methods. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 347-354, Cambridge, MA, U.S.A., 2006. The MIT Press.

Abstract: The Octopus arm is a highly versatile and complex limb. How the Octopus controls such a hyper-redundant arm (not to mention eight of them!) is as yet unknown. Robotic arms based on the same mechanical principles may render present day robotic arms obsolete. In this paper, we tackle this control problem using an online reinforcement learning algorithm, based on a Bayesian approach to policy evaluation known as Gaussian process temporal difference (GPTD) learning. Our substitute for the real arm is a computer simulation of a 2-dimensional model of an Octopus arm. Even with the simplifications inherent to this model, the state space we face is a high-dimensional one. We apply a GPTDbased algorithm to this domain, and demonstrate its operation on several learning tasks of varying degrees of difficulty.

M. Ghavamzadeh and Y. Engel. Bayesian policy gradient algorithms. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 457-464, Cambridge, MA, U.S.A., 2007. The MIT Press.

Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost.

M. Ghavamzadeh and Y. Engel. Bayesian actor-critic algorithms. In 24th International Conference on Machine Learning, pages 297-304, Corvallis, OR, U.S.A., June 2007.

Abstract: We present a new actor-critic learning model in which a Bayesian class of non-parametric critics, using Gaussian process temporal difference learning is used. Such critics model the state-action value function as a Gaussian process, allowing Bayes' rule to be used in computing the posterior distribution over state-action value functions, conditioned on the observed data. Appropriate choices of the prior covariance (kernel) between state-action values and of the parametrization of the policy allow us to obtain closed-form expressions for the posterior distribution of the gradient of the average discounted return with respect to the policy parameters. The posterior mean, which serves as our estimate of the policy gradient, is used to update the policy, while the posterior covariance allows us to gauge the reliability of the update.

J. Ko, D. J. Klein, D. Fox, and D. Haehnel. Gaussian Processes and Reinforcement Learning for Identification and Control of an Autonomous Blimp. In Proceedings of the International Conference on Robotics and Automation (ICRA), pages 742-747, Rome, Italy, April 2007.

Abstract: Blimps are a promising platform for aerial robotics and have been studied extensively for this purpose. Unlike other aerial vehicles, blimps are relatively safe and also possess the ability to loiter for long periods. These advantages, however, have been difficult to exploit because blimp dynamics are complex and inherently non-linear. The classical approach to system modeling represents the system as an ordinary differential equation (ODE) based on Newtonian principles. A more recent modeling approach is based on representing state transitions as a Gaussian process (GP). In this paper, we present a general technique for system identification that combines these two modeling approaches into a single formulation. This is done by training a Gaussian process on the residual between the non-linear model and ground truth training data. The result is a GP-enhanced model that provides an estimate of uncertainty in addition to giving better state predictions than either ODE or GP alone. We show how the GP-enhanced model can be used in conjunction with reinforcement learning to generate a blimp controller that is superior to those learned with ODE or GP models alone.

J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and B. Likar. Predictive control with Gaussian process models. In B. Zajc and M. Tkal, editors, Proceedings of IEEE Region 8 Eurocon 2003: Computer as a Tool, pages 352-356, Piscataway, 2003. IEEE.

Abstract: This paper describes model-based predictive control based on Gaussian processes.Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identification of non-linear dynamic systems. It offers more insight in variance of obtained model response, as well as fewer parameters to determine than other models. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. This property is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on a simulated example of nonlinear system.

C. E. Rasmussen and M. P. Deisenroth. Probabilistic Inference for Fast Learning in Control. In S. Girgin, M. Loth, R. Munos, P. Preux, and D. Ryabko, editors, Recent Advances in Reinforcement Learning, volume 5323 of Lecture Notes in Computer Science, pages 229-242. Springer-Verlag, November 2008.

Abstract: We provide a novel framework for very fast model-based reinforcement learning in continuous state and action spaces. The framework requires probabilistic models that explicitly characterize their levels of confidence. Within this framework, we use flexible, non-parametric models to describe the world based on previously collected experience. We demonstrate learning on the cart-pole problem in a setting where we provide very limited prior knowledge about the task. Learning progresses rapidly, and a good policy is found after only a hand-full of iterations.

C. E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. The MIT Press, Cambridge, MA, 2004.

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

Gaussian Process Latent Variable Models (GP-LVM)

N. Lawrence. Gaussian process latent variable models for visualization of high dimensional data. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 329-336. The MIT Press, 2004.

Abstract: In this paper we introduce a new underlying probabilistic model for principal component analysis (PCA). Our formulation interprets PCA as a particular Gaussian process prior on a mapping from a latent space to the observed data-space. We show that if the prior’s covariance function constrains the mappings to be linear the model is equivalent to PCA, we then extend the model by considering less restrictive covariance functions which allow non-linear mappings. This more general Gaussian process latent variable model (GPLVM) is then evaluated as an approach to the visualisation of high dimensional data for three different data-sets. Additionally our non-linear algorithm can be further kernelised leading to ‘twin kernel PCA’ in which a mapping between feature spaces occurs.

N. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783-1816, 2005.

Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be non-linearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets.

R. Urtasun and T. Darrell. Discriminative Gaussian process latent variable model for classification. In 24th International Conference on Machine Learning, 2007.

J. Wang, D. Fleet, and A. Hertzmann. Gaussian process dynamical models. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1443-1450. The MIT Press, Cambridge, MA, 2006.

Abstract: This paper introduces Gaussian Process Dynamical Models (GPDM) for nonlinear time series analysis. A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space. We marginalize out the model parameters in closed-form, using Gaussian Process (GP) priors for both the dynamics and the observation mappings. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We demonstrate the approach on human motion capture data in which each pose is 62-dimensional. Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces.

Comment: Web page http://www.dgp.toronto.edu/~jmwang/gpdm.

Applications

C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor. Gaussian process approximations of stochastic differential equations. In JMLR: Workshop and Conference Proceedings, 2007.

Abstract: Stochastic differential equations arise naturally in a range of contexts, from financial to environmental modeling. Current solution methods are limited in their representation of the posterior process in the presence of data. In this work, we present a novel Gaussian process approximation to the posterior measure over paths for a general class of stochastic differential equations in the presence of observations. The method is applied to two simple problems: the Ornstein-Uhlenbeck process, of which the exact solution is known and can be compared to, and the double-well system, for which standard approaches such as the ensemble Kalman smoother fail to provide a satisfactory result. Experiments show that our variational approximation is viable and that the results are very promising as the variational approximate solution outperforms standard Gaussian process regression for non-Gaussian Markov processes.

C. A. L. Bailer-Jones, T. J. Sabin, D. J. C. MacKay, and P. J. Withers. Prediction of deformed and annealed microstructures using Bayesian neural networks and Gaussian processes. In T. Chandra, S. R. Leclair, J. A. Meech, B. Verma, M. Smith, and B. Balachandran, editors, Proceedings of the Australasia Pacific Forum on Intelligent Processing and Manufacturing of Materials (IPMM97), Brisbane, 1997. Watson Ferguson & Co.

Abstract: The forming of metals is important in many manufacturing industries. It has long been known that microstructure and texture affect the properties of a material, but to date limited progress has been made in predicting microstructural development during thermomechanical forming due to the complexity of the relationship between microstructure and local deformation conditions. In this paper we investigate the utility of non-linear interpolation models, in particular Gaussian processes, to model the development of microstructure during thermomechanical processing of metals. We adopt a Bayesian approach which allows: (1) automatic control of the complexity of the non-linear model; (2) calculation of error bars describing the reliability of the model predictions; (3) automatic determination of the relevance of the various input variables. Although this method is not intelligent in that it does not attempt to provide a fundamental understanding of the underlying micromechanical deformation processes, it can lead to empirical relations that predict microstructure as a function of deformation and heat treatments. These can easily be incorporated into existing Finite Element forging design tools. Future work will examine the use of these models in reverse to guide the definition of deformation processes aimed at delivering the required microstructures. In order to thoroughly train and test a Gaussian Process or neural network model, a large amount of representative experimental data is required. Initial experimental work has focused on an Al-1%Mg alloy deformed in non-uniform cold compression followed by different annealing treatments to build up a set of microstructural data brought about by a range of processing conditions. The DEFORM Finite Element modelling package has been used to calculate the local effective strain as a function of position across the samples. This is correlated with measurements of grain areas to construct the data set with which to develop the model.

W. Chu, Z. Ghahramani, F. Falciani, and D. L. Wild. Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics, 21(16):3385-3393, 2005.

Abstract: Motivation: In clinical practice, pathological phenotypes are often labelled with ordinal scales rather than binary, e.g. the Gleason grading system for tumor cell differentiation. However, in the literature of microarray analysis, these ordinal labels have been rarely treated in a principled way. This paper describes a gene selection algorithm based on Gaussian processes to discover consistent gene expression patterns associated with ordinal clinical phenotypes. The technique of automatic relevance determination is applied to represent the significance level of the genes in a Bayesian inference framework. Results: The usefulness of the proposed algorithm for ordinal labels is demonstrated by the gene expression signature associated with the Gleason score for prostate cancer data. Our results demonstrate how multi-gene markers that may be initially developed with a diagnostic or prognostic application in mind are also useful as an investigative tool to reveal associations between specific molecular and cellular events and features of tumor physiology. Our algorithm can also be applied to microarray data with binary labels with results comparable to other methods in the literature. Availability: The source code was written in ANSI C, which is accessible at www.gatsby.ucl.ac.uk/~chuwei/code/gpgenes.tar.

W. Chu, S. Sindwhani, Z. Ghahramani, and S. S. Keerthi. Relational learning with Gaussian processes. In Advances in Neural Information Processing Systems 18, 2006.

Abstract: Correlation between instances is often modelled via a kernel function using input attributes of the instances. Relational knowledge can further reveal additional pairwise correlations between variables of interest. In this paper, we develop a class of models which incorporates both reciprocal relational information and input attributes using Gaussian process techniques. This approach provides a novel non-parametric Bayesian framework with a data-dependent covariance function for supervised learning tasks. We also apply this framework to semi-supervised learning. Experimental results on several real world data sets verify the usefulness of this algorithm.

J. P. Cunningham, K. V. Shenoy, and M. Sahani. Fast Gaussian process methods for point process intensity estimation. In Andrew McCallum and Sam Roweis, editors, 25th International Conference on Machine Learning, pages 192-199. Omnipress, 2008.

Abstract: Point processes are difficult to analyze because they provide only a sparse and noisy observation of the intensity function driving the process. Gaussian Processes offer an attractive framework within which to infer underlying intensity functions. The result of this inference is a continuous function defined across time that is typically more amenable to analytical efforts. However, a naive implementation will become computationally infeasible in any problem of reasonable size, both in memory and run time requirements. We demonstrate problem specific methods for a class of renewal processes that eliminate the memory burden and reduce the solve time by orders of magnitude.

J. Eichhorn, A. Tolias, A. Zien, M. Kuss, C. E. Rasmussen, J. Weston, N. Logothetis, and B. Schölkopf. Prediction on spike data using kernel algorithms. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004. The MIT Press.

Abstract: We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms.

J. Ko and D. Fox. GP-BayesFilters: Bayesian Filtering Using Gaussian Process Prediction and Observation Models. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3471-3476, Nice, France, September 2008.

Abstract: Bayesian filtering is a general framework for recursively estimating the state of a dynamical system. The most common instantiations of Bayes filters are Kalman filters (extended and unscented) and particle filters. Key components of each Bayes filter are probabilistic prediction and observation models. Recently, Gaussian processes have been introduced as a non-parametric technique for learning such models from training data. In the context of unscented Kalman filters, these models have been shown to provide estimates that can be superior to those achieved with standard, parametric models. In this paper we show how Gaussian process models can be integrated into other Bayes filters, namely particle filters and extended Kalman filters. We provide a complexity analysis of these filters and evaluate the alternative techniques using data collected with an autonomous micro-blimp.

J. Ko, D. J. Klein, D. Fox, and D. Haehnel. GP-UKF: Unscented Kalman Filters with Gaussian Process Prediction and Observation Models. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1901-1907, San Diego, CA, USA, October 2007.

Abstract: This paper considers the use of non-parametric system models for sequential state estimation. In particular, motion and observation models are learned from training examples using Gaussian Process (GP) regression. The state estimator is an Unscented Kalman Filter (UKF). The resulting GP-UKF algorithm has a number of advantages over standard (parametric) UKFs. These include the ability to estimate the state of arbitrary nonlinear systems, improved tracking quality compared to a parametric UKF, and graceful degradation with increased model uncertainty. These advantages stem from the fact that GPs consider both the noise in the system and the uncertainty in the model. If an approximate parametric model is available, it can be incorporated into the GP; resulting in further performance improvements. In experiments, we show how the GP-UKF algorithm can be applied to the problem of tracking an autonomous micro-blimp.

J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard. Gasussian process model based predictive control. In Proceedings of the Amarican Control Conference, pages 2214-2219, 2004.

Abstract: Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identi cation of non-linear dynamic systems. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. Gaussian process models contain noticeably less coef cients to be optimised. This paper illustrates possible application of Gaussian process models within model-based predictive control. The extra information provided within Gaussian process model is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on control of pH process benchmark.

G. M. Laslett. Kriging and splines: An empirical comparison of their predictive performance in some applications. Journal of the American Statistical Association, 89(426):391-409, 1994.

J. J. Murillo-Fuentes, S. Caro, and F. Perez-Cruz. Gaussian processes for multiuser detection in cdma receivers. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 939-946. The MIT Press, Cambridge, MA, 2006.

A. O'Hagan, M. C. Kennedy, and J. E. Oakley. Uncertainty analysis and other inference tools for complex computer codes. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 503-524. Oxford University Press, 1999. (with discussion).

J. C. Platt, C. J. C. Burges, S. Swenson, C. Weare, and A. Zheng. Learning a Gaussian process prior for automatically generating music playlists. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, Cambridge, MA, 2000. The MIT Press.

Abstract: This paper presents AutoDJ: a system for automatically generating music playlists based on one or more seed songs selected by a user. AutoDJ uses Gaussian Process Regression to learn a user preference function over songs. This function takes music metadata as inputs. This paper further introduces Kernel Meta-Training, which is a method of learning a Gaussian Process kernel from a distribution of functions that generates the learned function. For playlist generation, AutoDJ learns a kernel from a large set of albums. This learned kernel is shown to be more effective at predicting users' playlists than a reasonable hand-designed kernel.

J. Quiñonero-Candela, A. Girard, Jan Larsen, and C. E. Rasmussen. Propagation of uncertainty in Bayesian kernel models - application to multiple-step ahead forecasting. In Proceedings of the 2003 IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP 03), 2003.

Abstract: The object of Bayesian modelling is the predictive distribution, which in a forecasting scenario enables improved estimates of forecasted values and their uncertainties. In this paper we focus on reliably estimating the predictive mean and variance of forecasted values using Bayesian kernel based models such as the Gaussian Process and the Relevance Vector Machine. We derive novel analytic expressions for the predictive mean and variance for Gaussian kernel shapes under the assumption of a Gaussian input distribution in the static case, and of a recursive Gaussian predictive density in iterative forecasting. The capability of the method is demonstrated for forecasting of time-series and compared to approximate methods.

J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409-435, 1989.

A. Schwaighofer, M. Grigoras, V. Tresp, and C Hoffmann. GPPS: A Gaussian process positioning system for cellular networks. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004. The MIT Press.

Abstract: In this article, we present a novel approach to solving the localization problem in cellular networks. The goal is to estimate a mobile users position, based on measurements of the signal strengths received from network base stations. Our solution works by building Gaussian process models for the distribution of signal strengths, as obtained in a series of calibration measurements. In the localization stage, the users position can be estimated by maximizing the likelihood of received signal strengths with respect to the position. We investigate the accuracy of the proposed approach on data obtained within a large indoor cellular network.

Jian Qing Shi, B. Wang, Roderick Murray-Smith, and D. M. Titterington. Gaussian process functional regression modeling for batch data. Biometrics, 63:714-723, 2007.

Abstract: A Gaussian process functional regression model is proposed for the analysis of batch data. Covariance structure and mean structure are considered simultaneously, with the covariance structure modelled by a Gaussian process regression model and the mean structure modelled by a functional regression model. The model allows the inclusion of covariates in both the covariance structure and the mean structure. It models the nonlinear relationship between a functional output variable and a set of functional and non-functional covariates. Several applications and simulation studies are reported and show that the method provides very good results for curve fitting and prediction.

Jian Qing Shi and B. Wang. Curve prediction and clustering with mixtures of Gaussian process functional regression models. Statistics and Computing, 18:267-283, 2008.

Abstract: Shi et al. (2007) proposed a Gaussian process functional regression (GPFR) model to model functional response curves with a set of functional covariates. Two main problems are addressed by this method: modelling nonlinear and nonparametric regression relationship and modelling covariance structure and mean structure simultaneously. The method gives very good results for curve fitting and prediction but side-steps the problem of heterogeneity. In this paper we present a new method for modelling functional data with ‘spatially’ indexed data, i.e., the heterogeneity is dependent on factors such as region and individual patient’s information. For data collected from different sources, we assume that the data corresponding to each curve (or batch) follows a Gaussian process func- tional regression model as a lower-level model, and introduce an allocation model for the latent indicator variables as a higher-level model. This higher-level model is dependent on the information related to each batch. This method takes ad- vantage of both GPFR and mixture models and therefore improves the accuracy of predictions. The mixture model has also been used for curve clustering, but focusing on the problem of clustering functional relationships between response curve and covariates. The model is examined on simulated data and real data.

F. Sinz, J. Quiñonero-Candela, G. H. Bakir, C. E. Rasmussen, and M. O. Franz. Learning depth from stereo. In C. E. Rasmussen, H. H. Buelthoff, M. A. Giese, and B. Schölkopf, editors, Pattern Recognition, Proc. 26th DAGM Symposium, LNCS 3175, pages 245-252. Springer, Berlin, 2004.

Abstract: We compare two approaches to the problem of estimating the depth of a point in space from observing its image position in two different cameras: 1. The classical photogrammetric approach explicitly models the two cameras and estimates their intrinsic and extrinsic parameters using a tedious calibration procedure; 2. A generic machine learning approach where the mapping from image to spatial coordinates is directly approximated by a Gaussian Process regression. Our results show that the generic learning approach, in addition to simplifying the procedure of calibration, can lead to higher depth accuracies than classical calibration although no specific domain knowledge is used.

W. J. Welch, R. J. Buck, J. Sacks, H. P. Wynn, T. J. Mitchell, and M. D. Morris. Screening, predicting, and computer experiments. Technometrics, 34:15-25, 1992.

Comment: Application of noise free Gaussian Process to screening (input selection) and prediction in computer experiments. The covariance function is $C(x, x'|θ, p)\propto\exp (-θ_j|x_j-x_j'|^p_j)$; variable selection is done by lumping some hyperparameters and letting others vary freely. Training by maximum likelihood using linesearches and the simplex algorithm.