Title: Sparse Gaussian Process Classification With Multiple Classes
1Sparse Gaussian Process ClassificationWith
Multiple Classes
- Matthias W. SeegerMichael I. Jordan
- University of California, Berkeley
- www.cs.berkeley.edu/mseeger
2Gaussian Processes are different
- Kernel MachinesEstimate single best function
to solve problem - Bayesian Gaussian ProcessesInference over
random functions? mean predictions and
uncertainty estimates - Gives posterior distribution over functions
- More expressive
- Powerful empirical Bayesian model selection
- Combination in larger probabilistic structure
- ? Harder to run, but worth it!
3The Need for Linear Time
- So Gaussian Processes aim for more than Kernel
Machines --- Do they run much slower then?? Not
necessarily (anymore)! - GP multi-way classification
- Linear in number datapoints
- Linear in number classes
- No artificial output coding
- Predictive uncertainties
- Empirical Bayesian model selection
4Sparse GP Approximations
- Lawrence, Seeger, Herbrich IVM (NIPS 02)
- Home in on active set ,
size - Replace likelihood by
alikelihood approximation , a
Gaussian functionof only - Use information criteria to find I greedily
- Restricted to models with one process only (like
other sparse GP methods)
5Multi-Class Models
- Multinomial Likelihood (Softmax)
- Use one process uc() for each class
- Processes independenta priori
- Different kernels K(c)for each class
6But Thats Easy
- we thought back then, but Posterior
covariance - Both are block-diagonal, but in different
systems!Together A has no simple structure!
7Second Order Approximation
- u(c) should be coupled a posteriori ?
Diagonalnot useful - Hessian of has simple form
- Allow for likelihood coupling to be represented
exactly up to second order ,
diagonal minus rank 1
8Subproblems
- Efficient representation exploiting
the prior independence and constrained form - ADF projections to constrained Gaussian to
compute site precision blocks - Forward selection of I
- Extensions of simple myopic scheme
- Model selection based on conditional inference
9Representation
- Exploits block-diagonal matrix structures
- Nontrivial to get numerics right (Cholesky
factors) - Dominating stub buffers , to compute
marginal moments - Update after inclusion (stubs)?
in total
10Restricted ADF Projection
- Hard (non-convex) because constrained
- Use double-loop scheme outer loop analytic,
inner loop convex ? very fast - Initialization matters. Our choice can be
motivated from second order approximation (once
more)
11Information Gain Criterion
- Selection score measures informativeness of
candidates, given current belief after
inclusion of candidate i - Points close or wrong side of class boundaries
- Requires marginal computed from stubs
- Score candidates prior to each
inclusion
12Extensions of Myopic Scheme
- growing
- fixed site parameters (for efficiency)
- fixed size
- site parameters iteratively updated using
EP
13Overview Inference Algorithm
Selection Phase Compute marginals, score
O(n/C) candidates. Select winner
Inclusion Phase Include pattern. Move
oldest liquid to solid active set
EP Phase Run EP updates iteratively on
liquid set site parameters
14Model Selection
- Use variational bound on marginal likelihood
based on inference approximation - Gradient costs inference plus
- Minimize using Quasi Newton, reselecting I and
site parameters for new search directions(non-sta
ndard optimization problem)
15Preliminary Experiments
- Small part of MNIST (even digits, C5, n800)
- No model selection (MS not yet tested), all K(c)
the same - dfinal150, L25 (liquid set)
16Preliminary Experiments (2)
17Preliminary Experiments (3)
18Preliminary Experiments (4)
19Future Experiments
- Much larger experiments are in preparation,includ
ing model selection - Uses novel powerful object oriented Matlab/C
interface - Control over very large persistent C objects
from Matlab - Faster transition prototype (Matlab) ? product
(C) - Powerful matrix classes (masking, LAPACK/BLAS)
- Optimization code
- Will be released into public domain
20Future Work
- Experiments on much larger tasks
- Model selection with independent, heavily
parameterized kernels (ARD,) - Present scheme cannot be used for large C
21Future Work (2)
- Gaussian process priors in large structured
networks? Gaussian process conditional random
fields, - Previous work adresses function point
estimation.We aim for GP inference including
uncertainty estimates - Have to deal with huge random field correlations
not only between datapoints, but also along
time? Automatic factorizations will be crucial - The multi-class scheme will be a major building
block