Sparse Gaussian Process Classification With Multiple Classes - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Sparse Gaussian Process Classification With Multiple Classes

Description:

u(c) should be coupled a posteriori Diagonal. not useful. Hessian of has simple form ... will be crucial. The multi-class scheme will be a major building block ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 22

Provided by: Matthia93

Category:

more less

Transcript and Presenter's Notes

Title: Sparse Gaussian Process Classification With Multiple Classes

1
Sparse Gaussian Process ClassificationWith
Multiple Classes

Matthias W. SeegerMichael I. Jordan
University of California, Berkeley
www.cs.berkeley.edu/mseeger

2
Gaussian Processes are different

Kernel MachinesEstimate single best function
to solve problem
Bayesian Gaussian ProcessesInference over
random functions? mean predictions and
uncertainty estimates
Gives posterior distribution over functions
More expressive
Powerful empirical Bayesian model selection
Combination in larger probabilistic structure
? Harder to run, but worth it!

3
The Need for Linear Time

So Gaussian Processes aim for more than Kernel
Machines --- Do they run much slower then?? Not
necessarily (anymore)!
GP multi-way classification
Linear in number datapoints
Linear in number classes
No artificial output coding
Predictive uncertainties
Empirical Bayesian model selection

4
Sparse GP Approximations

Lawrence, Seeger, Herbrich IVM (NIPS 02)
Home in on active set ,
size
Replace likelihood by
alikelihood approximation , a
Gaussian functionof only
Use information criteria to find I greedily
Restricted to models with one process only (like
other sparse GP methods)

5
Multi-Class Models

Multinomial Likelihood (Softmax)
Use one process uc() for each class
Processes independenta priori
Different kernels K(c)for each class

6
But Thats Easy

we thought back then, but Posterior
covariance
Both are block-diagonal, but in different
systems!Together A has no simple structure!

7
Second Order Approximation

u(c) should be coupled a posteriori ?
Diagonalnot useful
Hessian of has simple form
Allow for likelihood coupling to be represented
exactly up to second order ,
diagonal minus rank 1

8
Subproblems

Efficient representation exploiting
the prior independence and constrained form
ADF projections to constrained Gaussian to
compute site precision blocks
Forward selection of I
Extensions of simple myopic scheme
Model selection based on conditional inference

9
Representation

Exploits block-diagonal matrix structures
Nontrivial to get numerics right (Cholesky
factors)
Dominating stub buffers , to compute
marginal moments
Update after inclusion (stubs)?
in total

10
Restricted ADF Projection

Hard (non-convex) because constrained
Use double-loop scheme outer loop analytic,
inner loop convex ? very fast
Initialization matters. Our choice can be
motivated from second order approximation (once
more)

11
Information Gain Criterion

Selection score measures informativeness of
candidates, given current belief after
inclusion of candidate i
Points close or wrong side of class boundaries
Requires marginal computed from stubs
Score candidates prior to each
inclusion

12
Extensions of Myopic Scheme

growing
fixed site parameters (for efficiency)

fixed size
site parameters iteratively updated using
EP

13
Overview Inference Algorithm
Selection Phase Compute marginals, score
O(n/C) candidates. Select winner
Inclusion Phase Include pattern. Move
oldest liquid to solid active set
EP Phase Run EP updates iteratively on
liquid set site parameters
14
Model Selection

Use variational bound on marginal likelihood
based on inference approximation
Gradient costs inference plus
Minimize using Quasi Newton, reselecting I and
site parameters for new search directions(non-sta
ndard optimization problem)

15
Preliminary Experiments

Small part of MNIST (even digits, C5, n800)
No model selection (MS not yet tested), all K(c)
the same
dfinal150, L25 (liquid set)

16
Preliminary Experiments (2)
17
Preliminary Experiments (3)
18
Preliminary Experiments (4)
19
Future Experiments

Much larger experiments are in preparation,includ
ing model selection
Uses novel powerful object oriented Matlab/C
interface
Control over very large persistent C objects
from Matlab
Faster transition prototype (Matlab) ? product
(C)
Powerful matrix classes (masking, LAPACK/BLAS)
Optimization code
Will be released into public domain

20
Future Work

Experiments on much larger tasks
Model selection with independent, heavily
parameterized kernels (ARD,)
Present scheme cannot be used for large C

21
Future Work (2)

Gaussian process priors in large structured
networks? Gaussian process conditional random
fields,
Previous work adresses function point
estimation.We aim for GP inference including
uncertainty estimates
Have to deal with huge random field correlations
not only between datapoints, but also along
time? Automatic factorizations will be crucial
The multi-class scheme will be a major building
block