Sparse Gaussian Process Classification With Multiple Classes - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Sparse Gaussian Process Classification With Multiple Classes

Description:

u(c) should be coupled a posteriori Diagonal. not useful. Hessian of has simple form ... will be crucial. The multi-class scheme will be a major building block ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 22
Provided by: Matthia93
Category:

less

Transcript and Presenter's Notes

Title: Sparse Gaussian Process Classification With Multiple Classes


1
Sparse Gaussian Process ClassificationWith
Multiple Classes
  • Matthias W. SeegerMichael I. Jordan
  • University of California, Berkeley
  • www.cs.berkeley.edu/mseeger

2
Gaussian Processes are different
  • Kernel MachinesEstimate single best function
    to solve problem
  • Bayesian Gaussian ProcessesInference over
    random functions? mean predictions and
    uncertainty estimates
  • Gives posterior distribution over functions
  • More expressive
  • Powerful empirical Bayesian model selection
  • Combination in larger probabilistic structure
  • ? Harder to run, but worth it!

3
The Need for Linear Time
  • So Gaussian Processes aim for more than Kernel
    Machines --- Do they run much slower then?? Not
    necessarily (anymore)!
  • GP multi-way classification
  • Linear in number datapoints
  • Linear in number classes
  • No artificial output coding
  • Predictive uncertainties
  • Empirical Bayesian model selection

4
Sparse GP Approximations
  • Lawrence, Seeger, Herbrich IVM (NIPS 02)
  • Home in on active set ,
    size
  • Replace likelihood by
    alikelihood approximation , a
    Gaussian functionof only
  • Use information criteria to find I greedily
  • Restricted to models with one process only (like
    other sparse GP methods)

5
Multi-Class Models
  • Multinomial Likelihood (Softmax)
  • Use one process uc() for each class
  • Processes independenta priori
  • Different kernels K(c)for each class

6
But Thats Easy
  • we thought back then, but Posterior
    covariance
  • Both are block-diagonal, but in different
    systems!Together A has no simple structure!

7
Second Order Approximation
  • u(c) should be coupled a posteriori ?
    Diagonalnot useful
  • Hessian of has simple form
  • Allow for likelihood coupling to be represented
    exactly up to second order ,
    diagonal minus rank 1

8
Subproblems
  • Efficient representation exploiting
    the prior independence and constrained form
  • ADF projections to constrained Gaussian to
    compute site precision blocks
  • Forward selection of I
  • Extensions of simple myopic scheme
  • Model selection based on conditional inference

9
Representation
  • Exploits block-diagonal matrix structures
  • Nontrivial to get numerics right (Cholesky
    factors)
  • Dominating stub buffers , to compute
    marginal moments
  • Update after inclusion (stubs)?
    in total

10
Restricted ADF Projection
  • Hard (non-convex) because constrained
  • Use double-loop scheme outer loop analytic,
    inner loop convex ? very fast
  • Initialization matters. Our choice can be
    motivated from second order approximation (once
    more)

11
Information Gain Criterion
  • Selection score measures informativeness of
    candidates, given current belief after
    inclusion of candidate i
  • Points close or wrong side of class boundaries
  • Requires marginal computed from stubs
  • Score candidates prior to each
    inclusion

12
Extensions of Myopic Scheme
  • growing
  • fixed site parameters (for efficiency)
  • fixed size
  • site parameters iteratively updated using
    EP

13
Overview Inference Algorithm
Selection Phase Compute marginals, score
O(n/C) candidates. Select winner
Inclusion Phase Include pattern. Move
oldest liquid to solid active set
EP Phase Run EP updates iteratively on
liquid set site parameters
14
Model Selection
  • Use variational bound on marginal likelihood
    based on inference approximation
  • Gradient costs inference plus
  • Minimize using Quasi Newton, reselecting I and
    site parameters for new search directions(non-sta
    ndard optimization problem)

15
Preliminary Experiments
  • Small part of MNIST (even digits, C5, n800)
  • No model selection (MS not yet tested), all K(c)
    the same
  • dfinal150, L25 (liquid set)

16
Preliminary Experiments (2)
17
Preliminary Experiments (3)
18
Preliminary Experiments (4)
19
Future Experiments
  • Much larger experiments are in preparation,includ
    ing model selection
  • Uses novel powerful object oriented Matlab/C
    interface
  • Control over very large persistent C objects
    from Matlab
  • Faster transition prototype (Matlab) ? product
    (C)
  • Powerful matrix classes (masking, LAPACK/BLAS)
  • Optimization code
  • Will be released into public domain

20
Future Work
  • Experiments on much larger tasks
  • Model selection with independent, heavily
    parameterized kernels (ARD,)
  • Present scheme cannot be used for large C

21
Future Work (2)
  • Gaussian process priors in large structured
    networks? Gaussian process conditional random
    fields,
  • Previous work adresses function point
    estimation.We aim for GP inference including
    uncertainty estimates
  • Have to deal with huge random field correlations
    not only between datapoints, but also along
    time? Automatic factorizations will be crucial
  • The multi-class scheme will be a major building
    block
Write a Comment
User Comments (0)
About PowerShow.com