Part 2: Unsupervised Learning

About This Presentation

Title:

Part 2: Unsupervised Learning

Description:

Machine Learning Techniques for Computer Vision (ECCV 2004) Christopher M. Bishop ... Automatic relevance determination (ARD) ML PCA. Bayesian PCA ... – PowerPoint PPT presentation

Number of Views:229

Avg rating:3.0/5.0

Slides: 73

Provided by: cmbi5

Category:

more less

Transcript and Presenter's Notes

Title: Part 2: Unsupervised Learning

1
Machine Learning Techniques for Computer Vision

Part 2 Unsupervised Learning

Christopher M. Bishop
Microsoft Research Cambridge
ECCV 2004, Prague
2
Overview of Part 2

Mixture models
EM
Variational Inference
Bayesian model complexity
Continuous latent variables

3
The Gaussian Distribution

Multivariate Gaussian
Maximum likelihood

mean
4
Gaussian Mixtures

Linear super-position of Gaussians
Normalization and positivity require

5
Example Mixture of 3 Gaussians
6
Maximum Likelihood for the GMM

Log likelihood function
Sum over components appears inside the log
no closed form ML solution

7
EM Algorithm Informal Derivation
8
EM Algorithm Informal Derivation

M step equations

9
EM Algorithm Informal Derivation

E step equation

10
EM Algorithm Informal Derivation

Can interpret the mixing coefficients as prior
probabilities
Corresponding posterior probabilities
(responsibilities)

11
Old Faithful Data Set
Time betweeneruptions (minutes)
Duration of eruption (minutes)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
Latent Variable View of EM

To sample from a Gaussian mixture
first pick one of the components with probability
then draw a sample from that component
repeat these two steps for each new data point

19
Latent Variable View of EM

Goal given a data set, find
Suppose we knew the colours
maximum likelihood would involve fitting each
component to the corresponding cluster
Problem the colours are latent (hidden) variables

20
Incomplete and Complete Data
incomplete
complete
21
Latent Variable Viewpoint
22
Latent Variable Viewpoint

Binary latent variables
describing which component generated each data
point
Conditional distribution of observed variable
Prior distribution of latent variables
Marginalizing over the latent variables we obtain

23
Graphical Representation of GMM
24
Latent Variable View of EM

Suppose we knew the values for the latent
variables
maximize the complete-data log likelihood
trivial closed-form solution fit each component
to the corresponding set of data points
We dont know the values of the latent variables
however, for given parameter values we can
compute the expected values of the latent
variables

25
Posterior Probabilities (colour coded)
26
Over-fitting in Gaussian Mixture Models

Infinities in likelihood function when a
component collapses onto a data point
with
Also, maximum likelihood cannot determine the
number K of components

27
Cross Validation

Can select model complexity using an independent
validation data set
If data is scarce use cross-validation
partition data into S subsets
train on S-1 subsets
test on remainder
repeat and average
Disadvantages
computationally expensive
can only determine one or two complexity
parameters

28
Bayesian Mixture of Gaussians

Parameters and latent variables appear on equal
footing
Conjugate priors

29
Data Set Size

Problem 1 learn the functionfor
from 100 (slightly) noisy examples
data set is computationally small but
statistically large
Problem 2 learn to recognize 1,000 everyday
objects from 5,000,000 natural images
data set is computationally large but
statistically small
Bayesian inference
computationally more demanding than ML or
MAP(but see discussion of Gaussian mixtures
later)
significant benefit for statistically small data
sets

30
Variational Inference

Exact Bayesian inference intractable
Markov chain Monte Carlo
computationally expensive
issues of convergence
Variational Inference
broadly applicable deterministic approximation
let denote all latent variables and parameters
approximate true posterior using a
simpler distribution
minimize Kullback-Leibler divergence

31
General View of Variational Inference

For arbitrarywhere
Maximizing over would give the true
posterior
this is intractable by definition

32
Variational Lower Bound
33
Factorized Approximation

Goal choose a family of q distributions which
are
sufficiently flexible to give good approximation
sufficiently simple to remain tractable
Here we consider factorized distributions
No further assumptions are required!
Optimal solution for one factor, keeping the
remainder fixed
coupled solutions so initialize then cyclically
update
message passing view (Winn and Bishop, 2004)

34
(No Transcript)
35
Lower Bound

Can also be evaluated
Useful for maths/code verification
Also useful for model comparison

36
Illustration Univariate Gaussian

Likelihood function
Conjugate prior
Factorized variational distribution

37
Initial Configuration
38
After Updating
39
After Updating
40
Converged Solution
41
Variational Mixture of Gaussians

Assume factorized posterior distribution
No other approximations needed!

42
Variational Equations for GMM
43
Lower Bound for GMM
44
VIBES

Bishop, Spiegelhalter and Winn (2002)

45
ML Limit

If instead we choosewe recover the maximum
likelihood EM algorithm

46
Bound vs. K for Old Faithful Data
47
Bayesian Model Complexity
48
Sparse Bayes for Gaussian Mixture

Corduneanu and Bishop (2001)
Start with large value of K
treat mixing coefficients as parameters
maximize marginal likelihood
prunes out excess components

49
(No Transcript)
50
(No Transcript)
51
Summary Variational Gaussian Mixtures

Simple modification of maximum likelihood EM code
Small computational overhead compared to EM
No singularities
Automatic model order selection

52
Continuous Latent Variables

Conventional PCA
data covariance matrix
eigenvector decomposition
Minimizes sum-of-squares projection
not a probabilistic model
how should we choose L ?

53
Probabilistic PCA

Tipping and Bishop (1998)
L dimensional continuous latent space
D dimensional data space

PCA
factor analysis
54
Probabilistic PCA

Marginal distribution
Advantages
exact ML solution
computationally efficient EM algorithm
captures dominant correlations with few
parameters
mixtures of PPCA
Bayesian PCA
building block for more complex models

55
EM for PCA
56
EM for PCA
57
EM for PCA
58
EM for PCA
59
EM for PCA
60
EM for PCA
61
EM for PCA
62
Bayesian PCA

Bishop (1998)
Gaussian prior over columns of
Automatic relevance determination (ARD)

ML PCA
Bayesian PCA
63
Non-linear Manifolds

Example images of a rigid object

64
Bayesian Mixture of BPCA Models
65
(No Transcript)
66
Flexible Sprites

Jojic and Frey (2001)
Automatic decomposition of video sequence into
background model
ordered set of masks (one per object per frame)
foreground model (one per object per frame)

67
(No Transcript)
68
Transformed Component Analysis

Generative model
Now include transformations (translations)
Extend to L layers
Inference intractable so use variational
framework

69
(No Transcript)
70
Bayesian Constellation Model

Li, Fergus and Perona (2003)
Object recognition from small training sets
Variational treatment of fully Bayesian model

71
Bayesian Constellation Model
72
Summary of Part 2

Discrete and continuous latent variables
EM algorithm
Build complex models from simple components
represented graphically
incorporates prior knowledge
Variational inference
Bayesian model comparison

Write a Comment

User Comments (0)

About PowerShow.com

Part 2: Unsupervised Learning - PowerPoint PPT Presentation

Part 2: Unsupervised Learning

Machine Learning Techniques for Computer Vision (ECCV 2004) Christopher M. Bishop ... Automatic relevance determination (ARD) ML PCA. Bayesian PCA ... – PowerPoint PPT presentation