Christopher M' Bishop presentation

About This Presentation

Transcript and Presenter's Notes

Title: Christopher M' Bishop

1
Latent Variables,Mixture Modelsand EM

Christopher M. Bishop

Microsoft Research, Cambridge
BCS Summer SchoolExeter, 2003
2
Overview

K-means clustering
Gaussian mixtures
Maximum likelihood and EM
Probabilistic graphical models
Latent variables EM revisited
Bayesian Mixtures of Gaussians
Variational Inference
VIBES

3
Old Faithful
4
Old Faithful Data Set
Time betweeneruptions (minutes)
Duration of eruption (minutes)
5
K-means Algorithm

Goal represent a data set in terms of K clusters
each of which is summarized by a prototype
Initialize prototypes, then iterate between two
phases
E-step assign each data point to nearest
prototype
M-step update prototypes to be the cluster means
Simplest version is based on Euclidean distance
re-scale Old Faithful data

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Responsibilities

Responsibilities assign data points to
clusterssuch that
Example 5 data points and 3 clusters

16
K-means Cost Function
17
Minimizing the Cost Function

E-step minimize w.r.t.
assigns each data point to nearest prototype
M-step minimize w.r.t
gives
each prototype set to the mean of points in that
cluster
Convergence guaranteed since there is a finite
number of possible settings for the
responsibilities

18
(No Transcript)
19
(No Transcript)
20
Limitations of K-means

Hard assignments of data points to clusters
small shift of a data point can flip it to a
different cluster
Not clear how to choose the value of K
Solution replace hard clustering of K-means
with soft probabilistic assignments
Represents the probability distribution of the
data as a Gaussian mixture model

21
The Gaussian Distribution

Multivariate Gaussian
Define precision to be the inverse of the
covariance
In 1-dimension

22
Likelihood Function

Data set
Assume observed data points generated
independently
Viewed as a function of the parameters, this is
known as the likelihood function

23
Maximum Likelihood

Set the parameters by maximizing the likelihood
function
Equivalently maximize the log likelihood

24
Maximum Likelihood Solution

Maximizing w.r.t. the mean gives the sample
mean
Maximizing w.r.t covariance gives the sample
covariance

25
Bias of Maximum Likelihood

Consider the expectations of the maximum
likelihood estimates under the Gaussian
distribution
The maximum likelihood solution systematically
under-estimates the covariance
This is an example of over-fitting

26
Intuitive Explanation of Over-fitting
27
Unbiased Variance Estimate

Clearly we can remove the bias by usingsince
this gives
Arises naturally in a Bayesian treatment (see
later)
For an infinite data set the two expressions are
equal

28
Gaussian Mixtures

Linear super-position of Gaussians
Normalization and positivity require
Can interpret the mixing coefficients as prior
probabilities

29
Example Mixture of 3 Gaussians
30
Contours of Probability Distribution
31
Surface Plot
32
Sampling from the Gaussian

To generate a data point
first pick one of the components with probability
then draw a sample from that component
Repeat these two steps for each new data point

33
Synthetic Data Set
34
Fitting the Gaussian Mixture

We wish to invert this process given the data
set, find the corresponding parameters
mixing coefficients
means
covariances
If we knew which component generated each data
point, the maximum likelihood solution would
involve fitting each component to the
corresponding cluster
Problem the data set is unlabelled
We shall refer to the labels as latent ( hidden)
variables

35
Synthetic Data Set Without Labels
36
Posterior Probabilities

We can think of the mixing coefficients as prior
probabilities for the components
For a given value of we can evaluate the
corresponding posterior probabilities, called
responsibilities
These are given from Bayes theorem by

37
Posterior Probabilities (colour coded)
38
Posterior Probability Map
39
Maximum Likelihood for the GMM

The log likelihood function takes the form
Note sum over components appears inside the log
There is no closed form solution for maximum
likelihood

40
Over-fitting in Gaussian Mixture Models

Singularities in likelihood function when a
component collapses onto a data pointthen
consider
Likelihood function gets larger as we add more
components (and hence parameters) to the model
not clear how to choose the number K of
components

41
Problems and Solutions

How to maximize the log likelihood
solved by expectation-maximization (EM) algorithm
How to avoid singularities in the likelihood
function
solved by a Bayesian treatment
How to choose number K of components
also solved by a Bayesian treatment

42
EM Algorithm Informal Derivation

Let us proceed by simply differentiating the log
likelihood
Setting derivative with respect to equal to
zero givesgivingwhich is simply the
weighted mean of the data

43
EM Algorithm Informal Derivation

Similarly for the covariances
For mixing coefficients use a Lagrange multiplier
to give

44
EM Algorithm Informal Derivation

The solutions are not closed form since they are
coupled
Suggests an iterative scheme for solving them
Make initial guesses for the parameters
Alternate between the following two stages
E-step evaluate responsibilities
M-step update parameters using ML results

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Digression Probabilistic Graphical Models

Graphical representation of a probabilistic model
Each variable corresponds to a node in the graph
Links in the graph denote relations between
variables
Motivation
visualization of models and motivation for new
models
graphical determination of conditional
independence
complex calculations (inference) performed using
graphical operations (e.g. forward-backward for
HMM)
Here we consider directed graphs

52
Example 3 Variables

General distribution over 3 variables
Apply product rule of probability twice
Express as a directed graph

53
General Decomposition Formula

Joint distribution is product of conditionals,
conditioned on parent nodes
Example

54
EM Latent Variable Viewpoint

Binary latent variables
describing which component generated each data
point
Conditional distribution of observed variable
Prior distribution of latent variables
Marginalizing over the latent variables we obtain

55
Expected Value of Latent Variable

From Bayes theorem

56
Graphical Representation of GMM
57
Complete and Incomplete Data
complete
incomplete
58
Graph for Complete-Data Model
59
Latent Variable View of EM

If we knew the values for the latent variables,
we would maximize the complete-data log
likelihoodwhich gives a trivial closed-form
solution (fit each component to the corresponding
set of data points)
We dont know the values of the latent variables
However, for given parameter values we can
compute the expected values of the latent
variables

60
Expected Complete-Data Log Likelihood

Suppose we make a guess for the parameter
values (means, covariances and mixing
coefficients)
Use these to evaluate the responsibilities
Consider expected complete-data log likelihood
where responsibilities are computed using
We are implicitly filling in latent variables
with best guess
Keeping the responsibilities fixed and maximizing
with respect to the parameters give the previous
results

61
K-means Revisited

Consider GMM with common covariances
Take limit
Responsibilities become binary
Expected complete-data log likelihood becomes

62
EM in General

Consider arbitrary distribution over the
latent variables
The following decomposition always holdswhere

63
Decomposition
64
Optimizing the Bound

E-step maximize with respect to
equivalent to minimizing KL divergence
sets equal to the posterior distribution
M-step maximize bound with respect to
equivalent to maximizing expected complete-data
log likelihood
Each EM cycle must increase incomplete-data
likelihood unless already at a (local) maximum

65
E-step
66
M-step
67
Bayesian Inference

Include prior distributions over parameters
Advantages in using conjugate priors
Example consider a single Gaussian over one
variable
assume variance is known and mean is unknown
likelihood function for the mean
Choose Gaussian prior for mean

68
Bayesian Inference for a Gaussian

Posterior (proportional to product of prior and
likelihood) will then also be Gaussianwhere

69
Bayesian Inference for a Gaussian
70
Bayesian Mixture of Gaussians

Conjugate priors for the parameters
Dirichlet prior for mixing coefficients
Normal-Wishart prior for means and
precisionswhere the Wishart distribution is
given by

71
Graphical Representation

Parameters and latent variables appear on equal
footing

72
Variational Inference

As with many Bayesian models, exact inference for
the mixture of Gaussians is intractable
Approximate Bayesian inference traditionally
based on Laplaces method (local Gaussian
approximation to the posterior) or Markov chain
Monte Carlo
Variational Inference is an alternative, broadly
applicable deterministic approximation scheme

73
General View of Variational Inference

Consider again the previous decomposition, but
where the posterior is over all latent variables
and parameterswhere
Maximizing over would give the true
posterior distribution but this is intractable
by definition

74
Factorized Approximation

Goal choose a family of distributions which are
sufficiently flexible to give good posterior
approximation
sufficiently simple to remain tractable
Here we consider factorized distributions
No further assumptions are required!
Optimal solution for one factor, keeping the
remained fixed
Coupled solutions so initialize then cyclically
update

75
Lower Bound

Can also be evaluated
Useful for maths/code verification
Also useful for model comparison

76
(No Transcript)
77
(No Transcript)
78
Illustration Univariate Gaussian

Likelihood function
Conjugate priors
Factorized variational distribution

79
Variational Posterior Distribution

where

80
Initial Configuration
81
After Updating
82
After Updating
83
Converged Solution
84
Exact Solution

For this very simple example there is an exact
solution
Expected precision given by
Compare with earlier maximum likelihood solution

85
Variational Mixture of Gaussians

Assume factorized posterior distribution
Gives optimal solution in the formwhere
is a Dirichlet, and is a
Normal-Wishart

86
Sufficient Statistics

Small computational overhead compared to maximum
likelihood EM

87
Variational Equations for GMM
88
Bound vs. K for Old Faithful Data
89
Bayesian Model Complexity
90
Sparse Bayes for Gaussian Mixture

Instead of comparing different values of K, start
with a large value and prune out excess
components
Treat mixing coefficients as parameters, and
maximize marginal likelihood Corduneanu
Bishop, AIStats 01
Gives simple re-estimation equations for the
mixing coefficients interleave with variational
updates

91
(No Transcript)
92
(No Transcript)
93
General Variational Framework

Currently for each new model
derive the variational update equations
write application-specific code to find the
solution
Both stages are time consuming and error prone
Can we build a general-purpose inference engine
which automates these procedures?

94
Lower Bound for GMM
95
VIBES

Variational Inference for Bayesian Networks
Bishop and Winn (1999)
A general inference engine using variational
methods
Models specified graphically

96
Example Mixtures of Bayesian PCA
97
Solution
98
Local Computation in VIBES

A key observation is that in the general
solutionthe update for a particular node (or
group of nodes) depends only on other nodes in
the Markov blanket
Permits a local object-oriented implementation

99
Shared Hyper-parameters
100
Take-home Messages

Bayesian mixture of Gaussians
no singularities
determines optimal number of components
Variational inference
effective solution for Bayesian GMM
optimizes rigorous bound
little computational overhead compared to EM
VIBES
rapid prototyping of probabilistic models
graphical specification

Christopher M' Bishop PowerPoint PPT Presentation