Title: Christopher M' Bishop
 1Latent Variables,Mixture Modelsand EM
Microsoft Research, Cambridge
BCS Summer SchoolExeter, 2003 
 2Overview
- K-means clustering 
 - Gaussian mixtures 
 - Maximum likelihood and EM 
 - Probabilistic graphical models 
 - Latent variables EM revisited 
 - Bayesian Mixtures of Gaussians 
 - Variational Inference 
 - VIBES 
 
  3Old Faithful 
 4Old Faithful Data Set
Time betweeneruptions (minutes)
Duration of eruption (minutes) 
 5K-means Algorithm
- Goal represent a data set in terms of K clusters 
each of which is summarized by a prototype  - Initialize prototypes, then iterate between two 
phases  - E-step assign each data point to nearest 
prototype  - M-step update prototypes to be the cluster means 
 - Simplest version is based on Euclidean distance 
 - re-scale Old Faithful data 
 
  6(No Transcript) 
 7(No Transcript) 
 8(No Transcript) 
 9(No Transcript) 
 10(No Transcript) 
 11(No Transcript) 
 12(No Transcript) 
 13(No Transcript) 
 14(No Transcript) 
 15Responsibilities
- Responsibilities assign data points to 
clusterssuch that   - Example 5 data points and 3 clusters 
 
  16K-means Cost Function 
 17Minimizing the Cost Function
- E-step minimize w.r.t. 
 - assigns each data point to nearest prototype 
 - M-step minimize w.r.t 
 - gives 
 - each prototype set to the mean of points in that 
cluster  - Convergence guaranteed since there is a finite 
number of possible settings for the 
responsibilities 
  18(No Transcript) 
 19(No Transcript) 
 20Limitations of K-means
- Hard assignments of data points to clusters  
small shift of a data point can flip it to a 
different cluster  - Not clear how to choose the value of K 
 - Solution replace hard clustering of K-means 
with soft probabilistic assignments  - Represents the probability distribution of the 
data as a Gaussian mixture model 
  21The Gaussian Distribution
- Multivariate Gaussian 
 - Define precision to be the inverse of the 
covariance  - In 1-dimension 
 
  22Likelihood Function
- Data set 
 - Assume observed data points generated 
independently  - Viewed as a function of the parameters, this is 
known as the likelihood function  
  23Maximum Likelihood
- Set the parameters by maximizing the likelihood 
function  - Equivalently maximize the log likelihood
 
  24Maximum Likelihood Solution
- Maximizing w.r.t. the mean gives the sample 
mean  - Maximizing w.r.t covariance gives the sample 
covariance  
  25Bias of Maximum Likelihood
- Consider the expectations of the maximum 
likelihood estimates under the Gaussian 
distribution  - The maximum likelihood solution systematically 
under-estimates the covariance  - This is an example of over-fitting
 
  26Intuitive Explanation of Over-fitting 
 27Unbiased Variance Estimate
- Clearly we can remove the bias by usingsince 
this gives  - Arises naturally in a Bayesian treatment (see 
later)  - For an infinite data set the two expressions are 
equal 
  28Gaussian Mixtures
- Linear super-position of Gaussians 
 - Normalization and positivity require 
 - Can interpret the mixing coefficients as prior 
probabilities 
  29Example Mixture of 3 Gaussians 
 30Contours of Probability Distribution 
 31Surface Plot 
 32Sampling from the Gaussian
- To generate a data point 
 - first pick one of the components with probability 
  - then draw a sample from that component 
 - Repeat these two steps for each new data point 
 
  33Synthetic Data Set 
 34Fitting the Gaussian Mixture
- We wish to invert this process  given the data 
set, find the corresponding parameters  - mixing coefficients 
 - means 
 - covariances 
 - If we knew which component generated each data 
point, the maximum likelihood solution would 
involve fitting each component to the 
corresponding cluster  - Problem the data set is unlabelled 
 - We shall refer to the labels as latent ( hidden) 
variables 
  35Synthetic Data Set Without Labels 
 36Posterior Probabilities
- We can think of the mixing coefficients as prior 
probabilities for the components  - For a given value of we can evaluate the 
corresponding posterior probabilities, called 
responsibilities  - These are given from Bayes theorem by
 
  37Posterior Probabilities (colour coded) 
 38Posterior Probability Map 
 39Maximum Likelihood for the GMM
- The log likelihood function takes the form 
 - Note sum over components appears inside the log 
 - There is no closed form solution for maximum 
likelihood 
  40Over-fitting in Gaussian Mixture Models
- Singularities in likelihood function when a 
component collapses onto a data pointthen 
consider  - Likelihood function gets larger as we add more 
components (and hence parameters) to the model  - not clear how to choose the number K of 
components  
  41Problems and Solutions
- How to maximize the log likelihood 
 - solved by expectation-maximization (EM) algorithm 
 - How to avoid singularities in the likelihood 
function  - solved by a Bayesian treatment 
 - How to choose number K of components 
 - also solved by a Bayesian treatment
 
  42EM Algorithm  Informal Derivation
- Let us proceed by simply differentiating the log 
likelihood  - Setting derivative with respect to equal to 
zero givesgivingwhich is simply the 
weighted mean of the data  
  43EM Algorithm  Informal Derivation
- Similarly for the covariances 
 - For mixing coefficients use a Lagrange multiplier 
to give  
  44EM Algorithm  Informal Derivation
- The solutions are not closed form since they are 
coupled  - Suggests an iterative scheme for solving them 
 - Make initial guesses for the parameters 
 - Alternate between the following two stages 
 - E-step evaluate responsibilities 
 - M-step update parameters using ML results
 
  45(No Transcript) 
 46(No Transcript) 
 47(No Transcript) 
 48(No Transcript) 
 49(No Transcript) 
 50(No Transcript) 
 51Digression Probabilistic Graphical Models
- Graphical representation of a probabilistic model 
 - Each variable corresponds to a node in the graph 
 - Links in the graph denote relations between 
variables  - Motivation 
 - visualization of models and motivation for new 
models  - graphical determination of conditional 
independence  - complex calculations (inference) performed using 
graphical operations (e.g. forward-backward for 
HMM)  - Here we consider directed graphs
 
  52Example 3 Variables
- General distribution over 3 variables 
 - Apply product rule of probability twice 
 -  
 - Express as a directed graph
 
  53General Decomposition Formula
- Joint distribution is product of conditionals, 
conditioned on parent nodes  - Example
 
  54EM  Latent Variable Viewpoint 
- Binary latent variables 
describing which component generated each data 
point  - Conditional distribution of observed variable 
 - Prior distribution of latent variables 
 - Marginalizing over the latent variables we obtain
 
  55Expected Value of Latent Variable
  56Graphical Representation of GMM 
 57Complete and Incomplete Data
complete
incomplete 
 58Graph for Complete-Data Model 
 59Latent Variable View of EM
- If we knew the values for the latent variables, 
we would maximize the complete-data log 
likelihoodwhich gives a trivial closed-form 
solution (fit each component to the corresponding 
set of data points)  - We dont know the values of the latent variables 
 - However, for given parameter values we can 
compute the expected values of the latent 
variables 
  60Expected Complete-Data Log Likelihood
- Suppose we make a guess for the parameter 
values (means, covariances and mixing 
coefficients)  - Use these to evaluate the responsibilities 
 - Consider expected complete-data log likelihood 
where responsibilities are computed using  - We are implicitly filling in latent variables 
with best guess  - Keeping the responsibilities fixed and maximizing 
with respect to the parameters give the previous 
results 
  61K-means Revisited
- Consider GMM with common covariances 
 - Take limit 
 - Responsibilities become binary  
 - Expected complete-data log likelihood becomes 
 
  62EM in General
- Consider arbitrary distribution over the 
latent variables  - The following decomposition always holdswhere
 
  63Decomposition 
 64Optimizing the Bound
- E-step maximize with respect to 
 - equivalent to minimizing KL divergence 
 - sets equal to the posterior distribution 
 - M-step maximize bound with respect to 
 - equivalent to maximizing expected complete-data 
log likelihood  - Each EM cycle must increase incomplete-data 
likelihood unless already at a (local) maximum 
  65E-step 
 66M-step 
 67Bayesian Inference
- Include prior distributions over parameters 
 - Advantages in using conjugate priors 
 - Example consider a single Gaussian over one 
variable  - assume variance is known and mean is unknown 
 - likelihood function for the mean 
 - Choose Gaussian prior for mean
 
  68Bayesian Inference for a Gaussian
- Posterior (proportional to product of prior and 
likelihood) will then also be Gaussianwhere 
  69Bayesian Inference for a Gaussian 
 70Bayesian Mixture of Gaussians
- Conjugate priors for the parameters 
 - Dirichlet prior for mixing coefficients 
 - Normal-Wishart prior for means and 
precisionswhere the Wishart distribution is 
given by 
  71Graphical Representation
- Parameters and latent variables appear on equal 
footing 
  72Variational Inference
- As with many Bayesian models, exact inference for 
the mixture of Gaussians is intractable  - Approximate Bayesian inference traditionally 
based on Laplaces method (local Gaussian 
approximation to the posterior) or Markov chain 
Monte Carlo  - Variational Inference is an alternative, broadly 
applicable deterministic approximation scheme 
  73General View of Variational Inference
- Consider again the previous decomposition, but 
where the posterior is over all latent variables 
and parameterswhere  - Maximizing over would give the true 
posterior distribution  but this is intractable 
by definition 
  74Factorized Approximation
- Goal choose a family of distributions which are 
 - sufficiently flexible to give good posterior 
approximation  - sufficiently simple to remain tractable 
 - Here we consider factorized distributions 
 - No further assumptions are required! 
 - Optimal solution for one factor, keeping the 
remained fixed  - Coupled solutions so initialize then cyclically 
update 
  75Lower Bound
- Can also be evaluated 
 - Useful for maths/code verification 
 - Also useful for model comparison
 
  76(No Transcript) 
 77(No Transcript) 
 78Illustration Univariate Gaussian
- Likelihood function 
 - Conjugate priors 
 - Factorized variational distribution
 
  79Variational Posterior Distribution
  80Initial Configuration 
 81After Updating 
 82After Updating 
 83Converged Solution 
 84Exact Solution
- For this very simple example there is an exact 
solution  - Expected precision given by 
 - Compare with earlier maximum likelihood solution 
 
  85Variational Mixture of Gaussians
- Assume factorized posterior distribution 
 - Gives optimal solution in the formwhere 
 is a Dirichlet, and is a 
Normal-Wishart 
  86Sufficient Statistics
-  
 - Small computational overhead compared to maximum 
likelihood EM 
  87Variational Equations for GMM 
 88Bound vs. K for Old Faithful Data 
 89Bayesian Model Complexity 
 90Sparse Bayes for Gaussian Mixture
- Instead of comparing different values of K, start 
with a large value and prune out excess 
components  - Treat mixing coefficients as parameters, and 
maximize marginal likelihood Corduneanu  
Bishop, AIStats 01  - Gives simple re-estimation equations for the 
mixing coefficients  interleave with variational 
updates 
  91(No Transcript) 
 92(No Transcript) 
 93General Variational Framework
- Currently for each new model 
 - derive the variational update equations 
 - write application-specific code to find the 
solution  - Both stages are time consuming and error prone 
 - Can we build a general-purpose inference engine 
which automates these procedures? 
  94Lower Bound for GMM 
 95VIBES
- Variational Inference for Bayesian Networks 
 - Bishop and Winn (1999) 
 - A general inference engine using variational 
methods  - Models specified graphically
 
  96Example Mixtures of Bayesian PCA 
 97Solution 
 98Local Computation in VIBES
- A key observation is that in the general 
solutionthe update for a particular node (or 
group of nodes) depends only on other nodes in 
the Markov blanket  - Permits a local object-oriented implementation
 
  99Shared Hyper-parameters 
 100Take-home Messages
- Bayesian mixture of Gaussians 
 - no singularities 
 - determines optimal number of components 
 - Variational inference 
 - effective solution for Bayesian GMM 
 - optimizes rigorous bound 
 - little computational overhead compared to EM 
 - VIBES 
 - rapid prototyping of probabilistic models 
 - graphical specification
 
  101Viewgraphs, tutorials andpublications available 
from
- http//research.microsoft.com/cmbishop