Title: An Introduction to MCMC for Machine Learning (Markov Chain Monte Carlo)
1An Introduction to MCMC for Machine Learning
(Markov Chain Monte Carlo)
Young Ki Baik
Computer Vision Lab. SNU
2References
- An introduction to MCMC for Machine Learning
- Andrieu et al. (Machine Learning 2003)
- Introduction to Monte Carlo methods
- David MacKay.
-
- Markov Chain Monte Carlo for Computer Vision
- Zhu, Delleart and Tu. (a tutorial at ICCV05)
- http//civs.stat.ucla.edu/MCMC/MCMC_tutorial.htm
- Various PPTs for MCMC in the web
3Contents
- MCMC
- Metropolis-Hasting algorithm
- Mixture and cycles of MCMC kernels
- Auxiliary variable sampler
- Adaptive MCMC
- Other application of MCMC
- Convergence problem and Trick of MCMC
- Remained Problems
- Conclusion
4MCMC
- Problem of MC(Monte Carlo)
- Assembling the entire distribution for MC is
usually hard - Complicated energy landscapes
- High-dimensional system.
- Extraordinarily difficult normalization
- Solution MCMC
- Build up distribution from Markov chain
- Choose local transition probabilities which
generate distribution of interest (ensure
detailed balance) - Each random variable is chosen based on the
previous variable in the chain - Walk along the Markov chain until convergence
reached - Result Normalization not required, calculation
are local
5MCMC
- What is Markov Chain?
- A Markov chain is a mathematical model for
stochastic system that generates random variable
X1, X2, , Xt, where the distribution - The distribution of the next random variable
depends only on the current random variable. - The entire chain represents a stationary
probability distribution.
6MCMC
- What is Markov Chain Monte Carlo?
- MCMC is general purpose technique for generating
fair samples from a probability in
high-dimensional space, using random numbers
(dice) drawn from uniform probability in certain
range.
Markov chain states
Independent trials of dice
7MCMC
- MCMC as a general purpose computing technique
- Task 1 simulation draw fair (typical) samples
from a probability which governs a system. - Task 2 Integration/computing in very high
dimensions, i.e. to compute - Task 3 Optimization with an annealing scheme
- Task 4 Learning
- un supervised learning with hidden variables
(simulated from posterior) or MLE learning of
parameters p(xT) needs simulations as well.
8MCMC
- Some notation
- The stochastic process is called a Markov
chain if - The chain is homogeneous if
remains invariant for all i, with
for any i. - Chain depends solely on the current state of the
chain and a fixed transition matrix.
9MCMC
- Example
- Transition graph for Markov chain
- with three state (s3)
- Transition matrix
- For the initial state
- This stability result plays a fundamental role in
MCMC.
0.1
0.9
1
0.4
0.6
10MCMC
- Convergence properties
- For any starting point, the chain will
convergence to the invariant distribution p(x),
as long as T is a stochastic transition matrix
that obeys the following properties - 1) Irreducibility
- That is every state must be (eventually)
reachable from every other state. - 2) Aperiodicity
- This stops the chain from oscillating
between different states - 3) Reversibility (detailed balance) condition
- This holds the system remain its stationary
distribution. - .
discrete
continuous
Kernal, proposal distribution
11MCMC
- Eigen-analysis
- From the spectral theory, p(x) is the left
eigenvector of the matrix T with corresponding
eigenvalue 1. - The second largest eigenvalue determines the rate
of convergence of the chain, and should be as
small as possible.
Stationary distribution
Eigenvalue v1 always 1
12Metropolis-Hastings algorithm
- The MH algorithm
- The most popular MCMC method
- Invariant distgribution p(x)
- Proposal distribution q(xx)
- Candidate value x
- Acceptance probability A(x,x)
- Kernel K
- .
13Metropolis-Hastings algorithm
- Results of running the MH algorithm
- Target distribution
Proposal distribution
14Metropolis-Hastings algorithm
- Different choices of the proposal standard
deviation - MH requires careful design of the proposal
distribution. - If is narrow, only 1 mode of p(x) might be
visited. - If is too wide,
- the rejection rate can be high.
- If all the modes are visited while
- the acceptance probability is high,
- the chain is said to mix well.
15Mixture and cycles of MCMC kernels
- Mixture and cycle
- It is possible to combine several samplers into
mixture and cycles of individual samplers. - If transition kernels K1, K2 have invariant
distribution, then cycle hybrid kernel and
mixture hybrid kernel are also transition kernels.
16Mixture and cycles of MCMC kernels
- Mixtures of kernels
- Incorporate global proposals to explore vast
region of the state space and local proposals to
discover finer details of the target
distribution. - -gt target distribution with many narrow peaks
- ( reversible jump MCMC algorithm)
17Mixture and cycles of MCMC kernels
- Cycles of kernels
- Split a multivariate state vector into components
(block) -gt It can be updated separately. - -gt Blocking highly correlated variables
- ( Gibbs sampling algorithm)
18Auxiliary variable samplers
- Auxiliary variable
- It is often easier to sample from an augmented
distribution p(x,u), where u is an auxiliary
variable. - It is possible to obtain marginal samples x by
sampling (x, u), and ignoring the sample u. - Hybrid Monte Carlo (HMC)
- Use gradient approximation
- Slice sampling
19Adaptive MCMC
- Adaptive selection of proposal distribution
- The variance of proposal distribution is
important. - To automate the process of choosing the proposal
distribution as much as possible. - Problem
- Adaptive MCMC can disturb the stationary
distribution. - Gelfand and Sahu(1994)
- Station distribution is disturbed despite the
fact that each participating kernel has the same
stationary distribution. - Avoidance
- Carry out adaptation only initial fixed number of
step. - Parallel chains
- And so on
- -gt inefficient, much more research is
required.
20Other application of MCMC
- Simulated annealing method for global
optimization - To find global maximum of p(x)
- Monte Carlo EM
- To find fast approximation for E-step
- Sequential Monte Carlo method and particle
filters - To carry out on-line approximation of probability
distributions using samples. - -gtusing parallel sampling
21Convergence problem and Trick of MCMC
- Convergence problem
- Determining the length of the Markov chain is a
difficult task. - Trick
- Initial set problem (for starting biases)
- Discards an initial set of samples (Burn-in)
- Set initial sample value manually.
- Markov chain test
- Apply several graphical and statistical tests to
assess if the chain has stabilized. - -gt It doesnt provide entirely satisfactory
diagnostics. - Study about convergence problem
22Remained problems
- Large dimension model
- The combination of sampling algorithm with either
gradient optimization or exact one. - Massive data set
- A few solution based on importance sampling have
been proposed. - Many and varied applications
- -gt But there is still great room for innovation
in this area.
23Conclusion
- MCMC
- The Markov Chain Monte Carlo methods cover a
variety of different fields and applications. - There are great opportunities for combining
existing sub-optimal algorithms with MCMC in many
machine learning problems. - Some areas are already benefiting from sampling
methods include -
Tracking, restoration, segmentation Probabilistic
graphical models Classification Data association
for localization Classical mixture models.