Title: Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki@temple.edu based on slides by
1Introduction to Expectation MaximizationAssemble
d and extended by Longin Jan LateckiTemple
University, latecki_at_temple.edubased on slides by
- Andrew Blake, Microsoft Research and Bill
Freeman, MIT, ICCV 2003 - and
- Andrew W. Moore, Carnegie Mellon University
-
2Learning and vision Generative Methods
- Machine learning is an important direction in
computer vision. - Our goal for this class
- Give overviews of useful techniques.
- Show how these methods can be used in vision.
- Provide references and pointers.
3What is the goal of vision?
If you are asking, Are there any faces in this
image?, then you would probably want to use
discriminative methods.
4What is the goal of vision?
If you are asking, Are there any faces in this
image?, then you would probably want to use
discriminative methods.
If you are asking, Find a 3-d model that
describes the runner, then you would use
generative methods.
5Modeling
- So we want to look at high-dimensional visual
data, and fit models to it forming summaries of
it that let us understand what we see.
6The simplest data to modela set of 1d samples
7Fit this distribution with a Gaussian
8How find the parameters of the best-fitting
Gaussian?
9How find the parameters of the best-fitting
Gaussian?
10Derivation of MLE for Gaussians
Observation density
Log likelihood
Maximisation
11Basic Maximum Likelihood Estimate (MLE) of a
Gaussian distribution
Mean
Variance
Covariance Matrix
12Basic Maximum Likelihood Estimate (MLE) of a
Gaussian distribution
Mean
Variance
For vector-valued data, we have the Covariance
Matrix
13Model fitting example 2 Fit a line to observed
data
y
x
14Maximum likelihood estimation for the slope of a
single line
Maximum likelihood estimate
where
gives regression formula
15Model fitting example 3Fitting two lines to
observed data
y
x
16MLE for fitting a line pair
(a form of mixture dist. for )
17Fitting two lines on the one hand
If we knew which points went with which lines,
wed be back at the single line-fitting problem,
twice.
y
x
18Fitting two lines, on the other hand
We could figure out the probability that any
point came from either line if we just knew the
two equations for the two lines.
y
x
19Expectation Maximization (EM) a solution to
chicken-and-egg problems
20EM example
21EM example
22EM example
23EM example
24EM example
Converged!
25MLE with hidden/latent variablesExpectation
Maximisation
General problem
data
parameters
hidden variables
For MLE, want to maximise the log likelihood
The sum over z inside the log gives a complicated
expression for the ML solution.
26The EM algorithm
We dont know the values of the labels, zi , but
lets use its expected value under its posterior
with the current parameter values, ?old. That
gives us the expectation step
E-step
Now lets maximize this Q function, an expected
log-likelihood, over the parameter values, giving
the maximization step
M-step
Each iteration increases the total log-likelihood
log p(y?)
27Expectation Maximisation applied to fitting the
two lines
associate data point
Hidden variables
with line
and probabilities of association are
28EM fitting to two lines
with
/2
E-step
and
repeat
Regression becomes
M-step
29Experiments EM fitting to two lines
(from a tutorial by Yair Weiss,
http//www.cs.huji.ac.il/yweiss/tutorials.html)
Line weights
line 1
line 2
Iteration
1
2
3
30Applications of EM in computer vision
- Image segmentation
- Motion estimation combined with perceptual
grouping - Polygonal approximation of edges
31Next back to Density EstimationWhat if we
want to do density estimation with multimodal or
clumpy data?
32The GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
m2
m3
33The GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
- Each component generates data from a Gaussian
with mean mi and covariance matrix s2I - Assume that each datapoint is generated according
to the following recipe
m2
m3
34The GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
- Each component generates data from a Gaussian
with mean mi and covariance matrix s2I - Assume that each datapoint is generated according
to the following recipe - Pick a component at random. Choose component i
with probability P(wi).
m2
35The GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
- Each component generates data from a Gaussian
with mean mi and covariance matrix s2I - Assume that each datapoint is generated according
to the following recipe - Pick a component at random. Choose component i
with probability P(wi). - Datapoint N(mi, s2I )
m2
x
36The General GMM assumption
- There are k components. The ith component is
called wi - Component wi has an associated mean vector mi
- Each component generates data from a Gaussian
with mean mi and covariance matrix Si - Assume that each datapoint is generated according
to the following recipe - Pick a component at random. Choose component i
with probability P(wi). - Datapoint N(mi, Si )
m2
m3
37Unsupervised Learningnot as hard as it looks
Sometimes easy
Sometimes impossible
and sometimes in between
IN CASE YOURE WONDERING WHAT THESE DIAGRAMS ARE,
THEY SHOW 2-d UNLABELED DATA (X VECTORS)
DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE
VERY CLEAR GAUSSIAN CENTERS
38Computing likelihoods in unsupervised case
- We have x1 , x2 , xN
- We know P(w1) P(w2) .. P(wk)
- We know s
- P(xwi, µi, µk) Prob that an observation from
class wi would have value x given class means µ1
µk - Can we write an expression for that?
39likelihoods in unsupervised case
- We have x1 x2 xn
- We have P(w1) .. P(wk). We have s.
- We can define, for any x , P(xwi , µ1, µ2 .. µk)
- Can we define P(x µ1, µ2 .. µk) ?
- Can we define P(x1, x2, .. xn µ1, µ2 .. µk) ?
YES, IF WE ASSUME THE X1S WERE DRAWN
INDEPENDENTLY
40Unsupervised LearningMediumly Good News
We now have a procedure s.t. if you give me a
guess at µ1, µ2 .. µk, I can tell you the prob of
the unlabeled data given those µs.
Suppose xs are 1-dimensional. There are two
classes w1 and w2 P(w1) 1/3 P(w2) 2/3
s 1 . There are 25 unlabeled datapoints
(From Duda and Hart)
x1 0.608 x2 -1.590 x3 0.235 x4 3.949
x25 -0.712
41Duda Harts Example
- Graph of
- log P(x1, x2 .. x25 µ1, µ2 )
- against µ1 (?) and µ2 (?)
Max likelihood (µ1 -2.13, µ2 1.668) Local
minimum, but very close to global at (µ1 2.085,
µ2 -1.257) corresponds to switching w1
with w2.
42Duda Harts Example
We can graph the prob. dist. function of data
given our µ1 and µ2 estimates.
We can also graph the true function from which
the data was randomly generated.
- They are close. Good.
- The 2nd solution tries to put the 2/3 hump
where the 1/3 hump should go, and vice versa. - In this example unsupervised is almost as good as
supervised. If the x1 .. x25 are given the class
which was used to learn them, then the results
are (µ1-2.176, µ21.684). Unsupervised got
(µ1-2.13, µ21.668).
43Finding the max likelihood µ1,µ2..µk
- We can compute P( data µ1,µ2..µk)
- How do we find the µis which give max.
likelihood? - The normal max likelihood trick
- Set ? log Prob (.) 0
- ? µi
- and solve for µis.
- Here you get non-linear non-analytically- solv
able equations - Use gradient descent
- Slow but doable
- Use a much faster, cuter, and recently very
popular method
44Expectation Maximalization
45The E.M. Algorithm
DETOUR
- Well get back to unsupervised learning soon.
- But now well look at an even simpler case with
hidden information. - The EM algorithm
- Can do trivial things, such as the contents of
the next few slides. - An excellent way of doing our unsupervised
learning problem, as well see. - Many, many other uses, including inference of
Hidden Markov Models.
46Silly Example
- Let events be grades in a class
- w1 Gets an A P(A) ½
- w2 Gets a B P(B) µ
- w3 Gets a C P(C) 2µ
- w4 Gets a D P(D) ½-3µ
- (Note 0 µ 1/6)
- Assume we want to estimate µ from data. In a
given class there were - a As
- b Bs
- c Cs
- d Ds
- Whats the maximum likelihood estimate of µ given
a,b,c,d ?
47Silly Example
- Let events be grades in a class
- w1 Gets an A P(A) ½
- w2 Gets a B P(B) µ
- w3 Gets a C P(C) 2µ
- w4 Gets a D P(D) ½-3µ
- (Note 0 µ 1/6)
- Assume we want to estimate µ from data. In a
given class there were - a As
- b Bs
- c Cs
- d Ds
- Whats the maximum likelihood estimate of µ given
a,b,c,d ?
48Trivial Statistics
- P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
- P( a,b,c,d µ) K(½)a(µ)b(2µ)c(½-3µ)d
- log P( a,b,c,d µ) log K alog ½ blog µ
clog 2µ dlog (½-3µ)
A B C D
14 6 9 10
Boring, but true!
49Same Problem with Hidden Information
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
Someone tells us that Number of High grades (As
Bs) h Number of Cs
c Number of Ds
d What is the max. like estimate of µ
now?
50Same Problem with Hidden Information
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
Someone tells us that Number of High grades (As
Bs) h Number of Cs
c Number of Ds
d What is the max. like estimate of µ
now? We can answer this question circularly
EXPECTATION
If we know the value of µ we could compute the
expected value of a and b
Since the ratio ab should be the same as the
ratio ½ m
MAXIMIZATION
If we know the expected values of a and b we
could compute the maximum likelihood value of µ
51E.M. for our Trivial Problem
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
- We begin with a guess for µ
- We iterate between EXPECTATION and MAXIMALIZATION
to improve our estimates of µ and a and b. - Define µ(t) the estimate of µ on the tth
iteration - b(t) the estimate of b on tth
iteration
E-step
M-step
Continue iterating until converged. Good news
Converging to local optimum is assured. Bad news
I said local optimum.
52E.M. Convergence
- Convergence proof based on fact that Prob(data
µ) must increase or remain same between each
iteration NOT OBVIOUS - But it can never exceed 1 OBVIOUS
- So it must therefore converge OBVIOUS
t µ(t) b(t)
0 0 0
1 0.0833 2.857
2 0.0937 3.158
3 0.0947 3.185
4 0.0948 3.187
5 0.0948 3.187
6 0.0948 3.187
In our example, suppose we had h 20 c 10 d
10 µ(0) 0
Convergence is generally linear error decreases
by a constant factor each time step.
53Back to Unsupervised Learning of GMMs
- Remember
- We have unlabeled data x1 x2 xR
- We know there are k classes
- We know P(w1) P(w2) P(w3) P(wk)
- We dont know µ1 µ2 .. µk
- We can write P( data µ1. µk)
54E.M. for GMMs
See http//www.cs.cmu.edu/awm/doc/gmm-algebra.pdf
This is n nonlinear equations in µjs.
If, for each xi we knew that for each wj the prob
that µj was in class wj is P(wjxi,µ1µk) Then
we would easily compute µj. If we knew each µj
then we could easily compute P(wjxi,µ1µk) for
each wj and xi.
I feel an EM experience coming on!!
55E.M. for GMMs
- Iterate. On the tth iteration let our estimates
be - lt µ1(t), µ2(t) µc(t)
- E-step
- Compute expected classes of all datapoints for
each class
Just evaluate a Gaussian at xk
M-step. Compute Max. like µ given our datas
class membership distributions
56E.M. Convergence
- As with all EM procedures, convergence to a local
optimum guaranteed.
- This algorithm is REALLY USED. And in high
dimensional state spaces, too. E.G. Vector
Quantization for Speech Data
57E.M. for General GMMs
pi(t) is shorthand for estimate of P(wi) on tth
iteration
- Iterate. On the tth iteration let our estimates
be - lt µ1(t), µ2(t) µc(t), S1(t), S2(t)
Sc(t), p1(t), p2(t) pc(t) - E-step
- Compute expected classes of all datapoints for
each class
Just evaluate a Gaussian at xk
M-step. Compute Max. like µ given our datas
class membership distributions
R records
58Gaussian Mixture Example Start
59After first iteration
60After 2nd iteration
61After 3rd iteration
62After 4th iteration
63After 5th iteration
64After 6th iteration
65After 20th iteration
66Some Bio Assay data
67GMM clustering of the assay data
68Resulting Density Estimator
69In EM the model (number of parameters, which is
the number of lines in our case) is fixed.What
happen if we start with a wrong model?
Converged!
One line
70Three lines
71Why not 8 lines?
72Which model is better?
We cannot just compute an approximation error
(e.g. LSE), since it decreases with the number of
lines. If we have too many lines, we just fit the
noise.
73Can we find an optimal model?Human visual system
does it.
Good fit
Bad fit
74Polygonal Approximation of Laser Range Data Based
on Perceptual Grouping and EM Longin Jan Latecki
and Rolf LakaemperCIS Dept., Temple University,
Philadelphia
75Groping Edge Points to Contour PartsLongin Jan
Latecki and Rolf LakaemperCIS Dept., Temple
University, Philadelphia