Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki@temple.edu based on slides by - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki@temple.edu based on slides by

Description:

See http://www.cs.cmu.edu ... (from a tutorial by Yair Weiss, http://www.cs.huji.ac ... Sans Serif Default Design Adobe Photoshop Image Bitmap Image Microsoft ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 76
Provided by: billfr3
Learn more at: https://cis.temple.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki@temple.edu based on slides by


1
Introduction to Expectation MaximizationAssemble
d and extended by Longin Jan LateckiTemple
University, latecki_at_temple.edubased on slides by
  • Andrew Blake, Microsoft Research and Bill
    Freeman, MIT, ICCV 2003
  • and
  • Andrew W. Moore, Carnegie Mellon University

2
Learning and vision Generative Methods
  • Machine learning is an important direction in
    computer vision.
  • Our goal for this class
  • Give overviews of useful techniques.
  • Show how these methods can be used in vision.
  • Provide references and pointers.

3
What is the goal of vision?
If you are asking, Are there any faces in this
image?, then you would probably want to use
discriminative methods.
4
What is the goal of vision?
If you are asking, Are there any faces in this
image?, then you would probably want to use
discriminative methods.
If you are asking, Find a 3-d model that
describes the runner, then you would use
generative methods.
5
Modeling
  • So we want to look at high-dimensional visual
    data, and fit models to it forming summaries of
    it that let us understand what we see.

6
The simplest data to modela set of 1d samples
7
Fit this distribution with a Gaussian
8
How find the parameters of the best-fitting
Gaussian?
9
How find the parameters of the best-fitting
Gaussian?
10
Derivation of MLE for Gaussians
Observation density
Log likelihood
Maximisation
11
Basic Maximum Likelihood Estimate (MLE) of a
Gaussian distribution
Mean
Variance
Covariance Matrix
12
Basic Maximum Likelihood Estimate (MLE) of a
Gaussian distribution
Mean
Variance
For vector-valued data, we have the Covariance
Matrix
13
Model fitting example 2 Fit a line to observed
data
y
x
14
Maximum likelihood estimation for the slope of a
single line
Maximum likelihood estimate
where
gives regression formula
15
Model fitting example 3Fitting two lines to
observed data
y
x
16
MLE for fitting a line pair
(a form of mixture dist. for )
17
Fitting two lines on the one hand
If we knew which points went with which lines,
wed be back at the single line-fitting problem,
twice.
y
x
18
Fitting two lines, on the other hand
We could figure out the probability that any
point came from either line if we just knew the
two equations for the two lines.
y
x
19
Expectation Maximization (EM) a solution to
chicken-and-egg problems
20
EM example
21
EM example
22
EM example
23
EM example
24
EM example
Converged!
25
MLE with hidden/latent variablesExpectation
Maximisation
General problem
data
parameters
hidden variables
For MLE, want to maximise the log likelihood
The sum over z inside the log gives a complicated
expression for the ML solution.
26
The EM algorithm
We dont know the values of the labels, zi , but
lets use its expected value under its posterior
with the current parameter values, ?old. That
gives us the expectation step
E-step
Now lets maximize this Q function, an expected
log-likelihood, over the parameter values, giving
the maximization step
M-step
Each iteration increases the total log-likelihood
log p(y?)
27
Expectation Maximisation applied to fitting the
two lines
associate data point
Hidden variables
with line
and probabilities of association are



28
EM fitting to two lines
with
/2
E-step
and
repeat
Regression becomes
M-step
29
Experiments EM fitting to two lines
(from a tutorial by Yair Weiss,
http//www.cs.huji.ac.il/yweiss/tutorials.html)
Line weights
line 1
line 2
Iteration
1
2
3
30
Applications of EM in computer vision
  • Image segmentation
  • Motion estimation combined with perceptual
    grouping
  • Polygonal approximation of edges

31
Next back to Density EstimationWhat if we
want to do density estimation with multimodal or
clumpy data?
32
The GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi

m2
m3
33
The GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi
  • Each component generates data from a Gaussian
    with mean mi and covariance matrix s2I
  • Assume that each datapoint is generated according
    to the following recipe

m2
m3
34
The GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi
  • Each component generates data from a Gaussian
    with mean mi and covariance matrix s2I
  • Assume that each datapoint is generated according
    to the following recipe
  • Pick a component at random. Choose component i
    with probability P(wi).

m2
35
The GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi
  • Each component generates data from a Gaussian
    with mean mi and covariance matrix s2I
  • Assume that each datapoint is generated according
    to the following recipe
  • Pick a component at random. Choose component i
    with probability P(wi).
  • Datapoint N(mi, s2I )

m2
x
36
The General GMM assumption
  • There are k components. The ith component is
    called wi
  • Component wi has an associated mean vector mi
  • Each component generates data from a Gaussian
    with mean mi and covariance matrix Si
  • Assume that each datapoint is generated according
    to the following recipe
  • Pick a component at random. Choose component i
    with probability P(wi).
  • Datapoint N(mi, Si )

m2
m3
37
Unsupervised Learningnot as hard as it looks
Sometimes easy
Sometimes impossible
and sometimes in between
IN CASE YOURE WONDERING WHAT THESE DIAGRAMS ARE,
THEY SHOW 2-d UNLABELED DATA (X VECTORS)
DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE
VERY CLEAR GAUSSIAN CENTERS
38
Computing likelihoods in unsupervised case
  • We have x1 , x2 , xN
  • We know P(w1) P(w2) .. P(wk)
  • We know s
  • P(xwi, µi, µk) Prob that an observation from
    class wi would have value x given class means µ1
    µk
  • Can we write an expression for that?

39
likelihoods in unsupervised case
  • We have x1 x2 xn
  • We have P(w1) .. P(wk). We have s.
  • We can define, for any x , P(xwi , µ1, µ2 .. µk)
  • Can we define P(x µ1, µ2 .. µk) ?
  • Can we define P(x1, x2, .. xn µ1, µ2 .. µk) ?

YES, IF WE ASSUME THE X1S WERE DRAWN
INDEPENDENTLY
40
Unsupervised LearningMediumly Good News
We now have a procedure s.t. if you give me a
guess at µ1, µ2 .. µk, I can tell you the prob of
the unlabeled data given those µs.
Suppose xs are 1-dimensional. There are two
classes w1 and w2 P(w1) 1/3 P(w2) 2/3
s 1 . There are 25 unlabeled datapoints
(From Duda and Hart)
x1 0.608 x2 -1.590 x3 0.235 x4 3.949
x25 -0.712
41
Duda Harts Example
  • Graph of
  • log P(x1, x2 .. x25 µ1, µ2 )
  • against µ1 (?) and µ2 (?)

Max likelihood (µ1 -2.13, µ2 1.668) Local
minimum, but very close to global at (µ1 2.085,
µ2 -1.257) corresponds to switching w1
with w2.
42
Duda Harts Example
We can graph the prob. dist. function of data
given our µ1 and µ2 estimates.
We can also graph the true function from which
the data was randomly generated.
  • They are close. Good.
  • The 2nd solution tries to put the 2/3 hump
    where the 1/3 hump should go, and vice versa.
  • In this example unsupervised is almost as good as
    supervised. If the x1 .. x25 are given the class
    which was used to learn them, then the results
    are (µ1-2.176, µ21.684). Unsupervised got
    (µ1-2.13, µ21.668).

43
Finding the max likelihood µ1,µ2..µk
  • We can compute P( data µ1,µ2..µk)
  • How do we find the µis which give max.
    likelihood?
  • The normal max likelihood trick
  • Set ? log Prob (.) 0
  • ? µi
  • and solve for µis.
  • Here you get non-linear non-analytically- solv
    able equations
  • Use gradient descent
  • Slow but doable
  • Use a much faster, cuter, and recently very
    popular method

44
Expectation Maximalization
45
The E.M. Algorithm
DETOUR
  • Well get back to unsupervised learning soon.
  • But now well look at an even simpler case with
    hidden information.
  • The EM algorithm
  • Can do trivial things, such as the contents of
    the next few slides.
  • An excellent way of doing our unsupervised
    learning problem, as well see.
  • Many, many other uses, including inference of
    Hidden Markov Models.

46
Silly Example
  • Let events be grades in a class
  • w1 Gets an A P(A) ½
  • w2 Gets a B P(B) µ
  • w3 Gets a C P(C) 2µ
  • w4 Gets a D P(D) ½-3µ
  • (Note 0 µ 1/6)
  • Assume we want to estimate µ from data. In a
    given class there were
  • a As
  • b Bs
  • c Cs
  • d Ds
  • Whats the maximum likelihood estimate of µ given
    a,b,c,d ?

47
Silly Example
  • Let events be grades in a class
  • w1 Gets an A P(A) ½
  • w2 Gets a B P(B) µ
  • w3 Gets a C P(C) 2µ
  • w4 Gets a D P(D) ½-3µ
  • (Note 0 µ 1/6)
  • Assume we want to estimate µ from data. In a
    given class there were
  • a As
  • b Bs
  • c Cs
  • d Ds
  • Whats the maximum likelihood estimate of µ given
    a,b,c,d ?

48
Trivial Statistics
  • P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
  • P( a,b,c,d µ) K(½)a(µ)b(2µ)c(½-3µ)d
  • log P( a,b,c,d µ) log K alog ½ blog µ
    clog 2µ dlog (½-3µ)

A B C D
14 6 9 10
Boring, but true!
49
Same Problem with Hidden Information
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
Someone tells us that Number of High grades (As
Bs) h Number of Cs
c Number of Ds
d What is the max. like estimate of µ
now?
50
Same Problem with Hidden Information
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
Someone tells us that Number of High grades (As
Bs) h Number of Cs
c Number of Ds
d What is the max. like estimate of µ
now? We can answer this question circularly
EXPECTATION
If we know the value of µ we could compute the
expected value of a and b
Since the ratio ab should be the same as the
ratio ½ m
MAXIMIZATION
If we know the expected values of a and b we
could compute the maximum likelihood value of µ
51
E.M. for our Trivial Problem
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
  • We begin with a guess for µ
  • We iterate between EXPECTATION and MAXIMALIZATION
    to improve our estimates of µ and a and b.
  • Define µ(t) the estimate of µ on the tth
    iteration
  • b(t) the estimate of b on tth
    iteration

E-step
M-step
Continue iterating until converged. Good news
Converging to local optimum is assured. Bad news
I said local optimum.
52
E.M. Convergence
  • Convergence proof based on fact that Prob(data
    µ) must increase or remain same between each
    iteration NOT OBVIOUS
  • But it can never exceed 1 OBVIOUS
  • So it must therefore converge OBVIOUS

t µ(t) b(t)
0 0 0
1 0.0833 2.857
2 0.0937 3.158
3 0.0947 3.185
4 0.0948 3.187
5 0.0948 3.187
6 0.0948 3.187
In our example, suppose we had h 20 c 10 d
10 µ(0) 0
Convergence is generally linear error decreases
by a constant factor each time step.
53
Back to Unsupervised Learning of GMMs
  • Remember
  • We have unlabeled data x1 x2 xR
  • We know there are k classes
  • We know P(w1) P(w2) P(w3) P(wk)
  • We dont know µ1 µ2 .. µk
  • We can write P( data µ1. µk)

54
E.M. for GMMs
See http//www.cs.cmu.edu/awm/doc/gmm-algebra.pdf

This is n nonlinear equations in µjs.
If, for each xi we knew that for each wj the prob
that µj was in class wj is P(wjxi,µ1µk) Then
we would easily compute µj. If we knew each µj
then we could easily compute P(wjxi,µ1µk) for
each wj and xi.
I feel an EM experience coming on!!
55
E.M. for GMMs
  • Iterate. On the tth iteration let our estimates
    be
  • lt µ1(t), µ2(t) µc(t)
  • E-step
  • Compute expected classes of all datapoints for
    each class

Just evaluate a Gaussian at xk
M-step. Compute Max. like µ given our datas
class membership distributions
56
E.M. Convergence
  • As with all EM procedures, convergence to a local
    optimum guaranteed.
  • This algorithm is REALLY USED. And in high
    dimensional state spaces, too. E.G. Vector
    Quantization for Speech Data

57
E.M. for General GMMs
pi(t) is shorthand for estimate of P(wi) on tth
iteration
  • Iterate. On the tth iteration let our estimates
    be
  • lt µ1(t), µ2(t) µc(t), S1(t), S2(t)
    Sc(t), p1(t), p2(t) pc(t)
  • E-step
  • Compute expected classes of all datapoints for
    each class

Just evaluate a Gaussian at xk
M-step. Compute Max. like µ given our datas
class membership distributions
R records
58
Gaussian Mixture Example Start
59
After first iteration
60
After 2nd iteration
61
After 3rd iteration
62
After 4th iteration
63
After 5th iteration
64
After 6th iteration
65
After 20th iteration
66
Some Bio Assay data
67
GMM clustering of the assay data
68
Resulting Density Estimator
69
In EM the model (number of parameters, which is
the number of lines in our case) is fixed.What
happen if we start with a wrong model?
Converged!
One line
70
Three lines
71
Why not 8 lines?
72
Which model is better?
We cannot just compute an approximation error
(e.g. LSE), since it decreases with the number of
lines. If we have too many lines, we just fit the
noise.
73
Can we find an optimal model?Human visual system
does it.
Good fit
Bad fit
74
Polygonal Approximation of Laser Range Data Based
on Perceptual Grouping and EM Longin Jan Latecki
and Rolf LakaemperCIS Dept., Temple University,
Philadelphia
75
Groping Edge Points to Contour PartsLongin Jan
Latecki and Rolf LakaemperCIS Dept., Temple
University, Philadelphia
Write a Comment
User Comments (0)
About PowerShow.com