Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki@temple.edu based on slides by - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki@temple.edu based on slides by

Description:

Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki_at_temple.edu based on s by Andrew Blake, Microsoft ... – PowerPoint PPT presentation

Number of Views:164

Avg rating:3.0/5.0

Slides: 76

Provided by: billfr3

Learn more at: https://cis.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki@temple.edu based on slides by

1
Introduction to Expectation MaximizationAssemble
d and extended by Longin Jan LateckiTemple
University, latecki_at_temple.edubased on slides by

Andrew Blake, Microsoft Research and Bill
Freeman, MIT, ICCV 2003
and
Andrew W. Moore, Carnegie Mellon University

2
Learning and vision Generative Methods

Machine learning is an important direction in
computer vision.
Our goal for this class
Give overviews of useful techniques.
Show how these methods can be used in vision.
Provide references and pointers.

3
What is the goal of vision?
If you are asking, Are there any faces in this
image?, then you would probably want to use
discriminative methods.
4
What is the goal of vision?
If you are asking, Are there any faces in this
image?, then you would probably want to use
discriminative methods.
If you are asking, Find a 3-d model that
describes the runner, then you would use
generative methods.
5
Modeling

So we want to look at high-dimensional visual
data, and fit models to it forming summaries of
it that let us understand what we see.

6
The simplest data to modela set of 1d samples
7
Fit this distribution with a Gaussian
8
How find the parameters of the best-fitting
Gaussian?
9
How find the parameters of the best-fitting
Gaussian?
10
Derivation of MLE for Gaussians
Observation density
Log likelihood
Maximisation
11
Basic Maximum Likelihood Estimate (MLE) of a
Gaussian distribution
Mean
Variance
Covariance Matrix
12
Basic Maximum Likelihood Estimate (MLE) of a
Gaussian distribution
Mean
Variance
For vector-valued data, we have the Covariance
Matrix
13
Model fitting example 2 Fit a line to observed
data
y
x
14
Maximum likelihood estimation for the slope of a
single line
Maximum likelihood estimate
where
gives regression formula
15
Model fitting example 3Fitting two lines to
observed data
y
x
16
MLE for fitting a line pair
(a form of mixture dist. for )
17
Fitting two lines on the one hand
If we knew which points went with which lines,
wed be back at the single line-fitting problem,
twice.
y
x
18
Fitting two lines, on the other hand
We could figure out the probability that any
point came from either line if we just knew the
two equations for the two lines.
y
x
19
Expectation Maximization (EM) a solution to
chicken-and-egg problems
20
EM example
21
EM example
22
EM example
23
EM example
24
EM example
Converged!
25
MLE with hidden/latent variablesExpectation
Maximisation
General problem
data
parameters
hidden variables
For MLE, want to maximise the log likelihood
The sum over z inside the log gives a complicated
expression for the ML solution.
26
The EM algorithm
We dont know the values of the labels, zi , but
lets use its expected value under its posterior
with the current parameter values, ?old. That
gives us the expectation step
E-step
Now lets maximize this Q function, an expected
log-likelihood, over the parameter values, giving
the maximization step
M-step
Each iteration increases the total log-likelihood
log p(y?)
27
Expectation Maximisation applied to fitting the
two lines
associate data point
Hidden variables
with line
and probabilities of association are

28
EM fitting to two lines
with
/2
E-step
and
repeat
Regression becomes
M-step
29
Experiments EM fitting to two lines
(from a tutorial by Yair Weiss,
http//www.cs.huji.ac.il/yweiss/tutorials.html)
Line weights
line 1
line 2
Iteration
1
2
3
30
Applications of EM in computer vision

Image segmentation
Motion estimation combined with perceptual
grouping
Polygonal approximation of edges

31
Next back to Density EstimationWhat if we
want to do density estimation with multimodal or
clumpy data?
32
The GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi

m2
m3
33
The GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi
Each component generates data from a Gaussian
with mean mi and covariance matrix s2I
Assume that each datapoint is generated according
to the following recipe

m2
m3
34
The GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi
Each component generates data from a Gaussian
with mean mi and covariance matrix s2I
Assume that each datapoint is generated according
to the following recipe
Pick a component at random. Choose component i
with probability P(wi).

m2
35
The GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi
Each component generates data from a Gaussian
with mean mi and covariance matrix s2I
Assume that each datapoint is generated according
to the following recipe
Pick a component at random. Choose component i
with probability P(wi).
Datapoint N(mi, s2I )

m2
x
36
The General GMM assumption

There are k components. The ith component is
called wi
Component wi has an associated mean vector mi
Each component generates data from a Gaussian
with mean mi and covariance matrix Si
Assume that each datapoint is generated according
to the following recipe
Pick a component at random. Choose component i
with probability P(wi).
Datapoint N(mi, Si )

m2
m3
37
Unsupervised Learningnot as hard as it looks
Sometimes easy
Sometimes impossible
and sometimes in between
IN CASE YOURE WONDERING WHAT THESE DIAGRAMS ARE,
THEY SHOW 2-d UNLABELED DATA (X VECTORS)
DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE
VERY CLEAR GAUSSIAN CENTERS
38
Computing likelihoods in unsupervised case

We have x1 , x2 , xN
We know P(w1) P(w2) .. P(wk)
We know s
P(xwi, µi, µk) Prob that an observation from
class wi would have value x given class means µ1
µk
Can we write an expression for that?

39
likelihoods in unsupervised case

We have x1 x2 xn
We have P(w1) .. P(wk). We have s.
We can define, for any x , P(xwi , µ1, µ2 .. µk)
Can we define P(x µ1, µ2 .. µk) ?
Can we define P(x1, x2, .. xn µ1, µ2 .. µk) ?

YES, IF WE ASSUME THE X1S WERE DRAWN
INDEPENDENTLY
40
Unsupervised LearningMediumly Good News
We now have a procedure s.t. if you give me a
guess at µ1, µ2 .. µk, I can tell you the prob of
the unlabeled data given those µs.
Suppose xs are 1-dimensional. There are two
classes w1 and w2 P(w1) 1/3 P(w2) 2/3
s 1 . There are 25 unlabeled datapoints
(From Duda and Hart)
x1 0.608 x2 -1.590 x3 0.235 x4 3.949
x25 -0.712
41
Duda Harts Example

Graph of
log P(x1, x2 .. x25 µ1, µ2 )
against µ1 (?) and µ2 (?)

Max likelihood (µ1 -2.13, µ2 1.668) Local
minimum, but very close to global at (µ1 2.085,
µ2 -1.257) corresponds to switching w1
with w2.
42
Duda Harts Example
We can graph the prob. dist. function of data
given our µ1 and µ2 estimates.
We can also graph the true function from which
the data was randomly generated.

They are close. Good.
The 2nd solution tries to put the 2/3 hump
where the 1/3 hump should go, and vice versa.
In this example unsupervised is almost as good as
supervised. If the x1 .. x25 are given the class
which was used to learn them, then the results
are (µ1-2.176, µ21.684). Unsupervised got
(µ1-2.13, µ21.668).

43
Finding the max likelihood µ1,µ2..µk

We can compute P( data µ1,µ2..µk)
How do we find the µis which give max.
likelihood?
The normal max likelihood trick
Set ? log Prob (.) 0
? µi
and solve for µis.
Here you get non-linear non-analytically- solv
able equations
Use gradient descent
Slow but doable
Use a much faster, cuter, and recently very
popular method

44
Expectation Maximalization
45
The E.M. Algorithm
DETOUR

Well get back to unsupervised learning soon.
But now well look at an even simpler case with
hidden information.
The EM algorithm
Can do trivial things, such as the contents of
the next few slides.
An excellent way of doing our unsupervised
learning problem, as well see.
Many, many other uses, including inference of
Hidden Markov Models.

46
Silly Example

Let events be grades in a class
w1 Gets an A P(A) ½
w2 Gets a B P(B) µ
w3 Gets a C P(C) 2µ
w4 Gets a D P(D) ½-3µ
(Note 0 µ 1/6)
Assume we want to estimate µ from data. In a
given class there were
a As
b Bs
c Cs
d Ds
Whats the maximum likelihood estimate of µ given
a,b,c,d ?

47
Silly Example

Let events be grades in a class
w1 Gets an A P(A) ½
w2 Gets a B P(B) µ
w3 Gets a C P(C) 2µ
w4 Gets a D P(D) ½-3µ
(Note 0 µ 1/6)
Assume we want to estimate µ from data. In a
given class there were
a As
b Bs
c Cs
d Ds
Whats the maximum likelihood estimate of µ given
a,b,c,d ?

48
Trivial Statistics

P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
P( a,b,c,d µ) K(½)a(µ)b(2µ)c(½-3µ)d
log P( a,b,c,d µ) log K alog ½ blog µ
clog 2µ dlog (½-3µ)

A B C D
14 6 9 10
Boring, but true!
49
Same Problem with Hidden Information
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
Someone tells us that Number of High grades (As
Bs) h Number of Cs
c Number of Ds
d What is the max. like estimate of µ
now?
50
Same Problem with Hidden Information
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ
Someone tells us that Number of High grades (As
Bs) h Number of Cs
c Number of Ds
d What is the max. like estimate of µ
now? We can answer this question circularly
EXPECTATION
If we know the value of µ we could compute the
expected value of a and b
Since the ratio ab should be the same as the
ratio ½ m
MAXIMIZATION
If we know the expected values of a and b we
could compute the maximum likelihood value of µ
51
E.M. for our Trivial Problem
REMEMBER P(A) ½ P(B) µ P(C) 2µ P(D) ½-3µ

We begin with a guess for µ
We iterate between EXPECTATION and MAXIMALIZATION
to improve our estimates of µ and a and b.
Define µ(t) the estimate of µ on the tth
iteration
b(t) the estimate of b on tth
iteration

E-step
M-step
Continue iterating until converged. Good news
Converging to local optimum is assured. Bad news
I said local optimum.
52
E.M. Convergence

Convergence proof based on fact that Prob(data
µ) must increase or remain same between each
iteration NOT OBVIOUS
But it can never exceed 1 OBVIOUS
So it must therefore converge OBVIOUS

t µ(t) b(t)
0 0 0
1 0.0833 2.857
2 0.0937 3.158
3 0.0947 3.185
4 0.0948 3.187
5 0.0948 3.187
6 0.0948 3.187
In our example, suppose we had h 20 c 10 d
10 µ(0) 0
Convergence is generally linear error decreases
by a constant factor each time step.
53
Back to Unsupervised Learning of GMMs

Remember
We have unlabeled data x1 x2 xR
We know there are k classes
We know P(w1) P(w2) P(w3) P(wk)
We dont know µ1 µ2 .. µk
We can write P( data µ1. µk)

54
E.M. for GMMs
See http//www.cs.cmu.edu/awm/doc/gmm-algebra.pdf

This is n nonlinear equations in µjs.
If, for each xi we knew that for each wj the prob
that µj was in class wj is P(wjxi,µ1µk) Then
we would easily compute µj. If we knew each µj
then we could easily compute P(wjxi,µ1µk) for
each wj and xi.
I feel an EM experience coming on!!
55
E.M. for GMMs

Iterate. On the tth iteration let our estimates
be
lt µ1(t), µ2(t) µc(t)
E-step
Compute expected classes of all datapoints for
each class

Just evaluate a Gaussian at xk
M-step. Compute Max. like µ given our datas
class membership distributions
56
E.M. Convergence

As with all EM procedures, convergence to a local
optimum guaranteed.

This algorithm is REALLY USED. And in high
dimensional state spaces, too. E.G. Vector
Quantization for Speech Data

57
E.M. for General GMMs
pi(t) is shorthand for estimate of P(wi) on tth
iteration

Iterate. On the tth iteration let our estimates
be
lt µ1(t), µ2(t) µc(t), S1(t), S2(t)
Sc(t), p1(t), p2(t) pc(t)
E-step
Compute expected classes of all datapoints for
each class

Just evaluate a Gaussian at xk
M-step. Compute Max. like µ given our datas
class membership distributions
R records
58
Gaussian Mixture Example Start
59
After first iteration
60
After 2nd iteration
61
After 3rd iteration
62
After 4th iteration
63
After 5th iteration
64
After 6th iteration
65
After 20th iteration
66
Some Bio Assay data
67
GMM clustering of the assay data
68
Resulting Density Estimator
69
In EM the model (number of parameters, which is
the number of lines in our case) is fixed.What
happen if we start with a wrong model?
Converged!
One line
70
Three lines
71
Why not 8 lines?
72
Which model is better?
We cannot just compute an approximation error
(e.g. LSE), since it decreases with the number of
lines. If we have too many lines, we just fit the
noise.
73
Can we find an optimal model?Human visual system
does it.
Good fit
Bad fit
74
Polygonal Approximation of Laser Range Data Based
on Perceptual Grouping and EM Longin Jan Latecki
and Rolf LakaemperCIS Dept., Temple University,
Philadelphia
75
Groping Edge Points to Contour PartsLongin Jan
Latecki and Rolf LakaemperCIS Dept., Temple
University, Philadelphia

Write a Comment

User Comments (0)