Prototype Classification Methods - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Prototype Classification Methods

Description:

Prototypes are centers of non-overlapping clusters. Fuzzy model (Fuzzy ... T. P. Minka, 'Expectation-Maximization as Lower Bound Maximization,' www.stat. ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 52

Provided by: ist3

Category:

more less

Transcript and Presenter's Notes

Title: Prototype Classification Methods

1
Prototype Classification Methods

Fu Chang
Institute of Information Science
Academia Sinica
2788-3799 ext. 1819
fchang_at_iis.sinica.edu.tw

2
Types of Prototype Methods

Crisp model (K-means, KM)
Prototypes are centers of non-overlapping
clusters
Fuzzy model (Fuzzy c-means, FCM)
Prototypes are weighted average of all samples
Gaussian Mixture model (GM)
Prototypes have a mixture of distributions
Linear Discriminant Analysis (LDA)
Prototypes are projected sample means
K-nearest neighbor classifier (K-NN)
Learning vector quantization (LVQ)

3
Prototypes thru Clustering

Given the number k of prototypes, find k clusters
whose centers are prototypes
Commonality
Use iterative algorithm, aimed at decreasing an
objective function
May converge to local minima
The number of k as well as an initial solution
must be specified

4
Clustering Objectives

The aim of the iterative algorithm is to decrease
the value of an objective function
Notations
Samples
Prototypes
L2-distance

5
Objectives (cntd)

Crisp objective
Fuzzy objective
Gaussian mixture objective

6
K-Means Clustering
7
The Algorithm

Initiate k seeds of prototypes p1, p2, , pk
Grouping
Assign samples to their nearest prototypes
Form non-overlapping clusters out of these
samples
Centering
Centers of clusters become new prototypes
Repeat the grouping and centering steps, until
convergence

8
Justification

Grouping
Assigning samples to their nearest prototypes
helps to decrease the objective
Centering
Also helps to decrease the above objective,
because
and equality holds only if

9
Exercise

Prove that for any group of vectors yi, the
following inequality is always true
Prove that the equality holds only when
Use this fact to prove that the centering step is
helpful to decrease the objective function

10
Fuzzy c-Means Clustering
11
Crisp vs. Fuzzy Membership

Membership matrix Ucn
Uij is the grade of membership of sample j with
respect to prototype i
Crisp membership
Fuzzy membership

12
Fuzzy c-means (FCM)

The objective function of FCM is

13
FCM (Cntd)

Introducing the Lagrange multiplier ? with
respect to the constraint
we rewrite the objective function as

14
FCM (Cntd)

Setting the partial derivatives to zero, we obtain

15
FCM (Cntd)

From the 2nd equation, we obtain
From this fact and the 1st equation, we obtain

16
FCM (Cntd)

Therefore,
and

17
FCM (Cntd)

Together with the 2nd equation, we obtain the
updating rule for uij

18
FCM (Cntd)

On the other hand, setting the derivative of J
with respect to pi to zero, we obtain

19
FCM (Cntd)

It follows that
Finally, we can obtain the update rule of ci

20
FCM (Cntd)

To summarize

21
K-means vs. Fuzzy c-means
Sample Points
22
K-means vs. Fuzzy c-means
K-means
Fuzzy c-means
23
Expectation-Maximization (EM) Algorithm
24
What Is Given

Observed data X x1, x2, , xn, each of them
is drawn independently from a mixture of
probability distributions with the density
where

25
Incomplete vs. Complete Data

The incomplete-data log-likelihood is given by
which is difficult to optimize
The complete-data log-likelihood
can be handled much easily, where H is the
set of hidden random variables
How do we compute the distribution of H?

26
EM Algorithm

E-Step first find the expected value
where is the current estimate of
M-Step Update the estimate
Repeat the process, until convergence

27
E-M Steps
28
Justification

The expected value (the circled term) is the
lower bound of the log-likelihood

29
Justification (Cntd)

The maximum of the lower bound equals to the
log-likelihood
The first term of (1) is the relative entropy of
q(h) with respect to
The second term is a magnitude that does not
depend on h
We would obtain the maximum of (1) if the
relative entropy becomes zero
With this choice, the first term becomes zero and
(1) achieves the upper bound, which is

30
Details of EM Algorithm

Let be
the guessed values of
For the given , we can compute

31
Details (Cntd)

We then consider the expected value

32
Details (Cntd)

Lagrangian and partial derivative equation

33
Details (Cntd)

From (2), we derive that ? - n and
Based on these values, we can derive the optimal
for , of which only the following
part involves

34
Exercise

Deduce from (1) that ? - n and

35
Gaussian Mixtures

The Gaussian distribution is given by
For Gaussian mixtures,

36
Gaussian Mixtures (Cntd)

Partial derivative
Setting this to zero, we obtain

37
Gaussian Mixtures (Cntd)

Taking the derivative of with
respect to
and setting it to zero, we get
(many details are omitted)

38
Gaussian Mixtures (Cntd)

To summarize

39
Linear Discriminant Analysis(LDA)
40
Illustration
41
Definitions

Given
Samples x1, x2, , xn
Classes ni of them are of class i, i 1, 2, ,
c
Definition
Sample mean for class i
Scatter matrix for class i

42
Scatter Matrices

Total scatter matrix
Within-class scatter matrix
Between-class scatter matrix

43
Multiple Discriminant Analysis

We seek vectors wi, i 1, 2, .., c-1
And project the samples x to the c-1 dimensional
space y
The criterion for W (w1, w2, , wc-1) is

44
Multiple Discriminant Analysis (Cntd)

Consider the Lagrangian
Take the partial derivative
Setting the derivative to zero, we obtain

45
Multiple Discriminant Analysis (Cntd)

Find the roots of the characteristic function as
eigenvalues
and then solve
for wi for the largest c-1 eigenvalues

46
LDA Prototypes

The prototype of each class is the mean of the
projected samples of that class, the projection
is thru the matrix W
In the testing phase
All test samples are projected thru the same
optimal W
The nearest prototype is the winner

47
K-Nearest Neighbor (K-NN) Classifier
48
K-NN Classifier

For each test sample x, find the nearest K
training samples and classify x according to the
vote among the K neighbors
The error rate is
where
This shows that the error rate is at most twice
the Bayes error

49
Learning Vector Quantization (LVQ)
50
LVQ Algorithm

Initialize R prototypes for each class m1(k),
m2(k), , mR(k), where k 1, 2, , K.
Sample a training sample x and find the nearest
prototype mj(k) to x
If x and mj(k) match in class type,
Otherwise,
Repeat step 2, decreasing e at each iteration

51
References

F. Höppner, F. Klawonn, R. Kruse, and T. Runkler,
Fuzzy Cluster Analysis Methods for
Classification, Data Analysis and Image
Recognition, John Wiley Sons, 1999.
J. A. Bilmes, A Gentle Tutorial of the EM
algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov
Models, www.cs.berkeley.edu/daf/appsem/WordsAndP
ictures/Papers/bilmes98gentle.pdf
T. P. Minka, Expectation-Maximization as Lower
Bound Maximization, www.stat.cmu.edu/minka/paper
s/em.html
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern
Classification, 2nd Ed., Wiley Interscience,
2001.
T. Hastie, R. Tibshirani, and J. Friedman, The
Elements of Statistical Learning,
Springer-Verlag, 2001.