Prototype Classification Methods - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Prototype Classification Methods

Description:

Prototypes are centers of non-overlapping clusters. Fuzzy model (Fuzzy ... T. P. Minka, 'Expectation-Maximization as Lower Bound Maximization,' www.stat. ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 52
Provided by: ist3
Category:

less

Transcript and Presenter's Notes

Title: Prototype Classification Methods


1
Prototype Classification Methods
  • Fu Chang
  • Institute of Information Science
  • Academia Sinica
  • 2788-3799 ext. 1819
  • fchang_at_iis.sinica.edu.tw

2
Types of Prototype Methods
  • Crisp model (K-means, KM)
  • Prototypes are centers of non-overlapping
    clusters
  • Fuzzy model (Fuzzy c-means, FCM)
  • Prototypes are weighted average of all samples
  • Gaussian Mixture model (GM)
  • Prototypes have a mixture of distributions
  • Linear Discriminant Analysis (LDA)
  • Prototypes are projected sample means
  • K-nearest neighbor classifier (K-NN)
  • Learning vector quantization (LVQ)

3
Prototypes thru Clustering
  • Given the number k of prototypes, find k clusters
    whose centers are prototypes
  • Commonality
  • Use iterative algorithm, aimed at decreasing an
    objective function
  • May converge to local minima
  • The number of k as well as an initial solution
    must be specified

4
Clustering Objectives
  • The aim of the iterative algorithm is to decrease
    the value of an objective function
  • Notations
  • Samples
  • Prototypes
  • L2-distance

5
Objectives (cntd)
  • Crisp objective
  • Fuzzy objective
  • Gaussian mixture objective

6
K-Means Clustering
7
The Algorithm
  • Initiate k seeds of prototypes p1, p2, , pk
  • Grouping
  • Assign samples to their nearest prototypes
  • Form non-overlapping clusters out of these
    samples
  • Centering
  • Centers of clusters become new prototypes
  • Repeat the grouping and centering steps, until
    convergence

8
Justification
  • Grouping
  • Assigning samples to their nearest prototypes
    helps to decrease the objective
  • Centering
  • Also helps to decrease the above objective,
    because
  • and equality holds only if

9
Exercise
  • Prove that for any group of vectors yi, the
    following inequality is always true
  • Prove that the equality holds only when
  • Use this fact to prove that the centering step is
    helpful to decrease the objective function

10
Fuzzy c-Means Clustering
11
Crisp vs. Fuzzy Membership
  • Membership matrix Ucn
  • Uij is the grade of membership of sample j with
    respect to prototype i
  • Crisp membership
  • Fuzzy membership

12
Fuzzy c-means (FCM)
  • The objective function of FCM is

13
FCM (Cntd)
  • Introducing the Lagrange multiplier ? with
    respect to the constraint
  • we rewrite the objective function as

14
FCM (Cntd)
  • Setting the partial derivatives to zero, we obtain

15
FCM (Cntd)
  • From the 2nd equation, we obtain
  • From this fact and the 1st equation, we obtain

16
FCM (Cntd)
  • Therefore,
  • and

17
FCM (Cntd)
  • Together with the 2nd equation, we obtain the
    updating rule for uij

18
FCM (Cntd)
  • On the other hand, setting the derivative of J
    with respect to pi to zero, we obtain

19
FCM (Cntd)
  • It follows that
  • Finally, we can obtain the update rule of ci

20
FCM (Cntd)
  • To summarize

21
K-means vs. Fuzzy c-means
Sample Points
22
K-means vs. Fuzzy c-means
K-means
Fuzzy c-means
23
Expectation-Maximization (EM) Algorithm
24
What Is Given
  • Observed data X x1, x2, , xn, each of them
    is drawn independently from a mixture of
    probability distributions with the density
  • where

25
Incomplete vs. Complete Data
  • The incomplete-data log-likelihood is given by
  • which is difficult to optimize
  • The complete-data log-likelihood
  • can be handled much easily, where H is the
    set of hidden random variables
  • How do we compute the distribution of H?

26
EM Algorithm
  • E-Step first find the expected value
  • where is the current estimate of
  • M-Step Update the estimate
  • Repeat the process, until convergence

27
E-M Steps
28
Justification
  • The expected value (the circled term) is the
    lower bound of the log-likelihood

29
Justification (Cntd)
  • The maximum of the lower bound equals to the
    log-likelihood
  • The first term of (1) is the relative entropy of
    q(h) with respect to
  • The second term is a magnitude that does not
    depend on h
  • We would obtain the maximum of (1) if the
    relative entropy becomes zero
  • With this choice, the first term becomes zero and
    (1) achieves the upper bound, which is

30
Details of EM Algorithm
  • Let be
    the guessed values of
  • For the given , we can compute

31
Details (Cntd)
  • We then consider the expected value

32
Details (Cntd)
  • Lagrangian and partial derivative equation

33
Details (Cntd)
  • From (2), we derive that ? - n and
  • Based on these values, we can derive the optimal
    for , of which only the following
    part involves

34
Exercise
  • Deduce from (1) that ? - n and

35
Gaussian Mixtures
  • The Gaussian distribution is given by
  • For Gaussian mixtures,

36
Gaussian Mixtures (Cntd)
  • Partial derivative
  • Setting this to zero, we obtain

37
Gaussian Mixtures (Cntd)
  • Taking the derivative of with
    respect to
  • and setting it to zero, we get
  • (many details are omitted)

38
Gaussian Mixtures (Cntd)
  • To summarize

39
Linear Discriminant Analysis(LDA)
40
Illustration
41
Definitions
  • Given
  • Samples x1, x2, , xn
  • Classes ni of them are of class i, i 1, 2, ,
    c
  • Definition
  • Sample mean for class i
  • Scatter matrix for class i

42
Scatter Matrices
  • Total scatter matrix
  • Within-class scatter matrix
  • Between-class scatter matrix

43
Multiple Discriminant Analysis
  • We seek vectors wi, i 1, 2, .., c-1
  • And project the samples x to the c-1 dimensional
    space y
  • The criterion for W (w1, w2, , wc-1) is

44
Multiple Discriminant Analysis (Cntd)
  • Consider the Lagrangian
  • Take the partial derivative
  • Setting the derivative to zero, we obtain

45
Multiple Discriminant Analysis (Cntd)
  • Find the roots of the characteristic function as
    eigenvalues
  • and then solve
  • for wi for the largest c-1 eigenvalues

46
LDA Prototypes
  • The prototype of each class is the mean of the
    projected samples of that class, the projection
    is thru the matrix W
  • In the testing phase
  • All test samples are projected thru the same
    optimal W
  • The nearest prototype is the winner

47
K-Nearest Neighbor (K-NN) Classifier
48
K-NN Classifier
  • For each test sample x, find the nearest K
    training samples and classify x according to the
    vote among the K neighbors
  • The error rate is
  • where
  • This shows that the error rate is at most twice
    the Bayes error

49
Learning Vector Quantization (LVQ)
50
LVQ Algorithm
  • Initialize R prototypes for each class m1(k),
    m2(k), , mR(k), where k 1, 2, , K.
  • Sample a training sample x and find the nearest
    prototype mj(k) to x
  • If x and mj(k) match in class type,
  • Otherwise,
  • Repeat step 2, decreasing e at each iteration

51
References
  • F. Höppner, F. Klawonn, R. Kruse, and T. Runkler,
    Fuzzy Cluster Analysis Methods for
    Classification, Data Analysis and Image
    Recognition, John Wiley Sons, 1999.
  • J. A. Bilmes, A Gentle Tutorial of the EM
    algorithm and its Application to Parameter
    Estimation for Gaussian Mixture and Hidden Markov
    Models, www.cs.berkeley.edu/daf/appsem/WordsAndP
    ictures/Papers/bilmes98gentle.pdf
  • T. P. Minka, Expectation-Maximization as Lower
    Bound Maximization, www.stat.cmu.edu/minka/paper
    s/em.html
  • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern
    Classification, 2nd Ed., Wiley Interscience,
    2001.
  • T. Hastie, R. Tibshirani, and J. Friedman, The
    Elements of Statistical Learning,
    Springer-Verlag, 2001.
Write a Comment
User Comments (0)
About PowerShow.com