Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley - PowerPoint PPT Presentation

About This Presentation
Title:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley

Description:

Pattern Classification. All materials in these s were taken from ... (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 ... – PowerPoint PPT presentation

Number of Views:300
Avg rating:3.0/5.0
Slides: 68
Provided by: djam84
Learn more at: https://www.cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley


1
Pattern ClassificationAll materials in these
slides were taken from Pattern Classification
(2nd ed) by R. O. Duda, P. E. Hart and D. G.
Stork, John Wiley Sons, 2000 with the
permission of the authors and the publisher
2
Chapter 10Unsupervised Learning Clustering
  • Introduction
  • Mixture Densities and Identifiability
  • ML Estimates
  • Application to Normal Mixtures
  • K-means algorithm
  • Unsupervised Bayesian Learning
  • Data description and clustering
  • Criterion function for clustering
  • Hierarchical clustering
  • The number of cluster problem and cluster
    validation
  • On-line clustering
  • Graph-theoretic methods
  • PCA and ICA
  • Low-dim reps and multidimensional scaling
    (self-organizing maps)
  • Clustering and dimensionality reduction

3
Introduction
  • Previously, all our training samples were
    labeled these samples were said supervised
  • Why are we interested in unsupervised
    procedures which use unlabeled samples?
  • Collecting and Labeling a large set of sample
    patterns can be costly
  • We can train with large amounts of (less
    expensive) unlabeled data
  • ?Then use supervision to label the groupings
    found, this is appropriate for large data
    mining applications where the contents of a
    large database are not known beforehand

4
  • Patterns may change slowly with time
  • ?Improved performance can be achieved if
    classifiers running in a unsupervised mode are
    used
  • We can use unsupervised methods to identify
    features that will then be useful for
    categorization
  • ? smart feature extraction
  • We gain some insight into the nature (or
    structure) of the data
  • ? which set of classification labels?

5
Mixture Densities Identifiability
  • Assume
  • functional forms for underlying probability
    densities are known
  • value of an unknown parameter vector must be
    learned
  • i.e., like chapter 3 but without class labels
  • Specific assumptions
  • The samples come from a known number c of classes
  • The prior probabilities P(?j) for each class are
    known (j 1, ,c)
  • Forms for the P(x ?j, ?j) (j 1, ,c) are
    known
  • The values of the c parameter vectors ?1, ?2, ,
    ?c are unknown
  • The category labels are unknown

6
  • The PDF for the samples is
  • This density function is called a mixture
    density
  • Our goal will be to use samples drawn from this
    mixture density to estimate the unknown parameter
    vector ?.
  • Once ? is known, we can decompose the mixture
    into its components and use a MAP classifier on
    the derived densities.

7
  • Can ? be recovered from the mixture?
  • Consider the case where
  • Unlimited number of samples
  • Use nonparametric technique to find p(x? ) for
    every x
  • If several ? result in same p(x? ) ? cant find
    unique solution
  • This is the issue of solution identifiability.
  • Definition Identifiability A density P(x ?)
    is said to be identifiable if
  • ? ? ? implies that there exists an x such that
  • P(x ?) ? P(x ?)

8
  • As a simple example, consider the case where x
    is binary and P(x ?) is the mixture
  • Assume that
  • P(x 1 ?) 0.6 ? P(x 0 ?) 0.4
  • We know P(x ?) but not ?
  • We can say ?1 ?2 1.2 but not what ?1 and
    ?2 are.
  • Thus, we have a case in which the mixture
    distribution is completely unidentifiable, and
    therefore unsupervised learning is impossible.

9
  • In the discrete distributions too many components
    can be problematic
  • Too many unknowns
  • Perhaps more unknowns than independent equations
  • ?identifiability can become a serious problem!

10
  • While it can be shown that mixtures of normal
    densities are usually identifiable, the
    parameters in the simple mixture density
  • cannot be uniquely identified if P(?1) P(?2)
  • (we cannot recover a unique ? even from an
    infinite amount of data!)
  • ? (?1, ?2) and ? (?2, ?1) are two possible
    vectors that can be interchanged without
    affecting P(x ?).
  • Identifiability can be a problem, we always
    assume that the densities we are dealing with are
    identifiable!

11
ML Estimates
  • Suppose that we have a set D x1, , xn of n
    unlabeled samples drawn independently from the
    mixture density
  • (? is fixed but unknown!)
  • The MLE is

12
ML Estimates
  • Then the log-likelihood is
  • And the gradient of the log-likelihood is

13
  • Since the gradient must vanish at the value of ?i
  • that maximizes l ,
  • the ML estimate must satisfy the
    conditions

14
  • The MLE for P(wi) and must satisfy

15
Applications to Normal Mixtures
  • p(x ?i, ?i) N(?i, ?i)
  • Case 1 Simplest case
  • Case 2 more realistic case

Case ?i ?i P(?i) c
1 ? x x x
2 ? ? ? x
3 ? ? ? ?
16
  • Case 1 Multivariate Normal, Unknown mean vectors
  • ?i ?i ? i 1, , c, The likelihood is for
    the ith mean is
  • ML estimate of ? (?i) is
  • Where is the
    fraction of those samples having value xk that
    come from the ith class, and is the
    average of the samples coming from the i-th
    class.

17
  • Unfortunately, equation (1) does not give
    explicitly
  • However, if we have some way of obtaining good
    initial estimates for the unknown
    means, equation (1) can be seen as an iterative
    process for improving the estimates

18
  • This is a gradient ascent for maximizing the
    log-likelihood function
  • Example
  • Consider the simple two-component
    one-dimensional normal mixture
  • (2 clusters!)
  • Lets set ?1 -2, ?2 2 and draw 25 samples
    sequentially from this mixture. The
    log-likelihood function is

w1
w2
19
  • The maximum value of l occurs at
  • (which are not far from the true values ?1 -2
    and ?2 2)
  • There is another peak at
    which has almost
    the same height as can be seen from the following
    figure.
  • This mixture of normal densities is identifiable
  • When the mixture density is not identifiable,
    the ML solution is not unique

20
(No Transcript)
21
  • Case 2 All parameters unknown
  • No constraints are placed on the covariance
    matrix
  • Let p(x ?, ?2) be the two-component normal
    mixture

22
  • Suppose ? x1, therefore
  • For the rest of the samplesFinally,
  • The likelihood is therefore large and the
    maximum-likelihood solution becomes singular.

23
  • Assumption MLE is well-behaved at local maxima.
  • Consider the largest of the finite local maxima
    of the likelihood function and use the ML
    estimation.
  • We obtain the following local-maximum-likelihood
    estimates

Iterative scheme
24
  • Where

25
  • K-Means Clustering
  • Goal find the c mean vectors ?1, ?2, , ?c
  • Replace the squared Mahalanobis distance
  • Find the mean nearest to xk and
    approximate as
  • Use the iterative scheme to find

26
  • If n is the known number of patterns and c the
    desired number of clusters, the k-means algorithm
    is
  • Begin
  • initialize n, c, ?1, ?2, , ?c(randomly
    selected)
  • do classify n samples according to nearest
    ?i
  • recompute ?i
  • until no change in ?i
  • return ?1, ?2, , ?c
  • End
  • Complexity is O(ndcT) where d is the features,
    T the iterations

27
  • K-means cluster on data from previous figure

28
Unsupervised Bayesian Learning
  • Other than the ML estimate, the Bayesian
    estimation technique can also be used in the
    unsupervised case (see chapters ML Bayesian
    methods, Chap. 3 of the textbook)
  • number of classes is known
  • class priors are known
  • forms of class-conditional probability densities
    P(xwj, qj) are known
  • However, the full parameter vector q is unknown
  • Part of our knowledge about q is contained in the
    prior p(q)
  • rest of our knowledge of q is in the training
    samples
  • We compute the posterior distribution using the
    training samples

29
  • We can compute p(?D) as seen previously
  • and passing through the usual formulation
    introducing the unknown parameter vector ?.
  • Hence, the best estimate of p(x?i) is obtained
    by averaging p(x?i, ?i) over ?i.
  • The goodness of this estimate depends on p(?D)
    this is the main issue of the problem.

P(wiD) P(wi) since selection of wi is
independent of previous samples
30
  • From Bayes we get
  • where independence of the samples yields the
    likelihood
  • or alternately (denoting Dn the set of n samples)
    the recursive form
  • If p(?) is almost uniform in the region where
    p(D?) peaks, then p(?D) peaks in the same
    place.

31
  • If the only significant peak occurs at
    and the peak is very sharp, then
  • and
  • Therefore, the ML estimate is justified.
  • Both approaches coincide if large amounts of data
    are available.
  • In small sample size problems they can agree or
    not, depending on the form of the distributions
  • The ML method is typically easier
  • to implement than the Bayesian one

32
  • Formal Bayesian solution unsupervised learning
    of the parameters of a mixture density is similar
    to the supervised learning of the parameters of a
    component density.
  • Significant differences identifiability,
    computational complexity
  • The issue of identifiability
  • With SL, the lack of identifiability means that
    we do not obtain a unique vector, but an
    equivalence class, which does not present
    theoretical difficulty as all yield the same
    component density.
  • With UL, the lack of identifiability means that
    the mixture cannot be decomposed into its true
    components
  • ? p(x Dn) may still converge to p(x), but p(x
    ?i, Dn) will not in general converge to p(x
    ?i), hence there is a theoretical barrier.
  • The computational complexity
  • With SL, the sufficient statistics allows the
    solutions to be computationally feasible

33
  • With UL, samples comes from a mixture density and
    there is little hope of finding simple exact
    solutions for p(D ?). ? n samples results in 2n
    terms. (Corresponding to the ways in the which
    the n samples can be drawn from the 2 classes.)
  • Another way of comparing the UL and SL is to
    consider the usual equation in which the mixture
    density is explicit

34
  • If we consider the case in which P(?1)1 and all
    other prior probabilities as zero, corresponding
    to the supervised case in which all samples comes
    from the class ?1, then we get

From Previous slide
35
  • Comparing the two eqns, we see that observing an
    additional sample changes the estimate of ?.
  • Ignoring the denominator which is independent of
    ?, the only significant difference is that
  • in the SL, we multiply the prior density for ?
    by the component density p(xn ?1, ?1)
  • in the UL, we multiply the prior density by the
    whole mixture
  • Assuming that the sample did come from class ?1,
    the effect of not knowing this category is to
    diminish the influence of xn in changing ? for
    category 1..

Eqns From Previous slide
36
Data Clustering
  • Structures of multidimensional patterns are
    important for clustering
  • If we know that data come from a specific
    distribution, such data can be represented by a
    compact set of parameters (sufficient statistics)
  • If samples are considered coming from a specific
    distribution, but actually they are not, these
    statistics is a misleading representation of the
    data

37
  • Aproximation of density functions
  • Mixture of normal distributions can approximate
    arbitrary PDFs
  • In these cases, one can use parametric methods to
    estimate the parameters of the mixture density.
  • No free lunch ? dimensionality issue!
  • Huh?

38
  • Caveat
  • If little prior knowledge can be assumed, the
    assumption of a parametric form is meaningless
  • Issue imposing structure vs finding structure
  • ?use non parametric method to estimate the
    unknown mixture density.
  • Alternatively, for subclass discovery
  • use a clustering procedure
  • identify data points having strong internal
    similarities

39
Similarity measures
  • What do we mean by similarity?
  • Two isses
  • How to measure the similarity between samples?
  • How to evaluate a partitioning of a set into
    clusters?
  • Obvious measure of similarity/dissimilarity is
    the distance between samples
  • Samples of the same cluster should be closer to
    each other than to samples in different classes.

40
  • Euclidean distance is a possible metric
  • assume samples belonging to same cluster if their
    distance is less than a threshold d0
  • Clusters defined by Euclidean distance are
    invariant to translations and rotation of the
    feature space, but not invariant to general
    transformations that distort the distance
    relationship

41
  • Achieving invariance
  • normalize the data, e.g., such that they all have
    zero means and unit variance,
  • or use principal components for invariance to
    rotation
  • A broad class of metrics is the Minkowsky metric
  • where q?1 is a selectable parameter
  • q 1 ? Manhattan or city block metric
  • q 2 ? Euclidean metric
  • One can also used a nonmetric similarity function
    s(x,x) to compare 2 vectors.

42
  • It is typically a symmetric function whose value
    is large when x and x are similar.
  • For example, the inner product
  • In case of binary-valued features, we have, e.g.

Tanimoto distance
43
Clustering as optimization
  • The second issue how to evaluate a partitioning
    of a set into clusters?
  • Clustering can be posed as an optimization of a
    criterion function
  • The sum-of-squared-error criterion and its
    variants
  • Scatter criteria
  • The sum-of-squared-error criterion
  • Let ni the number of samples in Di, and mi the
    mean of those samples

44
  • The sum of squared error is defined as
  • This criterion defines clusters by their mean
    vectors mi
  • ? it minimizes the sum of the squared lengths of
    the error x - mi.
  • The minimum variance partition minimizes Je
  • Results
  • Good when clusters form well separated compact
    clouds
  • Bad with large differences in the number of
    samples in different clusters.

45
  • Scatter criteria
  • Scatter matrices used in multiple discriminant
    analysis, i.e., the within-scatter matrix SW and
    the between-scatter matrix SB
  • ST SB SW
  • Note
  • ST does not depend on partitioning
  • In contrast, SB and SW depend on partitioning
  • Two approaches
  • minimize the within-cluster
  • maximize the between-cluster scatter

46
  • The trace (sum of diagonal elements) is the
    simplest scalar measure of the scatter matrix
  • proportional to the sum of the variances in the
    coordinate directions
  • This is the sum-of-squared-error criterion, Je.

47
  • As trST trSW trSB and trST is
    independent from the partitioning, no new results
    can be derived by minimizing trSB
  • However, seeking to minimize the within-cluster
    criterion JetrSW, is equivalent to maximise
    the between-cluster criterion
  • where m is the total mean vector

48
Iterative optimization
  • Clustering ? discrete optimization problem
  • Finite data set ? finite number of partitions
  • What is the cost of exhaustive search?
  • ?cn/c! For c clusters. Not a good idea
  • Typically iterative optimization used
  • starting from a reasonable initial partition
  • Redistribute samples to minimize criterion
    function.
  • ? guarantees local, not global, optimization.

49
  • consider an iterative procedure to minimize the
    sum-of-squared-error criterion Je
  • where Ji is the effective error per cluster.
  • Moving sample from cluster Di to Dj, changes
    the errors in the 2 clusters by

50
  • Hence, the transfer is advantegeous if the
    decrease in Ji is larger than the increase in Jj

51
  • Alg. 3 is sequential version of the k-means alg.
  • Alg. 3 updates each time a sample is reclassified
  • k-means waits until n samples have been
    reclassified before updating
  • Alg 3 can get trapped in local minima
  • Depends on order of the samples
  • Basically, myopic approach
  • But it is online!

52
  • Starting point is always a problem
  • Approaches
  • Random centers of clusters
  • Repetition with different random initialization
  • c-cluster starting point as the solution of the
    (c-1)-cluster problem plus the sample farthest
    from the nearer cluster center

53
Hierarchical Clustering
  • Many times, clusters are not disjoint, but a
    cluster may have subclusters, in turn having
    sub-subclusters, etc.
  • Consider a sequence of partitions of the n
    samples into c clusters
  • The first is a partition into n cluster, each one
    containing exactly one sample
  • The second is a partition into n-1 clusters, the
    third into n-2, and so on, until the n-th in
    which there is only one cluster containing all of
    the samples
  • At the level k in the sequence, c n-k1.

54
  • Given any two samples x and x, they will be
    grouped together at some level, and if they are
    grouped a level k, they remain grouped for all
    higher levels
  • Hierarchical clustering ? tree representation
    called dendrogram

55
  • Are groupings natural or forced check
    similarity values
  • Evenly distributed similarity ? no justification
    for grouping
  • Another representation is based on set, e.g., on
    the Venn diagrams

56
  • Hierarchical clustering can be divided in
    agglomerative and divisive.
  • Agglomerative (bottom up, clumping) start with n
    singleton cluster and form the sequence by
    merging clusters
  • Divisive (top down, splitting) start with all of
    the samples in one cluster and form the sequence
    by successively splitting clusters

57
  • Agglomerative hierarchical clustering
  • The procedure terminates when the specified
    number of cluster has been obtained, and returns
    the cluster as sets of points, rather than the
    mean or a representative vector for each cluster

58
  • At any level, the distance between nearest
    clusters can provide the dissimilarity value for
    that level
  • To find the nearest clusters, one can use
  • which behave quite similar of the clusters are
    hyperspherical and well separated.
  • The computational complexity is O(cn2d2), ngtgtc

59
  • Nearest-neighbor algorithm (single linkage)
  • dmin is used
  • Viewed in graph terms, an edge is added to the
    nearest nonconnected components
  • Equivalent of Prims minimum spanning tree
    algorithm
  • Terminates when the distance between nearest
    clusters exceeds an arbitrary threshold

60
  • The use of dmin as a distance measure and the
    agglomerative clustering generate a minimal
    spanning tree
  • Chaining effect defect of this distance measure
    (right)

61
  • The farthest neighbor algorithm (complete
    linkage)
  • dmax is used
  • This method discourages the growth of elongated
    clusters
  • In graph theoretic terms
  • every cluster is a complete subgraph
  • the distance between two clusters is determined
    by the most distant nodes in the 2 clusters
  • terminates when the distance between nearest
    clusters exceeds an arbitrary threshold

62
  • When two clusters are merged, the graph is
    changed by adding edges between every pair of
    nodes in the 2 clusters
  • All the procedures involving minima or maxima are
    sensitive to outliers. The use of dmean or davg
    are natural compromises

63
The problem of the number of clusters
  • How many clusters should there be?
  • For clustering by extremizing a criterion
    function
  • repeat the clustering with c1, c2, c3, etc.
  • look for large changes in criterion function
  • Alternatively
  • state a threshold for the creation of a new
    cluster
  • useful for on line cases
  • sensitive to order of presentation of data.
  • These approaches are similar to model selection
    procedures

64
Graph-theoretic methods
  • Caveat no uniform way of posing clustering as a
    graph theoretic problem
  • Generalize from a threshold distance to arbitrary
    similarity measures.
  • If s0 is a threshold value, we can say that xi is
    similar to xj if s(xi, xj) gt s0.
  • We can define a similarity matrix S sij

65
  • This matrix induces a similarity graph, dual to
    S, in which nodes corresponds to points and edge
    joins node i and j iff sij1.
  • Single-linkage alg. two samples x and x are in
    the same cluster if there exists a chain x, x1,
    x2, , xk, x, such that x is similar to x1, x1
    to x2, and so on ? connected components of the
    graph
  • Complete-link alg. all samples in a given
    cluster must be similar to one another and no
    sample can be in more than one cluster.
  • Neirest-neighbor algorithm is a method to find
    the minimum spanning tree and vice versa
  • Removal of the longest edge produce a 2-cluster
    grouping, removal of the next longest edge
    produces a 3-cluster grouping, and so on.

66
  • This is a divisive hierarchical procedure, and
    suggest ways to dividing the graph in subgraphs
  • E.g., in selecting an edge to remove, comparing
    its length with the lengths of the other edges
    incident the nodes

67
  • One useful statistic to be estimated from the
    minimal spanning tree is the edge length
    distribution
  • For instance, in the case of 2 dense cluster
    immersed in a sparse set of points
Write a Comment
User Comments (0)
About PowerShow.com