CIS732-Lecture-36-20070418 - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

CIS732-Lecture-36-20070418

Description:

Let h denote the number of heads in experiment xi (a single data point) ... Molecular biology: pre-clustering DNA acceptor, donor sites (mouse, human) ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 42
Provided by: lindajacks
Category:

less

Transcript and Presenter's Notes

Title: CIS732-Lecture-36-20070418


1
Lecture 21 of 42
Partitioning-Based Clustering and Expectation
Maximization (EM)
Monday, 10 March 2008 William H. Hsu Department
of Computing and Information Sciences,
KSU http//www.cis.ksu.edu/Courses/Spring-2008/CIS
732 Readings Section 7.1 7.3, Han Kamber 2e
2
EM AlgorithmExample 3
3
EM for Unsupervised Learning
  • Unsupervised Learning Problem
  • Objective estimate a probability distribution
    with unobserved variables
  • Use EM to estimate mixture policy (more on this
    later see 6.12, Mitchell)
  • Pattern Recognition Examples
  • Human-computer intelligent interaction (HCII)
  • Detecting facial features in emotion recognition
  • Gesture recognition in virtual environments
  • Computational medicine Frey, 1998
  • Determining morphology (shapes) of bacteria,
    viruses in microscopy
  • Identifying cell structures (e.g., nucleus) and
    shapes in microscopy
  • Other image processing
  • Many other examples (audio, speech, signal
    processing motor control etc.)
  • Inference Examples
  • Plan recognition mapping from (observed) actions
    to agents (hidden) plans
  • Hidden changes in context e.g., aviation
    computer security MUDs

4
Unsupervised LearningAutoClass 1
5
Unsupervised LearningAutoClass 2
  • AutoClass Algorithm Cheeseman et al, 1988
  • Based on maximizing P(x ?j, yj, J)
  • ?j class (cluster) parameters (e.g., mean and
    variance)
  • yj synthetic classes (can estimate marginal
    P(yj) any time)
  • Apply Bayess Theorem, use numerical BOC
    estimation techniques (cf. Gibbs)
  • Search objectives
  • Find best J (ideally integrate out ?j, yj
    really start with big J, decrease)
  • Find ?j, yj use MAP estimation, then integrate
    in the neighborhood of yMAP
  • EM Find MAP Estimate for P(x ?j, yj, J) by
    Iterative Refinement
  • Advantages over Symbolic (Non-Numerical) Methods
  • Returns probability distribution over class
    membership
  • More robust than best yj
  • Compare fuzzy set membership (similar but
    probabilistically motivated)
  • Can deal with continuous as well as discrete data

6
Unsupervised LearningAutoClass 3
  • AutoClass Resources
  • Beginning tutorial (AutoClass II) Cheeseman et
    al, 4.2.2 Buchanan and Wilkins
  • Project page http//ic-www.arc.nasa.gov/ic/projec
    ts/bayes-group/autoclass/
  • Applications
  • Knowledge discovery in databases (KDD) and data
    mining
  • Infrared astronomical satellite (IRAS) spectral
    atlas (sky survey)
  • Molecular biology pre-clustering DNA acceptor,
    donor sites (mouse, human)
  • LandSat data from Kansas (30 km2 region, 1024 x
    1024 pixels, 7 channels)
  • Positive findings see book chapter by Cheeseman
    and Stutz, online
  • Other typical applications see KD Nuggets
    (http//www.kdnuggets.com)
  • Implementations
  • Obtaining source code from project page
  • AutoClass III Lisp implementation Cheeseman,
    Stutz, Taylor, 1992
  • AutoClass C C implementation Cheeseman, Stutz,
    Taylor, 1998
  • These and others at http//www.recursive-partitio
    ning.com/cluster.html

7
Unsupervised LearningCompetitive Learning for
Feature Discovery
  • Intuitive Idea Competitive Mechanisms for
    Unsupervised Learning
  • Global organization from local, competitive
    weight update
  • Basic principle expressed by Von der Malsburg
  • Guiding examples from (neuro)biology lateral
    inhibition
  • Previous work Hebb, 1949 Rosenblatt, 1959 Von
    der Malsburg, 1973 Fukushima, 1975 Grossberg,
    1976 Kohonen, 1982
  • A Procedural Framework for Unsupervised
    Connectionist Learning
  • Start with identical (neural) processing units,
    with random initial parameters
  • Set limit on activation strength of each unit
  • Allow units to compete for right to respond to a
    set of inputs
  • Feature Discovery
  • Identifying (or constructing) new features
    relevant to supervised learning
  • Examples finding distinguishable letter
    characteristics in handwriten character
    recognition (HCR), optical character recognition
    (OCR)
  • Competitive learning transform X into X train
    units in X closest to x

8
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 1
  • Another Clustering Algorithm
  • aka Self-Organizing Feature Map (SOFM)
  • Given vectors of attribute values (x1, x2, ,
    xn)
  • Returns vectors of attribute values (x1, x2,
    , xk)
  • Typically, n gtgt k (n is high, k 1, 2, or 3
    hence dimensionality reducing)
  • Output vectors x, the projections of input
    points x also get P(xj xi)
  • Mapping from x to x is topology preserving
  • Topology Preserving Networks
  • Intuitive idea similar input vectors will map to
    similar clusters
  • Recall informal definition of cluster (isolated
    set of mutually similar entities)
  • Restatement clusters of X (high-D) will still
    be clusters of X (low-D)
  • Representation of Node Clusters
  • Group of neighboring artificial neural network
    units (neighborhood of nodes)
  • SOMs combine ideas of topology-preserving
    networks, unsupervised learning
  • Implementation http//www.cis.hut.fi/nnrc/ and
    MATLAB NN Toolkit

9
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 2
10
Unsupervised LearningKohonens Self-Organizing
Map (SOM) 3
j
11
Unsupervised LearningSOM and Other Projections
for Clustering
Cluster Formation and Segmentation Algorithm
(Sketch)
12
Unsupervised LearningOther Algorithms (PCA,
Factor Analysis)
  • Intuitive Idea
  • Q Why are dimensionality-reducing transforms
    good for supervised learning?
  • A There may be many attributes with undesirable
    properties, e.g.,
  • Irrelevance xi has little discriminatory power
    over c(x) yi
  • Sparseness of information feature of interest
    spread out over many xis (e.g., text document
    categorization, where xi is a word position)
  • We want to increase the information density by
    squeezing X down
  • Principal Components Analysis (PCA)
  • Combining redundant variables into a single
    variable (aka component, or factor)
  • Example ratings (e.g., Nielsen) and polls (e.g.,
    Gallup) responses to certain questions may be
    correlated (e.g., like fishing? time spent
    boating)
  • Factor Analysis (FA)
  • General term for a class of algorithms that
    includes PCA
  • Tutorial http//www.statsoft.com/textbook/stfacan
    .html

13
Clustering MethodsDesign Choices
14
Clustering Applications
Information Retrieval Text Document Categorizatio
n
15
Unsupervised Learning andConstructive Induction
  • Unsupervised Learning in Support of Supervised
    Learning
  • Given D ? labeled vectors (x, y)
  • Return D ? transformed training examples (x,
    y)
  • Solution approach constructive induction
  • Feature construction generic term
  • Cluster definition
  • Feature Construction Front End
  • Synthesizing new attributes
  • Logical x1 ? ? x2, arithmetic x1 x5 / x2
  • Other synthetic attributes f(x1, x2, , xn),
    etc.
  • Dimensionality-reducing projection, feature
    extraction
  • Subset selection finding relevant attributes for
    a given target y
  • Partitioning finding relevant attributes for
    given targets y1, y2, , yp
  • Cluster Definition Back End
  • Form, segment, and label clusters to get
    intermediate targets y
  • Change of representation find an (x, y) that
    is good for learning target y

x / (x1, , xp)
16
ClusteringRelation to Constructive Induction
  • Clustering versus Cluster Definition
  • Clustering 3-step process
  • Cluster definition back end for feature
    construction
  • Clustering 3-Step Process
  • Form
  • (x1, , xk) in terms of (x1, , xn)
  • NB typically part of construction step,
    sometimes integrates both
  • Segment
  • (y1, , yJ) in terms of (x1, , xk)
  • NB number of clusters J not necessarily same as
    number of dimensions k
  • Label
  • Assign names (discrete/symbolic labels (v1, ,
    vJ)) to (y1, , yJ)
  • Important in document categorization (e.g.,
    clustering text for info retrieval)
  • Hierarchical Clustering Applying Clustering
    Recursively

17
CLUTO
  • Clustering Algorithms
  • High-performance High-quality partitional
    clustering
  • High-quality agglomerative clustering
  • High-quality graph-partitioning-based clustering
  • Hybrid partitional agglomerative algorithms for
    building trees for very large datasets.
  • Cluster Analysis Tools
  • Cluster signature identification
  • Cluster organization identification
  • Visualization Tools
  • Hierarchical Trees
  • High-dimensional datasets
  • Cluster relations
  • Interfaces
  • Stand-alone programs
  • Library with a fully published API
  • Available on Windows, Sun, and Linux

http//www.cs.umn.edu/cluto
18
Today
  • Clustering
  • Distance Measures
  • Graph-based Techniques
  • K-Means Clustering
  • Tools and Software for Clustering

19
Prediction, Clustering, Classification
  • What is Prediction?
  • The goal of prediction is to forecast or deduce
    the value of an attribute based on values of
    other attributes
  • A model is first created based on the data
    distribution
  • The model is then used to predict future or
    unknown values
  • Supervised vs. Unsupervised Classification
  • Supervised Classification Classification
  • We know the class labels and the number of
    classes
  • Unsupervised Classification Clustering
  • We do not know the class labels and may not know
    the number of classes

20
What is Clustering in Data Mining?
Clustering is a process of partitioning a set of
data (or objects) in a set of meaningful
sub-classes, called clusters
Helps users understand the natural grouping or
structure in a data set
  • Cluster
  • a collection of data objects that are similar
    to one another and thus can be treated
    collectively as one group
  • but as a collection, they are sufficiently
    different from other groups
  • Clustering
  • unsupervised classification
  • no predefined classes

21
Requirements of Clustering Methods
  • Scalability
  • Dealing with different types of attributes
  • Discovery of clusters with arbitrary shape
  • Minimal requirements for domain knowledge to
    determine input parameters
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • The curse of dimensionality
  • Interpretability and usability

22
Applications of Clustering
  • Clustering has wide applications in Pattern
    Recognition
  • Spatial Data Analysis
  • create thematic maps in GIS by clustering feature
    spaces
  • detect spatial clusters and explain them in
    spatial data mining
  • Image Processing
  • Market Research
  • Information Retrieval
  • Document or term categorization
  • Information visualization and IR interfaces
  • Web Mining
  • Cluster Web usage data to discover groups of
    similar access patterns
  • Web Personalization

23
Clustering Methodologies
  • Two general methodologies
  • Partitioning Based Algorithms
  • Hierarchical Algorithms
  • Partitioning Based
  • divide a set of N items into K clusters
    (top-down)
  • Hierarchical
  • agglomerative pairs of items or clusters are
    successively linked to produce larger clusters
  • divisive start with the whole set as a cluster
    and successively divide sets into smaller
    partitions

24
Distance or Similarity Measures
  • Measuring Distance
  • In order to group similar items, we need a way to
    measure the distance between objects (e.g.,
    records)
  • Note distance inverse of similarity
  • Often based on the representation of objects as
    feature vectors

Term Frequencies for Documents
An Employee DB
Which objects are more similar?
25
Distance or Similarity Measures
  • Properties of Distance Measures
  • for all objects A and B, dist(A, B) ³ 0, and
    dist(A, B) dist(B, A)
  • for any object A, dist(A, A) 0
  • dist(A, C) dist(A, B) dist (B, C)
  • Common Distance Measures
  • Manhattan distance
  • Euclidean distance
  • Cosine similarity

Can be normalized to make values fall between 0
and 1.
26
Distance or Similarity Measures
  • Weighting Attributes
  • in some cases we want some attributes to count
    more than others
  • associate a weight with each of the attributes in
    calculating distance, e.g.,
  • Nominal (categorical) Attributes
  • can use simple matching distance1 if values
    match, 0 otherwise
  • or convert each nominal attribute to a set of
    binary attribute, then use the usual distance
    measure
  • if all attributes are nominal, we can normalize
    by dividing the number of matches by the total
    number of attributes
  • Normalization
  • want values to fall between 0 an 1
  • other variations possible

27
Distance or Similarity Measures
  • Example
  • max distance for age 100000-19000 79000
  • max distance for age 52-27 25
  • dist(ID2, ID3) SQRT( 0 (0.04)2 (0.44)2 )
    0.44
  • dist(ID2, ID4) SQRT( 1 (0.72)2 (0.12)2 )
    1.24

28
Domain Specific Distance Functions
  • For some data sets, we may need to use
    specialized functions
  • we may want a single or a selected group of
    attributes to be used in the computation of
    distance - same problem as feature selection
  • may want to use special properties of one or more
    attribute in the data
  • natural distance functions may exist in the data

Example Zip Codes distzip(A, B) 0, if zip
codes are identical distzip(A, B) 0.1, if
first 3 digits are identical distzip(A, B)
0.5, if first digits are identical distzip(A, B)
1, if first digits are different
Example Customer Solicitation distsolicit(A, B)
0, if both A and B responded distsolicit(A, B)
0.1, both A and B were chosen but did not
respond distsolicit(A, B) 0.5, both A and B
were chosen, but only one responded distsolicit(A
, B) 1, one was chosen, but the other was not
29
Distance (Similarity) Matrix
  • Similarity (Distance) Matrix
  • based on the distance or similarity measure we
    can construct a symmetric matrix of distance (or
    similarity values)
  • (i, j) entry in the matrix is the distance
    (similarity) between items i and j

Note that dij dji (i.e., the matrix is
symmetric. So, we only need the lower triangle
part of the matrix. The diagonal is all 1s
(similarity) or all 0s (distance)
30
Example Term Similarities in Documents
Term-Term Similarity Matrix
31
Similarity (Distance) Thresholds
  • A similarity (distance) threshold may be used to
    mark pairs that are sufficiently similar

Using a threshold value of 10 in the previous
example
32
Graph Representation
  • The similarity matrix can be visualized as an
    undirected graph
  • each item is represented by a node, and edges
    represent the fact that two items are similar (a
    one in the similarity threshold matrix)

If no threshold is used, then matrix can be
represented as a weighted graph
33
Simple Clustering Algorithms
  • If we are interested only in threshold (and not
    the degree of similarity or distance), we can use
    the graph directly for clustering
  • Clique Method (complete link)
  • all items within a cluster must be within the
    similarity threshold of all other items in that
    cluster
  • clusters may overlap
  • generally produces small but very tight clusters
  • Single Link Method
  • any item in a cluster must be within the
    similarity threshold of at least one other item
    in that cluster
  • produces larger but weaker clusters
  • Other methods
  • star method - start with an item and place all
    related items in that cluster
  • string method - start with an item place one
    related item in that cluster then place anther
    item related to the last item entered, and so on

34
Simple Clustering Algorithms
  • Clique Method
  • a clique is a completely connected subgraph of a
    graph
  • in the clique method, each maximal clique in the
    graph becomes a cluster

T3
T1
Maximal cliques (and therefore the clusters) in
the previous example are T1, T3, T4,
T6 T2, T4, T6 T2, T6, T8 T1,
T5 T7 Note that, for example, T1, T3, T4
is also a clique, but is not maximal.
T5
T4
T2
T7
T6
T8
35
Simple Clustering Algorithms
  • Single Link Method
  • selected an item not in a cluster and place it in
    a new cluster
  • place all other similar item in that cluster
  • repeat step 2 for each item in the cluster until
    nothing more can be added
  • repeat steps 1-3 for each item that remains
    unclustered

T3
T1
In this case the single link method produces only
two clusters T1, T3, T4, T5, T6, T2,
T8 T7 Note that the single link method
does not allow overlapping clusters, thus
partitioning the set of items.
T5
T4
T2
T7
T6
T8
36
Clustering with Existing Clusters
  • The notion of comparing item similarities can be
    extended to clusters themselves, by focusing on a
    representative vector for each cluster
  • cluster representatives can be actual items in
    the cluster or other virtual representatives
    such as the centroid
  • this methodology reduces the number of similarity
    computations in clustering
  • clusters are revised successively until a
    stopping condition is satisfied, or until no more
    changes to clusters can be made
  • Partitioning Methods
  • reallocation method - start with an initial
    assignment of items to clusters and then move
    items from cluster to cluster to obtain an
    improved partitioning
  • Single pass method - simple and efficient, but
    produces large clusters, and depends on order in
    which items are processed
  • Hierarchical Agglomerative Methods
  • starts with individual items and combines into
    clusters
  • then successively combine smaller clusters to
    form larger ones
  • grouping of individual items can be based on any
    of the methods discussed earlier

37
K-Means Algorithm
  • The basic algorithm (based on reallocation
    method)
  • 1. select K data points as the initial
    representatives
  • 2. for i 1 to N, assign item xi to the most
    similar centroid (this gives K clusters)
  • 3. for j 1 to K, recalculate the cluster
    centroid Cj
  • 4. repeat steps 2 and 3 until these is (little
    or) no change in clusters
  • Example Clustering Terms

Initial (arbitrary) assignment C1 T1,T2, C2
T3,T4, C3 T5,T6
Cluster Centroids
38
Example K-Means
  • Example (continued)

Now using simple similarity measure, compute the
new cluster-term similarity matrix
Now compute new cluster centroids using the
original document-term matrix
The process is repeated until no further changes
are made to the clusters
39
K-Means Algorithm
  • Strength of the k-means
  • Relatively efficient O(tkn), where n is of
    objects, k is of clusters, and t is of
    iterations. Normally, k, t ltlt n
  • Often terminates at a local optimum
  • Weakness of the k-means
  • Applicable only when mean is defined what about
    categorical data?
  • Need to specify k, the number of clusters, in
    advance
  • Unable to handle noisy data and outliers
  • Variations of K-Means usually differ in
  • Selection of the initial k means
  • Dissimilarity calculations
  • Strategies to calculate cluster means

40
Terminology
  • Expectation-Maximization (EM) Algorithm
  • Iterative refinement repeat until convergence to
    a locally optimal label
  • Expectation step estimate parameters with which
    to simulate data
  • Maximization step use simulated (fictitious)
    data to update parameters
  • Unsupervised Learning and Clustering
  • Constructive induction using unsupervised
    learning for supervised learning
  • Feature construction front end - construct new
    x values
  • Cluster definition back end - use these to
    reformulate y
  • Clustering problems formation, segmentation,
    labeling
  • Key criterion distance metric (points closer
    intra-cluster than inter-cluster)
  • Algorithms
  • AutoClass Bayesian clustering
  • Principal Components Analysis (PCA), factor
    analysis (FA)
  • Self-Organizing Maps (SOM) topology preserving
    transform (dimensionality reduction) for
    competitive unsupervised learning

41
Summary Points
  • Expectation-Maximization (EM) Algorithm
  • Unsupervised Learning and Clustering
  • Types of unsupervised learning
  • Clustering, vector quantization
  • Feature extraction (typically, dimensionality
    reduction)
  • Constructive induction unsupervised learning in
    support of supervised learning
  • Feature construction (aka feature extraction)
  • Cluster definition
  • Algorithms
  • EM mixture parameter estimation (e.g., for
    AutoClass)
  • AutoClass Bayesian clustering
  • Principal Components Analysis (PCA), factor
    analysis (FA)
  • Self-Organizing Maps (SOM) projection of data
    competitive algorithm
  • Clustering problems formation, segmentation,
    labeling
  • Next Lecture Time Series Learning and
    Characterization
Write a Comment
User Comments (0)
About PowerShow.com