Text Clustering - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Text Clustering

Description:

Top-down, divisive. Clustering Algorithms. Flat algorithms ... Divisive (top-down): Start with all documents belong to the same cluster. ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 64
Provided by: pb8
Category:

less

Transcript and Presenter's Notes

Title: Text Clustering


1
Text Clustering
  • PengBo
  • Nov 16, 2009

2
Review
  • Definition
  • The category of x c(x)?C
  • K-Nearest Neighbor
  • Naïve Bayes
  • Bayesian Methods
  • Bernoulli NB classifier
  • Multinomial NB classifier
  • Categorization Evaluation
  • Training data/Test data
  • Over-fitting Generalize

3
How to Evaluate?
  • Test data set
  • ??training set?????
  • Measure
  • Precision
  • Recall
  • Accuracy (Correct rate)
  • F1
  • Micro/Macro Average

Actual Class
Predictedclass
4
Exercise
  • Federalist papers
  • 1787-1788???Hamilton, Jay and Madison??????77???,?
    ??NY??US Constitution
  • ??12?papers???????

?????
5
Author identification
  • In 1964 Mosteller and Wallace solved the problem
  • Mosteller, Frederick and Wallace, David L. 1964.
    Inference and Disputed Authorship The
    Federalist.
  • Its a Text Catergorization Problem
  • They identified 70 function words as good
    candidates for authorship analysis
  • Using statistical inference they concluded the
    author was Madison

6
Function words for Author Identification
7
Function Words for Author Identification
8
Todays Topic
  • Document clustering
  • Motivations
  • Clustering algorithms
  • Partitional
  • Hierarchical
  • Evaluation

9
Whats Clustering?
10
What is clustering?
  • Clustering the process of grouping a set of
    objects into classes of similar objects
  • The commonest form of unsupervised learning
  • Unsupervised learning learning from raw data,
    as opposed to supervised data where a
    classification of examples is given
  • A common and important task that finds many
    applications in IR and other places

11
Clustering Internal Criterion
How many clusters?
  • High intra-cluster similarity
  • Low inter-cluster similarity

12
Issues for clustering
  • Representation for clustering
  • ????Document representation
  • Vector space or language model?
  • ???/??similarity/distance
  • COS similarity or KL distance
  • How many clusters?
  • Fixed a priori?
  • Completely data driven?
  • Avoid trivial clusters - too large or small

13
Clustering Algorithms
  • Hard clustering algorithms
  • computes a hard assignment each document is a
    member of exactly one cluster.
  • Soft clustering algorithms
  • is soft a documents assignment is a
    distribution over all clusters.

14
Clustering Algorithms
  • Flat algorithms
  • Create cluster set without explicit structure
  • Usually start with a random (partial)
    partitioning
  • Refine it iteratively
  • K means clustering
  • Model based clustering
  • Hierarchical algorithms
  • Bottom-up, agglomerative
  • Top-down, divisive

15
Clustering Algorithms
  • Flat algorithms
  • Create cluster set without explicit structure
  • Usually start with a random (partial)
    partitioning
  • Refine it iteratively
  • K means clustering
  • Model based clustering
  • Hierarchical algorithms
  • Bottom-up, agglomerative
  • Top-down, divisive

16
Evaluation
17
Think about it
  • Evaluation by High internal criterion scores?
  • Object function for High intra-cluster similarity
    and Low inter-cluster similarity

Application User judgment
Internal judgment
18
External criteria for clustering quality
  • ???????ground truth?
  • Assume documents with C gold standard classes,
    while our clustering algorithms produce K
    clusters, ?1, ?2, , ?K with ni members each.
  • ?????measure purity ???cluster?????class Ci
    ?????cluster ?K ?????
  • ? ?1,?2, . . . ,?K is the set of clusters and
    C c1, c2, . . . , cJ the set of classes.

19
Purity example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
Cluster I Purity 1/6 (max(5, 1, 0)) 5/6
Cluster II Purity 1/6 (max(1, 4, 1)) 4/6
Cluster III Purity 1/5 (max(2, 0, 3)) 3/5
Total Purity
1/17 (543) 12/17
20
Rand Index
  • View it as a series of decisions, one for each of
    the N(N - 1)/2 pairs of documents in the
    collection.
  • true positive (TP) decision assigns two similar
    documents to the same cluster
  • true negative (TN) decision assigns two
    dissimilar documents to different clusters.
  • false positive (FP) decision assigns two
    dissimilar documents to the same cluster.
  • false negative (FN) decision assigns two similar
    documents to different clusters.

21
Rand Index
TP
FN
  • TN

FP
22
Rand index Example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
23
K Means Algorithm
24
Partitioning Algorithms
  • Given
  • a set of documents D and the number K
  • Find
  • ????K clusters???,?partitioning criterion??
  • Globally optimal exhaustively enumerate all
    partitions
  • Effective heuristic methods K-means algorithms

partitioning criterion residual sum of
squares(?????)
25
K-Means
  • ??documents??? vectors.
  • ??cluster ????centroids (aka the center of
    gravity or mean)
  • ??instances?clusters?????cluster
    centroid??????,?????centroid

26
K Means Example(K2)
Reassign clusters
Converged!
27
K-Means Algorithm
28
Convergence
  • ???K-means??????
  • A state in which clusters dont change.
  • Reassignment RSS???,?????????centroid.
  • Recomputation ??RSSk ???(mk is number of members
    in cluster k)
  • a ?(?k )????,?RSSK??????

S 2(X a) 0 S X S a mK a S X a (1/
mk) S X
29
Convergence Global Minimum?
  • There is unfortunately no guarantee that a global
    minimum in the objective function will be reached

outlier
30
Seed Choice
  • Seed????????
  • ??seeds??????,?????sub-optimal clusterings.
  • ?heuristic?seeds (e.g., doc least similar to any
    existing mean)
  • ????starting points
  • ???clustering?????????.(e.g., by sampling)

In the above, if you start with B and E as
centroids you converge to A,B,C and D,E,F If
you start with D and F you converge to A,B,D,E
C,F
31
How Many Clusters?
  • ???????K?
  • ?????cluster(??cluster????)??????cluster
    (eg.?????)??????
  • ??
  • ??Benefit a doc?????cluster centroid?cosine
    similarity???docs?benefit???Total Benefit.
  • ????cluster?Cost
  • ??clustering?Value Total Benefit - Total Cost.
  • ?????K?,??value??????

32
Is K-Means Efficient?
  • Time Complexity
  • Computing distance between two docs is O(M) where
    M is the dimensionality of the vectors.
  • Reassigning clusters O(KN) distance
    computations, or O(KNM).
  • Computing centroids Each doc gets added once to
    some centroid O(NM).
  • Assume these two steps are each done once for I
    iterations O(IKNM).
  • M is
  • Document is sparse vector, but Centroid is not
  • K-medoids algorithms the element closest to the
    center as "the medoid"

33
Efficiency Medoid As Cluster Representative
  • Medoid ???document??cluster???
  • ? ?centroid???document
  • One reason this is useful
  • ???????cluster?representative (gt1000 documents)
  • The centroid of this cluster will be a dense
    vector
  • The medoid of this cluster will be a sparse
    vector
  • ???
  • mean .vs. median
  • centroid vs. medoid

34
Hierarchical Clustering Algorithm
35
Hierarchical Agglomerative Clustering (HAC)
  • ??????similarity function????? instances????.
  • ????
  • ??instances?????cluster??
  • ???similar???cluster,??????cluster
  • ????????cluster??
  • ???????????binary tree?hierarchy.

Dendrogram
36
Dendrogram Document Example
  • As clusters agglomerate, docs likely to fall into
    a hierarchy of topics or concepts.

d3
d5
d1
d4
d2
d1,d2
37
HAC Algorithm, pseudo-code
38
Hierarchical Clustering algorithms
  • Agglomerative (bottom-up)
  • Start with each document being a single cluster.
  • Eventually all documents belong to the same
    cluster.
  • Divisive (top-down)
  • Start with all documents belong to the same
    cluster.
  • Eventually each node forms a cluster on its own.
  • ?????clusters???k

39
Key notion cluster representative
  • ???????clusters???
  • ?????????,??????cluster(cluster representation)?
  • Representative??cluster????typical ?central?
  • point inducing smallest radii to docs in cluster
  • smallest squared distances, etc.
  • point that is the average of all docs in the
    cluster
  • Centroid or center of gravity

40
Closest pair of clusters
  • Center of gravity
  • centroids (centers of gravity)?cosine-similar?clus
    ters
  • Average-link
  • ???????cosine-similar
  • Single-link
  • ???(Similarity of the most cosine-similar)
  • Complete-link
  • ???(Similarity of the furthest points, the
    least cosine-similar)

41
Single Link Example
chaining
42
Complete Link Example
Affect by outliers
43
Computational Complexity
  • ???iteration, HAC????pairs???similarity O(n2).
  • ???n?2 merging iterations, ?????????cluster??????c
    lusters???similarity
  • ???similarity??
  • ???????O(n2) performance
  • ?????cluster???similarity???constant time.
  • ??O(n2 log n) or O(n3)

44
Centroid Agglomerative Clustering
Example n6, k3, closest pair of centroids
d4
d6
d3
d5
d1
d2
45
Group Average Agglomerative Clustering
  • ????cluster???pairs???similarity
  • ??????????
  • Vectors???????normalized.
  • ????cluster?sum of vectors.

46
Exercise
  • ?????????n???agglomerative??. ????n3
    ????/???????????????????

47
Efficiency Using approximations
  • ?????,???????????centroid pairs
  • ???? ?nearly closest pair
  • simplistic example maintain closest pair based
    on distances in projection on a random line

Random line
48
Applications in IR
49
Navigating document collections
Table of Contents 1. Science of Cognition 1.a.
Motivations 1.a.i. Intellectual
Curiosity 1.a.ii. Practical Applications 1.b.
History of Cognitive Psychology2. The Neural
Basis of Cognition 2.a. The Nervous System 2.b.
Organization of the Brain 2.c. The Visual
System 3. Perception and Attention 3.a. Sensory
Memory 3.b. Attention and Sensory Information
Processing
IndexAardvark, 15Blueberry, 200Capricorn, 1,
45-55Dog, 79-99Egypt, 65Falafel,
78-90Giraffes, 45-59
  • Information Retrieval a book index
  • Document clusters a table of contents

50
Scatter/Gather Cutting, Karger, and Pedersen
51
For better navigation of search results
52
Vivisimo SE
53
Navigating search results (2)
  • ?sense of a word ?documents??
  • ????? (say Jaguar, or NLP), ????????
  • ??????word sense disambiguation

54
(No Transcript)
55
For speeding up vector space retrieval
  • VSM?retrieval, ?????query vector???doc vectors
  • ????????doc?query doc?similarity slow (for some
    applications)
  • ??????inverted index,?????query doc??term????doc
  • By clustering docs in corpus a priori
  • ???????query doc???cluster

56
Resources
  • Weka 3 - Data Mining with Open Source Machine
    Learning Software in Java

57
?????
  • Text Clustering
  • Evaluation
  • Purity, NMI ,Rand Index
  • Partition Algorithm
  • K-Means
  • Reassignment
  • Recomputation
  • Hierarchical Algorithm
  • Cluster representation
  • Close measure of cluster pair
  • Single link
  • Complete link
  • Average link
  • centroid

58
Readings
  • 1. IIR Ch16.1-4 Ch17.1-4
  • 2. B. Florian, E. Martin, and X. Xiaowei,
    "Frequent term-based text clustering," in
    Proceedings of the eighth ACM SIGKDD
    international conference on Knowledge discovery
    and data mining. Edmonton, Alberta, Canada ACM,
    2002.

59
Thank You!
  • QA

60
Cluster Labeling
61
Major issue - labeling
  • After clustering algorithm finds clusters - how
    can they be useful to the end user?
  • Need pithy label for each cluster
  • In search results, say Animal or Car in the
    jaguar example.
  • In topic trees (Yahoo), need navigational cues.
  • Often done by hand, a posteriori.

62
How to Label Clusters
  • Show titles of typical documents
  • Titles are easy to scan
  • Authors create them for quick scanning!
  • But you can only show a few titles which may not
    fully represent cluster
  • Show words/phrases prominent in cluster
  • More likely to fully represent cluster
  • Use distinguishing words/phrases
  • Differential labeling (think about Feature
    Selection)
  • But harder to scan

63
Labeling
  • Common heuristics - list 5-10 most frequent terms
    in the centroid vector.
  • Drop stop-words stem.
  • Differential labeling by frequent terms
  • Within a collection Computers, clusters all
    have the word computer as frequent term.
  • Discriminant analysis of centroids.
  • Perhaps better distinctive noun phrase
Write a Comment
User Comments (0)
About PowerShow.com