Text Clustering - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Text Clustering

Description:

Top-down, divisive. Clustering Algorithms. Flat algorithms ... Divisive (top-down): Start with all documents belong to the same cluster. ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 64

Provided by: pb8

Category:

more less

Transcript and Presenter's Notes

Title: Text Clustering

1
Text Clustering

PengBo
Nov 16, 2009

2
Review

Definition
The category of x c(x)?C
K-Nearest Neighbor
Naïve Bayes
Bayesian Methods
Bernoulli NB classifier
Multinomial NB classifier
Categorization Evaluation
Training data/Test data
Over-fitting Generalize

3
How to Evaluate?

Test data set
??training set?????
Measure
Precision
Recall
Accuracy (Correct rate)
F1
Micro/Macro Average

Actual Class
Predictedclass
4
Exercise

Federalist papers
1787-1788???Hamilton, Jay and Madison??????77???,?
??NY??US Constitution
??12?papers???????

?????
5
Author identification

In 1964 Mosteller and Wallace solved the problem
Mosteller, Frederick and Wallace, David L. 1964.
Inference and Disputed Authorship The
Federalist.
Its a Text Catergorization Problem
They identified 70 function words as good
candidates for authorship analysis
Using statistical inference they concluded the
author was Madison

6
Function words for Author Identification
7
Function Words for Author Identification
8
Todays Topic

Document clustering
Motivations
Clustering algorithms
Partitional
Hierarchical
Evaluation

9
Whats Clustering?
10
What is clustering?

Clustering the process of grouping a set of
objects into classes of similar objects
The commonest form of unsupervised learning
Unsupervised learning learning from raw data,
as opposed to supervised data where a
classification of examples is given
A common and important task that finds many
applications in IR and other places

11
Clustering Internal Criterion
How many clusters?

High intra-cluster similarity
Low inter-cluster similarity

12
Issues for clustering

Representation for clustering
????Document representation
Vector space or language model?
???/??similarity/distance
COS similarity or KL distance
How many clusters?
Fixed a priori?
Completely data driven?
Avoid trivial clusters - too large or small

13
Clustering Algorithms

Hard clustering algorithms
computes a hard assignment each document is a
member of exactly one cluster.
Soft clustering algorithms
is soft a documents assignment is a
distribution over all clusters.

14
Clustering Algorithms

Flat algorithms
Create cluster set without explicit structure
Usually start with a random (partial)
partitioning
Refine it iteratively
K means clustering
Model based clustering
Hierarchical algorithms
Bottom-up, agglomerative
Top-down, divisive

15
Clustering Algorithms

Flat algorithms
Create cluster set without explicit structure
Usually start with a random (partial)
partitioning
Refine it iteratively
K means clustering
Model based clustering
Hierarchical algorithms
Bottom-up, agglomerative
Top-down, divisive

16
Evaluation
17
Think about it

Evaluation by High internal criterion scores?
Object function for High intra-cluster similarity
and Low inter-cluster similarity

Application User judgment
Internal judgment
18
External criteria for clustering quality

???????ground truth?
Assume documents with C gold standard classes,
while our clustering algorithms produce K
clusters, ?1, ?2, , ?K with ni members each.
?????measure purity ???cluster?????class Ci
?????cluster ?K ?????
? ?1,?2, . . . ,?K is the set of clusters and
C c1, c2, . . . , cJ the set of classes.

19
Purity example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
Cluster I Purity 1/6 (max(5, 1, 0)) 5/6
Cluster II Purity 1/6 (max(1, 4, 1)) 4/6
Cluster III Purity 1/5 (max(2, 0, 3)) 3/5
Total Purity
1/17 (543) 12/17
20
Rand Index

View it as a series of decisions, one for each of
the N(N - 1)/2 pairs of documents in the
collection.
true positive (TP) decision assigns two similar
documents to the same cluster
true negative (TN) decision assigns two
dissimilar documents to different clusters.
false positive (FP) decision assigns two
dissimilar documents to the same cluster.
false negative (FN) decision assigns two similar
documents to different clusters.

21
Rand Index
TP
FN

FP
22
Rand index Example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
23
K Means Algorithm
24
Partitioning Algorithms

Given
a set of documents D and the number K
Find
????K clusters???,?partitioning criterion??
Globally optimal exhaustively enumerate all
partitions
Effective heuristic methods K-means algorithms

partitioning criterion residual sum of
squares(?????)
25
K-Means

??documents??? vectors.
??cluster ????centroids (aka the center of
gravity or mean)
??instances?clusters?????cluster
centroid??????,?????centroid

26
K Means Example(K2)
Reassign clusters
Converged!
27
K-Means Algorithm
28
Convergence

???K-means??????
A state in which clusters dont change.
Reassignment RSS???,?????????centroid.
Recomputation ??RSSk ???(mk is number of members
in cluster k)
a ?(?k )????,?RSSK??????

S 2(X a) 0 S X S a mK a S X a (1/
mk) S X
29
Convergence Global Minimum?

There is unfortunately no guarantee that a global
minimum in the objective function will be reached

outlier
30
Seed Choice

Seed????????
??seeds??????,?????sub-optimal clusterings.
?heuristic?seeds (e.g., doc least similar to any
existing mean)
????starting points
???clustering?????????.(e.g., by sampling)

In the above, if you start with B and E as
centroids you converge to A,B,C and D,E,F If
you start with D and F you converge to A,B,D,E
C,F
31
How Many Clusters?

???????K?
?????cluster(??cluster????)??????cluster
(eg.?????)??????
??
??Benefit a doc?????cluster centroid?cosine
similarity???docs?benefit???Total Benefit.
????cluster?Cost
??clustering?Value Total Benefit - Total Cost.
?????K?,??value??????

32
Is K-Means Efficient?

Time Complexity
Computing distance between two docs is O(M) where
M is the dimensionality of the vectors.
Reassigning clusters O(KN) distance
computations, or O(KNM).
Computing centroids Each doc gets added once to
some centroid O(NM).
Assume these two steps are each done once for I
iterations O(IKNM).
M is
Document is sparse vector, but Centroid is not
K-medoids algorithms the element closest to the
center as "the medoid"

33
Efficiency Medoid As Cluster Representative

Medoid ???document??cluster???
? ?centroid???document
One reason this is useful
???????cluster?representative (gt1000 documents)
The centroid of this cluster will be a dense
vector
The medoid of this cluster will be a sparse
vector
???
mean .vs. median
centroid vs. medoid

34
Hierarchical Clustering Algorithm
35
Hierarchical Agglomerative Clustering (HAC)

??????similarity function????? instances????.
????
??instances?????cluster??
???similar???cluster,??????cluster
????????cluster??
???????????binary tree?hierarchy.

Dendrogram
36
Dendrogram Document Example

As clusters agglomerate, docs likely to fall into
a hierarchy of topics or concepts.

d3
d5
d1
d4
d2
d1,d2
37
HAC Algorithm, pseudo-code
38
Hierarchical Clustering algorithms

Agglomerative (bottom-up)
Start with each document being a single cluster.
Eventually all documents belong to the same
cluster.
Divisive (top-down)
Start with all documents belong to the same
cluster.
Eventually each node forms a cluster on its own.
?????clusters???k

39
Key notion cluster representative

???????clusters???
?????????,??????cluster(cluster representation)?
Representative??cluster????typical ?central?
point inducing smallest radii to docs in cluster
smallest squared distances, etc.
point that is the average of all docs in the
cluster
Centroid or center of gravity

40
Closest pair of clusters

Center of gravity
centroids (centers of gravity)?cosine-similar?clus
ters
Average-link
???????cosine-similar
Single-link
???(Similarity of the most cosine-similar)
Complete-link
???(Similarity of the furthest points, the
least cosine-similar)

41
Single Link Example
chaining
42
Complete Link Example
Affect by outliers
43
Computational Complexity

???iteration, HAC????pairs???similarity O(n2).
???n?2 merging iterations, ?????????cluster??????c
lusters???similarity
???similarity??
???????O(n2) performance
?????cluster???similarity???constant time.
??O(n2 log n) or O(n3)

44
Centroid Agglomerative Clustering
Example n6, k3, closest pair of centroids
d4
d6
d3
d5
d1
d2
45
Group Average Agglomerative Clustering

????cluster???pairs???similarity
??????????
Vectors???????normalized.
????cluster?sum of vectors.

46
Exercise

?????????n???agglomerative??. ????n3
????/???????????????????

47
Efficiency Using approximations

?????,???????????centroid pairs
???? ?nearly closest pair
simplistic example maintain closest pair based
on distances in projection on a random line

Random line
48
Applications in IR
49
Navigating document collections
Table of Contents 1. Science of Cognition 1.a.
Motivations 1.a.i. Intellectual
Curiosity 1.a.ii. Practical Applications 1.b.
History of Cognitive Psychology2. The Neural
Basis of Cognition 2.a. The Nervous System 2.b.
Organization of the Brain 2.c. The Visual
System 3. Perception and Attention 3.a. Sensory
Memory 3.b. Attention and Sensory Information
Processing
IndexAardvark, 15Blueberry, 200Capricorn, 1,
45-55Dog, 79-99Egypt, 65Falafel,
78-90Giraffes, 45-59

Information Retrieval a book index
Document clusters a table of contents

50
Scatter/Gather Cutting, Karger, and Pedersen
51
For better navigation of search results
52
Vivisimo SE
53
Navigating search results (2)

?sense of a word ?documents??
????? (say Jaguar, or NLP), ????????
??????word sense disambiguation

54
(No Transcript)
55
For speeding up vector space retrieval

VSM?retrieval, ?????query vector???doc vectors
????????doc?query doc?similarity slow (for some
applications)
??????inverted index,?????query doc??term????doc
By clustering docs in corpus a priori
???????query doc???cluster

56
Resources

Weka 3 - Data Mining with Open Source Machine
Learning Software in Java

57
?????

Text Clustering
Evaluation
Purity, NMI ,Rand Index
Partition Algorithm
K-Means
Reassignment
Recomputation
Hierarchical Algorithm
Cluster representation
Close measure of cluster pair
Single link
Complete link
Average link
centroid

58
Readings

1. IIR Ch16.1-4 Ch17.1-4
2. B. Florian, E. Martin, and X. Xiaowei,
"Frequent term-based text clustering," in
Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery
and data mining. Edmonton, Alberta, Canada ACM,
2002.

59
Thank You!

60
Cluster Labeling
61
Major issue - labeling

After clustering algorithm finds clusters - how
can they be useful to the end user?
Need pithy label for each cluster
In search results, say Animal or Car in the
jaguar example.
In topic trees (Yahoo), need navigational cues.
Often done by hand, a posteriori.

62
How to Label Clusters

Show titles of typical documents
Titles are easy to scan
Authors create them for quick scanning!
But you can only show a few titles which may not
fully represent cluster
Show words/phrases prominent in cluster
More likely to fully represent cluster
Use distinguishing words/phrases
Differential labeling (think about Feature
Selection)
But harder to scan

63
Labeling

Common heuristics - list 5-10 most frequent terms
in the centroid vector.
Drop stop-words stem.
Differential labeling by frequent terms
Within a collection Computers, clusters all
have the word computer as frequent term.
Discriminant analysis of centroids.
Perhaps better distinctive noun phrase

Write a Comment

User Comments (0)