Title: Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage
1Semi-Supervised Clustering and its Application to
Text Clustering and Record Linkage
- Raymond J. Mooney
- Sugato Basu
- Mikhail Bilenko
- Arindam Banerjee
2Supervised Classification Example
.
.
.
.
3Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7Semi-Supervised Learning
- Combines labeled and unlabeled data during
training to improve performance - Semi-supervised classification Training on
labeled data exploits additional unlabeled data,
frequently resulting in a more accurate
classifier. - Semi-supervised clustering Uses small amount of
labeled data to aid and bias the clustering of
unlabeled data.
8Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10Semi-Supervised Classification
- Algorithms
- Semisupervised EM GhahramaniNIPS94,NigamML00.
- Co-training BlumCOLT98.
- Transductive SVMs Vapnik98,JoachimsICML99.
- Assumptions
- Known, fixed set of categories given in the
labeled data. - Goal is to improve classification of examples
into these known categories.
11Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15Semi-Supervised Clustering
- Can group data using the categories in the
initial labeled data. - Can also extend and modify the existing set of
categories as needed to reflect other
regularities in the data. - Can cluster a disjoint set of unlabeled data
using the labeled data as a guide to the type
of clusters desired.
16Search-Based Semi-Supervised Clustering
- Alter the clustering algorithm that searches for
a good partitioning by - Modifying the objective function to give a reward
for obeying labels on the supervised data
DemerizANNIE99. - Enforcing constraints (must-link, cannot-link) on
the labeled data during clustering
WagstaffICML00, WagstaffICML01. - Use the labeled data to initialize clusters in an
iterative refinement algorithm (kMeans, EM)
BasuICML02.
17Unsupervised KMeans Clustering
- KMeans is a partitional clustering algorithm
based on iterative relocation that partitions a
dataset into K clusters. - Algorithm
- Initialize K cluster centers
randomly. Repeat until convergence - Cluster Assignment Step Assign each data point x
to the cluster Xl, such that L2 distance of x
from (center of Xl) is minimum - Center Re-estimation Step Re-estimate each
cluster center as the mean of the points in
that cluster
18KMeans Objective Function
- Locally minimizes sum of squared distance between
the data points and their corresponding cluster
centers - Initialization of K cluster centers
- Totally random
- Random perturbation from global mean
- Heuristic to ensure well-separated centers
- etc.
19K Means Example
20K Means ExampleRandomly Initialize Means
x
x
21K Means ExampleAssign Points to Clusters
x
x
22K Means ExampleRe-estimate Means
x
x
23K Means ExampleRe-assign Points to Clusters
x
x
24K Means ExampleRe-estimate Means
x
x
25K Means ExampleRe-assign Points to Clusters
x
x
26K Means ExampleRe-estimate Means and Converge
x
x
27Semi-Supervised KMeans
- Seeded KMeans
- Labeled data provided by user are used for
initialization initial center for cluster i is
the mean of the seed points having label i. - Seed points are only used for initialization, and
not in subsequent steps. - Constrained KMeans
- Labeled data provided by user are used to
initialize KMeans algorithm. - Cluster labels of seed data are kept unchanged in
the cluster assignment steps, and only the labels
of the non-seed data are re-estimated.
28Semi-Supervised K Means Example
29Semi-Supervised K Means ExampleInitialize Means
Using Labeled Data
x
x
30Semi-Supervised K Means ExampleAssign Points to
Clusters
x
x
31Semi-Supervised K Means ExampleRe-estimate Means
and Converge
x
x
32Similarity-Based Semi-Supervised Clustering
- Train an adaptive similarity function to fit the
labeled data. - Use a standard clustering algorithm with the
trained similarity function to cluster the
unlabeled data. - Adaptive similarity functions
- Altered Euclidian distance KleinICML02
- Trained Mahalanobis distance XingNIPS02
- EM-trained edit distance BilenkoKDD03
- Clustering algorithms
- Single-link agglomerative BilenkoKDD03
- Complete-link agglomerative KleinICML02
- K-means XingNIPS02
33Semi-Supervised Clustering ExampleSimilarity
Based
34Semi-Supervised Clustering ExampleDistances
Transformed by Learned Metric
35Semi-Supervised Clustering ExampleClustering
Result with Trained Metric
36Experiments
- Evaluation measures
- Objective function value for KMeans.
- Mutual Information (MI) between distributions of
computed cluster labels and human-provided class
labels. - Experiments
- Change of objective function and MI with
increasing fraction of seeding (for complete
labeling and no noise). - Change of objective function and MI with
increasing noise in seeds (for complete labeling
and fixed seeding).
37Experimental Methodology
- Clustering algorithm is always run on the entire
dataset. - Learning curves with 10-fold cross-validation
- 10 data set aside as test set whose label is
always hidden. - Learning curve generated by training on different
seed fractions of the remaining 90 of the
data whose label is provided. - Objective function is calculated over the entire
dataset. - MI measure calculated only on the independent
test set.
38Experimental Methodology (contd.)
- For each fold in the Seeding experiments
- Seeds selected from training dataset by varying
seed fraction from 0.0 to 1.0, in steps of 0.1 - For each fold in the Noise experiments
- Noise simulated by changing the labels of a
fraction of the seed values to a random incorrect
value.
39COP-KMeans
- COPKMeans Wagstaff et al. ICML01 is KMeans
with must-link (must be in same cluster) and
cannot-link (cannot be in same cluster)
constraints on data points. - Initialization Cluster centers are chosen
randomly, but as each one is chosen any must-link
constraints that it participates in are enforced
(so that they cannot later be chosen as the
center of another cluster). - Algorithm During cluster assignment step in
COP-KMeans, a point is assigned to its nearest
cluster without violating any of its constraints.
If no such assignment exists, abort.
40Datasets
- Data sets
- UCI Iris (3 classes 150 instances)
- CMU 20 Newsgroups (20 classes 20,000 instances)
- Yahoo! News (20 classes 2,340 instances)
- Data subsets created for experiments
- Small-20 newsgroup random sample of 100
documents from each newsgroup, created to study
effect of datasize on algorithms. - Different-3 newsgroup 3 very different
newsgroups (alt.atheism, rec.sport.baseball,
sci.space), created to study effect of data
separability on algorithms. - Same-3 newsgroup 3 very similar newsgroups
(comp.graphics, comp.os.ms-windows,
comp.windows.x).
41Text Data
- Vector space model with TF-IDF weighting for text
data. - Non-content bearing words removed
- Stopwords
- High and low frequency words
- Words of length lt 3
- Text-handling software
- Spherical KMeans was used as the underlying
clustering algorithm it uses cosine-similarity
instead of Euclidean distance between word
vectors. - Code base built on top of MC and SPKMeans
packages developed at UT Austin.
42Results MI and Seeding
- Zero noise in seeds Small-20
NewsGroup - Semi-Supervised KMeans substantially better than
unsupervised KMeans
43Results Objective function and Seeding
- User-labeling consistent with KMeans assumptions
Small-20 NewsGroup - Obj. function of data partition increases
exponentially with seed fraction
44Results MI and Seeding
- Zero noise in seeds
Yahoo! News - Semi-Supervised KMeans still better than
unsupervised
45Results Objective Function and Seeding
-
- User-labeling inconsistent with KMeans
assumptions Yahoo! News - Objective function of constrained algorithms
decreases with seeding
46Results Dataset Separability
- Difficult datasets lots of overlap between the
clusters Same-3 NewsGroup - Semi-supervision gives substantial improvement
47Results Dataset Separability
- Easy datasets not much overlap between the
clusters Different-3 NewsGroup - Semi-supervision does not give substantial
improvement
48Results Noise Resistance
- Seed fraction 0.1 20 NewsGroup
- Seeded-KMeans most robust against noisy seeding
49Record Linkage
- Identify and merge duplicate field values and
duplicate records in a database. - Applications
- Duplicates in mailing lists
- Information integration of multiple databases of
stores, restaurants, etc. - Matching bibliographic references in research
papers (Cora/ResearchIndex) - Different published editions in a database of
books.
50Experimental Datasets
- 1,200 artificially corrupted mailing list
addresses. - 1,295 Cora research paper citations.
- 864 restaurant listings from Fodors and Zagats
guidebooks. - 1,675 Citeseer research paper citations.
51Record Linkage Examples
Author Title
Venue Address Year
Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby Information, prediction, and query by committee Advances in Neural Information Processing System San Mateo, CA 1993
Freund, Y., Seung, H.S., Shamir, E. Tishby, N. Information, prediction, and query by committee Advances in Neural Information Processing Systems San Mateo, CA.
Name Address City
Cusine
Second Avenue Deli 156 2nd Ave. at 10th New York Delicatessen
Second Avenue Deli 156 Second Ave. New York City Delis
52Traditional Record Linkage
- Apply a static text-similarity metric to each
field. - Cosine similarity
- Jaccard similarity
- Edit distance
- Combine similarity of each field to determine
overall similarity. - Manually weighted sum
- Threshold overall similarity to detect duplicates.
53Edit (Levenstein) Distance
- Minimum number of character deletions, additions,
or substitutions needed to make two strings
equivalent. - misspell to mispell is distance 1
- misspell to mistell is distance 2
- misspell to misspelling is distance 3
- Can be computed efficiently using dynamic
programming in O(mn) time where m and n are the
lengths of the two strings being compared.
54Edit Distance with Affine Gaps
- Contiguous deletions/additions are less expensive
than non-contiguous ones. - misspell to misspelling is distance lt 3
- Relative cost of contiguous and non-contiguous
deletions/additions determined by a manually-set
parameter. - Affine gap edit-distance better for identifying
duplicates than Levenstein.
55Trainable Record Linkage
- MARLIN (Multiply Adaptive Record Linkage using
INduction) - Learn parameterized similarity metrics for
comparing each field. - Trainable edit-distance
- Use EM to set edit-operation costs
- Learn to combine multiple similarity metrics for
each field to determine equivalence. - Use SVM to decide on duplicates
56Trainable Edit Distance
- Learnable edit distance based on generative
probabilistic model for producing matched pairs
of strings. - Parameters trained using EM to maximize the
probability of producing training pairs of
equivalent strings. - Originally developed for Levenstein distance by
Ristad Yianilos (1998). - We modified for affine gap edit distance.
57Sample Learned Edit Operations
- Inexpensive operations
- Deleting/adding space
- Substituting / for - in phone numbers
- Deleting/adding e and t in addresses (Street
?St.) - Expensive operations
- Deleting/adding a digit in a phone number.
- Deleting/adding a q in a name
58Combining Field Similarities
- Record similarity is determined by combining the
similarities of individual fields. - Some fields are more indicative of record
similarity than others - For addresses, city similarity is less relevant
than restaurant/person name or street address. - For bibliographic citation, venue (i.e.
conference or journal name) is less relevant than
author or title. - Field similarities should be weighted when
combined to determine record similarity. - Weights should be learned using learning
algorithm.
59MARLIN Record Linkage Framework
Trainable duplicate detector
Trainable similarity metrics
60Learned Record Similarity
- Field similarities used as feature vectors
describing a pair of records. - SVM trained on these feature vectors to
discriminate duplicate from non-duplicate pairs. - Record similarity based on distance of feature
vector from the separator.
61Record Pair Classification Example
62Clustering Records into Equivalence Classes
- Use similarity-based semi-supervised clustering
to identify groups of equivalent records. - Use single-link agglomerative clustering to
cluster records based on learned similarity
metric.
63Experimental Methodology
- 2-fold cross-validation with equivalence classes
of records randomly assigned to folds. - Results averaged over 20 runs of
cross-validation. - Accuracy of duplicate detection on test data
measured using - Precision
- Recall
- F-Measure
64Mailing List Name Field Results
65Cora Title Field Results
66Maximum F-measure forDetecting Duplicate Field
Values
Metric RestaurantName Restaurant Address Citeseer Reason Citeseer Face Citeseer RL Citeseer Constraint
Static Affine Edit Dist. 0.29 0.68 0.93 0.95 0.89 0.92
Learned Affine Edit Dist. 0.35 0.71 0.94 0.97 0.91 0.94
T-test results indicate differences are
significant at .05 level
67Mailing List Record Results
68Restaurant Record Results
69Combining Similarity and Search-BasedSemi-Supervi
sed Clustering
- Can apply seeded/constrained clustering with a
trained similarity metric. - We developed a unified framework for Euclidian
distance with soft pairwise-constraints
(must-link, cannot-link). - Experiments on UCI data comparing approaches.
- With small amounts of training,
seeded/constrained tends to do better than
similarity-based. - With larger amounts of labeled data,
similarity-based tends to do better. - Combining both approaches outperforms both
individual approaches.
70Active Semi-Supervision
- Use active learning to select the most
informative labeled examples. - We have developed an active approach for
selecting good pairwise queries to get must-link
and cannot-link constraints. - Should these two examples be in same or different
clusters? - Experimental results on UCI and text data.
- Active learning achieves much higher accuracy
with fewer labeled training pairs.
71Future Work
- Adaptive metric learning for vector-space
cosine-similarity. - Supervised learning of better token weights than
TF-IDF. - Unified method for text data (cosine similarity)
that combines seeded/constrained with learned
similarity measure. - Active learning results for duplicate detection.
- Static-active learning
- Exploiting external data/knowledge (e.g. from the
web) for improving similarity measures for
duplicate detection.
72Conclusion
- Semi-supervised clustering is an alternative way
of combining labeled and unlabeled data in
learning. - Search-based and similarity-based are two
alternative approaches. - They have useful applications in text-clustering
and database record linkage. - Experimental results for these applications
illustrate their utility. - They can be combined to produce even better
results.