Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage - PowerPoint PPT Presentation

About This Presentation
Title:

Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage

Description:

Combines labeled and unlabeled data during training to improve performance: ... 864 restaurant listings from Fodor's and Zagat's guidebooks. ... – PowerPoint PPT presentation

Number of Views:774
Avg rating:3.0/5.0
Slides: 73
Provided by: sugat
Category:

less

Transcript and Presenter's Notes

Title: Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage


1
Semi-Supervised Clustering and its Application to
Text Clustering and Record Linkage
  • Raymond J. Mooney
  • Sugato Basu
  • Mikhail Bilenko
  • Arindam Banerjee

2
Supervised Classification Example
.
.
.
.
3
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
Semi-Supervised Learning
  • Combines labeled and unlabeled data during
    training to improve performance
  • Semi-supervised classification Training on
    labeled data exploits additional unlabeled data,
    frequently resulting in a more accurate
    classifier.
  • Semi-supervised clustering Uses small amount of
    labeled data to aid and bias the clustering of
    unlabeled data.

8
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
Semi-Supervised Classification
  • Algorithms
  • Semisupervised EM GhahramaniNIPS94,NigamML00.
  • Co-training BlumCOLT98.
  • Transductive SVMs Vapnik98,JoachimsICML99.
  • Assumptions
  • Known, fixed set of categories given in the
    labeled data.
  • Goal is to improve classification of examples
    into these known categories.

11
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
Semi-Supervised Clustering
  • Can group data using the categories in the
    initial labeled data.
  • Can also extend and modify the existing set of
    categories as needed to reflect other
    regularities in the data.
  • Can cluster a disjoint set of unlabeled data
    using the labeled data as a guide to the type
    of clusters desired.

16
Search-Based Semi-Supervised Clustering
  • Alter the clustering algorithm that searches for
    a good partitioning by
  • Modifying the objective function to give a reward
    for obeying labels on the supervised data
    DemerizANNIE99.
  • Enforcing constraints (must-link, cannot-link) on
    the labeled data during clustering
    WagstaffICML00, WagstaffICML01.
  • Use the labeled data to initialize clusters in an
    iterative refinement algorithm (kMeans, EM)
    BasuICML02.

17
Unsupervised KMeans Clustering
  • KMeans is a partitional clustering algorithm
    based on iterative relocation that partitions a
    dataset into K clusters.
  • Algorithm
  • Initialize K cluster centers
    randomly. Repeat until convergence
  • Cluster Assignment Step Assign each data point x
    to the cluster Xl, such that L2 distance of x
    from (center of Xl) is minimum
  • Center Re-estimation Step Re-estimate each
    cluster center as the mean of the points in
    that cluster

18
KMeans Objective Function
  • Locally minimizes sum of squared distance between
    the data points and their corresponding cluster
    centers
  • Initialization of K cluster centers
  • Totally random
  • Random perturbation from global mean
  • Heuristic to ensure well-separated centers
  • etc.

19
K Means Example
20
K Means ExampleRandomly Initialize Means
x
x
21
K Means ExampleAssign Points to Clusters
x
x
22
K Means ExampleRe-estimate Means
x
x
23
K Means ExampleRe-assign Points to Clusters
x
x
24
K Means ExampleRe-estimate Means
x
x
25
K Means ExampleRe-assign Points to Clusters
x
x
26
K Means ExampleRe-estimate Means and Converge
x
x
27
Semi-Supervised KMeans
  • Seeded KMeans
  • Labeled data provided by user are used for
    initialization initial center for cluster i is
    the mean of the seed points having label i.
  • Seed points are only used for initialization, and
    not in subsequent steps.
  • Constrained KMeans
  • Labeled data provided by user are used to
    initialize KMeans algorithm.
  • Cluster labels of seed data are kept unchanged in
    the cluster assignment steps, and only the labels
    of the non-seed data are re-estimated.

28
Semi-Supervised K Means Example
29
Semi-Supervised K Means ExampleInitialize Means
Using Labeled Data
x
x
30
Semi-Supervised K Means ExampleAssign Points to
Clusters
x
x
31
Semi-Supervised K Means ExampleRe-estimate Means
and Converge
x
x
32
Similarity-Based Semi-Supervised Clustering
  • Train an adaptive similarity function to fit the
    labeled data.
  • Use a standard clustering algorithm with the
    trained similarity function to cluster the
    unlabeled data.
  • Adaptive similarity functions
  • Altered Euclidian distance KleinICML02
  • Trained Mahalanobis distance XingNIPS02
  • EM-trained edit distance BilenkoKDD03
  • Clustering algorithms
  • Single-link agglomerative BilenkoKDD03
  • Complete-link agglomerative KleinICML02
  • K-means XingNIPS02

33
Semi-Supervised Clustering ExampleSimilarity
Based
34
Semi-Supervised Clustering ExampleDistances
Transformed by Learned Metric
35
Semi-Supervised Clustering ExampleClustering
Result with Trained Metric
36
Experiments
  • Evaluation measures
  • Objective function value for KMeans.
  • Mutual Information (MI) between distributions of
    computed cluster labels and human-provided class
    labels.
  • Experiments
  • Change of objective function and MI with
    increasing fraction of seeding (for complete
    labeling and no noise).
  • Change of objective function and MI with
    increasing noise in seeds (for complete labeling
    and fixed seeding).

37
Experimental Methodology
  • Clustering algorithm is always run on the entire
    dataset.
  • Learning curves with 10-fold cross-validation
  • 10 data set aside as test set whose label is
    always hidden.
  • Learning curve generated by training on different
    seed fractions of the remaining 90 of the
    data whose label is provided.
  • Objective function is calculated over the entire
    dataset.
  • MI measure calculated only on the independent
    test set.

38
Experimental Methodology (contd.)
  • For each fold in the Seeding experiments
  • Seeds selected from training dataset by varying
    seed fraction from 0.0 to 1.0, in steps of 0.1
  • For each fold in the Noise experiments
  • Noise simulated by changing the labels of a
    fraction of the seed values to a random incorrect
    value.

39
COP-KMeans
  • COPKMeans Wagstaff et al. ICML01 is KMeans
    with must-link (must be in same cluster) and
    cannot-link (cannot be in same cluster)
    constraints on data points.
  • Initialization Cluster centers are chosen
    randomly, but as each one is chosen any must-link
    constraints that it participates in are enforced
    (so that they cannot later be chosen as the
    center of another cluster).
  • Algorithm During cluster assignment step in
    COP-KMeans, a point is assigned to its nearest
    cluster without violating any of its constraints.
    If no such assignment exists, abort.

40
Datasets
  • Data sets
  • UCI Iris (3 classes 150 instances)
  • CMU 20 Newsgroups (20 classes 20,000 instances)
  • Yahoo! News (20 classes 2,340 instances)
  • Data subsets created for experiments
  • Small-20 newsgroup random sample of 100
    documents from each newsgroup, created to study
    effect of datasize on algorithms.
  • Different-3 newsgroup 3 very different
    newsgroups (alt.atheism, rec.sport.baseball,
    sci.space), created to study effect of data
    separability on algorithms.
  • Same-3 newsgroup 3 very similar newsgroups
    (comp.graphics, comp.os.ms-windows,
    comp.windows.x).

41
Text Data
  • Vector space model with TF-IDF weighting for text
    data.
  • Non-content bearing words removed
  • Stopwords
  • High and low frequency words
  • Words of length lt 3
  • Text-handling software
  • Spherical KMeans was used as the underlying
    clustering algorithm it uses cosine-similarity
    instead of Euclidean distance between word
    vectors.
  • Code base built on top of MC and SPKMeans
    packages developed at UT Austin.

42
Results MI and Seeding
  • Zero noise in seeds Small-20
    NewsGroup
  • Semi-Supervised KMeans substantially better than
    unsupervised KMeans

43
Results Objective function and Seeding
  • User-labeling consistent with KMeans assumptions
    Small-20 NewsGroup
  • Obj. function of data partition increases
    exponentially with seed fraction

44
Results MI and Seeding
  • Zero noise in seeds

    Yahoo! News
  • Semi-Supervised KMeans still better than
    unsupervised

45
Results Objective Function and Seeding
  • User-labeling inconsistent with KMeans
    assumptions Yahoo! News
  • Objective function of constrained algorithms
    decreases with seeding

46
Results Dataset Separability
  • Difficult datasets lots of overlap between the
    clusters Same-3 NewsGroup
  • Semi-supervision gives substantial improvement

47
Results Dataset Separability
  • Easy datasets not much overlap between the
    clusters Different-3 NewsGroup
  • Semi-supervision does not give substantial
    improvement

48
Results Noise Resistance
  • Seed fraction 0.1 20 NewsGroup
  • Seeded-KMeans most robust against noisy seeding

49
Record Linkage
  • Identify and merge duplicate field values and
    duplicate records in a database.
  • Applications
  • Duplicates in mailing lists
  • Information integration of multiple databases of
    stores, restaurants, etc.
  • Matching bibliographic references in research
    papers (Cora/ResearchIndex)
  • Different published editions in a database of
    books.

50
Experimental Datasets
  • 1,200 artificially corrupted mailing list
    addresses.
  • 1,295 Cora research paper citations.
  • 864 restaurant listings from Fodors and Zagats
    guidebooks.
  • 1,675 Citeseer research paper citations.

51
Record Linkage Examples
Author Title
Venue Address Year
Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby Information, prediction, and query by committee Advances in Neural Information Processing System San Mateo, CA 1993
Freund, Y., Seung, H.S., Shamir, E. Tishby, N. Information, prediction, and query by committee Advances in Neural Information Processing Systems San Mateo, CA.
Name Address City
Cusine
Second Avenue Deli 156 2nd Ave. at 10th New York Delicatessen
Second Avenue Deli 156 Second Ave. New York City Delis
52
Traditional Record Linkage
  • Apply a static text-similarity metric to each
    field.
  • Cosine similarity
  • Jaccard similarity
  • Edit distance
  • Combine similarity of each field to determine
    overall similarity.
  • Manually weighted sum
  • Threshold overall similarity to detect duplicates.

53
Edit (Levenstein) Distance
  • Minimum number of character deletions, additions,
    or substitutions needed to make two strings
    equivalent.
  • misspell to mispell is distance 1
  • misspell to mistell is distance 2
  • misspell to misspelling is distance 3
  • Can be computed efficiently using dynamic
    programming in O(mn) time where m and n are the
    lengths of the two strings being compared.

54
Edit Distance with Affine Gaps
  • Contiguous deletions/additions are less expensive
    than non-contiguous ones.
  • misspell to misspelling is distance lt 3
  • Relative cost of contiguous and non-contiguous
    deletions/additions determined by a manually-set
    parameter.
  • Affine gap edit-distance better for identifying
    duplicates than Levenstein.

55
Trainable Record Linkage
  • MARLIN (Multiply Adaptive Record Linkage using
    INduction)
  • Learn parameterized similarity metrics for
    comparing each field.
  • Trainable edit-distance
  • Use EM to set edit-operation costs
  • Learn to combine multiple similarity metrics for
    each field to determine equivalence.
  • Use SVM to decide on duplicates

56
Trainable Edit Distance
  • Learnable edit distance based on generative
    probabilistic model for producing matched pairs
    of strings.
  • Parameters trained using EM to maximize the
    probability of producing training pairs of
    equivalent strings.
  • Originally developed for Levenstein distance by
    Ristad Yianilos (1998).
  • We modified for affine gap edit distance.

57
Sample Learned Edit Operations
  • Inexpensive operations
  • Deleting/adding space
  • Substituting / for - in phone numbers
  • Deleting/adding e and t in addresses (Street
    ?St.)
  • Expensive operations
  • Deleting/adding a digit in a phone number.
  • Deleting/adding a q in a name

58
Combining Field Similarities
  • Record similarity is determined by combining the
    similarities of individual fields.
  • Some fields are more indicative of record
    similarity than others
  • For addresses, city similarity is less relevant
    than restaurant/person name or street address.
  • For bibliographic citation, venue (i.e.
    conference or journal name) is less relevant than
    author or title.
  • Field similarities should be weighted when
    combined to determine record similarity.
  • Weights should be learned using learning
    algorithm.

59
MARLIN Record Linkage Framework
Trainable duplicate detector
Trainable similarity metrics





60
Learned Record Similarity
  • Field similarities used as feature vectors
    describing a pair of records.
  • SVM trained on these feature vectors to
    discriminate duplicate from non-duplicate pairs.
  • Record similarity based on distance of feature
    vector from the separator.

61
Record Pair Classification Example
62
Clustering Records into Equivalence Classes
  • Use similarity-based semi-supervised clustering
    to identify groups of equivalent records.
  • Use single-link agglomerative clustering to
    cluster records based on learned similarity
    metric.

63
Experimental Methodology
  • 2-fold cross-validation with equivalence classes
    of records randomly assigned to folds.
  • Results averaged over 20 runs of
    cross-validation.
  • Accuracy of duplicate detection on test data
    measured using
  • Precision
  • Recall
  • F-Measure

64
Mailing List Name Field Results
65
Cora Title Field Results
66
Maximum F-measure forDetecting Duplicate Field
Values
Metric RestaurantName Restaurant Address Citeseer Reason Citeseer Face Citeseer RL Citeseer Constraint
Static Affine Edit Dist. 0.29 0.68 0.93 0.95 0.89 0.92
Learned Affine Edit Dist. 0.35 0.71 0.94 0.97 0.91 0.94
T-test results indicate differences are
significant at .05 level
67
Mailing List Record Results
68
Restaurant Record Results
69
Combining Similarity and Search-BasedSemi-Supervi
sed Clustering
  • Can apply seeded/constrained clustering with a
    trained similarity metric.
  • We developed a unified framework for Euclidian
    distance with soft pairwise-constraints
    (must-link, cannot-link).
  • Experiments on UCI data comparing approaches.
  • With small amounts of training,
    seeded/constrained tends to do better than
    similarity-based.
  • With larger amounts of labeled data,
    similarity-based tends to do better.
  • Combining both approaches outperforms both
    individual approaches.

70
Active Semi-Supervision
  • Use active learning to select the most
    informative labeled examples.
  • We have developed an active approach for
    selecting good pairwise queries to get must-link
    and cannot-link constraints.
  • Should these two examples be in same or different
    clusters?
  • Experimental results on UCI and text data.
  • Active learning achieves much higher accuracy
    with fewer labeled training pairs.

71
Future Work
  • Adaptive metric learning for vector-space
    cosine-similarity.
  • Supervised learning of better token weights than
    TF-IDF.
  • Unified method for text data (cosine similarity)
    that combines seeded/constrained with learned
    similarity measure.
  • Active learning results for duplicate detection.
  • Static-active learning
  • Exploiting external data/knowledge (e.g. from the
    web) for improving similarity measures for
    duplicate detection.

72
Conclusion
  • Semi-supervised clustering is an alternative way
    of combining labeled and unlabeled data in
    learning.
  • Search-based and similarity-based are two
    alternative approaches.
  • They have useful applications in text-clustering
    and database record linkage.
  • Experimental results for these applications
    illustrate their utility.
  • They can be combined to produce even better
    results.
Write a Comment
User Comments (0)
About PowerShow.com