Efficient Clustering of HighDimensional Data Sets with Application to Reference Matching PowerPoint PPT Presentation

presentation player overlay
1 / 34
About This Presentation
Transcript and Presenter's Notes

Title: Efficient Clustering of HighDimensional Data Sets with Application to Reference Matching


1
Efficient Clustering of High-Dimensional Data
Sets with Application to Reference Matching
  • Andrew McCallum
  • Kamal Nigam
  • Lyle H. Ungar

2
ABSTRACT
  • Traditional techniques for clustering is
    efficient when the dataset has either
  • 1. a limited number of clusters
  • 2. a low feature dimensionality
  • 3. a small number of data points
  • Not efficient in having millions of data points
    that exist in many thousands of dimensions.

3
  • Canopy a new technique for clustering
    large, high-dimensional dataset.
  • Idea
  • Using a cheap approximate distance measure to
    efficiently divide the data into overlapping
    subsets.
  • Applying many domains and used with a variety of
    clustering approaches
  • Greedy Agglomerative Clustering (GAC)
    k-means Expectation-Maximization (EM).

4
1. INTRODUCTION
  • Traditional clustering algorithms become
    computationally expensive in three ways
  • 1. There can be a large number of
    elements in the data set.
  • 2. Each element can have many features.
  • 3. There can be many clusters to discover.

5
  • KD-tree provide for efficient EM-style clustering
    of many elements but require that the
    dimensionality of each element
  • be small.
  • K-means clustering is efficiently by finding good
    initial starting points, but is not efficiently
    when the number of clusters is large.

6
Two stages of clustering
  • First a rough and quick stage that divides the
    data into overlapping subsets called canopies.
    The canopies are built using the approximate
    distance measure.
  • The second stage completes the clustering by
    running a standard clustering algorithm using a
    rigorous, and thus more expensive distance
    metric.

7
  • Significant computation is saved by eliminating
    all the distance comparisons among points that do
    not fall within a common canopy.
  • Difference from previous clustering methods It
    use two different distance metrics for the two
    stages, and forms overlapping regions.

8
2. EFFICIENT CLUSTERING WITH CANOPIES
  • The key idea of the canopy algorithm is that one
    can greatly reduce the number of distance
    computations required for clustering by first
    cheaply partitioning the data into overlapping
    subsets, and then only measuring distances among
    pairs of data points that belong to a common
    subset.

9
  • The canopies technique thus uses two different
    sources of information to cluster items
  • 1. Cheap and approximate similarity
    measure - first stage
  • 2. More expensive and accurate similarity
    measure second stage

10
First stage
  • A canopy is simply a subset of the element.
  • An element may appear under more than one canopy
    and every element must appear in at least one
    canopy. ( figure 1.)

11
Figure1
12
Second stage
  • The accurate distance measure is by using
  • some traditional clustering algorithms,
  • but we do not calculate the distance between
    two points that never appear in the same canopy.
  • ( We assume their distance to be infinite.)

13
  • If all items are trivially placed into a single
    canopy, then the second round will not improve
    efficient.
  • If the canopies are not too large an do not
    overlap too much, then a large number of
    expensive distance measurements will be avoided.

14
The formal conditions on canopies
  • 1. K-means, EM, the clustering accuracy will be
    preserve exactly when
  • For every traditional cluster, there exists
    a canopy such that all elements of the cluster
    are in the canopy.
  • 2. GAC measure the closest point in the cluster
  • For every cluster, there exists a set of
    canopies such that the elements of the cluster
    connect the canopies.

15
2.1 Creating Canopies
  • In most cases, a user of the canopies technique
    will be able to leverage domain-specific features
    in order to design a cheap distance metric and
    efficiently create canopies using the metric.
  • Often one or a small number of features suffice
    to build canopies, even if the items being
    clustered (e.g. the patients) have thousands of
    features.

16
2.1.1 A Cheap Distance Metric
  • All the very fast distance metrics for text used
    by search engines are based on the inverted
    index.
  • Thus we can use an inverted index to efficiently
    calculate a distance metric that is based on the
    number of words two documents have in common.

17
  • Given the above distance metric, one can create
    canopies as follows. Start with a list of the
    data points in any order, and with two distance
    thresholds, T1 and T2, where T1 gt T2.
  • Pick a point off the list and approximately
    measure its distance to all other points. Put all
    points that are within distance threshold T1 into
    a canopy. Remove from the list all points that
    are within distance threshold T2. Repeat until
    the list is empty.
  • Figure 1 shows some canopies that were created by
    this procedure.

18
2.2 Canopies with Greed Agglomerative
Clustering
  • GAC is a common clustering technique used to
    group items together based on similarity.
  • Implementation of GAC
  • 1. Initialize each elements.
  • 2. Compute the distances between all pair
    of clusters, and sorting.
  • 3. Repeatedly merge the two clusters which
    are closest together.

19
  • we are guaranteed that any two points that do not
    share a canopy will not fall into the same
    cluster.
  • Thus, we do not need to calculate the distances
    between these pairs of points.

20
2.3 Canopies with Expectation-
Maximization Clustering
  • Method 1
  • 1. Create the canopies.
  • 2. Decided how many prototypes will be
    created for each canopy. Then we place
    prototypes into each canopy.
  • 3. Instead of calculating the distance from
    each prototype to every point, the E-step
    instead calculates the distance from each
    prototype to a much smaller number of
    points.

21
  • K-means gives not just similar results for
    canopies and the traditional setting, but exactly
    identical clusters.
  • In K-means each data point is assigned to a
    single prototype.

22
  • Method 2
  • In this method, allowed us to select the
  • number of prototypes that will cover the whole
    data set, and prototypes are influenced by all
    the other data.
  • Not only efficiently compute the distance
    between two points using the cheap distance
    metric, but the use of inverted indices avoids
    completely computing the distance to many points.

23
  • Method 3
  • Dynamic determining the number of prototypes
  • We create (and possibly destroy) prototypes
    dynamically during clustering.
  • We avoid creating multiple prototypes to cover
    points that fall into more than one canopy.

24
A summary of the three different methods of
combining canopies and EM
25
2.4 Computation Complexity
  • n data points that originated from k clusters
  • c The number of canopies
  • Each data point on average falls into f canopies.
  • The factor f estimates the amount to which
    the canopies overlap with each other.

26
  • Consider GAC algorithm
  • fn/c data points per canopy
  • Time complexity O(n2) gtO(f2n2/c) distance
    measurements
  • Consider EM method 1
  • Assume that clusters have roughly the
    same overlap factor f as data points do.
  • Time complexity O(nk) gtO(nkf2/c)
  • Two methods have same complexity reduction f2/c

27
3. Clustering Textual Bibliographic References
28
3.2 Dataset ,Protocol and Evaluation Metrics
  • Precision The fraction of correct prediction
    among all pairs of citations predicted to fall in
    the same cluster.
  • Recall The fraction of correct prediction
    among all pairs of citations truly to fall in the
    same cluster.
  • F1 The harmonic average of precision
    and recall.

29
3.3 Experimental Results(1)
Table 2 The error and time costs of different
methods of clustering references. The
naive baseline places each citation in its own
cluster. The Author/Year baseline clusters each
citation based on the year and first author of
the citation, identified by hand-built regular
expressions. The existing Cora method A word
matching based technique.
30
3.3 Experimental Results(2)
Table 3 F1 results created by varying the
parameters for the tight and loose
thresholds during canopy creation.
31
3.3 Experimental Results(3)
Table 4 The accuracy of the clustering as we
vary the final number of clusters.
32
4. RELATED WORK
  • The canopy is not usable in KD-tree and
    balltrees.
  • A special case of clustering problem
  • The record linkage or merge-purge problem
    occurs when a company purchases multiple
    databases.

33
5. Conclusions
  • Clustering large data sets is a ubiquitous task
  • 1. Astronomical images analysis.
  • 2. Search engines on the web seek.
  • 3. Marketers seek clusters of similar
    shoppers.
  • 4. Biologists seek to group DNA.

34
  • The canopy approach is widely applicable, even
    the complex problem of merge-purge .
  • Because the problem also have a cheap and
    approximate means and a accurate and expensive
    comparison.

35
Future work
  • It will quantify analytically the correspondence
    between the cheap and expensive distance metrics,
  • It will perform experiments with EM and with a
    wider variety of domains, including data sets
    with real-valued attributes.
Write a Comment
User Comments (0)
About PowerShow.com