Title: Efficient Clustering of HighDimensional Data Sets with Application to Reference Matching
1Efficient Clustering of High-Dimensional Data
Sets with Application to Reference Matching
- Andrew McCallum
- Kamal Nigam
- Lyle H. Ungar
2ABSTRACT
- Traditional techniques for clustering is
efficient when the dataset has either - 1. a limited number of clusters
- 2. a low feature dimensionality
- 3. a small number of data points
- Not efficient in having millions of data points
that exist in many thousands of dimensions.
3- Canopy a new technique for clustering
large, high-dimensional dataset. - Idea
- Using a cheap approximate distance measure to
efficiently divide the data into overlapping
subsets. - Applying many domains and used with a variety of
clustering approaches - Greedy Agglomerative Clustering (GAC)
k-means Expectation-Maximization (EM). -
41. INTRODUCTION
- Traditional clustering algorithms become
computationally expensive in three ways - 1. There can be a large number of
elements in the data set. - 2. Each element can have many features.
- 3. There can be many clusters to discover.
-
5- KD-tree provide for efficient EM-style clustering
of many elements but require that the
dimensionality of each element - be small.
- K-means clustering is efficiently by finding good
initial starting points, but is not efficiently
when the number of clusters is large.
6Two stages of clustering
- First a rough and quick stage that divides the
data into overlapping subsets called canopies.
The canopies are built using the approximate
distance measure. - The second stage completes the clustering by
running a standard clustering algorithm using a
rigorous, and thus more expensive distance
metric.
7- Significant computation is saved by eliminating
all the distance comparisons among points that do
not fall within a common canopy. - Difference from previous clustering methods It
use two different distance metrics for the two
stages, and forms overlapping regions.
82. EFFICIENT CLUSTERING WITH CANOPIES
- The key idea of the canopy algorithm is that one
can greatly reduce the number of distance
computations required for clustering by first
cheaply partitioning the data into overlapping
subsets, and then only measuring distances among
pairs of data points that belong to a common
subset.
9- The canopies technique thus uses two different
sources of information to cluster items - 1. Cheap and approximate similarity
measure - first stage - 2. More expensive and accurate similarity
measure second stage
10First stage
- A canopy is simply a subset of the element.
- An element may appear under more than one canopy
and every element must appear in at least one
canopy. ( figure 1.)
11Figure1
12Second stage
- The accurate distance measure is by using
- some traditional clustering algorithms,
- but we do not calculate the distance between
two points that never appear in the same canopy. - ( We assume their distance to be infinite.)
13- If all items are trivially placed into a single
canopy, then the second round will not improve
efficient. - If the canopies are not too large an do not
overlap too much, then a large number of
expensive distance measurements will be avoided.
14The formal conditions on canopies
- 1. K-means, EM, the clustering accuracy will be
preserve exactly when - For every traditional cluster, there exists
a canopy such that all elements of the cluster
are in the canopy. - 2. GAC measure the closest point in the cluster
- For every cluster, there exists a set of
canopies such that the elements of the cluster
connect the canopies.
152.1 Creating Canopies
- In most cases, a user of the canopies technique
will be able to leverage domain-specific features
in order to design a cheap distance metric and
efficiently create canopies using the metric. - Often one or a small number of features suffice
to build canopies, even if the items being
clustered (e.g. the patients) have thousands of
features.
162.1.1 A Cheap Distance Metric
- All the very fast distance metrics for text used
by search engines are based on the inverted
index. - Thus we can use an inverted index to efficiently
calculate a distance metric that is based on the
number of words two documents have in common.
17- Given the above distance metric, one can create
canopies as follows. Start with a list of the
data points in any order, and with two distance
thresholds, T1 and T2, where T1 gt T2. - Pick a point off the list and approximately
measure its distance to all other points. Put all
points that are within distance threshold T1 into
a canopy. Remove from the list all points that
are within distance threshold T2. Repeat until
the list is empty. - Figure 1 shows some canopies that were created by
this procedure.
182.2 Canopies with Greed Agglomerative
Clustering
- GAC is a common clustering technique used to
group items together based on similarity. - Implementation of GAC
- 1. Initialize each elements.
- 2. Compute the distances between all pair
of clusters, and sorting. - 3. Repeatedly merge the two clusters which
are closest together.
19- we are guaranteed that any two points that do not
share a canopy will not fall into the same
cluster. - Thus, we do not need to calculate the distances
between these pairs of points.
202.3 Canopies with Expectation-
Maximization Clustering
- Method 1
- 1. Create the canopies.
- 2. Decided how many prototypes will be
created for each canopy. Then we place
prototypes into each canopy. - 3. Instead of calculating the distance from
each prototype to every point, the E-step
instead calculates the distance from each
prototype to a much smaller number of
points.
21- K-means gives not just similar results for
canopies and the traditional setting, but exactly
identical clusters. - In K-means each data point is assigned to a
single prototype.
22- Method 2
- In this method, allowed us to select the
- number of prototypes that will cover the whole
data set, and prototypes are influenced by all
the other data. -
- Not only efficiently compute the distance
between two points using the cheap distance
metric, but the use of inverted indices avoids
completely computing the distance to many points.
23- Method 3
- Dynamic determining the number of prototypes
- We create (and possibly destroy) prototypes
dynamically during clustering. - We avoid creating multiple prototypes to cover
points that fall into more than one canopy.
24A summary of the three different methods of
combining canopies and EM
252.4 Computation Complexity
- n data points that originated from k clusters
- c The number of canopies
- Each data point on average falls into f canopies.
- The factor f estimates the amount to which
the canopies overlap with each other.
26- Consider GAC algorithm
- fn/c data points per canopy
- Time complexity O(n2) gtO(f2n2/c) distance
measurements - Consider EM method 1
- Assume that clusters have roughly the
same overlap factor f as data points do. - Time complexity O(nk) gtO(nkf2/c)
- Two methods have same complexity reduction f2/c
-
273. Clustering Textual Bibliographic References
283.2 Dataset ,Protocol and Evaluation Metrics
- Precision The fraction of correct prediction
among all pairs of citations predicted to fall in
the same cluster. - Recall The fraction of correct prediction
among all pairs of citations truly to fall in the
same cluster. - F1 The harmonic average of precision
and recall.
293.3 Experimental Results(1)
Table 2 The error and time costs of different
methods of clustering references. The
naive baseline places each citation in its own
cluster. The Author/Year baseline clusters each
citation based on the year and first author of
the citation, identified by hand-built regular
expressions. The existing Cora method A word
matching based technique.
303.3 Experimental Results(2)
Table 3 F1 results created by varying the
parameters for the tight and loose
thresholds during canopy creation.
313.3 Experimental Results(3)
Table 4 The accuracy of the clustering as we
vary the final number of clusters.
324. RELATED WORK
- The canopy is not usable in KD-tree and
balltrees. - A special case of clustering problem
- The record linkage or merge-purge problem
occurs when a company purchases multiple
databases.
335. Conclusions
- Clustering large data sets is a ubiquitous task
- 1. Astronomical images analysis.
- 2. Search engines on the web seek.
- 3. Marketers seek clusters of similar
shoppers. - 4. Biologists seek to group DNA.
34- The canopy approach is widely applicable, even
the complex problem of merge-purge . - Because the problem also have a cheap and
approximate means and a accurate and expensive
comparison.
35Future work
- It will quantify analytically the correspondence
between the cheap and expensive distance metrics,
- It will perform experiments with EM and with a
wider variety of domains, including data sets
with real-valued attributes.