Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification - PowerPoint PPT Presentation

Loading...

PPT – Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification PowerPoint presentation | free to download - id: 7386ee-MGFiM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification

Description:

Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification


1
Discriminative Training ofClustering Functions
Theory and Experiments with EntityIdentificatio
n
Xin Li Dan Roth University of Illinois,
Urbana-Champaign
2
Outline
  • Clustering
  • Current approaches
  • Some problems
  • Making Clustering a Learning problem
  • Supervised Discriminative Clustering Framework
  • The Reference Problem
  • Entity Identification within across documents.

3
The Reference Problem
Document 1 The Justice Department has officially
ended its inquiry into the assassinations of John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission
's belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. Document 2 In 1953,
Massachusetts Sen. John F. Kennedy married
Jacqueline Lee Bouvier in Newport, R.I. In 1960,
Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith
by telling a Protestant group in Houston, I do
not speak for my church on public matters, and
the church does not speak for me.' Document 3
David Kennedy was born in Leicester, England in
1959.  Kennedy co-edited The New Poetry
(Bloodaxe Books 1993), and is the author of New
Relations The Refashioning Of British Poetry
1980-1994 (Seren 1996). 
4
Entity Identification in Text
  • Goal Given names, within or across documents,
    identify real-world entities behind them.
  • Problem Definition Given a set of names and
    their semantic types, people, locations
    Organizations partition them into groups that
    refer to different entities.
  • Approaches
  • A generative Model
  • Li, Morie, Roth, NAACL04
  • A discriminative approach
  • Li, Morie, Roth, AAAI04
  • Other works on citation, and more Milche et.
    al Bilenko et. al.,
  • Intuitively, a discriminative approach, requires
  • using some similarity measure
  • between names, followed up by
  • clustering into clusters that
  • represent entities.

5
Clustering
  • An optimization procedure that takes
  • A collection of data elements
  • A distance (similarity) measure on the space of
    data elements
  • A Partition Algorithm
  • Attempts to
  • Optimize some quality with respect to the given
    distance metric.

6
Clustering Example (k4)
7
Example K-means Clustering
  • An Optimization Problem
  • Data X x1,x2, Cluster Names C
    1,2,3,,K
  • The Euclidean Distance d(x1,x2)
    (x1-x2)T(x1-x2)1/2
  • Find a mapping f X ? C
  • That minimizes ?j ?x 2 Cj d(x,?j )2
  • Where ?j 1/m ?x 2 Cj x mean of elements in
    the k-th cluster

8
Many NLP Applications
  • Class-based language models
  • group similar words together based on their
    semantics (Dagan et. al 99, Lee et. al Pantel
    and Lin, 2002).
  • Document categorization and topic
    identification (Karypis, Han 99,02).
  • Co-reference resolution
  • build coreference chain of noun phrases (Cardie,
    Wagstaff 99).
  • In all cases fixed metric distance tuned for
    the application and the data.
    (and the algorithm?)

9
Clustering Metric and Algorithm
  • There is no universal distance metric that is
    appropriate for all clustering algorithms
  • How do we make sure we have an appropriate one,
    that reflects the task/designer intentions?

d1(x,x) (f1 - f1) 2(f2 - f2) 21/2
d2(x,x) (f1 f2)-(f1f2)
(a) Single-Linkage with Euclidean
(c) K-Means with a Linear Metric
(b) K-Means with Euclidean
10
Additional information in Clustering
11
Traditional Clustering Framework
  • Typically, unsupervised no learning.
  • More recently work on metric learning with
    supervision
  • BilenkoMooney 03, 04, Xing et. al.03, Schultz
    Joachims03, Bach Jordan03
  • Learning a metric then cluster
  • Learning while clustering (algorithm specific)

12
Supervised Discriminative Clustering (SDC)
  • Incorporates supervision directly into metric
    training process
  • Training is driven by true clustering error
  • Computed via the chosen data partition algorithm.

13
Elements of SDC Partition Function and Error
  • Goal A partition function h maps a set of data
    points S to
  • a partition h(S) of S.
    (outcome of a clustering algorithm)
  • Note difference from
    multi-class classification
  • The partition function h is a function of the
    parameterized distance metric
    d(x1,x2) ? wi xi1- xi2
  • Error Given a labeled data set S
  • p(S) (xi,ci)1m, the
    correct partition, and
  • a fixed clustering algorithm
    A,
  • the training process attempts to find d,
    minimizing the clustering error
  • d argmind errS(h,p), where h(S)Ad(S).

Optimal (given) Partition
Learned Partition
14
A Supervised Clustering Error
  • errS(h,p) 1/S2 ?ij d(xi,xj)Aij
    (D-d(xi,xj))Bij
  • (as opposed to a quality function that depends
    only on the distance)
  • Two types of errors in pairwise prediction
  • (xi,xj) ?h together or apart
  • False negative Aij I p(xi)p(xj)
    h(xi)?h(xj),
  • False positive Bij I p(xi) ? p(xj)
    h(xi)h(xj),
  • D maxij d(xi,xj ) .
  • (See paper for a comparison with other error
    functions)

15
Training the distance function
S
Initialize the distance Metric d
Cluster S using algorithm hAd
Update d
Evaluate ErrS(h,p)
A gradient descent based algorithm
16
Training the distance function
  • Gradient descent Alg.
  • Learns a metric Iteratively by adjusting the
    parameter vector ? by a small amount in the
    direction that would most reduce the error.

17
Entity Identification in Text
  • Goal Given names, within or across documents,
    identify real-world entities behind them.
  • Problem Definition Given a set of names and
    their semantic types, people, locations
    Organizations partition them into groups that
    refer to different entities.
  • Approaches
  • A generative Model
  • Li, Morie, Roth, NAACL04
  • A discriminative approach
  • Li, Morie, Roth, AAAI04

18
Parameterized Distance Metrics for Name Matching
John F. Kennedy
President Kennedy
?
  • Feature Extraction (John F. Kennedy, President
    Kennedy ) (?1, ?2 , )
  • Fixed distance distance (similarity) metric d
    for names.
  • d (John F. Kennedy, President Kennedy) ? 0.6
  • d (Chicago Cubs, Cubs) ? 0.6
  • d (United States, USA) ? 0.7
  • A learned distance function parameterized as a
    Linear function over features (kernelized)
  • d(John F. Kennedy, President Kennedy )
    ? wi ?i
  • Make it a pairwise classifier
  • h (John F. Kennedy, President Kennedy
    ) together
  • iff ? wi ?i lt 0.5
  • The distance function can be trained separately,
    to optimize partition quality, or via SDC, to
    minimize Error. (via gradient
    descent)

19
Features
  • Relational features that are extracted from a
    pair of strings, taking into account relative
    positions of tokens, substring relations, etc.

20
Experimental Setting
  • Names of people, locations and organizations.
  • John F. Kennedy, Bush, George W. Bush
  • U.S.A, United States, and America
  • University of Illinois, U. of I., IBM,
    International Business Machines.
  • 300 randomly picked New York Times news articles.
  • 8,600 names annotated by a named entity tagger
    and manually verified.
  • Training sets contain names labeled with its
    global entity.
  • John F. Kennedy ? Kennedy1
  • President Kennedy? Kennedy1,
  • David Kennedy? Kennedy2.
  • Data is available from http//l2r.cs.uiuc.edu/cog
    comp/

21
Gain from Metric Learning while Clustering
  • SoftTFIDF (Cohen et. al) Fixed metric
  • LMR (Li, Morie, Roth, AAAI04) learned metric
    via a pairwise classifier
  • relational features extracted from pairs of
    strings feedback from pairwise labels
  • SDC trains a linear weighted distance metric
    for the single-link clustering
  • algorithm with labeled pairs of 600 names.

22
Influence of Data Size
23
Different Clustering Algorithms
  • Difference across clustering algorithm is not as
    significant as difference obtained from learning
    a good metric via SDC.

24
Summary
  • A framework for Metric Learning for Clustering
    that is guided by global supervision with
    clustering as part of the feedback loop.
  • A parameterized distance metric is learned in a
    way that depends on the specific clustering
    algorithm used.
  • Significant improvement shown on the Reference
    Problem Entity Identification Across documents.

25
Intuition behind SDC
d
K16
26
Relational Features
  • Relational features do not depend on specific
    tokens in the two names, but depend on some
    abstraction over tokens.
  • Honorific Equal Mr., Mrs., President, Prof.
  • Nickname Thomas, Tom
  • Edit Distance.

Toward Concept-Based Text Understanding and Mining
27
How to employ transitivity between names ?
  • Clustering splitting a set of names.
  • Distance Metrics Edit distance, SoftTFIDF,
    Jora-Winkler
  • Clustering Algorithms Single-Link,
    Complete-Link, K-means, graph cut.

Toward Concept-Based Text Understanding and Mining
28
Outline
  • Clustering
  • Current approaches
  • Some problems
  • Making Clustering a Learning problem
  • Supervised Discriminative Clustering Framework
  • The Reference Problem
  • Entity Identification in within across document.

29
Entity Identification in Text
  • Goal Given names, within or across documents,
    identify real-world entities behind them.
  • Problem Definition Given a set of names and
    their semantic types, people, locations
    Organizations partition them into groups that
    refer to different entities.
  • Approaches
  • A generative Model
  • Li, Morie, Roth, NAACL04
  • A discriminative approach
  • Li, Morie, Roth, AAAI04

Challenge millions of entities in the world, but
in training, we can only see names of a limited
number of entities.
About PowerShow.com